Retrieval-augmented generation (RAG) systems integrate information retrieval with generative language models to improve the relevance, accuracy and explainability of AI-driven responses. This study evaluates how different configurations of embedding and generative models nfluence the performance of RAGpipelines for knowledge management (KM) scenarios.
The study combines a broad benchmark of embedding and generation components with a contemporary open-weight comparison centered on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3 and Gemma-2-9B-It. Retrieval configurations are evaluated through recall, latency and storage trade-offs, while generation quality is assessed using ROUGE-L, exact match (EM), token-level F1, BERTScore F1, semantic similarity, answer relevance and faithfulness. The benchmark also includes complementary evaluation on SQuAD and HotpotQA, grounded prompting, abstention prompting, error analysis and long-context stress testing.
Retrieval quality remained the main determinant of end-to-end RAG quality. The strongest shared retrieval setup combined all-mpnet-base-v2, 256-token chunking with 64-token overlap and top-1 retrieval, reaching Recall@1 = 0.938. Among the open-weight generators, Gemma-2-9B-It achieved the strongest lexical and semantic matching, with its best grounded-abstain configuration reaching ROUGE-L = 0.631, EM = 0.456, token-F1 = 0.631 and BERTScore F1 = 0.767. Llama-3-8B-Instruct produced the strongest faithfulness score in the best grounded setting (0.241), while Mistral-7B-Instruct-v0.3 occupied a more conservative operating point with lower answer matching but stronger abstention behavior. HNSW matched exact-search quality for equivalent retrieval configurations while reducing query latency.
The findings support retrieval chunking, top-1 retrieval and grounded prompting as robust design choices for question-answering-oriented RAG. They also suggest that safer abstention-oriented prompting should be treated as a different operating point rather than as a universal default.
More reliable RAG systems can improve access to institutional knowledge, support organizational learning and reduce barriers to expertise discovery, especially when system designs balance quality, latency and computational cost.
The paper contributes a component-level benchmark for RAG in KM settings, richer evaluation dimensions and a more explicit treatment of retrieval/generation trade-offs across historical and contemporary open-weight baselines. The design narrows practical claims to what is supported by multi-dataset evidence, error analysis and long-context testing.
