Skip to Main Content
Article navigation
Purpose

Retrieval-augmented generation (RAG) systems integrate information retrieval with generative language models to improve the relevance, accuracy and explainability of AI-driven responses. This study evaluates how different configurations of embedding and generative models nfluence the performance of RAGpipelines for knowledge management (KM) scenarios.

Design/methodology/approach

The study combines a broad benchmark of embedding and generation components with a contemporary open-weight comparison centered on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3 and Gemma-2-9B-It. Retrieval configurations are evaluated through recall, latency and storage trade-offs, while generation quality is assessed using ROUGE-L, exact match (EM), token-level F1, BERTScore F1, semantic similarity, answer relevance and faithfulness. The benchmark also includes complementary evaluation on SQuAD and HotpotQA, grounded prompting, abstention prompting, error analysis and long-context stress testing.

Findings

Retrieval quality remained the main determinant of end-to-end RAG quality. The strongest shared retrieval setup combined all-mpnet-base-v2, 256-token chunking with 64-token overlap and top-1 retrieval, reaching Recall@1 = 0.938. Among the open-weight generators, Gemma-2-9B-It achieved the strongest lexical and semantic matching, with its best grounded-abstain configuration reaching ROUGE-L = 0.631, EM = 0.456, token-F1 = 0.631 and BERTScore F1 = 0.767. Llama-3-8B-Instruct produced the strongest faithfulness score in the best grounded setting (0.241), while Mistral-7B-Instruct-v0.3 occupied a more conservative operating point with lower answer matching but stronger abstention behavior. HNSW matched exact-search quality for equivalent retrieval configurations while reducing query latency.

Practical implications

The findings support retrieval chunking, top-1 retrieval and grounded prompting as robust design choices for question-answering-oriented RAG. They also suggest that safer abstention-oriented prompting should be treated as a different operating point rather than as a universal default.

Social implications

More reliable RAG systems can improve access to institutional knowledge, support organizational learning and reduce barriers to expertise discovery, especially when system designs balance quality, latency and computational cost.

Originality/value

The paper contributes a component-level benchmark for RAG in KM settings, richer evaluation dimensions and a more explicit treatment of retrieval/generation trade-offs across historical and contemporary open-weight baselines. The design narrows practical claims to what is supported by multi-dataset evidence, error analysis and long-context testing.

Licensed re-use rights only
You do not currently have access to this content.
Don't already have an account? Register

Purchased this content as a guest? Enter your email address to restore access.

Please enter valid email address.
Email address must be 94 characters or fewer.
Pay-Per-View Access
$41.00
Rental

or Create an Account

Close Modal
Close Modal