Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines

Hitl, Mateo; Bagić Babac, Marina; Mornar, Vedran

doi:10.1108/DTA-07-2025-0609

Article navigation

Research Article| June 23 2026

Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines

Mateo Hitl;

Mateo Hitl

University of Zagreb Faculty of Electrical Engineering and Computing

, Zagreb,

Croatia

Search for other works by this author on:

This Site

PubMed

Google Scholar

Marina Bagić Babac

0000-0003-4979-2216

;

Marina Bagić Babac

University of Zagreb Faculty of Electrical Engineering and Computing

, Zagreb,

Croatia

Search for other works by this author on:

This Site

PubMed

Google Scholar

Vedran Mornar

VERN' University

, Zagreb,

Croatia

Search for other works by this author on:

This Site

PubMed

Google Scholar

Author & Article Information

Marina Bagić Babac can be contacted at: marina.bagic@fer.hr

Publisher: Emerald Publishing

Received: July 20 2025

Revision Received: May 08 2026

Accepted: May 31 2026

Online ISSN: 2514-9318

Print ISSN: 2514-9288

Funding

Funding Group:

Award Group:
- Funder(s):
  Hrvatska Zaklada za Znanost
- Award Id(s):
  IP-2025-02-1267
- Principal Award Recipient(s):
Award Group:
- Funder(s):
  European Commission
- Award Id(s):
  101086179
- Principal Award Recipient(s):
Funding Statement(s):
Funding: This work was supported by the Croatian Science Foundation under the project number IP-2025-02-1267, and by the European Union's Horizon Europe research and innovation programme under Grant No. 101086179.

2026

Emerald Publishing Limited

Licensed re-use rights only

Data Technologies and Applications 1–19.

https://doi.org/10.1108/DTA-07-2025-0609

Purpose

Retrieval-augmented generation (RAG) systems integrate information retrieval with generative language models to improve the relevance, accuracy and explainability of AI-driven responses. This study evaluates how different configurations of embedding and generative models nfluence the performance of RAGpipelines for knowledge management (KM) scenarios.

Design/methodology/approach

The study combines a broad benchmark of embedding and generation components with a contemporary open-weight comparison centered on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3 and Gemma-2-9B-It. Retrieval configurations are evaluated through recall, latency and storage trade-offs, while generation quality is assessed using ROUGE-L, exact match (EM), token-level F1, BERTScore F1, semantic similarity, answer relevance and faithfulness. The benchmark also includes complementary evaluation on SQuAD and HotpotQA, grounded prompting, abstention prompting, error analysis and long-context stress testing.

Findings

Retrieval quality remained the main determinant of end-to-end RAG quality. The strongest shared retrieval setup combined all-mpnet-base-v2, 256-token chunking with 64-token overlap and top-1 retrieval, reaching Recall@1 = 0.938. Among the open-weight generators, Gemma-2-9B-It achieved the strongest lexical and semantic matching, with its best grounded-abstain configuration reaching ROUGE-L = 0.631, EM = 0.456, token-F1 = 0.631 and BERTScore F1 = 0.767. Llama-3-8B-Instruct produced the strongest faithfulness score in the best grounded setting (0.241), while Mistral-7B-Instruct-v0.3 occupied a more conservative operating point with lower answer matching but stronger abstention behavior. HNSW matched exact-search quality for equivalent retrieval configurations while reducing query latency.

Practical implications

The findings support retrieval chunking, top-1 retrieval and grounded prompting as robust design choices for question-answering-oriented RAG. They also suggest that safer abstention-oriented prompting should be treated as a different operating point rather than as a universal default.

Social implications

More reliable RAG systems can improve access to institutional knowledge, support organizational learning and reduce barriers to expertise discovery, especially when system designs balance quality, latency and computational cost.

Originality/value

The paper contributes a component-level benchmark for RAG in KM settings, richer evaluation dimensions and a more explicit treatment of retrieval/generation trade-offs across historical and contemporary open-weight baselines. The design narrows practical claims to what is supported by multi-dataset evidence, error analysis and long-context testing.

2026

Emerald Publishing Limited

Licensed re-use rights only

You do not currently have access to this content.

Don't already have an account? Register

Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines

Email Alerts

Cited By

Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines Available to Purchase

Sign in

Client Account

ICE Member Sign In

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Sharing Unavailable

Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines