This paper aims to evaluate how effectively large language models (LLMs) represent and generate minority cultural knowledge, specifically Taiwanese Hakka culture. To address this, the study proposes a structured and replicable evaluation framework integrating Bloom’s taxonomy and retrieval-augmented generation (RAG). The research is guided by the following questions: (1) How do LLMs perform across different cognitive domains when processing Hakka cultural content? (2) To what extent does the integration of RAG enhance the accuracy and contextual appropriateness of LLM outputs? And (3) How do different LLM architectures compare in their ability to recall, analyse and creatively synthesize culturally grounded information?
This study proposes a cognitive benchmarking framework to evaluate how LLMs process and apply culturally specific knowledge. The framework integrates Bloom’s taxonomy with RAG to assess model performance across six hierarchical cognitive domains: remembering, understanding, applying, analysing, evaluating and creating. Using a curated Taiwanese Hakka digital cultural archive as the primary testbed, the evaluation measures LLM-generated responses’ semantic accuracy and cultural relevance.
The evaluation results indicate that LLMs augmented with RAG exhibit marked improvements over baseline models in the cognitive domains of remembering, understanding and analysing. These enhancements are particularly evident in tasks requiring factual accuracy, contextual relevance and semantic precision, underscoring RAG’s effectiveness in addressing the knowledge sparsity typically observed in underrepresented cultural data sets. However, a notable limitation persists across all models including those equipped with RAG in the domain of creating. This suggests that while retrieval mechanisms bolster the reproduction and comprehension of cultural knowledge, they do not yet sufficiently support culturally nuanced generative synthesis.
This study introduces a novel evaluation framework integrating cognitive domain benchmarks with RAG-enhanced LLMs to assess cultural knowledge processing. The research advances culturally grounded artificial intelligence (AI) systems and digital archival quality by empirically demonstrating RAG’s impact on improving factual accuracy in lower and mid-level tasks. The findings affirm the strategic value of retrieval integration for enhancing representational fidelity in cultural AI applications, while also highlighting the need for future research into hybrid architectures that combine external grounding with culturally adaptive generation strategies.
