Cultural heritage (CH) texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured knowledge graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model (LLM)-based knowledge extraction from CH documents. We validate the methodology through a case study on authenticity assessment debates.
ATR4CH combines annotation models, ontological frameworks and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts, etc.), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B and GPT-4o-mini).
The methodology successfully extracts complex CH knowledge: 0.96–0.99 F1 for metadata extraction, 0.7–0.8 F1 for entity recognition, 0.65–0.75 F1 for hypothesis extraction, 0.95–0.97 for evidence extraction and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment.
The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing.
To the best of the authors’ knowledge, this is the first systematic methodology for coordinating LLM-based extraction with CH ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. ATR4CH enables CH institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.
1. Introduction
Knowledge graphs (KGs) have become the standard approach for representing and sharing cultural heritage (CH) information in the Linked Open Data (LOD) ecosystem, enabling interoperability between Libraries, Archives and Museums institutions (Barabucci et al., 2021). This effort has been concentrated for the most part on creating KGs of metadata, with diversified workflows dedicated to converting semi-structured or already structured sources (catalogues, inventories) into LOD (Bernasconi and Ferilli, 2024). However, the knowledge contained in unstructured texts (descriptive content, contextual information and analytical discourse) remains difficult to extract and structure systematically into queryable formats, and even when integrated into KGs, it is usually kept in long and description string fields (Barabucci et al., 2021; Giagnolini et al., 2025). Scholarly authenticity assessment debates exemplify this challenge, where complex interpretative knowledge is embedded in natural language discourse but practically absent from structured representations. Additional challenges stem from the inherently interpretative nature of humanities scholarship, which aligns with a constructivist epistemology viewing knowledge as situated, provisional and shaped by the observer's perspective. Checkland and Holwell distinguish between data – passively recorded facts – and capta – knowledge actively constructed by the observer. This distinction challenges the realist assumption that often underpins data practices, in which data are treated as an objective and context-independent representation of reality (Peter and Holwell, 2006). These epistemological tensions manifest across various forms of CH scholarship, from attribution studies and provenance research to interpretative analysis and critical evaluation. In authenticity assessment, scholars from different humanities disciplines (e.g., Diplomatics, Paleography, Philology, History) and scientific fields (e.g., Forensics, Materials science, Chemical analysis) frequently arrive at divergent conclusions based on different evidential priorities (Barone, 1912). Inherent factors contributing to this diversity include historical uncertainty, gaps in documentary transmission and subjectivity (Blau, 2011; Gadamer, 2013). Recent theoretical advancements acknowledge the subjectivity and uncertainty inherent in interpreting CH data, recognizing these as essential epistemic characteristics that must be preserved in digital representations (Pasqual, 2025; Piotrowski and Neuwirth, 2020; Piotrowski, 2023). However, current KG implementations represent only simplified versions of scholarly discourse. Whether dealing with artistic attribution, provenance disputes, historical interpretation or authenticity assessment, complex scholarly reasoning gets reduced to simple categorical assertions. Major knowledge bases like Wikidata [1] and DBpedia [2] exemplify this limitation. While Wikipedia articles contain rich discussions with detailed scholarly arguments, evidence analysis and alternative hypotheses, their structured counterparts reduce this complexity to sparse, categorical statements that fail to capture the evidential reasoning, methodological disagreements and evolving consensus that characterize authentic scholarly discourse. Consider the famous Donation of Constantine, a supposed 4th-century decree by Emperor Constantine transferring authority over Rome and the western Roman Empire to the Pope. In the 15th century, Lorenzo Valla exposed the document as a forgery through philological analysis (Valla, 2023), demonstrating that its Latin contained anachronisms from the 8th rather than the 4th century. Despite Valla's compelling evidence, acceptance of this finding evolved gradually over centuries.
As shown in Figure 1, Wikidata categorizes the Donation as a “historical forgery” [3] with no representation of the scholarly debate, while DBpedia [4] similarly lacks structured representation of the authenticity discourse. In contrast, the corresponding Wikipedia page contains extensive discussions of Valla's philological arguments, the specific linguistic evidence, the Church's resistance and subsequent scholarly confirmation. Three interconnected challenges exist. The first is syntactic: representing competing scholarly opinions within formal knowledge representation (KR) systems requires sophisticated mechanisms that traditional implementations struggle to handle effectively (Pasqual, 2025). While theoretical frameworks such as RDF-star, Named Graphs and reification methods provide the necessary expressive power, their practical application demands complex modeling decisions about contradictory evidence, evolving consensus and methodological disagreements, often resulting in oversimplified categorical assertions or unmanageably complex representations. The second challenge is practical: extracting complex scholarly information from textual sources requires enormous manual labor, creating insurmountable scalability barriers. This process requires expert annotators to identify scholarly agents, extract evidential reasoning and capture alternative hypotheses while maintaining consistency across large document collections. CH institutions possess vast textual resources containing sophisticated scholarly analyses, but lack practical means to transform this knowledge into queryable, machine-readable formats. Large Language Models (LLMs) present a promising solution due to their ability to process complex academic discourse, identify implicit relationships and handle domain-specific vocabularies (Khorashadizadeh et al., 2024) without additional training. LLMs can bootstrap knowledge extraction (KE) pipelines by eliminating the need for specific annotated training corpora and exploiting transfer learning across domains beyond those on which they were explicitly trained (Brown et al., 2020).
The screenshot displays a Wikidata webpage interface. At the top left, the Wikidata logo appears next to the text “W I K I D A T A.” To its right, a horizontal search bar labeled “Search Wikidata” is visible with a “Search” button positioned on the right side of the bar. Below this header, the page title “Donation of Constantine (Q 238476)” appears. Directly beneath the title, a horizontal tab bar shows “Item” and “Discussion.” Below the tabs, a description line reads “forged Roman imperial decree by which the emperor Constantine supposedly donated Rome and surrounding territory to the Pope” followed by “Constitutum Constantini.” Further down, a section header labeled “Statements” appears. Under it, a statement labeled “instance of” shows the value “historical forgery,” with “0 references” displayed beneath. To the right of this statement, options such as “edit,” “add reference,” and “add value” are visible. Below the statements section, a labeled field titled “image” appears on the left side. To its right, a small rectangular thumbnail image shows a historical painting scene.The Donation of Constantine entry in Wikidata. Source: Wikidata
The screenshot displays a Wikidata webpage interface. At the top left, the Wikidata logo appears next to the text “W I K I D A T A.” To its right, a horizontal search bar labeled “Search Wikidata” is visible with a “Search” button positioned on the right side of the bar. Below this header, the page title “Donation of Constantine (Q 238476)” appears. Directly beneath the title, a horizontal tab bar shows “Item” and “Discussion.” Below the tabs, a description line reads “forged Roman imperial decree by which the emperor Constantine supposedly donated Rome and surrounding territory to the Pope” followed by “Constitutum Constantini.” Further down, a section header labeled “Statements” appears. Under it, a statement labeled “instance of” shows the value “historical forgery,” with “0 references” displayed beneath. To the right of this statement, options such as “edit,” “add reference,” and “add value” are visible. Below the statements section, a labeled field titled “image” appears on the left side. To its right, a small rectangular thumbnail image shows a historical painting scene.The Donation of Constantine entry in Wikidata. Source: Wikidata
The third challenge is methodological: despite advances in both ontology development and extraction techniques, the literature lacks systematic methodologies that integrate ontology-driven KG generation specifically for CH contexts. While general-purpose text-to-KG frameworks exist (Maynard et al., 2017; Hotho et al., 2020), they typically assume the availability of large annotated training corpora and domain-agnostic entity types. What remains absent is a comprehensive methodological framework that guides practitioners through the complete process: from analyzing source materials and identifying extractable patterns, through developing appropriate annotation schemas and extraction pipelines, to evaluation. Such a methodology must accommodate the resource constraints typical of CH institutions while maintaining the representational fidelity required for humanities scholarship.
This work tackles the following primary research question: How can a systematic methodology integrate LLM-based KE with ontological frameworks to effectively capture and structure the complex interpretative knowledge contained in CH texts?
To systematically address this primary question, we investigate the following sub-questions, which we answer using our authenticity assessment case study:
Methodological Framework: What methodological approach can effectively integrate LLM-based KE with existing ontological frameworks to capture complex scholarly interpretations in CH texts?
Extraction Performance: How accurately can systematic LLM-based pipelines extract different components of scholarly discourse, including metadata, agents, evidential reasoning and interpretative hypotheses?
Representation Fidelity: Do automatically generated KGs adequately represent the complexity and nuance of scholarly interpretations when following structured methodological approaches?
Model Comparison: How do different LLMs perform within structured extraction pipelines for CH texts, and what are the implications for cost-effective deployment?
Methodology Validation: What insights does authenticity assessment validation provide about the methodology's broader applicability to other forms of CH interpretative scholarship?
Our contributions are threefold. First, we present the ATR4CH (Adaptive Text-to-RDF for Cultural Heritage) methodology, which combines annotation development, ontological alignment and pipeline-based extraction. The methodology is a replicable framework that can be adapted across CH domains and institutional resources. Second, we demonstrate the practical implementation by connecting ATR4CH to the Scholarly Evidence Based Interpretation (SEBI) ontology [5] (Pasqual, 2025) to develop an annotation model and an extraction pipeline for complex scholarly discourse. Third, we provide a comprehensive evaluation on a manually annotated sample of Wikipedia articles, establishing performance benchmarks across multiple extraction tasks and model architectures.
Our evaluation demonstrates the methodology's effectiveness, achieving F1-scores of 0.96–0.99 for metadata extraction, 0.7–0.8 for scholarly entity recognition, 0.65–0.75 for hypothesis extraction and 0.95–0.97 for evidence extraction, with 0.62 G-EVAL overall discourse representativeness.
The remainder of the paper is organized as follows: Section 2 reviews related works in KR, extraction methods for the Semantic Web, opinion mining and LLM-based KE approaches. Section 3 presents the ATR4CH methodology, detailing the five-task iterative approach for integrating LLM-based extraction with ontological frameworks in the CH domain. Section 4 describes ATR4CH implementation for the authenticity assessment use case, following the five-task structure: Section 4.1 presents foundational analysis and design, including corpus collection, the SEBI ontology and preliminary analyses; Section 4.2 describes iterative annotation schema development and ground truth (GT) preparation; Section 4.3 details pipeline architecture development; Section 4.4 and Section 4.5 describe pipeline refinement, KG generation and comprehensive evaluation. Section 5 provides experimental results across five evaluation questions (EQs), comparing Claude Sonnet 3.7, Llama 3.3 70B and GPT-4o-mini on metadata extraction, entity recognition, evidence mining, hypothesis extraction and overall KE fidelity. Section 6 discusses findings in relation to research questions, analyzes performance trade-offs, addresses deployment implications and outlines contributions, limitations and future directions. All code is available at https://github.com/aschimmenti/SEBI-Knowledge-Extraction.
2. Related work
The challenges of representing and extracting interpretative knowledge in the CH domain have received increasing attention in recent research. We focus first on conceptual and ontological models developed for multi-perspective KR, and then turn to methods for extracting such interpretations from unstructured texts, including recent advances in LLMs.
2.1 KR of certainty in the Semantic Web
Recent theoretical advancements acknowledged the subjectivity and uncertainty inherent in interpreting CH data, recognizing these aspects as essential epistemic characteristics in analyzing and representing such data (Piotrowski and Neuwirth, 2020). Uncertainty in the CH domain arises not only from the data itself (for example, data extracted from the digitization of a birth certificate) but also from the interpretative connections made by scholars regarding such data, such as identifying a name on a birth certificate with a specific historical figure (Piotrowski, 2023). However, these advancements have not translated into widely adopted practical tools and standards in KGs. LOD is the standard for encoding and publishing CH data on the Web, promoting interoperability and data exchange between institutions. Standard online catalogues (e.g., Europeana) [6] typically provide single-perspective flat metadata, relegating discussions, debates and uncertain facts to free text descriptions (Barabucci et al., 2021).
To the best of our knowledge, Wikidata is the only large-scale data catalogue that employs a custom reification method to integrate claims with varying degrees of truthfulness, i.e., its ranking mechanism. Other knowledge bases like YAGO4 [7] integrate RDF-Star to model provenance (e.g., for temporal information) but lack statements on the certainty of the given triple (Govindapillai et al., 2021). Despite the adequate expressive power made available by the Wikidata model, annotators in the CH domain underutilize this feature. Additionally, claims related to CH data often make use of numerous qualifiers to encode contextual metadata, likely due to the increased effort required for this type of annotation (Di Pasquale et al., 2024).
Some ontologies have been designed to structure multi-perspective representations in CH data. ICON (Sartini et al., 2023; Baroncini et al., 2023) encodes visual recognitions in art history using n-ary relations to encode contextual metadata. Digital Hermeneutics (Daquino et al., 2020) employs a layered approach using Named Graphs (Carroll et al., 2005) to represent scholarly interpretations in archival and literary sources. HiCo (Daquino and Tomasi, 2015) and the STAR model (Andrews, 2023) have been designed to represent historical interpretations and arguments. Previous work has introduced the SEBI ontology [8] Pasqual (2025), aimed at representing scholarly claims on a hand-curated catalogue of forged manuscripts, which resulted in the BROAST application [9].
2.2 KE for the Semantic Web
KE for the Semantic Web transforms unstructured textual content into machine-processable representations conforming to established ontological models. The text-to-KG task refers to systems that process natural language text to generate RDF KGs, where extraction is guided by predefined ontologies and outputs conform to ontological constraints (Maynard et al., 2017). This approach, termed Closed KE or Ontology-Based Information Extraction (Wimalasuriya and Dou, 2010), operates within predefined ontological frameworks specifying permitted entity types, relations and semantic constraints.
The Closed KE paradigm aligns with the Semantic Web vision, as extraction targets conform to ontological schemas designed for machine reasoning and cross-system integration (Berners-Lee et al., 2001). The requirements for ontology-oriented extraction systems manifest across three critical dimensions (Hotho et al., 2020): entity mentions must be mapped to URIs serving as globally unique identifiers within the LOD cloud, elevating Entity Linking from optional refinement to fundamental requirement; data alignment challenges emerge across T-Box alignment (schema extension for novel entity types), A-Box alignment (deduplication and consistency checking) and URI alignment (entity resolution against existing knowledge base entries); provenance tracking and validation mechanisms must ensure generated RDF conforms to both syntactic requirements and semantic constraints defined in the ontology. Text-to-KG systems following the Closed KE paradigm typically decompose extraction into specialized subtasks executed sequentially: preprocessing and sentence segmentation, Named Entity Recognition (NER) to identify and classify entity mentions, Entity Linking to map mentions to canonical identifiers, Relation Extraction to identify semantic relationships, Event Detection when using event-centric ontologies and graph assembly with validation (Maynard et al., 2017). Recent CH projects demonstrate diverse pipeline implementation approaches. The Musical Meetups Knowledge Graph (Alba Morales et al., 2023) combines DBpedia Spotlight entity recognition with LLMs (GPT-3.5-Turbo) for temporal normalization and purpose classification. MusicBO (Gangemi et al., 2024) employs Abstract Meaning Representation as an intermediate layer (Meloni et al., 2017) between linguistic structure and ontological requirements. The Odeuropa project (Lisena et al., 2022) achieves tight annotation–ontology integration through frame-based schemes mapping directly to CIDOC CRM extensions using multilingual BERT models.
LLMs have introduced new text-to-KG paradigms (Khorashadizadeh et al., 2024; Mihindukulasooriya et al., 2023; Meyer and Stadler, 2024). Allen et al. (2023) identify hybrid neuro-symbolic systems and natural language interfaces as primary architectural directions, the latter enabling domain experts to guide extraction without technical expertise in KR. Lairgi et al.’s iText2KG (Lairgi et al., 2024) employs a zero-shot, incremental approach enabling knowledge base expansion without annotated training data. Ringwald (2024) explores pattern-based methods for learning from Wikipedia-DBpedia/Wikidata pairs. For CH-specific applications, Santini (2024) shows LLM-based relation extraction outperforming specialized models like mREBEL on 19th-century Italian texts through broader linguistic knowledge. Schimmenti et al. (2024) demonstrate zero-shot tools like GLiNER (Zaratiana et al., 2024), enabling on-the-fly entity type specification through natural language descriptions. Giagnolini et al. (2025) employ Llama 3.3 70B for text-to-KG extraction from archival metadata, classifying paragraphs by event types before applying event-specific extraction schemas mapped to RiC-O [10].
The integration of LLMs alters resource requirements and performance characteristics. LLMs leverage In-Context Learning (ICL) (Brown et al., 2020), Few-Shot learning strategies (Petroni et al., 2019) and Chain-of-Thought (CoT) prompting to perform extraction with minimal task-specific examples. This capability proves valuable for domains where creating large annotated corpora is often infeasible. On the other hand, LLM-based approaches introduce new challenges: ontological compliance requires sophisticated prompt engineering or post-processing, hallucination risks necessitate careful verification mechanisms and the implicit knowledge encoded in model parameters may not align with domain-specific conceptual frameworks embodied in CH ontologies.
Evaluation methodologies for generated KGs remain heterogeneous and often inadequate for capturing semantic fidelity beyond surface-level metrics. Traditional precision–recall calculations on entity mentions and relations fail to assess whether extracted knowledge structures adequately represent the complexity and nuance of scholarly discourse. Back-translation approaches (Gangemi et al., 2024) and LLM-based evaluation frameworks like G-EVAL (Liu et al., 2023) offer promising directions, but their applicability across different types of CH interpretative content requires further validation.
3. The ATR4CH methodology
This section introduces the ATR4CH methodology, an iterative approach for extracting KGs from CH documents using LLMs. ATR4CH recognizes the fundamental interdependence of annotation, KE and ontology alignment, approaching them conjunctively.
3.1 Methodology overview
The ATR4CH methodology transforms three foundational inputs – unstructured document corpus, target ontology and Competency Questions (CQs) – into a validated KE system through five interconnected tasks, producing a working extraction pipeline, refined annotation model and comprehensive evaluation framework.
A flowchart is shown in Figure 2. The methodology draws on eXtreme Design (Presutti et al., 2009) for iterative ontology engineering centered on CQs, selection methodologies (Tomasi, 2020) and Odeuropa's integrated annotation–ontology approach (Lisena et al., 2022).
The diagram shows five vertically arranged sections labeled “Task 1”, “Task 2”, “Task 3”, “Task 4”, and “Task 5”. In “Task 1”, three boxes positioned at the top left and arranged horizontally from left to right are labeled “Corpus”, “Ontology”, and “C Qs”. Above these three boxes, three small icons are displayed. Above “Corpus”, a document-shaped icon appears. Above “Ontology”, a network-structure icon with connected nodes appears. Above “C Qs”, an icon showing three circular nodes with question marks inside them appears. A downward arrow extends from these boxes and leads to the box labeled “Select documents and define core ontology patterns”. A downward arrow from this box branches to two boxes labeled “Pilot Corpus” on the left and “C O Ps” on the right. A rightward arrow extends from “C O Ps” and leads to the “Task 2” section positioned at the top right. In “Task 2”, a large box positioned at the top is labeled “For each C O P from TASK I”. Inside this box, the process begins with a box labeled “Annotate C O P on Pilot Corpus”. A rightward arrow leads to the box labeled “Validate with R D F mapping”. A downward arrow extends from “Validate with R D F mapping” and leads to a diamond-shaped text box labeled “Is the C O P annotation valid?”. From this diamond, a downward arrow labeled “YES” leads to “C O P Annotation Schema”, and a leftward arrow from this box leads to “Proceed to next C O P”. From the diamond, a leftward arrow labeled “NO” leads to “Create schema for C O P”, which connects upward back to “Annotate C O P on Pilot Corpus”. A downward arrow extends from the “Task 2” section to the box labeled “Merge annotations and R D F mapping”. A leftward arrow extends from “Merge annotations and R D F mapping” and branches to three text boxes arranged vertically from top to bottom labeled “C O P Annotation Schemas”, “Annotated Pilot Corpus”, and “R D F Mapping”. A leftward arrow extends from “C O P Annotation Schemas” and leads to the “Task 3” section positioned below “Task 1”. In “Task 3”, a large box labeled “For each C O P annotated in the Pilot Corpus” appears. Inside this box, a box positioned at the left is labeled “Test on Pilot Corpus”. A rightward arrow connects this box to “Develop C O P text to K G module”. A downward arrow from “Test on Pilot Corpus” leads to “Validate against Annotation”. A rightward arrow from “Validate against Annotation” leads to a diamond-shaped text box labeled “Is the output satisfactory question mark”. From this diamond, a leftward arrow labeled “NO” leads upward to “Develop C O P text to K G module”. A downward arrow from the diamond leads to the box labeled “Implement next C O P”. A downward arrow extends from the “Task 3” section to the box labeled “Evaluate on Annotated Pilot Corpus”. A rightward arrow extends from this box to the diamond labeled “Is the pipeline satisfactory question mark”. From this diamond, a rightward arrow leads to the box labeled “Trace back to annotation or pipeline”. A downward arrow extends from the diamond and leads to the “Task 4” section positioned below. In “Task 4”, a downward arrow extends from the diamond-shaped box labeled “Is the pipeline satisfactory question mark” and branches into three horizontally aligned boxes that are connected by a single solid line. The first box reads “Integrate full text-to-K G pipeline”. The second box reads “Sample text corpus”. The third box reads “Finalize Annotation Model”. A downward arrow extends from the solid line connecting “Integrate full text-to-K G pipeline” and “Sample text corpus” to a box labeled “K G of the test corpus”. Similarly, a downward arrow extends from the solid line connecting “Sample text corpus” and “Finalize Annotation Model” to a box labeled “Ground Truth”. These two boxes are connected by lines and lead downward to the box labeled “Evaluation” under “Task 5”. In “Task 5”, a downward arrow from “Evaluation” branches into three boxes. The left branch leads to a box labeled “Domain expert evaluation”. The middle branch leads to a box labeled “Precision, Recall, F 1”. The right branch first leads to a box labeled “K G Refinement”, which then connects downward to a box labeled “Semantic Evaluation (G-Eval, BLEU, ellipsis)”. Individually, three downward arrows extend from “Domain expert evaluation”, “Precision, Recall, F 1”, and “Semantic Evaluation (G-Eval, BLEU, ellipsis)” and lead to the diamond-shaped decision box labeled “Is the output satisfactory question mark”. From this diamond, a leftward arrow labeled “YES” leads to the box labeled “Output”. A rightward arrow labeled “NO” leads to the box labeled “Go back to specific part to reimplement”.Flowchart of the ATR4CH methodology showing the iterative task structure. Source: Authors' own work
The diagram shows five vertically arranged sections labeled “Task 1”, “Task 2”, “Task 3”, “Task 4”, and “Task 5”. In “Task 1”, three boxes positioned at the top left and arranged horizontally from left to right are labeled “Corpus”, “Ontology”, and “C Qs”. Above these three boxes, three small icons are displayed. Above “Corpus”, a document-shaped icon appears. Above “Ontology”, a network-structure icon with connected nodes appears. Above “C Qs”, an icon showing three circular nodes with question marks inside them appears. A downward arrow extends from these boxes and leads to the box labeled “Select documents and define core ontology patterns”. A downward arrow from this box branches to two boxes labeled “Pilot Corpus” on the left and “C O Ps” on the right. A rightward arrow extends from “C O Ps” and leads to the “Task 2” section positioned at the top right. In “Task 2”, a large box positioned at the top is labeled “For each C O P from TASK I”. Inside this box, the process begins with a box labeled “Annotate C O P on Pilot Corpus”. A rightward arrow leads to the box labeled “Validate with R D F mapping”. A downward arrow extends from “Validate with R D F mapping” and leads to a diamond-shaped text box labeled “Is the C O P annotation valid?”. From this diamond, a downward arrow labeled “YES” leads to “C O P Annotation Schema”, and a leftward arrow from this box leads to “Proceed to next C O P”. From the diamond, a leftward arrow labeled “NO” leads to “Create schema for C O P”, which connects upward back to “Annotate C O P on Pilot Corpus”. A downward arrow extends from the “Task 2” section to the box labeled “Merge annotations and R D F mapping”. A leftward arrow extends from “Merge annotations and R D F mapping” and branches to three text boxes arranged vertically from top to bottom labeled “C O P Annotation Schemas”, “Annotated Pilot Corpus”, and “R D F Mapping”. A leftward arrow extends from “C O P Annotation Schemas” and leads to the “Task 3” section positioned below “Task 1”. In “Task 3”, a large box labeled “For each C O P annotated in the Pilot Corpus” appears. Inside this box, a box positioned at the left is labeled “Test on Pilot Corpus”. A rightward arrow connects this box to “Develop C O P text to K G module”. A downward arrow from “Test on Pilot Corpus” leads to “Validate against Annotation”. A rightward arrow from “Validate against Annotation” leads to a diamond-shaped text box labeled “Is the output satisfactory question mark”. From this diamond, a leftward arrow labeled “NO” leads upward to “Develop C O P text to K G module”. A downward arrow from the diamond leads to the box labeled “Implement next C O P”. A downward arrow extends from the “Task 3” section to the box labeled “Evaluate on Annotated Pilot Corpus”. A rightward arrow extends from this box to the diamond labeled “Is the pipeline satisfactory question mark”. From this diamond, a rightward arrow leads to the box labeled “Trace back to annotation or pipeline”. A downward arrow extends from the diamond and leads to the “Task 4” section positioned below. In “Task 4”, a downward arrow extends from the diamond-shaped box labeled “Is the pipeline satisfactory question mark” and branches into three horizontally aligned boxes that are connected by a single solid line. The first box reads “Integrate full text-to-K G pipeline”. The second box reads “Sample text corpus”. The third box reads “Finalize Annotation Model”. A downward arrow extends from the solid line connecting “Integrate full text-to-K G pipeline” and “Sample text corpus” to a box labeled “K G of the test corpus”. Similarly, a downward arrow extends from the solid line connecting “Sample text corpus” and “Finalize Annotation Model” to a box labeled “Ground Truth”. These two boxes are connected by lines and lead downward to the box labeled “Evaluation” under “Task 5”. In “Task 5”, a downward arrow from “Evaluation” branches into three boxes. The left branch leads to a box labeled “Domain expert evaluation”. The middle branch leads to a box labeled “Precision, Recall, F 1”. The right branch first leads to a box labeled “K G Refinement”, which then connects downward to a box labeled “Semantic Evaluation (G-Eval, BLEU, ellipsis)”. Individually, three downward arrows extend from “Domain expert evaluation”, “Precision, Recall, F 1”, and “Semantic Evaluation (G-Eval, BLEU, ellipsis)” and lead to the diamond-shaped decision box labeled “Is the output satisfactory question mark”. From this diamond, a leftward arrow labeled “YES” leads to the box labeled “Output”. A rightward arrow labeled “NO” leads to the box labeled “Go back to specific part to reimplement”.Flowchart of the ATR4CH methodology showing the iterative task structure. Source: Authors' own work
ATR4CH focuses on integrating annotation and KE pipeline development with ontologies in the CH domain, including CIDOC CRM (Doerr, 2003), Dublin Core [11], FRBR/FRBRoo (IFLA Working Group on FRBR/CRM Dialogue, 2017), HiCO (Daquino and Tomasi, 2015), SKOS [12] and PROV-O (Lebo et al., 2013). The methodology presupposes dual alignment: the ontology must represent relevant domain knowledge, and this knowledge must be present (explicitly or inferably) within source documents. It suits unstructured texts (informative, narrative, scholarly sources) rather than semi-structured documents like catalogues. The methodology leverages LLM capabilities in ICL (Brown et al., 2020), Few-Shot and CoT strategies (Petroni et al., 2019; Lairgi et al., 2024), based on established KE practices (Tamasauskaitė and Groth, 2022).
ATR4CH adopts an incremental, pattern-by-pattern development strategy, iteratively focusing on one Core Ontological Pattern (COP) at a time, identified through target ontology and CQ analysis. Each COP progresses through the complete development cycle – annotation schema design, RDF mapping validation and automated extraction module implementation – before advancing to the next pattern. After several pattern-specific iterations, the methodology transitions from the Pilot Corpus phase to full-scale corpus processing. The annotation model is consolidated into a production-ready version to work as the GT for pipeline evaluation.
3.2 Foundational analysis and design (Task I)
Task I establishes foundational understanding by analyzing the corpus and ontology to identify COPs, addressing data sparseness common in Information Extraction from unstructured texts.
Corpus Analysis: This ontology-dependent activity examines knowledge manifestation in textual discourse, including linguistic patterns, discourse structures and representational strategies. Challenges include implicit mentions requiring contextual inference, long-distance dependencies where KG components are separated by substantial text spans, nested entities in relational structures and ambiguous references. For Wikipedia articles about forged CH items, structural analysis identifies which sections contain scholarly opinions versus tangential debates, enabling focused extraction from high-density sections like “Scholarly analysis.” Content analysis determines whether articles present complete scholarly reasoning or merely final judgments, guiding the methodology toward sources with sufficient depth.
Ontology Analysis: This parallel activity assesses which ontology parts can be populated from source documents, examining alignment between the ontology's conceptual framework and available textual information. It identifies which classes and properties have sufficient textual evidence, which relationships can be inferred from corpus patterns and which elements need omission. The aim is to determine what data are present rather than immediately addressing how to extract. CQs guide prioritization of ontological coverage based on research requirements.
COPs Identification: Based on analyses, this process identifies essential KG patterns required to answer CQs. COPs represent central ontological nodes and relationships that are both present as extractable information and necessary for addressing research questions. Patterns emerge through systematic intersection analysis of CQs, ontological capabilities and textual evidence. Each candidate pattern is evaluated on necessity (required for answering CQs?) and feasibility (sufficient textual evidence for reliable extraction?). Patterns scoring high on both dimensions form the initial set, refined by considering dependencies and complexity. Simpler patterns are prioritized for early iterations to establish baseline functionality. The final selection represents a manageable subset forming the semantic backbone for KE, with patterns ordered by their structural role – foundational metadata patterns preceding interpretative reasoning patterns. These COPs will be processed incrementally through subsequent tasks.
Pilot Corpus Selection: The Pilot Corpus is a representative document set serving as a development sandbox. It is not a quantitatively representative sample but a qualitative one that must be linguistically, structurally and epistemically representative while remaining manageable for intensive manual work. Selection ensures coverage of linguistic patterns, discourse structures and diverse COP manifestations. Size can be three to five documents, depending on length and information complexity. This set will be used iteratively for developing and validating extraction pipelines for each COP.
3.3 Minimal working annotation development (Task II)
Task II develops annotation schemas incrementally, processing one COP at a time. For each COP identified in Task 3.2, an annotation schema is developed, applied to the Pilot Corpus and validated through RDF mapping before proceeding to the next pattern, producing an annotation model serving as the target schema for automated extraction.
Pattern-by-Pattern Annotation Schema Development: Development proceeds iteratively through identified COPs. For each pattern, an annotation schema captures essential knowledge structures while remaining practical for manual annotation and automated extraction. Schema design accounts for diverse knowledge manifestations in the corpus, including explicit textual mentions and information requiring inference or contextualization.
The annotation schema should prioritize simplicity and feasibility while ensuring adequate coverage. “Minimal” refers to including only necessary annotation elements for extracting identified ontological patterns in the first iterations, avoiding over-annotation that complicates extraction without contributing to answering CQs. If COPs require complex semantic structures beyond simple triple patterns, the annotation schema should include appropriate mechanisms for representing these relationships mappable to RDF (e.g., Named Graphs, reification).
Knowledge Base Integration Strategy: Knowledge base integration enables consistent entity identification and vocabulary alignment between textual mentions and the target ontology. Since COPs typically involve ontological individuals, entities, controlled vocabularies or standardized terminologies, annotators need access to these resources to ensure textual references link to correct ontological entities, preventing inconsistent annotation that would hamper aggregation and reasoning in the final KG.
Integration – whether through local vocabularies or external resources like Wikidata or DBpedia – must be designed early to establish clear protocols for entity linking and vocabulary alignment, guiding both manual annotation and automated extraction in Task 3.4. Choice between local and external knowledge bases depends on domain coverage, data quality requirements and specific entity types required by COPs.
Annotation Paradigm: Annotation should follow established corpus linguistics and NLP practices. When resources permit, multiple annotators should annotate the same documents to enable inter-annotator agreement measurement using Cohen's kappa (Carletta, 1996; Cohen, 1960) or Krippendorff's alpha (Krippendorff, 2019), identifying ambiguous categories and revealing where guidelines require clarification. In resource-constrained settings, a single experienced annotator may suffice, but annotation guidelines must be thoroughly documented for reproducibility. The process should be iterative: initial guidelines are refined based on encountered edge cases.
Iterative Development Process for Each COP: Development follows a systematic cycle, ensuring the annotation schema produces RDF structures satisfying COPs:
Schema Design: Develop initial annotation layers based on the current COP, incorporating knowledge base integration protocols through tagsets, controlled vocabularies and standardized terminologies aligning with the target ontology.
Pilot Corpus Annotation: Annotate the entire pilot corpus using the current schema iteration to identify gaps, inconsistencies or bottlenecks.
Mapping Validation: Conduct preliminary mapping exercises from annotated data to RDF format, testing whether resulting KGs satisfy the COP and adequately represent source document semantic content. This mapping serves as a unit test, validating that annotation patterns correctly transform to valid RDF.
Schema Refinement: Refine the annotation model based on issues identified during mapping validation, returning to previous activities as necessary.
Pipeline Development for Current COP: Once the annotation schema has been validated through successful RDF mapping, proceed to Task 3.4 to develop automated extraction for this pattern using the same Pilot Corpus documents. Only after completing pipeline development and validation for the current COP should development proceed to the next pattern.
These preliminary mapping exercises validate that the annotation schema produces target knowledge structures, serving as an early validation mechanism before proceeding to automated extraction development. The Minimal Working Annotation emerges as the aggregation of annotation schemas developed for individual COPs.
3.4 Pipeline architecture development (Task III)
Task III designs and implements computational tools to automatically extract COPs from text using the annotation model as the target schema, addressing the CH corpora's domain-specific characteristics and limited annotated training data. This task develops and validates extraction capabilities incrementally for each COP using the same Pilot Corpus documents annotated in the previous task.
Task Decomposition and Architecture Design: KE is designed around annotation model elements, prioritizing based on COP semantic importance and accounting for information manifestation patterns from corpus analysis. This modular approach enables incremental extraction where KG components are progressively identified through sequential processing, facilitating debugging and targeted optimization while minimizing error propagation. For the current COP, the extraction pipeline targets the relative annotation schema developed in Task 3.3.
Tool Selection Strategy: Tool choice aligns with available resources and data characteristics:
Low data, low resources: API-based LLMs with few-shot prompting and rule-based entity linking
Moderate data, moderate resources: Hybrid approaches combining pre-trained models with domain-specific fine-tuning
Large data, extensive resources: Custom model training and ensemble methods
Large data, low resources: Structured pipeline approaches leveraging smaller models with knowledge distillation.
LLM-based approaches use structured output generation through JSON schemas (Schick et al., 2023; Qin et al., 2024) and ICL strategies (Brown et al., 2020; Min et al., 2022), combined with specialized NER tools (Devlin et al., 2019) for precise span identification when character-level accuracy is critical.
Pipeline Implementation: Development targets the annotation schema for the current COP, integrating knowledge base resources and vocabulary standardization protocols through prompt integration or RAG (Lewis et al., 2020). Initial implementation focuses on basic functionality before optimization.
Immediate Validation and Benchmarking: Once extraction for the current COP has been implemented, the pipeline is tested on the Pilot Corpus and results are compared against manual annotations from Task 3.3. Evaluation strategies range from basic (standard metrics on a pilot corpus) to comprehensive (ablation studies and hybrid approach exploration) based on project constraints. This immediate validation enables rapid identification of extraction bottlenecks or misalignments before proceeding to the next COP.
Output: An extraction pipeline capable of processing raw text and generating structured outputs following the annotation schema for the specific pattern being developed. Only after successfully validating extraction for the current COP should development proceed to the next pattern, returning to Task 3.3 to develop its annotation schema. This cycle continues until extraction pipelines have been developed and validated for all identified COPs using the Pilot Corpus.
3.5 Integration and refinement (Task IV)
Task IV harmonizes the COPs (Task 3.2), annotation schemas (Task 3.3) and pipeline components (Task 3.4) into a coherent end-to-end KE system, transitioning the experimental pipeline to production-ready status. The annotation model emerging from processing individual COPs is consolidated into a production version suitable for full corpus processing.
Pipeline Integration: Modular extraction components developed for individual COPs are integrated into a unified text-to-KG pipeline. Integration addresses dependencies between patterns, ensures consistent entity resolution across components and optimizes overall processing architecture. The integrated pipeline processes documents from raw text to complete KGs, instantiating all identified COPs.
End-to-End Pipeline Testing: Comprehensive testing over the pilot corpus processes documents from raw text to final KGs, revealing systematic issues including data sparseness patterns, interaction effects between COP extractors, inconsistent tool coverage across discourse types and representation generation errors. Testing systematically evaluates performance across document types and semantic phenomena, with particular attention to error propagation through pipeline stages.
Annotation Model Refinement to Production Version: Based on testing results, the annotation model evolves into a production-ready version suitable for both manual annotation and automated extraction. This may involve adding elements crucial for automated extraction – coreference chains spanning multiple COPs, disambiguation tags and confidence indicators – while maintaining backward compatibility with COPs. The production annotation model serves as the schema for GT creation in Task 3.6.
Mapping Algorithm Enhancement: Preliminary mapping algorithms from Task 3.3 are consolidated and enhanced to handle the complete KG structure, improving handling of complex semantic structures emerging from COP interactions and adding validation using tools such as SHACL, OWL reasoners, SPARQLAnything (Asprino et al., 2023) and RML (Dimou et al., 2014). Error handling mechanisms manage extraction failures and partial results.
3.6 KE and evaluation (Task V)
The final validation phase employs technical validation and domain-expert evaluation to ensure knowledge structures accurately represent domain-specific discourse complexity, applying the refined system from Task 3.5 to test data separate from the Pilot Corpus.
GT Preparation: Comprehensive GT creation using the production annotation model involves annotating a test dataset separate from the pilot corpus, covering all COPs from Task 3.2. Annotation follows the same paradigms established in Task 3.3, including multiple annotators and inter-annotator agreement measurement when resources permit. The GT serves as the gold standard for systematic evaluation, with mapping algorithms from Task 3.5 applied to generate reference RDF. Test dataset size should balance evaluation rigor with annotation resource constraints, typically ranging from 10 to 50 documents, depending on length and complexity.
KE: Test datasets are processed through the complete pipeline under realistic deployment conditions, with systematic documentation of performance and failure modes. This represents the first application of the extraction pipeline beyond the Pilot Corpus used for development.
Multi-Level Evaluation: Multiple complementary approaches address KG evaluation challenges:
Technical Evaluation: Component-level assessment using precision, recall and F1-score evaluates individual extraction tasks independently for each COP. Coverage analysis for CQs examines whether the KG contains sufficient information to answer the original CQs.
Semantic Evaluation: KG “rehydration” (Gardent et al., 2017; Gangemi et al., 2024) enables comparison when structural alignment is impossible. This approach reconstructs natural language text from KG information, using metrics like BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), BARTScore (Yuan et al., 2021), CHRF+++ (Popović, 2015) and G-EVAL (Liu et al., 2023) as proposed by He et al. (2025).
Competency-Based Evaluation: SPARQL query suites derived from original CQs verify that the KG satisfies the functional requirements that motivated its construction, aligned with evaluation practices from eXtreme Design (Presutti et al., 2009) using tools like TestaLOD (Carriero et al., 2019).
Domain Expert Validation: A comprehensive review by domain specialists evaluates extraction quality and coherence, with rehydration enabling evaluation by experts without RDF expertise by presenting KG content as natural language.
Iteration Strategy: Evaluation results may trigger returns to earlier tasks: coverage issues to Task 3.3 or Task 3.5, extraction bottlenecks to Task 3.4, systematic errors requiring architectural restructuring in Task 3.5 or fundamental ontological misalignments necessitating a return to Task 3.2 for COP reassessment.
4. Authenticity assessment case study
This section describes the implementation of the ATR4CH methodology for authenticity assessment debates. The methodology operates on the intersection between ontological representation and textual evidence: the entities, relations and concepts defined by the SEBI ontology must correspond to information that can be identified and extracted from the documents. Following the five-task structure of ATR4CH (Section 3), we present: foundational analysis and design (Section 4.1), iterative annotation schema development (Section 4.2), pipeline architecture development (Section 4.3), system integration and refinement (Section 4.4) and comprehensive evaluation (Section 4.5).
4.1 Foundational analysis and design (Task I)
This subsection implements Task I of the ATR4CH methodology (Section 3.2), establishing a foundational understanding of source materials by analyzing both corpus and ontology to identify COPs for KE. As specified in the methodology, Task I requires three foundational inputs: a corpus of unstructured documents, a target ontology defining KR and CQs specifying information requirements. This subsection presents these inputs and the analytical activities that identify extractable patterns.
4.1.1 Corpus collection and analysis
Following Task I guidelines (Section 3.2), we collected Wikipedia articles on historical forgeries, hoaxes and authenticity controversies through web scraping from Wikipedia's categorical organization system. Initial selection covered 31 categories, including Document and Literary Forgeries, Historical Myths, Conspiracy Theories, Pseudepigraphy and Political forgery. From 1,301 retrieved documents [13], 16 categories and 717 articles were excluded because they presented no scholarly debate or were not about CH items. The final dataset encompasses 581 articles across 15 categories (Table 1).
Distribution of articles across Wikipedia categories in the corpus
| Category | Article count |
|---|---|
| Literary forgeries | 138 |
| Pseudepigraphy | 65 |
| Old Testament pseudepigrapha | 60 |
| Forgery controversies | 58 |
| Archaeological forgeries | 52 |
| Musical hoaxes | 44 |
| Art forgers | 40 |
| Document forgeries | 33 |
| Ancient Greek pseudepigrapha | 28 |
| Political forgery | 26 |
| Religious hoaxes | 15 |
| Modern pseudepigrapha | 11 |
| Sculpture forgeries | 7 |
| Political forgeries | 2 |
| Shakespeare authorship question | 2 |
| Total | 581 |
| Category | Article count |
|---|---|
| Literary forgeries | 138 |
| Pseudepigraphy | 65 |
| Old Testament pseudepigrapha | 60 |
| Forgery controversies | 58 |
| Archaeological forgeries | 52 |
| Musical hoaxes | 44 |
| Art forgers | 40 |
| Document forgeries | 33 |
| Ancient Greek pseudepigrapha | 28 |
| Political forgery | 26 |
| Religious hoaxes | 15 |
| Modern pseudepigrapha | 11 |
| Sculpture forgeries | 7 |
| Political forgeries | 2 |
| Shakespeare authorship question | 2 |
| Total | 581 |
Articles average 8,150 characters and 1,249 tokens, with unique vocabulary averaging 464 tokens per article, indicating substantial lexical diversity. Figure 3 shows a right-skewed distribution, with most articles ranging from 2k to 15k characters and outliers extending beyond 40k characters.
The horizontal axis is labeled “Article Length (characters)” and ranges from 0 to approximately 65,000. The vertical axis is labeled “Frequency” and ranges from 0 to about 95 in increments of 20 units. The histogram shows the highest frequencies between roughly 2,000 and 6,000 characters, with counts near 80 to 95. Frequencies decrease steadily as article length increases. Few articles exceed 20,000 characters, and sparse bars extend beyond 40,000 up to approximately 65,000, forming a long right tail.Article length distribution showing a right-skewed pattern characteristic of encyclopedic content. Source: Authors' own work
The horizontal axis is labeled “Article Length (characters)” and ranges from 0 to approximately 65,000. The vertical axis is labeled “Frequency” and ranges from 0 to about 95 in increments of 20 units. The histogram shows the highest frequencies between roughly 2,000 and 6,000 characters, with counts near 80 to 95. Frequencies decrease steadily as article length increases. Few articles exceed 20,000 characters, and sparse bars extend beyond 40,000 up to approximately 65,000, forming a long right tail.Article length distribution showing a right-skewed pattern characteristic of encyclopedic content. Source: Authors' own work
Distribution across categories (Figure 4) reflects the natural prevalence of different forgery types, with literary forgeries representing the largest category (138 articles), followed by pseudepigraphy forms (136 articles across subcategories), and archaeological and artistic forgeries (132 articles combined).
The horizontal axis is labeled “Category” and lists: “Ancient underscore Greek underscore pseudepigraph”, “Anti Islamic underscore forgeries”, “Archaeological underscore forgeries”, “Art underscore forgers”, “Document underscore forgeries”, “Forgery underscore controversies”, “Literary underscore forgeries”, “Modern underscore pseudepigrapha”, “Musical underscore hoaxes”, “Old underscore Testament underscore pseudepigrapha”, “Political underscore forgery”, “Pseudepigraphy”, “Religious underscore hoaxes”, “Scripture underscore forgeries”, and “Shakespeare underscore authorship underscore question”. The vertical axis is labeled “Article Count” and ranges from 0 to 140 in increments of 20 units. Each category is represented by a single vertical bar indicating the number of articles. For “Ancient Greek pseudepigraph”: 28. For “Anti Islamic forgeries”: 2. For “Archaeological forgeries”: 52. For “Art forgeries”: 40. For “Document forgeries”: 33. For “Forgery controversies”: 58. For “Literary forgeries”: 138. For “Modern pseudepigrapha”: 11. For “Musical hoaxes”: 44. For “Old Testament pseudepigrapha”: 60. For “Political forgery”: 26. For “Pseudepigraphy”: 65. For “Religious hoaxes”: 17. For “Scripture forgeries”: 7. For “Shakespeare authorship question”: 2. Note: All numerical values are approximated.Distribution of articles across Wikipedia categories. Source: Authors' own work
The horizontal axis is labeled “Category” and lists: “Ancient underscore Greek underscore pseudepigraph”, “Anti Islamic underscore forgeries”, “Archaeological underscore forgeries”, “Art underscore forgers”, “Document underscore forgeries”, “Forgery underscore controversies”, “Literary underscore forgeries”, “Modern underscore pseudepigrapha”, “Musical underscore hoaxes”, “Old underscore Testament underscore pseudepigrapha”, “Political underscore forgery”, “Pseudepigraphy”, “Religious underscore hoaxes”, “Scripture underscore forgeries”, and “Shakespeare underscore authorship underscore question”. The vertical axis is labeled “Article Count” and ranges from 0 to 140 in increments of 20 units. Each category is represented by a single vertical bar indicating the number of articles. For “Ancient Greek pseudepigraph”: 28. For “Anti Islamic forgeries”: 2. For “Archaeological forgeries”: 52. For “Art forgeries”: 40. For “Document forgeries”: 33. For “Forgery controversies”: 58. For “Literary forgeries”: 138. For “Modern pseudepigrapha”: 11. For “Musical hoaxes”: 44. For “Old Testament pseudepigrapha”: 60. For “Political forgery”: 26. For “Pseudepigraphy”: 65. For “Religious hoaxes”: 17. For “Scripture forgeries”: 7. For “Shakespeare authorship question”: 2. Note: All numerical values are approximated.Distribution of articles across Wikipedia categories. Source: Authors' own work
Token count distribution by category (Figure 5) reveals substantial variability. The Shakespeare authorship question demonstrates the highest token density (nearly 10K tokens), while musical hoaxes and modern pseudepigraphy exhibit more consistent, moderate-length articles.
The diagram is titled “Distribution of Token Counts by Category (Top 15 Categories)”. The vertical axis is labeled “Token Count” and ranges from 0 to 10000 with increments of 2000. The horizontal axis lists fifteen categories: “Literary underscore forgeries”, “Pseudepigraphy”, “Old underscore Testament underscore pseudepigrapha”, “Forgery underscore controversies”, “Archaeological underscore forgeries”, “Musical underscore hoaxes”, “Art underscore forgeries”, “Document underscore forgeries”, “Ancient underscore Greek underscore pseudepigrapha”, “Political underscore forgery”, “Religious underscore hoaxes”, “Modern underscore pseudepigrapha”, “Sculpture underscore forgeries”, “Anti underscore Islamic underscore forgeries”, and “Shakespeare underscore authorship underscore question”. Each category is represented by a vertical boxplot showing the distribution of token counts. The box indicates the interquartile range, the horizontal line inside each box represents the median, vertical whiskers extend to non-outlier minimum and maximum values, and circular markers above the whiskers represent outliers. For “Literary underscore forgeries”, the median appears below 1000 tokens, with outliers extending near 8000 tokens. “Pseudepigraphy” shows a median below 1000 tokens, with outliers near 4500 tokens. “Old underscore Testament underscore pseudepigrapha” has a median under 1000 tokens, with upper values approaching 4000 tokens. “Forgery underscore controversies” shows a median near 1000 tokens, with outliers above 7000 tokens. “Archaeological underscore forgeries” has a median near 1000 tokens and upper values exceeding 6000 tokens. “Musical underscore hoaxes” shows a lower median near 500 tokens and fewer extreme outliers. “Art underscore forgeries” has a median below 1000 tokens with several outliers above 5000 tokens. “Document underscore forgeries” shows a median under 1000 tokens and outliers exceeding 6000 tokens. “Ancient underscore Greek underscore pseudepigrapha” has a median below 500 tokens and a moderate spread. “Political underscore forgery” shows a median near 1000 tokens with outliers above 8000 tokens. “Religious underscore hoaxes” displays a wide spread with a median below 500 tokens and whiskers extending above 6000 tokens. “Modern underscore pseudepigrapha” has a median below 1000 tokens with moderate spread. “Sculpture underscore forgeries” shows a median near 1000 tokens and moderate variability. “Anti underscore Islamic underscore forgeries” has a median near 1000 tokens with a narrower spread. “Shakespeare underscore authorship underscore question” shows the highest median among the categories, above 2000 tokens, with upper values exceeding 3000 tokens.Token count distribution by category showing medians, quartiles and outliers. Source: Authors' own work
The diagram is titled “Distribution of Token Counts by Category (Top 15 Categories)”. The vertical axis is labeled “Token Count” and ranges from 0 to 10000 with increments of 2000. The horizontal axis lists fifteen categories: “Literary underscore forgeries”, “Pseudepigraphy”, “Old underscore Testament underscore pseudepigrapha”, “Forgery underscore controversies”, “Archaeological underscore forgeries”, “Musical underscore hoaxes”, “Art underscore forgeries”, “Document underscore forgeries”, “Ancient underscore Greek underscore pseudepigrapha”, “Political underscore forgery”, “Religious underscore hoaxes”, “Modern underscore pseudepigrapha”, “Sculpture underscore forgeries”, “Anti underscore Islamic underscore forgeries”, and “Shakespeare underscore authorship underscore question”. Each category is represented by a vertical boxplot showing the distribution of token counts. The box indicates the interquartile range, the horizontal line inside each box represents the median, vertical whiskers extend to non-outlier minimum and maximum values, and circular markers above the whiskers represent outliers. For “Literary underscore forgeries”, the median appears below 1000 tokens, with outliers extending near 8000 tokens. “Pseudepigraphy” shows a median below 1000 tokens, with outliers near 4500 tokens. “Old underscore Testament underscore pseudepigrapha” has a median under 1000 tokens, with upper values approaching 4000 tokens. “Forgery underscore controversies” shows a median near 1000 tokens, with outliers above 7000 tokens. “Archaeological underscore forgeries” has a median near 1000 tokens and upper values exceeding 6000 tokens. “Musical underscore hoaxes” shows a lower median near 500 tokens and fewer extreme outliers. “Art underscore forgeries” has a median below 1000 tokens with several outliers above 5000 tokens. “Document underscore forgeries” shows a median under 1000 tokens and outliers exceeding 6000 tokens. “Ancient underscore Greek underscore pseudepigrapha” has a median below 500 tokens and a moderate spread. “Political underscore forgery” shows a median near 1000 tokens with outliers above 8000 tokens. “Religious underscore hoaxes” displays a wide spread with a median below 500 tokens and whiskers extending above 6000 tokens. “Modern underscore pseudepigrapha” has a median below 1000 tokens with moderate spread. “Sculpture underscore forgeries” shows a median near 1000 tokens and moderate variability. “Anti underscore Islamic underscore forgeries” has a median near 1000 tokens with a narrower spread. “Shakespeare underscore authorship underscore question” shows the highest median among the categories, above 2000 tokens, with upper values exceeding 3000 tokens.Token count distribution by category showing medians, quartiles and outliers. Source: Authors' own work
Notable examples include the Demodocus [14], a fabricated Platonic dialogue whose Wikidata entry [15] employs a deprecated rank for authorship attribution to Plato (Figure 6), demonstrating how existing knowledge bases represent disputed attributions.
The left side of the screenshot contains a label that reads “author.” To its right, the value “Plato” appears highlighted in a shaded box, with an “edit” option displayed on the far right. Below “Plato,” a line reads “reason for deprecated rank” followed by the text “superseded by later scholarship or research.” Beneath this, a line shows a small right-pointing arrow symbol followed by the text “1 reference.” At the bottom right of the section, a link labeled “add value” is visible.Plato noted as the author of the Demodocus using a deprecated rank. Source: Wikidata
The left side of the screenshot contains a label that reads “author.” To its right, the value “Plato” appears highlighted in a shaded box, with an “edit” option displayed on the far right. Below “Plato,” a line reads “reason for deprecated rank” followed by the text “superseded by later scholarship or research.” Beneath this, a line shows a small right-pointing arrow symbol followed by the text “1 reference.” At the bottom right of the section, a link labeled “add value” is visible.Plato noted as the author of the Demodocus using a deprecated rank. Source: Wikidata
4.1.2 SEBI ontology analysis
Implementing the ontology analysis component of Task I (Section 3.2), we examined the SEBI ontology to assess which ontological elements can be populated from the source documents. The SEBI ontology [16] (Pasqual, 2025) was developed based on scholarly articles (e.g., Härtel, 2017), a catalogue describing 153 known forgeries from Styria (Haider, 2022), and discussions with an expert diplomatist. The data model represents authenticity assessment claims using RDF-star (Hartig, 2017) as a reification method to represent (possibly concurrent) claim contents and contextual information (Daquino et al., 2020).
Each claim provides information about the document: authenticity classification, date and place of creation, author, and intention behind creation. Contextual information includes evidence collected by scholars to reach conclusions using evidence-based evaluations, the author of the claim and relevant bibliographic entries (using HiCo [17] and PROV-O [18]). RDF-star (Hartig, 2017) was chosen as the reification method to express both claim content and context, allowing representation of the complete evaluation process.
As shown in Figure 7, each claim contains an authenticity classification. Items are instances of sebi:Forgery, sebi:Authentic, sebi:FormalForgery or sebi:ContentForgery, all subclasses of sebi:Document.
At the top left, a rectangular box titled “Prefixes” lists the prefixes for the ontologies: rdfs, dct, xsd, prov, hico and sebi. Below the prefix box, two yellow rectangular boxes labeled “sebi:ContentForgery” and “sebi:FormalForgery” are shown. A downward arrow from “sebi:ContentForgery” points to “sebi:Forgery”. A downward arrow from “sebi:FormalForgery” also points to “sebi:Forgery”. Below this, three rectangular boxes are arranged vertically: “sebi:Forgery”, “sebi:Authentic”, and “sebi:Unknown”. A rightward arrow from “sebi:Forgery” points to “sebi:Document”. A rightward arrow from “sebi:Authentic” points to “sebi:Document”. A rightward arrow from “sebi:Unknown” points to “sebi:Document”. From “sebi:Document”, a rightward arrow labeled “dct:date” points to “time:Interval”. From “time:Interval”, an upward arrow labeled “time:hasBeginning” points to a green box labeled “xsd:gYear”. Another upward arrow labeled “time:hasEnd” points from “time:Interval” to a second green box labeled “xsd:gYear”. From “sebi:Document”, a rightward arrow labeled “dct:coverage” points to “dct:Location”. From “sebi:Document”, a rightward arrow labeled “sebi:intend” points to “sebi:Intention”. From “sebi:Document”, a downward arrow labeled “dct:creator” points to “dct:Agent”.Selection of classes and properties to represent scholarly claims addressing authenticity assessment of a document. Source: Authors' own work
At the top left, a rectangular box titled “Prefixes” lists the prefixes for the ontologies: rdfs, dct, xsd, prov, hico and sebi. Below the prefix box, two yellow rectangular boxes labeled “sebi:ContentForgery” and “sebi:FormalForgery” are shown. A downward arrow from “sebi:ContentForgery” points to “sebi:Forgery”. A downward arrow from “sebi:FormalForgery” also points to “sebi:Forgery”. Below this, three rectangular boxes are arranged vertically: “sebi:Forgery”, “sebi:Authentic”, and “sebi:Unknown”. A rightward arrow from “sebi:Forgery” points to “sebi:Document”. A rightward arrow from “sebi:Authentic” points to “sebi:Document”. A rightward arrow from “sebi:Unknown” points to “sebi:Document”. From “sebi:Document”, a rightward arrow labeled “dct:date” points to “time:Interval”. From “time:Interval”, an upward arrow labeled “time:hasBeginning” points to a green box labeled “xsd:gYear”. Another upward arrow labeled “time:hasEnd” points from “time:Interval” to a second green box labeled “xsd:gYear”. From “sebi:Document”, a rightward arrow labeled “dct:coverage” points to “dct:Location”. From “sebi:Document”, a rightward arrow labeled “sebi:intend” points to “sebi:Intention”. From “sebi:Document”, a downward arrow labeled “dct:creator” points to “dct:Agent”.Selection of classes and properties to represent scholarly claims addressing authenticity assessment of a document. Source: Authors' own work
Additionally, each RDF-star quoted triple includes details such as:
the believed creator: sebi:Document → dct:creator → dct:Agent
date of creation: sebi:Document → dct:date → time:Interval
location of creation: sebi:Document → dct:coverage → dct:Location
intention behind creation: sebi:Document → sebi:intended → sebi:Intention.
The dct:date property connects to time:Interval, which includes time:hasBeginning and time:hasEnd properties to specify creation periods and handle fuzzy time-spans. Concerning contextual information (Figure 8), each interpretation (set of claims represented as quoted triples) is categorized as hico:InterpretationAct connected to prov:Agent for authoriality and linked to supporting evidence (sebi:support sebi:Evidence).
Prefixes for the ontologies rdfs, dct, dcmitype, xsd, prov, hico, sebi, ov and skos are provided in the right box.On the left side, a box labeled “claim” is shown. An upward arrow labeled “prov:wasQuotedFrom” points from “claim” to a box labeled “owl:Thing”. A rightward arrow labeled “prov:wasDerivedFrom” connects the claim area to a box labeled “h i c o:InterpretationAct”. A leftward downward arrow labeled “prov:wasAttributedTo” points from “claim” to a box labeled “prov:Agent”. Below this area, a box labeled “h i c o:InterpretationCriterion” appears. An upward arrow from “sebi:Evidence” points to “h i c o:InterpretationCriterion”. From “claim”, a rightward arrow labeled “cito:isSupportedBy” connects to “sebi:Evidence”. A central box labeled “sebi:Evidence” connects through a rightward arrow labeled “sebi:evaluate” to “sebi:Evaluation”. A curved upward right arrow labeled “sebi:assess” connects from “sebi:Evidence” to “sebi:Feature”. From “sebi:Feature”, rightward arrows connect to three boxes labeled “sebi:ExtrinsicFeature”, “sebi:IntrinsicFeature”, and “sebi:Provenance”. From “sebi:Evidence”, a downward arrow labeled “ov:confidence” points to “x s d:Literal”. From “sebi:Evidence”, a downward arrow labeled “forgont:hasEvaluationScore” points to another “x s d:Literal”. To the right of “sebi:Evaluation”, five circular nodes are connected and labeled “sebi:consistency”, “sebi:presence”, “sebi:reliability”, “sebi:completeness”, and “sebi:veridicality”. All arrows clearly indicate direction from source entity to target entity as shown by the arrowheads pointing toward “sebi:Evaluation”.Selection of classes and properties to represent contextual information about scholarly claims addressing authenticity assessment of a document. Source: Authors' own work
Prefixes for the ontologies rdfs, dct, dcmitype, xsd, prov, hico, sebi, ov and skos are provided in the right box.On the left side, a box labeled “claim” is shown. An upward arrow labeled “prov:wasQuotedFrom” points from “claim” to a box labeled “owl:Thing”. A rightward arrow labeled “prov:wasDerivedFrom” connects the claim area to a box labeled “h i c o:InterpretationAct”. A leftward downward arrow labeled “prov:wasAttributedTo” points from “claim” to a box labeled “prov:Agent”. Below this area, a box labeled “h i c o:InterpretationCriterion” appears. An upward arrow from “sebi:Evidence” points to “h i c o:InterpretationCriterion”. From “claim”, a rightward arrow labeled “cito:isSupportedBy” connects to “sebi:Evidence”. A central box labeled “sebi:Evidence” connects through a rightward arrow labeled “sebi:evaluate” to “sebi:Evaluation”. A curved upward right arrow labeled “sebi:assess” connects from “sebi:Evidence” to “sebi:Feature”. From “sebi:Feature”, rightward arrows connect to three boxes labeled “sebi:ExtrinsicFeature”, “sebi:IntrinsicFeature”, and “sebi:Provenance”. From “sebi:Evidence”, a downward arrow labeled “ov:confidence” points to “x s d:Literal”. From “sebi:Evidence”, a downward arrow labeled “forgont:hasEvaluationScore” points to another “x s d:Literal”. To the right of “sebi:Evaluation”, five circular nodes are connected and labeled “sebi:consistency”, “sebi:presence”, “sebi:reliability”, “sebi:completeness”, and “sebi:veridicality”. All arrows clearly indicate direction from source entity to target entity as shown by the arrowheads pointing toward “sebi:Evaluation”.Selection of classes and properties to represent contextual information about scholarly claims addressing authenticity assessment of a document. Source: Authors' own work
Document features and their evaluation are components of the ontology. Document features (sebi:Feature) are extrinsic features (sebi:ExtrinsicFeature), intrinsic ones (sebi:IntrinsicFeature) or provenance information (sebi:Provenance), capturing aspects such as ink, support, handwriting and orthography. Each feature is evaluated on established criteria (sebi:Evidence) such as consistency, presence, completeness, veridicality and reliability. A score is associated with each evidence as xsd:Literal using forgont:hasEvaluationScore. The evaluation score allows integration of negatives (e.g., absence of signature is represented as evidence based on the feature “authentication marks” with evaluation “presence” and score false or 0).
4.1.3 COPs identification
We identified COPs following the COP identification process specified in Task I (Section 3.2). The identification process involved: (1) assessing alignment between CQ, ontological structures and available textual content, (2) identifying patterns with sufficient textual evidence for reliable extraction, (3) prioritizing based on extractability feasibility and CQ relevance and (4) selecting a manageable subset forming the semantic backbone for KE.
Four COPs were identified for extraction, presented in hierarchical priority order:
CH Item Metadata – Alleged information that the object claims about itself (creator, date, location) before scholarly critical analysis. For instance, in the case of the Donation of Constantine, a relevant CQ would be “What does the Donation claim about its author, date and place of creation?”
Scholarly Opinions – Authenticity assessments expressed by scholarly agents, classified as Authentic, Forgery, FormalForgery or ContentForgery. A typical pattern would be “Scholar X evaluates Document Z and concludes Authenticity Status S.” For instance, “Which scholars identify the Donation of Constantine as authentic?”
Evidential Features – Characteristics examined by scholars to support assessments, organized by type with evaluations. Example CQ: “What evidence does Lorenzo Valla cite to support his assessment of the Donation?”
Alternative Hypotheses – Competing scholarly claims about the actual creator, date, location or intended purpose. Example CQ: “What are the competing scholarly hypotheses about the actual date of the Donation's creation?”
These COPs emerged from the intersection of CQs, ontological structures and extractable content patterns identified in corpus analysis, following the systematic approach described in the methodology.
4.1.4 Pilot corpus selection
Following the Task I guidelines for pilot corpus selection (Section 3.2), we defined a qualitative sample of the corpus. Seven articles were chosen (Donation of Constantine, Eremin Letter, Getty Kouros, Historia Augusta, Life of Homer, Marriage Charter of Empress Theophanu and Protocols of the Elders of Sion), each belonging to a different category. Selection criteria included: (1) presence of multiple scholarly perspectives, (2) clear attribution of claims, (3) discussion of evidence-based reasoning and (4) representation of different temporal periods and document types.
4.2 Minimal working annotation development (Task II)
This subsection documents the implementation of Task II (Section 3.3), developing annotation schemas incrementally by processing one pattern at a time. Each pattern identified in Task I received iterative schema development, pilot corpus application and RDF mapping validation before proceeding to the next pattern.
Annotation Schema Design. Following Task II guidelines, the annotation model was developed through INCEpTION (Klie et al., 2018), implementing three core patterns from SEBI corresponding to the first two COPs: CH item metadata, scholarly agents and authenticity opinions. CH item metadata establishes entity linking using INCEpTION's Wikidata integration, reconciling item types to DCMI Type Vocabulary classes (dcmitype:Text, dcmitype:PhysicalObject and dcmitype:Collection). Scholarly agents (Cognizers) correspond to dct:Agent, linked to Wikidata when possible, with fallback strategies for entities without entries. Authenticity claims were modeled through directed relations between Cognizer and item spans, labeled according to SEBI's authenticity categories (Authentic, FormalForgery, ContentForgery, Forgery and Neutral) as shown in Figure 9.
The illustration displays a structured annotation diagram containing three numbered sentences arranged vertically. Sentence 1 reads, “The Document, allegedly created by Historical Figure in Time Period, has been analyzed by Scholar”. Above and around parts of the sentence are rectangular labels connected by arrows. These labels include “DocumentTitle vertical bar Document” above “Document”, a rightward arrow labeled “algd:Creator” extends from “DocumentTitle vertical bar Document” and points to “AllegedCreator vertical bar person” above “Historical Figure”, and a rightward arrow labeled “algd:Date” extends from “DocumentTitle vertical bar Document” and points to “AllegedDate vertical bar TimeSpan” above “Time Period”. A long leftward arrow labeled “stm:Object” points to “DocumentTitle vertical bar Document” and runs across the sentence. Below, a leftward arrow labeled “stm:Object” connects to “person vertical bar stm:Forgery” above “Scholar”. Sentence 2 reads, “The scholar argues that the document is a forgery, based on examination of multiple features”. A long rightward arrow labeled “stm:Feature” runs above the entire sentence from left to right. Sentence 3 reads, “The scholar notes that the specific aspect of the Document is not consistent with the time period”. Above “specific aspect” appears the label “aspect vertical bar feature_term vertical bar consistency vertical bar Neutral vertical bar feat:Negative”. A rightward arrow labeled “stm:Feature” points to this label and spans across the sentence.Example annotation of an entity expressing an opinion about a CH item. Source: Authors' own work
The illustration displays a structured annotation diagram containing three numbered sentences arranged vertically. Sentence 1 reads, “The Document, allegedly created by Historical Figure in Time Period, has been analyzed by Scholar”. Above and around parts of the sentence are rectangular labels connected by arrows. These labels include “DocumentTitle vertical bar Document” above “Document”, a rightward arrow labeled “algd:Creator” extends from “DocumentTitle vertical bar Document” and points to “AllegedCreator vertical bar person” above “Historical Figure”, and a rightward arrow labeled “algd:Date” extends from “DocumentTitle vertical bar Document” and points to “AllegedDate vertical bar TimeSpan” above “Time Period”. A long leftward arrow labeled “stm:Object” points to “DocumentTitle vertical bar Document” and runs across the sentence. Below, a leftward arrow labeled “stm:Object” connects to “person vertical bar stm:Forgery” above “Scholar”. Sentence 2 reads, “The scholar argues that the document is a forgery, based on examination of multiple features”. A long rightward arrow labeled “stm:Feature” runs above the entire sentence from left to right. Sentence 3 reads, “The scholar notes that the specific aspect of the Document is not consistent with the time period”. Above “specific aspect” appears the label “aspect vertical bar feature_term vertical bar consistency vertical bar Neutral vertical bar feat:Negative”. A rightward arrow labeled “stm:Feature” points to this label and spans across the sentence.Example annotation of an entity expressing an opinion about a CH item. Source: Authors' own work
Knowledge Base Integration. Integration with Wikidata through INCEpTION's Knowledge Base module ensures consistent entity identification, preventing inconsistent annotation that would hamper aggregation and reasoning in the final KG. Wikidata was selected for its domain coverage, data quality and compatibility with the entity types required by the COPs.
Annotation Process. Following established corpus linguistics and NLP practices, a single experienced annotator performed the annotation work iteratively, with guidelines refined based on encountered edge cases and thoroughly documented to ensure reproducibility.
RDF Mapping Validation. Preliminary mapping exercises validated that the annotation schema could produce target knowledge structures. Using the Pilot Corpus, annotation-to-RDF mapping was validated through the algorithm in Listing 4.2, serving as a unit test confirming that annotation patterns correctly transformed to valid RDF instantiating SEBI classes and properties.
Core Annotation Mapping Algorithm
STEP 1: Extract Cognizer-Opinion Pairs
Select all spans marked as Entity
WHERE span also has Opinion tagset label
=> CognizerSet(Cognizer(CognizerSpan, Opinion, WikidataID))
STEP 2: Extract CH Items
Select all spans marked as Entity
WHERE span has ItemTitle label
=> ItemSet(ItemSpan, WikidataID)
STEP 3: Find Relations
For CognizerSpan in CognizerSet, check if CognizerSpan
has stm:Object relation to span in ItemSet
=> Valid tuples (Cognizer, Item, Opinion)
STEP 4: Generate RDF for each tuple
For each matching pattern:
|‐- Generate URI for Cognizer
|‐- Add owl:sameAs + Wikidata ID
|‐- Generate URI for Item
|‐- Add owl:sameAs + Wikidata ID
|‐- Map opinion to corresponding SEBI class (e.g. sebi:Forgery)
|‐- Generate URI for Named Graph (hico:InterpretationAct)
|‐- Generate claim triple as RDF-star statement
+‐- Apply template:
ex:{cognizer_uri}_about_{item_uri} rdf:type hico:InterpretationAct;
prov:wasAttributedTo ex:cognizer.
ex:cognizer rdf:type dct:Agent;
rdfs:label “CognizerSpan”@language;
owl:sameAs wd:wikidataId.
ex:item rdf:type ex:type;
rdfs:label “ItemSpan”@language;
owl:sameAs wd:wikidataID.
<< ex:item rdf:type sebi:Opinion >> prov:wasDerivedFrom ex:cognizer_uri_about_{↼≻item_uri}.
Successful RDF generation from annotations validated the initial schema design, confirming that annotation patterns adequately captured the semantic content required to instantiate SEBI's authenticity claim structure. Following successful validation, development proceeded to the next pattern following the iterative cycle specified in Task II.
4.3 Pipeline architecture development (Task III)
This subsection implements Task III of the ATR4CH methodology (Section 3.4), designing and implementing computational tools to automatically extract COPs from text using the annotation model developed in Task II as the target schema. Following the methodology's emphasis on incremental development, extraction capabilities were developed and validated for each COP using the same Pilot Corpus documents that were annotated in Task II, rather than building the complete extraction pipeline before testing.
4.3.1 Architectural design and tool selection
Following the task decomposition and tool selection guidelines in Task III (Section 3.4), the pipeline was designed around the annotation model elements developed in Task II, prioritizing based on COP semantic importance and accounting for information manifestation patterns identified during corpus analysis in Task I. The KE task addresses CH corpora's domain-specific characteristics and limited annotated training data through a modular architecture enabling incremental extraction.
The pipeline integrates three complementary technologies, each addressing specific extraction requirements identified through the methodology:
GliNER for NER: Provides lightweight, generalist NER using custom entity types, enabling precise character-level span identification for entity extraction without requiring task-specific training data. This addresses the limited annotated training data characteristic of CH domains, as identified in the methodology's tool selection strategy.
LLMs for Structured Extraction: Handle complex information extraction through JSON schema-based responses (Schick et al., 2023; Qin et al., 2024) using ICL strategies (Brown et al., 2020; Min et al., 2022). We evaluated three models at varying parameter scales to understand performance trade-offs and cost-effectiveness, following the methodology's resource-adaptive approach:
Claude Sonnet 3.7 [19] as the largest model
Llama 3.3 70B (Dubey et al., 2024) as a medium-sized model
GPT-4o-mini [20] as the smallest model [estimated 8–14 billion active parameters (Ben Abacha et al., 2025)].
Rule-Based Entity Linking: Employs the Wikibase API [21] with domain-specific heuristics, integrating the knowledge base resources identified during annotation model development (Task II, Section 4.2). After evaluating various state-of-the-art solutions, this approach proved most effective for historical entities and CH concepts, providing reliable external knowledge base integration while handling the specialized vocabulary of authenticity assessment debates.
Paragraph-Level Processing Strategy: While the selected LLMs can process complete documents, the system automatically selects only relevant paragraphs whenever possible. This design serves three purposes: (1) reducing content volume per processing step to minimize potential opinion overlap between entities, which we hypothesize improves precision; (2) demonstrating scalability to documents of arbitrary length; and (3) maintaining computational efficiency and cost-effectiveness by minimizing token consumption per API call.
4.3.2 Sequential pipeline implementation
Following the modular approach specified in Task III (Section 3.4), the KE pipeline consists of six sequential components, each enriching the output before passing it to the next. Each component targets specific annotation schema elements from Task II and produces a JSON output following a predefined schema designed for conversion to RDF. Development proceeded through preliminary implementations and sequential testing over the COPs identified in Task I until the complete KG could be extracted.
Figure 10 presents the pipeline architecture, showing the correspondence between pipeline components and the COPs identified in Task I:
The diagram presents a left to right workflow with labeled shapes and directed arrows. At the far left bottom, an oval labeled “Document” has the word “START” written below it. An upward arrow from “Document” points to a diamond labeled “is a C H item discussed?”. From this diamond, an upward arrow labeled “NO ITEM” points to an oval labeled “Document is skipped”. From the diamond labeled “is a C H item discussed?”, a rightward arrow labeled “IF ITEM” points to a parallelogram labeled “Extract Metadata of Item”. A dotted rightward arrow labeled “INJECT METADATA” extends from “Extract Metadata of Item” toward a diamond labeled “is the entity a expressing an opinion on the C H item?”. From the oval labeled “Document”, a rightward arrow labeled “GLINER” points to a parallelogram labeled “Identify all entities with GLINER plus Wikidata E L”. Beneath this box appears the text “WIKIDATA” with its logo. From “Identify all entities with GLINER Wikidata E L”, a rightward arrow labeled “Entity plus Wikidata Context” points to the diamond labeled “is the entity a expressing an opinion on the C H item?”. From the diamond labeled “is the entity a expressing an opinion on the C H item?”, three dotted rightward arrows branch to three parallelograms labeled “Extract statement provenance”, “Extract evidences”, and “Extract hypotheses”. Solid rightward arrows from these three boxes converge into a trapezoid labeled “Merge K G”. From “Merge K G”, a rightward arrow leads to three vertically arranged parallelograms grouped by dotted bracket lines. The top box is labeled “Use skos to reconcile strings with S E B I individuals”. The middle box is labeled “Use Wikidata to enhance K G”. The bottom box is labeled “Map to R D F using a queue approach”. A rightward arrow from this grouped section points to an oval labeled “S E B I dash K G”. To the right of this oval, a small node connects to several circular nodes forming a graph structure. All arrows clearly indicate direction from left to right or upward as labeled, and all processing steps and decision paths are connected by arrowheads showing flow direction. Additional visual elements appear in the diagram. Above the “Extract Metadata of Item” region, there is a small illustration of a monument or classical building structure. Near the upper middle section above the “is the entity a expressing an opinion on the C H item?”, there is an illustration of a girl and a boy seated and writing. Near the “Extract hypotheses” box, there is a small illustration depicting a person working with documents or research materials”.Sequential pipeline architecture for SEBI-based KG generation, showing correspondence between components and COPs. Source: Authors' own work
The diagram presents a left to right workflow with labeled shapes and directed arrows. At the far left bottom, an oval labeled “Document” has the word “START” written below it. An upward arrow from “Document” points to a diamond labeled “is a C H item discussed?”. From this diamond, an upward arrow labeled “NO ITEM” points to an oval labeled “Document is skipped”. From the diamond labeled “is a C H item discussed?”, a rightward arrow labeled “IF ITEM” points to a parallelogram labeled “Extract Metadata of Item”. A dotted rightward arrow labeled “INJECT METADATA” extends from “Extract Metadata of Item” toward a diamond labeled “is the entity a expressing an opinion on the C H item?”. From the oval labeled “Document”, a rightward arrow labeled “GLINER” points to a parallelogram labeled “Identify all entities with GLINER plus Wikidata E L”. Beneath this box appears the text “WIKIDATA” with its logo. From “Identify all entities with GLINER Wikidata E L”, a rightward arrow labeled “Entity plus Wikidata Context” points to the diamond labeled “is the entity a expressing an opinion on the C H item?”. From the diamond labeled “is the entity a expressing an opinion on the C H item?”, three dotted rightward arrows branch to three parallelograms labeled “Extract statement provenance”, “Extract evidences”, and “Extract hypotheses”. Solid rightward arrows from these three boxes converge into a trapezoid labeled “Merge K G”. From “Merge K G”, a rightward arrow leads to three vertically arranged parallelograms grouped by dotted bracket lines. The top box is labeled “Use skos to reconcile strings with S E B I individuals”. The middle box is labeled “Use Wikidata to enhance K G”. The bottom box is labeled “Map to R D F using a queue approach”. A rightward arrow from this grouped section points to an oval labeled “S E B I dash K G”. To the right of this oval, a small node connects to several circular nodes forming a graph structure. All arrows clearly indicate direction from left to right or upward as labeled, and all processing steps and decision paths are connected by arrowheads showing flow direction. Additional visual elements appear in the diagram. Above the “Extract Metadata of Item” region, there is a small illustration of a monument or classical building structure. Near the upper middle section above the “is the entity a expressing an opinion on the C H item?”, there is an illustration of a girl and a boy seated and writing. Near the “Extract hypotheses” box, there is a small illustration depicting a person working with documents or research materials”.Sequential pipeline architecture for SEBI-based KG generation, showing correspondence between components and COPs. Source: Authors' own work
Metadata Extraction: Targets COP 1 (CH Item Metadata) – Raw text documents → alleged and settled item metadata
Opinion Holder Identification: Targets COP 2 (Scholarly Opinions) – Item metadata + text → entity mentions with opinion classifications
Entity Resolution: Integrates knowledge bases identified in Task II – Entity mentions → Wikidata-linked entity clusters
Opinion Extraction: Completes COP 2 extraction – Linked entities + paragraphs → structured authenticity opinions
Evidence Mining: Targets COP 3 (Evidential Features) – Opinions + contexts → feature evaluations with polarity
Hypothesis Extraction: Targets COP 4 (Alternative Hypotheses) – Evidence + full context → conflicting statements and alternative theories.
The following subsections detail each component's implementation, demonstrating how the annotation schemas developed in Task II guided extraction design.
4.3.3 Component 1: CH item metadata extraction
COP Addressed: COP 1 (CH Item Metadata, Section 4.1.3)
Annotation Schema: CH Item Metadata Layer (Section 4.2)
Input: Raw Wikipedia articles in markup (.txt files)
Output: Cleaned articles; JSON with alleged item metadata.
This component implements extraction for the first COP, identifying and extracting metadata about CH items discussed in each article. The LLM is instructed to extract a JSON schema describing all items under discussion, specifically targeting the alleged metadata elements defined in the annotation model – what items claim to be – including purported authors, creation dates, locations, item types and subject matter. The task relies on ICL in a Few-Shot setting (with three examples), using CoT reasoning. Figure 11 shows an example input and JSON output.
The left panel has a header labeled “Input source text”. Below it, a block of text reads: “The Donation of Constantine [dot dot dot] is a forged Roman imperial decree by which the 4th-century emperor Constantine the Great supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope [dot dot dot]. [It] was used, especially in the 13th century, in support of claims of political authority by the papacy”. The right panel has a header labeled “J S O N output”. Below it, a formatted J S O N block appears with opening and closing curly braces. Inside, the fields are listed line by line as: “item”: “Donation of Constantine”, “alleged underscore author”: “Constantine the Great”, “alleged underscore date”: “4th century”, “alleged underscore location”: “Rome”, and “item underscore type”: “decree”.CH item metadata extraction from source text to structured JSON output
The left panel has a header labeled “Input source text”. Below it, a block of text reads: “The Donation of Constantine [dot dot dot] is a forged Roman imperial decree by which the 4th-century emperor Constantine the Great supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope [dot dot dot]. [It] was used, especially in the 13th century, in support of claims of political authority by the papacy”. The right panel has a header labeled “J S O N output”. Below it, a formatted J S O N block appears with opening and closing curly braces. Inside, the fields are listed line by line as: “item”: “Donation of Constantine”, “alleged underscore author”: “Constantine the Great”, “alleged underscore date”: “4th century”, “alleged underscore location”: “Rome”, and “item underscore type”: “decree”.CH item metadata extraction from source text to structured JSON output
4.3.4 Component 2: cognizer identification
COP Addressed: COP 2 (Scholarly Opinions, Section 4.1.3)
Annotation Schema: Entity and Opinion layers (Section 4.2)
Input: Cleaned article + item metadata (output of Component 1)
Output: JSON with is_cognizer classification, coreferences.
This component begins extraction of the second COP by identifying scholarly agents. Following the entity identification approach developed during annotation, it employs GliNER for NER, targeting people, organizations, groups and locations as specified in the annotation schema. GliNER identifies precise character-level spans, enabling exact identification and grouping of paragraphs in which each entity appears. The LLM then performs binary classification (is_expressing_opinion: True/False) to identify which entities function as Cognizers, alongside additional textual mentions and co-references. The task relies on ICL in a Few-Shot setting (with three examples), using CoT reasoning.
4.3.5 Component 3: entity resolution and linking
COP Addressed: Supporting infrastructure for all COPs
Annotation Schema: Knowledge Base integration (Section 4.2)
Input: Cognizers and coreferences (output of Component 2)
Output: JSON with relevant paragraphs grouped by Cognizer and biographical information.
This component implements the knowledge base integration strategy established during annotation model development (Task II, Section 4.2). It performs coreference resolution and Entity Linking through a three-stage pipeline. First, it collects all entity mentions across paragraphs and groups them by exact string match. Second, it applies coreference resolution by calculating mention similarity using a Jaccard coefficient over word sets (with common stop words removed), clustering mentions with similarity scores exceeding 0.7 when entity types are compatible. The system selects the longest string as the representative mention for each cluster. Third, it performs Entity Linking by querying the Wikidata Search API (wbsearchentities) with each mention variant, retrieving up to five candidates per query and deduplicating results by Wikidata identifier.
Each candidate undergoes scoring through the Wikidata API (wbgetentities) to retrieve claims and labels. The scoring function combines three weighted components: (1) name similarity between cluster mentions and candidate labels/aliases, calculated using Jaccard similarity over word sets with special handling for first name/initial matching (weight: 0.6 for labels, 0.3 for aliases); (2) entity type compatibility verified through Wikidata property P31 (instance of), with type mappings defined for persons (Q5), organizations (Q43229, Q7278), locations (Q2221906) and groups (Q16334295); and (3) for person entities, occupation relevance assessed through Wikidata property P106 (occupation), comparing retrieved occupation identifiers against a curated vocabulary of scholarly occupations (weight: 0.1). The system applies a minimum threshold of 0.4 for candidate acceptance and selects the highest-scoring candidate per cluster. When multiple candidates exceed the threshold, the system retrieves detailed data in batch requests and calculates an overlap score measuring the proportion of cluster mentions matching each candidate's labels and aliases (using both exact matches and fuzzy matching with a similarity threshold of 0.7), combining this with the initial score through weighted average (0.6 for overlap, 0.4 for initial score).
4.3.6 Component 4: opinion extraction and classification
COP Addressed: COP 2 (Scholarly Opinions, Section 4.1.3)
Annotation Schema: Opinion classification layer (Section 4.2)
Input: Entity + Wikidata Information (if linked) + paragraphs where entity is mentioned
Output: JSON describing (1) the Cognizer's opinion, (2) their opinion type and (3) the metadata of the opinion.
This component completes extraction of the second COP by extracting and classifying authenticity opinions. If the entity has been successfully linked to Wikidata in Component 3, this information is provided to the model. The extraction process captures the main elements of the annotation schema: opinion targets (which documents or artifacts), opinion types following SEBI classifications (Authentic, Forgery, Formal forgery, Content forgery, Neutral), confidence levels expressed by the Cognizer, temporal contexts (when opinions were expressed) and geographic contexts where relevant.
4.3.7 Component 5: evidence mining and feature assessment
COP Addressed: COP 3 (Evidential Features, Section 4.1.3)
Annotation Schema: Evidence and Features Layer (Section 4.4.3)
Input: Structured opinions (output of Component 4) + contextual paragraphs
Output: JSON with supporting evidences and evaluations for each opinion.
This component implements extraction for the third COP, enriching opinions with evidences and features being evaluated. Features are organized into three categories following the SEBI ontology and the annotation model developed in Task II: intrinsic features (content, language, style, orthography), extrinsic features (handwriting, ink, material support, physical characteristics) and provenance information (historical context, witness accounts, transmission history). For each feature, the system determines evaluation criteria, including consistency, presence, completeness, reliability and veridicality, as specified in the annotation schema. Each evaluation receives polarity assignment (positive, negative, neutral evidence) and links to supporting scholarly opinions.
4.3.8 Component 6: hypothesis extraction
COP Addressed: COP 4 (Alternative Hypotheses, Section 4.1.3)
Annotation Schema: Scholarly Hypotheses Layer (Section 4.4.3)
Input: Opinions + evidence evaluations + full document context
Output: JSON with hypotheses about document origins, intent, etc.
This final component implements extraction for the fourth COP, enriching the output with scholars' hypotheses. The hypotheses types correspond directly to the relation types defined in the annotation model: authorship hypotheses (who actually created items if not alleged authors?), dating hypotheses (when were items actually created if not alleged dates?), location hypotheses (where were items actually created if not alleged locations?) and motivation hypotheses (why were items created or forged?). The system handles cases where Cognizers accept alleged metadata as authentic. For consistency and to avoid negated categories, polarity (positive/negative) is included as a field.
4.3.9 Immediate validation on pilot corpus
Following the immediate validation strategy specified in Task III (Section 3.4), once extraction for each COP was implemented, the pipeline was tested on the Pilot Corpus and results were compared against manual annotations created in Task II. This immediate validation cycle enabled rapid identification of extraction bottlenecks or misalignments before proceeding to the next COP. This cycle continued until extraction pipelines had been developed and validated for all identified COPs using the Pilot Corpus.
4.4 Integration and refinement (Task IV)
This subsection implements Task IV of the ATR4CH methodology (Section 3.5), harmonizing the COPs (Task I), annotation schemas (Task II) and pipeline components (Task III) into a coherent, end-to-end KE system. Task IV represents a critical transition point: the modular components developed and validated on the Pilot Corpus are now integrated into a production-ready system, and the annotation model that emerged from processing individual COPs is consolidated into a production version suitable for full corpus processing and comprehensive GT creation.
4.4.1 Pipeline integration
Following the integration approach specified in Task IV (Section 3.5), the modular extraction components developed for individual COPs were integrated into a unified text-to-knowledge-graph pipeline. Integration addressed dependencies between patterns, ensured consistent entity resolution across components and optimized the overall processing architecture. The integrated pipeline processes documents from raw text to complete KGs that instantiate all identified COPs.
4.4.2 End-to-end pipeline testing
Following the comprehensive testing approach in Task IV (Section 3.5), testing over the pilot corpus processed documents from raw text to final KGs, revealing systematic issues including data sparseness patterns, interaction effects between COP extractors, inconsistent tool coverage across discourse types and representation generation errors. Testing systematically evaluated performance across document types and semantic phenomena, with particular attention to error propagation through pipeline stages.
4.4.3 Production annotation model development
Following the annotation model refinement approach in Task IV (Section 3.5), based on testing results, the annotation model evolved into a production-ready version suitable for both manual annotation and automated extraction. Following successful development and validation of extraction pipelines for individual COPs (Tasks II–III), the annotation schemas were consolidated into a production model suitable for comprehensive GT creation. This refined model captures CH item metadata, evidence and features, and scholarly hypotheses through additional layers developed iteratively during pipeline testing on the Pilot Corpus, maintaining backward compatibility with COPs while adding elements crucial for automated extraction.
CH Item Metadata Layer. This layer captures alleged metadata – descriptive information (creator, date, location) that the document or artifact purports about itself, before scholarly critical analysis. This includes face-value claims presented within the item or by whoever claimed to find the item regarding authorship, creation date, geographic origin and other identifying characteristics. Annotations include AllegedCreator, AllegedDate, AllegedLocation, ItemSubject and ItemType, plus properties for formal forgeries (ItemCreator, ItemDate, ItemLocation) (see Figure 12).
The sentence reads, “The Donation of Constantine (Latin: Donatio Constantini) is a forged Roman imperial decree by which the 4th-century emperor Constantine the Great supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope”. Above and around parts of the sentence are rectangular labels connected by arrows. These labels include “DocumentTitle vertical bar Donation of Constantine” above “Donation of Constantine”. A rightward arrow labeled “algd:Type” extends from “DocumentTitle vertical bar Donation of Constantine” and points to “DocumentType vertical bar imperial decree” above “Roman imperial decree”. A rightward arrow labeled “algd:Date” extends from “DocumentTitle vertical bar Donation of Constantine” and points to “AllegedDate vertical bar 4th century vertical bar TimeSpan” above “4th-century”. A rightward arrow labeled “algd:Creator” extends from “DocumentTitle vertical bar Donation of Constantine” and points to “AllegedCreator vertical bar Constantine the Great” above “Constantine the Great”. A long rightward arrow labeled “algd:Topic” extends across the sentence. Below the sentence, a downward arrow labeled “algd:Topic” points to “DocumentTopic vertical bar authority” positioned under the phrase “supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope”.Alleged metadata annotation for the Donation of Constantine. Source: Authors' own work
The sentence reads, “The Donation of Constantine (Latin: Donatio Constantini) is a forged Roman imperial decree by which the 4th-century emperor Constantine the Great supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope”. Above and around parts of the sentence are rectangular labels connected by arrows. These labels include “DocumentTitle vertical bar Donation of Constantine” above “Donation of Constantine”. A rightward arrow labeled “algd:Type” extends from “DocumentTitle vertical bar Donation of Constantine” and points to “DocumentType vertical bar imperial decree” above “Roman imperial decree”. A rightward arrow labeled “algd:Date” extends from “DocumentTitle vertical bar Donation of Constantine” and points to “AllegedDate vertical bar 4th century vertical bar TimeSpan” above “4th-century”. A rightward arrow labeled “algd:Creator” extends from “DocumentTitle vertical bar Donation of Constantine” and points to “AllegedCreator vertical bar Constantine the Great” above “Constantine the Great”. A long rightward arrow labeled “algd:Topic” extends across the sentence. Below the sentence, a downward arrow labeled “algd:Topic” points to “DocumentTopic vertical bar authority” positioned under the phrase “supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope”.Alleged metadata annotation for the Donation of Constantine. Source: Authors' own work
Evidence and Features Layer. This layer generates Evidence nodes connected to InterpretationAct Named Graphs. It employs four tagsets: Feature (SEBI vocabulary terms for intrinsic/extrinsic features and provenance), FeatureAssessment (evaluation perspectives: consistency, presence, completeness, reliability, veridicality), FeatureAssessmentPolarity (negative, neutral, positive) and FeatureAssessmentConfidence.
Consider Lorenzo Valla's assessment of the Donation's language features (Figure 15), which converts to three evidence structures linking textual features to evaluation criteria and polarities.
Listing 4.4.3 shows the evidence mapping algorithm.
Evidence and Feature Mapping Algorithm
STEP 1: Extract Evaluated Features
Select all spans marked as feature
WHERE span also has FeatureAssessment label, FeatureAssessmentPolarity,
FeatureAssessmentConfidence
=> FeatureSet(FeatureSpan, FeatureClass, FeatureAssessment,
FeatureAssessmentPolarity, FeatureAssessmentConfidence)
STEP 2: Select all spans marked as Entity
WHERE span also has Opinion tagset label
=> CognizerSet(Cognizer(CognizerSpan, Opinion, WikidataID))
STEP 3: Find Relations
For FeatureSpan in FeatureSet, check if CognizerSpan
has stm:Feature relation to any span(s) in FeatureSet
=> Valid tuples (Cognizer, FeatureSet)
STEP 4: Generate nodes
For each matching pattern:
|‐- Generate/Reuse URI for Cognizer
|‐- Add owl:sameAs + Wikidata ID
|‐- Generate URI for sebi:Evidence graph
|‐- match FeatureAssessment individual with sebi:Evaluation individual
|‐- match FeatureAssessmentPolarity
|‐- attach FeatureAssessmentConfidence score
|‐- Generate URI for sebi:Feature graph
|‐- attach FeatureSpan through rdfs:label
|‐- attach FeatureClass through skos:broader
STEP 5: Generate RDF graph
+‐- Apply template:
kb:{cognizer_uri}_about_{item_uri}_{idx} a sebi:Evidence;
sebi:assess kb:{feature_uri};
sebi:evaluate sebi:evaluation_uri;
sebi:hasEvaluationScore “polarity”@language;
sebi:support kb:interpretation_act;
ov:confidence 1.0.
kb:{feature_uri} a sebi:Feature;
rdfs:label “{FeatureSpan}“@language;
sebi:isAssessedBy kb:{cognizer_uri}_about_{item_uri}_{idx};
skos:broader sebi:{feature_vocabulary_term}.
Scholarly Hypotheses Layer. This layer captures alternative hypotheses through four relation types linking Cognizers to Wikidata entities: stm:CreatorHypothesis, stm:DatingHypothesis, stm:LocationHypothesis and stm:ReasonHypothesis (see Figures 13–15).
The sentence reads, “The Donation continued to be tacitly accepted as authentic until Caesar Baronius in his Annales Ecclesiastici (published 1588–1607) admitted that it was a forgery, after which it was almost universally accepted as such”. Above and around parts of the sentence are rectangular labels connected by arrows. These labels include “Donation of Constantine” above “Donation”. A long leftward arrow labeled “stm:Object” extends across the sentence and connects to “Donation of Constantine”. A rightward arrow labeled “stm:Object” extends from “Donation of Constantine” and points to the label “Caesar Baronius vertical bar stm:Forgery” above “Caesar Baronius”. A rightward arrow labeled “stm:Provenance” extends from “Caesar Baronius vertical bar stm:Forgery” and points to “Annales Ecclesiastici” above “Annales Ecclesiastici”. Similarly, a rightward arrow labeled “stm:Date” extends from “Caesar Baronius vertical bar stm:Forgery” and points to “TimeSpan” above “1588–1607”. A long leftward arrow labeled “stm:Object” runs along the lower part of the sentence.Caesar Baronius's admission of forgery with provenance annotation. Source: Authors' own work
The sentence reads, “The Donation continued to be tacitly accepted as authentic until Caesar Baronius in his Annales Ecclesiastici (published 1588–1607) admitted that it was a forgery, after which it was almost universally accepted as such”. Above and around parts of the sentence are rectangular labels connected by arrows. These labels include “Donation of Constantine” above “Donation”. A long leftward arrow labeled “stm:Object” extends across the sentence and connects to “Donation of Constantine”. A rightward arrow labeled “stm:Object” extends from “Donation of Constantine” and points to the label “Caesar Baronius vertical bar stm:Forgery” above “Caesar Baronius”. A rightward arrow labeled “stm:Provenance” extends from “Caesar Baronius vertical bar stm:Forgery” and points to “Annales Ecclesiastici” above “Annales Ecclesiastici”. Similarly, a rightward arrow labeled “stm:Date” extends from “Caesar Baronius vertical bar stm:Forgery” and points to “TimeSpan” above “1588–1607”. A long leftward arrow labeled “stm:Object” runs along the lower part of the sentence.Caesar Baronius's admission of forgery with provenance annotation. Source: Authors' own work
The Sentence 1 reads, “In one study, an attempt was made at dating the forgery to the 9th century, and placing its composition at Corbie Abbey, in northern France”. Above “9th century” appears the label “9th century vertical bar TimeSpan”. Above “Corbie Abbey” appears the label “Corbie Abbey”, and above “France” appears the label “France”. Three leftward arrows labeled “stm:DatingHypothesis”, “stm:CreatorHypothesis”, and “stm:LocationHypothesis” extend across the sentence and connect to “9th century vertical bar TimeSpan”, “Corbie Abbey”, and “France” respectively. Below this sentence, three leftward arrows labeled “stm:DatingHypothesis”, “stm:CreatorHypothesis”, and “stm:LocationHypothesis” converge on the label “Johannes Fried vertical bar stm:Forgery” positioned above “Johannes Fried”. Sentence 2 reads, “German medievalist Johannes Fried draws a distinction between the Donation of Constantine and an earlier, also forged version, the Constitutum Constantini, which was included”. Above “Johannes Fried” appears the label “Johannes Fried vertical bar stm:Forgery”. A rightward arrow labeled “stm:Object” extends from “Johannes Fried vertical bar stm:Forgery” and points to “Donation of Constantine” above “Donation of Constantine”. A long horizontal leftward arrow labeled “stm:Object” runs across the sentence and points to “Donation of Constantine”. A parallel horizontal arrow labeled “stm:Feature” runs across the sentence beneath it and points to “Johannes Fried vertical bar stm:Forgery”. Similarly, beneath Sentence 2, a long horizontal arrow labeled “stm:Object” runs across the sentence. A parallel horizontal arrow labeled “stm:Feature” runs across the sentence beneath it. Sentence 3 continues, “in the collection of forged documents, the False Decretals, compiled in the latter half of the 9th century”. A long horizontal leftward arrow labeled “stm:Object” runs beneath this portion of the sentence. Sentence 4 reads, “Fried argues the Donation is a later expansion of the much shorter Constitutum”. Above part of this sentence appears the label “provenance vertical bar consistency vertical bar Certain vertical bar feat:Positive”. A rightward arrow labeled “stm:Feature” points to this label.Johannes Fried's hypotheses annotation for the Donation of Constantine. Source: Authors' own work
The Sentence 1 reads, “In one study, an attempt was made at dating the forgery to the 9th century, and placing its composition at Corbie Abbey, in northern France”. Above “9th century” appears the label “9th century vertical bar TimeSpan”. Above “Corbie Abbey” appears the label “Corbie Abbey”, and above “France” appears the label “France”. Three leftward arrows labeled “stm:DatingHypothesis”, “stm:CreatorHypothesis”, and “stm:LocationHypothesis” extend across the sentence and connect to “9th century vertical bar TimeSpan”, “Corbie Abbey”, and “France” respectively. Below this sentence, three leftward arrows labeled “stm:DatingHypothesis”, “stm:CreatorHypothesis”, and “stm:LocationHypothesis” converge on the label “Johannes Fried vertical bar stm:Forgery” positioned above “Johannes Fried”. Sentence 2 reads, “German medievalist Johannes Fried draws a distinction between the Donation of Constantine and an earlier, also forged version, the Constitutum Constantini, which was included”. Above “Johannes Fried” appears the label “Johannes Fried vertical bar stm:Forgery”. A rightward arrow labeled “stm:Object” extends from “Johannes Fried vertical bar stm:Forgery” and points to “Donation of Constantine” above “Donation of Constantine”. A long horizontal leftward arrow labeled “stm:Object” runs across the sentence and points to “Donation of Constantine”. A parallel horizontal arrow labeled “stm:Feature” runs across the sentence beneath it and points to “Johannes Fried vertical bar stm:Forgery”. Similarly, beneath Sentence 2, a long horizontal arrow labeled “stm:Object” runs across the sentence. A parallel horizontal arrow labeled “stm:Feature” runs across the sentence beneath it. Sentence 3 continues, “in the collection of forged documents, the False Decretals, compiled in the latter half of the 9th century”. A long horizontal leftward arrow labeled “stm:Object” runs beneath this portion of the sentence. Sentence 4 reads, “Fried argues the Donation is a later expansion of the much shorter Constitutum”. Above part of this sentence appears the label “provenance vertical bar consistency vertical bar Certain vertical bar feat:Positive”. A rightward arrow labeled “stm:Feature” points to this label.Johannes Fried's hypotheses annotation for the Donation of Constantine. Source: Authors' own work
Listing 4.4.3 details the hypotheses mapping algorithm.
Hypotheses Mapping Algorithm
STEP 1: Extract Hypothesis Relations
Select all relations of type:
|‐- stm:CreatorHypothesis
|‐- stm:DatingHypothesis
|‐- stm:LocationHypothesis
|‐- stm:ReasonHypothesis
=> HypothesesSet(CognizerSpan, HypothesisType, TargetSpan, WikidataID)
STEP 2: Extract Cognizer Entities
Select all spans marked as Entity
WHERE span also has Opinion tagset label
=> CognizerSet(CognizerSpan, Opinion, WikidataID)
STEP 3: Find Valid Patterns
For each relation in HypothesesSet:
Check if CognizerSpan exists in CognizerSet
=> Valid tuples (Cognizer, HypothesisType, Target)
STEP 4: Generate Target URIs For each matching pattern:
|‐- Generate/Reuse URI for Cognizer
|‐- Generate/Reuse URI for Item
|‐- Generate/Reuse URI for Target entity
|‐- Map HypothesisType to corresponding RDF property
STEP 5: Generate RDF-star Statements
+‐- Apply template:
kb:{target_uri} a {target_class};
owl:sameAs wd:wikidata_id;
# if Wikidata ID not available
# kb:{urifiedTargetSpan} a {target_uri};
rdfs:label “{TargetSpan}“@language.
<< kb:item_uri dct:creator kb:{target_uri} >>
prov:wasDerivedFrom kb:cognizer_uri_about_item_uri.
<< kb:{item_uri} dct:date kb:{target_uri} >>
prov:wasDerivedFrom kb:cognizer_uri_about_item_uri.
<< kb:item_uri} sebi:location kb:{target_uri} >>
prov:wasDerivedFrom kb:cognizer_uri_about_item_uri.
<< kb:{item_uri} sebi:intendedTo kb:{target_uri} >>
prov:wasDerived From kb:cognizer_uri_about_item_uri.
4.4.4 Mapping algorithm enhancement
Following the mapping algorithm enhancement approach in Task IV (Section 3.5), preliminary mapping algorithms from Task II were consolidated and enhanced to handle the complete KG structure. As the output JSON model closely resembles the GT structure, the mapping uses similar logic, differing only in that it processes from JSON rather than the JSON UIMA CAS (Content Analysis System) format used by INCEpTION. Error handling mechanisms manage extraction failures and partial results using validation tools such as SHACL and OWL reasoners as specified in the methodology.
The sentence reads, “Later, the humanist and scholar Lorenzo Valla argued in his philological study of the text that the language used in manuscript could not be dated to the 4th century.” Above “Lorenzo Valla” appears the label “Lorenzo Valla vertical bar stm:Forgery.” From this label, several long rightward arrows labeled “stm:Feature” extend across the top of the sentence. A rightward arrow labeled “stm:DatingHypothesis” also extends from “Lorenzo Valla vertical bar stm:Forgery” and runs above the sentence. Above the word “language” appears the label “language vertical bar consistency vertical bar Certain vertical bar feat:Negative,” with a rightward arrow labeled “stm:Feature” pointing to this label.Lorenzo Valla's opinion with feature assessment annotation. Source: Authors' own work
The sentence reads, “Later, the humanist and scholar Lorenzo Valla argued in his philological study of the text that the language used in manuscript could not be dated to the 4th century.” Above “Lorenzo Valla” appears the label “Lorenzo Valla vertical bar stm:Forgery.” From this label, several long rightward arrows labeled “stm:Feature” extend across the top of the sentence. A rightward arrow labeled “stm:DatingHypothesis” also extends from “Lorenzo Valla vertical bar stm:Forgery” and runs above the sentence. Above the word “language” appears the label “language vertical bar consistency vertical bar Certain vertical bar feat:Negative,” with a rightward arrow labeled “stm:Feature” pointing to this label.Lorenzo Valla's opinion with feature assessment annotation. Source: Authors' own work
4.4.5 GT statistics
Each annotation layer maps to RDF following SEBI ontology principles, with Wikidata integration providing entity resolution. The INCEpTION project is available on GitHub alongside mapping scripts [22]. Statistics for annotation results are shown in Table 2.
4.5 KE and evaluation (Task V)
This subsection implements Task V of the ATR4CH methodology (Section 3.6), employing technical validation and domain-expert evaluation to ensure knowledge structures accurately represent domain-specific discourse complexity. The refined system from Task IV is applied to test data separate from the Pilot Corpus used for development, representing the first application of the complete pipeline to unseen documents. The final iterations of the prompts are available in Appendix.
4.5.1 KG generation
Following the KE approach in Task V (Section 3.6), test datasets were processed through the complete pipeline under realistic deployment conditions, with systematic documentation of performance and failure modes. This represented the first application of the extraction pipeline beyond the Pilot Corpus used for development. The final output is mapped to RDF using the algorithms explained in previous sections. This subsection showcases produced KGs in RDF-star format (specifically, this example was generated by the pipeline using Llama 3.3 70B).
Figure 16 shows the general structure of a generated KG from the GraphDB interface (Ontotext, 2024). Each CH item is represented with both alleged metadata (what the item claims to be) and scholarly assessments, as shown in Listing 4.5.1. The Donation of Constantine exemplifies this pattern:
The diagram is a node link knowledge graph with labeled nodes and directed edges. At the upper left, a circular node labeled “Forgery” is connected by a leftward arrow labeled “type” from a circular node labeled “donation underscore of underscore constanti ellipsis”. From “donation underscore of underscore constanti ellipsis”, a downward right arrow labeled “date” points to a circular node labeled “8th century”. A diagonal downward left arrow labeled “wasDerivedFrom” connects from “donation underscore of underscore constanti ellipsis” to a circular node labeled “lorenzo underscore valla underscore about underscore d ellipsis”. The node “lorenzo underscore valla underscore about underscore d ellipsis” has a leftward arrow labeled “wasAttributedTo” pointing to a circular node labeled “Lorenzo Valla”. A downward left arrow labeled “type” points from “lorenzo underscore valla underscore about underscore d ellipsis” to a circular node labeled “InterpretationAct”. A righward arrow labeled “isSupportedBy” and a leftward arrow labeled “support” connects from “lorenzo underscore valla underscore about underscore d ellipsis” to another circular node labeled “lorenzo underscore valla underscore about underscore d ellipsis”. From this lower “lorenzo underscore valla underscore about underscore d ellipsis” node, a downward left arrow labeled “type” points to a circular node labeled “Evidence”. A downward right arrow labeled “evaluate” points to a circular node labeled “presence”. Between the lower “lorenzo underscore valla underscore about underscore d ellipsis” node and a circular node labeled “imperial-era formulas”, there is a rightward arrow labeled “assess” and a leftward arrow labeled “isAssessedBy”. From “imperial-era formulas”, an upward right arrow labeled “type” points to a circular node labeled “Feature”. A rightward arrow labeled “broader” connects “imperial-era formulas” to a circular node labeled “legal underscore formula”. Two diagonal arrows labeled “isAssessedBy” point from “imperial-era formulas” toward two circular nodes labeled “nicholas underscore of underscore cusa underscore abo ellipsis” and “reginald underscore peacock underscore abou ellipsis”. Corresponding leftward arrows labeled “assess” point from “nicholas underscore of underscore cusa underscore abo ellipsis” and “reginald underscore peacock underscore abou ellipsis” back to “imperial-era formulas”.Lorenzo Valla's statement about the Donation of Constantine. Source: Authors' own work
The diagram is a node link knowledge graph with labeled nodes and directed edges. At the upper left, a circular node labeled “Forgery” is connected by a leftward arrow labeled “type” from a circular node labeled “donation underscore of underscore constanti ellipsis”. From “donation underscore of underscore constanti ellipsis”, a downward right arrow labeled “date” points to a circular node labeled “8th century”. A diagonal downward left arrow labeled “wasDerivedFrom” connects from “donation underscore of underscore constanti ellipsis” to a circular node labeled “lorenzo underscore valla underscore about underscore d ellipsis”. The node “lorenzo underscore valla underscore about underscore d ellipsis” has a leftward arrow labeled “wasAttributedTo” pointing to a circular node labeled “Lorenzo Valla”. A downward left arrow labeled “type” points from “lorenzo underscore valla underscore about underscore d ellipsis” to a circular node labeled “InterpretationAct”. A righward arrow labeled “isSupportedBy” and a leftward arrow labeled “support” connects from “lorenzo underscore valla underscore about underscore d ellipsis” to another circular node labeled “lorenzo underscore valla underscore about underscore d ellipsis”. From this lower “lorenzo underscore valla underscore about underscore d ellipsis” node, a downward left arrow labeled “type” points to a circular node labeled “Evidence”. A downward right arrow labeled “evaluate” points to a circular node labeled “presence”. Between the lower “lorenzo underscore valla underscore about underscore d ellipsis” node and a circular node labeled “imperial-era formulas”, there is a rightward arrow labeled “assess” and a leftward arrow labeled “isAssessedBy”. From “imperial-era formulas”, an upward right arrow labeled “type” points to a circular node labeled “Feature”. A rightward arrow labeled “broader” connects “imperial-era formulas” to a circular node labeled “legal underscore formula”. Two diagonal arrows labeled “isAssessedBy” point from “imperial-era formulas” toward two circular nodes labeled “nicholas underscore of underscore cusa underscore abo ellipsis” and “reginald underscore peacock underscore abou ellipsis”. Corresponding leftward arrows labeled “assess” point from “nicholas underscore of underscore cusa underscore abo ellipsis” and “reginald underscore peacock underscore abou ellipsis” back to “imperial-era formulas”.Lorenzo Valla's statement about the Donation of Constantine. Source: Authors' own work
Document representation with alleged and scholarly metadata
# Basic Item information
kb:donation_of_constantine a sebi:Decree;
dct:title “Donation of Constantine”@en;
dct:coverage kb:Rome.
# Item type definition, generated from the text:
sebi:Decree rdfs:subClassOf dcmitype:Text;
rdfs:label “decree”@en.
# Alleged metadata as quoted triples (what the item purports to be)
<< kb:donation_of_constantine dct:creator kb:constantine_the_great >>
prov:wasDerivedFrom kb:donation_of_constantine_self_statement.
<< kb:donation_of_constantine dct:date kb:constantines_reign_306–337_ad >>
prov:wasDerivedFrom kb:donation_of_constantine_self_statement.
<< kb:donation_of_constantine dct:coverage kb:Rome >>
prov:wasDerivedFrom kb:donation_of_constantine_self_statement.
Listing 4.5.1 shows Lorenzo Valla's interpretation of the Donation.
Lorenzo Valla's interpretation with supporting evidence
# Lorenzo Valla as scholarly agent
kb:lorenzo_valla a sebi:Human, dct:Agent;
rdfs:label “Lorenzo Valla”@en;
owl:sameAs wd:Q214115;
skos:altLabel “Valla”@en;
wd:occupation kb:Latin_Catholic_priest, kb:philologist,
kb:philosopher, kb:renaissance_humanist.
# Valla's interpretation act
kb:lorenzo_valla_about_donation_of_constantine a hico:InterpretationAct;
sebi:date kb:1,439–1,440;
prov:wasAttributedTo kb:lorenzo_valla;
prov:wasQuotedFrom “donation_of_constantine”ˆˆxsd:anyURI;
cito:isSupportedBy kb:lorenzo_valla_about_donation_of_constantine_1.
# Main authenticity claim
<< kb:donation_of_constantine rdf:type sebi:Forgery >>
prov:wasDerivedFrom kb:lorenzo_valla_about_donation_of_constantine.
# Alternative dating hypothesis
<< kb:donation_of_constantine dct:date kb:8th_century >>
prov:wasDerivedFrom kb:lorenzo_valla_about_donation_of_constantine.
# Motivation hypothesis
<< kb:donation_of_constantine sebi:intendedTo kb:political_authority >>
prov:wasDerivedFrom kb:lorenzo_valla_about_donation_of_constantine.
The supporting evidence for Valla's conclusions is captured through the Evidence graph, shown in Listing 4.5.1.
Lorenzo Valla's philological evidence structure
# Evidence node linking feature assessment to interpretation
kb:lorenzo_valla_about_donation_of_constantine_1 a sebi:Evidence;
sebi:assess kb:philological_arguments;
sebi:evaluate sebi:consistency;
sebi:hasEvaluationScore “negative”@en;
sebi:support kb:lorenzo_valla_about_donation_of_constantine;
ov:confidence 1.0.
# Feature being assessed
kb:philological_arguments a sebi:Feature;
rdfs:label “philological arguments”@en;
sebi:isAssessedBy kb:lorenzo_valla_about_donation_of_constantine_1;
skos:broader kb:language.
4.5.2 Evaluation framework
Following the multi-level evaluation approach specified in Task V (Section 3.6), the evaluation framework provides a multi-dimensional assessment of the KG generation pipeline. Multiple complementary approaches address KG evaluation challenges, integrating human assessment throughout the evaluation pipeline and using F1 score and G-EVAL metrics. The framework systematically addresses five EQs, based on the CQs that defined the original SEBI ontology (Pasqual, 2025).
EQ1: CH Item Metadata Extraction Precision. How accurately does the pipeline extract alleged item metadata compared to expert annotations?
Methodology: This is formulated as a multiclass classification task, evaluating the metadata extraction component against the GT. The classification scheme follows standard evaluation practices: True Positive (TP) for exact matches between model output and GT, False Positive (FP) for incorrect model predictions, True Negative (TN) for correctly identified absence of metadata when GT is also empty and False Negative (FN) for missing outputs when GT contains valid metadata. To accommodate acceptable semantic variations (e.g., alternative titles, location aliases), all FP cases are manually reviewed to identify outputs that are semantically equivalent to the GT and should be reclassified as TP.
Metrics: Micro-averaged results for individual metadata categories (Title, Creator, Date, Location) and macro-averaged overall performance using standard precision, recall and F1-score calculations.
EQ2: Scholarly Entity Recognition Coverage. How effectively does the entity recognition and opinion frame module identify scholarly agents (Cognizers) present in the source documents?
Methodology: The entity extraction component is evaluated by conducting frequency-based analysis comparing GT entities with model-identified entities.
Metrics: Entity-level recall (proportion of GT entities correctly identified) and the total number of entities detected by the model to assess both coverage and potential over-generation.
EQ3: Evidential Reasoning Extraction Quality. How accurately does the model capture the multi-dimensional evidential reasoning employed by scholars in their interpretations?
Methodology: Given the complex structure of scholarly evidence identified in the ontological framework, where each piece of evidence comprises multiple semantic dimensions (evaluated feature, evaluation perspective, broader feature class, polarity), a custom scoring metric operating on a 4-point scale is implemented. For each evidence prediction, points are assigned based on accuracy across these four dimensions, subtracting one point for each incorrectly identified component. This approach accommodates cases where model outputs are semantically similar but not lexically identical to GT annotations.
Score Interpretation: 0 points indicates complete extraction failure (equivalent to FN or total FP); 1–2 points indicates weak but partially acceptable outputs; and 3–4 points indicates acceptable to strong outputs meeting semantic requirements.
Scope: Evidence evaluation is restricted to entities successfully matched between model output and GT from EQ2.
EQ4: Hypothesis and Judgment Identification. How accurately does the model extract scholars' interpretative hypotheses and overall authenticity judgments?
Methodology: The same precision, recall and F1-score evaluation framework established for EQ1 is applied to assess the hypothesis extraction component. Model outputs are compared against expert-annotated GT for both specific scholarly hypotheses and overall authenticity determinations.
Scope: Evaluation is limited to the subset of successfully matched entities identified in EQ2 to ensure fair comparison.
EQ5: Overall Discourse Representation Fidelity. Does the complete generated KG provide an adequate representation of the scholarly debate surrounding the CH items' authenticity?
Methodology: To evaluate representation fidelity, G-EVAL (Liu et al., 2023) is employed. Since the KGs only represent opinions inside the text, comparing the source document with a rehydrated version of the KG would heavily bias the evaluation metric. This led to avoiding similarity-based metrics like BLEU, ROUGE and COMET with the source corpus as used in Gangemi et al. (2024).
G-EVAL evaluates two metrics: debate correctness and debate representativeness. The first evaluates how well individual scholarly entities and their arguments are represented compared to the GT, penalizing omission of specific entities while rewarding accurate representation of facts, claims and evidence with proper domain-specific terminology. The second assesses how comprehensively the overall structure and flow of the authenticity debate is captured, including the breadth of scholarly perspectives and their relationships within the discourse narrative.
Previous evaluation metrics mostly covered matchable entries between GT and output, whereas G-EVAL evaluates the complete output.
Scope: G-EVAL over rehydrated KGs covers the complete pipeline output against the rehydrated GT.
5. Results
This section presents a comprehensive evaluation and preliminary discussion of findings across the five EQs outlined in Section 4.5. We evaluate Claude Sonnet 3.7, Llama 3.3 70B and GPT-4o-mini across multiple dimensions of the authenticity debate extraction task (the tables will show only Claude, GPT and Llama for brevity). We begin with simple exploratory SPARQL queries across the 3 KGs and compare the results with the GT, as shown in Figure 17.
The panel is titled “S P A R Q L Query for K G Statistics”. Below the title, a query is displayed beginning with SELECT followed by ( C O U N T ( D I S T I N C T question mark entity ) A S question mark entityCount ). The next line reads WHERE open curly brace. The following lines state question mark interpretationAct a h i c o colon InterpretationAct period, question mark interpretationAct p r o v colon wasAttributedTo question mark entity period, and question mark entity a d c t colon Agent period. A F I L T E R clause appears as F I L T E R ( exclamation mark C O N T A I N S ( S T R ( question mark interpretationAct ) comma quotation mark self underscore statement quotation mark ) ). The query ends with a closing curly brace.SPARQL query used to extract entity counts from the KGs for statistical comparison across models
The panel is titled “S P A R Q L Query for K G Statistics”. Below the title, a query is displayed beginning with SELECT followed by ( C O U N T ( D I S T I N C T question mark entity ) A S question mark entityCount ). The next line reads WHERE open curly brace. The following lines state question mark interpretationAct a h i c o colon InterpretationAct period, question mark interpretationAct p r o v colon wasAttributedTo question mark entity period, and question mark entity a d c t colon Agent period. A F I L T E R clause appears as F I L T E R ( exclamation mark C O N T A I N S ( S T R ( question mark interpretationAct ) comma quotation mark self underscore statement quotation mark ) ). The query ends with a closing curly brace.SPARQL query used to extract entity counts from the KGs for statistical comparison across models
Table 3 and Figure 18 provide an overview of the KGs generated by each model compared to the GT. The models produce more triples than the GT (10,000–12,000 vs 4,026), as the GT relies heavily on Wikidata entity linking, while the models extract and create explicit triples for information found directly in the text (such as dates, locations and descriptive metadata). Despite this, the models generate comparable numbers of Interpretation Acts and Cognizers to the GT, suggesting at this stage a similar density of extracted information.
KE overall metrics
| Model | Triples | Interpretation acts | Cognizers |
|---|---|---|---|
| GT | 4,026 | 170 | 164 |
| Claude | 10,173 | 148 | 103 |
| GPT | 12,088 | 247 | 201 |
| Llama | 10,119 | 217 | 172 |
| Model | Triples | Interpretation acts | Cognizers |
|---|---|---|---|
| GT | 4,026 | 170 | 164 |
| Claude | 10,173 | 148 | 103 |
| GPT | 12,088 | 247 | 201 |
| Llama | 10,119 | 217 | 172 |
The radar chart is drawn on a circular coordinate system with five axes labeled. It contains five axes arranged counter-clockwise. “Triples” is positioned on the right side. “Interpretation Acts” is positioned at the upper right. “Cognizers” is positioned at the upper left. “Evidence Quality” is positioned at the lower left.“Metadata (F 1 score, avg)” is positioned at the bottom right. Concentric circular grid lines are labeled 0.2, 0.4, 0.6, 0.8, and 1.0 along the radial scale near the right side. The legend is positioned in the upper right area of the chart and lists “Ground Truth”, “Claude”, “G P T”, and “Llama” with corresponding line markers. The line labeled “Ground Truth” shows values of 0.3 for “Triples”, 1.0 for “Interpretation Acts”, 0.6 for “Cognizers”, 1.0 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. The line labeled “Claude” shows values of 1.0 for “Triples”, 0.7 for “Interpretation Acts”, 0.6 for “Cognizers”, 0.95 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. The line labeled “G P T” shows values of 1.0 for “Triples”, 1.0 for “Interpretation Acts”, 0.65 for “Cognizers”, 0.95 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. The line labeled “Llama” shows values of 0.85 for “Triples”, 0.9 for “Interpretation Acts”, 1.0 for “Cognizers”, 0.95 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. Note: All numerical data values are approximated.Radar chart of different KG extractions. Source: Authors' own work
The radar chart is drawn on a circular coordinate system with five axes labeled. It contains five axes arranged counter-clockwise. “Triples” is positioned on the right side. “Interpretation Acts” is positioned at the upper right. “Cognizers” is positioned at the upper left. “Evidence Quality” is positioned at the lower left.“Metadata (F 1 score, avg)” is positioned at the bottom right. Concentric circular grid lines are labeled 0.2, 0.4, 0.6, 0.8, and 1.0 along the radial scale near the right side. The legend is positioned in the upper right area of the chart and lists “Ground Truth”, “Claude”, “G P T”, and “Llama” with corresponding line markers. The line labeled “Ground Truth” shows values of 0.3 for “Triples”, 1.0 for “Interpretation Acts”, 0.6 for “Cognizers”, 1.0 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. The line labeled “Claude” shows values of 1.0 for “Triples”, 0.7 for “Interpretation Acts”, 0.6 for “Cognizers”, 0.95 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. The line labeled “G P T” shows values of 1.0 for “Triples”, 1.0 for “Interpretation Acts”, 0.65 for “Cognizers”, 0.95 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. The line labeled “Llama” shows values of 0.85 for “Triples”, 0.9 for “Interpretation Acts”, 1.0 for “Cognizers”, 0.95 for “Evidence Quality”, and 1.0 for “Metadata (F 1 score, avg)”. Note: All numerical data values are approximated.Radar chart of different KG extractions. Source: Authors' own work
5.1 EQ1: CH item metadata extraction precision
How accurately does the pipeline extract alleged CH item metadata compared to expert annotations?
As shown in Table 4, the performance is high across all models, with F1-scores ranging from 0.97 to 0.99.
CH item metadata extraction performance across three LLMs
| Model | Category | Precision | Recall | F1-score |
|---|---|---|---|---|
| Claude | Titles | 1.000 | 1.000 | 1.000 |
| Type | 1.000 | 1.000 | 1.000 | |
| Creators | 0.977 | 0.977 | 0.977 | |
| Dates | 0.978 | 1.000 | 0.989 | |
| Locations | 0.978 | 1.000 | 0.989 | |
| Overall | 0.987 | 0.995 | 0.991 | |
| GPT | Titles | 0.889 | 1.000 | 0.941 |
| Type | 0.956 | 1.000 | 0.977 | |
| Creators | 0.956 | 1.000 | 0.977 | |
| Dates | 0.911 | 1.000 | 0.953 | |
| Locations | 1.000 | 1.000 | 1.000 | |
| Overall | 0.942 | 1.000 | 0.970 | |
| Llama | Titles | 0.933 | 1.000 | 0.966 |
| Type | 0.933 | 1.000 | 0.966 | |
| Creators | 0.933 | 1.000 | 0.966 | |
| Dates | 0.867 | 1.000 | 0.929 | |
| Locations | 1.000 | 1.000 | 1.000 | |
| Overall | 0.933 | 1.000 | 0.965 |
| Model | Category | Precision | Recall | F1-score |
|---|---|---|---|---|
| Claude | Titles | 1.000 | 1.000 | 1.000 |
| Type | 1.000 | 1.000 | 1.000 | |
| Creators | 0.977 | 0.977 | 0.977 | |
| Dates | 0.978 | 1.000 | 0.989 | |
| Locations | 0.978 | 1.000 | 0.989 | |
| Overall | 0.987 | 0.995 | 0.991 | |
| GPT | Titles | 0.889 | 1.000 | 0.941 |
| Type | 0.956 | 1.000 | 0.977 | |
| Creators | 0.956 | 1.000 | 0.977 | |
| Dates | 0.911 | 1.000 | 0.953 | |
| Locations | 1.000 | 1.000 | 1.000 | |
| Overall | 0.942 | 1.000 | 0.970 | |
| Llama | Titles | 0.933 | 1.000 | 0.966 |
| Type | 0.933 | 1.000 | 0.966 | |
| Creators | 0.933 | 1.000 | 0.966 | |
| Dates | 0.867 | 1.000 | 0.929 | |
| Locations | 1.000 | 1.000 | 1.000 | |
| Overall | 0.933 | 1.000 | 0.965 |
Claude Sonnet 3.7 achieves the highest overall performance with an F1-score of 0.987. All models show nearly perfect recall, indicating successful extraction of all relevant metadata elements, with precision differences primarily reflecting varying FP rates. Date extraction shows more variability, with Llama 3.3 achieving the lowest precision (0.867) due to higher FP rates, as it misclassified the forging date with the alleged dating. For this particular task, the challenge was to distinguish between alleged metadata and settled metadata. All models successfully understood the task, showing only small precision drops at varying parameter sizes.
5.2 EQ2: scholarly entity recognition coverage
How effectively does the entity recognition and opinion frame module identify scholarly agents (Cognizers) present in the source documents? As shown in Table 3, the number of Cognizers is relatively similar across models – Table 5 shows the number of overlapping entities between the model's KG and the GT.
Entity recognition coverage and accuracy
| Model | Precision | Recall | F1 | TP | FP | FN |
|---|---|---|---|---|---|---|
| Claude | 0.696 | 0.763 | 0.728 | 71 | 31 | 22 |
| GPT | 0.718 | 0.912 | 0.803 | 145 | 57 | 14 |
| Llama | 0.626 | 0.817 | 0.709 | 107 | 64 | 24 |
| Model | Precision | Recall | F1 | TP | FP | FN |
|---|---|---|---|---|---|---|
| Claude | 0.696 | 0.763 | 0.728 | 71 | 31 | 22 |
| GPT | 0.718 | 0.912 | 0.803 | 145 | 57 | 14 |
| Llama | 0.626 | 0.817 | 0.709 | 107 | 64 | 24 |
GPT-4o-mini demonstrates superior entity recognition coverage, identifying 77.3% of scholarly agents present in the GT, significantly outperforming Claude (49.5%) and Llama 3.3 (58.8%). It identified the most entities that were expressing opinions. The perfect match rates indicate the proportion of identified entities that exactly match GT annotations. GPT-4o-mini maintains the highest accuracy at 66.0%.
5.3 EQ3: evidential reasoning extraction quality
How accurately does the model capture the multi-dimensional evidential reasoning employed by scholars in their interpretations?
Table 6 presents evidence extraction performance using our custom 4-point scoring system that evaluates the accuracy of feature identification, evaluation perspective, feature classification and polarity assessment.
Evidence extraction quality and coverage
| Model | Mean score (0–4) | Percentage score (%) |
|---|---|---|
| Claude | 3.87 | 96.8 |
| GPT-4o-mini | 3.84 | 96.0 |
| Llama 3.3 | 3.81 | 95.3 |
| Model | Mean score (0–4) | Percentage score (%) |
|---|---|---|
| Claude | 3.87 | 96.8 |
| GPT-4o-mini | 3.84 | 96.0 |
| Llama 3.3 | 3.81 | 95.3 |
All models demonstrate strong evidence extraction capabilities, with mean accuracies above 0.95%. While GPT-4o-mini achieves the highest precision and recall for entities, as shown in 5, Claude shows the highest evidence coverage (0.968) in Table 6. This pattern highlights that the lower recall in identifying Cognizers by Claude returns in higher precision in downstream tasks.
5.4 EQ4: hypothesis and judgment identification
How accurately does the model extract scholars' interpretative hypotheses and overall authenticity judgments?
Table 7 presents performance on extracting scholarly hypotheses about items’ origins and authenticity judgments.
Hypothesis and judgment extraction performance
| Model | Macro | Type | Creator | Date | Location |
|---|---|---|---|---|---|
| F1 | F1 | F1 | F1 | F1 | |
| Claude | 0.655 | 0.652 | 0.638 | 0.791 | 0.923 |
| GPT | 0.749 | 0.845 | 0.484 | 0.595 | 0.727 |
| Llama | 0.694 | 0.691 | 0.712 | 0.762 | 0.727 |
| Model | Macro | Type | Creator | Date | Location |
|---|---|---|---|---|---|
| F1 | F1 | F1 | F1 | F1 | |
| Claude | 0.655 | 0.652 | 0.638 | 0.791 | 0.923 |
| GPT | 0.749 | 0.845 | 0.484 | 0.595 | 0.727 |
| Llama | 0.694 | 0.691 | 0.712 | 0.762 | 0.727 |
GPT-4o-mini achieves the highest overall F1-score (0.749) for hypothesis extraction, with particularly strong performance in authenticity type classification (0.845). However, the model shows weaker performance in creator hypothesis identification (0.484), suggesting challenges in extracting attribution hypotheses. Claude demonstrates exceptional performance in geographic hypotheses (0.923 F1) and temporal hypotheses (0.791 F1), indicating strength in extracting location and dating alternative theories. Llama 3.3 shows the most balanced performance across hypothesis types, with particularly strong creator hypothesis extraction (0.712 F1). The variation across hypothesis types reflects the inherent complexity of scholarly reasoning, with location and date hypotheses generally more explicitly stated than creator attributions or underlying motivations.
5.5 EQ5: overall discourse representation fidelity
Does the complete generated KG provide an adequate representation of the scholarly debate surrounding CH Item authenticity?
The empirical threshold, using the scores produced by G-EVAL on three well-represented articles revised manually (Posthumous Diary, Centiloquium and Acámbaro figures), is set at 0.6–0.7. This result is consistent with other evaluation findings: while the other two models demonstrate higher debate coverage overall, they are penalized for generating more FPs, resulting in lower scores. Table 8 shows the per-statement G-EVAL scores for the three models, while Table 9 for the whole KG produced by a single Wikipedia article. This evaluation confirms a key pattern in our pipeline – when an entity is correctly identified as a Cognizer, its associated arguments are accurately represented. However, incorrect entity identification leads to error propagation throughout the pipeline, causing the generation of FPs in downstream components. Future iterations of the pipeline should incorporate self-consistency checks at the entity identification stage to reduce error accumulation and improve overall accuracy.
Per-statement correctness (G-EVAL scores on 0–1 scale)
| Model | Mean | Std dev | Range |
|---|---|---|---|
| Claude | 0.620 | 0.133 | 0.333–0.889 |
| GPT | 0.590 | 0.204 | 0.222–0.889 |
| Llama | 0.533 | 0.153 | 0.222–0.889 |
| Model | Mean | Std dev | Range |
|---|---|---|---|
| Claude | 0.620 | 0.133 | 0.333–0.889 |
| GPT | 0.590 | 0.204 | 0.222–0.889 |
| Llama | 0.533 | 0.153 | 0.222–0.889 |
6. Discussion and conclusions
In this section, we discuss the overall performance patterns, identified bottlenecks and potential steps to enhance the KE while answering our RQs (Section 1), followed by our contributions, limitations and future steps.
6.1 Methodological framework validation
To answer RQ1, our five-step ATR4CH methodology proves effective in developing the pipeline within the boundaries of an ontology. The granular evaluation demonstrates that our divide-and-conquer methodology enables systematic refinement of individual components while maintaining system coherence. The result is that different models excel at different subtasks, suggesting potential for hybrid approaches that leverage each model's strengths.
The alignment between G-EVAL and other evaluations suggests that self-consistency checks throughout the pipeline (such as prompting models to evaluate their own extraction results) could reduce FPs and FNs without reducing the necessity of external validation.
6.2 Extraction performance analysis
To answer RQ2, our evaluation reveals component-specific performance patterns across all tested models. Performance varies across extraction tasks, with all models achieving high scores on metadata extraction (F1-scores of 0.965–0.991), moderate performance for entity recognition (F1: 0.709–0.803), strong evidence extraction capabilities (95.3–96.8% accuracy) and more challenging hypothesis extraction (F1-scores of 0.655–0.749). Extracting alleged metadata proves straightforward across models, while capturing nuanced scholarly hypotheses requires more sophisticated interpretation regardless of architecture. The evidence extraction results demonstrate that contemporary LLMs can effectively capture multi-dimensional evidential reasoning, but they can do so only when they can identify the Cognizer – this represents an error propagation problem we identified in the pipeline, as the out-of-GT outputs for evidence extraction are mostly empty or incorrect.
6.3 Representation fidelity and quality assessment
To answer RQ3, the generated KGs demonstrate adequate representation of scholarly debate complexity and nuance. While the representation model proves more than adequate, as already demonstrated in the BROAST catalogue (Pasqual, 2025), the quality of the automatically generated KGs can still be improved.
G-EVAL scores around 0.6 indicate acceptable discourse representation quality with room for improvement. The successful capture of multi-dimensional evidential reasoning (95.3–96.8% accuracy) shows that LLMs can handle complex semantic relationships, suggesting broader applicability to other humanities domains characterized by multi-perspectival interpretation and evidence-based reasoning. However, the model perspective on specific domain terminology and approaches requires improvement, as the G-EVAL evaluation demonstrates.
6.4 Model comparison and performance trade-offs
To answer RQ4, our findings challenge the conventional assumption that larger models always perform better for complex domain tasks. Claude 3.7 Sonnet demonstrates lower recall but higher precision, being more conservative in entity classification but achieving greater accuracy in subsequent extraction steps. GPT-4o-mini shows the opposite pattern with higher recall and competitive precision, while Llama 3.3 70B falls between these approaches. Notably, as seen in Table 7, GPT-4o-mini performs better since it managed to correctly identify more Cognizers covered in the GT than other models, while having the least parameters of the lot.
The precision–recall trade-off has significant implications for deployment strategies. In production environments where KGs undergo human review and correction, higher recall models may be preferable since updating or deleting erroneous triples is more efficient than creating new KGs from scratch. Conversely, in real-time applications such as RAG systems, where extraction occurs without human supervision, higher precision becomes critical to avoid propagating false information.
6.5 Deployment implications and cost-effectiveness
To answer RQ5, the performance differences between models are relatively modest, while model sizes and costs differ substantially [23]. This suggests that the step-by-step pipeline architecture effectively leverages the capabilities of smaller models, making deployment feasible and more cost-effective for CH institutions with varying computational budgets.
The competitive performance of different model sizes within sequential pipelines opens two promising research directions. First, fine-tuning approaches could specifically target bottlenecks like Cognizer classification of recognized entities. Second, enhanced pre-processing using specialized tools could filter irrelevant entities before they enter the extraction pipeline.
The methodology's adaptability accommodates diverse institutional landscapes: smaller projects can benefit from intensive human-in-the-loop approaches with API-based models, while larger projects can leverage automated scaling through extensive annotation datasets and local deployment.
6.6 Contributions, limitations and future directions
In this work, we demonstrated the practical application of the SEBI ontology using RDF-star to represent multi-perspective authenticity claims, enabling structured representation of evidence-based scholarly interpretation while preserving provenance and alternative hypotheses. Second, we introduced a comprehensive five-step methodology for building LLM-centric KE pipelines that addresses the unique challenges of humanities texts through systematic integration of annotation models, ontological frameworks and computational tools. The methodology's technology-agnostic design provides a replicable blueprint adaptable to varying project scales and resource constraints. Third, our technical implementation achieved practical feasibility through a sequential LLM pipeline that successfully captures scholarly reasoning, including evidential features, evaluation polarities and alternative hypotheses.
Our approach faces some limitations that can be addressed in future work. The current focus on English Wikipedia sources limits multilingual applicability, particularly important given the glocal nature of CH scholarship. Performance on primary scholarly literature remains untested, and two key bottlenecks emerged: Cognizer classification difficulty and dependency on Wikidata linking for optimal performance.
Future work will prioritize developing multilingual extraction capabilities, implementing targeted improvements for Cognizer identification through fine-tuning or hybrid approaches and creating user-friendly tools that enable CH practitioners to customize the extraction process with appropriate human-in-the-loop interfaces. Additionally, working not only with secondary literature but also with primary works from scholars would be a relevant possible contribution. While works that try to summarize While LLMs show promise for structuring complex scholarly debates, complete automation remains premature, suggesting that balanced human-machine collaboration represents the most viable path forward.
AI tool disclosure
This research employed LLMs (Claude Sonnet 3.7, Llama 3.3 70B and GPT-4o-mini) as research subjects for KE experiments, as detailed in the methodological sections of this article. Additionally, G-EVAL, an LLM-based evaluation framework, was used for assessing discourse representation quality. Claude Sonnet 4.5 was used as a writing aid for summarization, translation and code generation when necessary. No AI tools were used in the ideation, analysis or interpretation of this manuscript beyond the specified cases above. The authors maintain full responsibility for the research design, methodology, data interpretation, figure and listing creation, and all conclusions presented.
This research was partially funded by the European Union – Next Generation EU, investment I.4.1 PNRR Patrimonio Culturale, Decreto Ministeriale n. 351, 9 April 2022. We gratefully acknowledge Dr. Cristina Solidoro (Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy) for her bibliographic suggestions and additional domain expertise in support of the case study.
Appendix
Metadata Extraction System Prompt
You are an expert Knowledge Extraction agent. Your task is to extract factual claims about
↼≻cultural heritage documents - distinguishing between what documents claim about
↼≻themselves versus what scholars believe about them.
Task
Extract alleged metadata from Wikipedia-style text:
- Alleged authorship: Who the document claims to be by
- Alleged dating: When the document claims to be from
- Alleged location: Where the document claims to originate
- Publisher/Discoverer: Who published, discovered, or brought the document to light
- Actual authorship: Who scholars think really created it
- Actual dating: When scholars think it was really created
- Actual location: Where scholars think it really originated
Examples
Example 1: Chronicle of Valdoria
Input:
The Chronicle of Valdoria was published in 1962 by antiquarian dealer Giuseppe Torretti,
↼≻who claimed to have discovered the manuscript in the archives of San Pietro
↼≻monastery. The document purports to be a 12th-century chronicle written by Brother
↼≻Marcus documenting the founding of the monastery in northern Italy. However,
↼≻paleographic analysis conducted by Dr. Elena Rossi in 2018 revealed that the
↼≻parchment contains watermarks not used until the 15th century, and the Latin
↼≻contains grammatical constructions typical of Renaissance humanists rather than
↼≻medieval scribes.
Output:
{
“documents”: [
{
“document”: “Chronicle of Valdoria”,
“alleged_metadata”: {
“alleged_author”: [“Brother Marcus”],
“alleged_date”: “12th century”,
“alleged_location”: “northern Italy”,
“publisher”: [“Giuseppe Torretti”],
“actual_author”:”“,
“actual_date”: “15th century”,
“actual_location”:””
}
}
]
}
Example 2: Codex Aureus Britannicus
Input:
In 1924, German manuscript dealer Heinrich Weber announced the discovery of the Codex
↼≻Aureus Britannicus, which he claimed to have acquired from a private English
↼≻collection. The manuscript purports to be an illuminated Gospel book created by
↼≻Celtic monks at Iona Abbey in the 8th century, allegedly commissioned by King
↼≻Aethelbald of Mercia. The codex bears an inscription stating it was “written in the
↼≻year of our Lord 742 by the hand of Brother Columba.” Modern forensic analysis by
↼≻Cambridge University, however, determined that the gold leaf contains titanium
↼≻dioxide, a pigment not available until 1916, suggesting the manuscript was created
↼≻in the early 20th century, possibly in Germany.
Output:
{
“documents”: [
{
“document”: “Codex Aureus Britannicus”,
“alleged_metadata”: {
“alleged_author”: [“Brother Columba”],
“alleged_date”: “742”,
“alleged_location”: “Iona Abbey”,
“publisher”: [“Heinrich Weber”],
“actual_author”:”“,
“actual_date”: “early 20th century”,
“actual_location”: “Germany”
}
}
]
}
Example 3: Letters of Empress Theodora
Input:
The Letters of Empress Theodora were first published by Turkish historian Mehmet Ozkan in
↼≻1925, who claimed the documents were discovered during restoration work at Hagia
↼≻Sophia in Constantinople. The letters allegedly consist of correspondence between
↼≻the Byzantine Empress Theodora and various nobles, purporting to reveal court
↼≻intrigues in 6th-century Constantinople. Professor Andreas Mikhailov's textual
↼≻analysis, published in 2019, demonstrated that the Greek contains modern
↼≻grammatical forms and references to concepts unknown in the Byzantine period,
↼≻concluding the letters were fabricated in the 1920s by an unknown forger seeking to
↼≻capitalize on interest in Byzantine history.
Output:
{
“documents”: [
{
“document”: “Letters of Empress Theodora”,
“alleged_metadata”:{
“alleged_author”: [“Empress Theodora”],
“alleged_date”: “6th century”,
“alleged_location”: “Constantinople”,
“publisher”: [“Mehmet Ozkan”],
“actual_author”: [“unknown”],
“actual_date”: “1920s”,
“actual_location”:””
}
}
]
}
Example 4: Multiple Documents – Venetian Statue Collection
Input:
In 1889, art dealer Rodolfo Marinetti claimed to have acquired three ancient Roman marble
↼≻statues from excavations near Venice. The Venus of San Marco purports to be a 1st-
↼≻century sculpture by the workshop of Praxiteles, allegedly discovered near the
↼≻Roman Forum. The Apollo Veneticus claims to be a 2nd–century work commissioned by
↼≻Emperor Hadrian for his villa at Tivoli. The Minerva Triumphans allegedly dates to
↼≻the 3rd century and bears an inscription attributing it to the sculptor Gaius
↼≻Valerius. However, Dr. Maria Santini's 2020 forensic analysis revealed that all
↼≻three sculptures contain trace amounts of modern portland cement in their marble,
↼≻indicating they were carved in the late 19th century, likely by the same workshop
↼≻in Carrara that specialized in creating “ancient” pieces for wealthy collectors.
Output:
{
“documents”: [
{
“document”: “Venus of San Marco”,
“alleged_metadata”: {
“alleged_author”: [“workshop of Praxiteles”],
“alleged_date”: “1st century”,
“alleged_location”: “Roman Forum”,
“publisher”: [“Rodolfo Marinetti”],
“actual_author”: [“workshop in Carrara”],
“actual_date”: “late 19th century”,
“actual_location”: “Carrara”,
}
},
{
“document”: “Apollo Veneticus”,
“alleged_metadata”:{
“alleged_author”: [],
“alleged_date”: “2nd century”,
“alleged_location”: “Hadrian's Villa (Tivoli)”,
“publisher”: [“Rodolfo Marinetti”],
“actual_author”: [“workshop in Carrara”],
“actual_date”: “late 19th century”,
“actual_location”: “Carrara”,
}
},
{
“document”: “Minerva Triumphans”,
“alleged_metadata”: {
“alleged_author”: [“Gaius Valerius”],
“alleged_date”: “3rd century”,
“alleged_location”:”“,
“publisher”: [“Rodolfo Marinetti”],
“actual_author”: [“workshop in Carrara”],
“actual_date”: “late 19th century”,
“actual_location”: “Carrara”
}
}
]
}
Extraction Rules
Extract as alleged metadata:
- Direct document claims: “The text states it was written by …”
- Purported attribution: “allegedly written by,” “purports to be by”
- Self-identification: Document identifying its own author/date/origin
- Avoid using comments or commenting on the data. Dates must be date ranges, entities will
↼≻be linked later
- AVOID: “unknown (possibly John Doe)” and prefer “unknown”
Extract as actual metadata:
- Scholarly determinations: “historians believe,” “analysis shows”
- Forensic findings: “carbon dating revealed,” “investigations found”
- Academic consensus: “scholars generally agree,” “modern research indicates”
Return null if:
- No authorship, dating, or location claims mentioned
- Only general discussion without metadata specifics
- Pure methodology/technique descriptions
Output Format
Always include both alleged and actual fields, using empty strings when information is not
↼≻mentioned. Focus on factual claims, not interpretations of authenticity.
Notes
https://www.wikidata.org/. While Wikidata employs a custom reification method to integrate multiple perspectives through its ranking mechanism, annotators in the CH domain sometimes neglect this feature (Di Pasquale et al., 2024).
Donation of Constantine – Q238476.
Donation of Constantine – DBpedia entry.
Selection performed October 2024.
Claude Sonnet 3.7 Model Card: https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf.
GPT-4o-mini model card: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/.
SEBI-KE repository. See “Inception2Graph” folder.
As of May 2025, the Claude-3.7-Sonnet API has a cost of $3/million tokens, GPT-4o-mini $0.60/million tokens and Llama-3.3-70B $0.54/million tokens. The overall cost for 45 articles using the Anthropic API exceeded $20, while for Llama-3.3.-70B and GPT-4o-mini was between $5 and $10.

