Achieving smart question-answering (QA) for construction laws (CLs) holds significant promise in aiding domain professionals with legal inquiries. Existing studies of construction law question-answering (CLQA) rely on learning-based models, which require extensive training data and are limited to a narrow QA scope. Meanwhile, general-purpose large language models (GPLLMs) possess great potential for CLQA but fall short of domain-specific knowledge. This study aims to propose a data-driven and expertise-based approach to develop a construction law knowledge repository (CLKR) and validate its effectiveness in enhancing the CLQA performance of GPLLMs.
This methodology includes (1) recognizing 702 candidate CL documents from 374,992 official judgments, (2) building a CLKR with 387 filtered documents covering eight CL knowledge areas, (3) integrating CLKR and seven representative GPLLMs and (4) constructing a 2,140-question CLQA dataset from Professional Construction Engineer Qualification Examinations (PCEQEs) during 2014–2023 to compare CLQA performance between seven pairs of GPLLMs with and without CLKR.
The CLKR significantly enhances the CLQA performance of seven GPLLMs, yielding an impressive average accuracy increase of 21.1%, with individual improvements ranging from 9.9 to 44.9%. Furthermore, CLKR boosts the accuracy of single-answer questions by 14.9% and multiple-answer questions by 38.3%. Additionally, the accuracy enhancements across 8 CL knowledge areas are between 14.5 and 28.2%.
This study proposes an approach of developing the external knowledge base of CLKR to empower GPLLMs, significantly expanding the scope of CLQA while bypassing the complex training of traditional learning-based models. Moreover, this study confirms the effectiveness of CLKR in augmenting GPLLM performance and offers a reusable CLQA test dataset as a benchmark.
1. Introduction
The broadly defined construction laws (CLs) include various statutes, acts, regulations, judicial decisions, and legal interpretations (Alhyari and Ani, 2022; Khademi Adel et al., 2022). These construction laws influence multiple critical aspects of the construction industry, such as contractual breaches (Khademi Adel et al., 2022), safety incident treatment (Zailani et al., 2023), occupational health risks (Tao et al., 2022; Liu et al., 2021), and others. Addressing such legal issues within the construction industry is crucial for the smooth implementation of construction projects (Tao et al., 2022; Zailani et al., 2023). The question-answering (QA) for CLs currently relies on three main approaches: (1) consulting literature, (2) searching for online information, or (3) seeking guidance from domain experts. Reviewing literature (e.g. books, standards, and statutes) is a laborious process, often requiring the review of hundreds of documents to answer a query (Rasool et al., 2024; Choi et al., 2023). Search engines are freely accessible, but they only provide information relevant to the question, instead of directly delivering specific answers (Oeding et al., 2024; Liu et al., 2024). Consulting domain experts is an effective approach, but the scarcity of experts results in high costs (Alan et al., 2024; Hou and Zhang, 2024). For example, only 8.24 million out of 33.72 million (24.44%) legal cases in Mainland China involved lawyers until 2023, leaving 75.56% of cases without lawyers (DJ, 2023; SPC, 2023). Therefore, global academia and industry face the challenge of efficiently and cost-effectively providing CLQA. The smart CLQA that automatically answers construction-related legal questions would be a valuable complement, helping resolve the basic legal queries in the construction domain with reduced cost and time consumption.
General-purpose large language models (GPLLMs) (e.g. ChatGPT and Llama) are large language models designed to handle a broad range of language tasks across multiple domains without specific domain specialization (Rizzo et al., 2024; Gilson et al., 2023). GPLLM-based CLQA is advantageous for being training-free and having a strong language understanding (Ghimire et al., 2024; Pursnani et al., 2023; Oh et al., 2023). However, these non-specialized GPLLMs lack CL knowledge incorporation, which may result in unsatisfying CLQA performance (Su et al., 2024; Tsoutsanis and Tsoutsanis, 2024; Frieder et al., 2023). Developing external knowledge bases encounters significant challenges: (1) a heavy reliance on expertise-based selection of knowledge-embedded documents, which restricts the CLKR scope and limits updates; (2) a lack of empirical validation regarding whether domain knowledge can enhance CLQA performance; and (3) a shortage of benchmark datasets for CLQA testing.
In response, this study aims to (1) propose a data-driven and expertise-based approach to developing CLKR, (2) validate the effectiveness of CLKR in improving GPLLM performance, and (3) provide an openly available CLQA dataset. To achieve these objectives, this study devises a four-phase CLKR development approach to construct a 387-document CLKR, based on 374,992 written judgments and 10 experts. Besides, this study validates the effectiveness of the CLKR by comparing the performance of seven GPLLMs with and without CLKR, as well as providing a 2,140-question CLQA dataset based on China’s most authoritative Professional Construction Engineer Qualification Examinations (PCEQEs) in architecture, engineering, and construction (AEC) field. The remaining sections of this study are organized as follows. Section 2 provides a literature review on existing construction-related QA research and the application of GPLLMs. Section 3 outlines the methodology for building the CLKR and integrating it with GPLLMs. Section 4 evaluates the effectiveness of the CLKR in enhancing the CLQA of GPLLMs. Section 5 discusses the research contributions, potential methods for the long-tail effect, mis-answered question types, and practical implications.
2. Related works
2.1 Existing construction-focused QA studies
In the construction domain, there are pioneering smart QA studies, as listed in Table 1. Existing construction-focused QA studies involve the procurement of construction materials (Lee et al., 2023), construction safety hazards (Wang and El-Gohary, 2023; Tian et al., 2023), construction accidents (Xu et al., 2023), building codes (Xue et al., 2024), construction document text mining (Sun et al., 2020), and construction procedures (Zhong et al., 2020). Although these studies focus on different topics, they still offer significant insights into (1) exploited QA models and (2) QA performance testing for this study.
Existing construction-related QA studies
| No | References | QA targeted specific areas | QA-used models | Training-free | Knowledge scope for QA | QA performance test dataset | |
|---|---|---|---|---|---|---|---|
| Question source | Number of questions | ||||||
| 1 | Chou et al. (2024) | Risk management in river dredging projects | A BERT-based deep learning model | × | Dredging risk knowledge collected by interviews | Listed by experienced dredging personnel | 16 |
| 2 | Kim et al. (2024) | Construction market knowledge in overseas projects | A BERT-based deep learning model | × | 3 versions of a FIDIC standard contract written in English, Korean, and Indonesian | The FIDIC documents | 80 |
| 3 | Xue et al. (2024) | Building codes | A BERT-based deep learning model | × | 2 Chapters of the IBC 2015 | Manually generated for model testing | 175 |
| 4 | Lee et al. (2023) | Steel manufacturer equipment procurement | A machine learning model combining KG and QA | × | An equipment procurement document from a steel-making company | Generated questions based on relevant arbitration and clause settings | 45 |
| 5 | Tian et al. (2023) | Construction safety hazard | A BERT + BiGRU + Self-Attention-based deep learning model | × | 6,325 safety hazard texts | Dedicated questions for model application | 25 |
| 6 | Wang and El-Gohary (2023) | Construction safety hazard | A CNN-based deep learning model | × | 20 OSHA sections related to fall protection | Manually developed for model testing | 671 |
| 7 | Xu et al. (2023) | Coal mine construction safety | A BERT-BiLSTM-CRF-based deep learning model | × | 43 sections of 80 papers from coal mine construction safety management standard specifications | Example questions used to validate the semantic query and entity information modules | Unspecified |
| 8 | Sun et al. (2020) | Construction document information transmission mining | A TF-IDF-based machine learning model | × | A monthly construction report containing 1734 words | Posed by three construction managers | 5 |
| 9 | Zhong et al. (2020) | Construction procedural constraint | A BiLSTM- + CRF-based deep learning model | × | 14 types of national standards of CACQ in China | Sentences labeled by experts | 400 |
| 10 | Rajpurkar et al. (2016) | Multiple domains including building regulation domain | A logistic regression-based machine learning model | × | 536 Wikipedia articles | Contributed by 5 civil engineers | Unspecified |
| No | References | QA targeted specific areas | QA-used models | Training-free | Knowledge scope for QA | QA performance test dataset | |
|---|---|---|---|---|---|---|---|
| Question source | Number of questions | ||||||
| 1 | Risk management in river dredging projects | A BERT-based deep learning model | × | Dredging risk knowledge collected by interviews | Listed by experienced dredging personnel | 16 | |
| 2 | Construction market knowledge in overseas projects | A BERT-based deep learning model | × | 3 versions of a FIDIC standard contract written in English, Korean, and Indonesian | The FIDIC documents | 80 | |
| 3 | Building codes | A BERT-based deep learning model | × | 2 Chapters of the IBC 2015 | Manually generated for model testing | 175 | |
| 4 | Steel manufacturer equipment procurement | A machine learning model combining KG and QA | × | An equipment procurement document from a steel-making company | Generated questions based on relevant arbitration and clause settings | 45 | |
| 5 | Construction safety hazard | A BERT + BiGRU + Self-Attention-based deep learning model | × | 6,325 safety hazard texts | Dedicated questions for model application | 25 | |
| 6 | Construction safety hazard | A CNN-based deep learning model | × | 20 OSHA sections related to fall protection | Manually developed for model testing | 671 | |
| 7 | Coal mine construction safety | A BERT-BiLSTM-CRF-based deep learning model | × | 43 sections of 80 papers from coal mine construction safety management standard specifications | Example questions used to validate the semantic query and entity information modules | Unspecified | |
| 8 | Construction document information transmission mining | A TF-IDF-based machine learning model | × | A monthly construction report containing 1734 words | Posed by three construction managers | 5 | |
| 9 | Construction procedural constraint | A BiLSTM- + CRF-based deep learning model | × | 14 types of national standards of CACQ in China | Sentences labeled by experts | 400 | |
| 10 | Multiple domains including building regulation domain | A logistic regression-based machine learning model | × | 536 Wikipedia articles | Contributed by 5 civil engineers | Unspecified | |
Note(s): BERT: Bidirectional Encoder Representations from Transformers; BiGRU: Bidirectional Gated Recurrent Unit; BIM: Building Information Modeling; BiLSTM: Bidirectional Long Short-Term Memory; CACQ: Code for Acceptance of Construction Quality; CRF: Conditional Random Fields; FIDIC: International Federation of Consulting Engineers; IBC: International Building Code; IE: Information Extraction; KG: Knowledge Graph; NHC: National Hurricane Center; NLG: Natural Language Generation; NLP: Natural Language Process; NLU: Natural Language Understanding; OSHA: Occupational Safety and Health Organization; TF-IDF: Term Frequency-Inverse Document Frequency
Source(s): Authors’ own work
Existing construction-related smart QA is primarily constructed upon machine learning and deep learning models (Table 1). Conventional CLQA-used machine learning models are often based on algorithms like TF-IDF (Sun et al., 2020) and logistic regression (Rajpurkar et al., 2016). While these machine learning models could be developed for QA with small amounts of training data, their natural language processing (NLP) capabilities are limited, resulting in suboptimal QA performance (Lee et al., 2023). Deep learning models such as BERT (Chou et al., 2024; Tian et al., 2023; Kim et al., 2024), CNN (Lee et al., 2023; Wang and El-Gohary, 2023), and BiLSTM (Xu et al., 2023; Zhong et al., 2020) exhibit better NLP capabilities and more optimal QA performance, owing to their greater number of parameters. However, the deep learning models necessitate a large volume of data for QA training (Chou et al., 2024; Xue et al., 2024). Annotating QA training data is a high-cost and labor-intensive process (Chou et al., 2024; Lee et al., 2023; Wang and El-Gohary, 2023; Xue et al., 2024). Moreover, whether employing machine learning or deep learning models for QA, a fundamental limitation persists in that these learning-based models are only applicable to questions covered by training data. In other words, their knowledge scope for QA is relatively narrow, such as one procurement document (Lee et al., 2023), two chapters of a construction code (Xue et al., 2024), three standard contracts (Kim et al., 2024), and a dozen or so regulations (Zhong et al., 2020) (Table 1).
In construction-related QA performance testing, QA performance test sets in existing studies have two sources: (1) expert-designed question-answer pairs and (2) test datasets built on authoritative exams (Table 1). The first source relies primarily on domain experts to devise question-answer pairs tailored to specific CL subareas, drawing from relevant literature, project materials, and their expertise (Chou et al., 2024; Lee et al., 2023; Sun et al., 2020). This QA performance test data source often requires a significant investment of human resources and time, resulting in relatively small test datasets with fewer than 671 questions (Table 1) (Xu et al., 2023; Tian et al., 2023). Such limited test datasets may be inadequate for evaluating the QA performance of GPLLMs across a broad domain like construction laws (Martinez-Gil, 2023; Sun et al., 2020). Moreover, due to individuals’ knowledge limitations and biases, the authority of expertise-based QA test sets is also often questioned (Zhong et al., 2020). The second source leverages questions from professional examinations (Gencer and Aydin, 2023; Pursnani et al., 2023; Wang and El-Gohary, 2023), which are highly authoritative and facilitate the rapid creation of performance test datasets (Rizzo et al., 2024; Sahin et al., 2024; Frieder et al., 2023). Hence, these examination-based QA test datasets offer valuable guidance for developing the CLKR performance test dataset in this study.
2.2 Studies based on GPLLMs
GPLLMs demonstrate significant advancements in language comprehension, context understanding, and text generation over traditional learning-based models (e.g. machine learning and deep learning models) (Rizzo et al., 2024; Sahin et al., 2024; Oh et al., 2023). These advancements benefit from the remarkable parameter volumes of GPLLMs (Rizzo et al., 2024; Gilson et al., 2023). For example, GPT-4.0 has 1.8 trillion parameters, Llama-2-70b has 70 billion parameters, and ERNIE-Bot-turbo also exceeds 1 trillion parameters. In contrast, conventional learning-based models have much smaller parameter scales; even the largest, such as BERT, only has 340 million parameters (Shi et al., 2023), leading to limited capabilities in language processing (Gilson et al., 2023; Saad et al., 2023). State-of-the-art research has applied GPLLMs in various domains, such as orthopedics (Rizzo et al., 2024; Saad et al., 2023), neurosurgery (Sahin et al., 2024), urology (Schoch et al., 2024), nursing (Su et al., 2024), clinical help (Tsoutsanis and Tsoutsanis, 2024), ophthalmology (Antaki et al., 2023), thoracic surgery (Gencer and Aydin, 2023), mathematics (Frieder et al., 2023), and physics (Kortemeyer, 2023).
Although GPLLMs possess more powerful language processing capabilities, their QA performance in specific professional domains remains inadequate due to a lack of domain knowledge (Lu et al., 2024; Su et al., 2024; Tsoutsanis and Tsoutsanis, 2024). The integration of external knowledge bases is considered to potentially further enhance the performance of GPLLMs, as suggested by recent studies (Lu et al., 2024; Su et al., 2024; Tsoutsanis and Tsoutsanis, 2024). However, few scholars in the AEC field have attempted to develop domain-specific knowledge repositories to lift GPLLMs’ QA performance (Ghimire et al., 2024; Antaki et al., 2023) (Table 2). There is still a lack of empirical validation regarding whether domain knowledge could lift the CLQA performance of GPLLMs.
Examples of recent studies on GPLLMs
| No | References | GPLLMs | Specific domain | Test datasets | ||
|---|---|---|---|---|---|---|
| Question sources | Question types | Number of questions | ||||
| 1 | Alan et al. (2024) | GPT-3.5 turbo | Islam understanding | Designed by experts | Open-ended | 3 (mentioned by the author) |
| 2 | Hou and Zhang (2024) | GPT-3.5 and GPT-4.0 | Dietary supplement | Information on the MSKCC website | Closed-ended (MSQs and True/False) | 2000 |
| 3 | Mansurova et al. (2024) | Llama-2-7b and Llama-2-13b | General | TriviaQA open-domain dataset | Closed-ended (Filling in the blank) | 500 |
| 4 | Rasool et al. (2024) | GPT-3.5-turbo and GPT-4 | Healthcare | CogTale dataset | Closed-ended (MMQs, MSQs, True/False, and number extraction) | 337 |
| 5 | Rizzo et al. (2024) | GPT-3.5 turbo and GPT-4 | Orthopaedics | OITE in the 2020, 2021, and 2022 | Closed-ended (MSQs) | 207 |
| 6 | Sahin et al. (2024) | GPT-4 | Neurosurgery | The latest six written TNSPBE | Closed-ended (MSQs) | 523 |
| 7 | Schoch et al. (2024) | GPT-3.5 and GPT-4 | Urology | A test book published by the FEBU association | Closed-ended (MSQs) | Around 600 |
| 8 | Su et al. (2024) | GPT-4 | Nursing | Taiwan’s 2022 Nursing Licensing Exam | Closed-ended (MSQs) | 400 |
| 9 | Tsoutsanis and Tsoutsanis (2024) | Llama-2, Google Bard, Bing Chat, and GPT-3.5 | Clinical help | Commercial question banks (i.e. Qbank) for the MSRA exam | Closed-ended (MSQs) | 100 |
| 10 | Antaki et al. (2023) | GPT-3.5 turbo and GPT-4 | Ophthalmology | Basic and Clinical Science Course Self-Assessment Program and an online question bank (i.e. OphthoQuestions) | Closed-ended (MSQs) | 520 |
| 11 | Choi et al. (2023) | ChatGPT | Laws | Exams for law school courses at the University of Minnesota | Closed-ended (MSQs) and open-ended (essay writing) | 107 |
| 12 | Gencer and Aydin (2023) | GPT-3.5 and GPT-4 | Thoracic surgery | Turkish-language thoracic surgery exam questions | Closed-ended (MSQs) | 105 |
| 13 | Gilson et al. (2023) | InstructGPT, GPT-3.5, and ChatGPT | Medicine | A question bank for medical students and the NBME | Closed-ended (MSQs) | 220 |
| 14 | Oh et al. (2023) | GPT-3.5 and GPT-4 | Surgery | The KGSBE in 2020, 2021, and 2022 | Closed-ended (MSQs) | 280 |
| 15 | Pursnani et al. (2023) | GPT-3.5-Legacy, GPT-3.5-Turbo, and GPT-4 | Engineering fundamental knowledge | An unpublished practice exam | Closed-ended (MSQs, MMQs, and filling in the blank) | 134 |
| 16 | Rosól et al. (2023) | GPT-3.5 and GPT-4 | Medicine | 3 versions of PMFE | Closed-ended (MSQs) | 600 |
| 17 | Saad et al. (2023) | GPT-4 | Orthopedics | Mock FRCS Orth Part A | Closed-ended (MSQs) | 240 |
| No | References | GPLLMs | Specific domain | Test datasets | ||
|---|---|---|---|---|---|---|
| Question sources | Question types | Number of questions | ||||
| 1 | GPT-3.5 turbo | Islam understanding | Designed by experts | Open-ended | 3 (mentioned by the author) | |
| 2 | GPT-3.5 and GPT-4.0 | Dietary supplement | Information on the MSKCC website | Closed-ended (MSQs and True/False) | 2000 | |
| 3 | Llama-2-7b and Llama-2-13b | General | TriviaQA open-domain dataset | Closed-ended (Filling in the blank) | 500 | |
| 4 | GPT-3.5-turbo and GPT-4 | Healthcare | CogTale dataset | Closed-ended (MMQs, MSQs, True/False, and number extraction) | 337 | |
| 5 | GPT-3.5 turbo and GPT-4 | Orthopaedics | OITE in the 2020, 2021, and 2022 | Closed-ended (MSQs) | 207 | |
| 6 | GPT-4 | Neurosurgery | The latest six written TNSPBE | Closed-ended (MSQs) | 523 | |
| 7 | GPT-3.5 and GPT-4 | Urology | A test book published by the FEBU association | Closed-ended (MSQs) | Around 600 | |
| 8 | GPT-4 | Nursing | Taiwan’s 2022 Nursing Licensing Exam | Closed-ended (MSQs) | 400 | |
| 9 | Llama-2, Google Bard, Bing Chat, and GPT-3.5 | Clinical help | Commercial question banks (i.e. Qbank) for the MSRA exam | Closed-ended (MSQs) | 100 | |
| 10 | GPT-3.5 turbo and GPT-4 | Ophthalmology | Basic and Clinical Science Course Self-Assessment Program and an online question bank (i.e. OphthoQuestions) | Closed-ended (MSQs) | 520 | |
| 11 | ChatGPT | Laws | Exams for law school courses at the University of Minnesota | Closed-ended (MSQs) and open-ended (essay writing) | 107 | |
| 12 | GPT-3.5 and GPT-4 | Thoracic surgery | Turkish-language thoracic surgery exam questions | Closed-ended (MSQs) | 105 | |
| 13 | InstructGPT, GPT-3.5, and ChatGPT | Medicine | A question bank for medical students and the NBME | Closed-ended (MSQs) | 220 | |
| 14 | GPT-3.5 and GPT-4 | Surgery | The KGSBE in 2020, 2021, and 2022 | Closed-ended (MSQs) | 280 | |
| 15 | GPT-3.5-Legacy, GPT-3.5-Turbo, and GPT-4 | Engineering fundamental knowledge | An unpublished practice exam | Closed-ended (MSQs, MMQs, and filling in the blank) | 134 | |
| 16 | GPT-3.5 and GPT-4 | Medicine | 3 versions of PMFE | Closed-ended (MSQs) | 600 | |
| 17 | GPT-4 | Orthopedics | Mock FRCS Orth Part A | Closed-ended (MSQs) | 240 | |
Note(s): CogTale: Cognitive Treatments Article Library and Evaluation FEBU: Fellow of the European Board of Urology; FRCS Orth: Orthopedic fellow of the Royal College of Surgeons; iDISK: International Dietary Supplement Knowledgebase; KGSBE: Korean General Surgery Board Exams; MSKCC: Memorial Sloan Kettering Cancer Center; MSRA: Multi-Specialty Recruitment Assessment; NBME: National Board of Medical Examiners; OITE: Orthopedic In-Training Examination; PMFE: Polish Medical Final Examination; TNSPBE: Turkish Neurosurgical Society Proficiency Board Exams
Source(s): Authors’ own work
The development of external knowledge bases for the Retrieval-Augmented Generation (RAG) of GPLLMs encounters significant challenges: (1) heavy reliance on expert knowledge to construct the knowledge base, (2) a limited scale of the knowledge base, and (3) the infrequency of updates to the knowledge base. Specifically, the process of building these knowledge bases heavily depends on experts manually selecting domain-relevant documents (Rasool et al., 2024; Alan et al., 2024). This approach risks critical omissions when constructing a large-scale domain knowledge base (e.g. CLKR), thus reducing its comprehensiveness (Lee et al., 2023; Zhong et al., 2020). Furthermore, domain knowledge, such as in CLKR, continually evolves (Ghimire et al., 2024; Khademi Adel et al., 2022). The failure to update the knowledge base can render it not only ineffective but also potentially harmful to GPLLM performance (Gao et al., 2023). Previous research has rarely addressed the issues of updating these knowledge bases or developing their update mechanisms (Hou and Zhang, 2024; Alan et al., 2024; Mansurova et al., 2024).
The exploitation of knowledge bases relies on RAG technology. A variety of RAG frameworks are available, including LangChain (Langchain-ai, 2024), Haystack (deepset-ai, 2024), RAGFlow (infiniflow, 2024), Txtai (neuml, 2024), LLM-App (pathwaycom, 2024), FlashRAG (RUC-NLPIR, 2024), and Cognita (truefoundry, 2024). Each of these RAG frameworks supports a modular architecture with data connectors, document loaders, vector stores, and GPLLM integrations (Hou and Zhang, 2024; Rasool et al., 2024; Mansurova et al., 2024; Petrus, 2024). Among these, LangChain stands out for its open-source nature, active community, and user-friendly API (Alan et al., 2024; Mansurova et al., 2024). Consequently, LangChain is selected for subsequent research to implement the integration between LLMs and external knowledge bases.
The evaluation of QA performance for GPLLMs necessitates test datasets, with question types potentially being open-ended or closed-ended (Table 2). Open-ended questions are typically designed by a few experts and lack authoritative sources (Alan et al., 2024; Choi et al., 2023). These open-ended questions need subjective expert judgment for assessing answer quality, making them unsuitable as benchmark datasets (Harvel et al., 2024; Zheng et al., 2023). In contrast, most closed-ended question sets are based on officially organized exams with objective and consistent answers (Rizzo et al., 2024; Pursnani et al., 2023; Rosól et al., 2023), making them suitable for use as benchmark datasets (Su et al., 2024). Generally, closed-ended questions outperform open-ended ones in terms of question quantity, source authority, answer judgment, and benchmark suitability. Therefore, closed-ended questions are commonly used in research, while open-ended questions are less frequently employed in evaluating the performance of GPLLMs (Table 2). Most existing studies utilize hundreds of closed-ended questions for QA test dataset development, including multiple-choice questions (Sahin et al., 2024; Hou and Zhang, 2024; Rizzo et al., 2024; Schoch et al., 2024; Tsoutsanis and Tsoutsanis, 2024; Antaki et al., 2023), true/false judgments (Hou and Zhang, 2024; Rasool et al., 2024), and fill-in-the-blank questions (Mansurova et al., 2024; Pursnani et al., 2023). Although numerous test datasets exist, there remains a lack of standardized benchmark datasets specifically designed to evaluate and compare CLQA performance.
2.3 Research gaps
Existing studies (Table 1) on construction-related QA have laid the foundation for this work. However, they exploit conventional deep learning or machine learning QA models, which have disadvantages concerning (1) the requirement of large amounts of training data and (2) a small-scale QA scope. In contrast, CLQA involves a broad spectrum of subareas (e.g. permits, contracting, safety, disputes, quality, and environment), and it also faces challenges in acquiring sufficient QA pairs for training learning-based models. Hence, existing studies using conventional learning-based models are inapplicable to achieving CLQA.
Although GPLLMs exhibit higher language processing capacity than conventional learning-based models and hold the advantage of not requiring training, their lack of domain-specific knowledge may compromise CLQA performance (Table 2). Developing domain knowledge bases faces three challenges. First, the development process heavily depends on an expert-driven selection of documents containing domain knowledge, which limits the scope and updates of CLKR. Second, there is insufficient empirical evidence to confirm that domain knowledge improves CLQA performance. Third, there is a lack of benchmark datasets to test and compare CLQA performance.
3. Methodology
To resolve the aforementioned challenges, a 4-phase methodology is devised to build a CLKR to lift the CLQA performance of GPLLMs. The first phase involves obtaining 702 candidate documents from a total of 374,992 written judgments. In the second phase, 387 CL documents are filtered to ensure coverage across eight distinct CL areas and form the CLKR. The third phase focuses on integrating the CLKR with GPLLMs using the LangChain-based RAG technology. The final phase compares the performance of seven GPLLMs, both with and without the CLKR, by utilizing a 2,140-question CLQA validation set. This comparison not only measures the GPLLMs’ performance improvements but also validates the effectiveness of the CLKR (Figure 1).
The illustration is a multi‑row flowchart describing a four‑stage process for creating and evaluating a construction law knowledge repository (C L K R) used with general‑purpose legal language models (G P L L M s). Stage 1, on the top left, is titled “1. Collecting corpora and Recognizing candidate documents for C L K R”. Inside this stage are three sequential boxes connected by right‑pointing arrows. The first box is labeled “Gather corpora that contain construction laws” with a subtitle “Source: China Judgments Online”. The second box is labeled “Recognize document name entities in corpora”, with the line “Identifiers: Guillemets (left and right double angle beackets)”. The third box is a wider rectangle titled “Cleanse the document name entities” and internally split into three vertical panels labeled from left to right: “Merging identical entities”, “removing low‑frequency items”, and “removing non‑law documents”. Stage 2, on the top right, is titled “2. Identifying C L documents & Building the C L K R”. It contains three more boxes connected by arrows. The first box reads “Filter and align the candidate document entities” with a lower caption “Majority voting by 10 experts” above a row of stylized human icons. The next box is labeled “Clarify the structures of C L knowledge areas” with the note “Referring to a textbook and reviewed by experts”. The final box is titled “Categorize C L documents and collect the document contents” with the line “Collecting from Chinese Laws and Regulations Database”. Stage 3 in the middle row is titled “Incorporating C L K R into G P L L Ms for C L Q A” and contains three main boxes connected by right‑pointing arrows. The first box is labeled “Split C L documents into knowledge chunks”, and inside it two lines of text read “Chunk size equals 250 and Overlap equals 50” and “Chunk vectorization by most suitable embedding model of each L L M”. The second box is labeled “Retrieve question‑relevant knowledge chunks” and contains the line “Extracting 3 closest knowledge chunks (I) with minimum squared Euclidean distance (L subscript i squared):” followed by the formula “L subscript i squared equals open double mode V subscript i superscript knowledge minus V superscript question close double mode” and the selection rule “I equals arg min subscript 3 ({L subscript i squared} superscript N subscript i equals 1)”. The third box is titled “Input the combined question and retrieved knowledge into 7 selected G P L L M s” and includes a bulleted list labeled “Selecting G P L L Ms using three criteria:” with three bullets: “Inclusion of both open‑source and end‑source G P L L M s”, “Prioritization of G P L L Ms with superior performance”, and “Supporting automated batch Q A”. Stage 4 in the bottom row is titled “Validating the effectiveness of C L K R” and also contains three boxes joined by arrows. The first box reads “Devise a validation set for C L Q A” with a description: “Deciding question type and size by referring to existing literature in Table 2”. The second box is labeled “Compare performance differences between G P L L M s with and without C L K R” and states “Calculated by Accuracy and tested by Wilcoxon T Test”, followed by the formula “Accuracy equals (M superscript M S Q plus M superscript M M Q) over (1 times N superscript M S Q plus 2 times N superscript M M Q)”. The third box is titled “Evaluate individual C L document’s impact on performance enhancement” and explains “Evaluated by Unranked frequency and Ranked frequency” with two equations: “Unranked frequency equals sum from i equals 1 to n of C subscript i” and “Ranked frequency equals sum from i equals 1 to n of C subscript i times 1 over R subscript i”. Arrows connect all boxes from left to right and top to bottom, visually tracing the workflow from data collection through knowledge integration and quantitative validation.The phases of building a CLKR to lift the CLQA performance of GPLLMs. Source(s): Authors’ own work
The illustration is a multi‑row flowchart describing a four‑stage process for creating and evaluating a construction law knowledge repository (C L K R) used with general‑purpose legal language models (G P L L M s). Stage 1, on the top left, is titled “1. Collecting corpora and Recognizing candidate documents for C L K R”. Inside this stage are three sequential boxes connected by right‑pointing arrows. The first box is labeled “Gather corpora that contain construction laws” with a subtitle “Source: China Judgments Online”. The second box is labeled “Recognize document name entities in corpora”, with the line “Identifiers: Guillemets (left and right double angle beackets)”. The third box is a wider rectangle titled “Cleanse the document name entities” and internally split into three vertical panels labeled from left to right: “Merging identical entities”, “removing low‑frequency items”, and “removing non‑law documents”. Stage 2, on the top right, is titled “2. Identifying C L documents & Building the C L K R”. It contains three more boxes connected by arrows. The first box reads “Filter and align the candidate document entities” with a lower caption “Majority voting by 10 experts” above a row of stylized human icons. The next box is labeled “Clarify the structures of C L knowledge areas” with the note “Referring to a textbook and reviewed by experts”. The final box is titled “Categorize C L documents and collect the document contents” with the line “Collecting from Chinese Laws and Regulations Database”. Stage 3 in the middle row is titled “Incorporating C L K R into G P L L Ms for C L Q A” and contains three main boxes connected by right‑pointing arrows. The first box is labeled “Split C L documents into knowledge chunks”, and inside it two lines of text read “Chunk size equals 250 and Overlap equals 50” and “Chunk vectorization by most suitable embedding model of each L L M”. The second box is labeled “Retrieve question‑relevant knowledge chunks” and contains the line “Extracting 3 closest knowledge chunks (I) with minimum squared Euclidean distance (L subscript i squared):” followed by the formula “L subscript i squared equals open double mode V subscript i superscript knowledge minus V superscript question close double mode” and the selection rule “I equals arg min subscript 3 ({L subscript i squared} superscript N subscript i equals 1)”. The third box is titled “Input the combined question and retrieved knowledge into 7 selected G P L L M s” and includes a bulleted list labeled “Selecting G P L L Ms using three criteria:” with three bullets: “Inclusion of both open‑source and end‑source G P L L M s”, “Prioritization of G P L L Ms with superior performance”, and “Supporting automated batch Q A”. Stage 4 in the bottom row is titled “Validating the effectiveness of C L K R” and also contains three boxes joined by arrows. The first box reads “Devise a validation set for C L Q A” with a description: “Deciding question type and size by referring to existing literature in Table 2”. The second box is labeled “Compare performance differences between G P L L M s with and without C L K R” and states “Calculated by Accuracy and tested by Wilcoxon T Test”, followed by the formula “Accuracy equals (M superscript M S Q plus M superscript M M Q) over (1 times N superscript M S Q plus 2 times N superscript M M Q)”. The third box is titled “Evaluate individual C L document’s impact on performance enhancement” and explains “Evaluated by Unranked frequency and Ranked frequency” with two equations: “Unranked frequency equals sum from i equals 1 to n of C subscript i” and “Ranked frequency equals sum from i equals 1 to n of C subscript i times 1 over R subscript i”. Arrows connect all boxes from left to right and top to bottom, visually tracing the workflow from data collection through knowledge integration and quantitative validation.The phases of building a CLKR to lift the CLQA performance of GPLLMs. Source(s): Authors’ own work
3.1 Recognition of candidate documents for CLKR
This phase (Figure 1) is designed to collect candidate documents for constructing the CLKR in a data-driven manner. It encompasses (1) gathering corpora (i.e. written judgments) containing construction laws, (2) identifying CL document name entities within the written judgments, and (3) cleansing these CL document name entities. Further details are provided in Figure 2a.
The figure is a large multi‑panel diagram illustrating data‑driven recognition and expert‑based determination of construction law (C L) documents for the C L K R. The top band, labeled on the right “(a) Data‑driven recognition of candidate documents for C L K R”, shows a three‑step pipeline. The first box on the left reads “Collect corpora containing C L documents from China Judgements Online”, and beneath it, within a rounded rectangle, “374,992 written judgments” with a small screenshot icon. A right‑pointing arrow leads to the second box “Recognize document name entities by guillemets (left and right double angle brackets)” with a count “775,241 document name entities”. Another arrow leads to “Cleanse the identified document name entities with three criteria” with three stacked numbers: “7,954 non‑duplicate items”, “1,018 items with no less than 5 appearances”, and “702 candidate documents that end with 10 specific terms”. A downward label “Provide 702 candidate documents” points to the lower half of the figure. The bottom half, marked on the right as “(b) Expertise‑based determination of C L documents in C L K R”, is split into three main text boxes at the top and a large visualization underneath. From left to right, the boxes state: “Filter and align the 702 candidate documents”, “Clarify the 8 C L knowledge areas and 164 C L knowledge subareas”, and “Categorize 387 C L documents into 164 distinct C L K subareas (Table S 2)”. Below the first box is a grey panel headed “387 C L documents for C L K R” containing green rounded rectangles labeled “C L D‑001”, “C L D‑002”, “C L D‑003”, “C L D‑004”, “C L D‑005”, “C L D‑006”, followed by dotted ellipsis dots and ending with “C L D‑386” and “C L D‑387”. The center of the figure depicts “C L Knowledge” as a circle feeding eight colored second‑layer areas labeled “C 1: Basic Legal Knowledge for Construction”, “C 2: Construction Permits”, “C 3: Contracting and Subcontracting”, “C 4: Construction Project Contracts and Labor Contracts”, “C 5: Environment and Cultural Heritage Protection”, “C 6: Construction Safety”, “C 7: Construction Quality”, and “C 8: Dispute Resolution”. Each C‑area connects to a band of thinner third‑layer labels such as “C 1‑01 to C 1‑29”, “C 2‑01 to C 2‑14”, “C 3‑01 to C 3‑16”, “C 4‑01 to C 4‑23”, “C 5‑01 to C 5‑12”, “C 6‑01 to C 6‑25”, “C 7‑01 to C 7‑23”, and “C 8‑01 to C 8‑22”. From these subarea codes, many multicolored strands flow rightward into a tall rectangular block labeled on the side “387 C L documents in C L K R”, whose interior is filled with vertical green rectangles representing individual documents. At the bottom, a dashed “Legends” box explains icons: open rounded rectangles represent “Second‑layer C L K area”, narrow rounded rectangles indicate “Third‑layer C L K subarea”, and solid green bars denote “Construction law document”. To the right of the legend is a worked example titled “An example of C L knowledge in subarea C 3‑06”. It shows “C L K” leading to “C 3: Contracting and Subcontracting”, then to “C 3‑06: Statutory requirements for winning bids and handling complaints in bidding”, which in turn connects to three specific green document bars labeled: “C L D‑072: Opinions on Promoting the Sustainable and Healthy Development of the Construction Industry”, “C L D‑232: Tendering and Bidding Law of the People’s Republic of China”, and “C L D‑266: Regulations for the Implementation of Bidding and Tendering Law of the People’s Republic of China”. The entire diagram emphasizes the progression from hundreds of thousands of judgments to a curated set of 387 construction law documents structured into 8 knowledge areas and 164 subareas within the C L K R.Building the CLKR by combining data-driven and expertise-based paradigm. Source(s): Authors’ own work
The figure is a large multi‑panel diagram illustrating data‑driven recognition and expert‑based determination of construction law (C L) documents for the C L K R. The top band, labeled on the right “(a) Data‑driven recognition of candidate documents for C L K R”, shows a three‑step pipeline. The first box on the left reads “Collect corpora containing C L documents from China Judgements Online”, and beneath it, within a rounded rectangle, “374,992 written judgments” with a small screenshot icon. A right‑pointing arrow leads to the second box “Recognize document name entities by guillemets (left and right double angle brackets)” with a count “775,241 document name entities”. Another arrow leads to “Cleanse the identified document name entities with three criteria” with three stacked numbers: “7,954 non‑duplicate items”, “1,018 items with no less than 5 appearances”, and “702 candidate documents that end with 10 specific terms”. A downward label “Provide 702 candidate documents” points to the lower half of the figure. The bottom half, marked on the right as “(b) Expertise‑based determination of C L documents in C L K R”, is split into three main text boxes at the top and a large visualization underneath. From left to right, the boxes state: “Filter and align the 702 candidate documents”, “Clarify the 8 C L knowledge areas and 164 C L knowledge subareas”, and “Categorize 387 C L documents into 164 distinct C L K subareas (Table S 2)”. Below the first box is a grey panel headed “387 C L documents for C L K R” containing green rounded rectangles labeled “C L D‑001”, “C L D‑002”, “C L D‑003”, “C L D‑004”, “C L D‑005”, “C L D‑006”, followed by dotted ellipsis dots and ending with “C L D‑386” and “C L D‑387”. The center of the figure depicts “C L Knowledge” as a circle feeding eight colored second‑layer areas labeled “C 1: Basic Legal Knowledge for Construction”, “C 2: Construction Permits”, “C 3: Contracting and Subcontracting”, “C 4: Construction Project Contracts and Labor Contracts”, “C 5: Environment and Cultural Heritage Protection”, “C 6: Construction Safety”, “C 7: Construction Quality”, and “C 8: Dispute Resolution”. Each C‑area connects to a band of thinner third‑layer labels such as “C 1‑01 to C 1‑29”, “C 2‑01 to C 2‑14”, “C 3‑01 to C 3‑16”, “C 4‑01 to C 4‑23”, “C 5‑01 to C 5‑12”, “C 6‑01 to C 6‑25”, “C 7‑01 to C 7‑23”, and “C 8‑01 to C 8‑22”. From these subarea codes, many multicolored strands flow rightward into a tall rectangular block labeled on the side “387 C L documents in C L K R”, whose interior is filled with vertical green rectangles representing individual documents. At the bottom, a dashed “Legends” box explains icons: open rounded rectangles represent “Second‑layer C L K area”, narrow rounded rectangles indicate “Third‑layer C L K subarea”, and solid green bars denote “Construction law document”. To the right of the legend is a worked example titled “An example of C L knowledge in subarea C 3‑06”. It shows “C L K” leading to “C 3: Contracting and Subcontracting”, then to “C 3‑06: Statutory requirements for winning bids and handling complaints in bidding”, which in turn connects to three specific green document bars labeled: “C L D‑072: Opinions on Promoting the Sustainable and Healthy Development of the Construction Industry”, “C L D‑232: Tendering and Bidding Law of the People’s Republic of China”, and “C L D‑266: Regulations for the Implementation of Bidding and Tendering Law of the People’s Republic of China”. The entire diagram emphasizes the progression from hundreds of thousands of judgments to a curated set of 387 construction law documents structured into 8 knowledge areas and 164 subareas within the C L K R.Building the CLKR by combining data-driven and expertise-based paradigm. Source(s): Authors’ own work
The collection of construction-related legal corpora consists of determining corpora sources and clarifying search keywords. The written judgments of the construction industry encompass various laws referred to in real-world legal cases, which could be an ideal and authoritative source for determining candidate documents. For example, the “China Judgments Online” is established and developed by the government to show all legal case judgments in Mainland China to the public, and its authority and comprehensiveness ensure both the quality and quantity necessary for serving as the corpora source (SPC, 2023). This study uses the keywords “construction engineering” and “construction project” to search for written judgments in the construction industry. These keywords can appear in any part of the judgment text. Considering the extensive historical corpora, this study focuses on the most recent legal cases. As a result, a total of 374,992 written judgments from January 2021 to July 2023 are collected through online retrieval (Figure 2a).
Subsequently, document name entities are recognized in 374,992 written judgments (Figure 2a). Guillemets are a pair of punctuation marks in the form of sideways double chevrons (i.e. 《》), which are used to highlight names of books, articles, laws, regulations, and other documents in Chinese text. This study employs guillemets as identifiers to precisely and efficiently recognize document entities (e.g. 《Management Regulations of Registered Construction Engineers》). Utilizing this identifier-based recognition method, a total of 772,559 document name entities are retrieved from these judgments (Table S1).
The cleansing process of document name entities involves merging identical entities, excluding low-frequency items, and eliminating non-law documents (Figure 2a). Specifically, duplicates among 772,559 document name entities are merged, and 7,954 non-duplicate items are retained (Figure 2a). Then, the document name entities that appear less than five times (i.e. an occurrence rate of less than 0.01% in 374,992 cases) are excluded, significantly reducing the name entities containing various typographical errors. This reduces the number of documents from 7,954 to 1,018 (Figure 2a). As the naming system for legal documents in Mainland China is rigorous, legal document names must end with 10 specific terms, including decision/ruling, guidelines, interpretation, law/code, method/procedure, notice, opinion/advisory, ordinance, regulation/rule, and standards/norm. Thus, non-legal documents lacking these keywords are excluded, such as 《Shenzhen Newspaper》 and 《Commercial Housing Sales Contract》. After these exclusions, 702 documents remain as candidates for CLKR construction (Figure 2a). Table S1 shows the whole process from 374,992 written judgments to 702 candidate documents step by step.
3.2 Determination of construction laws in CLKR
Following the acquisition of 702 candidate documents through the data-driven approach, expertise is applied in this phase to filter the CL documents and construct the CLKR (Figure 2b). The procedure includes (1) filtering out and aligning the document entities, (2) refining a “1-8-164” structure of CL knowledge areas, and (3) categorizing the CL documents and gathering the specific contents.
This step involves filtering out ambiguous named entities and aligning different entities that refer to the same CL document (Figure 2b). Ten experienced experts (Table S1) are invited to manually review 702 documents one by one. A majority vote is conducted for each removal to minimize the subjectivity of the experts’ judgments. If 6 or more out of 10 experts agree that a document should be excluded, it is removed from the list of candidate documents. Ambiguous named entities do not specify a specific document, such as “notice”, “guidelines”, and “interpretation”, which account for 17 out of the 702 document name entities excluded (Table S1). Subsequently, multiple entities referring to the same CL document are aligned, and redundant document entities are removed. For example, “Civil Code” is removed as it refers to the same entity as “Civil Code of the People’s Republic of China”. As a result, 298 of the 685 candidate documents are excluded, leaving 387 CL documents in the final set (Figure 2b). The voting details and the documents excluded and retained are both attached in Table S1.
A three-layer knowledge hierarchy (Figure 2b) has been established by referring to the textbooks (Li et al., 2021) and a group of ten experts (Table S1). The first layer is the root node entitled CL knowledge, beneath which lie 8 second-layer knowledge areas concerning basic legal knowledge, permitting, contracting and subcontracting systems, project and labor contracts, protection, safety, quality, and disputes (Figure 2b). These 8 second-layer areas are further subdivided into 164 third-layer CL knowledge subareas to organize the documents in CLKR with a finer granularity (Figure 2b and Table S2).
After establishing the three-layer knowledge hierarchy, the experts categorized 387 CL documents into 164 distinct third-layer CL knowledge subareas (Figure 2b and Table S2). For instance, the CL knowledge subarea C3-06 has three CL documents (CLD): CLD-072, CLD-232, and CLD-266. Each CL document can be categorized into multiple knowledge subareas. This categorization process clarifies the relationships between CL knowledge subareas and CL documents, which further ensures that CLKR comprehensively covers all 164 CL knowledge subareas. The specific contents of these CL documents are obtained from the Chinese Laws & Regulations Database (NPC, 2024). Finally, the CLKR, containing 387 documents across 8 second-layer areas and 164 third-layer subareas, has been developed and released in the GitHub repository.
3.3 CLKR-empowered GPLLMs for CLQA
This section utilizes the RAG to facilitate integration between the CLKR and GPLLMs for CLQA. It comprises three steps (Figure 1), including (1) dividing the 387 CL documents from the CLKR into knowledge chunks (Figure 3a), (2) retrieving knowledge chunks pertinent to the CL question (Figure 3b), and (3) integrating the question with the retrieved knowledge for input into seven selected GPLLMs (Figure 3c).
The figure is a three‑section workflow diagram describing construction‑law‑aware question answering. The top section, labeled “(a) Split 387 documents into knowledge chunks”, begins on the left with a vertical stack of rounded rectangles labeled “C L D‑001”, “C L D‑002”, ellipsis dots, and “C L D‑387”, collectively marked “C L K R”. An arrow labeled “Split” points to a box containing items “Knowledge chunk 1”, “Knowledge chunk 2”, ellipsis, and “Knowledge chunk n”. A second arrow labeled “Embed chunks” leads to a box listing “Knowledge vector 1”, “Knowledge vector 2”, ellipsis, and “Knowledge vector n”. A third arrow leads to a magenta box titled “F A I S S‑formatted vector repository” that shows three example high‑dimensional vectors, such as “0.101, negative 0.002, ellipsis, negative 0.400”, “negative 0.003, negative 0.902, ellipsis, negative 0.007”, ellipsis, and “negative 0.803, 0.005, ellipsis, negative 0.243”, under the heading “Store the vectors”. The middle section is titled “(b) Retrieve question‑relevant knowledge chunks”. On the left, a user icon labeled “User” speaks “Questions” in a bubble, and an arrow labeled “Vectorize” points to a box listing “Question vector 1”, “Question vector 2”, up to “Question vector m”. A horizontal arrow labeled “Compare” connects this box to a box describing retrieved knowledge: rows such as “Relevant knowledge chunk 1‑1 to 1‑3”, “Relevant knowledge chunk 2‑1 to 2‑3”, ellipsis, and “Relevant knowledge chunk m‑1 to m‑3”. A side note states, “Retrieve 3 relevant knowledge chunks for each question”, with a loop arrow back up to the F A I S S vector repository, indicating a similarity search between question vectors and stored knowledge vectors. The bottom section is labeled “Combine question and knowledge” on the left and “(c) Input the combined question and retrieved knowledge into G P L L Ms” on the right. Two large ovals represent batched inputs: the top oval shows “Question 1” paired with “Relevant knowledge chunk 1‑1”, “Relevant knowledge chunk 1‑2”, and “Relevant knowledge chunk 1‑3”, while the bottom oval shows “Question m” with “Relevant knowledge chunk m‑1”, “Relevant knowledge chunk m‑2”, and “Relevant knowledge chunk m‑3”. Arrows labeled “Input” point from these ovals to a vertical list titled “General‑purpose large language model (G P L L M)”, which enumerates specific models with icons: “Llama‑2‑70 b”, “Text‑davinci‑003”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”. On the far right, a bubble labeled “Answers” indicates the generated outputs.The process of leveraging CLKR to empower GPLLM for CLQA. Source(s): Authors’ own work
The figure is a three‑section workflow diagram describing construction‑law‑aware question answering. The top section, labeled “(a) Split 387 documents into knowledge chunks”, begins on the left with a vertical stack of rounded rectangles labeled “C L D‑001”, “C L D‑002”, ellipsis dots, and “C L D‑387”, collectively marked “C L K R”. An arrow labeled “Split” points to a box containing items “Knowledge chunk 1”, “Knowledge chunk 2”, ellipsis, and “Knowledge chunk n”. A second arrow labeled “Embed chunks” leads to a box listing “Knowledge vector 1”, “Knowledge vector 2”, ellipsis, and “Knowledge vector n”. A third arrow leads to a magenta box titled “F A I S S‑formatted vector repository” that shows three example high‑dimensional vectors, such as “0.101, negative 0.002, ellipsis, negative 0.400”, “negative 0.003, negative 0.902, ellipsis, negative 0.007”, ellipsis, and “negative 0.803, 0.005, ellipsis, negative 0.243”, under the heading “Store the vectors”. The middle section is titled “(b) Retrieve question‑relevant knowledge chunks”. On the left, a user icon labeled “User” speaks “Questions” in a bubble, and an arrow labeled “Vectorize” points to a box listing “Question vector 1”, “Question vector 2”, up to “Question vector m”. A horizontal arrow labeled “Compare” connects this box to a box describing retrieved knowledge: rows such as “Relevant knowledge chunk 1‑1 to 1‑3”, “Relevant knowledge chunk 2‑1 to 2‑3”, ellipsis, and “Relevant knowledge chunk m‑1 to m‑3”. A side note states, “Retrieve 3 relevant knowledge chunks for each question”, with a loop arrow back up to the F A I S S vector repository, indicating a similarity search between question vectors and stored knowledge vectors. The bottom section is labeled “Combine question and knowledge” on the left and “(c) Input the combined question and retrieved knowledge into G P L L Ms” on the right. Two large ovals represent batched inputs: the top oval shows “Question 1” paired with “Relevant knowledge chunk 1‑1”, “Relevant knowledge chunk 1‑2”, and “Relevant knowledge chunk 1‑3”, while the bottom oval shows “Question m” with “Relevant knowledge chunk m‑1”, “Relevant knowledge chunk m‑2”, and “Relevant knowledge chunk m‑3”. Arrows labeled “Input” point from these ovals to a vertical list titled “General‑purpose large language model (G P L L M)”, which enumerates specific models with icons: “Llama‑2‑70 b”, “Text‑davinci‑003”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”. On the far right, a bubble labeled “Answers” indicates the generated outputs.The process of leveraging CLKR to empower GPLLM for CLQA. Source(s): Authors’ own work
RAG is a widely adopted method for integrating external knowledge into GPLLMs (Mansurova et al., 2024; Alan et al., 2024). This study employs the extensively used LangChain framework for the RAG (Langchain-ai, 2024). The CLKR is first loaded using a document loader, and the loaded documents are then divided into smaller knowledge chunks (Figure 3a), instead of presenting the entire document (Rasool et al., 2024; Alan et al., 2024). Chunk size refers to the token count of each chunk during the document dividing process, while overlap indicates the repeated tokens between adjacent chunks (Mansurova et al., 2024; Alan et al., 2024). After referring to existing studies (Langchain-ai, 2024; Eleliemy and Ciorba, 2021; Wang et al., 2024), the chunk size is set to 250 tokens. Additionally, certain GPLLMs (e.g. Llama-2-70B) have a 4096-token limit for one input (meta-llama, 2024), which includes three knowledge chunks, the prompt, and the question itself. Setting each chunk size to 250 words ensures that the total length of these elements stays within this limit (meta-llama, 2024; Wang et al., 2024). The adjacent chunks are designed to overlap by 50 tokens to maintain data continuity and prevent information loss at the boundaries of the chunks (Langchain-ai, 2024; Domengo, 2024; Eleliemy and Ciorba, 2021). The CL knowledge chunks are then vectorized using embedding models, changing from text form into numerical knowledge vectors (Figure 3a). In selecting embedding models, the author adheres to the recommendations from the GPLLM developer, as these recommended embedding models allow GPLLM to attain peak performance (Rasool et al., 2024; Hou and Zhang, 2024). For example, the embedding model of ERNIE Embedding-V1 is recommended for ERNIE-Bot 4.0 in its official technical documentation (Table 3). These CL knowledge vectors are stored in a FAISS-formatted vector store for reuse. Both FAISS and Chroma are the most commonly used vector stores, but FAISS is particularly noted for its faster retrieval speed in RAG tasks (Langchain-ai, 2024; Rasool et al., 2024).
GPLLMs selected for integration with CLKR
| No | Contributors | GPLLMs | Parameters | Best in processing | Open-/Closed-source | Corresponding embedding models |
|---|---|---|---|---|---|---|
| 1 | Meta | Llama-2-70b | 70 billion | English | Open-source | all-mpnet-base-v2 |
| 2 | OpenAI | text-davinci-003 | Unknown | English | Closed-source | text-embedding-ada-002 |
| 3 | GPT-3.5 Turbo | 20 billion | English | Closed-source | text-embedding-ada-002 | |
| 4 | GPT-4 | 1.8 trillion | English | Closed-source | text-embedding-ada-002 | |
| 5 | Tsinghua University | ChatGLM2-6B | 6 billion | Chinese | Open-source | text2vec-large-chinese |
| 6 | Baidu | ERNIE-Bot-turbo | 13 billion | Chinese | Closed-source | ERNIE Embedding-V1 |
| 7 | ERNIE-Bot 4.0 | >1 trillion | Chinese | Closed-source | ERNIE Embedding-V1 |
| No | Contributors | GPLLMs | Parameters | Best in processing | Open-/Closed-source | Corresponding embedding models |
|---|---|---|---|---|---|---|
| 1 | Meta | Llama-2-70b | 70 billion | English | Open-source | all-mpnet-base-v2 |
| 2 | OpenAI | text-davinci-003 | Unknown | English | Closed-source | text-embedding-ada-002 |
| 3 | GPT-3.5 Turbo | 20 billion | English | Closed-source | text-embedding-ada-002 | |
| 4 | GPT-4 | 1.8 trillion | English | Closed-source | text-embedding-ada-002 | |
| 5 | Tsinghua University | ChatGLM2-6B | 6 billion | Chinese | Open-source | text2vec-large-chinese |
| 6 | Baidu | ERNIE-Bot-turbo | 13 billion | Chinese | Closed-source | ERNIE Embedding-V1 |
| 7 | ERNIE-Bot 4.0 | >1 trillion | Chinese | Closed-source | ERNIE Embedding-V1 |
Source(s): Authors’ own work
Then, this study exploits Euclidean distance-based method to find question-related knowledge chunks from the FAISS vector store (Figure 3b). It is defined as:
where represents the squared Euclidean distance between the ith knowledge chunk and the question vector . A smaller distance signifies a higher similarity, indicating that the knowledge chunk is more relevant to the question. The three knowledge vectors nearest to the question vector are defined mathematically as:
where is the set of indices for the three closest knowledge vectors. The selected knowledge chunks are subsequently utilized as the background information for addressing the query.
Finally, the top three relevant knowledge chunks and the question are combined as a new query, and the query is then inputted into GPLLM to get the corresponding answer (Figure 3c). In this study, seven GPLLMs are selected for test. The reasons for selecting these GPLLMs are: (1) inclusion of both open-source and closed-source GPLLMs, (2) prioritization of GPLLMs demonstrating superior performance, and (3) a requirement that the GPLLMs support automated batch QA. The chosen GPLLMs in Table 3 represent a mix of open-source and closed-source technologies. As indicated on the LLM leaderboard, OpenAI and Baidu are recognized as the leading closed-source models for English and Chinese respectively (Pei et al., 2024; Oh et al., 2023). Similarly, Meta and Zhipu stand out as the top open-source models for English and Chinese (Pei et al., 2024; Lu et al., 2024). Since this study requires each model to answer thousands of questions, only GPLLMs that are either open-source or closed-source models with accessible APIs can be included (Sahin et al., 2024; Saad et al., 2023; Gilson et al., 2023). As a result, certain popular GPLLMs like Copilot, which fundamentally rely on OpenAI’s GPLLMs and do not support automated batch QA, have to be excluded from this study (Khan, 2024).
3.4 Effectiveness validation of CLKR
The validation of CLKR’s effectiveness is determined by examining whether GPLLMs have performance enhancements before and after the integration with CLKR (Figure 1). It involves (1) devising a validation set for CLQA, (2) comparing the differences between initial GPLLMs and CLKR-empowered GPLLMs, and (3) evaluating individual CL document impact on CLQA performance enhancement.
As there are no ready-to-use benchmark datasets of CLQA, a validation set comprising 2,140 real questions is developed, which covers first-level PCEQEs (11 test papers) and second-level PCEQEs (13 test papers) from 2014 to 2023 (Figure 4). The PCEQE is the most authoritative qualification assessment for those aspiring to be registered construction engineers in Mainland China (Liu and Low, 2011). Of the 2,140 questions, 1,550 are multiple-choice single-answer questions (MSQs), and 590 are multiple-choice multiple-answer (MMQs) (Figure 4). Additionally, each question is labeled with the sourced PCEQE paper and the corresponding CL knowledge area (Table S3).
The figure is divided into three dashed boxes labeled “Test paper tag”, “Question type tag”, and “C L K area tag”, showing how exam questions are categorized. On the left, under “Test paper tag”, the top half states “1100 questions from 11 test papers of first‑level P C E Q E” above a four-by-three grid of document icons captioned with years: “2014”, “2015”, “2016”, “2017”, “2018”, “2019”, “2020”, “2021”, “2022”, “2022 asterisk”, and “2023”. The bottom half reads “1040 questions from 13 test papers of second‑level P C E Q E” above another grid of document icons labeled “2014”, “2015”, “2016”, “2017”, “2018”, “2019”, “2020”, “2020 asterisk”, “2021”, “2021 asterisk”, “2022”, “2022 asterisk”, and “2023”. Arrows from both groups point to the central box. The middle “Question type tag” box contains four stacked cylindrical icons. For the first‑level questions, the upper cylinder is labeled “770 multiple‑choice single‑answer questions (M S Q s)” and the second cylinder “330 multiple‑choice multiple‑answer questions (M M Q s)”. For the second‑level questions, the third cylinder is labeled “780 M S Q s” and the bottom cylinder “260 M M Q s”. A right‑pointing arrow from this box leads to the third box. The rightmost “C L K area tag” box lists how questions are distributed across eight construction‑law knowledge areas: “457 C 1‑related questions”, “120 C 2‑related questions”, “304 C 3‑related questions”, “397 C 4‑related questions”, “130 C 5‑related questions”, “261 C 6‑related questions”, “244 C 7‑related questions”, and “227 C 8‑related questions”.The CLQA dataset. Note: * indicates an extra PCEQE hold that year. Source(s): Authors’ own work
The figure is divided into three dashed boxes labeled “Test paper tag”, “Question type tag”, and “C L K area tag”, showing how exam questions are categorized. On the left, under “Test paper tag”, the top half states “1100 questions from 11 test papers of first‑level P C E Q E” above a four-by-three grid of document icons captioned with years: “2014”, “2015”, “2016”, “2017”, “2018”, “2019”, “2020”, “2021”, “2022”, “2022 asterisk”, and “2023”. The bottom half reads “1040 questions from 13 test papers of second‑level P C E Q E” above another grid of document icons labeled “2014”, “2015”, “2016”, “2017”, “2018”, “2019”, “2020”, “2020 asterisk”, “2021”, “2021 asterisk”, “2022”, “2022 asterisk”, and “2023”. Arrows from both groups point to the central box. The middle “Question type tag” box contains four stacked cylindrical icons. For the first‑level questions, the upper cylinder is labeled “770 multiple‑choice single‑answer questions (M S Q s)” and the second cylinder “330 multiple‑choice multiple‑answer questions (M M Q s)”. For the second‑level questions, the third cylinder is labeled “780 M S Q s” and the bottom cylinder “260 M M Q s”. A right‑pointing arrow from this box leads to the third box. The rightmost “C L K area tag” box lists how questions are distributed across eight construction‑law knowledge areas: “457 C 1‑related questions”, “120 C 2‑related questions”, “304 C 3‑related questions”, “397 C 4‑related questions”, “130 C 5‑related questions”, “261 C 6‑related questions”, “244 C 7‑related questions”, and “227 C 8‑related questions”.The CLQA dataset. Note: * indicates an extra PCEQE hold that year. Source(s): Authors’ own work
The performance differences between GPLLMs with and without are calculated by accuracy and statistically tested using the Wilcoxon T Test. Since MSQ and MMQ vary in difficulty, they are assigned different point values in PCEQEs. Each MSQ is worth one point, and each MMQ is worth two points. For MMQs, full marks of two points are awarded if all correct options are selected. Choosing some, but not all, correct options can earn 0.5 points for each correct option chosen. Selecting any incorrect option results in zero points. The CLQA accuracy of the GPLLM is assessed based on the marks it receives on PCEQE papers or particular question sets (e.g. all 457 questions in the C1 subarea as shown in Figure 4), which is specifically defined as:
where and refer to the marks obtained on MSQs and MMQs, and and are the number of MSQs and MMQs.
Additionally, the Wilcoxon T Test is used to test whether there is a significant difference between the performance obtained by initial GPLLMs and CLKR-empowered GPLLMs (Figure 1). The mark comparisons are conducted from three perspectives: 24 PCEQE papers, MSQs/MMQs, and 8 second-layer knowledge areas. If CLKR-empowered GPLLMs show significant performance improvements compared to the original GPLLMs in PCEQEs and across CL knowledge areas, this could strongly validate the effectiveness and comprehensiveness of CLKR. Conversely, if there is no significant difference, then CLKR is ineffective.
Finally, the impact of each of the 387 CL documents on CLQA performance improvement is quantitatively evaluated using two distinct indicators: unranked frequency (Eq. (4)) and ranked frequency (Eq. (5)). For each of the 2,140 questions, three question-related knowledge chunks from the 387 documents will be extracted. Unranked frequency measures how often a document is referenced, counting every appearance of the knowledge chunk-sourced documents regardless of the similarity rank. It is defined as:
where if a chunk-sourced document appears, otherwise ; refers to the total number of retrieved knowledge chunks. Ranked frequency considers the rank of similarity when counting knowledge chunk-sourced document appearances, which is defined as:
where refers to the rank (i.e. 1, 2, or 3) of the top 3 question-relevant knowledge chunks.
4. Results
With the devised four-phase methodology, this study obtains the answers from seven pairs of original and CLKR-empowered GPLLMs, as well as the extracted knowledge chunks during CLKR-engaged CLQA, as depicted in Figure 5. All 29,960 answers provided by these GPLLMs and the knowledge chunks with their similarity ranks are found in Table S4. The authors compare CLQA accuracies of the original versus CLKR-enhanced GPLLMs on different test papers, between MSQs and MMQs, across 8 CL knowledge areas, and in 100 open-ended questions (Figure 5). Additionally, each CL document’s impact is evaluated and compared by unranked and ranked frequency (Figure 5).
The figure is a multi‑panel workflow summarizing evaluation outcomes for construction‑law question answering. On the left, a vertical box titled “7 G P L L M s” lists the models with icons: “Llama‑2‑70 b”, “Text‑davinci‑003”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”. A double-headed arrow labeled “Integration” points downward to a box “Construction law knowledge repository (C L K R)” containing a miniature of the earlier eight‑area knowledge diagram (C 1 to C 8 of “387 C L documents in C L K R”). From the models, an arrow leads to a central box stating “14,980 answers to 2,140 questions from 7 original G P L L M s” with “8,404 marks for 14,980 answers” beneath. A double-headed arrow labeled “Comparison” points from a dashed box below back to this box. The dashed box reads “14,980 answers generated by 7 C L K R‑empowered G P L L Ms” and adds “Retrieved knowledge chunks and similarity ranks”. Below the comparison arrow, another box notes “10,202.5 marks for 14,980 answers”, indicating improved scoring when using C L K R. To the right, a large rectangular panel titled “Performance comparison of each test paper slash on M S Q s or M M Q s slash across 8 C L areas slash on open‑ended questions (Section 4.1)” contains bullet points: “C L K R enhances the C L Q A performance of 7 G P L L M s by an average of 21.1 percent, varying from 9.9 percent to 44.9 percent (Table 4).” “C L Q A performance on M S Q s and M M Q s improves by 14.9 percent and 38.3 percent (Table 5).” “C L K R enhances C L Q A performance from 14.5 percent to 28.2 percent across eight C L knowledge areas (Table 6).” “C L K R enhances 7 G P L L M s’ performance in 100 open‑ended questions by an average of 22.0 percent (Table S 4).” A lower right panel titled “Impact evaluation of individual C L document on performance enhancement (Section 4.2)” provides further bullets: “Top 10 (2.6 percent) documents offer 37.2 percent (unranked) and 37.3 percent (ranked) knowledge for C L Q A (Figure 9).” “210 documents retrieved less than 5 times offer 3.7 percent (unranked) and 3.5 percent (ranked) knowledge for C L Q A (Table S 4).”The comparison results of GPLLMs’ CLQA accuracies with and without CLKR. Source(s): Authors’ own work
The figure is a multi‑panel workflow summarizing evaluation outcomes for construction‑law question answering. On the left, a vertical box titled “7 G P L L M s” lists the models with icons: “Llama‑2‑70 b”, “Text‑davinci‑003”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”. A double-headed arrow labeled “Integration” points downward to a box “Construction law knowledge repository (C L K R)” containing a miniature of the earlier eight‑area knowledge diagram (C 1 to C 8 of “387 C L documents in C L K R”). From the models, an arrow leads to a central box stating “14,980 answers to 2,140 questions from 7 original G P L L M s” with “8,404 marks for 14,980 answers” beneath. A double-headed arrow labeled “Comparison” points from a dashed box below back to this box. The dashed box reads “14,980 answers generated by 7 C L K R‑empowered G P L L Ms” and adds “Retrieved knowledge chunks and similarity ranks”. Below the comparison arrow, another box notes “10,202.5 marks for 14,980 answers”, indicating improved scoring when using C L K R. To the right, a large rectangular panel titled “Performance comparison of each test paper slash on M S Q s or M M Q s slash across 8 C L areas slash on open‑ended questions (Section 4.1)” contains bullet points: “C L K R enhances the C L Q A performance of 7 G P L L M s by an average of 21.1 percent, varying from 9.9 percent to 44.9 percent (Table 4).” “C L Q A performance on M S Q s and M M Q s improves by 14.9 percent and 38.3 percent (Table 5).” “C L K R enhances C L Q A performance from 14.5 percent to 28.2 percent across eight C L knowledge areas (Table 6).” “C L K R enhances 7 G P L L M s’ performance in 100 open‑ended questions by an average of 22.0 percent (Table S 4).” A lower right panel titled “Impact evaluation of individual C L document on performance enhancement (Section 4.2)” provides further bullets: “Top 10 (2.6 percent) documents offer 37.2 percent (unranked) and 37.3 percent (ranked) knowledge for C L Q A (Figure 9).” “210 documents retrieved less than 5 times offer 3.7 percent (unranked) and 3.5 percent (ranked) knowledge for C L Q A (Table S 4).”The comparison results of GPLLMs’ CLQA accuracies with and without CLKR. Source(s): Authors’ own work
4.1 Performance comparison between GPLLMs with and without CLKR
4.1.1 CLKR-enabled performance enhancements on PCEQE test papers
The Wilcoxon T Test results demonstrate that CLKR significantly enhances the accuracy of seven different GPLLMs in CLQA (Table 4). On average, the CLKR results in a remarkable 21.1% increase in the accuracy of these GPLLMs (Table 4). The performance enhancement across the seven GPLLMs ranges from 9.9% to 44.9% (Figure 6). The CLKR-empowered text-davinci-003 exhibits the most substantial improvement, achieving a 44.9% accuracy increase, rising from 0.329 to 0.476 (Figure 6b). Despite the CLKR-empowered ERNIE-Bot 4.0 shows the least improvement at only 9.9% (Figure 6g), the effectiveness of CLKR is also significantly confirmed by Wilcoxon T Test (Table 4).
The figure consists of seven subplots labeled (a) through (g), each showing paired distributions of model accuracy before and after integrating the construction law knowledge repository (C L K R). All panels share the vertical axis “Accuracy”, ranging from 0.2 to 1.0 with an interval of 0.2, and a legend indicating violins and boxes depict the 25 percent–75 percent range of baseline performance with whiskers for “Min–Max”, a horizontal line for the median, diamonds for the mean, and dots for “Accuracy on each P C E Q E test paper”. A dashed horizontal line marks the “Passing Line (Accuracy equals 0.6)”, which runs right from the marking 0.6 on the vertical axis of each plot. Panel (a), titled “Accuracy of Llama‑2‑70 b with and without C L K R”, shows a distribution for “Llama‑2‑70 b” on the left with a mean of “0.283” and a distribution for “Llama‑2‑70 b with C L K R” on the right with a mean of “0.363”. An arrow labeled “28.3 percent” points from the baseline mean to the C L K R‑enhanced mean, indicating modest improvement that remains below the 0.6 passing line. Panel (b), “Accuracy of text‑davinci‑003 with and without C L K R”, similarly shows the baseline mean “0.329” increasing to “0.476” with C L K R, annotated as “44.9 percent”. Panel (c), “Accuracy of G P T‑3.5 Turbo with and without C L K R”, presents a baseline mean of “0.349” rising to “0.476”, with a labeled improvement of “36.3 percent”. Panel (d), “Accuracy of G P T‑4 with and without C L K R”, shows the highest accuracies, with the baseline mean “0.528” already near the passing line and the C L K R‑empowered mean “0.663” clearly above it, corresponding to a “25.4 percent” gain. Panel (e), titled “Accuracy of Chat G L M 2‑6 B with and without C L K R”, shows a distribution for “Chat G L M 2-6 B” on the left with a mean of “0.430” and a distribution for “Chat G L M 2‑6 B with C L K R” on the right with a mean of “0.478”. An arrow labeled “11.1 percent” points from the baseline mean to the C L K R‑enhanced mean, indicating modest improvement that still remains below the 0.6 passing line. Panel (f), “Accuracy of E R N I E‑Bot‑turbo with and without C L K R”, displays baseline accuracy centered around “0.419” and C L K R‑empowered accuracy around “0.462”, with a “10.2 percent” gain; both distributions also lie below the passing threshold. Panel (g), “Accuracy of E R N I E‑Bot 4.0 with and without C L K R”, shows the strongest performance: the baseline mean is “0.755”, already above 0.6, while the C L K R‑enhanced mean is “0.830”, with an improvement of “9.9 percent” and tighter clustering of blue points. In each subplot, the distributions on the right are not only shifted upward in mean and median but also cluster more tightly above the baseline, illustrating that incorporating C L K R consistently boosts model performance across all P C E Q E test papers for each of the four G P L L Ms. Note: All the numerical data values are approximated.Performance of original and CLKR-empowered GPLLMs in PCEQEs. Source(s): Authors’ own work
The figure consists of seven subplots labeled (a) through (g), each showing paired distributions of model accuracy before and after integrating the construction law knowledge repository (C L K R). All panels share the vertical axis “Accuracy”, ranging from 0.2 to 1.0 with an interval of 0.2, and a legend indicating violins and boxes depict the 25 percent–75 percent range of baseline performance with whiskers for “Min–Max”, a horizontal line for the median, diamonds for the mean, and dots for “Accuracy on each P C E Q E test paper”. A dashed horizontal line marks the “Passing Line (Accuracy equals 0.6)”, which runs right from the marking 0.6 on the vertical axis of each plot. Panel (a), titled “Accuracy of Llama‑2‑70 b with and without C L K R”, shows a distribution for “Llama‑2‑70 b” on the left with a mean of “0.283” and a distribution for “Llama‑2‑70 b with C L K R” on the right with a mean of “0.363”. An arrow labeled “28.3 percent” points from the baseline mean to the C L K R‑enhanced mean, indicating modest improvement that remains below the 0.6 passing line. Panel (b), “Accuracy of text‑davinci‑003 with and without C L K R”, similarly shows the baseline mean “0.329” increasing to “0.476” with C L K R, annotated as “44.9 percent”. Panel (c), “Accuracy of G P T‑3.5 Turbo with and without C L K R”, presents a baseline mean of “0.349” rising to “0.476”, with a labeled improvement of “36.3 percent”. Panel (d), “Accuracy of G P T‑4 with and without C L K R”, shows the highest accuracies, with the baseline mean “0.528” already near the passing line and the C L K R‑empowered mean “0.663” clearly above it, corresponding to a “25.4 percent” gain. Panel (e), titled “Accuracy of Chat G L M 2‑6 B with and without C L K R”, shows a distribution for “Chat G L M 2-6 B” on the left with a mean of “0.430” and a distribution for “Chat G L M 2‑6 B with C L K R” on the right with a mean of “0.478”. An arrow labeled “11.1 percent” points from the baseline mean to the C L K R‑enhanced mean, indicating modest improvement that still remains below the 0.6 passing line. Panel (f), “Accuracy of E R N I E‑Bot‑turbo with and without C L K R”, displays baseline accuracy centered around “0.419” and C L K R‑empowered accuracy around “0.462”, with a “10.2 percent” gain; both distributions also lie below the passing threshold. Panel (g), “Accuracy of E R N I E‑Bot 4.0 with and without C L K R”, shows the strongest performance: the baseline mean is “0.755”, already above 0.6, while the C L K R‑enhanced mean is “0.830”, with an improvement of “9.9 percent” and tighter clustering of blue points. In each subplot, the distributions on the right are not only shifted upward in mean and median but also cluster more tightly above the baseline, illustrating that incorporating C L K R consistently boosts model performance across all P C E Q E test papers for each of the four G P L L Ms. Note: All the numerical data values are approximated.Performance of original and CLKR-empowered GPLLMs in PCEQEs. Source(s): Authors’ own work
Wilcoxon T Tests on CLQA accuracy of 7 GPLLMs with and without CLKR in PCEQEs
| No | GPLLM | CLKR | Average accuracy | Accuracy enhancement | z-statistic | p-value |
|---|---|---|---|---|---|---|
| 1 | Llama-2-70b | without | 0.283 | 28.3% | 4.197 | 0.000*** |
| with | 0.363 | |||||
| 2 | text-davinci-003 | without | 0.329 | 44.9% | 4.286 | 0.000*** |
| with | 0.476 | |||||
| 3 | GPT-3.5 Turbo | without | 0.349 | 36.3% | 4.287 | 0.000*** |
| with | 0.476 | |||||
| 4 | GPT-4 | without | 0.528 | 25.4% | 4.171 | 0.000*** |
| with | 0.663 | |||||
| 5 | ChatGLM2-6B | without | 0.430 | 11.1% | 3.729 | 0.000*** |
| with | 0.478 | |||||
| 6 | ERNIE-Bot-turbo | without | 0.419 | 10.2% | 3.429 | 0.002*** |
| with | 0.462 | |||||
| 7 | ERNIE-Bot 4.0 | without | 0.755 | 9.9% | 4.029 | 0.000*** |
| with | 0.830 | |||||
| Average accuracy of 7 GPLLMs | without | 0.442 | 21.1% | NA | NA | |
| with | 0.535 | |||||
| No | GPLLM | CLKR | Average accuracy | Accuracy enhancement | z-statistic | p-value |
|---|---|---|---|---|---|---|
| 1 | Llama-2-70b | without | 0.283 | 28.3% | 4.197 | 0.000*** |
| with | 0.363 | |||||
| 2 | text-davinci-003 | without | 0.329 | 44.9% | 4.286 | 0.000*** |
| with | 0.476 | |||||
| 3 | GPT-3.5 Turbo | without | 0.349 | 36.3% | 4.287 | 0.000*** |
| with | 0.476 | |||||
| 4 | GPT-4 | without | 0.528 | 25.4% | 4.171 | 0.000*** |
| with | 0.663 | |||||
| 5 | ChatGLM2-6B | without | 0.430 | 11.1% | 3.729 | 0.000*** |
| with | 0.478 | |||||
| 6 | ERNIE-Bot-turbo | without | 0.419 | 10.2% | 3.429 | 0.002*** |
| with | 0.462 | |||||
| 7 | ERNIE-Bot 4.0 | without | 0.755 | 9.9% | 4.029 | 0.000*** |
| with | 0.830 | |||||
| Average accuracy of 7 GPLLMs | without | 0.442 | 21.1% | NA | NA | |
| with | 0.535 | |||||
Note(s): *** denote confidence levels above 99%
Source(s): Authors’ own work
CLKR can significantly enhance the CLQA performance of GPLLMs regardless of their NLP capabilities and their training language (i.e. Chinese or English) (Tables 3 and 4). ERNIE-Bot 4.0 leads the performance of individual GPLLMs, with an improvement in accuracy from 0.755 to 0.830 (Figure 6g). The CLKR-empowered GPT-4 also shows significant gains, with its accuracy increasing from 0.528 to 0.663 (Figure 6d), surpassing the PCEQE passing mark of 0.6. Although the average accuracy of other 5 GPLLMs does not pass the PCEQE tests, their considerable improvements also affirm the effectiveness of CLKR in boosting the performance of GPLLMs in CLQA (Figure 6a, 6b, and 6c and 6e-6f). Meanwhile, the CLKR built based on CL documents written in Chinese (Table S2) significantly boosts the CLQA performance of GPLLMs launched by Chinese institutions (i.e. ChatGLM2-6B, ERNIE-Bot-turbo, and ERNIE-Bot 4.0) (Figure 6e, 6f, and 6g). It also provides notable improvements for GPLLMs primarily trained in English corpora like Llama-2-70b, text-davinci-003, GPT-3.5 Turbo, and GPT-4 (Figure 6a, 6b, 6c, and 6d).
4.1.2 CLKR-enabled performance enhancements in MSQs and MMQs
In the CLQA performance comparative analysis (Table 5), the integration of CLKR significantly enhances the performance of 7 GPLLMs in answering both types of multiple-choice questions (i.e. MSQs and MMQs) (Figure 7). Specifically, the accuracy of GPLLMs on MSQs improves by 14.9%, increasing from 0.569 to 0.654 (Table 5). Text-davinci-003 demonstrates the most significant enhancement in MSQs among all GPLLMs, achieving an improvement of 40.4% (Figure 7b). Meanwhile, GPT-3.5 Turbo exhibits the highest improvement in MMQs (Figure 7c), with an impressive increase of 86.2%. Although the accuracy of GPLLMs on MSQs is higher than on MMQs (Figure 7), the improvement ratio for MMQs is greater at 38.3%, increasing from 0.273 to 0.378 (Table 5). The discussions about the performance difference between MSQs and MMQs are conducted in Section 5.3. Among the GPLLMs evaluated (Table 5), the CLKR-empowered ERNIE-Bot 4.0 stands out as the top performer, showing superior capability in handling both MSQs and MMQs, with its accuracy on MSQs even exceeding 0.9 (Figure 7g). However, it recorded the lowest improvement among the seven GPLLMs in MSQs, achieving an increase of only 6.6% (Figure 7g).
The figure is composed of subplots (a)–(g), each showing how integrating the construction law knowledge repository (C L K R) affects accuracy on multiple‑choice single‑answer questions (M S Q s) and multiple‑choice multiple‑answer questions (M M Q s) for a given G P L L M. All panels share the vertical axis “Accuracy”, ranging from 0.0 to 1.0 with an interval of 0.2, and a legend indicating boxes for the 25 percent–75 percent range of baseline performance: red boxes for “without C L K R” and blue boxes for “with C L K R”, whiskers for “Min–Max”, a horizontal line for the median, diamonds for the mean, and dots for “Accuracy on each P C E Q E test paper”. A dashed horizontal line marks the “Passing Line (Accuracy equals 0.6)”, which runs right from the marking 0.6 on the vertical axis of each plot. Panels (a)–(g) cover “Llama‑2‑70 b”, “text‑davinci‑003”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”, respectively. In each, two pairs of boxplots appear along the horizontal axis labeled “without C L K R” and “with C L K R” under “M S Q s” and “M M Q s”. For Llama‑2‑70 b, the mean M S Q accuracy increases from “0.413” to “0.475” (a “15.0 percent” gain), while M M Q accuracy rises from “0.120” to “0.214” (a “78.7 percent” gain), though both remain below the 0.6 passing line. For text‑davinci‑003, M S Q mean accuracy improves from “0.404” to “0.567” (“40.4 percent”), and M M Q accuracy from “0.220” to “0.353” (“59.2 percent”). G P T‑3.5 Turbo shows M S Q accuracy increasing from “0.452” to “0.547” (“21.1 percent”) and M M Q accuracy from “0.205” to “0.381” (“86.2 percent”). G P T‑4 exhibits the high M S Q scores, with the mean rising from “0.645” to “0.743” (“15.2 percent”), surpassing the passing line; its M M Q accuracy grows from “0.374” to “0.557”, a “49.0 percent” gain that moves the distribution close to the 0.6 threshold. For Chat G L M 2‑6 B, M S Q accuracy slightly increases from “0.538” to “0.604”, labeled “12.4 percent” improvement in the figure, while M M Q accuracy increases from “0.293” to “0.311” (“6.2 percent”), both below the passing line. E R N I E‑Bot‑turbo’s M S Q mean accuracy improves from “0.680” to “0.731” (“7.5 percent”), remaining above 0.6, and M M Q accuracy rises from “0.071” to “0.103” (“44.6 percent”) though still low in absolute terms. E R N I E‑Bot 4.0 exhibits the highest M S Q scores, with the mean rising from “0.853” to “0.909” (“6.6 percent”), surpassing the passing line; its M M Q accuracy grows from “0.626” to “0.724”, a “15.5 percent” gain, also surpassing the passing 0.6 threshold. Note: All the numerical data values are approximated.Performance comparison of original and CLKR-empowered GPLLMs in MSQs and MMQs. Source(s): Authors’ own work
The figure is composed of subplots (a)–(g), each showing how integrating the construction law knowledge repository (C L K R) affects accuracy on multiple‑choice single‑answer questions (M S Q s) and multiple‑choice multiple‑answer questions (M M Q s) for a given G P L L M. All panels share the vertical axis “Accuracy”, ranging from 0.0 to 1.0 with an interval of 0.2, and a legend indicating boxes for the 25 percent–75 percent range of baseline performance: red boxes for “without C L K R” and blue boxes for “with C L K R”, whiskers for “Min–Max”, a horizontal line for the median, diamonds for the mean, and dots for “Accuracy on each P C E Q E test paper”. A dashed horizontal line marks the “Passing Line (Accuracy equals 0.6)”, which runs right from the marking 0.6 on the vertical axis of each plot. Panels (a)–(g) cover “Llama‑2‑70 b”, “text‑davinci‑003”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”, respectively. In each, two pairs of boxplots appear along the horizontal axis labeled “without C L K R” and “with C L K R” under “M S Q s” and “M M Q s”. For Llama‑2‑70 b, the mean M S Q accuracy increases from “0.413” to “0.475” (a “15.0 percent” gain), while M M Q accuracy rises from “0.120” to “0.214” (a “78.7 percent” gain), though both remain below the 0.6 passing line. For text‑davinci‑003, M S Q mean accuracy improves from “0.404” to “0.567” (“40.4 percent”), and M M Q accuracy from “0.220” to “0.353” (“59.2 percent”). G P T‑3.5 Turbo shows M S Q accuracy increasing from “0.452” to “0.547” (“21.1 percent”) and M M Q accuracy from “0.205” to “0.381” (“86.2 percent”). G P T‑4 exhibits the high M S Q scores, with the mean rising from “0.645” to “0.743” (“15.2 percent”), surpassing the passing line; its M M Q accuracy grows from “0.374” to “0.557”, a “49.0 percent” gain that moves the distribution close to the 0.6 threshold. For Chat G L M 2‑6 B, M S Q accuracy slightly increases from “0.538” to “0.604”, labeled “12.4 percent” improvement in the figure, while M M Q accuracy increases from “0.293” to “0.311” (“6.2 percent”), both below the passing line. E R N I E‑Bot‑turbo’s M S Q mean accuracy improves from “0.680” to “0.731” (“7.5 percent”), remaining above 0.6, and M M Q accuracy rises from “0.071” to “0.103” (“44.6 percent”) though still low in absolute terms. E R N I E‑Bot 4.0 exhibits the highest M S Q scores, with the mean rising from “0.853” to “0.909” (“6.6 percent”), surpassing the passing line; its M M Q accuracy grows from “0.626” to “0.724”, a “15.5 percent” gain, also surpassing the passing 0.6 threshold. Note: All the numerical data values are approximated.Performance comparison of original and CLKR-empowered GPLLMs in MSQs and MMQs. Source(s): Authors’ own work
Wilcoxon T Tests on CLQA accuracy of GPLLMs with and without CLKR in MSQs and MMQs
| Question type | CLKR | Average accuracy | Accuracy enhancement | z-statistic | p-value |
|---|---|---|---|---|---|
| MSQs | without | 0.569 | 14.9% | 9.451 | 0.000*** |
| with | 0.654 | ||||
| MMQs | without | 0.273 | 38.3% | 9.360 | 0.000*** |
| with | 0.378 |
| Question type | CLKR | Average accuracy | Accuracy enhancement | z-statistic | p-value |
|---|---|---|---|---|---|
| MSQs | without | 0.569 | 14.9% | 9.451 | 0.000*** |
| with | 0.654 | ||||
| MMQs | without | 0.273 | 38.3% | 9.360 | 0.000*** |
| with | 0.378 |
Source(s): Authors’ own work
4.1.3 CLKR-enabled performance enhancements across 8 CL knowledge areas
The results reveal that the CLKR significantly lifts the CLQA accuracy of GPLLMs across eight CL knowledge areas (C1-C8). Figure 8 visually shows the extent to which CLKR-empowered GPLLMs have improved their question-answering capabilities in each area, despite varying degrees of improvements (Figure 8a, 8b, 8c, 8d, 8e, 8f, and 8g). For example, GPT-4.0’s accuracy enhancement is the minimum in C1: basic legal knowledge for construction (14.7%) and the largest in C2: construction permits (41.7%) (Figure 8d). Each GPLLM shows different maximum or minimum CL knowledge area-specific accuracy improvement (Figure 8a, 8b, 8c, 8d, 8e, 8f, and 8g). The CL knowledge area-specific accuracy improvements range from 14.5% to 28.2% (Table 6). These notable increases in CLQA accuracy (Table 6 and Figure 8) not only validate the CLKR’s efficacy but also underscore CLKR’s comprehensive coverage of CL knowledge areas.
The figure is spread across seven labeled panels (a)–(g), each showing how a G P L L M performs in eight construction law knowledge areas after integrating the construction law knowledge repository (C L K R). All panels share the vertical axis “Accuracy”, ranging from 0.0 to 1.0 with an interval of 0.2, and a common legend: boxes indicate the 25 percent–75 percent range for accuracies across P C E Q E test papers, thin vertical lines show “Min–Max”, diamonds mark “Average accuracy”, red dots show “Accuracy of G P L L Ms on each P C E Q E test paper” without C L K R, and blue dots show “Accuracy of C L K R‑empowered G P L L M s on each P C E Q E test paper”. A green dashed line at 0.6 represents the “Passing Line (Accuracy equals 0.6)”. Panel (a), “Llama‑2‑70 b”, contains eight grouped boxplots labeled C 1 through C 8 along the horizontal axis. For each C‑area, red and blue point clouds cluster around the boxes, with arrows annotating relative improvements. It shows relatively low accuracies in all eight areas, with averages below the passing line. In C 1, mean accuracy rises from “0.287” without C L K R to “0.370” with C L K R, a “29.1 percent” gain. C 2 increases from “0.355” to “0.360” (“1.4 percent”), C 3 from “0.221” to “0.325” (“47.1 percent”), C 4 from “0.302” to “0.360” (“19.0 percent”), C 5 from “0.303” to “0.386” (“27.3 percent”), C 6 from “0.333” to “0.411” (“23.3 percent”), C 7 from “0.324” to “0.379” (“17.1 percent”), and C 8 from “0.314” to “0.405” (“29.1 percent”). Panel (b), “text‑davinci‑003”, shows higher starting accuracies and larger relative gains. In C 1, mean accuracy improves from “0.340” to “0.435” (“28.1 percent”), in C 2 from “0.320” to “0.582” (“82.3 percent”), in C 3 from “0.332” to “0.487” (“46.5 percent”), and in C 4 from “0.299” to “0.426” (“42.4 percent”). For C 5, the average rises from “0.348” to “0.548” (“49.8 percent”), for C 6 from “0.348” to “0.533” (“53.0 percent”), for C 7 from “0.366” to “0.531” (“45.2 percent”), and for C 8 from “0.332” to “0.463” (“39.2 percent”). Most C‑area averages with C L K R approach or exceed 0.5, though still near or below the 0.6 threshold. Panel (c), “G P T‑3.5 Turbo”, presents moderate baseline performance that benefits noticeably from C L K R. In C 1, the mean increases from “0.416” to “0.484” (“16.3 percent”), in C 2 from “0.317” to “0.489” (“54.2 percent”), in C 3 from “0.362” to “0.503” (“38.9 percent”), and in C 4 from “0.285” to “0.445” (“56.1 percent”). C 5 improves from “0.400” to “0.455” (“13.6 percent”), C 6 from “0.333” to “0.526” (“57.8 percent”), C 7 from “0.389” to “0.539” (“38.5 percent”), and C 8 from “0.345” to “0.414” (“19.9 percent”). While some enhanced averages approach the passing line, most remain slightly below 0.6. Panel (d), “G P T‑4”, shows the strongest overall performance. Baseline averages are already near or above 0.6 in many C‑areas and consistently rise with C L K R. In C 1, accuracy grows from “0.573” to “0.657” (“14.7 percent”), in C 2 from “0.523” to “0.742” (“41.7 percent”), in C 3 from “0.549” to “0.689” (“25.5 percent”), and in C 4 from “0.481” to “0.615” (“27.9 percent”). For C 5, the mean increases from “0.539” to “0.719” (“33.2 percent”), for C 6 from “0.546” to “0.685” (“25.6 percent”), for C 7 from “0.488” to “0.678” (“39.1 percent”), and for C 8 from “0.590” to “0.681” (“15.5 percent”), with all C L K R‑enhanced averages clearly above the passing line. Panel (e), “Chat G L M 2‑6 B”, shows moderate accuracies: in C 1 the mean rises from “0.436” to “0.455” (“4.2 percent” improvement), in C 2 from “0.349” to “0.436” (“25.0 percent”), in C 3 from “0.511” to “0.517” (“1.3 percent”), in C 4 from “0.368” to “0.448” (“17.0 percent”), in C 5 from “0.430” to “0.538” (“20.1 percent”), in C 6 from “0.490” to “0.527” (“7.5 percent”), in C 7 from “0.477” to “0.536” (“12.5 percent”), and in C 8 from “0.430” to “0.513” (“19.3 percent”). Most averages stay below 0.6 but trend upward with C L K R. Panel (f), “E R N I E‑Bot‑turbo”, exhibits higher but still mixed performance: average accuracy in C 1 increases from “0.440” to “0.485” (“10.3 percent”), in C 2 from “0.553” to “0.580” (“4.9 percent”), in C 3 from “0.408” to “0.440” (“7.9 percent”), in C 4 from “0.385” to “0.406” (“5.4 percent”), in C 5 from “0.438” to “0.522” (“19.3 percent”), in C 6 from “0.435” to “0.525” (“20.7 percent”), in C 7 from “0.446” to “0.463” (“1.4 percent”), and in C 8 from “0.452” to “0.511” (“10.5 percent”). Only a few C‑areas approach the 0.6 passing line. Panel (g), “E R N I E‑Bot 4.0”, shows consistently high accuracies above the passing line for all C‑areas. For C 1 the mean improves from “0.770” to “0.849” (“10.3 percent”), for C 2 from “0.729” to “0.843” (“15.7 percent”), for C 3 from “0.761” to “0.862” (“13.3 percent”), for C 4 from “0.737” to “0.778” (“5.6 percent”), for C 5 from “0.715” to “0.828” (“15.8 percent”), for C 6 from “0.771” to “0.819” (“6.3 percent”), for C7 from “0.747” to “0.820” (“9.7 percent”), and for C 8 from “0.819” to “0.896” (“9.4 percent”). The blue point clouds cluster near the top of the chart, indicating robust, high‑accuracy performance in every construction law area once C L K R is integrated. Note: All the numerical data values are approximated.
Performance comparison of original and CLKR-empowered GPLLMs across C1-C8. Source(s): Authors’ own work
The figure is spread across seven labeled panels (a)–(g), each showing how a G P L L M performs in eight construction law knowledge areas after integrating the construction law knowledge repository (C L K R). All panels share the vertical axis “Accuracy”, ranging from 0.0 to 1.0 with an interval of 0.2, and a common legend: boxes indicate the 25 percent–75 percent range for accuracies across P C E Q E test papers, thin vertical lines show “Min–Max”, diamonds mark “Average accuracy”, red dots show “Accuracy of G P L L Ms on each P C E Q E test paper” without C L K R, and blue dots show “Accuracy of C L K R‑empowered G P L L M s on each P C E Q E test paper”. A green dashed line at 0.6 represents the “Passing Line (Accuracy equals 0.6)”. Panel (a), “Llama‑2‑70 b”, contains eight grouped boxplots labeled C 1 through C 8 along the horizontal axis. For each C‑area, red and blue point clouds cluster around the boxes, with arrows annotating relative improvements. It shows relatively low accuracies in all eight areas, with averages below the passing line. In C 1, mean accuracy rises from “0.287” without C L K R to “0.370” with C L K R, a “29.1 percent” gain. C 2 increases from “0.355” to “0.360” (“1.4 percent”), C 3 from “0.221” to “0.325” (“47.1 percent”), C 4 from “0.302” to “0.360” (“19.0 percent”), C 5 from “0.303” to “0.386” (“27.3 percent”), C 6 from “0.333” to “0.411” (“23.3 percent”), C 7 from “0.324” to “0.379” (“17.1 percent”), and C 8 from “0.314” to “0.405” (“29.1 percent”). Panel (b), “text‑davinci‑003”, shows higher starting accuracies and larger relative gains. In C 1, mean accuracy improves from “0.340” to “0.435” (“28.1 percent”), in C 2 from “0.320” to “0.582” (“82.3 percent”), in C 3 from “0.332” to “0.487” (“46.5 percent”), and in C 4 from “0.299” to “0.426” (“42.4 percent”). For C 5, the average rises from “0.348” to “0.548” (“49.8 percent”), for C 6 from “0.348” to “0.533” (“53.0 percent”), for C 7 from “0.366” to “0.531” (“45.2 percent”), and for C 8 from “0.332” to “0.463” (“39.2 percent”). Most C‑area averages with C L K R approach or exceed 0.5, though still near or below the 0.6 threshold. Panel (c), “G P T‑3.5 Turbo”, presents moderate baseline performance that benefits noticeably from C L K R. In C 1, the mean increases from “0.416” to “0.484” (“16.3 percent”), in C 2 from “0.317” to “0.489” (“54.2 percent”), in C 3 from “0.362” to “0.503” (“38.9 percent”), and in C 4 from “0.285” to “0.445” (“56.1 percent”). C 5 improves from “0.400” to “0.455” (“13.6 percent”), C 6 from “0.333” to “0.526” (“57.8 percent”), C 7 from “0.389” to “0.539” (“38.5 percent”), and C 8 from “0.345” to “0.414” (“19.9 percent”). While some enhanced averages approach the passing line, most remain slightly below 0.6. Panel (d), “G P T‑4”, shows the strongest overall performance. Baseline averages are already near or above 0.6 in many C‑areas and consistently rise with C L K R. In C 1, accuracy grows from “0.573” to “0.657” (“14.7 percent”), in C 2 from “0.523” to “0.742” (“41.7 percent”), in C 3 from “0.549” to “0.689” (“25.5 percent”), and in C 4 from “0.481” to “0.615” (“27.9 percent”). For C 5, the mean increases from “0.539” to “0.719” (“33.2 percent”), for C 6 from “0.546” to “0.685” (“25.6 percent”), for C 7 from “0.488” to “0.678” (“39.1 percent”), and for C 8 from “0.590” to “0.681” (“15.5 percent”), with all C L K R‑enhanced averages clearly above the passing line. Panel (e), “Chat G L M 2‑6 B”, shows moderate accuracies: in C 1 the mean rises from “0.436” to “0.455” (“4.2 percent” improvement), in C 2 from “0.349” to “0.436” (“25.0 percent”), in C 3 from “0.511” to “0.517” (“1.3 percent”), in C 4 from “0.368” to “0.448” (“17.0 percent”), in C 5 from “0.430” to “0.538” (“20.1 percent”), in C 6 from “0.490” to “0.527” (“7.5 percent”), in C 7 from “0.477” to “0.536” (“12.5 percent”), and in C 8 from “0.430” to “0.513” (“19.3 percent”). Most averages stay below 0.6 but trend upward with C L K R. Panel (f), “E R N I E‑Bot‑turbo”, exhibits higher but still mixed performance: average accuracy in C 1 increases from “0.440” to “0.485” (“10.3 percent”), in C 2 from “0.553” to “0.580” (“4.9 percent”), in C 3 from “0.408” to “0.440” (“7.9 percent”), in C 4 from “0.385” to “0.406” (“5.4 percent”), in C 5 from “0.438” to “0.522” (“19.3 percent”), in C 6 from “0.435” to “0.525” (“20.7 percent”), in C 7 from “0.446” to “0.463” (“1.4 percent”), and in C 8 from “0.452” to “0.511” (“10.5 percent”). Only a few C‑areas approach the 0.6 passing line. Panel (g), “E R N I E‑Bot 4.0”, shows consistently high accuracies above the passing line for all C‑areas. For C 1 the mean improves from “0.770” to “0.849” (“10.3 percent”), for C 2 from “0.729” to “0.843” (“15.7 percent”), for C 3 from “0.761” to “0.862” (“13.3 percent”), for C 4 from “0.737” to “0.778” (“5.6 percent”), for C 5 from “0.715” to “0.828” (“15.8 percent”), for C 6 from “0.771” to “0.819” (“6.3 percent”), for C7 from “0.747” to “0.820” (“9.7 percent”), and for C 8 from “0.819” to “0.896” (“9.4 percent”). The blue point clouds cluster near the top of the chart, indicating robust, high‑accuracy performance in every construction law area once C L K R is integrated. Note: All the numerical data values are approximated.
Performance comparison of original and CLKR-empowered GPLLMs across C1-C8. Source(s): Authors’ own work
Wilcoxon T Tests on CLQA accuracy of GPLLMs with and without CLKR across C1-C8
| Knowledge domain | CLKR | Average accuracy | Accuracy enhancement | z-statistic | p-value |
|---|---|---|---|---|---|
| C1 | without | 0.466 | 14.5% | 6.672 | 0.000*** |
| with | 0.534 | ||||
| C2 | without | 0.449 | 28.2% | 5.896 | 0.000*** |
| with | 0.576 | ||||
| C3 | without | 0.449 | 21.6% | 6.825 | 0.000*** |
| with | 0.546 | ||||
| C4 | without | 0.408 | 21.1% | 7.086 | 0.000*** |
| with | 0.494 | ||||
| C5 | without | 0.458 | 24.5% | 5.966 | 0.000*** |
| with | 0.571 | ||||
| C6 | without | 0.465 | 23.6% | 7.427 | 0.000*** |
| with | 0.575 | ||||
| C7 | without | 0.462 | 21.6% | 7.334 | 0.000*** |
| with | 0.562 | ||||
| C8 | without | 0.470 | 17.9% | 5.970 | 0.000*** |
| with | 0.555 |
| Knowledge domain | CLKR | Average accuracy | Accuracy enhancement | z-statistic | p-value |
|---|---|---|---|---|---|
| C1 | without | 0.466 | 14.5% | 6.672 | 0.000*** |
| with | 0.534 | ||||
| C2 | without | 0.449 | 28.2% | 5.896 | 0.000*** |
| with | 0.576 | ||||
| C3 | without | 0.449 | 21.6% | 6.825 | 0.000*** |
| with | 0.546 | ||||
| C4 | without | 0.408 | 21.1% | 7.086 | 0.000*** |
| with | 0.494 | ||||
| C5 | without | 0.458 | 24.5% | 5.966 | 0.000*** |
| with | 0.571 | ||||
| C6 | without | 0.465 | 23.6% | 7.427 | 0.000*** |
| with | 0.575 | ||||
| C7 | without | 0.462 | 21.6% | 7.334 | 0.000*** |
| with | 0.562 | ||||
| C8 | without | 0.470 | 17.9% | 5.970 | 0.000*** |
| with | 0.555 |
Source(s): Authors’ own work
Across 8 CL knowledge areas, ERNIE-Bot 4.0 demonstrates the highest CLQA accuracy among all individual GPLLMs (Figure 8a, 8b, 8c, 8d, 8e, 8f, and 8g), achieving 5.6%–15.8% improvements in each area with the integration of CLKR (Figure 8g). GPT-4 shows accuracy below 0.6 across all CL knowledge areas before CLKR incorporation (Figure 8d), whereas the accuracy enhancement brought by CLKR enables it to exceed the PCEQE passing threshold of 0.6 in 8 CL knowledge areas (Figure 8d). Other 5 GPLLMs do not reach the 0.6 passing threshold across each area, however, the integration of CLKR brings marked accuracy improvements across all CL knowledge areas (Figure 8a, 8b, and 8c and 8e-8f).
4.1.4 CLKR-enabled performance enhancements on open-ended questions
CLKR not only improves GPLLMs’ performance in closed-ended questions but also enables an average accuracy improvement of 22.0% across seven GPLLMs in open-ended questions (Table S4). Figure 9 visually presents the accuracies of 6 pairs of GPLLMs on 100 open-ended questions, which are converted from 100 multiple-choice questions in the test dataset by removing the choices (Table S4). Three experienced experts are invited to evaluate the responses of GPLLMs to open-ended questions, scoring each response either 0 or 1, with the final scores being the average of the three experts’ ratings (Table S4). Although no GPLLM achieves an accuracy exceeding 0.6 (the passing threshold of PCQEQs) even with the integration of CLKR, the performance enhancements enabled by CLKR remain considerable (Figure 9).
The chart plots “Accuracy” on the vertical axis, ranging from 0.0 to 1.0 with an interval of 0.2, and lists six models along the horizontal axis: “Llama‑2‑70 B”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”. The legend at the top explains tow bars for “Accuracy of G P L L M s on 100 open‑ended questions” and “Accuracy of C L K R‑empowered G P L L M s on 100 open‑ended questions”. For each model, a bar for accuracy of G P L L M s appears on the left with a numeric accuracy printed inside, and a bar for accuracy of C L K R‑empowered G P L L M s on the right with a higher value; arrows and percentage labels above indicate the relative improvement. For Llama‑2‑70 B, the original accuracy is “0.407” and the C L K R‑empowered accuracy is “0.473”, corresponding to a “16.4 percent” gain. For G P T‑3.5 Turbo, accuracy increases from “0.213” to “0.317”, an improvement of “48.4 percent”. For G P T‑4, the bars rise from “0.277” to “0.407” with a “47.0 percent” gain. For Chat G L M 2‑6 B, accuracy improves from “0.210” to “0.277”, labeled “31.8 percent”. For E R N I E‑Bot‑turbo, accuracy rises from “0.467” to “0.510”, giving a “9.3 percent” increase. For E R N I E‑Bot 4.0, the values go from “0.483” to “0.527”, a “9.0 percent” gain. A caption beneath the axis notes these as “6 pairs of G P L L M s with and without C L K R”.Performance comparison of GPLLMs with and without CLKR in 100 open-ended questions. Note: The API of text-davinci-003 model is no longer accessible, when the authors conduct the CLQA on the open-ended question set in Nov. 2024. Source(s): Authors’ own work
The chart plots “Accuracy” on the vertical axis, ranging from 0.0 to 1.0 with an interval of 0.2, and lists six models along the horizontal axis: “Llama‑2‑70 B”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”. The legend at the top explains tow bars for “Accuracy of G P L L M s on 100 open‑ended questions” and “Accuracy of C L K R‑empowered G P L L M s on 100 open‑ended questions”. For each model, a bar for accuracy of G P L L M s appears on the left with a numeric accuracy printed inside, and a bar for accuracy of C L K R‑empowered G P L L M s on the right with a higher value; arrows and percentage labels above indicate the relative improvement. For Llama‑2‑70 B, the original accuracy is “0.407” and the C L K R‑empowered accuracy is “0.473”, corresponding to a “16.4 percent” gain. For G P T‑3.5 Turbo, accuracy increases from “0.213” to “0.317”, an improvement of “48.4 percent”. For G P T‑4, the bars rise from “0.277” to “0.407” with a “47.0 percent” gain. For Chat G L M 2‑6 B, accuracy improves from “0.210” to “0.277”, labeled “31.8 percent”. For E R N I E‑Bot‑turbo, accuracy rises from “0.467” to “0.510”, giving a “9.3 percent” increase. For E R N I E‑Bot 4.0, the values go from “0.483” to “0.527”, a “9.0 percent” gain. A caption beneath the axis notes these as “6 pairs of G P L L M s with and without C L K R”.Performance comparison of GPLLMs with and without CLKR in 100 open-ended questions. Note: The API of text-davinci-003 model is no longer accessible, when the authors conduct the CLQA on the open-ended question set in Nov. 2024. Source(s): Authors’ own work
4.2 The impact evaluation of each CL document on performance enhancement
A long-tail effect among CL documents in the CLKR has been observed, which conforms to a power law distribution (Figure 10). The top 10 of 387 (2.6%) documents offer 37.2% (Figure 10a) and 37.3% (Figure 10b) contextual knowledge for CLQA under unranked frequency and ranked frequency. The three most frequently retrieved documents are CLD-260: Regulations on the Management of Construction Project Quality, CLD-380: Civil Code of the People’s Republic of China, and CLD-347: Regulations on the Causes of Action for Civil Cases (Figure 10a). Concurrently, 210 CL documents are retrieved fewer than five times, and 101 CL documents do not even make any contribution to the CLQA (Table S4). Nonetheless, despite the minimal individual contributions from most documents, their collective significance as the long tail (Figure 10) remains crucial for enhancing GPLLMs’ CLQA performance.
The figure contains two stacked plots that analyze how often individual construction law documents contribute knowledge chunks, each fitted with a power‑law curve and annotated with statistics. Panel (a), titled “Power law distribution of unranked frequency of knowledge chunk‑sourced documents”, plots “Unranked frequency of documents” on the vertical axis from 0 to “750” and “Knowledge chunk‑sourced documents” along the horizontal axis. Orange vertical bars represent the unranked frequency with a line labeled “Power curve fitting” overlaying them. The curve is steep at the left, where a few documents have very high frequencies, then quickly decays toward near zero as the document index increases, highlighting a long‑tailed distribution. An inset bar chart zooms in on “The top 10 frequent documents (unranked)”, showing counts labeled above each bar: “704”, “469”, “252”, “228”, “176”, “133”, “112”, “111”, “105”, and “100”, and document I D s along the bottom such as “C L D‑260, C L D‑380, C L D‑347, C L D‑007, C L D‑108, C L D‑003, C L D‑015, C L D‑339, C L D‑149, C L D‑258”. A table to the right titled “Power function” reports the fitted model “y equals a asterisk x to the b power” with parameters “a: 727.39 plus or minus 5.75”, “b: negative 0.85 plus or minus 0.01”, and “R‑Square: 0.98”. The equation “Unranked frequency equals the sum from i equals 1 to n of C subscript i” appears below the table. Panel (b), titled “Power law distribution of ranked frequency of knowledge chunk‑sourced documents”, uses the same horizontal axis, but the vertical axis is “Ranked frequency of documents”, ranging from 0 to 450. Bars show ranked frequencies with a power‑law line labeled “Power curve fitting”. Again, the distribution is highly skewed, with a few documents dominating. The inset bar chart notes, “The top 10 frequent documents (ranked) provide 37.3 percent of the question‑related knowledge”, and displays frequencies above each bar: “428”, “298”, “145”, “138”, “100”, “82.8”, “73.7”, “68.7”, “67”, and “62.8”, for the same leading document I D s. The adjacent table lists the ranked‑frequency power‑law parameters: “y equals a asterisk x to the b power”, with “a: 444.75 plus or minus 3.96”, “b: negative 0.85 plus or minus 0.01”, and “R‑Square: 0.98”. A formula at the bottom reads, “Ranked frequency equals the sum from i equals 1 to n of C subscript i times 1 over R subscript i”.The power law distribution of CL knowledge-sourced documents for CLQA. Source(s): Authors’ own work
The figure contains two stacked plots that analyze how often individual construction law documents contribute knowledge chunks, each fitted with a power‑law curve and annotated with statistics. Panel (a), titled “Power law distribution of unranked frequency of knowledge chunk‑sourced documents”, plots “Unranked frequency of documents” on the vertical axis from 0 to “750” and “Knowledge chunk‑sourced documents” along the horizontal axis. Orange vertical bars represent the unranked frequency with a line labeled “Power curve fitting” overlaying them. The curve is steep at the left, where a few documents have very high frequencies, then quickly decays toward near zero as the document index increases, highlighting a long‑tailed distribution. An inset bar chart zooms in on “The top 10 frequent documents (unranked)”, showing counts labeled above each bar: “704”, “469”, “252”, “228”, “176”, “133”, “112”, “111”, “105”, and “100”, and document I D s along the bottom such as “C L D‑260, C L D‑380, C L D‑347, C L D‑007, C L D‑108, C L D‑003, C L D‑015, C L D‑339, C L D‑149, C L D‑258”. A table to the right titled “Power function” reports the fitted model “y equals a asterisk x to the b power” with parameters “a: 727.39 plus or minus 5.75”, “b: negative 0.85 plus or minus 0.01”, and “R‑Square: 0.98”. The equation “Unranked frequency equals the sum from i equals 1 to n of C subscript i” appears below the table. Panel (b), titled “Power law distribution of ranked frequency of knowledge chunk‑sourced documents”, uses the same horizontal axis, but the vertical axis is “Ranked frequency of documents”, ranging from 0 to 450. Bars show ranked frequencies with a power‑law line labeled “Power curve fitting”. Again, the distribution is highly skewed, with a few documents dominating. The inset bar chart notes, “The top 10 frequent documents (ranked) provide 37.3 percent of the question‑related knowledge”, and displays frequencies above each bar: “428”, “298”, “145”, “138”, “100”, “82.8”, “73.7”, “68.7”, “67”, and “62.8”, for the same leading document I D s. The adjacent table lists the ranked‑frequency power‑law parameters: “y equals a asterisk x to the b power”, with “a: 444.75 plus or minus 3.96”, “b: negative 0.85 plus or minus 0.01”, and “R‑Square: 0.98”. A formula at the bottom reads, “Ranked frequency equals the sum from i equals 1 to n of C subscript i times 1 over R subscript i”.The power law distribution of CL knowledge-sourced documents for CLQA. Source(s): Authors’ own work
5. Discussion
5.1 Mis-answered question types and potential reasons for imperfect CLQA performance
While incorporating CLKR into GPLLMs has effectively improved their CLQA performance, the accuracy remains suboptimal (Figure 6–9). After analyzing the incorrect answers, it is found that errors are more likely to occur in multiple-answer, scenario-based, and long-text questions. Beyond the inherent complexity of MMQs compared to MSQs, the main reasons for imperfect performance are (1) insufficient scenario analysis ability and (2) poor understanding of long-text contexts.
The insufficient scenario analysis ability refers to the notably lower accuracy in answering scenario-based questions (SBQs) compared to recall-based questions (RBQs) (Table S4). SBQs typically do not explicitly specify the knowledge points required for the answer but rather present a hypothetical scenario (Saka et al., 2024; Badyal et al., 2023), as opposed to simply recalling specific facts which are defined as RBQs (Badyal et al., 2023; Hwang and Mattila, 2022). The integration of CLKR improves GPLLMs’ accuracy by 23.0% on RBQs but only 4.6% on SBQs, with the latter’s minimal enhancement limiting overall CLQA performance (Table S4). Future efforts to create specific libraries of scenarios within the CL domain may offer a path toward a substantial enhancement of GPLLMs’ scenario analysis capabilities to achieve optimal CLQA performance (Saka et al., 2024; Mansurova et al., 2024).
Pearson correlation tests reveal a strong negative correlation between the average question length and the accuracy of GPLLMs, both without and with CLKR (Table S4). Question lengths for MSQs and MMQs are grouped into intervals of 50 (e.g. 0–50 and 50–100), and the average question length and accuracy of seven GPLLMs are calculated for each interval (Table S4). The Pearson correlation coefficients between question length and the accuracies of GPLLMs, both with and without CLKR, are less than −0.8 (Table S4), significantly demonstrating that longer question lengths result in lower accuracy for GPLLMs. Future improvements could focus on streamlining questions to eliminate interference from irrelevant information (Saad et al., 2023; Rizzo et al., 2024), thereby achieving higher performance in CLQA tasks.
Besides the particular types of legal questions the LLMs struggled with, hallucination is a commonly seen reason for imperfect CLQA performance (Su et al., 2024; Mansurova et al., 2024). This study employs two methods (i.e. RAG and a specifically designed prompt) to mitigate hallucinations. First, this approach provides three CL knowledge chunks to GPLLMs for CLQA; the GPLLMs refer to these CL knowledge chunks to answer questions. This method effectively reduces hallucinations by supplying relevant information from external knowledge sources (Alan et al., 2024; Hou and Zhang, 2024; Mansurova et al., 2024). Additionally, to minimize hallucinations during the CLQA process, a prompt is designed to inform the GPLLMs that (1) answers must be based on the provided knowledge and (2) responses should be limited to the selected option without any additional explanation. The 29,960 answers from the seven pairs of GPLLMs are presented in Table S4, where no significant issues of redundancy, irrelevance, or factual inaccuracies are observed.
5.2 Potential methods for addressing long-tail effect among CL documents
To address the long-tail effect for more efficient knowledge chunk retrieval, scholars have primarily proposed two potential methods, including (1) building cross-scale CLKRs and (2) removing redundant knowledge chunks. The documents within CLKR can be categorized into the frequently used CLKR (FU-CLKR) and rarely used CLKR (RU-CLKR) based on a certain threshold of retrieval times. During CLQA processing, the system prioritizes retrieving relevant documents from FU-CLKR. If the similarity between the question vector and the knowledge vectors in FU-CLKR falls below a certain threshold, the RAG then extends its search to knowledge chunks in the RU-CLKR (Sasazawa et al., 2023). The cross-scale CLKRs hold the potential to reduce the computational complexity of initial retrieval by focusing on a smaller, more relevant subset of CLKR (Zhu et al., 2024; Sasazawa et al., 2023). Secondly, although documents with similar name entities are filtered and aligned by experts during the establishment of CLKR (Figure 2b), different documents may contain overlapping knowledge, resulting in the redundancy of CLKR and complicating the retrieval process (Yu et al., 2024; Gao et al., 2023). The duplicated or similar contexts in different documents could be eliminated or merged (Gao et al., 2023; Yu et al., 2024; Zhou et al., 2023), ensuring a more streamlined and efficient knowledge repository.
5.3 Practical implications of this study
The practical implications of this study include (1) offering an additional channel for CLQA query, (2) highlighting the importance of adding evolving domain-specific knowledge in GPLLM application, and (3) holding the transferability to develop region-specific CLKR beyond Mainland China.
This study provides engineers with an additional channel of CLQA (Figure 11), complementing traditional ways such as consulting books (Rasool et al., 2024; Choi et al., 2023), conducting online searches (Oeding et al., 2024), and seeking expert advice (Hou and Zhang, 2024; Alan et al., 2024). Regardless of whether the questions are open-ended or closed-ended, this approach is more efficient than consulting books and online searches while being more cost-effective than seeking expert advice (Chou et al., 2024; Lee et al., 2023). Although current GPLLM-based CLQA systems cannot perfectly answer all questions, they serve as a meaningful supplement to traditional ways.
The figure contains two annotated screenshots labeled “(a) A C L Q A example in the deployed prototype” and “(b) Updating C L documents in the C L K R”. In panel (a), at the very top, a green‑bordered text box shows an English multiple‑choice item: “According to the ‘Unified Standard for Construction Quality Acceptance of Building Engineering’, who is responsible for accepting the inspection lot for the energy‑saving work of main structures, as well as for accepting the concealed work?” followed by options “(A) Supervision Engineer (B) Project Manager (C) Quality Engineer (D) Chief Supervision Engineer” and a label “Translation of the question”. A green annotation on the left labels this region “The Question”. Beneath, on the right, a smaller box indicates the original Chinese exam item with the tag “Q 0 1 6 0 (sourced from 2022 asterisk first‑level P C E Q E)”. Below it is a dialogue bubble containing the Chinese version of the question and, highlighted in red, the system’s final output, “Answer: D. Chief Supervision Engineer”, called out by a red label on the left, “Answer generated by G P L L M s”. On the left side of the screenshot is the Smart C L Q A sidebar headed “Dialogue” with version text “Current version v 0.2.10”. Within the panel are controls: “Manage knowledge repository”, “Current session: default”, “Please select a dialogue mode: Knowledge repository chat”, “Please select a L L M model: chatglm2‑6 b (Running)”, “Please select a prompt template: default”, a “Temperature: 0.80” slider, and “Historical dialogue rounds: 3”. Color‑coded boxes label “Enable the C L K R”, “Select the G P L L Ms for C L Q A”, and at the bottom “3 knowledge chunks related to the question”, with a drop‑down “Knowledge repository configuration” set to “Construction law knowledge”. On the right, three blue‑outlined rectangles stacked vertically present the retrieved supporting texts. Each shows a document title link and a mix of Chinese characters with an English summary caption. “Knowledge Chunk 1” is described as “A 250‑token Knowledge Chunk from ‘C L D‑225: Standards for Quality Acceptance of Energy‑Efficient Building Construction’”. “Knowledge Chunk 2” reads “A 250‑token Knowledge Chunk from ‘C L D‑151: Unified Standard for Construction Quality Acceptance of Building Engineering’”. “Knowledge Chunk 3” reads “Another 250‑token C L knowledge chunk from ‘C L D‑151: Unified Standard for Construction Quality Acceptance of Building Engineering’”. Blue arrows from the sidebar emphasize that these three chunks are retrieved as evidence for the question. Panel (b) shows the same “Smart C L Q A” interface focused on repository management. A central grey button labeled “Manage knowledge repository” is highlighted with a turquoise callout “Manage the knowledge repository”. On the right, a form headed “Please select or create a knowledge repository” with a drop‑down “Construction law knowledge repository (C L K R) [alias c l g b e‑v m‑1.5]” allows users to “Upload knowledge files” via a magenta‑outlined area reading “Drag and drop a document here” and “Browse”, annotated “Add up‑to‑date documents to the C L K R”. Below, a table lists existing documents, including an entry “C L D‑108: Standards for Identifying Bad Behaviors of All Parties in the National Construction Market”, which is highlighted in yellow with a note “Delete the C L D‑108: Standards for Identifying Bad Behaviors of All Parties in the National Construction Market”. Buttons at the bottom read “Download selected”, “Re‑add to vectorstore”, and a green button “Delete from knowledge repository”, called out as “Delete out‑of‑date documents from the C L K R”.The CLQA prototype and the CLKR update. Note: Codes and specifications for deploying the prototype are available in supplemental materials. Source(s): Authors’ own work
The figure contains two annotated screenshots labeled “(a) A C L Q A example in the deployed prototype” and “(b) Updating C L documents in the C L K R”. In panel (a), at the very top, a green‑bordered text box shows an English multiple‑choice item: “According to the ‘Unified Standard for Construction Quality Acceptance of Building Engineering’, who is responsible for accepting the inspection lot for the energy‑saving work of main structures, as well as for accepting the concealed work?” followed by options “(A) Supervision Engineer (B) Project Manager (C) Quality Engineer (D) Chief Supervision Engineer” and a label “Translation of the question”. A green annotation on the left labels this region “The Question”. Beneath, on the right, a smaller box indicates the original Chinese exam item with the tag “Q 0 1 6 0 (sourced from 2022 asterisk first‑level P C E Q E)”. Below it is a dialogue bubble containing the Chinese version of the question and, highlighted in red, the system’s final output, “Answer: D. Chief Supervision Engineer”, called out by a red label on the left, “Answer generated by G P L L M s”. On the left side of the screenshot is the Smart C L Q A sidebar headed “Dialogue” with version text “Current version v 0.2.10”. Within the panel are controls: “Manage knowledge repository”, “Current session: default”, “Please select a dialogue mode: Knowledge repository chat”, “Please select a L L M model: chatglm2‑6 b (Running)”, “Please select a prompt template: default”, a “Temperature: 0.80” slider, and “Historical dialogue rounds: 3”. Color‑coded boxes label “Enable the C L K R”, “Select the G P L L Ms for C L Q A”, and at the bottom “3 knowledge chunks related to the question”, with a drop‑down “Knowledge repository configuration” set to “Construction law knowledge”. On the right, three blue‑outlined rectangles stacked vertically present the retrieved supporting texts. Each shows a document title link and a mix of Chinese characters with an English summary caption. “Knowledge Chunk 1” is described as “A 250‑token Knowledge Chunk from ‘C L D‑225: Standards for Quality Acceptance of Energy‑Efficient Building Construction’”. “Knowledge Chunk 2” reads “A 250‑token Knowledge Chunk from ‘C L D‑151: Unified Standard for Construction Quality Acceptance of Building Engineering’”. “Knowledge Chunk 3” reads “Another 250‑token C L knowledge chunk from ‘C L D‑151: Unified Standard for Construction Quality Acceptance of Building Engineering’”. Blue arrows from the sidebar emphasize that these three chunks are retrieved as evidence for the question. Panel (b) shows the same “Smart C L Q A” interface focused on repository management. A central grey button labeled “Manage knowledge repository” is highlighted with a turquoise callout “Manage the knowledge repository”. On the right, a form headed “Please select or create a knowledge repository” with a drop‑down “Construction law knowledge repository (C L K R) [alias c l g b e‑v m‑1.5]” allows users to “Upload knowledge files” via a magenta‑outlined area reading “Drag and drop a document here” and “Browse”, annotated “Add up‑to‑date documents to the C L K R”. Below, a table lists existing documents, including an entry “C L D‑108: Standards for Identifying Bad Behaviors of All Parties in the National Construction Market”, which is highlighted in yellow with a note “Delete the C L D‑108: Standards for Identifying Bad Behaviors of All Parties in the National Construction Market”. Buttons at the bottom read “Download selected”, “Re‑add to vectorstore”, and a green button “Delete from knowledge repository”, called out as “Delete out‑of‑date documents from the C L K R”.The CLQA prototype and the CLKR update. Note: Codes and specifications for deploying the prototype are available in supplemental materials. Source(s): Authors’ own work
In practical GPLLM applications including CLQA, this study highlights the importance of incorporating domain-specific knowledge. The addition of domain-specific knowledge not only improves accuracy but also enhances the explainability of answers (Figure 11a) (Mansurova et al., 2024; Su et al., 2024). CLQA knowledge, in particular, is constantly evolving. New laws in the construction industry are continuously introduced, while existing laws may be modified or repealed (Li et al., 2021; Tian et al., 2023). The prototype developed in this study includes a feature for updating CL documents in the CLKR (Figure 11b). This facilitates the incorporation of up-to-date CL documents and the removal of outdated ones, ensuring that the knowledge base remains current and relevant.
Finally, this proposed methodology can be referred to develop region-specific CLKRs beyond Mainland China. As variations in construction-related laws, regulations, and standards exist across different countries and regions (Alhyari and Ani, 2022; Hansen, 2013), scholars should collect region-specific corpora (e.g. local written judgments and textbooks) and then identify candidate legal documents automatically. While this study presents an example of a “1-8-164” three-layer knowledge hierarchy (Figure 2b), it is crucial to consult region-specific experts to construct the CL hierarchy for filtering and categorizing CL documents. By the two adjustments, the proposed data-driven and expertise-based approach for establishing external knowledge bases can be applied to regions beyond Mainland China. All related codes for collecting raw documents and building CLKR are shared in a GitHub repository.
5.4 Limitations and further endeavors
This study still shares limitations with existing literature and calls for further endeavors, including (1) the challenges of using large-scale open-ended questions in the CLQA dataset (Table 2), (2) the need to mitigate long-tail effects of CLKR, and (3) the necessity to continually update the CLKR and the selected GPLLMs. While this research includes a set of 100 open-ended questions (Table S4), performance testing on a larger-scale open-ended question set presents significant challenges, including the subjective nature of answer evaluation and the considerable time and effort required. Additionally, a long-tail effect among CL documents within CLKR is identified. Future improvements could focus on developing cross-scale CLKR (Zhu et al., 2024; Sasazawa et al., 2023) and removing redundant knowledge chunks (Yu et al., 2024; Gao et al., 2023) to streamline CLKR and maximize the contribution of individual documents. Finally, It is acknowledged that achieving perfect CLQA remains a formidable challenge for current GPLLM technology. The constant updates to the knowledge in CLKR and the continuous development of more powerful GPLLM models are crucial for enhancing the performance of CLQA in the future, which requires the collaborative efforts of scholars worldwide.
6. Conclusion
The study proposes an approach of developing CLKR to empower GPLLMs for CLQA. The development process involves (1) identifying 702 candidate documents from 374,992 pieces of written judgments (Figure 2a), (2) building a CLKR with 387 documents covering over 8 main areas and 164 subareas (Figure 2b), (3) conducting a three-step integration between CLKR and GPLLMs (Figure 3), and (4) constructing a 2,140-question validation dataset (Figure 4) to evaluate the efficacy of CLKR. The findings indicate that CLKR notably augments GPLLMs’ performance in CLQA (Figure 6–9 and Tables 4–6), enhancing the accuracy by an average of 21.1%. The CLQA performance enhancements across various CL knowledge areas vary from 14.5% to 28.2% (Table 6).
Three major contributions are concluded as follows. Firstly, this study devises a data-driven and expertise-based method to construct the external knowledge base. Previous studies rely heavily on expertise to manually select documents, limiting QA to a narrow scope and rarely considering reuse and updating (Table 1). This approach not only relieves the reliance on experts during document selection but also broadens the knowledge scope of an external knowledge base aligning with the professional examinations (Figure 2). Secondly, this study empirically demonstrates the effectiveness of the CLKR in improving the performance of GPLLMs on CLQA (Figure 6–9 and Table 4–6). Contrasting with existing CLQA-related approaches (Table 1), CLKR-augmented GPLLM-based QA holds the advantages of avoiding complex training of conventional learning-based QA models and better CLQA performance. Finally, this work offers an openly available test dataset (Figure 4) as a benchmark dataset to advance the CLQA field. Previous studies have primarily relied on existing datasets and seldom provide access to the test dataset (Tables 1 and 2). This study provides a CLQA dataset comprising 2,140 questions from authoritative PCEQEs (Figure 4) to help objectively assess the effectiveness and comprehensiveness of CLKR.
This study is financially supported by National Natural Science Foundation of China (No. 72201057) and Social Science Foundation of Jiangsu Province (No. 23GLC020). All data, codes, results, and videos are provided in the supplemental materials in the GitHub repository (https://github.com/0AnonymousSite0/Question_Answering_of_Construction_Laws).
References
The supplementary material for this article can be found online.
