The illustration is a multi‑row flowchart describing a four‑stage process for creating and evaluating a construction law knowledge repository (C L K R) used with general‑purpose legal language models (G P L L M s). Stage 1, on the top left, is titled “1. Collecting corpora and Recognizing candidate documents for C L K R”. Inside this stage are three sequential boxes connected by right‑pointing arrows. The first box is labeled “Gather corpora that contain construction laws” with a subtitle “Source: China Judgments Online”. The second box is labeled “Recognize document name entities in corpora”, with the line “Identifiers: Guillemets (left and right double angle beackets)”. The third box is a wider rectangle titled “Cleanse the document name entities” and internally split into three vertical panels labeled from left to right: “Merging identical entities”, “removing low‑frequency items”, and “removing non‑law documents”. Stage 2, on the top right, is titled “2. Identifying C L documents & Building the C L K R”. It contains three more boxes connected by arrows. The first box reads “Filter and align the candidate document entities” with a lower caption “Majority voting by 10 experts” above a row of stylized human icons. The next box is labeled “Clarify the structures of C L knowledge areas” with the note “Referring to a textbook and reviewed by experts”. The final box is titled “Categorize C L documents and collect the document contents” with the line “Collecting from Chinese Laws and Regulations Database”. Stage 3 in the middle row is titled “Incorporating C L K R into G P L L Ms for C L Q A” and contains three main boxes connected by right‑pointing arrows. The first box is labeled “Split C L documents into knowledge chunks”, and inside it two lines of text read “Chunk size equals 250 and Overlap equals 50” and “Chunk vectorization by most suitable embedding model of each L L M”. The second box is labeled “Retrieve question‑relevant knowledge chunks” and contains the line “Extracting 3 closest knowledge chunks (I) with minimum squared Euclidean distance (L subscript i squared):” followed by the formula “L subscript i squared equals open double mode V subscript i superscript knowledge minus V superscript question close double mode” and the selection rule “I equals arg min subscript 3 ({L subscript i squared} superscript N subscript i equals 1)”. The third box is titled “Input the combined question and retrieved knowledge into 7 selected G P L L M s” and includes a bulleted list labeled “Selecting G P L L Ms using three criteria:” with three bullets: “Inclusion of both open‑source and end‑source G P L L M s”, “Prioritization of G P L L Ms with superior performance”, and “Supporting automated batch Q A”. Stage 4 in the bottom row is titled “Validating the effectiveness of C L K R” and also contains three boxes joined by arrows. The first box reads “Devise a validation set for C L Q A” with a description: “Deciding question type and size by referring to existing literature in Table 2”. The second box is labeled “Compare performance differences between G P L L M s with and without C L K R” and states “Calculated by Accuracy and tested by Wilcoxon T Test”, followed by the formula “Accuracy equals (M superscript M S Q plus M superscript M M Q) over (1 times N superscript M S Q plus 2 times N superscript M M Q)”. The third box is titled “Evaluate individual C L document’s impact on performance enhancement” and explains “Evaluated by Unranked frequency and Ranked frequency” with two equations: “Unranked frequency equals sum from i equals 1 to n of C subscript i” and “Ranked frequency equals sum from i equals 1 to n of C subscript i times 1 over R subscript i”. Arrows connect all boxes from left to right and top to bottom, visually tracing the workflow from data collection through knowledge integration and quantitative validation.The phases of building a CLKR to lift the CLQA performance of GPLLMs. Source(s): Authors’ own work