Figure 1 The illustration is a multi‑row...

Figure 1

The illustration is a multi‑row flowchart describing a four‑stage process for creating and evaluating a construction law knowledge repository (C L K R) used with general‑purpose legal language models (G P L L M s). Stage 1, on the top left, is titled “1. Collecting corpora and Recognizing candidate documents for C L K R”. Inside this stage are three sequential boxes connected by right‑pointing arrows. The first box is labeled “Gather corpora that contain construction laws” with a subtitle “Source: China Judgments Online”. The second box is labeled “Recognize document name entities in corpora”, with the line “Identifiers: Guillemets (left and right double angle beackets)”. The third box is a wider rectangle titled “Cleanse the document name entities” and internally split into three vertical panels labeled from left to right: “Merging identical entities”, “removing low‑frequency items”, and “removing non‑law documents”. Stage 2, on the top right, is titled “2. Identifying C L documents & Building the C L K R”. It contains three more boxes connected by arrows. The first box reads “Filter and align the candidate document entities” with a lower caption “Majority voting by 10 experts” above a row of stylized human icons. The next box is labeled “Clarify the structures of C L knowledge areas” with the note “Referring to a textbook and reviewed by experts”. The final box is titled “Categorize C L documents and collect the document contents” with the line “Collecting from Chinese Laws and Regulations Database”. Stage 3 in the middle row is titled “Incorporating C L K R into G P L L Ms for C L Q A” and contains three main boxes connected by right‑pointing arrows. The first box is labeled “Split C L documents into knowledge chunks”, and inside it two lines of text read “Chunk size equals 250 and Overlap equals 50” and “Chunk vectorization by most suitable embedding model of each L L M”. The second box is labeled “Retrieve question‑relevant knowledge chunks” and contains the line “Extracting 3 closest knowledge chunks (I) with minimum squared Euclidean distance (L subscript i squared):” followed by the formula “L subscript i squared equals open double mode V subscript i superscript knowledge minus V superscript question close double mode” and the selection rule “I equals arg min subscript 3 ({L subscript i squared} superscript N subscript i equals 1)”. The third box is titled “Input the combined question and retrieved knowledge into 7 selected G P L L M s” and includes a bulleted list labeled “Selecting G P L L Ms using three criteria:” with three bullets: “Inclusion of both open‑source and end‑source G P L L M s”, “Prioritization of G P L L Ms with superior performance”, and “Supporting automated batch Q A”. Stage 4 in the bottom row is titled “Validating the effectiveness of C L K R” and also contains three boxes joined by arrows. The first box reads “Devise a validation set for C L Q A” with a description: “Deciding question type and size by referring to existing literature in Table 2”. The second box is labeled “Compare performance differences between G P L L M s with and without C L K R” and states “Calculated by Accuracy and tested by Wilcoxon T Test”, followed by the formula “Accuracy equals (M superscript M S Q plus M superscript M M Q) over (1 times N superscript M S Q plus 2 times N superscript M M Q)”. The third box is titled “Evaluate individual C L document’s impact on performance enhancement” and explains “Evaluated by Unranked frequency and Ranked frequency” with two equations: “Unranked frequency equals sum from i equals 1 to n of C subscript i” and “Ranked frequency equals sum from i equals 1 to n of C subscript i times 1 over R subscript i”. Arrows connect all boxes from left to right and top to bottom, visually tracing the workflow from data collection through knowledge integration and quantitative validation.

The phases of building a CLKR to lift the CLQA performance of GPLLMs. Source(s): Authors’ own work