Figure 7 The figure is composed of...

Figure 7

The figure is composed of subplots (a)–(g), each showing how integrating the construction law knowledge repository (C L K R) affects accuracy on multiple‑choice single‑answer questions (M S Q s) and multiple‑choice multiple‑answer questions (M M Q s) for a given G P L L M. All panels share the vertical axis “Accuracy”, ranging from 0.0 to 1.0 with an interval of 0.2, and a legend indicating boxes for the 25 percent–75 percent range of baseline performance: red boxes for “without C L K R” and blue boxes for “with C L K R”, whiskers for “Min–Max”, a horizontal line for the median, diamonds for the mean, and dots for “Accuracy on each P C E Q E test paper”. A dashed horizontal line marks the “Passing Line (Accuracy equals 0.6)”, which runs right from the marking 0.6 on the vertical axis of each plot. Panels (a)–(g) cover “Llama‑2‑70 b”, “text‑davinci‑003”, “G P T‑3.5 Turbo”, “G P T‑4”, “Chat G L M 2‑6 B”, “E R N I E‑Bot‑turbo”, and “E R N I E‑Bot 4.0”, respectively. In each, two pairs of boxplots appear along the horizontal axis labeled “without C L K R” and “with C L K R” under “M S Q s” and “M M Q s”. For Llama‑2‑70 b, the mean M S Q accuracy increases from “0.413” to “0.475” (a “15.0 percent” gain), while M M Q accuracy rises from “0.120” to “0.214” (a “78.7 percent” gain), though both remain below the 0.6 passing line. For text‑davinci‑003, M S Q mean accuracy improves from “0.404” to “0.567” (“40.4 percent”), and M M Q accuracy from “0.220” to “0.353” (“59.2 percent”). G P T‑3.5 Turbo shows M S Q accuracy increasing from “0.452” to “0.547” (“21.1 percent”) and M M Q accuracy from “0.205” to “0.381” (“86.2 percent”). G P T‑4 exhibits the high M S Q scores, with the mean rising from “0.645” to “0.743” (“15.2 percent”), surpassing the passing line; its M M Q accuracy grows from “0.374” to “0.557”, a “49.0 percent” gain that moves the distribution close to the 0.6 threshold. For Chat G L M 2‑6 B, M S Q accuracy slightly increases from “0.538” to “0.604”, labeled “12.4 percent” improvement in the figure, while M M Q accuracy increases from “0.293” to “0.311” (“6.2 percent”), both below the passing line. E R N I E‑Bot‑turbo’s M S Q mean accuracy improves from “0.680” to “0.731” (“7.5 percent”), remaining above 0.6, and M M Q accuracy rises from “0.071” to “0.103” (“44.6 percent”) though still low in absolute terms. E R N I E‑Bot 4.0 exhibits the highest M S Q scores, with the mean rising from “0.853” to “0.909” (“6.6 percent”), surpassing the passing line; its M M Q accuracy grows from “0.626” to “0.724”, a “15.5 percent” gain, also surpassing the passing 0.6 threshold. Note: All the numerical data values are approximated.

Performance comparison of original and CLKR-empowered GPLLMs in MSQs and MMQs. Source(s): Authors’ own work