Wilcoxon T Tests on CLQA accuracy of 7 GPLLMs with and without CLKR in PCEQEs
| No | GPLLM | CLKR | Average accuracy | Accuracy enhancement | z-statistic | p-value |
|---|---|---|---|---|---|---|
| 1 | Llama-2-70b | without | 0.283 | 28.3% | 4.197 | 0.000*** |
| with | 0.363 | |||||
| 2 | text-davinci-003 | without | 0.329 | 44.9% | 4.286 | 0.000*** |
| with | 0.476 | |||||
| 3 | GPT-3.5 Turbo | without | 0.349 | 36.3% | 4.287 | 0.000*** |
| with | 0.476 | |||||
| 4 | GPT-4 | without | 0.528 | 25.4% | 4.171 | 0.000*** |
| with | 0.663 | |||||
| 5 | ChatGLM2-6B | without | 0.430 | 11.1% | 3.729 | 0.000*** |
| with | 0.478 | |||||
| 6 | ERNIE-Bot-turbo | without | 0.419 | 10.2% | 3.429 | 0.002*** |
| with | 0.462 | |||||
| 7 | ERNIE-Bot 4.0 | without | 0.755 | 9.9% | 4.029 | 0.000*** |
| with | 0.830 | |||||
| Average accuracy of 7 GPLLMs | without | 0.442 | 21.1% | NA | NA | |
| with | 0.535 | |||||
| No | GPLLM | CLKR | Average accuracy | Accuracy enhancement | ||
|---|---|---|---|---|---|---|
| 1 | Llama-2-70b | without | 0.283 | 28.3% | 4.197 | 0.000*** |
| with | 0.363 | |||||
| 2 | text-davinci-003 | without | 0.329 | 44.9% | 4.286 | 0.000*** |
| with | 0.476 | |||||
| 3 | GPT-3.5 Turbo | without | 0.349 | 36.3% | 4.287 | 0.000*** |
| with | 0.476 | |||||
| 4 | GPT-4 | without | 0.528 | 25.4% | 4.171 | 0.000*** |
| with | 0.663 | |||||
| 5 | ChatGLM2-6B | without | 0.430 | 11.1% | 3.729 | 0.000*** |
| with | 0.478 | |||||
| 6 | ERNIE-Bot-turbo | without | 0.419 | 10.2% | 3.429 | 0.002*** |
| with | 0.462 | |||||
| 7 | ERNIE-Bot 4.0 | without | 0.755 | 9.9% | 4.029 | 0.000*** |
| with | 0.830 | |||||
| Average accuracy of 7 GPLLMs | without | 0.442 | 21.1% | NA | NA | |
| with | 0.535 | |||||
Note(s): *** denote confidence levels above 99%
Source(s): Authors’ own work