Performance results of six base models with different prompt strategies
| Performance | Models | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o Base | Qwen2.5–72B base | Qwen2.5-max base | Qwen3-235B-A22 B base | DeepSeek-V3 base | DeepSeek-R1 base | |||||||
| SP | JP | SP | JP | SP | JP | SP | JP | SP | JP | SP | JP | |
| L | 11.78 | 6.60↓ | 42.56 | 29.98↓ | 59.58 | 36.53↓ | 77.21 | 45.87↓ | 36.20 | 22.91↓ | 186.43 | 122.84↓ |
| JEM | 0.220 | 0.355↑ | 0.184 | 0.291↑ | 0.206 | 0.340↑ | 0.241 | 0.397↑ | 0.220 | 0.362↑ | 0.262 | 0.433↑ |
| F1(MNR) | 0.705 | 0.709↑ | 0.645 | 0.686↑ | 0.651 | 0.682↑ | 0.635 | 0.651↑ | 0.774 | 0.790↑ | 0.753 | 0.762↑ |
| F1(ENR) | 0.300 | 0.412↑ | 0.304 | 0.418↑ | 0.356 | 0.445↑ | 0.410 | 0.507↑ | 0.337 | 0.385↑ | 0.386 | 0.462↑ |
| F1(MNCX) | 0.582 | 0.654↑ | 0.598 | 0.646↑ | 0.643 | 0.710↑ | 0.676 | 0.690↑ | 0.622 | 0.656↑ | 0.701 | 0.737↑ |
| F1(ENCX) | 0.476 | 0.712↑ | 0.515 | 0.718↑ | 0.560 | 0.771↑ | 0.640 | 0.802↑ | 0.591 | 0.671↑ | 0.720 | 0.790↑ |
| Performance | Models | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o Base | Qwen2.5–72B base | Qwen2.5-max base | Qwen3-235B-A22 B base | DeepSeek-V3 base | DeepSeek-R1 base | |||||||
| SP | JP | SP | JP | SP | JP | SP | JP | SP | JP | SP | JP | |
| L | 11.78 | 6.60↓ | 42.56 | 29.98↓ | 59.58 | 36.53↓ | 77.21 | 45.87↓ | 36.20 | 22.91↓ | 186.43 | 122.84↓ |
| JEM | 0.220 | 0.355↑ | 0.184 | 0.291↑ | 0.206 | 0.340↑ | 0.241 | 0.397↑ | 0.220 | 0.362↑ | 0.262 | 0.433↑ |
| F1(MNR) | 0.705 | 0.709↑ | 0.645 | 0.686↑ | 0.651 | 0.682↑ | 0.635 | 0.651↑ | 0.774 | 0.790↑ | 0.753 | 0.762↑ |
| F1(ENR) | 0.300 | 0.412↑ | 0.304 | 0.418↑ | 0.356 | 0.445↑ | 0.410 | 0.507↑ | 0.337 | 0.385↑ | 0.386 | 0.462↑ |
| F1(MNCX) | 0.582 | 0.654↑ | 0.598 | 0.646↑ | 0.643 | 0.710↑ | 0.676 | 0.690↑ | 0.622 | 0.656↑ | 0.701 | 0.737↑ |
| F1(ENCX) | 0.476 | 0.712↑ | 0.515 | 0.718↑ | 0.560 | 0.771↑ | 0.640 | 0.802↑ | 0.591 | 0.671↑ | 0.720 | 0.790↑ |
Note(s): L stands for latency (time per query, s). JEM stands for Joint Exact-Match. JP stands for Joint Prompt strategy. SP stands for Separate Prompt strategy