Table 7

Performance results of six base models with different prompt strategies

PerformanceModels
GPT-4o BaseQwen2.5–72B baseQwen2.5-max baseQwen3-235B-A22 B baseDeepSeek-V3 baseDeepSeek-R1 base
SPJPSPJPSPJPSPJPSPJPSPJP
L11.786.60↓42.5629.98↓59.5836.53↓77.2145.87↓36.2022.91↓186.43122.84↓
JEM0.2200.355↑0.1840.291↑0.2060.340↑0.2410.397↑0.2200.362↑0.2620.433↑
F1(MNR)0.7050.709↑0.6450.686↑0.6510.682↑0.6350.651↑0.7740.790↑0.7530.762↑
F1(ENR)0.3000.412↑0.3040.418↑0.3560.445↑0.4100.507↑0.3370.385↑0.3860.462↑
F1(MNCX)0.5820.654↑0.5980.646↑0.6430.710↑0.6760.690↑0.6220.656↑0.7010.737↑
F1(ENCX)0.4760.712↑0.5150.718↑0.5600.771↑0.6400.802↑0.5910.671↑0.7200.790↑

Note(s): L stands for latency (time per query, s). JEM stands for Joint Exact-Match. JP stands for Joint Prompt strategy. SP stands for Separate Prompt strategy

Source(s): Authors’ own work

or Create an Account

Close Modal
Close Modal