Table 4

Macro-F1 and Cohen’s Kappa reliability values

ChatGPT 4.0ChatGPT 4.5ChatGPT 4.5 with deep research functionality activated
Macro-F10.62640.61360.8630
Cohen’s Kappa0.53620.49850.7243

Note(s): Macro-F1 (excluding zero-support categories) confirms that the Deep Research model substantially outperforms the other configurations. The exclusion of O5 avoids artificial penalisation due to dataset composition and ensures that the metric reflects performance on observed categories only

Source(s): Authors’ own work

or Create an Account

Close Modal
Close Modal