Table 4 Macro-F1 and Cohen’s Kappa...

Table 4

Macro-F1 and Cohen’s Kappa reliability values

	ChatGPT 4.0	ChatGPT 4.5	ChatGPT 4.5 with deep research functionality activated
Macro-F1	0.6264	0.6136	0.8630
Cohen’s Kappa	0.5362	0.4985	0.7243

Note(s): Macro-F1 (excluding zero-support categories) confirms that the Deep Research model substantially outperforms the other configurations. The exclusion of O5 avoids artificial penalisation due to dataset composition and ensures that the metric reflects performance on observed categories only

Source(s): Authors’ own work

[ViewLarge]

Sharing Unavailable