Macro-F1 and Cohen’s Kappa reliability values
| ChatGPT 4.0 | ChatGPT 4.5 | ChatGPT 4.5 with deep research functionality activated | |
|---|---|---|---|
| Macro-F1 | 0.6264 | 0.6136 | 0.8630 |
| Cohen’s Kappa | 0.5362 | 0.4985 | 0.7243 |
| ChatGPT 4.0 | ChatGPT 4.5 | ChatGPT 4.5 with deep research functionality activated | |
|---|---|---|---|
| Macro-F1 | 0.6264 | 0.6136 | 0.8630 |
| Cohen’s Kappa | 0.5362 | 0.4985 | 0.7243 |
Note(s): Macro-F1 (excluding zero-support categories) confirms that the Deep Research model substantially outperforms the other configurations. The exclusion of O5 avoids artificial penalisation due to dataset composition and ensures that the metric reflects performance on observed categories only
Sharing content requires targeting cookies to be enabled. Please update your cookie preferences to use this feature.