Skip to Main Content
Article navigation

This paper investigates whether Large Language Models (LLMs), guided by prompt engineering, can automate the complex Discourse Quality Index (DQI) for measuring deliberative quality. Evaluating state-of-the-art models from OpenAI and Google by varying In-Context Learning (ICL) examples (0–50) shows LLMs achieve high fidelity, comparable to human annotators based on standard reliability metrics. Performance improves markedly with few examples, plateauing around 25–50 examples. While both models perform well, differences highlight the interplay between model selection and ICL strategy. Error analysis identifies specific DQI dimensions requiring further improvement, suggesting future work on advanced reasoning prompts. This study confirms LLM viability for scaling DQI measurement and provides practical guidance on optimizing ICL strategies. Additionally, it contributes a modular, adaptable AI engineering pipeline that researchers can leverage for their own prompting experiments across various measurement tasks.

Licensed re-use rights only
You do not currently have access to this content.
Don't already have an account? Register

Purchased this content as a guest? Enter your email address to restore access.

Please enter valid email address.
Email address must be 94 characters or fewer.
Pay-Per-View Access
$39.00
Rental

or Create an Account

Close Modal
Close Modal