This exploratory study aims to develop and evaluate artificial intelligence (AI) based predictive models for academic outcomes, offering data-driven insights for school leadership by integrating machine learning (ML) with self-determination theory motivation frameworks.
A comparative analysis was conducted across eight regression models, including deep neural network (DNN), random forest (RF) and gradient boosting (GB), using a small-scale dataset (n = 68). Model performance was assessed through a robust 5× repeated 5-fold cross-validation (CV), utilizing root mean squared error (RMSE) as the primary metric. The statistical significance of the models was validated via 100 permutation tests. To provide actionable transparency, the best-performing model was subjected to Shapley additive explanations (SHAP) analysis and learning curve (LC) analysis to evaluate generalization capabilities and bias-variance tradeoffs.
The RF model emerged as the superior performer, achieving the lowest Mean RMSE of 5.2138 (±1.5280), followed by KNearest Neighbors (5.3572) and light GB machine (5.5326). Statistical testing confirmed significant predictive power for RF (p = 0.0198) and GB (p = 0.0297). SHAP analysis identified first-quarter scores, Age and study time as primary predictors, but critically highlighted SDT-related factors such as autonomy importance, relatedness and recommendation likelihood (a proxy for AI app engagement) as top-10 influential features. The LC indicated a persistent generalization gap, suggesting that while the model captures complex patterns, its current predictive stability is constrained by the small sample size (n = 68).
The primary limitation is the small sample size (n = 68). While CV and permutation tests were used to ensure model stability, the small N means that the results are highly sensitive to the specific characteristics of this cohort. Consequently, these findings cannot be generalized to broader or more diverse student populations without further large-scale validation. High variance and overfitting as evidenced by the LC, a persistent gap remains between the training and CV RMSE. This indicates a high degree of variance, where the model is still prone to overfitting the noise within the small dataset. The model's predictive accuracy might fluctuate significantly if applied to a different academic environment. Exploratory Nature of Policy Claims. The practical implications discussed such as using specific SHAP features to drive curriculum changes should be viewed as hypotheses for future research rather than definitive institutional mandates. The synthetic nature of some data labels and the limited demographic range further necessitate a cautious approach to applying these results to broad educational policy. Constraints of deep learning: The relative underperformance of the DNN (MLPRegressor) further highlights the difficulty of applying complex, “data-hungry” architectures to small-scale educational datasets.
While the RF model and SHAP analysis provide valuable insights, it is important to treat these findings as exploratory and preliminary due to the study's specific context and sample size. Nevertheless, this investigation offers a promising framework for how school leadership can move toward a more proactive, data-informed ecosystem. Targeted interventions and resource allocation: The identification of “score first quarter” and “study time” as primary drivers suggests that academic support should be front-loaded. Rather than waiting for mid-year failures, school leaders can use early quarter data as a “screening tool” to identify students who may require additional mentorship. However, given the exploratory nature of this study, such interventions should be implemented as pilot programs to further validate these predictors in real-world settings.
This study uniquely integrates statistical significance testing with comprehensive explainable AI to bridge the gap between ML and educational psychology. By identifying Autonomy and Personalization as measurable predictors of academic success, it provides a methodological proof-of-concept for how school leadership can leverage SDT-informed AI. Given the exploratory nature and limited sample, these results serve as a preliminary foundation for proactive, motivation-aware educational strategies and targeted digital interventions.
Introduction
Accurately predicting student academic performance has become an essential component of educational leadership, influencing resource allocation, intervention planning and strategic policymaking (Leithwood et al., 2004; Alnasyan et al., 2024). School leaders are increasingly relying on data-driven approaches to make informed decisions that impact student success (Amatullah et al., 2025). However, traditional academic prediction models primarily focus on test scores and historical performance, often neglecting critical psychological and motivational factors that shape learning behaviors (Yağcı, 2022; Nabil et al., 2021).
Advancements in artificial intelligence (AI) and machine learning (ML) have enabled more sophisticated methods for forecasting student achievement (Collier et al., 2024; Nabil et al., 2021). AI-driven predictive analytics have the potential to enhance school leadership strategies, offering administrators deeper insights into student learning behaviors, engagement levels and academic trajectories (Collier et al., 2024; Bellaj et al., 2023). Despite these advancements, school leaders face challenges in implementing AI-based models, particularly in ensuring interpretability, addressing biases and making ethical decisions in student performance predictions (Khosravi et al., 2022).
Despite significant advancements in ML techniques for academic performance forecasting, research incorporating psychological constructs from self-determination theory (SDT), including autonomy, competence and relatedness, remains scarce (Ryan and Deci, 2020). While Orji and Vassileva (2022) explored these constructs in predictive models, most studies still rely primarily on traditional metrics such as test scores and demographic data, overlooking AI-driven approaches that holistically integrate SDT-based learning attributes (Buenaño-Fernandez et al., 2019; Aljaloud et al., 2022; Chavez and Palaoag, 2024).
Additionally, while ML models have demonstrated effectiveness in predicting academic performance, limited research has examined AI-driven interventions for school leadership. Studies such as Nabil et al. (2021) and Bellaj et al. (2023) provide evidence that ML techniques can enhance predictive accuracy, yet further exploration is necessary to integrate these tools into personalized resource allocation strategies and student support programs within educational leadership.
This study aims to compare the effectiveness of various ML models, including deep neural network (DNN), gradient boosting (GB), Extreme GB (XGB), light GB machine (LGBM), random forest (RF) and classical algorithms such as LR, DT and knearest neighbors (KNN) in predicting student academic performance. By incorporating SDT-based motivational constructs alongside test scores, this research provides school leaders with a comprehensive framework for proactive interventions and strategic planning.
The findings offer exploratory insights that may inform future leadership training and the development of preliminary guidelines for school administrators seeking to understand AI-driven patterns in educational outcomes. Additionally, this study highlights the potential role of explainable AI (XAI) techniques, such as Shapley additive explanations (SHAP) analysis, in fostering transparency and supporting ethical considerations during the initial stages of model interpretation (Lundberg and Lee, 2017). By bridging preliminary AI-based analytics, psychological theory and school leadership, this study provides a proof-of-concept for how administrators might eventually adopt data-informed strategies. While the results are constrained by the small sample size, they offer a foundation for further research into fostering student success and examining equity through a more transparent, motivation-aware lens.
Literature review
Predicting student performance is crucial for schools aiming to enhance learning outcomes, provide timely support and allocate resources efficiently (Leithwood et al., 2004; Alnasyan et al., 2024). Traditionally, many studies have relied on grades and test scores to forecast student success. However, these approaches often overlook important psychological and motivational factors that significantly influence learning (Cao et al., 2024; Alnasyan et al., 2024; Sghir et al., 2022).
Recent advances in educational data mining (EDM) and ML have made it possible to analyze large volumes of student data, thereby improving prediction accuracy (Bin Roslan and Chen, 2022; Lampropoulos, 2022; Sekeroglu and Altun, 2023). Despite these technological improvements, schools require AI models that are transparent and easy to interpret, so that teachers and administrators can trust and effectively utilize the predictions (Wang and Luo, 2024). XAI techniques, such as SHAP, address this need by quantifying the contribution of each feature to the model's output, revealing which student factors most influence predictions (Lee et al., 2023; Rajendran et al., 2022).
These AI-driven insights also support more effective budget planning, strategic investment in educational tools and the refinement of teaching methods (Lin and Yu, 2023). However, most AI research focuses predominantly on academic records and rarely incorporates psychological factors, limiting the scope and accuracy of their predictions (Vistorte et al., 2024).
To bridge this gap, the present study integrates SDT into AI-based academic predictions. SDT identifies three fundamental drivers of student motivation: autonomy, competence and relatedness (Ryan and Deci, 2020; Jeno et al., 2018; Liu et al., 2022; Bureau et al., 2021).
Motivation data grounded in SDT are collected through a questionnaire designed for an AI-driven application. This questionnaire assesses student preferences for features such as progress tracking (competence), gamification, personalized feedback (autonomy) and social interaction (relatedness) (Ryan and Deci, 2020; Chavez and Palaoag, 2024). By combining these motivational factors with traditional academic records, this study aims to develop a more comprehensive and accurate model for predicting student success.
AI models for educational performance prediction
Numerous ML models have been leveraged to predict student academic performance, each with distinct strengths and limitations in handling educational data. LR remains a foundational model widely used due to its simplicity and interpretability (Alsariera et al., 2022). Bhutto et al. (2020) applied LR within an E-learning environment using the Kalboard 360 system dataset, considering student satisfaction, system interaction and punctuality, and achieved 71% accuracy. Likewise, Kotsiantis et al. (2003) included LR among 6 ML algorithms for dropout prediction in distance learning, finding it effective based on demographic and performance data from 356 student records, despite being surpassed in accuracy by other models.
DT and KNN algorithms improve pattern recognition capabilities beyond LR but carry risks of inconsistent results due to data sensitivity (Linkon et al., 2024). Hasan et al. (2019) demonstrated DT and KNN's efficacy in predicting final exam marks across three semesters, with DT achieving 94.44% and KNN 89.74% accuracy, excluding basic attributes like attendance but incorporating test and midterm scores along with demographic factors. Optimization techniques applied to DT and KNN, such as Bayesian optimization, have shown significant accuracy improvements in non-educational domains like pedestrian fatality prediction (Yang et al., 2022), highlighting their adaptability and potential for complex educational data.
Ensemble models like RF and GB have gained prominence for combining multiple DTs to deliver more stable and accurate predictions (Agmeyang et al., 2024; Alamer et al., 2025). In education, Syawaludin et al. (2024) used GB and RF models to forecast student grades, integrating variables including GPA, absences, parental support and extracurricular participation, with GB achieving a precision of 0.929. Similarly, Wang et al. (2022) applied LGBM on online learning interaction data, outperforming ten classical ML algorithms and underscoring the importance of behavioral analysis in online education. Advanced GB frameworks such as XGB and LGBM are well-suited for processing large datasets with high precision (Bellaj et al., 2023). Karthika et al. (2025) proposed a stacking ensemble of GB, RF and XGB models, augmented by a support vector regressor meta-learner, achieving an accuracy-like metric of 97.29%, R2 of 0.751 and a low mean absolute error of 0.246, demonstrating ensemble learning's robustness and interpretability for student performance prediction Khosravi and Azarnik (2025) affirmed RF and XGB's effectiveness in EDM despite challenges like temporal and heterogeneous data distributions.
In Addition Kokol et al. (2022) examined ML for small datasets, noting risks of overfitting and poor generalization. They proposed four strategies: dimensionality reduction, data augmentation, data mining and statistical learning using simpler, regularized models. The study emphasized interpretability and robust validation (e.g. nested cross-validation (CV)) as essential for reliability. These practices ensure effective, transparent ML when data is limited. Furthermore, Kraljevski et al. (2023) addressed industrial small-data challenges, including unlabeled, imbalanced and rare event issues, advocating domain-informed feature engineering, data augmentation and expert knowledge integration to enhance ML reliability in constrained environments. Rather et al. (2024) reviewed deep learning (DL) for small datasets, emphasizing transfer learning, data augmentation and lightweight architectures to improve performance and democratize AI in resource-constrained environments.
DL, a subset of ML, uses DNNs with many layers to model complex patterns in vast, unstructured data, often surpassing human performance in image recognition and natural language tasks. DNNs transform industries, enabling accurate medical diagnoses, real-time decisions in autonomous vehicles and advanced language translation in chatbots. Notably, pruning techniques help make these models more efficient, but interpretability and energy use remain challenges. In education, DNNs and long short-term memory networks excel at revealing nonlinear and temporal relationships. Recent studies show up to 89% accuracy in predicting student success, and with explainable frameworks like SHAP, DNNs can clarify which factors matter most. LSTM-based models, particularly when paired with SMOTE or GANs for balanced data, reach over 98% accuracy for early student risk detection in programming courses, highlighting their practical impact (LeCun et al., 2015; Nabil et al., 2021; Vives et al., 2024).
SDT-based learning
The incorporation of motivational theories like SDT provides a psychological backbone to educational prediction models. SDT posits that autonomy, competence and relatedness are key drivers of intrinsic motivation and academic engagement (Ryan and Deci, 2020). Recent meta-analyses substantiate the linkage between SDT constructs and improved educational outcomes (Howard et al., 2024; Wang et al., 2024; Alamer et al., 2025). Empirical research demonstrates SDT's utility across diverse educational contexts, revealing its efficacy in cultivating motivation and academic success (Ryan and Deci, 2020; Jeno et al., 2018).
Alrabai (2021) implemented a 12-week autonomy-supportive intervention for EFL learners based on SDT, significantly enhancing perceived choice, competence, autonomy support and intrinsic motivation, which collectively improved learners' autonomy and engagement in language learning. Moreover, Chong and Reinders (2025) conducted a comprehensive scoping review of 61 empirical studies on English language learner autonomy, revealing diverse theoretical frameworks and operational definitions. They noted limited use of evaluation measures and recommended clearer theoretical grounding, cross-contextual research, mixed-method approaches and greater emphasis on out-of-class learning to foster autonomy. Complementing this, İrgatoğlu et al. (2022) examined autonomy and strategy use among 155 preparatory students before and during the COVID-19 pandemic. Their findings showed a decline in autonomy from high to moderate levels, alongside reduced strategy use, yet a positive correlation persisted between autonomy and language learning strategies, underscoring their interdependence during challenging contexts.
Concurrently, AI-driven personalized learning leverages such motivational frameworks to tailor interventions and enhance predictive accuracy (Holstein et al., 2020; Roll and Wylie, 2016; Romero and Ventura, 2020). The integration of SDT and ML advances, therefore, presents a promising avenue for understanding and augmenting student motivation within predictive analytics (Goodfellow et al., 2016; LeCun et al., 2015).
Evaluation metrics in educational prediction
The root mean squared error (RMSE) was the primary metric used to evaluate model performance. RMSE measures the average deviation between predicted and actual student scores in original grade units, making results highly interpretable for school leaders (Hodson, 2022).
By squaring errors, RMSE penalizes large prediction gaps more heavily than minor ones. This is critical in education, where failing to identify a student at high risk of failure is more consequential than small fluctuations in passing grades (Miller et al., 2024). Given the small sample (n = 68), RMSE was paired with Repeated CV to ensure reliability and mitigate the impact of outliers (Kokol et al., 2022).
AI-driven decision-making and school leadership
AI and ML are transforming school leadership by enabling data-driven decision-making, optimizing management practices and supporting equitable education. A key component of this transformation is predictive analytics, which uses historical and real-time data to identify patterns and forecast future outcomes. In schools, predictive models help leaders anticipate student performance trends, detect early signs of academic struggle and recommend timely interventions before challenges escalate. This proactive capability enhances equity by ensuring that at-risk learners receive support when it is most effective (Göçen et al., 2025; Zawacki-Richter et al., 2019).
AI-powered platforms such as civitas learning demonstrate how predictive analytics consolidates diverse data streams, including student performance, teacher effectiveness and resource utilization, into actionable insights. These tools allow school leaders to make proactive, evidence-based decisions that were previously difficult to achieve (Göçen et al., 2025). Predictive budgeting models similarly strengthen financial management by forecasting future resource needs based on historical spending patterns, helping leaders allocate funds more efficiently and prioritize sustainable investments in educational technology and infrastructure (Lin and Yu, 2023; Buenaño-Fernández et al., 2019).
At the instructional level, predictive analytics supports personalized learning by identifying students at risk of disengagement and suggesting tailored interventions such as adaptive exercises, remedial materials, or one-on-one tutoring (Alnasyan et al., 2024). These systems also enable dynamic curriculum design, recommending content that matches students' interests, goals and learning styles, while monitoring socio-emotional cues to address holistic development (Alyahyan and Dustegor, 2020). For teachers, predictive insights into student learning trends guide professional development initiatives, helping leaders align training with specific classroom needs and improve instructional strategies (Pek et al., 2023).
Moreover, by automating routine administrative tasks such as scheduling, communication and reporting, AI frees school leaders to focus on strategic decision-making and human-centered leadership (Dogan and Arslan, 2025; Karakose and Tülübaş, 2024). Still, the growing reliance on predictive analytics and AI raises ethical challenges related to data privacy, algorithmic fairness and transparency. Without strong governance frameworks and stakeholder engagement, these tools risk reinforcing inequities or undermining trust in educational systems (UNESCO, 2025; Dogan and Arslan, 2025).
Finally, while predictive analytics has significantly advanced evidence-based leadership, most current AI models underutilize psychological and motivational factors. For instance, student motivation, a critical determinant of academic success, is often missing from predictive models, limiting the depth of understanding behind student performance (Wu et al., 2022). Integrating such factors into AI-driven decision-making represents a promising direction for future research and practice, ensuring that predictive analytics not only anticipates outcomes but also addresses the underlying reasons for student success or struggle.
XAI, predictive analytics and SHAP in education
Predictive analytics involves using advanced mathematical and statistical approaches, combined with information technology tools, to detect patterns, relationships and dependencies within complex data. Its main goal is to develop models that can predict the likelihood of future events based on historical information applied to new data sets (Dinov, 2018; Williams, 2011).
In the context of education, these techniques analyze various student characteristics to forecast outcomes like academic success, risk of dropping out, or the need for interventions, thereby supporting informed decision-making by educators (Hsu and Lu, 2024).
However, for predictive analytics to be effectively used in education, it is essential that the predictions are transparent and interpretable. XAI techniques, such as SHAP, play a vital role by shedding light on the inner workings of these models, making AI predictions understandable for educators and school leaders (Li et al., 2024). SHAP quantifies the contribution of each student feature to the prediction outcome, allowing educators to see which factors most influence student success and supporting more targeted interventions (Wang and Luo, 2024).
SHAP provides both local explanations for individual students and global explanations across the entire student body, enhancing the transparency and fairness of predictive analytics in education (Lundberg and Lee, 2017). This ensures that AI-driven insights are not only accurate but also ethically sound and practically useful in educational decision-making (Rajendran et al., 2022).
Ethical concerns in AI-based student analytics
Schools must ensure AI systems are fair and responsible by addressing key concerns:
Data Privacy and Security, Schools must protect student records and ensure AI tools follow ethical standards (Khosravi et al., 2022). In addition, Fair AI Predictions, AI models should not create bias or reinforce existing inequalities (Lundberg and Lee, 2017). Moreover, Educators need clear explanations of AI predictions to make informed decisions (Bellaj et al., 2023).
Methodology
To rigorously assess the predictive capabilities of various ML algorithms on academic outcomes, a comprehensive evaluation framework was implemented. The overall ML workflow commenced with identifying the relevant data attributes and selecting appropriate columns for analysis. This was followed by a meticulous data preprocessing phase to prepare the features for model training.
Research design
The research design involved a comparative analysis of multiple regression models to predict academic outcomes. The objective was to identify the most effective model and to understand the underlying feature contributions through robust evaluation techniques.
Research instrumentation and reliability
Student motivation was measured using an instrument adapted from an AI-driven mobile application: Unraveling students' motivational feature preferences for reading comprehension (Chavez and Palaoag, 2024), which is a validated scale designed to measure the SDT constructs of autonomy, competence and relatedness.
To evaluate the internal consistency of the instrument for this specific cohort, Cronbach's alpha was calculated for the core Likert-scale items. The analysis yielded a coefficient of 0.7415, which exceeds the standard reliability threshold of 0.70 (Nunnally and Bernstein, 1994).
Sampling method
Due to logistical constraints, convenience sampling was implemented. Participants were selected based on accessibility within the school, resulting in 68 respondents. This non-probability sampling method introduces potential selection bias, limiting external validity and generalizability beyond the sampled students.
Data preprocessing
Data preprocessing involved several critical steps to prepare the features for robust model training. To ensure the integrity of the predictive analysis, a strict leakage prevention protocol was implemented; the target variable (3rd Quarter Score) was excluded from the feature matrix. Only baseline academic data (first and second quarter scores), demographics and SDT constructs were retained.
Categorical predictors were transformed using one-hot encoding, while multi-select survey items related to SDT constructs were processed via binary indicator encoding. This resulted in each selectable option being treated as an independent feature, bringing the final dimensionality to 44 features. This granular approach allows the models to identify specific combinations of motivational factors that most strongly correlate with academic success.
Model selection
A diverse suite of eight regression models was selected for evaluation to cover various algorithmic approaches and complexities. These included:
LR: A simple, interpretable baseline model.
KNN: A non-parametric, instance-based learning algorithm.
DT: A foundational tree-based model.
RF: An ensemble method using multiple DTs to improve accuracy and reduce overfitting.
GB: Another powerful ensemble technique that builds models sequentially.
XGB: An optimized distributed GB library designed for speed and performance.
LGBM: A GB framework that uses tree-based learning algorithms, known for its speed and efficiency.
DNN: Implemented using MLPRegressor from sklearn.neural_network, representing a basic DL approach.
Model training and validation
The core of the model evaluation relied on a robust 5x repeated 5-fold CV technique. This approach involves partitioning the training data (X_train_processed_final, y_train) into five folds. For each of the five folds, the model was trained on four folds and validated on the remaining one. This 5-fold process was then repeated five times, yielding a total of 25 independent performance estimates for each model. This extensive CV strategy was chosen to provide reliable and less biased estimates of model performance, especially given the dataset size (n = 68).
Handling MLPRegressor Convergence: During initial CV runs, the MLPRegressor (DNN) exhibited ConvergenceWarning messages, indicating that the optimization algorithm did not converge within the default or previously set max_iter limit. To address this, the max_iter parameter for the MLPRegressor was progressively increased to 10,000. Additionally, to ensure a clean re-evaluation and prevent any unintended carry-over of state or duplicate entries, the list of models was explicitly re-initialized before each evaluation loop. This ensured that all models were instantiated and evaluated with their correct and updated configurations.
Evaluation metrics
The RMSE was chosen as the primary evaluation metric. RMSE is particularly relevant for regression tasks as it represents the average magnitude of the errors in predicting the target variable, providing an easily interpretable measure in the same units as the predicted outcome. Lower RMSE values indicate better model performance.
Statistical significance
To ascertain the statistical significance of the observed model performances, permutation tests were conducted for each evaluated model. This involved shuffling the target variable (y_train) 100 times and retraining each model on these permuted datasets. The RMSE scores obtained from these permuted runs were then compared against the RMSE of the model trained on the original, unshuffled data. A p-value less than 0.05 was considered indicative of statistically significant performance, implying that the model's predictive ability was unlikely to have occurred by chance, thus validating its genuine learning from the underlying data patterns.
Best model training
Following the identification of the best performing model based on the CV results and statistical significance from permutation tests (which was the RF model), this model was then trained on the complete training dataset (X_train_processed_final and y_train). This step ensures that the final chosen model leverages all available training data to maximize its predictive power before further interpretability analysis or potential deployment.
Model interpretability
To gain insights into how the best model makes its predictions, SHAP values were computed and visualized. A shap.TreeExplainer was specifically used for the tree-based RF model, as it is an efficient method for this class of models to calculate these values. The SHAP summary plot provided a comprehensive view of feature importance, illustrating not only which features were most influential but also the direction and magnitude of their impact on the model's output for individual predictions. This analysis directly contributed to understanding the key factors influencing academic outcomes.
Learning curve analysis
Finally, LC was generated for the best model to assess its performance as a function of the training set size. This plot helped diagnose potential issues such as overfitting or underfitting by showing the trend of both training and CV RMSE as more data was introduced. The LC provided insights into the model's generalization capabilities and whether acquiring more data would likely improve its performance, directly linking back to the objective of understanding model effectiveness and potential for improvement.
Ethical considerations
It is crucial to acknowledge the ethical implications when developing predictive models for academic outcomes. Potential biases in the data, fairness of predictions across different demographic groups, and responsible use of model insights are important considerations. Steps were taken to ensure transparency (e.g. SHAP analysis) and robustness of evaluation (e.g. repeated CV, permutation tests), but further investigation into bias detection and mitigation would be essential for any real-world application. Participation in this study was entirely voluntary. To protect the identity of the minor participants (Grade 9 and Grade 11 students), the school name and individual student identities were anonymized. Ethical clearance was obtained through Institutional Approval from the school administration. An informed consent letter was provided to all participants via the digital survey platform, which explained the study's purpose and the students' right to withdraw at any time. For this school-based evaluation, the administration provided oversight for the protection of student welfare.
Results
This section presents the empirical findings from the comprehensive evaluation of various ML models designed to predict academic outcomes. In alignment with the primary objectives of assessing model effectiveness and identifying key influencing factors, the results are meticulously detailed, covering model performance, statistical significance, feature importance, and generalization capabilities.
Model performance evaluation
Across eight distinct regression models, performance was quantified using 5x repeated 5-fold CV, with the RMSE serving as the primary metric. The mean RMSE and its standard deviation (SD) for each model are summarized in Table 1, and visually depicted in Figure 1 “Model Comparison: Mean RMSE with SD Error Bars” plot.
Model performance summary
| Model | Mean RMSE | Std Dev RMSE |
|---|---|---|
| RF | 5.2138 | 1.5280 |
| K-Nearest Neighbors (KNN) | 5.3572 | 1.5647 |
| LGBM | 5.5326 | 1.6366 |
| GB | 5.5822 | 1.7067 |
| XGB | 6.0144 | 1.7770 |
| DT | 7.3711 | 1.9404 |
| DNN (DNN - MLPRegressor) | 7.7331 | 2.1583 |
| LR | 8.7599 | 2.1287 |
| Model | Mean | Std Dev |
|---|---|---|
| 5.2138 | 1.5280 | |
| K-Nearest Neighbors ( | 5.3572 | 1.5647 |
| 5.5326 | 1.6366 | |
| 5.5822 | 1.7067 | |
| 6.0144 | 1.7770 | |
| 7.3711 | 1.9404 | |
| 7.7331 | 2.1583 | |
| 8.7599 | 2.1287 |
The horizontal axis is labeled “Model” and lists: “Random Forest”, “K N N”, “Light G B M”, “Gradient Boosting”, “X G Boost”, “Decision Tree”, “Deep Neural Network”, and “Linear Regression”. The vertical axis is labeled “Mean R M S E” and ranges from 0 to 10 in increments of 2 units. Each bar represents the mean R M S E for a model, with vertical error bars. The data from the bars are as follows: For Random Forest: 5.2 with error range 3.7 to 6.7. For K N N: 5.3 with error range 3.8 to 6.8. For Light G B M: 5.5 with error range 4.0 to 7.0. For Gradient Boosting: 5.6 with error range 3.9 to 7.2. For X G Boost: 6.0 with error range 4.2 to 7.8. For Decision Tree: 7.4 with error range 4.8 to 9.9. For Deep Neural Network: 7.7 with error range 6.0 to 9.4. For Linear Regression: 8.8 with error range 7.5 to 10.1. Note: All numerical values are approximated.Model comparison. Source: Authors’ own work
The horizontal axis is labeled “Model” and lists: “Random Forest”, “K N N”, “Light G B M”, “Gradient Boosting”, “X G Boost”, “Decision Tree”, “Deep Neural Network”, and “Linear Regression”. The vertical axis is labeled “Mean R M S E” and ranges from 0 to 10 in increments of 2 units. Each bar represents the mean R M S E for a model, with vertical error bars. The data from the bars are as follows: For Random Forest: 5.2 with error range 3.7 to 6.7. For K N N: 5.3 with error range 3.8 to 6.8. For Light G B M: 5.5 with error range 4.0 to 7.0. For Gradient Boosting: 5.6 with error range 3.9 to 7.2. For X G Boost: 6.0 with error range 4.2 to 7.8. For Decision Tree: 7.4 with error range 4.8 to 9.9. For Deep Neural Network: 7.7 with error range 6.0 to 9.4. For Linear Regression: 8.8 with error range 7.5 to 10.1. Note: All numerical values are approximated.Model comparison. Source: Authors’ own work
Best Performing Model: The RF model emerged as the top-performing model, achieving the lowest Mean RMSE of 5.2138 (SD: 1.5280), establishing it as the top-performing model with superior predictive accuracy.
Other Top Performers: LGBM (Mean RMSE: 5.5326) and GB (Mean RMSE: 5.5822) also showed strong performance, highlighting the effectiveness of ensemble tree-based methods.
Mid-Range Performers: KNN had a Mean RMSE of 5.3572, while XGB recorded 6.0144.
Lower Performers: DT (Mean RMSE: 7.3711), DNN (Mean RMSE: 7.7331) and LR (Mean RMSE: 8.7599) generally exhibited higher RMSEs, suggesting they were less effective in capturing the underlying patterns or were more prone to issues like overfitting (DT, DNN) or underfitting (LR) for this dataset.
Statistical significance of model performance
To validate that the observed model performances were not merely due to chance, permutation tests were conducted with 100 permutations for each model. A p-value threshold of 0.05 was used to determine statistical significance.
Statistically Significant Models:
These results provide robust evidence that both the RF and GB models learned genuine, predictive patterns from the dataset. Other models, including KNN (p-value = 0.0594), XGB (p-value = 0.0594), LGBM (p-value = 0.1188) and DNN (p-value = 0.1287), did not meet the conventional threshold for statistical significance in this test. This suggests their predictive power is less reliably distinguishable from random performance, particularly given the small sample size.
Model generalization (LC)
The LC for the RF model was generated to assess its generalization capabilities across varying training data sizes. This plot displayed the training RMSE and CV RMSE as a function of the number of training samples.
Bias-Variance Trade-off: The LC as shown in Figure 2, indicated a reasonable balance between bias and variance. Both training and CV RMSE decreased as the training set size increased, signifying effective learning. The curves showed convergence, suggesting the model was generalizing well. A persistent, albeit small, gap between the training and CV curves implied that while the model performed effectively, there might be further potential for performance improvement with a larger dataset or additional hyperparameter refinement, particularly in reducing the variance component.
The line graph is titled “Learning Curve for Random Forest”. The horizontal axis is labeled “Training Set Size” and ranges from 5 to 40, with tick marks shown at 5 unit intervals. The vertical axis is labeled “R M S E” and ranges from 1 to 7 in increments of 1. The legend in the lower right lists “Training R M S E” and “Cross-validation R M S E”. The line for “Training R M S E” begins at (3.5, 1.78), decreases to a trough at (14, 1.22), rises to a peak at (23, 1.95), decreases again at (33, 1.75), and ends at (43, 1.92). The line for “Cross-validation R M S E” begins at (3.5, 5.48), decreases slightly to (14, 5.32), rises to a peak at (23, 5.62), drops to a trough at (33, 5.10), and ends at (43, 5.22). Shaded bands appear around both lines.Learning curve for random forest. Source: Authors’ own work
The line graph is titled “Learning Curve for Random Forest”. The horizontal axis is labeled “Training Set Size” and ranges from 5 to 40, with tick marks shown at 5 unit intervals. The vertical axis is labeled “R M S E” and ranges from 1 to 7 in increments of 1. The legend in the lower right lists “Training R M S E” and “Cross-validation R M S E”. The line for “Training R M S E” begins at (3.5, 1.78), decreases to a trough at (14, 1.22), rises to a peak at (23, 1.95), decreases again at (33, 1.75), and ends at (43, 1.92). The line for “Cross-validation R M S E” begins at (3.5, 5.48), decreases slightly to (14, 5.32), rises to a peak at (23, 5.62), drops to a trough at (33, 5.10), and ends at (43, 5.22). Shaded bands appear around both lines.Learning curve for random forest. Source: Authors’ own work
Feature importance (SHAP analysis)
The SHAP summary plot (Figure 3) illustrates the impact of each feature on the model's predictions for the 68 participants. The results show that score first quarter, age, and study Time are the strongest predictors of academic outcomes as depicted in the Top 5 influential features in Table 2.
The horizontal axis at the bottom is labeled “S H A P value (impact on model output)” and ranges from 0.00 to 1.00, with increments of 0.25 units. The vertical axis lists feature names in descending order of importance, starting from “num underscore underscore score 1st quarter”, “num underscore underscore Age”, “num underscore underscore score 2nd quarter”, “cat underscore underscore Autonomy Importance underscore 5”, “cat underscore underscore Recommendation Likelihood underscore 4”, “num underscore underscore Least Helpful Features underscore Gamification”, “cat underscore underscore Motivational Features Importance underscore 5”, “remainder underscore underscore Most Helpful Features underscore Personalization”, “cat underscore underscore Gender underscore 2”, “num underscore underscore Least Helpful Features underscore Progress Tracking”, “remainder underscore underscore Motivational Needs underscore Personalization”, “num underscore underscore Study Time”, “cat underscore underscore Gender underscore 1”, “cat underscore underscore Relatedness underscore 4”, “cat underscore underscore Effectiveness underscore 5”, “cat underscore underscore A I App Usage underscore 2”, “remainder underscore underscore Motivational Needs underscore Goals”, “remainder underscore underscore Most Helpful Features underscore Progress Tracking”, “num underscore underscore Least Helpful Features underscore Feedback”, and “cat underscore underscore Effectiveness underscore 2”. On the right side of the plot, a vertical color bar labeled “Feature value” transitions from “Low” at the bottom to “High” at the top.SHAP summary plot for random forest. Source: Authors’ own work
The horizontal axis at the bottom is labeled “S H A P value (impact on model output)” and ranges from 0.00 to 1.00, with increments of 0.25 units. The vertical axis lists feature names in descending order of importance, starting from “num underscore underscore score 1st quarter”, “num underscore underscore Age”, “num underscore underscore score 2nd quarter”, “cat underscore underscore Autonomy Importance underscore 5”, “cat underscore underscore Recommendation Likelihood underscore 4”, “num underscore underscore Least Helpful Features underscore Gamification”, “cat underscore underscore Motivational Features Importance underscore 5”, “remainder underscore underscore Most Helpful Features underscore Personalization”, “cat underscore underscore Gender underscore 2”, “num underscore underscore Least Helpful Features underscore Progress Tracking”, “remainder underscore underscore Motivational Needs underscore Personalization”, “num underscore underscore Study Time”, “cat underscore underscore Gender underscore 1”, “cat underscore underscore Relatedness underscore 4”, “cat underscore underscore Effectiveness underscore 5”, “cat underscore underscore A I App Usage underscore 2”, “remainder underscore underscore Motivational Needs underscore Goals”, “remainder underscore underscore Most Helpful Features underscore Progress Tracking”, “num underscore underscore Least Helpful Features underscore Feedback”, and “cat underscore underscore Effectiveness underscore 2”. On the right side of the plot, a vertical color bar labeled “Feature value” transitions from “Low” at the bottom to “High” at the top.SHAP summary plot for random forest. Source: Authors’ own work
Top five influential features
| Feature name | Human-readable definition | Data source |
|---|---|---|
| first quarter score | The student's initial academic grade prior to using the AI tool | Academic record |
| Age | The chronological age of the student at the time of the study | Demographics |
| second quarter score | The student's initial academic grade prior to using the AI tool | Academic record |
| Autonomy importance | A measure of how much a student values the ability to choose their own learning path (SDT construct) | Survey |
| Recommendation likelihood | The student's stated willingness to recommend the AI tool to peers (Engagement metric) | Survey |
| Feature name | Human-readable definition | Data source |
|---|---|---|
| first quarter score | The student's initial academic grade prior to using the | Academic record |
| Age | The chronological age of the student at the time of the study | Demographics |
| second quarter score | The student's initial academic grade prior to using the | Academic record |
| Autonomy importance | A measure of how much a student values the ability to choose their own learning path ( | Survey |
| Recommendation likelihood | The student's stated willingness to recommend the | Survey |
Additionally, several motivational constructs from the SDT survey ranked within the top ten most influential features. Specifically, autonomy importance and relatedness showed a clear influence on the model's output. The plot confirms that higher levels of prior academic achievement and perceived autonomy generally lead to higher predicted scores, while the influences of age and study time are more varied across the student group.
Discussion
This study aimed to rigorously evaluate the effectiveness of various ML models in predicting academic outcomes and to identify the key factors significantly influencing these predictions. Through a comprehensive evaluation framework involving repeated CV, permutation testing and interpretability analysis, we sought to provide robust insights into the predictive landscape of academic success.
Summary of key findings
The evaluation of eight distinct regression models revealed that the RF model demonstrated superior predictive performance, achieving the lowest Mean RMSE of 5.2138. Both RF and GB models exhibited statistically significant performance, suggesting their ability to capture genuine patterns in the data beyond mere chance. SHAP analysis on the best-performing RF model identified age, study time and score first quarter as the most influential features in predicting academic outcomes. The LC indicated a balanced model, but also the potential for further performance gains with a larger dataset.
Interpretation of results
Model performance
The efficacy of ensemble tree-based methods. Our findings underscore the remarkable efficacy of ensemble tree-based models, particularly RF and GB, in predicting academic outcomes. The RF model's leading performance (Mean RMSE: 5.2138, Std: 1.5280) is attributable to its ability to aggregate predictions from multiple DTs, effectively reducing overfitting and improving generalization. This ensemble approach allows for capturing complex, non-linear relationships within the data, which simpler models like LR (Mean RMSE: 8.7599) evidently struggled with. LR's high RMSE suggests that academic outcomes are not linearly dependent on the input features, validating the need for more sophisticated models. LGBM and XGB, also GB frameworks, performed competitively (Mean RMSE: 5.5326 and 6.0144, respectively), further solidifying the suitability of tree-based ensembles for this prediction task. The DNN (MLPRegressor), despite increased max_iter settings, yielded a higher Mean RMSE (7.7331) than the top ensemble models. This might indicate that for datasets of this size and complexity, well-tuned ensemble methods can still outperform a basic DNN architecture, or that the DNN's architecture and hyperparameters were not optimally configured for this specific problem, or that it requires significantly more data to realize its full potential.
Statistical significance
Beyond Chance Performance The permutation tests provided crucial statistical validation, confirming that the predictive capabilities of RF (p-value: 0.0198) and GB (p-value: 0.0297) were statistically significant (p < 0.05). This is a vital finding, as it provides confidence that these models learned meaningful patterns from the underlying data structure rather than merely reflecting random correlations. The non-significant p-values for other models (e.g. KNN, XGB, LGBM, DNN) do not necessarily imply they lack predictive power, but rather that, within the context of 100 permutations and given the dataset's characteristics, their performance was less reliably distinguishable from random chance. This highlights the importance of robust statistical validation, especially in small datasets where apparent performance differences might be spurious.
Learning curves and generalizability
The LC for the RF model indicated a healthy learning process, with both training and CV RMSE decreasing as data size increased. However, a persistent gap between training and validation error remained. This suggests a balanced but delicate bias-variance trade-off; the model is not severely overfitting, but the small sample size (n = 68) prevents the curves from fully converging, indicating potential for further stability with additional data.
Model generalization
The LC for the RF model provided a diagnostic tool for understanding its generalization capabilities. The observed decrease in both training and CV RMSE as training data size increased, coupled with their eventual convergence, indicates that the model is learning effectively and generalizing reasonably well. The presence of a persistent, albeit small, gap between the training and CV curves suggests a slight variance component or that the model has not yet fully converged to its optimal performance frontier. This implies that while the model handles the current data well, incorporating more data could potentially narrow this gap further, leading to even stronger generalization and reduced prediction error.
SHAP analysis
The SHAP summary plot for the RF model provides a high-fidelity map of the variables driving academic outcomes. By analyzing the top 10 features, we move beyond “black-box” predictions to understand the specific mechanisms of student success.
1. Academic and demographic foundations
num__score first quarter
This emerged as the most powerful predictor. The high SHAP values associated with higher first-quarter scores suggest a “momentum effect,” where early academic mastery creates a stable trajectory for the final grade. This reinforces the importance of foundational knowledge.
num__Age
Age significantly influenced the model. In an educational context, this likely reflects levels of cognitive maturity or life experience, which can dictate a student's ability to self-regulate and manage complex academic tasks.
num__score second quarter
Similar to the first quarter, the second-quarter performance serves as a secondary validation of the student's academic standing, allowing the model to refine its prediction as the academic year progresses.
2. Digital engagement and AI-app utility
cat__Recommendation Likelihood_4
As a critical proxy for engagement, students who indicated a high likelihood of recommending the AI-driven mobile app showed positive SHAP contributions. This suggests that “advocacy” for the tool is linked to successful learning outcomes; students who find the tool effective enough to recommend are likely those leveraging it most efficiently.
num__Least helpful Features_Gamification
This feature suggests that the model identified a segment of students for whom gamified elements were not helpful. For these students, the presence of gamification may have been a distraction rather than a motivator, negatively impacting their predicted score.
num__Least helpful Features_Progress tracking
Interestingly, the perception of progress tracking as “least helpful” also drove model adjustments. This implies that if a student does not value or understand their progress data, the AI-driven intervention loses its effectiveness.
3. Motivational and SDT-based factors
cat__Autonomy Importance_5
Aligning with SDT, students who ranked “Autonomy” at the highest level (5) had a significant impact on the model. This indicates that students who value ownership over their learning process are more likely to achieve higher outcomes when using self-paced AI tools.
cat__Motivational features Importance_5
The high importance of this category suggests that the model successfully captured the “motivational profile” of the student. Students who are highly sensitive to specific motivational triggers in the app see a corresponding boost in their predicted performance.
remainder__Most helpful Features_Personalization
The model identified that students who found “Personalization” helpful were more likely to succeed. This validates the core mission of AI-based education, the ability to tailor content to individual needs is a measurable driver of academic success.
Cat__Gender_2
The inclusion of gender in the top 10 suggests that there may be nuanced differences in how different demographics interact with the AI-learning environment or respond to specific motivational triggers.
Implications
While the RF model and SHAP analysis provide valuable insights, it is important to treat these findings as exploratory and preliminary due to the study's specific context and sample size. Nevertheless, this investigation offers a promising framework for how school leadership can move toward a more proactive, data-informed ecosystem.
Targeted Interventions and Resource allocation
The identification of “score first quarter” and “study time” as primary drivers suggests that academic support should be front-loaded. Rather than waiting for mid-year failures, school leaders can use early quarter data as a “screening tool” to identify students who may require additional mentorship. However, given the exploratory nature of this study, such interventions should be implemented as pilot programs to further validate these predictors in real-world settings.
Connecting to SDT motivation and digital engagement
The SHAP analysis provides a unique window into the psychological drivers of success, aligning with SDT:
Autonomy as a leading indicator
The influence of “Autonomy Importance” suggests that students who feel empowered to control their learning path within the AI app tend to perform better. For leadership, this implies that digital tools are most effective when they support student agency rather than rigid, automated instruction.
App advocacy and engagement
The predictive weight of “Recommendation Likelihood” suggests that a student's subjective satisfaction with the AI mobile app is a significant proxy for their academic engagement. Administrators can monitor app “Net Promoter” scores as an early-warning metric; a drop in satisfaction may be a precursor to academic decline.
Limitations
Despite the robust methodology and the statistical significance achieved by the RF model, several critical limitations must be acknowledged:
Sample size and generalizability
The primary limitation is the small sample size (n = 68). While CV and permutation tests were used to ensure model stability, the small N means that the results are highly sensitive to the specific characteristics of this cohort. Consequently, these findings cannot be generalized to broader or more diverse student populations without further large-scale validation.
High variance and overfitting
As evidenced by the LC, a persistent gap remains between the training and CV RMSE. This indicates a high degree of variance, where the model is still prone to overfitting the noise within the small dataset. The model's predictive accuracy might fluctuate significantly if applied to a different academic environment.
Exploratory nature of policy claims
The practical implications discussed such as using specific SHAP features to drive curriculum changes, should be viewed as hypotheses for future research rather than definitive institutional mandates. The synthetic nature of some data labels and the limited demographic range further necessitate a cautious approach to applying these results to broad educational policy.
Constraints of deep learning
The relative underperformance of the DNN (MLPRegressor) further highlights the difficulty of applying complex, “data-hungry” architectures to small-scale educational datasets.
Conclusion
This study evaluated the efficacy of various ML architectures in predicting academic outcomes within a framework of data-driven decision making. By comparing eight different models, it was determined that ensemble tree-based methods, specifically RF, offer the most robust predictive performance for this task, significantly outperforming linear models and DNNs in a small-sample context.
Beyond raw prediction, the integration of SHAP analysis provided a transparent view into the factors driving student success. The findings suggest that while historical academic performance (first quarter scores) remains a primary predictor, “soft” factors rooted in SDT such as autonomy importance and relatedness, play a statistically significant role. Furthermore, student advocacy for the AI-driven mobile app (measured via Recommendation Likelihood) emerged as a vital proxy for engagement, indicating that a student's subjective experience with digital tools is intrinsically linked to their objective academic achievement.
However, given the exploratory nature of this research and the small sample size (n = 68), these results should be interpreted as preliminary. The LCs highlight a persistent generalization gap, suggesting that while the models show great promise for school leadership, they are currently sensitive to the specific characteristics of the study cohort. Ultimately, this work serves as a “proof of concept” for how AI can move educational leadership from reactive management to proactive, motivation-aware support systems.
Recommendations
Based on the exploratory findings of this study, the following recommendations are offered for school leadership and future academic research:
For school leadership
Prioritize Early-Quarter Screening: Administrators should utilize first-quarter academic data as a primary trigger for student support interventions, as it remains the most influential factor in predicting final outcomes.
Monitor Digital Advocacy: School leaders should treat “Recommendation Likelihood” of AI tools as a leading indicator of student success. A drop in app satisfaction or perceived “Effectiveness” should be viewed by leadership as an early warning signal for potential academic disengagement.
Cultivate Autonomy-Supportive Environments: Given the high SHAP importance of Autonomy, pedagogical strategies and AI tools should be designed to offer students choices and ownership over their learning paths, rather than relying on rigid, top-down instruction.
Implement Pilot Programs: Rather than broad institutional mandates, leaders should use these AI insights to guide small-scale pilot interventions to further validate the predictive features in a live classroom setting.
For future research
Expansion of Dataset Scale: Future studies must prioritize larger, more diverse datasets to bridge the “generalization gap” observed in the LCs and to allow more complex models, such as DNNs, to reach their full potential.
Refinement of SDT Metrics: Researchers should move from synthetic proxies to validated psychometric scales to measure autonomy, competence and relatedness more precisely within the ML pipeline.
Longitudinal Validation: Investigations should track the stability of these predictors across multiple academic years to determine if the identified influential features (like the first-quarter score) remain consistent over time.
Comparison of Feature Engineering: Further research is needed to determine if simpler, regularized models (e.g. Lasso or Ridge) can provide more stable performance on small-sample educational data compared to high-variance ensemble methods.
Declaration of generative AI and AI-assisted technologies in the writing and technical process
During the preparation of this work, the authors utilized AI-assisted technologies to support both the manuscript development and the technical modeling phases. In the writing process, AI was used to identify appropriate synonyms, clarify nuances between specialized terminology and assist in paraphrasing complex sections to improve readability. In the technical modeling phase, AI tools provided suggestions for evaluating ML architectures and interpreting model stability through diagnostic curves. Following the use of these tools, the authors rigorously reviewed and edited all generated content and data interpretations to ensure accuracy. The authors take full responsibility for the final content and integrity of the publication.

