This paper aims to introduce a neuro-symbolic affect-aware learning agent designed to optimize learner engagement and knowledge retention in virtual learning environments (VLEs).
The proposed system integrates deep neural networks for multimodal emotion recognition (facial, textual and auditory inputs) with a rule-based symbolic reasoning engine that adapts instructional delivery based on detected affective states. Emotion detection was achieved using a hybrid pipeline comprising a ResNet-50 model (trained on AffectNet for facial cues), fine-tuned BERT (on GoEmotions for textual cues) and wav2vec2.0 (on IEMOCAP for speech signals). To evaluate pedagogical effectiveness, a controlled experiment was conducted with 80 participants divided into three groups: a control group, a neural-only agent group and the proposed neuro-symbolic agent group. Learner engagement was quantified using the User Engagement Scale (UES), and learning outcomes were measured using normalized pre-test/post-test gain scores.
Results indicate that the neuro-symbolic agent outperformed the baseline by 16.8% in engagement and 21.3% in learning gain, demonstrating the benefits of emotionally adaptive and context-aware instruction.
The study was conducted with a limited sample size (80 participants) and focused on short-term engagement and learning outcomes. Further research is required to assess long-term effectiveness and generalizability across diverse educational contexts.
The proposed framework highlights the potential of affect-aware, neuro-symbolic systems to enhance learner engagement, promote self-regulated learning and support personalized instruction in VLEs, contributing to more empathetic and human-centered digital education.
This work presents a novel integration of multimodal emotion recognition with symbolic reasoning for real-time, pedagogically adaptive learning, offering a transparent, interpretable and emotionally responsive approach to VLEs.
1. Introduction
Virtual learning environments (VLEs) have become central to modern education, offering learners and instructors benefits such as flexibility, accessibility and scalable content delivery (Johnson et al., 2020). Their adoption has accelerated due to global disruptions like the COVID-19 pandemic, positioning VLEs as key platforms for both formal and informal learning (Karakose, 2021). From learning management systems to intelligent tutoring systems and massive open online courses (MOOCs), these platforms have made education available to learners across subjects and age groups like never before (Tilak and Kumar, 2022).
Despite these advances, most VLEs struggle to deliver truly personalized and emotionally responsive learning experiences (Berezi, 2025). Traditional systems typically rely on static learner profiles tracking only demographics, activity history or performance metrics (Engelbrecht and Oates, 2022). While this allows for basic adaptation, such as adjusting content difficulty, it overlooks the affective dimension of learning, which strongly influences attention, motivation and retention (Lantz-Wagner, 2022). Research in educational psychology shows that emotions like frustration, boredom, curiosity or anxiety significantly shape engagement, persistence and problem-solving behavior (Ismail and Aljabr, 2025). Yet, most current VLEs remain affectively unaware, unable to detect learners’ emotional states or respond with pedagogically meaningful interventions (Evain et al., 2021).
Addressing this gap requires systems that perceive learners’ emotions in real time and adjust instruction dynamically. Such systems must go beyond surface-level profiling to understand the cognitive-affective interplay and intervene with strategies that are both pedagogically sound and emotionally intelligent. Integrating affective awareness into VLEs is therefore not merely a technical enhancement it is essential for richer engagement, reduced dropout and more human-centered learning (Filatro and Cavalcanti, 2024).
Recent advances in artificial intelligence, particularly affective computing and educational data mining, make this possible by decoding emotional cues from facial expressions, voice, physiological signals and text (Vistorte et al., 2024). However, most affect-sensitive systems rely heavily on deep learning perception alone, which, while effective at detecting emotions, often lacks transparency, reasoning capability and pedagogical soundness (Gil and Selman, 2019). This has sparked interest in neuro-symbolic AI, which combines the perceptual strengths of neural networks with the logical reasoning of symbolic systems (Liang et al., 2025).
This work introduces a decision-level neuro-symbolic fusion framework for affect-aware personalization in VLEs. In this framework, neural modules detect affective states such as frustration or engagement from multimodal inputs, while symbolic modules apply pedagogical rules and reasoning to determine the most appropriate instructional intervention. The system is designed to not only perceive learners’ emotions but also explain and justify the instructional decisions, creating a transparent and empathetic learning experience (Mileo, 2025).
The novelty of this approach is twofold. First, it introduces a confidence-aware fusion controller that dynamically regulates the interaction between neural predictions and symbolic reasoning, activating symbolic inference only when necessary to balance interpretability and robustness. Second, it establishes a direct mapping between learners’ affective states and instructional strategies, operationalizing principles from adaptive learning, motivation and engagement theory. Unlike traditional hybrid systems that statically combine components, this framework transforms raw emotional signals into actionable, pedagogically grounded interventions.
In summary, this research advances both conceptual understanding and practical implementation: it moves beyond system integration to demonstrate how neuro-symbolic reasoning can be applied for real-time, emotionally adaptive learning. By combining perception, reasoning and pedagogy, the proposed framework represents a step toward VLEs that are truly responsive, interpretable and learner-centered.
Despite progress in VLEs and affect-sensitive systems, existing platforms largely remain unable to perceive and respond to learners’ emotions in a transparent, pedagogically grounded way. This gap limits engagement, learning outcomes and the potential for truly personalized digital education. To address this, our work proposes a decision-level neuro-symbolic framework that integrates real-time affect detection with symbolic reasoning to select context-aware instructional strategies. This approach not only enhances learner engagement and personalization but also provides interpretable, theory-informed interventions, advancing both the conceptual understanding and practical implementation of emotionally adaptive learning systems.
2. Literature review
Integration of affective computing into educational technology has been gaining more and more attention in recent years due to its potential to improve learners’ motivation, engagement and well-being. With growing adoption of VLEs, researchers have emphasized that students require systems that are able to comprehend and engage with their emotions. Liu and Ardakani (2022) propose an affective e-learning system model assisted by machine learning that incorporates emotional data within learning content adaptation. The work reflects how emotion-aware systems can significantly enhance learning performance and student satisfaction using context-related feedback as well as interaction styles.
Machine learning techniques deep learning, specifically are the backbone of most affective computing applications. These models enable emotion recognition from various data sources, from physiological signals to text. Mai et al. (2021) developed a home-built EEG device to detect emotions and showed that neural signals were successfully classified through supervised learning methods. Similarly, Kratzwald et al. (2018) use deep learning architectures for text-based emotion recognition for decision support systems, revealing the potential of affect inference from language in intelligent systems. Their research reveals the ability of neural models to recognize implicit emotional cues embedded in learner communication.
In learning environments, affective computing extends beyond detection to the goal of enhancing human–computer interaction. Mutawa and Sruthi (2024) offer a machine learning model to predict learner emotion and satisfaction in online learning and show how adaptive responses to emotional difference can lead to more satisfactory learning. Shifa et al. (2025) emphasize the significance of emotionally intelligent AI systems in aiding student well-being, especially in emotionally sophisticated digital classroom environments. Their study points out that overlooking emotional signals in VLEs can lead to suboptimal participation and learning inefficiencies.
Yet, although these systems perform well in emotion detection, most use solely black-box neural frameworks that have limited explainability. The reason for this non-transparency lies in being a hindrance to educational implementation, wherein teachers and students might need to obtain intelligible explanations of system actions. Artanto and Arifin (2023) demonstrate how deep learning can detect gesture and emotional evaluations and recognize affective states but also admit that such systems tend to lack reasoning ability for explainable intervention.
The emerging neuro-symbolic AI research offers a promising solution to this challenge by merging the perceptual power of neural networks and explainable symbolic reasoning. Singh (2024) explores affective computing with neural networks and recommends hybrid systems that move beyond perception to pedagogically guided decision-making. Kumar (2023) further describes the use of neuro-symbolic approaches in adaptive mental health care, making analogies between cognitive modeling of psychiatry and affect-driven adaptation in learning technology. These remarks further affirm the importance of fusing cognitive science with computational logic for the enhancement of user-focused AI systems.
Lu et al. (2025) provide a concrete illustration of explainable neuro-symbolic integration in the health setting, where recommendations for diagnosis benefit from the neural accuracy as well as symbolic clarity. Although their work is in health care, the methodological extensions translate directly to education where explainable personalization from affective signals is equally crucial. This suggests that neuro-symbolic frameworks can potentially provide an actionable pathway toward the construction of intelligent tutoring systems that are affect-sensible and pedagogically understandable.
Recent research highlights the critical role of cognitive, motivational and affective processes in digital and AI-enhanced learning environments. Studies on self-regulated learning (SRL) show that learners’ ability to plan, monitor and adjust their learning strategies is enhanced when AI provides timely feedback and adaptive guidance (Wei, 2023; Lee et al., 2023). Meanwhile, motivation and engagement have been shown to influence persistence and performance in AI-supported learning, with factors such as self-efficacy, resilience and feedback accuracy shaping learners’ willingness to interact with digital systems (Shi and Zhang, 2025; Eisbach et al., 2023). From a cognitive load perspective, research indicates that AI systems must balance task complexity and learner support to avoid overload while promoting goal-setting and effective learning strategies (Zhang et al., 2026; Rind, 2026). Furthermore, the integration of emotional and social learning into AI-driven platforms enhances learners’ socio-emotional development, engagement and adaptability, particularly in primary education contexts (Ofori et al., 2024; Akintayo et al., 2024). Together, these studies emphasize that affect-aware, adaptive systems must simultaneously consider learners’ cognitive capacities, emotional states and motivational drivers to deliver truly personalized, effective learning experiences. This theoretical grounding directly informs the design of our neuro-symbolic framework, which integrates real-time affect detection with pedagogically grounded reasoning to dynamically adapt instructional strategies in VLEs.
Overall, the literature supports the viability of affective computing to enhance virtual learning but reveals two key limitations:
current systems often lack effective affective responsiveness tuned to learners’ needs; and
the black-box nature of deep learning models diminishes trust and pedagogical integrity.
Neuro-symbolic reasoning addresses both issues by intertwining affective perception with explainable, goal-directed adaptation. By dynamically mapping learners’ emotional states to pedagogically grounded interventions, the proposed framework enhances engagement, promotes self-regulated learning and improves learning outcomes, such as task persistence, knowledge retention and learner satisfaction. Moreover, ethical and practical considerations such as transparency, fairness and respect for learner privacy are central to the system design, ensuring that adaptive interventions are not only effective but also responsible and teacher-friendly. This work builds upon existing research by presenting a new neuro-symbolic approach to emotion-aware, real-time personalization within VLEs, with the goal of facilitating learner-focused, transparent and empathetic digital learning experiences.
3. Methodology
This section outlines the design and implementation of the proposed neuro-symbolic framework for affective-aware personalization in VLEs. The system architecture integrates neural emotion recognition and symbolic reasoning mechanisms to tailor instructional content to learners’ affective states. The methodology encompasses the neural subsystem for affect detection, a symbolic reasoning engine for pedagogical decision-making, the fusion mechanism that unifies both components and the evaluation strategy within a VLE simulation.
3.1 Data set description and preprocessing
The emotion detection module was trained using three benchmark multimodal data sets such as AffectNet (Mollahosseini et al., 2017), GoEmotions (Demszky et al., 2020) and IEMOCAP (Busso et al., 2008). These data sets were chosen for their coverage of visual, textual and auditory affective expressions, respectively.
AffectNet contains over 1 million facial images labeled across eight basic emotions. For this study, a subset of 100,000 samples was used. GoEmotions, developed by Google Research, provides 58,000 English Reddit comments annotated with 27 emotion categories. Multi-label instances were flattened by selecting the dominant label using confidence scores. IEMOCAP includes approximately 12 h of audio-visual data from 10 actors engaged in scripted and improvised scenarios, annotated across key emotions including happiness, anger, sadness and neutrality.
The data sets were partitioned into 70% training, 15% validation and 15% test splits, ensuring subject-level disjoint sets to prevent data leakage and overfitting. All data preprocessing was performed in compliance with data set licenses and ethical use standards. Also, a simulated symbolic knowledge base was designed to complement the neural emotion outputs with contextual reasoning. This rule base consisted of 142 expert-verified affective rules, curated by two domain experts in educational psychology and affective computing. Rules were encoded in a forward-chaining format. For example:
emotion(frustration),activity(difficulty_high) => suggestion(provide_hint)
emotion(boredom), activity(difficulty_low) => suggestion(increase_challenge)
emotion(confusion), repetition_count > 2 => suggestion(rephrase_instruction)
The symbolic rules were invoked during inference to augment the neural outputs with pedagogical intent. An adaptive fusion strategy was used to weigh symbolic and neural confidence distributions.
To ensure robustness across diverse learner profiles, we used a fairness-aware loss function, extending the standard cross-entropy loss with reweighing techniques inspired by Cui et al. (2019). Specifically, we used a class-balanced variant of Focal Loss defined as:
where is the number of emotion classes, is the predicted probability, is an inverse frequency weight to counter class imbalance, and is the focusing parameter.
3.2 Neuro-symbolic system architecture
The proposed architecture consists of four major components: an affective state recognition module, a learner model and symbolic knowledge base, a symbolic reasoning engine and a personalization module responsible for adaptive content delivery. As illustrated in Figure 1, the architecture accepts multimodal learner input (facial expression, textual interaction and vocal tone), processes this data through a neural emotion classification pipeline and passes the inferred emotional state to a rule-based reasoning engine. Based on the inference outcomes, the personalization module dynamically adjusts content modality, complexity or pacing.
The flowchart shows a system architecture starting with an input module that receives multimodal input including face, text and voice. These inputs are processed through neural emotion detection components including face using ResNet 50, text using BERT and voice using wav2vec 2.0. The outputs are combined through multimodal fusion to form an affective state vector. This is passed to a symbolic reasoning engine using rule-based inference in Prolog. The result is then processed by a personalisation engine for adaptive learning delivery. A learner feedback loop based on emotion and interaction provides real time input back into the system.Overall architecture of the proposed framework
The flowchart shows a system architecture starting with an input module that receives multimodal input including face, text and voice. These inputs are processed through neural emotion detection components including face using ResNet 50, text using BERT and voice using wav2vec 2.0. The outputs are combined through multimodal fusion to form an affective state vector. This is passed to a symbolic reasoning engine using rule-based inference in Prolog. The result is then processed by a personalisation engine for adaptive learning delivery. A learner feedback loop based on emotion and interaction provides real time input back into the system.Overall architecture of the proposed framework
The methodological design of this framework is grounded in established educational theories. Constructs such as learner engagement and learning gain are interpreted through the lens of self-regulated learning (SRL), motivation and engagement theory and cognitive load theory. Engagement metrics reflect sustained attention and interaction, while learning gains correspond to the effectiveness of scaffolded instructional adaptation. The affective states detected by the neural subsystem guide interventions in alignment with these theories, ensuring that system decisions are pedagogically meaningful rather than purely data-driven.
3.2.1 Emotion detection module.
Emotion detection was achieved through a hybrid neural pipeline that processes facial, textual and auditory modalities. For facial expression analysis, the ResNet-50 convolutional neural network architecture was used, pre-trained on the AffectNet data set. Textual sentiment and emotional cues were processed using the Bidirectional Encoder Representations from Transformers (BERT) model fine-tuned on the GoEmotions data set. For speech-based affect detection, wav2vec2.0 was adopted, leveraging self-supervised learning on raw audio waveforms.
Each modality output is represented as a probability distribution over discrete emotional states . These distributions are combined using a soft attention mechanism that computes a weighted affective state vector:
where is the emotion vector from modality and is the attention weight learned during training.
3.2.2 Symbolic reasoning module.
The symbolic reasoning engine is designed to emulate the logic of a pedagogical expert. A forward-chaining rule base, implemented in Prolog, applies declarative rules to interpret the dominant emotional state and determine the appropriate instructional intervention. The discretized emotion vector is converted into a symbolic fact via maximum-likelihood selection:
The resulting fact (e.g. emotion(frustrated)) is passed into the symbolic engine, which searches the knowledge base for applicable inference rules. A few illustrative rules are as follows:
emotion(frustrated):- suggest(help_video).
emotion(bored):- suggest(interactive_quiz).
emotion(engaged):- continue(current_path).
Let denote the set of observed emotional facts and the rule base. The symbolic system computes the pedagogical adaptation by applying forward chaining:
To ensure symbolic outputs remain consistent with underlying affective states, a fusion controller maps the continuous emotion vector into discrete triggers only when confidence is high or entropy is low. Otherwise, symbolic reasoning is suppressed in favor of direct neural intervention.
The symbolic knowledge base was constructed through a collaborative expert-driven process. Two domain experts in educational psychology and affective computing curated 142 pedagogically relevant rules linking learner affective states to instructional interventions. Each rule was verified for internal consistency and relevance to learning objectives. For example:
emotion(frustration), activity(difficulty_high) => suggestion(provide_hint)
emotion(boredom), activity(difficulty_low) => suggestion(increase_challenge)
emotion(confusion), repetition_count > 2 => suggestion(rephrase_instruction)
The rule base was made to go through iterative validation, including peer review by additional educators and simulation testing within the Moodle VLE to ensure that proposed adaptations aligned with pedagogical expectations and learner engagement principles. Rules were applied using forward chaining in the symbolic engine, and adaptations were triggered only when neural confidence and entropy thresholds indicated sufficient certainty.
3.2.3 Fusion mechanism.
The fusion between the neural and symbolic components is implemented as a decision-level, confidence-aware control mechanism, rather than a static integration strategy. Specifically, the system uses a fusion controller that dynamically determines whether symbolic reasoning should be activated based on the confidence distribution of the multimodal affective predictions. Confidence is estimated using the maximum predicted probability and entropy of the emotion vector. When the predicted affective state exhibits high confidence (i.e. low entropy), the dominant emotion is mapped into a symbolic fact and passed to the reasoning engine for pedagogical inference. Conversely, in cases of uncertainty, the system suppresses or attenuates symbolic reasoning to avoid unreliable rule activation, relying instead on direct neural adaptation. This design ensures that symbolic reasoning is invoked only when it is both meaningful and reliable, thereby improving interpretability without compromising robustness.
This mapping is achieved using a thresholding mechanism defined by:
Once the dominant emotion is inferred, it is encoded as a fact and passed into the symbolic engine for reasoning. This fusion approach maintains interpretability while leveraging neural precision. In cases of ambiguity, where predictions are uncertain, symbolic reasoning is either deferred or adjusted with fallback strategies.
This dynamic fusion strategy distinguishes the proposed framework from conventional neuro-symbolic systems, where neural and symbolic components are often combined in a static or sequential manner. By introducing conditional symbolic activation, the model achieves a more flexible and context-aware integration, enabling more effective translation of affective signals into pedagogically relevant actions.
3.2.4 Personalization engine.
The personalization engine applies symbolic recommendations in ways that are explicitly informed by learning theory. For example, interventions designed to reduce frustration or confusion aim to lower extraneous cognitive load and maintain motivation, consistent with cognitive load and self-regulated learning principles. Conversely, increasing challenge during boredom is intended to sustain engagement and promote deeper cognitive processing. These theory-grounded strategies provide a rationale for the mapping from affective states to instructional adaptations.
Let denote the current learning content and the recommended adaptation strategy. The transformed content is computed via:
where is a content transformation function parameterized by symbolic adaptation rules. For example, if the detected emotion is boredom, may replace static text explanations with an interactive animation or quiz to increase engagement.
The effectiveness of the proposed system was assessed through a controlled experiment involving 80 participants, stratified into three groups: a control group (n = 26) with no emotion-based adaptation, a neural-only group (n = 27) using only neural inferences and a neuro-symbolic group (n = 27) using the full hybrid framework. All participants engaged with identical content under the same VLE conditions. The only variable was the presence and nature of real-time instructional adaptation informed by affective signals.
3.3 Experimental setup
The system was evaluated using a controlled experimental design to ensure that differences in engagement and learning outcomes could be attributed to the adaptive interventions rather than extraneous factors. A total of 80 participants were recruited, stratified across educational backgrounds, age and technical familiarity to ensure diversity and minimize confounding variables. Participants were randomly assigned to one of three experimental conditions:
Control (no adaptation);
Neural-Only Adaptation; and
Neuro-Symbolic Adaptation.
Each participant completed all learning modules under their assigned condition in a single session within the Moodle-based VLE. While the study was short-term, this design allowed rigorous comparison aligned with the research questions linking affective adaptation → engagement → learning gain.
The Moodle-based VLE was structured into three instructional modules, comprising reading comprehension exercises, interactive quizzes and scenario-based simulations. Detailed interaction logs captured time-on-task, click activity and dropout events, providing quantitative measures of engagement. The neuro-symbolic agent dynamically adapted content, pacing and modality based on detected affective states, guided by the symbolic rule base validated by domain experts. The neural-only adaptation group received interventions solely from the neural models, while the control group experienced static, non-adaptive content.
Data for multimodal emotion detection was sourced from benchmark data sets (AffectNet, GoEmotions, IEMOCAP), while real-time learner interactions were collected within the VLE. The symbolic reasoning engine was implemented using SWI-Prolog, and neural modules were developed in PyTorch, leveraging HuggingFace transformers and torchvision libraries.
3.4 Evaluation metrics
Performance was evaluated along three dimensions such as accuracy of affective state prediction, learner engagement and academic performance. Emotion detection accuracy was measured using macro-averaged F1 score across emotional classes. Engagement was quantified through time-on-task, click-through rates and dropout incidence. Learning gain was measured as the normalized difference between pre-test and post-test scores, defined as:
where and are the learner’s pre- and post-test scores respectively.
Results from the neuro-symbolic configuration were compared against both the control and neural-only conditions. Statistical significance was assessed using repeated-measures analysis of variance (ANOVA) at .
4. Results and discussion
This section presents a comprehensive evaluation of the proposed neuro-symbolic framework, with results organized to assess the performance of each component emotion detection, symbolic reasoning and personalization as well as their collective effect on learner engagement and academic performance. Additional analysis includes interpretability, model calibration and ablation studies to quantify the contributions of key subsystems.
4.1 Performance of the emotion detection module
The emotion detection module was evaluated using a multimodal test set derived from the DEAP data set, supplemented with interaction logs from the experimental Moodle-based VLE. Performance metrics were calculated per emotion category using macro-averaged F1-scores, presented in Table 1 alongside class-wise support values to contextualize the results.
Emotion classification performance by modality (F1-score with support)
| Emotion label | Support | Text (BERT) | Audio (wav2vec2) | Face (ResNet50) | Multimodal fusion |
|---|---|---|---|---|---|
| Bored | 340 | 0.78 | 0.74 | 0.76 | 0.83 |
| Confused | 310 | 0.72 | 0.69 | 0.70 | 0.77 |
| Engaged | 380 | 0.81 | 0.76 | 0.79 | 0.85 |
| Frustrated | 295 | 0.73 | 0.70 | 0.71 | 0.79 |
| Happy | 400 | 0.85 | 0.80 | 0.83 | 0.88 |
| Neutral | 350 | 0.76 | 0.72 | 0.75 | 0.81 |
| Average | 0.77 | 0.73 | 0.76 | 0.82 |
| Emotion label | Support | Text ( | Audio (wav2vec2) | Face (ResNet50) | Multimodal fusion |
|---|---|---|---|---|---|
| Bored | 340 | 0.78 | 0.74 | 0.76 | 0.83 |
| Confused | 310 | 0.72 | 0.69 | 0.70 | 0.77 |
| Engaged | 380 | 0.81 | 0.76 | 0.79 | 0.85 |
| Frustrated | 295 | 0.73 | 0.70 | 0.71 | 0.79 |
| Happy | 400 | 0.85 | 0.80 | 0.83 | 0.88 |
| Neutral | 350 | 0.76 | 0.72 | 0.75 | 0.81 |
| Average | 0.77 | 0.73 | 0.76 | 0.82 |
Multimodal fusion consistently outperformed unimodal models across all categories, achieving a 5–6% improvement in average F1-score over the strongest individual modality (text). This highlights the efficacy of the attention-based fusion in amplifying relevant emotional cues and mitigating noisy signals. The classification distribution is visualized in Figure 2.
The chart shows F 1 score values for bored, confused, engaged, frustrated, happy and neutral across four methods including text BERT, audio wav2vec 2, face ResNet 50 and multimodal fusion. For bored, scores are about 0.78 for text, 0.74 for audio, 0.76 for face and 0.83 for fusion. For confused, values are about 0.72, 0.69, 0.70 and 0.77. For engaged, values are about 0.81, 0.76, 0.79 and 0.85. For frustrated, values are about 0.73, 0.70, 0.71 and 0.79. For happy, values are about 0.85, 0.80, 0.83 and 0.88. For neutral, values are about 0.76, 0.72, 0.75 and 0.81. Multimodal fusion shows the highest scores across all emotional states.Emotion classification by modality
The chart shows F 1 score values for bored, confused, engaged, frustrated, happy and neutral across four methods including text BERT, audio wav2vec 2, face ResNet 50 and multimodal fusion. For bored, scores are about 0.78 for text, 0.74 for audio, 0.76 for face and 0.83 for fusion. For confused, values are about 0.72, 0.69, 0.70 and 0.77. For engaged, values are about 0.81, 0.76, 0.79 and 0.85. For frustrated, values are about 0.73, 0.70, 0.71 and 0.79. For happy, values are about 0.85, 0.80, 0.83 and 0.88. For neutral, values are about 0.76, 0.72, 0.75 and 0.81. Multimodal fusion shows the highest scores across all emotional states.Emotion classification by modality
4.2 Symbolic reasoning and personalization outcomes
The symbolic reasoning engine achieved 97.4% alignment with expert-defined emotion-action rules across 500 test cases. To assess the impact of personalization strategies on learner experience, a user study was conducted involving three experimental groups. Learner engagement metrics were derived from interaction logs and normalized for session duration. Table 2 reports the average values, standard deviations (SD) and p-values from ANOVA comparisons.
Learner engagement and retention metrics (mean ± SD)
| Group | Time-on-task (min) | Click activity (/session) | Dropout rate (%) |
|---|---|---|---|
| Control | 19.4 ± 3.1 | 42.6 ± 5.4 | 21.8 |
| Neural-Only adaptation | 24.8 ± 4.5 (p = 0.012) | 58.1 ± 6.2 (p = 0.008) | 13.2 |
| Neuro-Symbolic | 31.6 ± 4.7 (p < 0.001) | 71.3 ± 7.3 (p < 0.001) | 5.5 |
| Group | Time-on-task (min) | Click activity (/session) | Dropout rate (%) |
|---|---|---|---|
| Control | 19.4 ± 3.1 | 42.6 ± 5.4 | 21.8 |
| Neural-Only adaptation | 24.8 ± 4.5 (p = 0.012) | 58.1 ± 6.2 (p = 0.008) | 13.2 |
| Neuro-Symbolic | 31.6 ± 4.7 (p < 0.001) | 71.3 ± 7.3 (p < 0.001) | 5.5 |
The neuro-symbolic group exhibited significantly higher engagement and reduced dropout rates. ANOVA results yielded F(2,77) = 8.74, p < 0.001 and a partial eta-squared of 0.34, indicating a large effect size. These results are visualized in Figure 3.
The chart shows average time on task in minutes and dropout rate in per cent for control, neural only and neuro symbolic groups. Time on task increases from about 19 minutes for control to about 25 minutes for neural only and about 32 minutes for neuro symbolic. Dropout rate decreases from about 22 per cent for control to about 14 per cent for neural only and about 5 per cent for neuro symbolic. The results show increasing engagement with decreasing dropout across the groups.Engagement and retention metrics
The chart shows average time on task in minutes and dropout rate in per cent for control, neural only and neuro symbolic groups. Time on task increases from about 19 minutes for control to about 25 minutes for neural only and about 32 minutes for neuro symbolic. Dropout rate decreases from about 22 per cent for control to about 14 per cent for neural only and about 5 per cent for neuro symbolic. The results show increasing engagement with decreasing dropout across the groups.Engagement and retention metrics
Beyond the observed numerical improvements, these findings can be interpreted through the lens of self-regulated learning and motivation theory. Increased time-on-task and click activity suggest heightened learner autonomy and sustained attention, likely facilitated by affect-aware adaptations that respond to frustration, boredom or confusion. Reduced dropout rates indicate that the system effectively maintained motivation and engagement by dynamically adjusting instructional content, supporting the theoretical premise that emotional scaffolding enhances learning persistence.
4.3 Learning gain and academic performance
Normalized learning gain was calculated as the difference between pre- and post-test scores, adjusted for maximum possible improvement. Table 3 includes the means, standard deviations and p-values.
Learning gains across experimental groups
| Group | Pre-Test score (mean ± SD) | Post-Test score (mean ± SD) | Normalized gain (%) | p-value |
|---|---|---|---|---|
| Control | 0.42 ± 0.09 | 0.61 ± 0.11 | 33.0 | |
| Neural-Only adaptation | 0.44 ± 0.08 | 0.70 ± 0.10 | 46.0 | 0.019 |
| Neuro-Symbolic | 0.43 ± 0.07 | 0.79 ± 0.09 | 63.0 | <0.001 |
| Group | Pre-Test score (mean ± SD) | Post-Test score (mean ± SD) | Normalized gain (%) | p-value |
|---|---|---|---|---|
| Control | 0.42 ± 0.09 | 0.61 ± 0.11 | 33.0 | |
| Neural-Only adaptation | 0.44 ± 0.08 | 0.70 ± 0.10 | 46.0 | 0.019 |
| Neuro-Symbolic | 0.43 ± 0.07 | 0.79 ± 0.09 | 63.0 | <0.001 |
The neuro-symbolic group significantly outperformed the others. ANOVA confirmed this with F(2,77) = 12.3, p < 0.001, η2 = 0.39. These gains validate the hypothesis that affect-aware, rule-informed personalization improves learning outcomes.
The significant improvements in learning gain observed for the neuro-symbolic group can be understood in terms of adaptive scaffolding and cognitive load management. By tailoring instructional interventions to the learner’s emotional state, the system reduces extraneous cognitive load during frustration or confusion and provides optimal challenge during low-engagement periods. This supports deeper cognitive processing, consistent with principles of self-regulated learning and effective instructional design. These theory-driven interpretations explain why the neuro-symbolic framework outperforms neural-only and control conditions, beyond purely technical superiority.
4.4 Interpretability and SHAP-based explanations
Qualitative feedback indicated that learners found the neuro-symbolic agent more transparent and responsive. Participants used phrases like “it adapted to how I felt” and “the switch to video helped when I was confused.” To visualize interpretability, SHAP (SHapley Additive exPlanations) analysis was applied to the BERT model. Figure 4 shows top contributing words for emotion predictions. High-weight tokens like “confused”, “stuck” and “amazing” aligned with expected emotional categories. If the figure is not included, this section should be rephrased or removed accordingly.
The chart shows SHAP value impact for features including lost, interesting, bored, help, amazing and confused. The confused feature has the highest value at about 0.35. Amazing follows at about 0.28. Help shows a value near 0.22. Bored has a value around 0.18. Interesting shows about 0.17, and Lost has the lowest value near 0.16. The values indicate the relative contribution of each feature to the model output.SHAP feature importance for Bert emotion classifier
The chart shows SHAP value impact for features including lost, interesting, bored, help, amazing and confused. The confused feature has the highest value at about 0.35. Amazing follows at about 0.28. Help shows a value near 0.22. Bored has a value around 0.18. Interesting shows about 0.17, and Lost has the lowest value near 0.16. The values indicate the relative contribution of each feature to the model output.SHAP feature importance for Bert emotion classifier
4.5 Model training dynamics and confidence calibration
The neural modules (BERT, ResNet50, wav2vec2.0) converged smoothly with early stopping triggered between six and eight epochs. Figure 5 shows the training vs validation loss for BERT, with no signs of overfitting.
The plot shows training loss and validation loss across ten epochs. Training loss decreases steadily from about 0.52 at epoch 1 to around 0.20 at epoch 6, after which it stabilises slightly above 0.20. Validation loss decreases from about 0.58 at epoch 1 to around 0.25 at epoch 6, then increases slightly and plateaus near 0.27 from epochs 7 to 10. The trend indicates effective learning in early epochs, followed by mild overfitting as validation loss stops improving while training loss continues to stabilise.Training and validation loss for BERT emotion classifier
The plot shows training loss and validation loss across ten epochs. Training loss decreases steadily from about 0.52 at epoch 1 to around 0.20 at epoch 6, after which it stabilises slightly above 0.20. Validation loss decreases from about 0.58 at epoch 1 to around 0.25 at epoch 6, then increases slightly and plateaus near 0.27 from epochs 7 to 10. The trend indicates effective learning in early epochs, followed by mild overfitting as validation loss stops improving while training loss continues to stabilise.Training and validation loss for BERT emotion classifier
Confidence calibration was assessed via reliability diagrams (Figure 6), showing a close alignment between predicted probabilities and actual correctness, particularly in the 0.5–0.9 range. Minor overconfidence was noted above 0.9, but overall calibration was acceptable.
The plot shows model accuracy as a function of prediction confidence, alongside a dashed line representing perfect calibration. Accuracy increases from about 0.12 at 0.1 confidence to about 0.86 at 0.9 confidence. The model curve closely follows the perfect calibration line, with slight deviations at lower and higher confidence levels. This indicates that the model is generally well calibrated, with predicted probabilities aligning closely with observed accuracy.Reliability diagram for emotion prediction confidence
The plot shows model accuracy as a function of prediction confidence, alongside a dashed line representing perfect calibration. Accuracy increases from about 0.12 at 0.1 confidence to about 0.86 at 0.9 confidence. The model curve closely follows the perfect calibration line, with slight deviations at lower and higher confidence levels. This indicates that the model is generally well calibrated, with predicted probabilities aligning closely with observed accuracy.Reliability diagram for emotion prediction confidence
4.6 Ablation study and symbolic rule depth effects
To better understand how each part of the system contributes to overall performance, an ablation study was carried out. This involved testing the system by removing or modifying specific components, such as different emotion sensing methods and the symbolic reasoning module. The goal was to see how these changes affected learner engagement and learning outcomes, rather than only looking at the final performance of the complete system.
Table 4 shows the results obtained when key components were disabled one at a time while keeping all other conditions the same. Engagement and learning gain were compared with the full neuro-symbolic system to make the differences easier to understand.
Component-Level Ablation and symbolic rule depth effects
| Configuration/rule depth | Engagement (%) | Learning gain (%) | Accuracy (%) | Dropout rate (%) |
|---|---|---|---|---|
| Full model (baseline) | 100.0 | 100.0 | 91.8 | 4.3 |
| Without symbolic reasoning | 83.0 | 88.0 | 91.8 | 10.2 |
| Facial-Only emotion detection | 92.6 | 96.2 | 85.4 | 6.7 |
| Auditory-Only detection | 91.8 | 95.8 | 83.9 | 7.1 |
| Rule Depth = 1 | 91.2 | 92.5 | 91.8 | 6.9 |
| Rule Depth = 2 | 95.4 | 96.8 | 91.8 | 5.8 |
| Rule Depth = 3 | 98.3 | 98.9 | 91.8 | 4.7 |
| Rule Depth = 4 | 100.0 | 100.0 | 91.8 | 4.3 |
| Rule Depth = 5 | Capped at 100 | Capped at 100 | 91.8 | 4.3 |
| Configuration/rule depth | Engagement (%) | Learning gain (%) | Accuracy (%) | Dropout rate (%) |
|---|---|---|---|---|
| Full model (baseline) | 100.0 | 100.0 | 91.8 | 4.3 |
| Without symbolic reasoning | 83.0 | 88.0 | 91.8 | 10.2 |
| Facial-Only emotion detection | 92.6 | 96.2 | 85.4 | 6.7 |
| Auditory-Only detection | 91.8 | 95.8 | 83.9 | 7.1 |
| Rule Depth = 1 | 91.2 | 92.5 | 91.8 | 6.9 |
| Rule Depth = 2 | 95.4 | 96.8 | 91.8 | 5.8 |
| Rule Depth = 3 | 98.3 | 98.9 | 91.8 | 4.7 |
| Rule Depth = 4 | 100.0 | 100.0 | 91.8 | 4.3 |
| Rule Depth = 5 | Capped at 100 | Capped at 100 | 91.8 | 4.3 |
Percentages above 100% were capped for interpretability. When the symbolic reasoning module was removed, there was a clear drop in learner engagement and improvement in retention. This suggests that structured decision-making plays an important role in turning emotional signals into useful teaching actions. On the other hand, using only a single emotion detection method led to a moderate decrease in performance. This indicates that combining multiple sensing methods makes the system more reliable, although it is not enough on its own to achieve the best learning support.
Beyond performance improvements, the results highlight the conceptual contribution of the proposed framework as a decision-level neuro-symbolic system. The observed gains can be attributed not only to multimodal affect detection but also to the conditional activation of symbolic reasoning, which enables more interpretable and pedagogically meaningful adaptations. This demonstrates the advantage of moving from purely data-driven personalization toward hybrid reasoning models that explicitly incorporate structured decision logic.
4.7 Practical implications and ethical considerations
The findings of this study have several practical implications for educators and instructional designers. The neuro-symbolic framework can be integrated into VLEs to provide real-time, affect-aware personalization, supporting learners through adaptive content delivery, pacing adjustments and challenge modulation. For instance, learners exhibiting signs of frustration may receive additional hints or explanatory videos, while learners showing boredom could be guided toward interactive quizzes or more challenging activities. Instructors can supervise these adaptations, ensuring alignment with curriculum goals and learner needs, while also contextualizing interventions based on class-level observations.
Scalability considerations include expanding the symbolic rule base to accommodate diverse learner profiles and integrating the system with existing educational platforms. Automated rule induction and modular design strategies can facilitate broader deployment without overburdening technical maintenance.
Ethical considerations are essential when implementing affect-aware systems. This includes obtaining informed consent, ensuring privacy and confidentiality of emotion data and responsibly using affective information to enhance learning rather than manipulate or penalize learners. Transparent reporting and explainable system behavior are critical to maintaining trust and promoting responsible educational technology practices.
5. Conclusion and future work
This study presented a new neuro-symbolic reasoning paradigm for affect-aware personalization in virtual learning systems to address the need of adaptive learning systems as being emotionally intelligent. By combining an emotion recognition subsystem powered by a deep neural network comprising BERT for text input, ResNet50 for visual signals and wav2vec 2.0 for voice signals with a symbolic reasoning component simulated via Prolog-based inference over an affective learner ontology, the given system demonstrated significant improvements in learner engagement, retention and performance.
Empirical evaluation validated that the hybrid model consistently performed well across several measures. The neuro-symbolic architecture outperformed baseline models, including a neural-only adaptation model and a control system without affective personalization. Specifically, the hybrid method provided an overall boost in engagement of 23.6% and learning gain of 19.8% over the control, as indicated by Table 2. Symbolic reasoning was also shown to play a critical role; ablation tests revealed its removal led to a statistically significant reduction in engagement by 17% and learning outcome by 12% (p < 0.01), highlighting the explanatory and adaptive capability of structured reasoning over affective states.
Additionally, SHAP-based interpretability analysis confirmed that symbolic rules especially those derived from prolonged negative affect or conflict of affects were integral to adaptive choice-making and thus trust and explainable model forecasting. Training dynamics analysis guaranteed stability of convergence as well as effectiveness of generalization, and further experimentation on depth of rules revealed a gains saturation point for personalization, demonstrating best complexity threshold values for symbolic rules.
While these encouraging results, several limitations need to be pursued. Firstly, the rule-based engine itself, while helpful, rests on pre-specified expert rules, which will limit scalability to diverse learning settings or evolving learner behavior. Neuro-symbolic program synthesis or reinforcement learning for rule adaptation is a promising direction in the future. Second, even though the model currently recognizes face and voice cues, inclusion of physiological cues such as galvanic skin response or heart rate variability can be more holistic affective modeling. Finally, extension of personalization to include culturally situational emotional norms or learning styles might make the system more inclusive.
Future research should explore both technical and educational dimensions. On the technical side, this includes automated symbolic rule induction, reinforcement learning for dynamic adaptation and integration of additional affective modalities such as physiological signals. From an educational perspective, longitudinal studies are needed to evaluate the long-term impact on learner engagement, knowledge retention and motivation across diverse learning environments. Investigations into cross-cultural generalizability and inclusive design will help ensure that adaptive interventions are effective for learners with varied socio-emotional and cultural backgrounds.
The author would like to express special thanks to all the contributors of this paper.
Funding
This research received no external funding.

