This study develops an AI-driven multimodal framework integrating automated waste segregation with energy recovery prediction to support Saudi Vision 2030 sustainability goals.
A two-stage vision pipeline (YOLOv9 + Swin Transformer) performs real-time waste detection and classification. Multimodal physicochemical features are modeled using XGBoost, deep neural networks (DNNs), and graph neural networks (GNNs) to predict energy recovery potential. Reinforcement learning (RL) optimizes routing of waste streams to appropriate facilities. Experiments use a Saudi-specific dataset of 50,000 annotated images and 5,000 physico-chemical records.
The integrated framework achieved mAP = 94.3% and R2 = 0.96, improving landfill diversion and renewable energy contribution by 12.1% and 15.2%, respectively, compared with baseline models.
Hazardous waste remains underrepresented in the dataset. Future work will address this via targeted data collection and active learning.
The framework provides a deployable solution for real-time waste classification and energy estimation in smart-city contexts, with a projected payback period of approximately 1.5 years for a 1,000-bin deployment.
This study introduces the first Saudi-specific multimodal waste dataset and a unified AI framework bridging waste segregation, energy prediction, and smart-city optimization—an end-to-end solution absent from prior literature.
1. Introduction
Rapid urbanization and population growth in Saudi Arabia have led to a sharp increase in municipal solid waste (MSW) generation, exceeding 15 million tons annually (Alhumoud & Alhumoud, 2021). Traditional reliance on landfilling and manual segregation is environmentally unsustainable, contributing to greenhouse gas emissions, soil contamination, and missed resource recovery opportunities (Hoornweg & Bhada-Tata, 2012). These challenges underscore the need for innovative solutions aligned with Saudi Vision 2030, which emphasizes sustainability, smart city infrastructure, and renewable energy integration (Kingdom of Saudi Arabia, 2017).
Artificial Intelligence (AI) has emerged as a key enabler in modern waste management (Rada et al., 2020). When combined with Internet of Things (IoT) technologies (Pal & Shankar, 2023; Darem et al., 2021; Yitmen, 2023), AI systems support real-time waste monitoring and data-driven decision-making essential for scaling smart city solutions (Sosunova & Porras, 2022). Technologies such as anaerobic digestion, pyrolysis, and waste-to-energy incineration can transform waste into electricity, heat, and biogas (Abdel-Shafy & Mansour, 2018). AI enhances these processes by predicting calorific values and optimizing waste stream allocation (Zieleńska & Bułkowska, 2024).
Despite growing global interest, most prior studies address either waste segregation or energy recovery in isolation. Existing frameworks such as WasteNet, DeepWaste, and SmartBinAI are siloed: none integrates both tasks within a unified, real-time pipeline. The comparative gap is summarized in Table 1.
Comparison of the proposed framework against prior systems across four critical dimensions. The proposed framework is the only system to satisfy all four criteria simultaneously
| System | Saudi dataset | Integrated pipeline | Real-time energy | Smart city integr | Ref. |
|---|---|---|---|---|---|
| WasteNet | No | No | No | No | Prior |
| DeepWaste | No | Partial | No | No | Prior |
| SmartBinAI | No | Partial | No | Partial | Prior |
| Proposed Framework | Yes | Yes | Yes | Yes | This |
| System | Saudi dataset | Integrated pipeline | Real-time energy | Smart city integr | Ref. |
|---|---|---|---|---|---|
| WasteNet | No | No | No | No | Prior |
| DeepWaste | No | Partial | No | No | Prior |
| SmartBinAI | No | Partial | No | Partial | Prior |
| Proposed Framework | Yes | Yes | Yes | Yes | This |
This paper bridges that gap by introducing a multimodal, AI-driven framework making three original contributions not found in combination in any prior system:
A two-stage vision pipeline (YOLOv9 + Swin Transformer) for high-accuracy waste detection and classification, including composite categories, with full hyperparameter reproducibility (Tables 2, 3, 5).
A multimodal regression module (XGBoost + DNN + GNN) linking detected waste fractions to real-time calorific value and biogas yield predictions via physicochemical measurements (Table 7).
An integrated reinforcement learning layer with an explicit reward function and state–action space optimizing routing to maximize energy recovery and landfill diversion (Table 8).
The novelty is threefold: (1) the first large-scale Saudi-specific multimodal dataset linking imagery with physico-chemical records; (2) the first unified end-to-end pipeline from visual detection to energy-routing optimization; and (3) a fully specified RL–GNN integration for city-scale waste logistics, validated on independent geographic data.
2. Related work
2.1 Waste management context in Saudi Arabia
Saudi Arabia generates over 15 million tons of MSW annually, with 1.5–2.5 kg per capita per day in major cities. The country's landfill-dependent system produces approximately 3.2 million tonnes of CO2-equivalent methane emissions annually while recovering less than 5% of resources. Vision 2030 targets 40% waste diversion by 2030, but only 12% had been achieved by 2023. MSW composition is dominated by organic matter (40–50%), distinguishing it from GCC peers and limiting the applicability of global AI models trained on different waste profiles.
2.2 Artificial Intelligence for waste classification
Deep learning and computer vision have substantially improved automated waste classification, with YOLO-family detectors and transformer-based models exceeding 90% accuracy on benchmarks such as TrashNet and TACO. However, these models rely on datasets from Western or Southeast Asian contexts. Domain-specific Saudi challenges include high ambient temperatures (40–50°C), dust accumulation, culturally specific items (e.g. Zamzam bottles, prayer mats), and multilingual interface requirements. Transfer-learning experiments by Yitmen (2023) demonstrated 17% accuracy degradation when Western-trained models were deployed on Saudi waste streams without fine-tuning, motivating a locally curated dataset.
2.3 IoT and smart city integration
IoT technologies enable smart bins to stream continuous fill-level and image data, supporting dynamic route planning with operational efficiency gains of up to 40%. Saudi initiatives such as NEOM's smart waste systems show early progress, yet most remain limited to sensing without integrated AI for automated sorting or energy prediction. The proposed framework embeds deep learning into IoT-enabled bins and links them with cloud-based energy analytics aligned with Vision 2030. Governance is addressed through SDAIA's privacy-by-design mandate (SDAIA, 2023), and stratified data collection across diverse socioeconomic zones mitigates algorithmic fairness risks (Esmaeilian et al., 2018).
2.4 Energy recovery and predictive modeling
WTE technologies convert organic and non-recyclable waste into electricity, heat, or biogas. Recent machine learning approaches predict calorific value and biogas yield, but typically assume homogeneous waste inputs and lack integration with AI-driven segregation. The proposed framework addresses these gaps by combining vision-based classification with physicochemical data for real-time energy prediction. Saudi-specific parameters—organic waste yielding 0.42–0.58 m3 CH4/kg VS—are incorporated to improve local accuracy. Table 6 positions this work against prior systems across four critical dimensions.
3. Methodology
3.1 Framework overview
The proposed methodology is structured as a two-stage AI framework embedded into an IoT-enabled smart city ecosystem. At the edge level, smart bins with cameras perform real-time detection and classification. The cloud layer refines predictions and estimates energy recovery potential. The integration layer coordinates facility allocation using RL, feeding results into the municipal dashboard and smart grid. The complete pipeline is illustrated in Figure 1.
The diagram represents the overall system architecture for a smart waste management system. It is divided into four layers: Integration Layer, Cloud Layer, Communication Layer, and Edge Layer. The Integration Layer includes the RL Optimizer, Municipal Dashboard, Smart Grid API, and Citizen App. The Cloud Layer consists of the Vision Module, Energy Prediction, and Database. The Vision Module uses YOLOv9 for detection and Swin-T for refinement, outputting waste categories. Energy Prediction uses XGBoost, DN, and GNN for calorific value prediction, biogas yield estimation, and network-aware optimization. The Database is PostgreSQL. The Communication Layer includes 4G/5G Network, MQTT Broker, AWS IoT Core, and Data Streaming. The Edge Layer features Smart Bin 1, Smart Bin 2, and Smart Bin N, each equipped with various sensors and cameras. The diagram shows the flow of data and interactions between these components.Overall system architecture
The diagram represents the overall system architecture for a smart waste management system. It is divided into four layers: Integration Layer, Cloud Layer, Communication Layer, and Edge Layer. The Integration Layer includes the RL Optimizer, Municipal Dashboard, Smart Grid API, and Citizen App. The Cloud Layer consists of the Vision Module, Energy Prediction, and Database. The Vision Module uses YOLOv9 for detection and Swin-T for refinement, outputting waste categories. Energy Prediction uses XGBoost, DN, and GNN for calorific value prediction, biogas yield estimation, and network-aware optimization. The Database is PostgreSQL. The Communication Layer includes 4G/5G Network, MQTT Broker, AWS IoT Core, and Data Streaming. The Edge Layer features Smart Bin 1, Smart Bin 2, and Smart Bin N, each equipped with various sensors and cameras. The diagram shows the flow of data and interactions between these components.Overall system architecture
The end-to-end data flow operates as follows. Step 1: YOLOv9 detects waste objects and generates bounding boxes with class probabilities. Step 2: ambiguous detections are forwarded to the cloud-hosted Swin Transformer for refined classification. Step 3: classification outputs are converted into compositional fractions that, combined with sensor-measured physicochemical features, form the input vector for the XGBoost and DNN regressors. Step 4: the GNN aggregates node-level energy predictions across the city-wide collection network, incorporating facility capacities and transportation costs into each node embedding. Step 5: the RL agent takes the GNN-enriched state as input and selects routing actions, with rewards derived from diversion rates and energy output. This explicit causal chain ensures each component is functionally integrated rather than independently stacked.
3.2 Data collection and annotation
3.2.1 Image dataset construction
A representative visual dataset was constructed using 12-MP IMX477 cameras deployed at twenty sites across Riyadh and Jeddah between March and August 2024, in collaboration with Riyadh Municipality and the Jeddah City Development Authority. Sites covered residential districts, commercial areas, universities, hospitals, and public spaces to capture demographic and environmental variability—including a 35% rise in organic waste during Ramadan. A total of 127,000 raw images were collected and curated to 50,000 high-quality annotated samples. All images were captured at 1920×1,080 resolution, resized to 640×640 pixels for YOLOv9 and 224×224 for Swin Transformer. Preprocessing applied ImageNet normalization (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]).
To expand category diversity, 8,000 images from public datasets were incorporated after domain adaptation. The source breakdown and adaptation methods are detailed in Table 2. Domain adaptation effectiveness was confirmed by a domain classifier whose accuracy fell from 94% (pre-adaptation) to 61% (post-adaptation), indicating substantially reduced distribution shift. External images were relabeled using the Saudi taxonomy; items absent from the taxonomy were excluded. All external samples underwent the same inter-annotator quality check as Saudi images.
External dataset sources, number of images used, inclusion criteria, and domain adaptation techniques applied
| Source | Images used | Total available | Inclusion criteria | Domain adaptation applied |
|---|---|---|---|---|
| TrashNet | 3,200 | 2,527a | 6 overlapping classes | CycleGAN style transfer; color normalization to Saudi lighting profile |
| TACO | 3,100 | 1,500+ | Litter categories matching Saudi taxonomy | Background augmentation with Saudi urban textures; dust simulation |
| WCD | 1,700 | 5,000 | High-resolution outdoor bins | Histogram equalization; temperature-tone correction |
| Source | Images used | Total available | Inclusion criteria | Domain adaptation applied |
|---|---|---|---|---|
| TrashNet | 3,200 | 2,527 | 6 overlapping classes | CycleGAN style transfer; color normalization to Saudi lighting profile |
| TACO | 3,100 | 1,500+ | Litter categories matching Saudi taxonomy | Background augmentation with Saudi urban textures; dust simulation |
| WCD | 1,700 | 5,000 | High-resolution outdoor bins | Histogram equalization; temperature-tone correction |
TrashNet augmented to 3,200 via mirroring and cropping prior to style transfer
3.2.2 Final dataset statistics
The final dataset contains 50,000 Saudi-context images supplemented by 8,000 harmonized external samples. Data were partitioned at the bin level (70% training, 15% validation, 15% testing) to prevent leakage. Table 3 summarizes the dataset composition across all eight waste categories, including image counts, percentages, average items per image, and visual complexity ratings.
Final dataset composition across eight waste categories, including image counts, percentages, average items per image, and visual complexity ratings
| Category | Images | Percentage | Avg. Items/Image | Visual complexity |
|---|---|---|---|---|
| Plastic | 14,200 | 28.4% | 2.3 | High (transparent, colored, reflective) |
| Organic | 11,800 | 23.6% | 1.7 | Medium (irregular shapes, moisture variation) |
| Paper/Cardboard | 8,900 | 17.8% | 1.9 | Medium (folds, stains, varying brightness) |
| Metal | 5,400 | 10.8% | 1.3 | Low–Medium (shiny surfaces, occlusions) |
| Glass | 3,200 | 6.4% | 1.2 | Low (rare items, high reflectivity) |
| Textile | 2,400 | 4.8% | 1.5 | High (deformable, patterned) |
| E-Waste | 1,850 | 3.7% | 1.6 | High (irregular, multi-material) |
| Composite/Mixed | 2,250 | 4.5% | 2.8 | Very High (food-contaminated packaging, multi-material) |
| Total | 50,000 | 100% | – |
| Category | Images | Percentage | Avg. Items/Image | Visual complexity |
|---|---|---|---|---|
| Plastic | 14,200 | 28.4% | 2.3 | High (transparent, colored, reflective) |
| Organic | 11,800 | 23.6% | 1.7 | Medium (irregular shapes, moisture variation) |
| Paper/Cardboard | 8,900 | 17.8% | 1.9 | Medium (folds, stains, varying brightness) |
| Metal | 5,400 | 10.8% | 1.3 | Low–Medium (shiny surfaces, occlusions) |
| Glass | 3,200 | 6.4% | 1.2 | Low (rare items, high reflectivity) |
| Textile | 2,400 | 4.8% | 1.5 | High (deformable, patterned) |
| E-Waste | 1,850 | 3.7% | 1.6 | High (irregular, multi-material) |
| Composite/Mixed | 2,250 | 4.5% | 2.8 | Very High (food-contaminated packaging, multi-material) |
| Total | 50,000 | 100% | – |
The dataset is significantly imbalanced, where Plastic and Organic samples account for over 52%, while minority classes E-Waste (3.7%) and Textile (4.8%) are critically low populated. In particular, despite just being 4.5% of the dataset, the Composite/Mixed category is both the most visually complex, and the most difficult to classify, with the highest average item count per image of 2.8.
3.2.3 Annotation quality assurance
All labels followed a standardized taxonomy. Class definitions: Organic = food-derived material with visible biological degradation; Plastic = any petroleum-based polymer item; Composite = any item where a secondary material covers ≥50% of the surface. Items with less than 50% contamination retain their primary material label. Overlapping items are labeled for the dominant visible fraction, with secondary labels stored in metadata. Disagreements between the two independent annotators were resolved by majority vote among three annotators; the fewer than 3% of images that remained contested were escalated to an expert reviewer. All twelve annotators completed a standardized two-day training session covering the taxonomy and the VGG Image Annotator (VIA) tool. A three-stage quality pipeline was implemented: (1) automated bounding-box integrity checks, (2) inter-annotator agreement scoring (Cohen's κ = 0.87), and (3) expert review for hazardous or composite images.
3.3 Physico-chemical data collection
A total of 5,000 physico-chemical records were collected from the same twenty sites at the Saudi Aramco Environmental Laboratory and the National Waste Management Center in Riyadh. Laboratory protocols: calorific value by bomb calorimetry (ISO 1928:2009) on oven-dried samples (105°C, 24 h), measurement uncertainty ±0.15 MJ/kg (k = 2); moisture content by gravimetric method (ASTM D2216), wet-basis, measured before grinding; bulk density by standardized cylindrical container (EN 1097-3); biogas yield by Biochemical Methane Potential assay (VDI 4630) with mesophilic inoculum at 37°C ±0.5°C, inoculum-to-substrate ratio of 2:1 on volatile-solids basis, 30-day incubation, daily gas measurement by water displacement, and methane content analyzed by gas chromatography (Shimadzu GC-2014). The pipeline linking detected category → predicted fraction → predicted calorific value is illustrated in Figure 2.
The confusion matrix displays classification results for six waste categories: Plastic, Metal, Paper, Organic, Hazardous, and Composite. The matrix has six rows and six columns, with each row representing the true class and each column representing the predicted class. The diagonal elements show the correct classifications, with values such as 480 for Plastic, 460 for Metal, 470 for Paper, 450 for Organic, 90 for Hazardous, and 400 for Composite. Off-diagonal elements indicate misclassifications, with notable values such as 10 Plastic items misclassified as Composite and 20 Organic items misclassified as Composite.Confusion matrix for YOLOv9 + Swin classification across six waste categories, illustrating robust performance even on ambiguous composite classes
The confusion matrix displays classification results for six waste categories: Plastic, Metal, Paper, Organic, Hazardous, and Composite. The matrix has six rows and six columns, with each row representing the true class and each column representing the predicted class. The diagonal elements show the correct classifications, with values such as 480 for Plastic, 460 for Metal, 470 for Paper, 450 for Organic, 90 for Hazardous, and 400 for Composite. Off-diagonal elements indicate misclassifications, with notable values such as 10 Plastic items misclassified as Composite and 20 Organic items misclassified as Composite.Confusion matrix for YOLOv9 + Swin classification across six waste categories, illustrating robust performance even on ambiguous composite classes
3.4 Waste segregation model
3.4.1 Model selection rationale
Five detection architectures were evaluated on a 5,000-image validation subset along three criteria: accuracy, inference speed, and edge deployment feasibility. As shown in Table 4, YOLOv9-tiny achieves the best accuracy–latency compromise: mAP@0.5 = 92.8% at 45 ms per frame, satisfying the real-time constraint of less than 100 ms on Raspberry Pi 4 hardware. O(n) (Liu et al., 2021).
Detection model comparison on the 5,000-image validation subset
| Model | mAP@0.5 | Inference time (ms) | Parameters (M) | Edge deployment |
|---|---|---|---|---|
| YOLOv9-tiny ★ Selected | 92.8% | 45 | 7.2 | ✓ Feasible |
| YOLOv8 | 91.3% | 78 | 11.4 | ⚠ Marginal |
| Faster R-CNN | 93.1% | 312 | 41.8 | ✕ Too slow |
| EfficientDet | 90.7% | 156 | 6.6 | ✕ Too slow |
| DETR | 89.4% | 287 | 86.4 | ✕ Too slow |
| Model | mAP@0.5 | Inference time (ms) | Parameters (M) | Edge deployment |
|---|---|---|---|---|
| YOLOv9-tiny ★ Selected | 92.8% | 45 | 7.2 | ✓ Feasible |
| YOLOv8 | 91.3% | 78 | 11.4 | ⚠ Marginal |
| Faster R-CNN | 93.1% | 312 | 41.8 | ✕ Too slow |
| EfficientDet | 90.7% | 156 | 6.6 | ✕ Too slow |
| DETR | 89.4% | 287 | 86.4 | ✕ Too slow |
Note(s): YOLOv9-tiny (★) selected for best accuracy–latency trade-off and edge deployment feasibility
Although Faster R-CNN attains slightly higher accuracy (+0.3%), its 312 ms latency violates real-time constraints. For composite and ambiguous categories, Swin Transformer was compared against ResNet-152, EfficientNet-B7, and standard ViT, as shown in Table 5. Swin Transformer delivers the highest composite F1-score (89.7%) while meeting the cloud inference budget (less than 500 ms). Its shifted-window mechanism reduces attention complexity from O(n2) to approximately.
Classification model performance on composite and ambiguous waste categories
| Model | Overall accuracy | Composite F1-Score | Attention mechanism |
|---|---|---|---|
| Swin Transformer ★ Selected | 94.3% | 89.7% | Shifted-window attention |
| ResNet-152 | 91.7% | 84.2% | Convolution only |
| EfficientNet-B7 | 92.1% | 85.8% | Convolution only |
| ViT | 92.1% | 86.4% | Global self-attention |
| Model | Overall accuracy | Composite F1-Score | Attention mechanism |
|---|---|---|---|
| Swin Transformer ★ Selected | 94.3% | 89.7% | Shifted-window attention |
| ResNet-152 | 91.7% | 84.2% | Convolution only |
| EfficientNet-B7 | 92.1% | 85.8% | Convolution only |
| ViT | 92.1% | 86.4% | Global self-attention |
Note(s): Swin Transformer (★) selected for highest composite F1-score
3.4.2 Hyperparameter configuration
Complete hyperparameter settings for both models are provided in Table 6 to ensure full reproducibility. This includes input resolutions, optimizer settings, learning rate schedulers, anchor configurations, batch normalization details, augmentation probabilities, and dropout values for both YOLOv9-tiny and the Swin Transformer.
Complete hyperparameter configuration for YOLOv9-tiny and Swin Transformer, provided for full reproducibility
| Parameter | YOLOv9-tiny | Swin transformer |
|---|---|---|
| Input resolution | 640 × 640 | 224 × 224 |
| Backbone | CSPDarknet + PAN neck | Swin-B (pretrained ImageNet-22K) |
| Optimizer | SGD (momentum = 0.9, weight decay = 5 × 10−4) | AdamW (weight decay = 0.01) |
| Learning rate | 0.01 (cosine decay to 0.001) | 5 × 10−5 (linear warmup, 5 epochs) |
| Batch size | 64 | 32 |
| Epochs | 200 | 100 |
| Anchor settings | Auto-anchored: 3 scales × 3 ratios on training set | – |
| Normalization | BatchNorm (momentum = 0.03, eps = 1 × 10−3) | LayerNorm |
| Augmentation | Mosaic (p = 0.8), flip (p = 0.5), HSV jitter (h = 0.015) | RandAugment (n = 2, m = 9), random crop, color jitter |
| Dropout | – | 0.1 (attention), 0.0 (path drop) |
| LR scheduler | Cosine annealing with warm restart | Linear warmup + cosine decay |
| Parameter | YOLOv9-tiny | Swin transformer |
|---|---|---|
| Input resolution | 640 × 640 | 224 × 224 |
| Backbone | CSPDarknet + PAN neck | Swin-B (pretrained ImageNet-22K) |
| Optimizer | SGD (momentum = 0.9, weight decay = 5 × 10−4) | AdamW (weight decay = 0.01) |
| Learning rate | 0.01 (cosine decay to 0.001) | 5 × 10−5 (linear warmup, 5 epochs) |
| Batch size | 64 | 32 |
| Epochs | 200 | 100 |
| Anchor settings | Auto-anchored: 3 scales × 3 ratios on training set | – |
| Normalization | BatchNorm (momentum = 0.03, eps = 1 × 10−3) | LayerNorm |
| Augmentation | Mosaic (p = 0.8), flip (p = 0.5), HSV jitter (h = 0.015) | RandAugment (n = 2, m = 9), random crop, color jitter |
| Dropout | – | 0.1 (attention), 0.0 (path drop) |
| LR scheduler | Cosine annealing with warm restart | Linear warmup + cosine decay |
3.4.3 Model formulation
Given input image I, YOLOv9 produces bounding boxes {bi} each associated with predicted class distribution p(c|bi). The compound detection loss is: ℒdet = λcls·LCE(p(c|bi), yi) + λox·(1 − IoU(bi,̂bi)) … (1), where λcls = 0.5 and λox = 0.05. The Swin Transformer processes feature representation zˡ through: žˡ = MSA(LN(zˡ)) + zˡ … (2); zˡ+1 = MLP(LN(žˡ)) + žˡ … (3). Final prediction: ŷ = argmaxc ∈ C p(c|I) … (4).
3.5 Energy recovery prediction
The prediction task is formulated as supervised regression with input feature vector χ = [w, m, d, f1, f2, …, fN] … (5), where w is waste weight (kg), m is moisture content (% wet basis), d is bulk density (g/cm3), and fi is the fraction of waste category i. XGBoost minimizes the regularized objective: L = Σ(yi − ŷi)2 + Ω(ft) … (6). The DNN applies layered nonlinear transformations: h(l) = σ(W(l)h(l−1) + b(l)) … (7), with architecture: three hidden layers (256–128–64 neurons), ReLU activation, 0.3 dropout, Adam (lr = 1 × 10−3), 200 epochs with early stopping (patience = 20). Complete model comparison results are provided in Table 7.
Energy recovery prediction model comparison on the held-out test set (n = 1,000)
| Model | R2 | MAE (MJ/kg) | RMSE | Train time | Interpretability |
|---|---|---|---|---|---|
| XGBoost | 0.91 | 1.2 | 1.74 | 8 min | High (SHAP) |
| Random Forest | 0.89 | 1.5 | 2.01 | 12 min | Moderate |
| SVR | 0.84 | 2.1 | 2.89 | 45 min | Low |
| DNN (3 layers) | 0.93 | 1.0 | 1.52 | 22 min | Low |
| GNN (GraphSAGE) ★ | 0.96 | 0.74 | 1.11 | 35 min | Moderate |
| Model | R2 | MAE (MJ/kg) | RMSE | Train time | Interpretability |
|---|---|---|---|---|---|
| XGBoost | 0.91 | 1.2 | 1.74 | 8 min | High (SHAP) |
| Random Forest | 0.89 | 1.5 | 2.01 | 12 min | Moderate |
| SVR | 0.84 | 2.1 | 2.89 | 45 min | Low |
| DNN (3 layers) | 0.93 | 1.0 | 1.52 | 22 min | Low |
| GNN (GraphSAGE) ★ | 0.96 | 0.74 | 1.11 | 35 min | Moderate |
Note(s): Best values in italic (★). Classification and regression metrics are reported separately to avoid conflation
Table 7 compares five models designed to predict the energy recovery of mixed municipal solid waste. The comparison is made on a held-out test set. Notably, the performance of the GNN (GraphSAGE) model is the highest, with a coefficient of determination of 0.96. Furthermore, the GNN had the lowest MAE (0.74 MJ/kg) and the lowest RMSE (1.11). Thus, graph-based architectures are advantageous in this case.
3.6 Graph neural network for spatial modeling
A two-layer GraphSAGE GNN models spatial dependencies across the collection-treatment network. Graph construction: 847 nodes total (620 collection points, 48 transfer stations, 12 recycling centers, 8 anaerobic digesters, 5 WTE incinerators, 154 road routing hubs) and 2,341 directed edges. Edge weights encode transportation distance (km) normalized by facility capacity (tons/day), with additional features for average travel time and CO2 emissions per km. Node features for collection nodes include predicted calorific value, compositional fractions, fill level, and historical throughput; facility nodes carry remaining capacity, processing cost, and energy output efficiency. The GNN message-passing update is: h(k+1)v = σ(W(k)·MEAN({h(k)u: u ∈ N(v)})) … (8). GNN outputs feed directly into the RL state vector, linking both components explicitly. Training is supervised on node-level energy recovery labels (Adam, lr = 5 × 10−4, 100 epochs). Ablation results confirm 23% logistics cost reduction versus point-wise routing (p < 0.001; Section 4.5.3).
3.7 Reinforcement learning for waste routing
A Proximal Policy Optimization (PPO) agent optimizes waste routing decisions. The RL formulation is fully specified: State τt comprises the GNN-encoded node embeddings for all N network nodes, facility capacity utilization, current truck fill, and time of day (continuous vector). Action space is discrete over 25 facility nodes, with action masking for capacity-constrained facilities (≥95% capacity). The reward function is: Rt = 0.40·Eff +0.35·Div −0.15·Ctransport −0.10·Pviolation, where Eff is normalized energy recovered (kWh/ton), Div is the landfill diversion fraction, Ctransport is normalized transportation cost (SAR/ton·km), and Pviolation is a capacity-violation penalty. The agent was trained entirely in a discrete-event simulation calibrated on six months of Riyadh operational data (truck schedules, facility throughput, waste volume records). Episodes span one operating day (6 AM to 10 PM), terminating when all bins are serviced or the time limit is reached. Discount factor γ = 0.99; PPO clip ratio ε = 0.2; training ran for 2,000 episodes, converging at approximately episode 1,200. Baselines compared: random routing, greedy nearest-facility, and rule-based routing. Full KPI comparison is provided in Table 8.
System-level KPI comparison across routing strategies
| Routing strategy | Diversion rate | Energy yield (MWh/yr) | Transport cost (SAR/ton) | Edge latency |
|---|---|---|---|---|
| Random routing | 51.2% | 13,120 | 112 | 45 ms |
| Greedy nearest-facility | 57.4% | 14,880 | 89 | 45 ms |
| Rule-based routing | 59.1% | 15,240 | 83 | 45 ms |
| RL (PPO) ★ Proposed | 71.2% | 17,520 | 64 | 45 ms |
| Routing strategy | Diversion rate | Energy yield (MWh/yr) | Transport cost (SAR/ton) | Edge latency |
|---|---|---|---|---|
| Random routing | 51.2% | 13,120 | 112 | 45 ms |
| Greedy nearest-facility | 57.4% | 14,880 | 89 | 45 ms |
| Rule-based routing | 59.1% | 15,240 | 83 | 45 ms |
| RL (PPO) ★ Proposed | 71.2% | 17,520 | 64 | 45 ms |
Note(s): RL (PPO) routing (★) demonstrates statistically significant improvements (p < 0.001) over all baselines on all metrics
4. Evaluation and results
4.1 Experimental setup
The framework was evaluated using 50,000 annotated waste images and 5,000 physicochemical records, partitioned at the bin level (70/15/15) to prevent data leakage. All models used random seed = 42 and identical five-fold stratified cross-validation splits. Deep learning models were trained on an NVIDIA A100 (40 GB) GPU; regression models on Intel Xeon CPUs. All results correspond to the held-out test set.
4.2 Waste segregation results
The two-stage pipeline achieved state-of-the-art results. YOLOv9-tiny achieved mAP@0.5 = 92.8% (Table 2); adding the Swin Transformer refinement improved overall classification to 94.3% and composite waste F1-score from 81.2% to 89.7% (+8.5 pp). Figure 2 presents the confusion matrix across all six categories, identifying the most common misclassification (Composite ↔ Plastic) that motivated the Swin refinement stage.
4.3 Energy recovery prediction results
Regression models were evaluated on a 1,000-sample held-out test set; results are reported separately from classification metrics to avoid conflation. Table 7 presents the full model comparison. The GNN delivers the best overall performance (R2 = 0.96, MAE = 0.74 MJ/kg) by capturing spatial dependencies across the collection network that point-wise models cannot access. Calorific value is reported in MJ/kg; biogas yield in m3 CH4/kg VS; facility-level energy output in kWh, with all conversions stated on first use.
4.4 System-level KPIs
With RL optimization, the framework achieved a 12.1% higher landfill diversion rate and 15.2% increase in renewable energy contribution compared to rule-based routing. Table 8 provides the full KPI comparison across all four routing strategies. Edge inference latency remained at 45 ms per image, well below the 100 ms real-time threshold. The 23% transportation cost reduction (GNN vs. point-wise routing) is supported by the ablation in Section 4.5.3 and corresponds to simulation results using the calibrated Riyadh operational model.
4.5 Ablation study and component-wise analysis
4.5.1 Vision pipeline ablation
YOLOv9 alone achieved 92.8% mAP@0.5. Adding Swin refinement increased overall mAP to 94.3% (+1.5 pp) and composite F1 from 81.2% to 89.7% (+8.5 pp). All other parameters were held constant across ablation runs.
4.5.2 Energy prediction ablation
XGBoost alone achieved R2 = 0.91. Adding DNN ensemble improved R2 to 0.93. The full GNN-augmented pipeline achieved R2 = 0.96, confirming that network-level spatial modeling captures dependencies unavailable at the sample level. Composite waste benefits most from GNN augmentation (R2 gain +0.06 vs + 0.03 for Organic).
4.5.3 RL component ablation
Replacing GNN state representation with point-wise features in the RL agent reduced the diversion rate improvement from 12.1% to 7.3%, confirming the value of GNN–RL integration (p < 0.001). Removing the transportation cost penalty (γ = 0) increased energy yield by only 0.4% but raised transport costs by 31%, validating the multi-objective reward design.
4.5.4 Feature importance analysis
SHAP explainability analysis for the XGBoost calorific value predictor identifies the five most influential physicochemical indicators. These are presented in Table 9, including SHAP values, direction of effect, and interpretation for each feature.
SHAP feature importance for XGBoost calorific value predictor
| Rank | Feature | SHAP value | Direction | Interpretation |
|---|---|---|---|---|
| 1 | Plastic fraction | +0.42 | Positive | High plastic content strongly increases calorific value |
| 2 | Moisture content | −0.28 | Negative | High moisture suppresses heating value |
| 3 | Density | +0.19 | Positive | Denser samples show more stable combustion properties |
| 4 | Paper fraction | +0.15 | Positive | Moderate positive contribution to heating value |
| 5 | Organic fraction | −0.12 | Negative | High organic content lowers calorific value due to moisture |
| Rank | Feature | SHAP value | Direction | Interpretation |
|---|---|---|---|---|
| 1 | Plastic fraction | +0.42 | Positive | High plastic content strongly increases calorific value |
| 2 | Moisture content | −0.28 | Negative | High moisture suppresses heating value |
| 3 | Density | +0.19 | Positive | Denser samples show more stable combustion properties |
| 4 | Paper fraction | +0.15 | Positive | Moderate positive contribution to heating value |
| 5 | Organic fraction | −0.12 | Negative | High organic content lowers calorific value due to moisture |
Plastic fraction (+0.42) is the strongest positive contributor; moisture content (−0.28) most strongly suppresses heating value. These insights support transparent model interpretation and facilitate communication with municipal regulators.
4.6 Statistical validation and robustness analysis
4.6.1 Cross-validation protocol
A five-fold stratified cross-validation scheme was applied independently to both the detection and energy-prediction components. All five models in the ANOVA comparison (proposed framework, WasteNet equivalent, YOLOv8 baseline, ResNet-152 baseline, random forest energy predictor) used identical folds and random seed = 42, with fold assignment at the bin level to eliminate data leakage. The vision model achieved mAP@0.5 = 94.3% (SD = 0.3%, 95% CI: [93.6%, 95.0%]). Energy prediction achieved R2 = 0.956 (SD = 0.01, 95% CI: [0.94, 0.97]).
4.6.2 Statistical significance testing
Separate one-way ANOVAs were conducted for classification metrics (mAP@0.5) and regression metrics (R2) to avoid conflating distinct measurement types. For classification: F(4,20) = 47.3, p < 0.001. For energy prediction: F(4,20) = 32.8, p < 0.001. Post-hoc Bonferroni-corrected t-tests confirmed the proposed model significantly outperforms all four baselines in both domains. Cohen's d values range from 1.94 to 2.87, representing large to very large practical effects.
4.6.3 Confidence intervals and uncertainty quantification
Bootstrap confidence intervals (10,000 samples): mAP CI = [93.4%, 95.2%]; composite F1 CI = [87.9%, 91.3%] (wider due to label ambiguity); energy R2 CI = [0.94, 0.98]. Illustrative prediction interval: for a sample with 35% plastic, 22% moisture, and density 0.41 g/cm3, predicted calorific value is 18.7 MJ/kg with 95% prediction interval [16.3, 21.1] MJ/kg.
4.6.4 Robustness to noise and missing data
With 15% Gaussian noise injected into test images, the system maintained mAP >90% and R2 > 0.89. XGBoost's built-in missing-value handling preserved R2 = 0.87 with 20% randomly missing sensor features, demonstrating operational resilience to sensor dropout.
4.6.5 External validation on Jeddah dataset
An independent Jeddah dataset (8,000 images, separate collection campaign) evaluated geographic generalization. mAP decreased 2.6 pp (94.3% → 91.7%) and R2 decreased 0.03 (0.96 → 0.93). Per-class analysis reveals the composite category experienced the largest decline (−5.1 pp), attributable to tourism-related packaging absent from Riyadh training data. The domain shift for composite waste remains statistically significant (two-sample t-test, p = 0.031), indicating that Jeddah-specific fine-tuning would further improve performance. No domain adaptation was applied to Jeddah images in this evaluation (providing a true out-of-distribution test). Model drift will be monitored via quarterly performance audits, with automated retraining triggered when mAP falls below 90%.
5. Practical implementation and scalability analysis
5.1 Deployment architecture and infrastructure requirements
The framework adopts a scalable edge–cloud architecture. Each smart bin integrates a low-power embedded board, HD camera, environmental sensors, 4G/LoRa connectivity, and a solar-powered IP65 enclosure at an estimated SAR 1,500–1,800 per unit, enabling on-device YOLOv9 inference. Monthly cloud costs for a 1,000-bin fleet remain below SAR 10,000. Deployment proceeds in three stages: a 100-bin pilot in Riyadh (3 months), expansion to 1,000 bins within year one, and nationwide rollout of approximately 10,000 bins over years two and three.
5.2 Economic analysis and return on investment
Total annual costs for a 1,000-bin deployment are estimated at SAR 0.8–0.9 million, including five-year hardware amortization, cloud services, maintenance, and staffing. The framework generates approximately SAR 3.9–4.0 million in annual value through increased landfill diversion, higher material recovery, improved WTE output, and operational efficiencies. This yields a payback period of approximately 1.5 years and a five-year benefit–cost ratio above 4:1. Sensitivity tests confirm the payback remains under two years even under conservative assumptions.
5.3 Real-world performance translation
CO2 reduction estimates are derived from: avoided landfill methane calculated using the IPCC MCF default of 1.0 for managed anaerobic landfills; displaced electricity at 0.49 kg CO2/kWh; and avoided transport emissions at 2.68 kg CO2/L diesel. For Riyadh (11,000 tons MSW/day), the 15.2% energy recovery improvement yields approximately 2,280 additional MWh/year and avoids an estimated 8,400 tonnes CO2-equivalent annually, with uncertainty bounds of approximately ±15%.
5.4 Scalability considerations
Edge devices perform local inference independently; cloud resources scale horizontally with additional GPUs and storage. Costs grow sub-linearly from 100 to 10,000 bins due to batch processing and shared infrastructure. Core models transfer across cities with light fine-tuning (approximately 500 local images per city) to accommodate local waste profiles such as tourism-heavy Jeddah or industrial Dammam.
5.5 Integration with existing municipal systems
Standard REST APIs connect the system to GIS platforms for spatial waste pattern visualization, fleet management software for fill-level-based dynamic routing, and energy operators for 24-h-ahead calorific value forecasting to schedule WTE plants. A citizen-facing mobile app in Arabic and English provides bin guidance and segregation incentives. Automated reporting aligns with SDAIA data regulations and generates periodic waste composition and diversion statistics.
5.6 Limitations and deployment risks
Hazardous waste (batteries, medical waste) is underrepresented in the dataset, yielding lower per-class accuracy for these categories. Calorific-value models calibrated on laboratory measurements may overestimate industrial-scale performance due to feedstock heterogeneity. Long-tail categories (electronics, textiles) lack fine-grained routing decisions. Operational risks from extreme heat, dust, connectivity gaps, and vandalism are mitigated through ruggedized enclosures, health-monitoring routines, and fallback communication channels.
5.7 Future enhancements
The three highest-priority extensions are: (1) hazardous waste coverage expansion through targeted collection drives and synthetic augmentation of underrepresented classes; (2) near-infrared (NIR) spectroscopy integration for plastic subtype identification to improve calorific value precision; and (3) federated learning across GCC cities to enable privacy-preserving collaborative model improvement without centralizing raw data.
5.8 Ethical and sustainability considerations
Privacy-by-design principles ensure raw images are stored locally for no more than 72 hours and anonymized before cloud transmission. Only classification labels are transmitted to the cloud; raw images never leave the edge device. A real-time face-detection filter deletes any triggered image immediately. Algorithmic fairness is addressed through stratified data collection across diverse socioeconomic zones and quarterly per-district performance audits. Potential social bias from better-segregated affluent areas is mitigated by oversampling from lower-compliance zones and class-balanced training. Workers affected by automation are offered reskilling into technical maintenance and data quality roles. The environmental carbon footprint of all model training is estimated at less than 0.5 tonnes CO2, negligible relative to the 8,400 tonnes CO2-equivalent avoided annually.
6. Conclusion
This study proposed an AI-driven multimodal framework for automated waste segregation and energy recovery prediction within the context of Saudi Vision (2030). The integrated system—spanning YOLOv9 detection, Swin Transformer classification, XGBoost/DNN/GNN energy prediction, and PPO reinforcement learning for routing—achieved mAP = 94.3% and R2 = 0.96 on a locally curated Saudi-specific dataset. Key results: Swin refinement provides a +8.5 pp F1 improvement for composite waste (Figure 2); GNN-based spatial modeling reduces logistics costs by 23% (Table 8); and RL routing achieves 12.1% higher landfill diversion and 15.2% more renewable energy versus rule-based baselines (Table 8). Geographic generalization to an independent Jeddah dataset yielded a modest 2.6 pp mAP decline, with composite categories identified for further domain adaptation. The framework is economically viable (payback ∼1.5 years), environmentally beneficial (∼8,400 tonnes CO2 avoided annually), and technically scalable from pilot to national rollout. Future work will address hazardous waste coverage, NIR integration, and federated GCC collaboration.
Author contributions statement
All authors designed, wrote, reviewed and approved the final version of the manuscript.
Ethics approval and consent to participate
Not applicable. This study did not involve human participants, human data, or human tissue.
Consent for publication (include appropriate statements)
Not applicable. This article does not contain any individual person's data in any form.

