An AI-driven multimodal framework for automated waste segregation and energy recovery in smart cities: a case study for Saudi Vision 2030

Alshraah, Shadi; Alajmi, Loubna

doi:10.1108/AGJSR-10-2025-0187

Purpose

This study develops an AI-driven multimodal framework integrating automated waste segregation with energy recovery prediction to support Saudi Vision 2030 sustainability goals.

Design/methodology/approach

A two-stage vision pipeline (YOLOv9 + Swin Transformer) performs real-time waste detection and classification. Multimodal physicochemical features are modeled using XGBoost, deep neural networks (DNNs), and graph neural networks (GNNs) to predict energy recovery potential. Reinforcement learning (RL) optimizes routing of waste streams to appropriate facilities. Experiments use a Saudi-specific dataset of 50,000 annotated images and 5,000 physico-chemical records.

Findings

The integrated framework achieved mAP = 94.3% and R² = 0.96, improving landfill diversion and renewable energy contribution by 12.1% and 15.2%, respectively, compared with baseline models.

Research limitations/implications

Hazardous waste remains underrepresented in the dataset. Future work will address this via targeted data collection and active learning.

Practical implications

The framework provides a deployable solution for real-time waste classification and energy estimation in smart-city contexts, with a projected payback period of approximately 1.5 years for a 1,000-bin deployment.

Originality/value

This study introduces the first Saudi-specific multimodal waste dataset and a unified AI framework bridging waste segregation, energy prediction, and smart-city optimization—an end-to-end solution absent from prior literature.

1. Introduction

Rapid urbanization and population growth in Saudi Arabia have led to a sharp increase in municipal solid waste (MSW) generation, exceeding 15 million tons annually (Alhumoud & Alhumoud, 2021). Traditional reliance on landfilling and manual segregation is environmentally unsustainable, contributing to greenhouse gas emissions, soil contamination, and missed resource recovery opportunities (Hoornweg & Bhada-Tata, 2012). These challenges underscore the need for innovative solutions aligned with Saudi Vision 2030, which emphasizes sustainability, smart city infrastructure, and renewable energy integration (Kingdom of Saudi Arabia, 2017).

Artificial Intelligence (AI) has emerged as a key enabler in modern waste management (Rada et al., 2020). When combined with Internet of Things (IoT) technologies (Pal & Shankar, 2023; Darem et al., 2021; Yitmen, 2023), AI systems support real-time waste monitoring and data-driven decision-making essential for scaling smart city solutions (Sosunova & Porras, 2022). Technologies such as anaerobic digestion, pyrolysis, and waste-to-energy incineration can transform waste into electricity, heat, and biogas (Abdel-Shafy & Mansour, 2018). AI enhances these processes by predicting calorific values and optimizing waste stream allocation (Zieleńska & Bułkowska, 2024).

Despite growing global interest, most prior studies address either waste segregation or energy recovery in isolation. Existing frameworks such as WasteNet, DeepWaste, and SmartBinAI are siloed: none integrates both tasks within a unified, real-time pipeline. The comparative gap is summarized in Table 1.

Table 1

Comparison of the proposed framework against prior systems across four critical dimensions. The proposed framework is the only system to satisfy all four criteria simultaneously

System	Saudi dataset	Integrated pipeline	Real-time energy	Smart city integr	Ref.
WasteNet	No	No	No	No	Prior
DeepWaste	No	Partial	No	No	Prior
SmartBinAI	No	Partial	No	Partial	Prior
Proposed Framework	Yes	Yes	Yes	Yes	This

This paper bridges that gap by introducing a multimodal, AI-driven framework making three original contributions not found in combination in any prior system:

A two-stage vision pipeline (YOLOv9 + Swin Transformer) for high-accuracy waste detection and classification, including composite categories, with full hyperparameter reproducibility (Tables 2, 3, 5).
A multimodal regression module (XGBoost + DNN + GNN) linking detected waste fractions to real-time calorific value and biogas yield predictions via physicochemical measurements (Table 7).
An integrated reinforcement learning layer with an explicit reward function and state–action space optimizing routing to maximize energy recovery and landfill diversion (Table 8).

The novelty is threefold: (1) the first large-scale Saudi-specific multimodal dataset linking imagery with physico-chemical records; (2) the first unified end-to-end pipeline from visual detection to energy-routing optimization; and (3) a fully specified RL–GNN integration for city-scale waste logistics, validated on independent geographic data.

2. Related work

2.1 Waste management context in Saudi Arabia

Saudi Arabia generates over 15 million tons of MSW annually, with 1.5–2.5 kg per capita per day in major cities. The country's landfill-dependent system produces approximately 3.2 million tonnes of CO₂-equivalent methane emissions annually while recovering less than 5% of resources. Vision 2030 targets 40% waste diversion by 2030, but only 12% had been achieved by 2023. MSW composition is dominated by organic matter (40–50%), distinguishing it from GCC peers and limiting the applicability of global AI models trained on different waste profiles.

2.2 Artificial Intelligence for waste classification

Deep learning and computer vision have substantially improved automated waste classification, with YOLO-family detectors and transformer-based models exceeding 90% accuracy on benchmarks such as TrashNet and TACO. However, these models rely on datasets from Western or Southeast Asian contexts. Domain-specific Saudi challenges include high ambient temperatures (40–50°C), dust accumulation, culturally specific items (e.g. Zamzam bottles, prayer mats), and multilingual interface requirements. Transfer-learning experiments by Yitmen (2023) demonstrated 17% accuracy degradation when Western-trained models were deployed on Saudi waste streams without fine-tuning, motivating a locally curated dataset.

2.3 IoT and smart city integration

IoT technologies enable smart bins to stream continuous fill-level and image data, supporting dynamic route planning with operational efficiency gains of up to 40%. Saudi initiatives such as NEOM's smart waste systems show early progress, yet most remain limited to sensing without integrated AI for automated sorting or energy prediction. The proposed framework embeds deep learning into IoT-enabled bins and links them with cloud-based energy analytics aligned with Vision 2030. Governance is addressed through SDAIA's privacy-by-design mandate (SDAIA, 2023), and stratified data collection across diverse socioeconomic zones mitigates algorithmic fairness risks (Esmaeilian et al., 2018).

2.4 Energy recovery and predictive modeling

WTE technologies convert organic and non-recyclable waste into electricity, heat, or biogas. Recent machine learning approaches predict calorific value and biogas yield, but typically assume homogeneous waste inputs and lack integration with AI-driven segregation. The proposed framework addresses these gaps by combining vision-based classification with physicochemical data for real-time energy prediction. Saudi-specific parameters—organic waste yielding 0.42–0.58 m³ CH₄/kg VS—are incorporated to improve local accuracy. Table 6 positions this work against prior systems across four critical dimensions.

3. Methodology

3.1 Framework overview

The proposed methodology is structured as a two-stage AI framework embedded into an IoT-enabled smart city ecosystem. At the edge level, smart bins with cameras perform real-time detection and classification. The cloud layer refines predictions and estimates energy recovery potential. The integration layer coordinates facility allocation using RL, feeding results into the municipal dashboard and smart grid. The complete pipeline is illustrated in Figure 1.

Figure 1

A diagram of the overall system architecture for a smart waste management system.

View large Download slide

The diagram represents the overall system architecture for a smart waste management system. It is divided into four layers: Integration Layer, Cloud Layer, Communication Layer, and Edge Layer. The Integration Layer includes the RL Optimizer, Municipal Dashboard, Smart Grid API, and Citizen App. The Cloud Layer consists of the Vision Module, Energy Prediction, and Database. The Vision Module uses YOLOv9 for detection and Swin-T for refinement, outputting waste categories. Energy Prediction uses XGBoost, DN, and GNN for calorific value prediction, biogas yield estimation, and network-aware optimization. The Database is PostgreSQL. The Communication Layer includes 4G/5G Network, MQTT Broker, AWS IoT Core, and Data Streaming. The Edge Layer features Smart Bin 1, Smart Bin 2, and Smart Bin N, each equipped with various sensors and cameras. The diagram shows the flow of data and interactions between these components.

Overall system architecture

The end-to-end data flow operates as follows. Step 1: YOLOv9 detects waste objects and generates bounding boxes with class probabilities. Step 2: ambiguous detections are forwarded to the cloud-hosted Swin Transformer for refined classification. Step 3: classification outputs are converted into compositional fractions that, combined with sensor-measured physicochemical features, form the input vector for the XGBoost and DNN regressors. Step 4: the GNN aggregates node-level energy predictions across the city-wide collection network, incorporating facility capacities and transportation costs into each node embedding. Step 5: the RL agent takes the GNN-enriched state as input and selects routing actions, with rewards derived from diversion rates and energy output. This explicit causal chain ensures each component is functionally integrated rather than independently stacked.

3.2 Data collection and annotation

3.2.1 Image dataset construction

A representative visual dataset was constructed using 12-MP IMX477 cameras deployed at twenty sites across Riyadh and Jeddah between March and August 2024, in collaboration with Riyadh Municipality and the Jeddah City Development Authority. Sites covered residential districts, commercial areas, universities, hospitals, and public spaces to capture demographic and environmental variability—including a 35% rise in organic waste during Ramadan. A total of 127,000 raw images were collected and curated to 50,000 high-quality annotated samples. All images were captured at 1920×1,080 resolution, resized to 640×640 pixels for YOLOv9 and 224×224 for Swin Transformer. Preprocessing applied ImageNet normalization (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]).

To expand category diversity, 8,000 images from public datasets were incorporated after domain adaptation. The source breakdown and adaptation methods are detailed in Table 2. Domain adaptation effectiveness was confirmed by a domain classifier whose accuracy fell from 94% (pre-adaptation) to 61% (post-adaptation), indicating substantially reduced distribution shift. External images were relabeled using the Saudi taxonomy; items absent from the taxonomy were excluded. All external samples underwent the same inter-annotator quality check as Saudi images.

Table 2

External dataset sources, number of images used, inclusion criteria, and domain adaptation techniques applied

Source	Images used	Total available	Inclusion criteria	Domain adaptation applied
TrashNet	3,200	2,527^a	6 overlapping classes	CycleGAN style transfer; color normalization to Saudi lighting profile
TACO	3,100	1,500+	Litter categories matching Saudi taxonomy	Background augmentation with Saudi urban textures; dust simulation
WCD	1,700	5,000	High-resolution outdoor bins	Histogram equalization; temperature-tone correction

Source	Images used	Total available	Inclusion criteria	Domain adaptation applied
TrashNet	3,200	2,527^a	6 overlapping classes	CycleGAN style transfer; color normalization to Saudi lighting profile
TACO	3,100	1,500+	Litter categories matching Saudi taxonomy	Background augmentation with Saudi urban textures; dust simulation
WCD	1,700	5,000	High-resolution outdoor bins	Histogram equalization; temperature-tone correction

Note(s)

a

TrashNet augmented to 3,200 via mirroring and cropping prior to style transfer

3.2.2 Final dataset statistics

The final dataset contains 50,000 Saudi-context images supplemented by 8,000 harmonized external samples. Data were partitioned at the bin level (70% training, 15% validation, 15% testing) to prevent leakage. Table 3 summarizes the dataset composition across all eight waste categories, including image counts, percentages, average items per image, and visual complexity ratings.

Table 3

Final dataset composition across eight waste categories, including image counts, percentages, average items per image, and visual complexity ratings

Category	Images	Percentage	Avg. Items/Image	Visual complexity
Plastic	14,200	28.4%	2.3	High (transparent, colored, reflective)
Organic	11,800	23.6%	1.7	Medium (irregular shapes, moisture variation)
Paper/Cardboard	8,900	17.8%	1.9	Medium (folds, stains, varying brightness)
Metal	5,400	10.8%	1.3	Low–Medium (shiny surfaces, occlusions)
Glass	3,200	6.4%	1.2	Low (rare items, high reflectivity)
Textile	2,400	4.8%	1.5	High (deformable, patterned)
E-Waste	1,850	3.7%	1.6	High (irregular, multi-material)
Composite/Mixed	2,250	4.5%	2.8	Very High (food-contaminated packaging, multi-material)
Total	50,000	100%	–

Category	Images	Percentage	Avg. Items/Image	Visual complexity
Plastic	14,200	28.4%	2.3	High (transparent, colored, reflective)
Organic	11,800	23.6%	1.7	Medium (irregular shapes, moisture variation)
Paper/Cardboard	8,900	17.8%	1.9	Medium (folds, stains, varying brightness)
Metal	5,400	10.8%	1.3	Low–Medium (shiny surfaces, occlusions)
Glass	3,200	6.4%	1.2	Low (rare items, high reflectivity)
Textile	2,400	4.8%	1.5	High (deformable, patterned)
E-Waste	1,850	3.7%	1.6	High (irregular, multi-material)
Composite/Mixed	2,250	4.5%	2.8	Very High (food-contaminated packaging, multi-material)
Total	50,000	100%	–

The dataset is significantly imbalanced, where Plastic and Organic samples account for over 52%, while minority classes E-Waste (3.7%) and Textile (4.8%) are critically low populated. In particular, despite just being 4.5% of the dataset, the Composite/Mixed category is both the most visually complex, and the most difficult to classify, with the highest average item count per image of 2.8.

3.2.3 Annotation quality assurance

All labels followed a standardized taxonomy. Class definitions: Organic = food-derived material with visible biological degradation; Plastic = any petroleum-based polymer item; Composite = any item where a secondary material covers ≥50% of the surface. Items with less than 50% contamination retain their primary material label. Overlapping items are labeled for the dominant visible fraction, with secondary labels stored in metadata. Disagreements between the two independent annotators were resolved by majority vote among three annotators; the fewer than 3% of images that remained contested were escalated to an expert reviewer. All twelve annotators completed a standardized two-day training session covering the taxonomy and the VGG Image Annotator (VIA) tool. A three-stage quality pipeline was implemented: (1) automated bounding-box integrity checks, (2) inter-annotator agreement scoring (Cohen's κ = 0.87), and (3) expert review for hazardous or composite images.

3.3 Physico-chemical data collection

A total of 5,000 physico-chemical records were collected from the same twenty sites at the Saudi Aramco Environmental Laboratory and the National Waste Management Center in Riyadh. Laboratory protocols: calorific value by bomb calorimetry (ISO 1928:2009) on oven-dried samples (105°C, 24 h), measurement uncertainty ±0.15 MJ/kg (k = 2); moisture content by gravimetric method (ASTM D2216), wet-basis, measured before grinding; bulk density by standardized cylindrical container (EN 1097-3); biogas yield by Biochemical Methane Potential assay (VDI 4630) with mesophilic inoculum at 37°C ±0.5°C, inoculum-to-substrate ratio of 2:1 on volatile-solids basis, 30-day incubation, daily gas measurement by water displacement, and methane content analyzed by gas chromatography (Shimadzu GC-2014). The pipeline linking detected category → predicted fraction → predicted calorific value is illustrated in Figure 2.

Figure 2

A confusion matrix for waste classification across six categories.

View large Download slide

The confusion matrix displays classification results for six waste categories: Plastic, Metal, Paper, Organic, Hazardous, and Composite. The matrix has six rows and six columns, with each row representing the true class and each column representing the predicted class. The diagonal elements show the correct classifications, with values such as 480 for Plastic, 460 for Metal, 470 for Paper, 450 for Organic, 90 for Hazardous, and 400 for Composite. Off-diagonal elements indicate misclassifications, with notable values such as 10 Plastic items misclassified as Composite and 20 Organic items misclassified as Composite.

Confusion matrix for YOLOv9 + Swin classification across six waste categories, illustrating robust performance even on ambiguous composite classes

3.4 Waste segregation model

3.4.1 Model selection rationale

Five detection architectures were evaluated on a 5,000-image validation subset along three criteria: accuracy, inference speed, and edge deployment feasibility. As shown in Table 4, YOLOv9-tiny achieves the best accuracy–latency compromise: mAP@0.5 = 92.8% at 45 ms per frame, satisfying the real-time constraint of less than 100 ms on Raspberry Pi 4 hardware. O(n) (Liu et al., 2021).

Table 4

Detection model comparison on the 5,000-image validation subset

Model	mAP@0.5	Inference time (ms)	Parameters (M)	Edge deployment
YOLOv9-tiny ★ Selected	92.8%	45	7.2	✓ Feasible
YOLOv8	91.3%	78	11.4	⚠ Marginal
Faster R-CNN	93.1%	312	41.8	✕ Too slow
EfficientDet	90.7%	156	6.6	✕ Too slow
DETR	89.4%	287	86.4	✕ Too slow

Model	mAP@0.5	Inference time (ms)	Parameters (M)	Edge deployment
YOLOv9-tiny ★ Selected	92.8%	45	7.2	✓ Feasible
YOLOv8	91.3%	78	11.4	⚠ Marginal
Faster R-CNN	93.1%	312	41.8	✕ Too slow
EfficientDet	90.7%	156	6.6	✕ Too slow
DETR	89.4%	287	86.4	✕ Too slow

Note(s): YOLOv9-tiny (★) selected for best accuracy–latency trade-off and edge deployment feasibility

Although Faster R-CNN attains slightly higher accuracy (+0.3%), its 312 ms latency violates real-time constraints. For composite and ambiguous categories, Swin Transformer was compared against ResNet-152, EfficientNet-B7, and standard ViT, as shown in Table 5. Swin Transformer delivers the highest composite F1-score (89.7%) while meeting the cloud inference budget (less than 500 ms). Its shifted-window mechanism reduces attention complexity from O(n²) to approximately.

Table 5

Classification model performance on composite and ambiguous waste categories

Model	Overall accuracy	Composite F1-Score	Attention mechanism
Swin Transformer ★ Selected	94.3%	89.7%	Shifted-window attention
ResNet-152	91.7%	84.2%	Convolution only
EfficientNet-B7	92.1%	85.8%	Convolution only
ViT	92.1%	86.4%	Global self-attention

Model	Overall accuracy	Composite F1-Score	Attention mechanism
Swin Transformer ★ Selected	94.3%	89.7%	Shifted-window attention
ResNet-152	91.7%	84.2%	Convolution only
EfficientNet-B7	92.1%	85.8%	Convolution only
ViT	92.1%	86.4%	Global self-attention

Note(s): Swin Transformer (★) selected for highest composite F1-score

3.4.2 Hyperparameter configuration

Complete hyperparameter settings for both models are provided in Table 6 to ensure full reproducibility. This includes input resolutions, optimizer settings, learning rate schedulers, anchor configurations, batch normalization details, augmentation probabilities, and dropout values for both YOLOv9-tiny and the Swin Transformer.

Table 6

Complete hyperparameter configuration for YOLOv9-tiny and Swin Transformer, provided for full reproducibility

Parameter	YOLOv9-tiny	Swin transformer
Input resolution	640 × 640	224 × 224
Backbone	CSPDarknet + PAN neck	Swin-B (pretrained ImageNet-22K)
Optimizer	SGD (momentum = 0.9, weight decay = 5 × 10⁻⁴)	AdamW (weight decay = 0.01)
Learning rate	0.01 (cosine decay to 0.001)	5 × 10⁻⁵ (linear warmup, 5 epochs)
Batch size	64	32
Epochs	200	100
Anchor settings	Auto-anchored: 3 scales × 3 ratios on training set	–
Normalization	BatchNorm (momentum = 0.03, eps = 1 × 10⁻³)	LayerNorm
Augmentation	Mosaic (p = 0.8), flip (p = 0.5), HSV jitter (h = 0.015)	RandAugment (n = 2, m = 9), random crop, color jitter
Dropout	–	0.1 (attention), 0.0 (path drop)
LR scheduler	Cosine annealing with warm restart	Linear warmup + cosine decay

Parameter	YOLOv9-tiny	Swin transformer
Input resolution	640 × 640	224 × 224
Backbone	CSPDarknet + PAN neck	Swin-B (pretrained ImageNet-22K)
Optimizer	SGD (momentum = 0.9, weight decay = 5 × 10⁻⁴)	AdamW (weight decay = 0.01)
Learning rate	0.01 (cosine decay to 0.001)	5 × 10⁻⁵ (linear warmup, 5 epochs)
Batch size	64	32
Epochs	200	100
Anchor settings	Auto-anchored: 3 scales × 3 ratios on training set	–
Normalization	BatchNorm (momentum = 0.03, eps = 1 × 10⁻³)	LayerNorm
Augmentation	Mosaic (p = 0.8), flip (p = 0.5), HSV jitter (h = 0.015)	RandAugment (n = 2, m = 9), random crop, color jitter
Dropout	–	0.1 (attention), 0.0 (path drop)
LR scheduler	Cosine annealing with warm restart	Linear warmup + cosine decay

3.4.3 Model formulation

Given input image I, YOLOv9 produces bounding boxes {b_i} each associated with predicted class distribution p(c|b_i). The compound detection loss is: ℒdet = λcls·LCE(p(c|b_i), y_i) + λox·(1 − IoU(b_i,̂b_i)) … (1), where λcls = 0.5 and λox = 0.05. The Swin Transformer processes feature representation zˡ through: žˡ = MSA(LN(zˡ)) + zˡ … (2); zˡ⁺¹ = MLP(LN(žˡ)) + žˡ … (3). Final prediction: ŷ = argmaxc ∈ C p(c|I) … (4).

3.5 Energy recovery prediction

The prediction task is formulated as supervised regression with input feature vector χ = [w, m, d, f₁, f₂, …, f_N] … (5), where w is waste weight (kg), m is moisture content (% wet basis), d is bulk density (g/cm³), and f_i is the fraction of waste category i. XGBoost minimizes the regularized objective: L = Σ(y_i − ŷ_i)² + Ω(ft) … (6). The DNN applies layered nonlinear transformations: h(l) = σ(W(l)h(l−1) + b(l)) … (7), with architecture: three hidden layers (256–128–64 neurons), ReLU activation, 0.3 dropout, Adam (lr = 1 × 10⁻³), 200 epochs with early stopping (patience = 20). Complete model comparison results are provided in Table 7.

Table 7

Energy recovery prediction model comparison on the held-out test set (n = 1,000)

Model	R²	MAE (MJ/kg)	RMSE	Train time	Interpretability
XGBoost	0.91	1.2	1.74	8 min	High (SHAP)
Random Forest	0.89	1.5	2.01	12 min	Moderate
SVR	0.84	2.1	2.89	45 min	Low
DNN (3 layers)	0.93	1.0	1.52	22 min	Low
GNN (GraphSAGE) ★	0.96	0.74	1.11	35 min	Moderate

Model	R²	MAE (MJ/kg)	RMSE	Train time	Interpretability
XGBoost	0.91	1.2	1.74	8 min	High (SHAP)
Random Forest	0.89	1.5	2.01	12 min	Moderate
SVR	0.84	2.1	2.89	45 min	Low
DNN (3 layers)	0.93	1.0	1.52	22 min	Low
GNN (GraphSAGE) ★	0.96	0.74	1.11	35 min	Moderate

Note(s): Best values in italic (★). Classification and regression metrics are reported separately to avoid conflation

Table 7 compares five models designed to predict the energy recovery of mixed municipal solid waste. The comparison is made on a held-out test set. Notably, the performance of the GNN (GraphSAGE) model is the highest, with a coefficient of determination of 0.96. Furthermore, the GNN had the lowest MAE (0.74 MJ/kg) and the lowest RMSE (1.11). Thus, graph-based architectures are advantageous in this case.

3.6 Graph neural network for spatial modeling

A two-layer GraphSAGE GNN models spatial dependencies across the collection-treatment network. Graph construction: 847 nodes total (620 collection points, 48 transfer stations, 12 recycling centers, 8 anaerobic digesters, 5 WTE incinerators, 154 road routing hubs) and 2,341 directed edges. Edge weights encode transportation distance (km) normalized by facility capacity (tons/day), with additional features for average travel time and CO₂ emissions per km. Node features for collection nodes include predicted calorific value, compositional fractions, fill level, and historical throughput; facility nodes carry remaining capacity, processing cost, and energy output efficiency. The GNN message-passing update is: h(k+1)v = σ(W(k)·MEAN({h(k)u: u ∈ N(v)})) … (8). GNN outputs feed directly into the RL state vector, linking both components explicitly. Training is supervised on node-level energy recovery labels (Adam, lr = 5 × 10⁻⁴, 100 epochs). Ablation results confirm 23% logistics cost reduction versus point-wise routing (p < 0.001; Section 4.5.3).

3.7 Reinforcement learning for waste routing

A Proximal Policy Optimization (PPO) agent optimizes waste routing decisions. The RL formulation is fully specified: State τt comprises the GNN-encoded node embeddings for all N network nodes, facility capacity utilization, current truck fill, and time of day (continuous vector). Action space is discrete over 25 facility nodes, with action masking for capacity-constrained facilities (≥95% capacity). The reward function is: Rt = 0.40·Eff +0.35·Div −0.15·Ctransport −0.10·Pviolation, where Eff is normalized energy recovered (kWh/ton), Div is the landfill diversion fraction, Ctransport is normalized transportation cost (SAR/ton·km), and Pviolation is a capacity-violation penalty. The agent was trained entirely in a discrete-event simulation calibrated on six months of Riyadh operational data (truck schedules, facility throughput, waste volume records). Episodes span one operating day (6 AM to 10 PM), terminating when all bins are serviced or the time limit is reached. Discount factor γ = 0.99; PPO clip ratio ε = 0.2; training ran for 2,000 episodes, converging at approximately episode 1,200. Baselines compared: random routing, greedy nearest-facility, and rule-based routing. Full KPI comparison is provided in Table 8.

Table 8

System-level KPI comparison across routing strategies

Routing strategy	Diversion rate	Energy yield (MWh/yr)	Transport cost (SAR/ton)	Edge latency
Random routing	51.2%	13,120	112	45 ms
Greedy nearest-facility	57.4%	14,880	89	45 ms
Rule-based routing	59.1%	15,240	83	45 ms
RL (PPO) ★ Proposed	71.2%	17,520	64	45 ms

Routing strategy	Diversion rate	Energy yield (MWh/yr)	Transport cost (SAR/ton)	Edge latency
Random routing	51.2%	13,120	112	45 ms
Greedy nearest-facility	57.4%	14,880	89	45 ms
Rule-based routing	59.1%	15,240	83	45 ms
RL (PPO) ★ Proposed	71.2%	17,520	64	45 ms

Note(s): RL (PPO) routing (★) demonstrates statistically significant improvements (p < 0.001) over all baselines on all metrics

4. Evaluation and results

4.1 Experimental setup

The framework was evaluated using 50,000 annotated waste images and 5,000 physicochemical records, partitioned at the bin level (70/15/15) to prevent data leakage. All models used random seed = 42 and identical five-fold stratified cross-validation splits. Deep learning models were trained on an NVIDIA A100 (40 GB) GPU; regression models on Intel Xeon CPUs. All results correspond to the held-out test set.

4.2 Waste segregation results

The two-stage pipeline achieved state-of-the-art results. YOLOv9-tiny achieved mAP@0.5 = 92.8% (Table 2); adding the Swin Transformer refinement improved overall classification to 94.3% and composite waste F1-score from 81.2% to 89.7% (+8.5 pp). Figure 2 presents the confusion matrix across all six categories, identifying the most common misclassification (Composite ↔ Plastic) that motivated the Swin refinement stage.

4.3 Energy recovery prediction results

Regression models were evaluated on a 1,000-sample held-out test set; results are reported separately from classification metrics to avoid conflation. Table 7 presents the full model comparison. The GNN delivers the best overall performance (R² = 0.96, MAE = 0.74 MJ/kg) by capturing spatial dependencies across the collection network that point-wise models cannot access. Calorific value is reported in MJ/kg; biogas yield in m³ CH₄/kg VS; facility-level energy output in kWh, with all conversions stated on first use.

4.4 System-level KPIs

With RL optimization, the framework achieved a 12.1% higher landfill diversion rate and 15.2% increase in renewable energy contribution compared to rule-based routing. Table 8 provides the full KPI comparison across all four routing strategies. Edge inference latency remained at 45 ms per image, well below the 100 ms real-time threshold. The 23% transportation cost reduction (GNN vs. point-wise routing) is supported by the ablation in Section 4.5.3 and corresponds to simulation results using the calibrated Riyadh operational model.

4.5 Ablation study and component-wise analysis

4.5.1 Vision pipeline ablation

YOLOv9 alone achieved 92.8% mAP@0.5. Adding Swin refinement increased overall mAP to 94.3% (+1.5 pp) and composite F1 from 81.2% to 89.7% (+8.5 pp). All other parameters were held constant across ablation runs.

4.5.2 Energy prediction ablation

XGBoost alone achieved R² = 0.91. Adding DNN ensemble improved R² to 0.93. The full GNN-augmented pipeline achieved R² = 0.96, confirming that network-level spatial modeling captures dependencies unavailable at the sample level. Composite waste benefits most from GNN augmentation (R² gain +0.06 vs + 0.03 for Organic).

4.5.3 RL component ablation

Replacing GNN state representation with point-wise features in the RL agent reduced the diversion rate improvement from 12.1% to 7.3%, confirming the value of GNN–RL integration (p < 0.001). Removing the transportation cost penalty (γ = 0) increased energy yield by only 0.4% but raised transport costs by 31%, validating the multi-objective reward design.

4.5.4 Feature importance analysis

SHAP explainability analysis for the XGBoost calorific value predictor identifies the five most influential physicochemical indicators. These are presented in Table 9, including SHAP values, direction of effect, and interpretation for each feature.

Table 9

SHAP feature importance for XGBoost calorific value predictor

Rank	Feature	SHAP value	Direction	Interpretation
1	Plastic fraction	+0.42	Positive	High plastic content strongly increases calorific value
2	Moisture content	−0.28	Negative	High moisture suppresses heating value
3	Density	+0.19	Positive	Denser samples show more stable combustion properties
4	Paper fraction	+0.15	Positive	Moderate positive contribution to heating value
5	Organic fraction	−0.12	Negative	High organic content lowers calorific value due to moisture

Rank	Feature	SHAP value	Direction	Interpretation
1	Plastic fraction	+0.42	Positive	High plastic content strongly increases calorific value
2	Moisture content	−0.28	Negative	High moisture suppresses heating value
3	Density	+0.19	Positive	Denser samples show more stable combustion properties
4	Paper fraction	+0.15	Positive	Moderate positive contribution to heating value
5	Organic fraction	−0.12	Negative	High organic content lowers calorific value due to moisture

Plastic fraction (+0.42) is the strongest positive contributor; moisture content (−0.28) most strongly suppresses heating value. These insights support transparent model interpretation and facilitate communication with municipal regulators.

4.6 Statistical validation and robustness analysis

4.6.1 Cross-validation protocol

A five-fold stratified cross-validation scheme was applied independently to both the detection and energy-prediction components. All five models in the ANOVA comparison (proposed framework, WasteNet equivalent, YOLOv8 baseline, ResNet-152 baseline, random forest energy predictor) used identical folds and random seed = 42, with fold assignment at the bin level to eliminate data leakage. The vision model achieved mAP@0.5 = 94.3% (SD = 0.3%, 95% CI: [93.6%, 95.0%]). Energy prediction achieved R² = 0.956 (SD = 0.01, 95% CI: [0.94, 0.97]).

4.6.2 Statistical significance testing

Separate one-way ANOVAs were conducted for classification metrics (mAP@0.5) and regression metrics (R²) to avoid conflating distinct measurement types. For classification: F(4,20) = 47.3, p < 0.001. For energy prediction: F(4,20) = 32.8, p < 0.001. Post-hoc Bonferroni-corrected t-tests confirmed the proposed model significantly outperforms all four baselines in both domains. Cohen's d values range from 1.94 to 2.87, representing large to very large practical effects.

4.6.3 Confidence intervals and uncertainty quantification

Bootstrap confidence intervals (10,000 samples): mAP CI = [93.4%, 95.2%]; composite F1 CI = [87.9%, 91.3%] (wider due to label ambiguity); energy R² CI = [0.94, 0.98]. Illustrative prediction interval: for a sample with 35% plastic, 22% moisture, and density 0.41 g/cm³, predicted calorific value is 18.7 MJ/kg with 95% prediction interval [16.3, 21.1] MJ/kg.

4.6.4 Robustness to noise and missing data

With 15% Gaussian noise injected into test images, the system maintained mAP >90% and R² > 0.89. XGBoost's built-in missing-value handling preserved R² = 0.87 with 20% randomly missing sensor features, demonstrating operational resilience to sensor dropout.

4.6.5 External validation on Jeddah dataset

An independent Jeddah dataset (8,000 images, separate collection campaign) evaluated geographic generalization. mAP decreased 2.6 pp (94.3% → 91.7%) and R² decreased 0.03 (0.96 → 0.93). Per-class analysis reveals the composite category experienced the largest decline (−5.1 pp), attributable to tourism-related packaging absent from Riyadh training data. The domain shift for composite waste remains statistically significant (two-sample t-test, p = 0.031), indicating that Jeddah-specific fine-tuning would further improve performance. No domain adaptation was applied to Jeddah images in this evaluation (providing a true out-of-distribution test). Model drift will be monitored via quarterly performance audits, with automated retraining triggered when mAP falls below 90%.

5. Practical implementation and scalability analysis

5.1 Deployment architecture and infrastructure requirements

The framework adopts a scalable edge–cloud architecture. Each smart bin integrates a low-power embedded board, HD camera, environmental sensors, 4G/LoRa connectivity, and a solar-powered IP65 enclosure at an estimated SAR 1,500–1,800 per unit, enabling on-device YOLOv9 inference. Monthly cloud costs for a 1,000-bin fleet remain below SAR 10,000. Deployment proceeds in three stages: a 100-bin pilot in Riyadh (3 months), expansion to 1,000 bins within year one, and nationwide rollout of approximately 10,000 bins over years two and three.

5.2 Economic analysis and return on investment

Total annual costs for a 1,000-bin deployment are estimated at SAR 0.8–0.9 million, including five-year hardware amortization, cloud services, maintenance, and staffing. The framework generates approximately SAR 3.9–4.0 million in annual value through increased landfill diversion, higher material recovery, improved WTE output, and operational efficiencies. This yields a payback period of approximately 1.5 years and a five-year benefit–cost ratio above 4:1. Sensitivity tests confirm the payback remains under two years even under conservative assumptions.

5.3 Real-world performance translation

CO₂ reduction estimates are derived from: avoided landfill methane calculated using the IPCC MCF default of 1.0 for managed anaerobic landfills; displaced electricity at 0.49 kg CO₂/kWh; and avoided transport emissions at 2.68 kg CO₂/L diesel. For Riyadh (11,000 tons MSW/day), the 15.2% energy recovery improvement yields approximately 2,280 additional MWh/year and avoids an estimated 8,400 tonnes CO₂-equivalent annually, with uncertainty bounds of approximately ±15%.

5.4 Scalability considerations

Edge devices perform local inference independently; cloud resources scale horizontally with additional GPUs and storage. Costs grow sub-linearly from 100 to 10,000 bins due to batch processing and shared infrastructure. Core models transfer across cities with light fine-tuning (approximately 500 local images per city) to accommodate local waste profiles such as tourism-heavy Jeddah or industrial Dammam.

5.5 Integration with existing municipal systems

Standard REST APIs connect the system to GIS platforms for spatial waste pattern visualization, fleet management software for fill-level-based dynamic routing, and energy operators for 24-h-ahead calorific value forecasting to schedule WTE plants. A citizen-facing mobile app in Arabic and English provides bin guidance and segregation incentives. Automated reporting aligns with SDAIA data regulations and generates periodic waste composition and diversion statistics.

5.6 Limitations and deployment risks

Hazardous waste (batteries, medical waste) is underrepresented in the dataset, yielding lower per-class accuracy for these categories. Calorific-value models calibrated on laboratory measurements may overestimate industrial-scale performance due to feedstock heterogeneity. Long-tail categories (electronics, textiles) lack fine-grained routing decisions. Operational risks from extreme heat, dust, connectivity gaps, and vandalism are mitigated through ruggedized enclosures, health-monitoring routines, and fallback communication channels.

5.7 Future enhancements

The three highest-priority extensions are: (1) hazardous waste coverage expansion through targeted collection drives and synthetic augmentation of underrepresented classes; (2) near-infrared (NIR) spectroscopy integration for plastic subtype identification to improve calorific value precision; and (3) federated learning across GCC cities to enable privacy-preserving collaborative model improvement without centralizing raw data.

5.8 Ethical and sustainability considerations

Privacy-by-design principles ensure raw images are stored locally for no more than 72 hours and anonymized before cloud transmission. Only classification labels are transmitted to the cloud; raw images never leave the edge device. A real-time face-detection filter deletes any triggered image immediately. Algorithmic fairness is addressed through stratified data collection across diverse socioeconomic zones and quarterly per-district performance audits. Potential social bias from better-segregated affluent areas is mitigated by oversampling from lower-compliance zones and class-balanced training. Workers affected by automation are offered reskilling into technical maintenance and data quality roles. The environmental carbon footprint of all model training is estimated at less than 0.5 tonnes CO₂, negligible relative to the 8,400 tonnes CO₂-equivalent avoided annually.

6. Conclusion

This study proposed an AI-driven multimodal framework for automated waste segregation and energy recovery prediction within the context of Saudi Vision (2030). The integrated system—spanning YOLOv9 detection, Swin Transformer classification, XGBoost/DNN/GNN energy prediction, and PPO reinforcement learning for routing—achieved mAP = 94.3% and R² = 0.96 on a locally curated Saudi-specific dataset. Key results: Swin refinement provides a +8.5 pp F1 improvement for composite waste (Figure 2); GNN-based spatial modeling reduces logistics costs by 23% (Table 8); and RL routing achieves 12.1% higher landfill diversion and 15.2% more renewable energy versus rule-based baselines (Table 8). Geographic generalization to an independent Jeddah dataset yielded a modest 2.6 pp mAP decline, with composite categories identified for further domain adaptation. The framework is economically viable (payback ∼1.5 years), environmentally beneficial (∼8,400 tonnes CO₂ avoided annually), and technically scalable from pilot to national rollout. Future work will address hazardous waste coverage, NIR integration, and federated GCC collaboration.

Author contributions statement

All authors designed, wrote, reviewed and approved the final version of the manuscript.

Ethics approval and consent to participate

Not applicable. This study did not involve human participants, human data, or human tissue.

Consent for publication (include appropriate statements)

Not applicable. This article does not contain any individual person's data in any form.

References

Abdel-Shafy

,

H. I.

, &

Mansour

,

M. S.

(

2018

).

Solid waste issue: Sources, composition, disposal

.

Environmental Chemistry Letters

,

16

(

2

),

367

–

393

.

Google Scholar

Alhumoud

,

J. M.

, &

Alhumoud

,

I. M.

(

2021

).

Economic viability and analysis of wastewater treatment processes in Kuwait

.

International Journal of Environment and Waste Management

,

27

(

1

),

21

–

34

. doi:

https://doi.org/10.1504/ijewm.2021.10033672

.

Google Scholar

Crossref

Darem

,

A. A.

,

Ghaleb

,

F. A.

,

Al-Hashmi

,

A. A.

,

Abawajy

,

J. H.

,

Alanazi

,

S. M.

, &

Al-Rezami

,

A. Y.

(

2021

).

An adaptive behavioral-based incremental batch learning malware variants detection model

.

IEEE Access

,

9

,

97180

–

97196

.

Google Scholar

Crossref

Esmaeilian

,

B.

,

Wang

,

B.

,

Lewis

,

K.

,

Duarte

,

F.

,

Ratti

,

C.

, &

Behdad

,

S.

(

2018

).

The future of waste management in smart and sustainable cities

.

Waste Management

,

81

,

177

–

195

. doi:

https://doi.org/10.1016/j.wasman.2018.09.047

.

Google Scholar

Crossref

PubMed

Hoornweg

,

D.

, &

Bhada-Tata

,

P.

(

2012

).

What a waste: A global review of solid waste management

.

World Bank

.

Google Scholar

Crossref

Kingdom of Saudi Arabia

(

2017

).

Saudi Vision 2030

.

Government Report

.

Liu

,

Z.

,

Lin

,

Y.

,

Cao

,

Y.

,

Hu

,

H.

,

Wei

,

Y.

,

Zhang

,

Z.

, ... &

Guo

,

B.

(

2021

).

Swin transformer: Hierarchical vision transformer using shifted windows

. In

Proceedings of the IEEE/CVF international conference on computer vision

(pp.

10012

-

10022

).

‏

Google Scholar

Crossref

Pal

,

C.

, &

Shankar

,

R.

(

2023

).

A systematic inquiry of energy management in smart grid

.

International Journal of Energy Sector Management

,

17

(

5

),

989

–

1012

.

Google Scholar

Crossref

Rada

,

E. C.

,

Magaril

,

E. R.

,

Schiavon

,

M.

,

Karaeva

,

A.

,

Chashchin

,

M.

, &

Torretta

,

V.

(

2020

).

MSW management in universities: Sharing best practices

.

Sustainability

,

12

(

12

),

5084

. doi:

https://doi.org/10.3390/su12125084

.

Google Scholar

Crossref

SDAIA

(

2023

).

National AI Ethics principles

.

Saudi Data and Artificial Intelligence Authority

.

Sosunova

,

I.

, &

Porras

,

J.

(

2022

).

IoT-enabled smart waste management systems for smart cities

.

IEEE Access

,

10

,

73326

–

73363

. doi:

https://doi.org/10.1109/access.2022.3188308

.

Google Scholar

Crossref

Yitmen

,

I.

(

2023

).

Cognitive digital twins for smart lifecycle management of built environment

.

CRC Press

.

Google Scholar

Zieleńska

,

M.

, &

Bułkowska

,

K.

(

2024

).

Agricultural wastes and their by-products for the energy market

.

Energies

,

17

(

9

),

2099

.

Google Scholar

Crossref

An AI-driven multimodal framework for automated waste segregation and energy recovery in smart cities: a case study for Saudi Vision 2030

1. Introduction

2. Related work

2.1 Waste management context in Saudi Arabia

2.2 Artificial Intelligence for waste classification

2.3 IoT and smart city integration

2.4 Energy recovery and predictive modeling

3. Methodology

3.1 Framework overview

3.2 Data collection and annotation

3.2.1 Image dataset construction

3.2.2 Final dataset statistics

3.2.3 Annotation quality assurance

3.3 Physico-chemical data collection

3.4 Waste segregation model

3.4.1 Model selection rationale

3.4.2 Hyperparameter configuration

3.4.3 Model formulation

3.5 Energy recovery prediction

3.6 Graph neural network for spatial modeling

3.7 Reinforcement learning for waste routing

4. Evaluation and results

4.1 Experimental setup

4.2 Waste segregation results

4.3 Energy recovery prediction results

4.4 System-level KPIs

4.5 Ablation study and component-wise analysis

4.5.1 Vision pipeline ablation

4.5.2 Energy prediction ablation

4.5.3 RL component ablation

4.5.4 Feature importance analysis

4.6 Statistical validation and robustness analysis

4.6.1 Cross-validation protocol

4.6.2 Statistical significance testing

4.6.3 Confidence intervals and uncertainty quantification

4.6.4 Robustness to noise and missing data

4.6.5 External validation on Jeddah dataset

5. Practical implementation and scalability analysis

5.1 Deployment architecture and infrastructure requirements

5.2 Economic analysis and return on investment

5.3 Real-world performance translation

5.4 Scalability considerations

5.5 Integration with existing municipal systems

5.6 Limitations and deployment risks

5.7 Future enhancements

5.8 Ethical and sustainability considerations

6. Conclusion

Author contributions statement

Ethics approval and consent to participate

Consent for publication (include appropriate statements)

References

Further reading

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Sharing Unavailable