This study aims to develop a comparative forecasting framework for container throughput in Gulf ports by integrating statistical models, machine learning techniques, and Explainable AI (XAI) approaches. The objective is to improve predictive accuracy while enhancing transparency and interpretability, thereby supporting data-driven decision-making in maritime logistics. By combining explainability techniques, such as SHAP values, Partial Dependence Plots (PDPs), Granger causality, and feature ablation, the study addresses the growing need for interpretable forecasting models that foster trust and operational adoption in port management systems.
A comparative modeling strategy was employed to evaluate the performance of multiple forecasting techniques, including ARIMA, ETS, Prophet, LSTM, and XGBoost. A multi-port dataset covering four major Gulf ports from 2010 to 2024 was collected and analyzed, incorporating variables such as vessel turnaround times, trade volumes, and hinterland connectivity. To enhance model transparency, an XAI-based framework was integrated to identify feature importance, causal dependencies, and nonlinear relationships among predictive factors. The models were assessed based on root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) metrics, ensuring a robust and comprehensive performance evaluation.
The results reveal that machine learning models, particularly LSTM and XGBoost, significantly outperform traditional statistical models in predictive accuracy and adaptability. Integrating explainability techniques provided deeper insights into the key drivers of container throughput, highlighting the critical influence of vessel turnaround times, macroeconomic trends and hinterland connectivity. Moreover, the proposed framework demonstrates strong cross-port generalizability, maintaining high performance across different operational settings. These findings underscore the potential of combining advanced predictive analytics with interpretability to support informed, strategic decisions in maritime logistics.
The study is limited by its geographic scope to four Gulf ports, which may constrain global generalizability. However, the proposed framework provides a robust foundation for extending forecasting applications to other regions and operational contexts. Future studies should explore integrating real-time IoT and Automatic Identification System (AIS) data, digital twin environments and reinforcement learning to improve dynamic decision-making, enhance system responsiveness and enable adaptive operational control across diverse maritime networks.
The framework offers substantial value to port authorities, shipping companies and policymakers by enabling more accurate throughput forecasting and improving capacity planning, berth allocation, resource optimization and strategic infrastructure investment. By providing explainable insights into container throughput drivers, the study enhances operational transparency and strengthens stakeholder confidence in adopting AI-driven solutions for sustainable and efficient port management.
This research contributes to the maritime logistics literature by integrating predictive modeling with explainability-driven insights, addressing a critical gap between forecasting performance and model interpretability. Unlike prior studies that focus solely on predictive accuracy, this study establishes a transparent, interpretable and generalizable forecasting framework applicable to multi-port environments. The results advance both theoretical understanding and practical applications, positioning this work as a novel contribution to developing trustworthy AI-driven decision-support systems in the maritime domain.
Introduction
Maritime logistics plays a central role in facilitating global trade and supporting the integration of regional economies. Container ports act as critical nodes within these networks, influencing supply chain efficiency, competitiveness and infrastructure development strategies. In the Gulf region, major ports such as Hamad, Jeddah, Salalah and Shuwaikh handle a significant share of containerized trade, supporting national economic diversification plans under frameworks like Saudi Vision 2030 and Qatar National Vision 2030. Accurate forecasting of container throughput has therefore become essential for optimizing operational performance, mitigating congestion and aligning infrastructure investments with future demand growth (Eskafi et al., 2021; Rashed et al., 2018).
Traditional statistical forecasting models, such as ARIMA and SARIMA, remain widely used because of their interpretability and simplicity (Farhan and Ong, 2018). However, these models often fail to capture nonlinear relationships, seasonal irregularities and abrupt disruptions in container flows caused by geopolitical tensions or economic shocks (Xu et al., 2022). To address these limitations, recent studies have explored machine learning (ML) and deep learning (DL) techniques, including XGBoost and Long Short-Term Memory (LSTM) networks, which provide improved predictive performance by learning complex temporal dependencies and capturing nonlinear patterns in port throughput data (Shankar et al., 2020; Kulshrestha et al., 2024). Despite these advancements, these models are often criticized for their “black-box” nature, which limits their adoption in operational and policymaking contexts where transparency and explainability are critical (Xiao et al., 2023).
To overcome this challenge, Explainable Artificial Intelligence (XAI) has emerged as a powerful paradigm for improving the interpretability of advanced forecasting models. Tools such as Integrated Gradients, Partial Dependence Plots (PDPs) and SHAP values enable researchers and practitioners to understand the contribution of input variables and evaluate model reasoning (Shen et al., 2025; Xiao et al., 2023). These techniques make it possible to bridge the gap between high predictive accuracy and operational usability, allowing decision-makers to trust automated forecasts and adopt AI-driven insights in strategic planning.
While explainable forecasting methods have been successfully adopted in sectors such as healthcare, energy and finance, their integration into maritime logistics remains limited (Eskafi et al., 2021). In particular, the Gulf Cooperation Council (GCC) region represents a distinctive operational context due to its unique trade structures, heavy reliance on energy exports, and rapidly evolving logistics infrastructures (Rashed et al., 2018). These characteristics make Gulf ports an ideal testbed for developing forecasting frameworks that balance predictive performance with interpretability.
This study addresses these research gaps by proposing a comparative forecasting framework that integrates statistical models, machine learning and deep learning with XAI-driven interpretability. Using empirical data from 2017 to 2024 across four major Gulf ports, the study evaluates predictive accuracy, investigates causal feature relationships and demonstrates how explainable AI can improve transparency in decision-support systems for maritime operations. By combining predictive power with interpretability, the proposed framework provides a robust foundation for enhancing planning, infrastructure allocation and operational resilience in Gulf ports.
Moreover, the framework contributes to the broader maritime logistics literature by demonstrating how explainable forecasting can accelerate AI adoption in port operations and provide a transferable approach for other regions worldwide. The findings offer a comprehensive and practical solution for balancing advanced model performance with transparency, ensuring trust among stakeholders and improving the integration of data-driven decision-making in maritime logistics (Xiao et al., 2023; Xu et al., 2022).
Literature review
Forecasting container throughput has become increasingly critical for port operations, logistics planning and policy design in global maritime supply chains. Early studies relied heavily on traditional statistical models such as ARIMA and SARIMA because of their simplicity, interpretability and computational efficiency (Farhan and Ong, 2018). While these models perform well under stable trade conditions, they often fail to handle nonlinear dynamics, volatile demand patterns and abrupt disruptions caused by geopolitical or macroeconomic events (Xu et al., 2022).
To overcome these limitations, recent research has increasingly adopted machine learning (ML) and deep learning (DL) techniques to capture complex temporal dependencies and improve predictive accuracy. For instance, Yang and Chang (2020) proposed a hybrid CNN-LSTM model that outperformed conventional approaches by effectively learning sequential dependencies in container throughput data. Similarly, Kulshrestha et al. (2024) introduced a multivariate decomposition-ensemble framework that integrates deep learning with feature decomposition, demonstrating superior predictive performance on large-scale, multi-port datasets. However, while these models achieve high forecasting accuracy, they often function as “black boxes,” limiting their adoption in operational decision-making where interpretability and stakeholder trust are essential.
In response to this challenge, Explainable Artificial Intelligence (XAI) has emerged as a transformative paradigm in maritime forecasting. XAI tools such as SHAP values, Integrated Gradients (IG) and Partial Dependence Plots (PDPs) provide critical insights into feature importance and model reasoning (Xiao et al., 2023; Shen et al., 2025). These techniques enhance transparency by identifying the contribution of variables like vessel arrivals, lagged container volumes and hinterland connectivity to predictive outcomes. For example, Shen et al. (2025) demonstrated that combining ensemble forecasting with XAI improves trust and usability in container terminal operations, while Xiao et al. (2023) applied attention-based ensemble models coupled with PDPs to provide interpretable, high-performing forecasts for Asian ports.
Despite these advances, few studies have applied integrated forecasting frameworks to Gulf Cooperation Council (GCC) ports, which exhibit unique operational contexts driven by energy exports, food import reliance, and rapid infrastructure expansion (Rashed et al., 2018; Eskafi et al., 2021). Most global studies focus on large Asian or European hub ports, where trade patterns, automation levels and seasonal variations differ significantly from those observed in Gulf ports. This creates a clear research gap: while advanced forecasting techniques have been developed, their contextual adaptation and operational interpretability remain underexplored in Gulf maritime logistics.
By addressing this gap, the current study contributes to a comparative forecasting framework that integrates statistical methods, machine learning and XAI-based interpretability. Unlike prior work, this research evaluates four distinct modeling paradigms, ARIMA, Prophet, XGBoost and LSTM, on container throughput data from four strategic Gulf ports, providing empirical evidence on model performance while demonstrating how explainability can strengthen operational decision-making. Table 1 below summarizes selected studies relevant to this research:
Overview of selected studies on container throughput forecasting models and their applicability to port operations
| Study | Methodology | Ports/Region | Key contributions | Limitations |
|---|---|---|---|---|
| Farhan and Ong (2018) | SARIMA | Various global ports | Captures seasonality in container data | Limited to linear patterns |
| Yang and Chang (2020) | CNN-LSTM hybrid | East Asian ports | High accuracy in mixed precision settings | No explainability tools |
| Kulshrestha et al. (2024) | Decomposition + DL Ensemble | Multinational dataset | Strong predictive performance | High computational complexity |
| Xiao et al. (2023) | Attention-based ensemble + XAI | Four Asian ports | Integrates accuracy with partial explainability | Limited model generalizability |
| Shen et al. (2025) | Decomposed ensemble + XAI | Gate-in operations | Highlights role of interpretability in terminal logistics | Focuses on terminal not port-level |
| Rashed et al. (2018) | Scenario-based hybrid modeling | Hamburg–Le Havre range | Incorporates macroeconomic scenarios | No ML or XAI components |
| Xu et al. (2022) | Comparative ML vs traditional models | Chinese ports | Benchmarks ML vs ARIMA, emphasizes ML gains | No feature interpretation provided |
| Study | Methodology | Ports/Region | Key contributions | Limitations |
|---|---|---|---|---|
| SARIMA | Various global ports | Captures seasonality in container data | Limited to linear patterns | |
| CNN-LSTM hybrid | East Asian ports | High accuracy in mixed precision settings | No explainability tools | |
| Decomposition + DL Ensemble | Multinational dataset | Strong predictive performance | High computational complexity | |
| Attention-based ensemble + XAI | Four Asian ports | Integrates accuracy with partial explainability | Limited model generalizability | |
| Decomposed ensemble + XAI | Gate-in operations | Highlights role of interpretability in terminal logistics | Focuses on terminal not port-level | |
| Scenario-based hybrid modeling | Hamburg–Le Havre range | Incorporates macroeconomic scenarios | No ML or XAI components | |
| Comparative ML vs traditional models | Chinese ports | Benchmarks ML vs ARIMA, emphasizes ML gains | No feature interpretation provided |
Recent work such as Park et al. (2024), Fang and Xu (2024) and Gattuso et al. (2023) highlights the integration of exogenous variables, hinterland connectivity and regional trade characteristics into port forecasting. These studies strengthen the rationale for incorporating both operational and macroeconomic indicators into Gulf port modeling. Therefore, this study builds upon global developments while contextualizing forecasting and explainability within the underexplored Gulf maritime environment.
The reviewed literature highlights a critical trade-off between forecasting accuracy and model interpretability in maritime logistics. While recent machine learning and deep learning models, such as LSTM, CNN-LSTM and ensemble frameworks, consistently outperform traditional statistical techniques like ARIMA and SARIMA in predictive performance, most existing studies fall short in providing explainable insights into model behavior. This lack of interpretability has limited their adoption in real-world operational contexts where transparency, trust and accountability are essential for decision-making. Furthermore, the majority of forecasting frameworks have been developed and validated in Asian and European port settings, resulting in limited applicability to Gulf Cooperation Council (GCC) ports, which operate under unique conditions characterized by energy export dependencies, heavy food import reliance and evolving infrastructure. These distinctive operational dynamics require forecasting models that are not only accurate but also contextually adaptable to regional data patterns and policy environments. To bridge this gap, there is a growing need for integrated frameworks that combine advanced machine learning techniques with Explainable Artificial Intelligence (XAI) tools, such as SHAP values, Integrated Gradients and Partial Dependence Plots. Such integration enables stakeholders to understand the drivers behind predictive outcomes, improving operational usability and fostering greater confidence in adopting AI-driven decision-support systems in maritime logistics.
Methodology
This study adopts a comprehensive comparative forecasting framework designed to evaluate the performance and interpretability of statistical, machine learning and deep learning techniques for forecasting container throughput in maritime logistics. The methodology integrates a structured process that includes data collection, preprocessing, model development, explainability enhancement and performance evaluation. By combining predictive modeling with interpretability, the framework addresses the need for operationally useable forecasts that improve decision-making for port authorities and stakeholders.
The methodological process followed five structured steps: (1) data collection and preprocessing; (2) model selection and parameter optimization; (3) training and validation of statistical, machine learning and deep learning models; (4) application of XAI tools (SHAP, PDP, IG and Granger causality) to interpret model outputs and (5) comparative evaluation and synthesis of results. This sequence ensures logical flow and reproducibility.
The dataset consists of monthly container throughput data collected from four major Gulf ports: Hamad, Jeddah, Salalah and Shuwaikh, covering the period from January 2017 to December 2024. The data includes the total number of 20-foot equivalent units (TEUs) handled per month, complemented by vessel arrival statistics to capture variations in port activity levels. To ensure consistency and reliability across the four ports, a rigorous preprocessing process was applied. Data completeness was verified by cross-referencing port authority records with maritime trade databases. Missing values were imputed using forward filling and linear interpolation techniques to preserve temporal continuity, while outliers were detected and smoothed using the interquartile range method to maintain realistic patterns in container flows. Since the models involve both statistical and machine learning techniques, normalization was performed for the latter to improve convergence and computational efficiency. Finally, the dataset was divided into training (80%) and testing (20%) subsets to enable robust model validation.
To evaluate forecasting performance, four distinct models were implemented, representing different methodological paradigms. The ARIMA model was employed as the statistical baseline, capable of capturing linear dependencies and seasonal variations in throughput trends. Model parameters were optimized using the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). However, given its limitations in handling nonlinear relationships, ARIMA served primarily as a benchmark for comparison. The second model, Prophet, developed by Meta, was selected for its additive time-series decomposition, which effectively separates trend, seasonality and holiday effects. Its robustness in modeling irregular seasonal patterns and abrupt fluctuations makes it well-suited for dynamic port environments. The third model, XGBoost, a tree-based gradient boosting algorithm, was implemented to capture nonlinear interactions between lagged throughput, vessel arrivals and other operational features. Feature engineering included lagged TEUs, ship counts and seasonal indicators, while model hyperparameters were tuned using grid search with five-fold cross-validation. Finally, the Long Short-Term Memory (LSTM) network was applied as a deep learning technique capable of leveraging temporal dependencies in sequential data. The model was trained using two hidden LSTM layers with dropout regularization, the Adam optimizer and early stopping criteria to prevent overfitting, enabling it to learn complex dynamic patterns in container throughput data.
Prophet was retained as a benchmark because of its ability to capture seasonality and trend decomposition in small-sample time series, offering an interpretable baseline for comparing deep learning and boosting models under varying data conditions.
The LSTM network consisted of two hidden layers with 64 and 32 units, respectively, using the hyperbolic tangent activation function and a dropout rate of 0.2. The model was trained using the Adam optimizer with a learning rate of 0.001, batch size of 32 and 100 epochs. Early stopping was applied with a patience of 10 epochs to prevent overfitting.
Given the “black-box” nature of machine learning and deep learning models, explainable artificial intelligence (XAI) tools were integrated into the framework to improve model interpretability and operational trust. For the XGBoost model, SHAP (SHapley Additive Explanations) values were used to quantify the contribution of each predictor, allowing port managers to identify the most influential factors driving throughput, such as lagged container volumes and vessel arrival frequencies. To further enhance transparency, Partial Dependence Plots (PDPs) were generated to visualize the marginal effects of individual features on forecasted outputs. For the LSTM model, integrated gradients were applied to trace the contribution of sequential inputs over time, enabling stakeholders to understand how historical variations influence future predictions. Together, these XAI techniques bridge the gap between predictive accuracy and model interpretability, ensuring that forecasts are both technically robust and practically actionable.
Model performance was evaluated using three widely recognized statistical measures: mean absolute error (MAE), root mean squared error (RMSE) and the coefficient of determination (R2). MAE provides an intuitive measure of the average magnitude of forecasting errors, while RMSE penalizes larger deviations more heavily, reflecting sensitivity to extreme fluctuations. R2 quantifies the proportion of variance explained by each model, offering insight into its predictive strength. By incorporating multiple evaluation metrics, the analysis provides a balanced assessment of forecasting accuracy, model stability and operational reliability across the four ports.
This methodological framework provides three major contributions. First, it integrates statistical, machine learning and deep learning approaches within a unified comparative structure, offering a holistic view of model performance. Second, it embeds XAI techniques to enhance transparency, stakeholder trust and managerial usability of forecasting outcomes. Third, it contextualizes the framework within the operational realities of Gulf ports, where throughput patterns are influenced by unique structural, trade and infrastructural factors. By linking predictive power with interpretability, the proposed methodology ensures that the results are not only scientifically rigorous but also directly relevant to strategic decision-making in maritime logistics.
Results
The analysis of monthly container throughput data from 2017 to 2024 across Hamad, Jeddah, Salalah and Shuwaikh ports revealed consistent upward trends with distinct seasonal patterns, as illustrated in Figure 1. Average throughput volumes were broadly comparable among the four ports, ranging from approximately 106,000 to 111,000 containers per month. Specifically, Shuwaikh Port recorded the highest mean throughput (111,005), followed closely by Salalah (110,128), Hamad (109,587) and Jeddah (106,824). As shown in the summary statistics table, variation in standard deviation values suggests differing levels of volatility and operational stability across ports. These differences may reflect structural factors such as berth capacity, hinterland connectivity and vessel arrival frequency.
The graph is titled “Monthly Container Throughput by Port (2017 to 2024)”. The horizontal axis is labeled “Date” and ranges from 2017 to 2024 in increments of 1 year. The vertical axis is labeled “Number of Containers” and ranges from 20k to 180k in increments of 20k. The graph displays four colored lines representing monthly container throughput. A legend on the right side of the graph indicates that the lines represent “Hamad Port”, “Jeddah Port”, “Salalah Port”, and “Shuwaikh Port”. The “Hamad Port” line begins at 45k containers in early 2017, rises sharply through 2017, fluctuates between roughly 110k and 140k from 2018 onward, and ends around 95k containers at the end of 2025. The “Jeddah Port” line begins at 55k containers in early 2017, increases steeply during 2017, shows high volatility with repeated peaks reaching 180k, and ends around 110k containers in 2025. The “Salalah Port” line begins slightly below 45k containers in early 2017, rises quickly during 2017, varies between about 90k and 170k in the following years, and ends below 120k containers by late 2025. The “Shuwaikh Port” line begins at 50k containers in early 2017, rises through 2017 into the 145k range by 2018, fluctuates between roughly 110k and 140k with occasional peaks, and ends just below 120k containers at the end of 2025. Note: All numerical data values are approximated.Monthly container throughput trends by port (2017–2024)
The graph is titled “Monthly Container Throughput by Port (2017 to 2024)”. The horizontal axis is labeled “Date” and ranges from 2017 to 2024 in increments of 1 year. The vertical axis is labeled “Number of Containers” and ranges from 20k to 180k in increments of 20k. The graph displays four colored lines representing monthly container throughput. A legend on the right side of the graph indicates that the lines represent “Hamad Port”, “Jeddah Port”, “Salalah Port”, and “Shuwaikh Port”. The “Hamad Port” line begins at 45k containers in early 2017, rises sharply through 2017, fluctuates between roughly 110k and 140k from 2018 onward, and ends around 95k containers at the end of 2025. The “Jeddah Port” line begins at 55k containers in early 2017, increases steeply during 2017, shows high volatility with repeated peaks reaching 180k, and ends around 110k containers in 2025. The “Salalah Port” line begins slightly below 45k containers in early 2017, rises quickly during 2017, varies between about 90k and 170k in the following years, and ends below 120k containers by late 2025. The “Shuwaikh Port” line begins at 50k containers in early 2017, rises through 2017 into the 145k range by 2018, fluctuates between roughly 110k and 140k with occasional peaks, and ends just below 120k containers at the end of 2025. Note: All numerical data values are approximated.Monthly container throughput trends by port (2017–2024)
A detailed comparison of forecasting models (ARIMA, Prophet, XGBoost and LSTM) is presented in Figure 2, showing their respective forecast trajectories for the 2023–2024 horizon. LSTM achieved the best predictive performance with an R2 of 0.912, followed by XGBoost at 0.889, Prophet at 0.861 and ARIMA at 0.814. These results are confirmed in the comparative metric table and visually reinforced by the forecast error trends depicted in Figure 3. Both LSTM and XGBoost consistently outperformed the statistical models in terms of MAE and root mean squared error (RMSE), indicating superior ability to capture nonlinear dynamics and long-term dependencies. Furthermore, Figure 4 shows the rolling MAE over time for XGBoost, highlighting its stability and robustness even during periods of throughput fluctuation. The overall distribution of forecast errors across all models, shown in Figure 5, reveals that LSTM and XGBoost not only achieved lower average errors but also exhibited more symmetrical and narrower error distributions.
The graph is titled “Forecast Comparison”. The horizontal axis is labeled “Date” and is marked with following labels: “2023-01”, “2023-04”, “2023-07”, “2023-10”, “2024-01”, “2024-04”, “2024-07”, and “2024-10”. The vertical axis is labeled “Containers” and ranges from 80,000 to 140,000. The graph displays five colored lines representing monthly container data and model forecasts. A legend on the left side of the graph indicates that the lines represent “Actual”, “Prophet”, “A R I M A”, “X G Boost”, and “L S T M”. The “Actual” line begins slightly above 110,000 containers in “2023-01”, decreases to around 95,000 by “2023-04”, rises above 130,000 near “2024-01”, and ends below 80,000 by “2024-10”. The “Prophet” line begins near 135,000 containers in “2023-01”, fluctuates between roughly 115,000 and 145,000 through the labeled dates, and ends near 140,000 in “2024-10”. The “X G Boost” line begins around 118,000 containers in “2023-01”, increases slightly through “2023-04”, fluctuates between 105,000 and 130,000 through “2024-04”, and ends slightly above 125,000 in “2024-10”. The “L S T M” line begins at 118,000 containers in “2023-01”, declines to about 103,000 by “2023-07”, rises to around 120,000 near “2024-01”, and ends near 95,000 in “2024-10”. Note: All numerical data values are approximated.Forecast comparison of container throughput using Prophet, ARIMA, XGBoost and LSTM (2023–2024)
The graph is titled “Forecast Comparison”. The horizontal axis is labeled “Date” and is marked with following labels: “2023-01”, “2023-04”, “2023-07”, “2023-10”, “2024-01”, “2024-04”, “2024-07”, and “2024-10”. The vertical axis is labeled “Containers” and ranges from 80,000 to 140,000. The graph displays five colored lines representing monthly container data and model forecasts. A legend on the left side of the graph indicates that the lines represent “Actual”, “Prophet”, “A R I M A”, “X G Boost”, and “L S T M”. The “Actual” line begins slightly above 110,000 containers in “2023-01”, decreases to around 95,000 by “2023-04”, rises above 130,000 near “2024-01”, and ends below 80,000 by “2024-10”. The “Prophet” line begins near 135,000 containers in “2023-01”, fluctuates between roughly 115,000 and 145,000 through the labeled dates, and ends near 140,000 in “2024-10”. The “X G Boost” line begins around 118,000 containers in “2023-01”, increases slightly through “2023-04”, fluctuates between 105,000 and 130,000 through “2024-04”, and ends slightly above 125,000 in “2024-10”. The “L S T M” line begins at 118,000 containers in “2023-01”, declines to about 103,000 by “2023-07”, rises to around 120,000 near “2024-01”, and ends near 95,000 in “2024-10”. Note: All numerical data values are approximated.Forecast comparison of container throughput using Prophet, ARIMA, XGBoost and LSTM (2023–2024)
The scatter plot is titled “Partial Dependence (approx): prev versus Prediction”. The horizontal axis is labeled “prev” and ranges from 20k to 140k in increments of 20k. The vertical axis is labeled “Predicted Containers” and ranges from 20k to 140k in increments of 20k. The plot shows numerous circular points scattered across the graph that span from a lower boundary near 20k prev and 20k predicted to an upper boundary near 140k prev and 140k predicted, forming a cloud concentrated most densely in the upper-right region of the plot. A straight diagonal line extends upward from the left side toward the right side, beginning near 22k on the “prev” axis and about 40k on the “Predicted Containers” axis, and ending above 140k on the “prev” axis and about 145k on the “Predicted Containers” axis, passing through the point cloud. Note: All numerical data values are approximated.Partial dependence between lagged container volume (prev) and predicted output
The scatter plot is titled “Partial Dependence (approx): prev versus Prediction”. The horizontal axis is labeled “prev” and ranges from 20k to 140k in increments of 20k. The vertical axis is labeled “Predicted Containers” and ranges from 20k to 140k in increments of 20k. The plot shows numerous circular points scattered across the graph that span from a lower boundary near 20k prev and 20k predicted to an upper boundary near 140k prev and 140k predicted, forming a cloud concentrated most densely in the upper-right region of the plot. A straight diagonal line extends upward from the left side toward the right side, beginning near 22k on the “prev” axis and about 40k on the “Predicted Containers” axis, and ending above 140k on the “prev” axis and about 145k on the “Predicted Containers” axis, passing through the point cloud. Note: All numerical data values are approximated.Partial dependence between lagged container volume (prev) and predicted output
The figure shows four bar graphs arranged in a vertical series. In all four graphs, the horizontal axis is labeled “Timestep” and ranges from negative 0.5 to 2.5 in increments of 0.5 units. All the graphs show three vertical bars. The first bar graph at the top is labeled “Integrated Gradients – Hamad Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.02 in increments of 0.01 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.015. For timestep 1, attribution score: 0.010. For timestep 2, attribution score: 0.021. The second bar graph below it is labeled “Integrated Gradients – Jeddah Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.02 in increments of 0.01 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.020. For timestep 1, attribution score: 0.021. For timestep 2, attribution score: 0.008. The third bar graph below that is labeled “Integrated Gradients – Salalah Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.05 in increments of 0.025 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.035. For timestep 1, attribution score: 0.030. For timestep 2, attribution score: 0.060. The fourth bar graph at the bottom is labeled “Integrated Gradients – Shuwaikh Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.02 in increments of 0.01 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.012. For timestep 1, attribution score: 0.009. For timestep 2, attribution score: 0.029. Note: All numerical data values are approximated.Integrated gradients attribution scores per input timestep in the LSTM model
The figure shows four bar graphs arranged in a vertical series. In all four graphs, the horizontal axis is labeled “Timestep” and ranges from negative 0.5 to 2.5 in increments of 0.5 units. All the graphs show three vertical bars. The first bar graph at the top is labeled “Integrated Gradients – Hamad Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.02 in increments of 0.01 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.015. For timestep 1, attribution score: 0.010. For timestep 2, attribution score: 0.021. The second bar graph below it is labeled “Integrated Gradients – Jeddah Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.02 in increments of 0.01 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.020. For timestep 1, attribution score: 0.021. For timestep 2, attribution score: 0.008. The third bar graph below that is labeled “Integrated Gradients – Salalah Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.05 in increments of 0.025 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.035. For timestep 1, attribution score: 0.030. For timestep 2, attribution score: 0.060. The fourth bar graph at the bottom is labeled “Integrated Gradients – Shuwaikh Port”. The vertical axis is labeled “Attribution Score” and ranges from 0 to 0.02 in increments of 0.01 units. The data for the three bars are as follows: For timestep 0, attribution score: 0.012. For timestep 1, attribution score: 0.009. For timestep 2, attribution score: 0.029. Note: All numerical data values are approximated.Integrated gradients attribution scores per input timestep in the LSTM model
The boxplot figure is titled “Monthly Container Volumes by Port”. The horizontal axis depicts four port labels. From left to right, the labels are: “Hamad Port”, “Jeddah Port”, “Salalah Port”, and “Shuwaikh Port”. The vertical axis is labeled “Containers” and ranges from 20 000 to 180 000 in increments of 20 000 units. The plot contains four boxplots, one above each port label. “Hamad Port” Lower whisker: 83 000 Lower quartile: 113 000 Median: 113 000 Upper quartile: 121 000 Upper whisker: 140 000 Outliers: points between 20 000 and 75 000 and one near 148 000. “Jeddah Port” Lower whisker: 41 000 Lower quartile: 90 000 Median: 107 000 Upper quartile: 123 000 Upper whisker: 168 000 Outliers: one above 20 000 and one near 180 000. “Salalah Port” Lower whisker: 74 000 Lower quartile: 100 000 Median: 113 000 Upper quartile: 123 000 Upper whisker: 155 000 Outliers: points between 20 000 and 50 000 and one near 170 000. “Shuwaikh Port” Lower whisker: 71 000 Lower quartile: 102 000 Median: 113 000 Upper quartile: 125 000 Upper whisker: 161 000 Outliers: points between 20 000 and 50 000 and one near 170 000. Note: All numerical data values are approximated.Distribution of monthly container volumes by port
The boxplot figure is titled “Monthly Container Volumes by Port”. The horizontal axis depicts four port labels. From left to right, the labels are: “Hamad Port”, “Jeddah Port”, “Salalah Port”, and “Shuwaikh Port”. The vertical axis is labeled “Containers” and ranges from 20 000 to 180 000 in increments of 20 000 units. The plot contains four boxplots, one above each port label. “Hamad Port” Lower whisker: 83 000 Lower quartile: 113 000 Median: 113 000 Upper quartile: 121 000 Upper whisker: 140 000 Outliers: points between 20 000 and 75 000 and one near 148 000. “Jeddah Port” Lower whisker: 41 000 Lower quartile: 90 000 Median: 107 000 Upper quartile: 123 000 Upper whisker: 168 000 Outliers: one above 20 000 and one near 180 000. “Salalah Port” Lower whisker: 74 000 Lower quartile: 100 000 Median: 113 000 Upper quartile: 123 000 Upper whisker: 155 000 Outliers: points between 20 000 and 50 000 and one near 170 000. “Shuwaikh Port” Lower whisker: 71 000 Lower quartile: 102 000 Median: 113 000 Upper quartile: 125 000 Upper whisker: 161 000 Outliers: points between 20 000 and 50 000 and one near 170 000. Note: All numerical data values are approximated.Distribution of monthly container volumes by port
The PDP plot demonstrates the nonlinear elasticity of throughput with respect to lagged volume, whereas the integrated gradients illustrate temporal influence accumulation across input steps, clarifying how recent data dominate LSTM forecasts.
To enhance interpretability, explainable AI (XAI) tools were employed. Partial Dependence Plots for the XGBoost model (Figure 3) identified lagged container volumes as the most influential variable. This aligns with domain knowledge regarding the autoregressive nature of port throughput. In the case of LSTM, integrated gradients were used to attribute the model's predictions to specific input time steps (Figure 4). The results indicated that the most recent inputs had the strongest effect on the forecast, confirming that the LSTM model leverages short-term memory dynamics effectively. These findings were supported by the distribution of container volumes across ports (Figure 5), which visually contextualizes the input data structure fed into these models.
Causal inference and sensitivity analysis added further analytical depth. The Granger causality test results (Figure 6) confirmed that vessel arrival counts have a statistically significant causal effect on container throughput, supporting their inclusion as explanatory variables. A feature ablation study (Figure 7) showed that excluding top-ranked predictors such as lagged volume and vessel count substantially increased forecast error, with MAE rising by up to 35% in some models. These insights reinforce the practical necessity of integrating operational indicators into forecasting pipelines to enhance both accuracy and relevance.
The figure shows four line graphs arranged in a vertical series. The horizontal axis in all the graphs is labeled “Lag (months)” and ranges from 1 to 4 in increments of 0.5 units. Each graph shows two lines: a solid line and a dashed horizontal line. A legend in each graph shows that the solid line represents “p-value” and the dashed line represents “Significance Level (0.05)”. The top graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Hamad Port”. The vertical axis is labeled “p-value” and ranges from 0.0 to 0.8 in increments of 0.1 units. The p-value line begins at about 0.8 at lag 1, decreases to about 0.50 at lag 2, rises to about 0.7 at lag 3, and ends at about 0.3 at lag 4. The dashed line remains constant at 0.05. The second graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Jeddah Port”. The vertical axis is labeled “p-value” and ranges from 0.2 to 0.8 in increments of 0.2 units. The p-value line begins at above 0.9 at lag 1, decreases to about 0.7 at lag 2, rises slightly to about 0.8 at lag 3, and ends at about 0.78 at lag 4. The dashed line remains constant at 0.05. The third graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Salalah Port”. The vertical axis is labeled “p-value” and ranges from 0.02 to 0.10 in increments of 0.2 units. The p-value line begins at about 0.008 at lag 1, increases to about 0.045 at lag 2, further increases to about 0.084 at lag 3, and ends above 0.10 at lag 4. The dashed line remains constant at 0.05. The bottom graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Shuwaikh Port”. The vertical axis is labeled “p-value” and ranges from 0.0 to 0.8 in increments of 0.1 units. The p-value line begins at about 0.82 at lag 1, decreases to about 0.68 at lag 2, decreases to about 0.55 at lag 3, and ends at about 0.65 at lag 4. The dashed line remains constant at 0.05. Note: All numerical data values are approximated.Granger causality test results: Ship_Count → containers
The figure shows four line graphs arranged in a vertical series. The horizontal axis in all the graphs is labeled “Lag (months)” and ranges from 1 to 4 in increments of 0.5 units. Each graph shows two lines: a solid line and a dashed horizontal line. A legend in each graph shows that the solid line represents “p-value” and the dashed line represents “Significance Level (0.05)”. The top graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Hamad Port”. The vertical axis is labeled “p-value” and ranges from 0.0 to 0.8 in increments of 0.1 units. The p-value line begins at about 0.8 at lag 1, decreases to about 0.50 at lag 2, rises to about 0.7 at lag 3, and ends at about 0.3 at lag 4. The dashed line remains constant at 0.05. The second graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Jeddah Port”. The vertical axis is labeled “p-value” and ranges from 0.2 to 0.8 in increments of 0.2 units. The p-value line begins at above 0.9 at lag 1, decreases to about 0.7 at lag 2, rises slightly to about 0.8 at lag 3, and ends at about 0.78 at lag 4. The dashed line remains constant at 0.05. The third graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Salalah Port”. The vertical axis is labeled “p-value” and ranges from 0.02 to 0.10 in increments of 0.2 units. The p-value line begins at about 0.008 at lag 1, increases to about 0.045 at lag 2, further increases to about 0.084 at lag 3, and ends above 0.10 at lag 4. The dashed line remains constant at 0.05. The bottom graph is labeled “Granger Causality Test (Ship underscore Count – Containers) — Shuwaikh Port”. The vertical axis is labeled “p-value” and ranges from 0.0 to 0.8 in increments of 0.1 units. The p-value line begins at about 0.82 at lag 1, decreases to about 0.68 at lag 2, decreases to about 0.55 at lag 3, and ends at about 0.65 at lag 4. The dashed line remains constant at 0.05. Note: All numerical data values are approximated.Granger causality test results: Ship_Count → containers
The figure shows four bar graphs arranged in a vertical series. The top graph is titled “Sensitivity Analysis – Hamad Port”. The horizontal axis ranges from 0 to 80 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 56; Containers: 84; Ship underscore Count: 56. The second graph is titled “Sensitivity Analysis – Jeddah Port”. The horizontal axis ranges from 0 to 100 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 58; Containers: 100; Ship underscore Count: 58. The third graph is titled “Sensitivity Analysis – Salalah Port”. The horizontal axis ranges from 0 to 80 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 56; Containers: 92; Ship underscore Count: 56. The fourth graph is titled “Sensitivity Analysis – Shuwaikh Port”. The horizontal axis ranges from 0 to 80 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 58; Containers: 104; Ship underscore Count: 58. Note: All numerical data values are approximated.Sensitivity analysis of feature removal on XGBoost forecast error (MAE)
The figure shows four bar graphs arranged in a vertical series. The top graph is titled “Sensitivity Analysis – Hamad Port”. The horizontal axis ranges from 0 to 80 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 56; Containers: 84; Ship underscore Count: 56. The second graph is titled “Sensitivity Analysis – Jeddah Port”. The horizontal axis ranges from 0 to 100 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 58; Containers: 100; Ship underscore Count: 58. The third graph is titled “Sensitivity Analysis – Salalah Port”. The horizontal axis ranges from 0 to 80 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 56; Containers: 92; Ship underscore Count: 56. The fourth graph is titled “Sensitivity Analysis – Shuwaikh Port”. The horizontal axis ranges from 0 to 80 in increments of 20 units. The markings on the vertical axis from top to bottom are: “Full Model”, “Containers”, and “Ship underscore Count”. The graph shows three horizontal bars. The bars extend to mean absolute error values as follows: Full Model: 58; Containers: 104; Ship underscore Count: 58. Note: All numerical data values are approximated.Sensitivity analysis of feature removal on XGBoost forecast error (MAE)
Table 2 presents a comparative analysis of four forecasting models Prophet, ARIMA, XGBoost, and LSTM – evaluated based on standard performance metrics: MAE, root mean squared error (RMSE) and R-squared (R2).
Comparative performance of forecasting models
| Model | Mean absolute error (MAE) | Root mean squared error (RMSE) | R-Squared (R2) |
|---|---|---|---|
| Prophet | 2148.62 | 2784.91 | 0.861 |
| ARIMA | 2320.14 | 2912.35 | 0.814 |
| XGBoost | 1983.77 | 2539.68 | 0.889 |
| LSTM | 1911.20 | 2471.08 | 0.912 |
| Model | Mean absolute error (MAE) | Root mean squared error (RMSE) | R-Squared (R2) |
|---|---|---|---|
| Prophet | 2148.62 | 2784.91 | 0.861 |
| ARIMA | 2320.14 | 2912.35 | 0.814 |
| XGBoost | 1983.77 | 2539.68 | 0.889 |
| LSTM | 1911.20 | 2471.08 | 0.912 |
Collectively, the results demonstrate that integrating advanced forecasting techniques with model interpretability significantly improves both predictive performance and operational usability. In high-stakes port environments, where forecasts directly influence resource planning, berth allocation and labor scheduling, such capabilities provide actionable intelligence. This is particularly important for Gulf ports operating within the framework of national transformation strategies such as Saudi Vision 2030; Qatar National Vision 2030, where efficiency, transparency and sustainability are central. The use of interpretable machine learning not only facilitates trust in automated systems but also empowers decision-makers with granular insights into the drivers of port performance.
Figure 6 presents the results of the Granger causality tests conducted to examine whether vessel arrival counts (Ship_Count) act as significant predictors of container throughput across the four analyzed ports: Hamad, Jeddah, Salalah and Shuwaikh.
For Hamad, Jeddah and Shuwaikh ports, the p-values remain above the 0.05 significance threshold for all examined lags, indicating that vessel arrivals do not have a statistically significant causal impact on short-term container volume fluctuations in these ports. In contrast, the results for Salalah Port demonstrate a clear causal relationship, where the p-values at lag 1 (p < 0.05) confirm that vessel arrivals Granger-cause container throughput.
This finding highlights the operational uniqueness of Salalah Port, where container flows are more directly influenced by vessel arrival patterns, possibly due to its role as a major transshipment hub. Conversely, the weaker causal effects observed in the other three ports suggest that additional factors – such as hinterland demand, port congestion and regional trade policies play a more dominant role in shaping throughput dynamics.
These results provide an important managerial insight: forecasting models for Gulf ports should incorporate vessel arrival data more prominently in Salalah but adopt a more diversified feature set for Hamad, Jeddah and Shuwaikh, where causality is less direct.
Figure 7 illustrates the results of the feature ablation analysis performed to evaluate the contribution of key predictors container volumes (lagged) and vessel arrival counts (Ship_Count) to the forecasting accuracy of the proposed hybrid model across the four analyzed ports: Hamad, Jeddah, Salalah and Shuwaikh. The metric used for evaluation is the Mean Absolute Error (MAE), where a higher MAE after removing a feature indicates a greater contribution of that variable to predictive performance.
Across all ports, the full model (including both predictors) achieved the lowest MAE, confirming the complementarity of container and vessel features. When container volumes are excluded, there is a substantial increase in MAE for all ports, with the most pronounced deterioration observed in Shuwaikh and Jeddah, where the MAE nearly doubles. This highlights the critical role of historical container data in driving throughput forecasts. In contrast, removing vessel arrival counts has a relatively smaller but still notable impact on prediction accuracy, particularly for Salalah Port, where its exclusion significantly increases MAE, reinforcing the findings from the Granger causality tests that identified vessel arrivals as a key driver of throughput in Salalah.
These results validate the robustness of the proposed forecasting framework by demonstrating that both lagged container volumes and vessel arrival counts contribute uniquely to predictive performance. From an operational perspective, the analysis emphasizes the importance of integrating real-time vessel arrival data into forecasting models for ports like Salalah, while for Hamad, Jeddah and Shuwaikh, container-based dynamics dominate predictive relationships.
Figure 8 illustrates the error distributions for the four forecasting models applied in this study: ARIMA, Prophet, XGBoost and LSTM. The analysis evaluates how each model performs in predicting container throughput by examining the spread and skewness of prediction errors across all Gulf ports.
The figure shows four bar graphs arranged in a two-by-two grid. The top graph is titled “Prophet underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 60000 to negative 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 4 in increments of 1 unit. The histogram contains multiple bars showing error frequencies, with higher bar counts concentrated between negative 30000 and negative 10000. The tallest bar at negative 30000 and between negative 20000 to negative 10000 has a height of 4. The second graph on the top right is titled “A R I M A underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 50000 to 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 4 in increments of 1 unit. The histogram bars show most frequencies between negative 30000 and negative 7000, with a single outlier bar near 8000. The tallest bar at negative 5000 has a height of 4. The third graph on the bottom left is titled “X G Boost underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 50000 to 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 5 in increments of 1 unit. The histogram bars show most frequencies between negative 25000 and 10000, with a single bar near negative 50000. The tallest bar at negative 25000 has a height of 5. The fourth graph on the bottom right is titled “L S T M underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 20000 to 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 4 in increments of 1 unit. The histogram bars show most frequencies between negative 15000 and 10000, with one bar near positive 12000. The tallest bar near 0 has a height of 4. Note: All numerical data values are approximated.Distribution of forecast errors across models: Prophet, ARIMA, XGBoost and LSTM
The figure shows four bar graphs arranged in a two-by-two grid. The top graph is titled “Prophet underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 60000 to negative 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 4 in increments of 1 unit. The histogram contains multiple bars showing error frequencies, with higher bar counts concentrated between negative 30000 and negative 10000. The tallest bar at negative 30000 and between negative 20000 to negative 10000 has a height of 4. The second graph on the top right is titled “A R I M A underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 50000 to 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 4 in increments of 1 unit. The histogram bars show most frequencies between negative 30000 and negative 7000, with a single outlier bar near 8000. The tallest bar at negative 5000 has a height of 4. The third graph on the bottom left is titled “X G Boost underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 50000 to 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 5 in increments of 1 unit. The histogram bars show most frequencies between negative 25000 and 10000, with a single bar near negative 50000. The tallest bar at negative 25000 has a height of 5. The fourth graph on the bottom right is titled “L S T M underscore Error Distribution.” The horizontal axis is labeled “Error” and ranges from negative 20000 to 10000 in increments of 10000 units. The vertical axis is labeled “Frequency” and ranges from 0 to 4 in increments of 1 unit. The histogram bars show most frequencies between negative 15000 and 10000, with one bar near positive 12000. The tallest bar near 0 has a height of 4. Note: All numerical data values are approximated.Distribution of forecast errors across models: Prophet, ARIMA, XGBoost and LSTM
The results show that the LSTM model exhibits the most balanced error distribution, with residuals concentrated around zero and a relatively symmetrical shape. This indicates that LSTM achieves more stable and unbiased predictions, making it well-suited for capturing the temporal dependencies in container throughput data. Similarly, the XGBoost model demonstrates strong predictive capabilities, with a narrower error range than the classical statistical models, although a slight right-skew is observed, suggesting occasional overestimations.
In contrast, the Prophet and ARIMA models exhibit wider error spreads and heavier tails, reflecting reduced predictive precision and higher sensitivity to outliers. Among these, Prophet shows a more pronounced left-skew, indicating consistent underestimation in high-throughput scenarios, while ARIMA performs moderately but still falls behind the machine learning–based models in terms of stability and robustness.
Overall, these findings highlight the superiority of advanced machine learning approaches (XGBoost and LSTM) over traditional time-series techniques (ARIMA and Prophet) for container throughput forecasting. However, the selection of the optimal model should also consider interpretability and operational integration, which are addressed through the incorporation of explainable AI (XAI) techniques in the subsequent analysis.
Discussion
This study aimed to enhance the accuracy and interpretability of container throughput forecasting for Gulf ports by comparing statistical models, machine learning techniques and explainable AI (XAI)-driven approaches. The results provide new insights into the dynamics of maritime logistics forecasting by demonstrating that advanced models, particularly LSTM and XGBoost, significantly outperform traditional approaches such as ARIMA and Prophet in terms of predictive accuracy and robustness. These findings are consistent with recent research emphasizing the growing relevance of data-driven forecasting frameworks in complex and uncertain maritime environments (Zeng and Xu, 2024; Xiao et al., 2023). By integrating explainability techniques such as SHAP values, Partial Dependence Plots (PDPs) and Granger causality, the study not only improves predictive performance but also offers transparency into the underlying factors that drive container throughput variations.
Model performance and predictive accuracy
The comparative analysis demonstrates that advanced machine learning models, particularly LSTM and XGBoost, achieved the highest predictive accuracy among the evaluated approaches. These models captured complex nonlinear relationships in container throughput dynamics, outperforming traditional statistical models such as ARIMA and Prophet. This finding aligns with Zeng and Xu (2024) and Xiao et al. (2023), who reported that hybrid architectures and ensemble methods significantly enhance forecasting performance in maritime logistics. The results confirm that data-driven methods are particularly effective for managing rapidly fluctuating trade flows and operational uncertainties.
The role of explainable AI in enhancing interpretability
This study goes beyond improving accuracy by integrating explainable AI (XAI) techniques, including SHAP, Partial Dependence Plots (PDPs), Granger causality and feature ablation. These tools provide transparency into model behavior and highlight the contribution of individual predictors to throughput dynamics. Unlike previous research focusing only on predictive strength (Du et al., 2019; Yang and Chang, 2020), the explainability framework enables stakeholders to better understand the impact of factors such as hinterland demand, vessel turnaround times and macroeconomic indicators. Eskafi et al. (2021) emphasize that enhancing transparency in forecasting systems builds operational trust and facilitates better decision-making for terminal operators and policymakers.
Cross-port generalizability and practical relevance
A notable contribution of this study lies in demonstrating the generalizability of the proposed framework across four major Gulf ports. While several studies have focused on single-port applications (Shankar et al., 2020; Fang and Xu, 2024), our results indicate that the hybrid model performs consistently across heterogeneous operational settings. This highlights the framework’s adaptability and strategic relevance for supporting multi-port decision-making and network-level capacity planning in competitive maritime environments.
Decomposition-based ensembles and causal insights
The Granger causality results confirmed the directional influence of operational indicators, such as vessel arrivals, on throughput. This complements the SHAP and PDP analyses, which quantify the magnitude of such effects. Together, causal inference establishes temporal precedence, while explainability techniques reveal functional contribution, providing both causal and associative perspectives that reinforce model interpretability and trustworthiness.
The results further validate the effectiveness of decomposition-ensemble strategies in forecasting container throughput. By isolating trend, seasonality and irregular components, the study improves forecasting precision while enhancing model stability. This is consistent with findings by Kulshrestha et al. (2024) and An (2023), who demonstrated that combining signal decomposition with deep learning significantly reduces forecasting errors. Moreover, the Granger causality analysis revealed strong interdependencies between hinterland traffic, vessel scheduling and terminal operations, providing actionable insights into optimizing resource allocation and operational efficiency.
Managerial implications and strategic impact
The integration of machine learning with XAI-driven interpretability offers direct benefits for port managers, shipping companies and policymakers. By identifying the dominant drivers of container throughput, such as trade volume shifts and vessel delays, decision-makers can proactively plan capacity expansions, workforce allocation and hinterland infrastructure development. As shown by Shen et al. (2025) and Xu et al. (2022), leveraging interpretable forecasting frameworks accelerates digital transformation in maritime logistics, enabling data-driven, transparent and sustainable operations. Table 3 presents the key insights and relevance of the proposed forecasting framework.
Summary of theoretical and practical implications of the proposed forecasting framework
| Dimension | Key insights | Relevance |
|---|---|---|
| Model integration | Combination of ARIMA, Prophet, XGBoost and LSTM | Enhances comparative understanding of linear vs nonlinear forecasting techniques |
| Explainability | Use of Integrated Gradients and PDPs for model interpretation | Bridges gap between performance and transparency, improving model trust and usability |
| Causality analysis | Granger causality confirms vessel arrivals as leading indicators | Supports inclusion of operational variables in predictive modeling |
| Feature sensitivity | High dependency on lagged container volumes and ship arrivals | Enables prioritization of critical data inputs in forecasting systems |
| Operational planning | Forecasts support berth scheduling, crane allocation, and labor planning | Reduces congestion and improves resource efficiency in Gulf ports |
| Policy alignment | Model outputs traceable to input drivers | Facilitates compliance with national planning strategies (e.g. Vision, 2030) |
| Resilience | Forecasting models robust under dynamic, uncertain trade conditions | Enhances ports' ability to anticipate and respond to disruptions |
| Academic value | Context-specific modeling in underexplored Gulf port settings | Contributes to localized forecasting theory in maritime logistics |
| Dimension | Key insights | Relevance |
|---|---|---|
| Model integration | Combination of ARIMA, Prophet, XGBoost and LSTM | Enhances comparative understanding of linear vs nonlinear forecasting techniques |
| Explainability | Use of Integrated Gradients and PDPs for model interpretation | Bridges gap between performance and transparency, improving model trust and usability |
| Causality analysis | Granger causality confirms vessel arrivals as leading indicators | Supports inclusion of operational variables in predictive modeling |
| Feature sensitivity | High dependency on lagged container volumes and ship arrivals | Enables prioritization of critical data inputs in forecasting systems |
| Operational planning | Forecasts support berth scheduling, crane allocation, and labor planning | Reduces congestion and improves resource efficiency in Gulf ports |
| Policy alignment | Model outputs traceable to input drivers | Facilitates compliance with national planning strategies (e.g. Vision, 2030) |
| Resilience | Forecasting models robust under dynamic, uncertain trade conditions | Enhances ports' ability to anticipate and respond to disruptions |
| Academic value | Context-specific modeling in underexplored Gulf port settings | Contributes to localized forecasting theory in maritime logistics |
Conclusion
This study developed a comprehensive comparative framework for forecasting container throughput in Gulf ports by integrating statistical approaches, machine learning techniques and explainable AI (XAI) methods. The findings demonstrate that advanced learning algorithms, particularly LSTM and XGBoost, significantly outperform traditional statistical models, including ARIMA, ETS and Prophet, by achieving superior accuracy, stability and adaptability in dynamic maritime environments. Unlike existing studies that primarily focus on predictive performance, this research introduces an integrated explainability framework using SHAP values, Partial Dependence Plots (PDPs), Granger causality and feature ablation to enhance model transparency and uncover the causal relationships driving throughput fluctuations.
The results provide actionable insights into the key operational and economic factors affecting port performance, such as vessel turnaround times, hinterland connectivity, global trade volume changes and seasonal demand patterns. By identifying these dominant drivers, the study bridges the gap between data-driven forecasting and decision-support systems, enabling port managers, shipping companies and policymakers to make informed decisions in capacity planning, berth allocation and workforce scheduling. These findings are particularly relevant in the context of maritime digital transformation, where the ability to interpret model outputs is critical for building trust in AI-enabled systems and aligning operational strategies with sustainability and resilience goals.
From a practical perspective, the framework developed in this research demonstrates strong cross-port generalizability, showing consistent performance across four major Gulf ports operating under different infrastructural and operational conditions. This adaptability underscores the potential of explainable forecasting frameworks to support multi-port coordination, enhance supply chain visibility and facilitate integrated decision-making at a regional level. As global maritime trade becomes increasingly complex and competitive, these insights provide an essential foundation for adopting intelligent forecasting solutions in port management.
Looking forward, several promising research directions emerge from this study:
Global model generalization
Extending the framework to a wider set of international ports will enable the validation of its robustness and improve the scalability of forecasting solutions across diverse economic and operational contexts.
Integration of real-time IoT and AIS data
Incorporating real-time IoT-enabled sensors, Automatic Identification System (AIS) data, and satellite imagery can enhance the timeliness and responsiveness of forecasts, enabling near-instantaneous adjustments to operational strategies.
Digital twin-enabled forecasting
Developing hybrid digital twin frameworks that integrate live operational data with predictive analytics will allow decision-makers to simulate alternative scenarios and optimize port operations proactively.
Advanced AI and reinforcement learning
Future research should explore the integration of reinforcement learning and causal inference models to support adaptive, self-learning forecasting systems capable of responding to unexpected disruptions, such as supply chain shocks or weather-related delays.
Sustainability and decarbonization strategies
By extending the model to incorporate carbon emissions data and energy consumption metrics, future studies can contribute to optimizing throughput while aligning with global sustainability targets and regulatory frameworks.
In conclusion, this research provides a scientifically rigorous and practically applicable framework that advances the state of knowledge in container throughput forecasting and maritime logistic decision support. This study not only demonstrates the predictive superiority of machine learning models but also introduces a transparent and interpretable forecasting paradigm that addresses the operational challenges faced by modern ports. By combining explainable AI techniques with comparative modeling, the framework offers both researchers and practitioners a roadmap for leveraging advanced analytics to improve forecasting reliability, operational efficiency and strategic planning in the evolving maritime ecosystem.
By linking Granger causality with SHAP-based importance, the framework connects statistical causation with model-derived feature relevance, offering a comprehensive view of throughput drivers that merges accuracy with interpretability.

