This research presents machine learning models for predicting international tourist arrivals in Indonesia during the COVID-19 pandemic using multisource Internet data.
To develop the prediction models, this research utilizes multisource Internet data from TripAdvisor travel forum and Google Trends. Temporal factors, posts and comments, search queries index and previous tourist arrivals records are set as predictors. Four sets of predictors and three distinct data compositions were utilized for training the machine learning models, namely artificial neural networks (ANNs), support vector regression (SVR) and random forest (RF). To evaluate the models, this research uses three accuracy metrics, namely root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE).
Prediction models trained using multisource Internet data predictors have better accuracy than those trained using single-source Internet data or other predictors. In addition, using more training sets that cover the phenomenon of interest, such as COVID-19, will enhance the prediction model's learning process and accuracy. The experiments show that the RF models have better prediction accuracy than the ANN and SVR models.
First, this study pioneers the practice of a multisource Internet data approach in predicting tourist arrivals amid the unprecedented COVID-19 pandemic. Second, the use of multisource Internet data to improve prediction performance is validated with real empirical data. Finally, this is one of the few papers to provide perspectives on the current dynamics of Indonesia's tourism demand.
1. Introduction
The increasing use of web-based platforms stimulates the growing availability of structured and unstructured data (Li et al., 2021). Search engines (Bangwayo-Skeete and Skeete, 2015), online forums (Fronzetti Colladon et al., 2019) and photo sharing apps (Miah et al., 2017) are just a handful of applications that contribute to the increasing availability of online data. The availability of online data has attracted academics and practitioners to extract business values from it. The tourism and hospitality industries are not an exception. Tourists have used various online platforms, such as social networks, microblogs, online booking, online reviews and online forums (Li et al., 2021), for their traveling purposes. The data emission from this online platform provides valuable customer behavior information (Bangwayo-Skeete and Skeete, 2015; Li et al., 2017). Forecasting models have been one of the most popular use cases that can be improved by utilizing this big Internet data (Song et al., 2019).
Literature on tourism demand forecasting is extensive (Li et al., 2021). Most studies have been focusing on predicting international tourist flow using various quantitative methods (Song et al., 2019), including time series (Ma et al., 2016; Park et al., 2017), econometric (Padhi and Pati, 2017), and artificial intelligence (AI) (Lv et al., 2018; Sun et al., 2019). In this big data era, AI-based approaches have increased popularity (Song et al., 2019) and have been widely used for tourism demand forecasting due to their ability to deal with nonlinear data (Law et al., 2019; Sun et al., 2019; Huang and Hao, 2020). The artificial neural network (ANN), support vector regression (SVR), and random forest (RF) are among the most frequently used AI-based models (Sun et al., 2019; Song et al., 2019; Abellana et al., 2020; Huang and Hao, 2020; Li et al., 2020).
While the use of historical statistics records for forecasting purposes has already matured, forecasting models using Internet data have received increasing attention (Li and Law, 2020; Li et al., 2021). Previous studies have utilized Internet data from different sources, such as search engines (Dergiades et al., 2018; Li et al., 2020), web traffic (Yang et al., 2014; Gunter and Önder, 2016) and social media (Miah et al., 2017; Starosta et al., 2019), for forecasting purposes. Search engine and web traffic data provide structured time-series data, while social media generate unstructured data. Most previous studies focused on utilizing single-source Internet data with notable forecasting accuracy improvements (Bangwayo-Skeete and Skeete, 2015; Park et al., 2017).
Although many studies have explored the use of Internet data to develop more accurate forecasting models, the ones that attempt to utilize combinations of several types of Internet data remain limited. Since single-source Internet data cannot comprehensively reflect tourists' attention, interests and interactions (Fronzetti Colladon et al., 2019; Li et al., 2021), multisource Internet data can offer a solution to address this drawback. Moreover, numerous issues and challenges are present in integrating different data sources and verifying empirical applications of multisource Internet data (Li et al., 2021). Correspondingly, this research study aims to fill the gap by developing tourist arrivals forecasts using multisource and multi-categories of Internet data based on well-investigated machine learning models, namely ANN, SVR and RF. As a case study, this study opts to predict international tourist arrivals in Indonesia. Furthermore, this study corresponds to the current global tourism trend that has been affected by the travel restrictions amid the COVID-19 pandemic. In the face of an unprecedented pandemic, the applicability of Internet data and the developed machine learning solution must be reexamined. Thus, the main research question of this study is how to develop machine learning models using multisource Internet data that leads to more accurate tourist arrivals prediction during the COVID-19 pandemic.
The structure of this paper is written as follows. Section 1 provides brief background, research gap and research question. Section 2 presents a literature review on extant tourism forecasting methods and tourism demand forecasting using Internet data. The research method is explained in Section 3. Section 4 presents the case study context. Section 5 provides results and discussion. The last section provides the conclusion, implications, current limitations and future research.
2. Literature review
Existing quantitative methods for tourism forecasting can be classified into three categories: time series, econometric and AI (Song et al., 2019; Li et al., 2021). Time series models provide simplicity by employing a lag of Internet data as explanatory variables (Li et al., 2021). This model can provide accurate predictions, notably for short-term forecasting horizons (Gunter and Önder, 2016; Park et al., 2017). The most commonly used time series models include autoregressive, autoregressive integrated moving average and seasonal autoregressive integrated moving average (Song et al., 2019; Li et al., 2021). The econometric models are concerned with the causality of various explanatory variables (Zhou-grundy and Turner, 2015; Dergiades et al., 2018). The previous studies demonstrated that econometric models can improve accuracy in more extended time horizons (Bangwayo-Skeete and Skeete, 2015; Gunter and Önder, 2016). However, all variables included in these models should be stationary to avoid spurious results (Huang et al., 2017; Dergiades et al., 2018; Song et al., 2019). Autoregressive distributed lag model, time-varying parameter and vector autoregression are among the most popular econometric models (Song et al., 2019; Li et al., 2021).
Unlike econometric models, AI-based models can describe nonlinear data without a prior understanding of the correlations between input and output variables (Song et al., 2019). These models rely on built-in feature engineering, which becomes the distinct advantage when dealing with large datasets (Law et al., 2019). This black box nature is often chastised for its lack of theoretical underpinning, poor interpretations of analytical outcomes and questionable explanatory value of input variables (Song et al., 2019; Li et al., 2021). However, AI-based approaches have been widely used because their nonlinear features can enhance forecasting performance (Law et al., 2019; Sun et al., 2019; Huang and Hao, 2020). The ANN is the most frequently used AI-based model, which can deal with almost any nonlinearity (Sun et al., 2019; Song et al., 2019). SVR is also frequently used in tourism demand forecasting due to its ability to model nonlinear data (Abellana et al., 2020; Huang and Hao, 2020; Li et al., 2020). Besides these two models, the RF also has grown in popularity due to its reliability and practical application in various fields (Khaidem et al., 2016; Tyralis and Papacharalampous, 2017; Li et al., 2020).
Previous studies have investigated three categories of Internet data to predict tourism demand: search engine, web traffic and social media. Google Trends (Bangwayo-Skeete and Skeete, 2015) and Baidu (Huang et al., 2017) are examples of search query data generated from search engines. Baidu performed better for tourism forecasting in China due to its market share advantage than Google in the region. However, Google performed better for international tourism forecasting contexts (Yang et al., 2015). Google Analytics account provides web traffic data from a particular website (Yang et al., 2014). Social media data can be obtained from photo-sharing applications (Miah et al., 2017), online forums (Fronzetti Colladon et al., 2019) and news articles (Starosta et al., 2019).
In the context of forecasting using search engine data, Google Trends have been used to predict tourist demand both at the country level (Park et al., 2017) and at the tourist destination level, such as tourist arrivals to five London museums (Volchek et al., 2019) and US National Parks (Clark et al., 2019). Besides Google Trends, several studies with forecasting context in China have utilized the Baidu index (Huang et al., 2017). Highly correlated query data are a challenge in utilizing search engine data. Therefore, Li et al. (2017) construct a composite search index to overcome highly correlated search query data (Li et al., 2017). Moreover, the corrected aggregate search volume index or adjusted index for different search languages and search platforms is preferable to the nonadjusted index (Dergiades et al., 2018). Prior studies demonstrated that incorporating search engine data from Google Trends and Baidu can improve forecasting accuracy.
Other researchers have explored the use of web traffic data of destination marketing organizations to predict hotel demand (Yang et al., 2014) and tourist arrivals to Vienna (Gunter and Önder, 2016). Both studies obtained web traffic data by using a Google Analytics account. Google Analytics provides two significant types of web traffic data: visitors and visits. The findings showed that web traffic data can improve the error reduction (Yang et al., 2014) and improve vector autoregression models' performance in a more extended time horizon (Gunter and Önder, 2016).
In terms of social media data, Miah et al. (2017) used geotagged photos uploaded by tourists to Flickr, a social media for photo-sharing, to predict tourism demand in Melbourne (Miah et al., 2017). Another study classified the user reviews in social media into positive and negative sentiments (Starosta et al., 2019). In contrast to search engines and web traffic data, these user-generated social media data are commonly found in unstructured data. Processing textual and image data from social media require advanced data preprocessing techniques. In general, using single-source Internet data to forecast tourist demand has been explored extensively.
While using a single category of Internet data has been well studied, only a few studies explored the use of different categories of Internet data (see Table 1). In this stream, some studies combined Google Trends and the Baidu index to predict tourist arrivals at the city level, such as Hong Kong (Huang and Hao, 2020), Hainan (Yang et al., 2015) and Beijing (Lv et al., 2018; Sun et al., 2019). The results indicated that the forecasting performance of the models using combined search engine data outperformed the ones using individual search engine data. A study combined online reviews from TripAdvisor and Google Trends to predict international airport arrivals to major European capital cities (Fronzetti Colladon et al., 2019). Other researchers utilized Facebook likes data and Google Trends to predict tourist arrivals to Austrian cities (Gunter et al., 2019). At the destination level, online reviews from two platforms, namely Ctrip and Qunar, are combined with the Baidu index to predict tourist arrivals to Mount Siguniang China (Li et al., 2020). The findings showed that better accuracy can be obtained by combining user-generated reviews from several online platforms.
Previous research of tourism demand forecasting using Internet data
| Study | Category of Internet data | Predictor variables | Predicted variable | Forecasting methods | COVID-19 context |
|---|---|---|---|---|---|
| Yang et al. (2015) | Search engine | Baidu index and Google Trends | Tourist arrivals to Hainan, China | ARMA, ARMAX | No |
| Lv et al. (2018) | Search engine | Baidu index and Google Trends | Tourism demand to America, Hainan, Beijing and Jiuzhaigou China | SARIMA, MLR, SVR, SLFN, ESN, LSTM, SAEN | No |
| Fronzetti Colladon et al. (2019) | Social media and search engine | Online forum (TripAdvisor) and Google Trends | International airport arrivals to seven major European capital cities | AR, FAAR, FABM, BM | No |
| Gunter et al. (2019) | Social media and search engine | Facebook and Google Trends | Tourist arrivals to four Austrian cities | Naïve, ETS, ARMA, ADLM, MIDAS | No |
| Sun et al. (2019) | Search engine | Baidu index and Google Trends | Tourist arrivals to Beijing | KELM, ARIMAX, ANN, LSSVR | No |
| Li et al. (2020) | Social media and search engine | Online reviews (Ctrip, Qunar) and Baidu index | Tourist arrivals to Mount Siguniang, China | ARIMAX, SVR, RF | No |
| Huang and Hao (2020) | Search engine | Baidu index and Google Trends | Tourist arrivals to Hong Kong | DBEDBN, RW, ARIMAX, SVR, ANN, DBN, EANN | No |
| Study | Category of Internet data | Predictor variables | Predicted variable | Forecasting methods | COVID-19 context |
|---|---|---|---|---|---|
| Search engine | Baidu index and Google Trends | Tourist arrivals to Hainan, China | ARMA, ARMAX | No | |
| Search engine | Baidu index and Google Trends | Tourism demand to America, Hainan, Beijing and Jiuzhaigou China | SARIMA, MLR, SVR, SLFN, ESN, LSTM, SAEN | No | |
| Social media and search engine | Online forum (TripAdvisor) and Google Trends | International airport arrivals to seven major European capital cities | AR, FAAR, FABM, BM | No | |
| Social media and search engine | Facebook and Google Trends | Tourist arrivals to four Austrian cities | Naïve, ETS, ARMA, ADLM, MIDAS | No | |
| Search engine | Baidu index and Google Trends | Tourist arrivals to Beijing | KELM, ARIMAX, ANN, LSSVR | No | |
| Social media and search engine | Online reviews (Ctrip, Qunar) and Baidu index | Tourist arrivals to Mount Siguniang, China | ARIMAX, SVR, RF | No | |
| Search engine | Baidu index and Google Trends | Tourist arrivals to Hong Kong | DBEDBN, RW, ARIMAX, SVR, ANN, DBN, EANN | No |
Note(s): ADLM = Autoregressive distributed lag model, ANN = Artificial neural network, AR = Autoregressive, ARIMAX = Autoregressive integrated moving average with exogenous, ARMA = Autoregressive moving average, ARMAX = Autoregressive moving average with exogenous, BM = Bridge model, DBEDBN = Double boosting ensemble deep belief network, DBN = Deep belief network, EANN = Ensemble artificial neural network, ESN = Echo state network, ETS = Exponential smoothing, FAAR = Factor augment autoregressive model, FABM = Factor augmented bridge model, KELM = Kernel extreme learning machines, LSSVR = Least squares support vector regression, LSTM = Long short-term memory, MIDAS = Mixed-data sampling, MLR = Multiple linear regression, RF = Random forest, RW = Random walk, SAEN = Stacked autoencoder with echo-state regression, SARIMA = Seasonal autoregressive integrated moving average, SLFN = Single-hidden Layer Feed-forward Neural Network, SVR = Support vector regression
To the best of our knowledge, developing tourism demand forecasting models using multisource Internet data, particularly with different categories of Internet data, is hard to find. Moreover, the applicability of using Internet data and the performance of existing machine learning forecasting models must be reexamined under an unprecedented COVID-19 pandemic context. This study fills the gap by utilizing two categories of Internet data, namely search engine (Google Trends) and social media (TripAdvisor travel forum), to develop prediction models that can accurately predict international tourist arrivals in the pandemic context. In addition, this study evaluates the prediction models under different combinations of Internet data and training dataset compositions.
3. Methodology
Figure 1 portrays the research framework of this study consisting of four main steps, namely (1) data collection, (2) data preparation, (3) model development and (4) model evaluation. First, we collected the data from the Indonesian Statistical Bureau (locally known as Badan Pusat Statistik or BPS), TripAdvisor travel forum and Google's search engine. In the second step, we conduct data preprocessing followed by feature extraction to obtain valuable and representative information from the dataset. The third step is the forecasting models development phase, followed by model evaluation at the fourth step.
The diagram presents a left-to-right workflow composed of four large sections labeled “Data Collection”, “Data Preparation”, “Model Development”, and “Model Evaluation”, connected by arrows indicating process flow. In “Data Collection”, three stacked rectangular blocks list sources: “International tourist arrivals”, “Online travel forum”, and “Google trends dataset”. An arrow leads to “Data Preparation”, which contains two subsections. The upper subsection labeled “Data Preprocessing” includes two rectangular blocks labeled “Data transformation” and “Data standardization”. The lower subsection labeled “Feature Extraction” is divided into two groups. The left group labeled “Temporal Effect” contains two rounded blocks labeled “Month” and “Year”. The right group labeled “Inertia Effect” contains three rounded blocks labeled “Monthly tourist arrivals”, “Monthly posts and comments”, and “Monthly search volume index”. An arrow leads to “Model Development”, which includes three stages. The first stage shows a block labeled “Training dataset” pointing to a block labeled “Prediction model training”, which contains three rounded items labeled “A N N”, “S V R”, and “R F”. Below it, a block labeled “Validation dataset” points to “Grid search for hyperparameter”. Further below, a block labeled “Testing dataset” points to “Final prediction models”. Additional downward arrows connect “Prediction model training” to “Grid search for hyperparameter” and “Grid search for hyperparameter” to “Final prediction models”. An arrow then leads to the final section labeled “Model Evaluation”, where a block labeled “Prediction of international tourist arrivals” branches to three evaluation metric blocks labeled “R M S E”, “M A P E”, and “M A E”.The research framework
The diagram presents a left-to-right workflow composed of four large sections labeled “Data Collection”, “Data Preparation”, “Model Development”, and “Model Evaluation”, connected by arrows indicating process flow. In “Data Collection”, three stacked rectangular blocks list sources: “International tourist arrivals”, “Online travel forum”, and “Google trends dataset”. An arrow leads to “Data Preparation”, which contains two subsections. The upper subsection labeled “Data Preprocessing” includes two rectangular blocks labeled “Data transformation” and “Data standardization”. The lower subsection labeled “Feature Extraction” is divided into two groups. The left group labeled “Temporal Effect” contains two rounded blocks labeled “Month” and “Year”. The right group labeled “Inertia Effect” contains three rounded blocks labeled “Monthly tourist arrivals”, “Monthly posts and comments”, and “Monthly search volume index”. An arrow leads to “Model Development”, which includes three stages. The first stage shows a block labeled “Training dataset” pointing to a block labeled “Prediction model training”, which contains three rounded items labeled “A N N”, “S V R”, and “R F”. Below it, a block labeled “Validation dataset” points to “Grid search for hyperparameter”. Further below, a block labeled “Testing dataset” points to “Final prediction models”. Additional downward arrows connect “Prediction model training” to “Grid search for hyperparameter” and “Grid search for hyperparameter” to “Final prediction models”. An arrow then leads to the final section labeled “Model Evaluation”, where a block labeled “Prediction of international tourist arrivals” branches to three evaluation metric blocks labeled “R M S E”, “M A P E”, and “M A E”.The research framework
Table 2 shows the specification of the prediction models, namely the predictors and predicted variables. We use four variables: temporal factors, TripAdvisor, Google Trends and international tourist arrivals. In total, we use four different sets of predictors and predicted variables that will be adopted in developing the prediction models using ANN, SVR and RF. We vary the predictors to verify that the proposed multisource Internet data can improve the prediction accuracy. Model evaluation based on root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) was used to examine out-of-sample prediction accuracy. In order to ensure the robustness of the prediction models using multisource Internet data, we constructed the models using three distinct data compositions with different lengths of training, validation and testing dataset. While different settings of data splits can affect the model's forecasting performance (Yang et al., 2014), it is important to determine which data split setting will lead to the highest prediction accuracy.
The specification of prediction models
| Construct | Attribute | Function | Model | |||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | |||
| Temporal | Month | Predictor | v | v | v | v |
| Year | Predictor | v | v | v | v | |
| TripAdvisor | Number of posts | Predictor | v | v | ||
| Number of comments | Predictor | v | v | |||
| Google Trends | Main entry point | Predictor | v | v | ||
| International travel requirement | Predictor | v | v | |||
| Tourism planning | Predictor | v | v | |||
| Tourist arrivals | International tourist arrivals in the previous month | Predictor | v | v | ||
| Monthly international tourist arrivals | Predicted | v | v | v | v | |
| Construct | Attribute | Function | Model | |||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | |||
| Temporal | Month | Predictor | v | v | v | v |
| Year | Predictor | v | v | v | v | |
| TripAdvisor | Number of posts | Predictor | v | v | ||
| Number of comments | Predictor | v | v | |||
| Google Trends | Main entry point | Predictor | v | v | ||
| International travel requirement | Predictor | v | v | |||
| Tourism planning | Predictor | v | v | |||
| Tourist arrivals | International tourist arrivals in the previous month | Predictor | v | v | ||
| Monthly international tourist arrivals | Predicted | v | v | v | v | |
3.1 Artificial neural networks
A feed-forward neural network consists of one or more input layers, one or more hidden layers and one output layer where each neuron in one layer conveys information to all neurons in the subsequent layer (Höpken et al., 2020). In this study, the ANN model consists of an input layer with three neurons that represent the predictor variables, namely the previous tourist arrivals , the number of posts and comments and search volume index , and an output layer representing the predicted variable, namely international tourist arrivals or . The output of hidden neurons ( and the international tourist arrivals () can be written in Eq. (1) and (2):
where is the input weight, is the input neurons, is the hidden layer threshold, is the output weight, is the output of hidden neurons, is the output layer threshold, is the activation function and is the output neuron (international tourist arrivals). Figure 2 shows the structure of the feed-forward neural network.
The diagram illustrates a neural network structure arranged from left to right with three sections labeled “Input Layer (i)”, “Hidden Layer (L)”, and “Output Layer”. In the input layer on the left, three circular nodes labeled “x subscript 1”, “x subscript 2”, and “x subscript i” receive dashed arrows from the left indicating inputs. Solid lines extend from these input nodes toward the hidden layer in the center. The hidden layer contains two circular nodes labeled “V subscript 1” and “V subscript L”. Multiple connecting lines from the input nodes converge on these hidden nodes, with weights labeled “w subscript 1 i” and “w subscript L i” shown along some of the connections. Bias terms labeled “b subscript 1” and “b subscript L” are indicated by arrows pointing toward the hidden nodes “V subscript 1” and “V subscript L”, respectively. From the hidden layer, two arrows extend to the output layer on the right toward a single circular node labeled “Y”. The connections from “V subscript 1” and “V subscript L” to “Y” are labeled “w subscript 1” and “w subscript L”, and a parameter labeled “beta” appears along the upper connection leading toward the output node.The structure of the feed-forward neural network
The diagram illustrates a neural network structure arranged from left to right with three sections labeled “Input Layer (i)”, “Hidden Layer (L)”, and “Output Layer”. In the input layer on the left, three circular nodes labeled “x subscript 1”, “x subscript 2”, and “x subscript i” receive dashed arrows from the left indicating inputs. Solid lines extend from these input nodes toward the hidden layer in the center. The hidden layer contains two circular nodes labeled “V subscript 1” and “V subscript L”. Multiple connecting lines from the input nodes converge on these hidden nodes, with weights labeled “w subscript 1 i” and “w subscript L i” shown along some of the connections. Bias terms labeled “b subscript 1” and “b subscript L” are indicated by arrows pointing toward the hidden nodes “V subscript 1” and “V subscript L”, respectively. From the hidden layer, two arrows extend to the output layer on the right toward a single circular node labeled “Y”. The connections from “V subscript 1” and “V subscript L” to “Y” are labeled “w subscript 1” and “w subscript L”, and a parameter labeled “beta” appears along the upper connection leading toward the output node.The structure of the feed-forward neural network
3.2 Support vector regression
Support vector machine (SVM) is a machine learning algorithm that maps data in high-dimensional feature space through a nonlinear mapping function (Li et al., 2020). SVM classifies training data vectors into two segments that are represented in Eq. (3).
where is the training data vectors ( with the previous tourist arrivals, the number of posts and comments, search volume index), is the number of training data and is the input space dimension represented by the number of predictor variables. The training data vectors classified by a hyperplane which satisfy the following equations:
where is the weight vector, is the mapping of input space () to high-dimensional space (), is a constant and is the incentive loss function.
In Figure 3, we draw two parallel lines for one segment and for the other segment. In SVR, the model seeks a hyperplane to fit the given training data points with the fitting function by minimizing the regularized risk function , where is the regularization parameter, and are distances from actual value to the boundary values of . Thus, the nonlinear mapping function can be generated by applying the Lagrange multiplier (Yao et al., 2021),
where is the prediction of tourist arrivals, is the training data vectors ( with the previous tourist arrivals, the number of posts and comments, search volume index), and are Lagrange coefficients, is the Kernel function and is the constant.
The two-dimensional scatter plot with a horizontal axis labeled “x” and a vertical axis labeled “y”. Two groups of circular data points appear on opposite sides of a slanted decision boundary. One group corresponds to “Segment 1” and the other to “Segment 2”, as indicated in the legend on the right, which also lists “Support vectors segment 1” and “Support vectors segment 2”. A central solid diagonal line represents the decision boundary labeled “vector w dot vector x plus b equals 0”. Two parallel dashed lines on either side are labeled “vector w dot vector x plus b equals 1” above and “vector w dot vector x plus b equals negative 1” below, forming the margin region. Several points closest to these dashed lines are marked as support vectors for each segment. Arrows between the dashed lines indicate the margin width labeled “2 over mode of vector w”, and shorter arrows from the boundary to the margin lines indicate distances labeled “epsilon” and “negative epsilon”, respectively. Additional arrows from selected points to the margin lines are labeled “theta subscript i” and “v subscript i”. The data points from the two segments are distributed on opposite sides of the decision boundary, with support vectors located along the margin lines.The margin and decision boundary of the support vector machine
The two-dimensional scatter plot with a horizontal axis labeled “x” and a vertical axis labeled “y”. Two groups of circular data points appear on opposite sides of a slanted decision boundary. One group corresponds to “Segment 1” and the other to “Segment 2”, as indicated in the legend on the right, which also lists “Support vectors segment 1” and “Support vectors segment 2”. A central solid diagonal line represents the decision boundary labeled “vector w dot vector x plus b equals 0”. Two parallel dashed lines on either side are labeled “vector w dot vector x plus b equals 1” above and “vector w dot vector x plus b equals negative 1” below, forming the margin region. Several points closest to these dashed lines are marked as support vectors for each segment. Arrows between the dashed lines indicate the margin width labeled “2 over mode of vector w”, and shorter arrows from the boundary to the margin lines indicate distances labeled “epsilon” and “negative epsilon”, respectively. Additional arrows from selected points to the margin lines are labeled “theta subscript i” and “v subscript i”. The data points from the two segments are distributed on opposite sides of the decision boundary, with support vectors located along the margin lines.The margin and decision boundary of the support vector machine
3.3 Random forest
A RF has grown in popularity due to its high reliability and practical application in various fields (Khaidem et al., 2016; Tyralis and Papacharalampous, 2017; Li et al., 2020). This model combines the classification and regression tree and bagging method to improve the accuracy (Breiman, 2001). Figure 4 portrays the process of RF.
The top-down workflow illustrating how predictions from multiple decision trees are combined. At the top, a block labeled “Training dataset” splits into two arrows pointing to “Training subset subscript 1” on the left and “Training subset subscript N” on the right. Each subset points downward to a decision tree labeled “Tree subscript 1” and “Tree subscript N”, respectively, with an ellipsis between them indicating additional trees. Each tree is depicted as a branching structure of circular nodes connected by arrows. From each tree, an arrow points to a block labeled “Testing dataset”. Below these, arrows lead to blocks labeled “Single tree result (y subscript 1)” on the left and “Single tree result (y subscript N)” on the right. Finally, arrows from these results converge at the bottom into a block labeled “Average all prediction results (y cap)”, indicating the aggregation of predictions from all trees.The rationale of random forest
The top-down workflow illustrating how predictions from multiple decision trees are combined. At the top, a block labeled “Training dataset” splits into two arrows pointing to “Training subset subscript 1” on the left and “Training subset subscript N” on the right. Each subset points downward to a decision tree labeled “Tree subscript 1” and “Tree subscript N”, respectively, with an ellipsis between them indicating additional trees. Each tree is depicted as a branching structure of circular nodes connected by arrows. From each tree, an arrow points to a block labeled “Testing dataset”. Below these, arrows lead to blocks labeled “Single tree result (y subscript 1)” on the left and “Single tree result (y subscript N)” on the right. Finally, arrows from these results converge at the bottom into a block labeled “Average all prediction results (y cap)”, indicating the aggregation of predictions from all trees.The rationale of random forest
First, training subsets are randomly selected from the training dataset. Second, trees are randomly generated and trained by using the training subsets. The parent node splits into two daughter nodes, and the information impurity due to this split can be written by
where is the Gini impurity measure in node , is the population proportion of the left daughter node and is the population proportion of the right daughter node .
Third, each tree predicts the testing dataset, and the prediction results generated by all trees are averaged to obtain the final output of tourist arrivals prediction. The final output of RF is as follows:
where is the final output, is the number of trees and is the result of a single tree.
4. Case study
4.1 Data collection
As a case study, we analyze international tourist arrivals to Indonesia during the COVID-19 pandemic. First, we collected tourism data from the Indonesian Statistical Bureau Indonesia or BPS from January 2017 until June 2021. Next, we collect the data from a global online tourism platform, TripAdvisor. Table 3 shows the data sample of the Indonesia travel forum in TripAdvisor. The dynamic interactions within the online forums can be seen from the number of posts and comments that vary every day and covers diverse topics (Fronzetti Colladon et al., 2019). More than 43,000 posts were obtained, with 243,000 comments from users.
Data sample of Indonesia travel forum in TripAdvisor
| Variable | Data type | Data example |
|---|---|---|
| Forum | String | Bali |
| Topic | String | “is bali safe for vacation?” |
| Link of post | String | https://www.tripadvisor.com/ShowTopic-g294226-i7220-k13419945-Is_bali_safe_for_vacation-Bali.html |
| Author of post | String | Olivia |
| Link of the author's profile | String | https://www.tripadvisor.com/Profile/viva99slot? tab = forum |
| Posting date | Date | Dec 10, 2020 |
| Number of comments | Integer | 25 |
| Last comment by user | String | SW0590 |
| Link of the commenter's profile | String | https://www.tripadvisor.com/Profile/SW0590? tab = forum |
| Variable | Data type | Data example |
|---|---|---|
| Forum | String | Bali |
| Topic | String | “is bali safe for vacation?” |
| Link of post | String | |
| Author of post | String | Olivia |
| Link of the author's profile | String | |
| Posting date | Date | Dec 10, 2020 |
| Number of comments | Integer | 25 |
| Last comment by user | String | SW0590 |
| Link of the commenter's profile | String |
Table 4 shows the selected Google Trends keywords used in this study. The keywords are categorized into three topics: main entry point, international travel requirement and tourism planning. The search volume index represents search interest with values ranging from 0 to 100, a value of 100 as the search keyword's peak popularity.
Google Trends keywords
| Topic | Keyword |
|---|---|
| Main entry point | Ngurah Rai International Airport |
| Soekarno-Hatta International Airport | |
| Batam ferry terminal | |
| Bali | |
| Jakarta | |
| Batam | |
| International travel requirement | Passport Indonesia |
| Visa Indonesia | |
| Tourism planning | Indonesia hotel |
| Indonesia resort | |
| Indonesia restaurant | |
| Indonesia travel |
| Topic | Keyword |
|---|---|
| Main entry point | Ngurah Rai International Airport |
| Soekarno-Hatta International Airport | |
| Batam ferry terminal | |
| Bali | |
| Jakarta | |
| Batam | |
| International travel requirement | Passport Indonesia |
| Visa Indonesia | |
| Tourism planning | Indonesia hotel |
| Indonesia resort | |
| Indonesia restaurant | |
| Indonesia travel |
Table 5 summarizes the descriptive statistics of the datasets. The statistics consist of monthly international tourist arrivals, daily posts and comments in the Indonesia travel forum, and the monthly search volume index of the selected keywords.
Descriptive statistics of the datasets
| Data source | Variable | Count | Mean | Std. dev | Min | Max |
|---|---|---|---|---|---|---|
| Indonesian Statistical Bureau Indonesia (https://www.bps.go.id/) | Tourist arrivals | 54 | 940,902.98 | 518,460.03 | 115,765 | 1,547,231 |
| TripAdvisor (https://www.tripadvisor.com/ShowForum-g294225-i7219-o5320-Indonesia.html) | Posts | 1,642 | 26.54 | 17.70 | 0 | 73 |
| Comments | 1,642 | 148.08 | 100.55 | 0 | 728 | |
| Google Trends (https://trends.google.com/) | Ngurah Rai International airport | 54 | 29.11 | 14.94 | 9 | 98 |
| Soekarno-Hatta International airport | 54 | 69.78 | 15.86 | 36 | 100 | |
| Batam ferry terminal | 54 | 38.94 | 25.64 | 0 | 100 | |
| Bali | 54 | 65.85 | 19.97 | 33 | 100 | |
| Jakarta | 54 | 81.80 | 10.35 | 57 | 100 | |
| Batam | 54 | 78.48 | 10.77 | 56 | 100 | |
| Passport Indonesia | 54 | 39.30 | 13.25 | 14 | 59 | |
| Visa Indonesia | 54 | 65.98 | 23.19 | 25 | 92 | |
| Indonesia hotel | 54 | 67.07 | 13.38 | 36 | 100 | |
| Indonesia resort | 54 | 46.33 | 15.78 | 14 | 100 | |
| Indonesia restaurant | 54 | 50.63 | 11.54 | 21 | 77 | |
| Indonesia travel | 54 | 38.63 | 8.96 | 19 | 66 |
| Data source | Variable | Count | Mean | Std. dev | Min | Max |
|---|---|---|---|---|---|---|
| Indonesian Statistical Bureau Indonesia ( | Tourist arrivals | 54 | 940,902.98 | 518,460.03 | 115,765 | 1,547,231 |
| TripAdvisor ( | Posts | 1,642 | 26.54 | 17.70 | 0 | 73 |
| Comments | 1,642 | 148.08 | 100.55 | 0 | 728 | |
| Google Trends ( | Ngurah Rai International airport | 54 | 29.11 | 14.94 | 9 | 98 |
| Soekarno-Hatta International airport | 54 | 69.78 | 15.86 | 36 | 100 | |
| Batam ferry terminal | 54 | 38.94 | 25.64 | 0 | 100 | |
| Bali | 54 | 65.85 | 19.97 | 33 | 100 | |
| Jakarta | 54 | 81.80 | 10.35 | 57 | 100 | |
| Batam | 54 | 78.48 | 10.77 | 56 | 100 | |
| Passport Indonesia | 54 | 39.30 | 13.25 | 14 | 59 | |
| Visa Indonesia | 54 | 65.98 | 23.19 | 25 | 92 | |
| Indonesia hotel | 54 | 67.07 | 13.38 | 36 | 100 | |
| Indonesia resort | 54 | 46.33 | 15.78 | 14 | 100 | |
| Indonesia restaurant | 54 | 50.63 | 11.54 | 21 | 77 | |
| Indonesia travel | 54 | 38.63 | 8.96 | 19 | 66 |
Figure 5 portrays all variables utilized for developing the prediction models. The international tourist arrivals have been experiencing significant declines since February 2020 due to the government's travel restrictions amid COVID-19. During the outbreak, the interaction in travel forums and the popularity of selected search keywords also decreased.
The illustration presents multiple line graphs arranged in two groups showing time series from 2017 to 2021. In each graph, the horizontal axis is labeled “Year” and ranges from 2017 to 2021 with an interval of 1 year. The first row contains three graphs labeled “Tourist Arrivals”, “Posts”, and “Comments”. The vertical axis of the plots in the first row is labeled “N”. The vertical axis of the plot labeled “Tourist Arrivals” ranges from 400000 to 1600000 with an interval of 400000. The curve rises from around 1000000 in 2017 to 1500000 in early 2020, followed by a sharp decline to around 200000 during 2020 and remaining low near 150000 through 2021. The “Posts” graph ranges from 0 to 1500 with an interval of 500, fluctuating around 900 to 1500 until early 2020 before dropping sharply to near 100 and remaining low afterward. The “Comments” graph ranges from 2000 to 6000 with an interval of 2000, fluctuating around 5000 to 7000 until early 2020, followed by a sharp drop to around 1000 and remaining low. The second row contains four graphs labeled “Ngurah Rai Airport”, “Soekarno-Hatta Airport”, “Batam Ferry Terminal”, and “Bali”. The vertical axis of each plot in this row is labeled “Popularity” and ranges from approximately 0 to 100. The “Ngurah Rai Airport” graph ranges from 25 to 100 with an interval of 25. The curve fluctuates between approximately 25 and 40 from 2017 to early 2020, with a brief peak near 100 around early 2018, followed by a sharp decline to near 10 during 2020 and remaining low through 2021. The “Soekarno-Hatta Airport” ranges from 40 to 100 with an interval of 20. The curve graph fluctuates between roughly 65 and 100 from 2017 to early 2020, then drops sharply to near 40 during 2020 and later fluctuates between about 40 and 60 through 2021. The “Batam Ferry Terminal” graph ranges from 0 to 100 with an interval of 25. The curve varies between about 25 and 80 from 2017 to early 2020 with occasional peaks near 100, followed by a sharp drop to near 5 during 2020 and remaining very low afterward. The “Bali” graph ranges from 40 to 100 with an interval of 20. The curve fluctuates between roughly 65 and 95 from 2017 to early 2020 before dropping sharply to around 35 during 2020 and remaining near that level with small fluctuations through 2021. The third row contains four graphs labeled “Jakarta”, “Batam”, “Passport Indonesia”, and “Visa Indonesia”. The vertical axis of these plots is labeled “Popularity” and ranges approximately from 20 to 100. The “Jakarta” graph ranges from 60 to 100 with an interval of 10. The curve fluctuates between about 80 and 100 from 2017 to early 2020, then drops sharply to near 60 during 2020 before partially recovering to around 70 to 85 by 2021. The “Batam” graph ranges from 60 to 100 with an interval of 10. The curve fluctuates between about 75 and 95 before 2020, followed by a sharp decline to near 60 during 2020 and modest recovery to around 65 to 70 afterward. The “Passport Indonesia” graph ranges from 20 to 60 with an interval of 10. The curve fluctuates between roughly 35 and 60 before 2020, then drops sharply to around 15 during 2020 and later rises slightly to around 30 to 35 by 2021. The “Visa Indonesia” graph ranges from 40 to 80 with an interval of 20. The curve fluctuates between approximately 70 and 95 before 2020, followed by a sharp drop to near 25 during 2020 and slight fluctuations around 30 to 40 afterward. The final row contains four graphs labeled “Indonesia Hotel”, “Indonesia Resort”, “Indonesia Restaurant”, and “Indonesia Travel”. The vertical axis of these plots is labeled “Popularity” and ranges approximately from 20 to 100. The “Indonesia Hotel” graph ranges from 40 to 100 with an interval of 20. The curve fluctuates between about 65 and 80 before 2020 with occasional peaks near 100, followed by a decline to around 40 to 60 during 2020 and remaining variable through 2021. The “Indonesia Resort” graph ranges from 25 to 100 with an interval of 25. The curve fluctuates around 45 to 60 before 2020 with occasional spikes near 100, followed by a steady decrease to near 20 during 2021. The “Indonesia Restaurant” graph ranges from 20 to 60 with an interval of 20. The curve fluctuates between about 45 and 70 before 2020, drops sharply to near 25 during 2020, and then gradually rises toward about 50 by 2021. The “Indonesia Travel” graph ranges from 20 to 60 with an interval of 10. The curve fluctuates between roughly 35 and 50 before 2020, with a peak near 65, then drops sharply to around 20 during 2020 and later fluctuates between about 25 and 40 through 2021. Note: All the numerical data values are approximated.All variables for developing the prediction models
The illustration presents multiple line graphs arranged in two groups showing time series from 2017 to 2021. In each graph, the horizontal axis is labeled “Year” and ranges from 2017 to 2021 with an interval of 1 year. The first row contains three graphs labeled “Tourist Arrivals”, “Posts”, and “Comments”. The vertical axis of the plots in the first row is labeled “N”. The vertical axis of the plot labeled “Tourist Arrivals” ranges from 400000 to 1600000 with an interval of 400000. The curve rises from around 1000000 in 2017 to 1500000 in early 2020, followed by a sharp decline to around 200000 during 2020 and remaining low near 150000 through 2021. The “Posts” graph ranges from 0 to 1500 with an interval of 500, fluctuating around 900 to 1500 until early 2020 before dropping sharply to near 100 and remaining low afterward. The “Comments” graph ranges from 2000 to 6000 with an interval of 2000, fluctuating around 5000 to 7000 until early 2020, followed by a sharp drop to around 1000 and remaining low. The second row contains four graphs labeled “Ngurah Rai Airport”, “Soekarno-Hatta Airport”, “Batam Ferry Terminal”, and “Bali”. The vertical axis of each plot in this row is labeled “Popularity” and ranges from approximately 0 to 100. The “Ngurah Rai Airport” graph ranges from 25 to 100 with an interval of 25. The curve fluctuates between approximately 25 and 40 from 2017 to early 2020, with a brief peak near 100 around early 2018, followed by a sharp decline to near 10 during 2020 and remaining low through 2021. The “Soekarno-Hatta Airport” ranges from 40 to 100 with an interval of 20. The curve graph fluctuates between roughly 65 and 100 from 2017 to early 2020, then drops sharply to near 40 during 2020 and later fluctuates between about 40 and 60 through 2021. The “Batam Ferry Terminal” graph ranges from 0 to 100 with an interval of 25. The curve varies between about 25 and 80 from 2017 to early 2020 with occasional peaks near 100, followed by a sharp drop to near 5 during 2020 and remaining very low afterward. The “Bali” graph ranges from 40 to 100 with an interval of 20. The curve fluctuates between roughly 65 and 95 from 2017 to early 2020 before dropping sharply to around 35 during 2020 and remaining near that level with small fluctuations through 2021. The third row contains four graphs labeled “Jakarta”, “Batam”, “Passport Indonesia”, and “Visa Indonesia”. The vertical axis of these plots is labeled “Popularity” and ranges approximately from 20 to 100. The “Jakarta” graph ranges from 60 to 100 with an interval of 10. The curve fluctuates between about 80 and 100 from 2017 to early 2020, then drops sharply to near 60 during 2020 before partially recovering to around 70 to 85 by 2021. The “Batam” graph ranges from 60 to 100 with an interval of 10. The curve fluctuates between about 75 and 95 before 2020, followed by a sharp decline to near 60 during 2020 and modest recovery to around 65 to 70 afterward. The “Passport Indonesia” graph ranges from 20 to 60 with an interval of 10. The curve fluctuates between roughly 35 and 60 before 2020, then drops sharply to around 15 during 2020 and later rises slightly to around 30 to 35 by 2021. The “Visa Indonesia” graph ranges from 40 to 80 with an interval of 20. The curve fluctuates between approximately 70 and 95 before 2020, followed by a sharp drop to near 25 during 2020 and slight fluctuations around 30 to 40 afterward. The final row contains four graphs labeled “Indonesia Hotel”, “Indonesia Resort”, “Indonesia Restaurant”, and “Indonesia Travel”. The vertical axis of these plots is labeled “Popularity” and ranges approximately from 20 to 100. The “Indonesia Hotel” graph ranges from 40 to 100 with an interval of 20. The curve fluctuates between about 65 and 80 before 2020 with occasional peaks near 100, followed by a decline to around 40 to 60 during 2020 and remaining variable through 2021. The “Indonesia Resort” graph ranges from 25 to 100 with an interval of 25. The curve fluctuates around 45 to 60 before 2020 with occasional spikes near 100, followed by a steady decrease to near 20 during 2021. The “Indonesia Restaurant” graph ranges from 20 to 60 with an interval of 20. The curve fluctuates between about 45 and 70 before 2020, drops sharply to near 25 during 2020, and then gradually rises toward about 50 by 2021. The “Indonesia Travel” graph ranges from 20 to 60 with an interval of 10. The curve fluctuates between roughly 35 and 50 before 2020, with a peak near 65, then drops sharply to around 20 during 2020 and later fluctuates between about 25 and 40 through 2021. Note: All the numerical data values are approximated.All variables for developing the prediction models
4.2 Data preparation
This phase consists of data preprocessing and feature extraction. In the data preprocessing, we transform all data into monthly data. We performed a three-month moving average for Google Trends data that smoothed out popularity trends to filter noise. In the last data preprocessing step, we perform data standardization using Eq. (8).
where is the original value, is the mean and σ is the standard deviation.
For processing time series data using the machine learning method, we extract two temporal features in this study: month and year. These variables are converted into dummy variables that aim to prevent information duplication. The second feature is the inertia variable or lag feature, which describes the value of the data in the previous month. We extract the inertia variable for all data categories, including tourist arrivals, search volume index, number of posts and comments.
4.3 Model development
We split the entire dataset into three segments: training, validation and testing datasets. We decompose the training datasets into three partitions (see Figure 6), namely (1) January 2017–April 2020 (the period when COVID-19 starts to gain popularity and infect Indonesian citizens), (2) January 2017–August 2020 (the period when the government implemented international travel restrictions) and (3) January 2017–December 2020 (the period when the government extend the international travel restrictions and implement wide-scale social restrictions).
The table presents three rows of data composition timelines across columns labeled “2017”, “2018”, “2019”, “2020”, and “2021”. The year “2020” is further divided into months labeled “January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”, and “December”, while the year “2021” is divided into “January”, “February”, “March”, “April”, “May”, and “June”. In row “1”, the period from 2017 through April 2020 is labeled “Training”, the period from May 2020 through November 2020 is labeled “Validation”, and the period from December 2020 through June 2021 is labeled “Testing”. In row “2”, the period from 2017 through August 2020 is labeled “Training”, followed by “Validation” extending from September 2020 to January 2021, and “Testing” covering the remaining months of 2021 up to June. In row “3”, the “Training” period extends from 2017 through December 2020, followed by a shorter “Validation” period from January 2021 through March 2021 and a “Testing” period continuing through June 2021.Composition of training, validation and testing datasets
The table presents three rows of data composition timelines across columns labeled “2017”, “2018”, “2019”, “2020”, and “2021”. The year “2020” is further divided into months labeled “January”, “February”, “March”, “April”, “May”, “June”, “July”, “August”, “September”, “October”, “November”, and “December”, while the year “2021” is divided into “January”, “February”, “March”, “April”, “May”, and “June”. In row “1”, the period from 2017 through April 2020 is labeled “Training”, the period from May 2020 through November 2020 is labeled “Validation”, and the period from December 2020 through June 2021 is labeled “Testing”. In row “2”, the period from 2017 through August 2020 is labeled “Training”, followed by “Validation” extending from September 2020 to January 2021, and “Testing” covering the remaining months of 2021 up to June. In row “3”, the “Training” period extends from 2017 through December 2020, followed by a shorter “Validation” period from January 2021 through March 2021 and a “Testing” period continuing through June 2021.Composition of training, validation and testing datasets
The model parameters are optimized through a hyperparameter grid search (Lijuan and Guohua, 2016; Bi et al., 2020). First, we optimized the learning rate and the number of hidden layers for the ANN model. Second, three parameters, namely the regularization parameter (), Kernel and epsilon (), are optimized for the SVR model. Lastly, grid search for the RF model is performed by considering the number of variables randomly sampled at each split (Mtry), the number of trees (N trees) and the maximum nodes. Table 6 shows the results of the hyperparameters optimization.
Hyperparameter optimization
| Data composition | Method | Hyperparameter | Model 1 | Model 2 | Model 3 | Model 4 |
|---|---|---|---|---|---|---|
| 1 | ANN | Learning rate | 0.01 | 0.01 | 0.1 | 0.1 |
| Hidden layer | 8 | 7 | 7 | 10 | ||
| SVR | C | 0.01 | 0.01 | 0.01 | 0.01 | |
| Kernel | Sigmoid | Sigmoid | Sigmoid | Sigmoid | ||
| Epsilon | 0.025 | 0.025 | 0.05 | 0.05 | ||
| RF | Mtry | 5 | 4 | 5 | 4 | |
| N trees | 10 | 30 | 50 | 10 | ||
| Maximum nodes | 5 | 10 | 10 | 5 | ||
| 2 | ANN | Learning rate | 0.1 | 0.1 | 0.01 | 0.1 |
| Hidden layer | 4 | 3 | 1 | 3 | ||
| SVR | C | 0.01 | 0.01 | 0.01 | 0.01 | |
| Kernel | Sigmoid | Sigmoid | Sigmoid | Sigmoid | ||
| Epsilon | 0.025 | 0.025 | 0.025 | 0.025 | ||
| RF | Mtry | 5 | 5 | 5 | 4 | |
| N trees | 30 | 10 | 30 | 10 | ||
| Maximum nodes | 10 | 10 | 10 | 5 | ||
| 3 | ANN | Learning rate | 0.01 | 0.1 | 0.1 | 0.01 |
| Hidden layer | 2 | 7 | 3 | 2 | ||
| SVR | C | 0.01 | 0.01 | 0.01 | 0.01 | |
| Kernel | Sigmoid | Sigmoid | Sigmoid | Sigmoid | ||
| Epsilon | 0.05 | 0.025 | 0.025 | 0.025 | ||
| RF | Mtry | 3 | 5 | 5 | 4 | |
| N trees | 10 | 40 | 10 | 10 | ||
| Maximum nodes | 5 | 10 | 5 | 10 |
| Data composition | Method | Hyperparameter | Model 1 | Model 2 | Model 3 | Model 4 |
|---|---|---|---|---|---|---|
| 1 | ANN | Learning rate | 0.01 | 0.01 | 0.1 | 0.1 |
| Hidden layer | 8 | 7 | 7 | 10 | ||
| SVR | C | 0.01 | 0.01 | 0.01 | 0.01 | |
| Kernel | Sigmoid | Sigmoid | Sigmoid | Sigmoid | ||
| Epsilon | 0.025 | 0.025 | 0.05 | 0.05 | ||
| RF | Mtry | 5 | 4 | 5 | 4 | |
| N trees | 10 | 30 | 50 | 10 | ||
| Maximum nodes | 5 | 10 | 10 | 5 | ||
| 2 | ANN | Learning rate | 0.1 | 0.1 | 0.01 | 0.1 |
| Hidden layer | 4 | 3 | 1 | 3 | ||
| SVR | C | 0.01 | 0.01 | 0.01 | 0.01 | |
| Kernel | Sigmoid | Sigmoid | Sigmoid | Sigmoid | ||
| Epsilon | 0.025 | 0.025 | 0.025 | 0.025 | ||
| RF | Mtry | 5 | 5 | 5 | 4 | |
| N trees | 30 | 10 | 30 | 10 | ||
| Maximum nodes | 10 | 10 | 10 | 5 | ||
| 3 | ANN | Learning rate | 0.01 | 0.1 | 0.1 | 0.01 |
| Hidden layer | 2 | 7 | 3 | 2 | ||
| SVR | C | 0.01 | 0.01 | 0.01 | 0.01 | |
| Kernel | Sigmoid | Sigmoid | Sigmoid | Sigmoid | ||
| Epsilon | 0.05 | 0.025 | 0.025 | 0.025 | ||
| RF | Mtry | 3 | 5 | 5 | 4 | |
| N trees | 10 | 40 | 10 | 10 | ||
| Maximum nodes | 5 | 10 | 5 | 10 |
4.4 Model evaluation
Evaluation of model performance is an inseparable step in developing prediction models. The difference between the predicted and actual values refers to the prediction error (Li et al., 2017). We evaluate the prediction performance using two scale-dependent errors, namely RMSE and MAE, and a percentage error, namely MAPE, which can be calculated using Eq. (9)–(11).
where is the actual, and is the predicted value of tourist arrivals.
5. Results and discussion
Tables 7 and 8 summarize the accuracy of all prediction models in terms of RMSE and MAE. From a total of 36 models, the prediction models utilizing multisource Internet data perform consistently better than the other models using single or even no Internet data predictors. The superiority of the multisource Internet data is also consistent across different data compositions. This finding indicates the robustness of using the multisource Internet data approach. Furthermore, all prediction models trained using data composition 3 yielded the best RMSE and MAE compared to those trained using data compositions 1 and 2. The RMSE and MAE significantly improve when we incorporate more data within the outbreak.
RMSE of the prediction models
| Model | Data composition | Predictors | |||
|---|---|---|---|---|---|
| 1 (Temporal + Previous arrivals) | 2 (Temporal + TripAdvisor) | 3 (Temporal + Google Trends) | 4 (Temporal + Previous arrivals + TripAdvisor + Google Trends) | ||
| ANN | Data composition 1 | 410,507.90 | 286,340.00 | 263,504.20 | 115,814.80 |
| Data composition 2 | 244,910.67 | 161,509.29 | 122,716.81 | 62,873.48 | |
| Data composition 3 | 48,094.28 | 23,014.18 | 20,578.84 | 11,698.45 | |
| SVR | Data composition 1 | 251,740.00 | 172,068.00 | 198,134.90 | 164,414.70 |
| Data composition 2 | 180,286.49 | 85,727.30 | 108,142.37 | 78,241.15 | |
| Data composition 3 | 55,374.35 | 31,004.33 | 58,953.95 | 28,175.02 | |
| RF | Data composition 1 | 676,674.60 | 349,367.50 | 154,126.95 | 55,156.28 |
| Data composition 2 | 349,704.78 | 20,014.93 | 100,762.79 | 19,084.13 | |
| Data composition 3 | 289,465.10 | 38,689.82 | 11,798.04 | 10,334.02* | |
| Model | Data composition | Predictors | |||
|---|---|---|---|---|---|
| 1 (Temporal + Previous arrivals) | 2 (Temporal + TripAdvisor) | 3 (Temporal + Google Trends) | 4 (Temporal + Previous arrivals + TripAdvisor + Google Trends) | ||
| ANN | Data composition 1 | 410,507.90 | 286,340.00 | 263,504.20 | 115,814.80 |
| Data composition 2 | 244,910.67 | 161,509.29 | 122,716.81 | 62,873.48 | |
| Data composition 3 | 48,094.28 | 23,014.18 | 20,578.84 | 11,698.45 | |
| SVR | Data composition 1 | 251,740.00 | 172,068.00 | 198,134.90 | 164,414.70 |
| Data composition 2 | 180,286.49 | 85,727.30 | 108,142.37 | 78,241.15 | |
| Data composition 3 | 55,374.35 | 31,004.33 | 58,953.95 | 28,175.02 | |
| RF | Data composition 1 | 676,674.60 | 349,367.50 | 154,126.95 | 55,156.28 |
| Data composition 2 | 349,704.78 | 20,014.93 | 100,762.79 | 19,084.13 | |
| Data composition 3 | 289,465.10 | 38,689.82 | 11,798.04 | 10,334.02* | |
Note(s): The italic figures indicate the best performing model across different data compositions and predictors sets within a similar prediction model, and * indicates the best performing model across different data compositions, predictors sets and prediction models
MAE of the prediction models
| Model | Data composition | Predictors | |||
|---|---|---|---|---|---|
| 1 (Temporal + Previous arrivals) | 2 (Temporal + TripAdvisor) | 3 (Temporal + Google Trends) | 4 (Temporal + Previous arrivals + TripAdvisor + Google Trends) | ||
| ANN | Data composition 1 | 386,762.20 | 268,013.40 | 244,880.00 | 107,909.30 |
| Data composition 2 | 243,863.25 | 160,187.97 | 122,073.73 | 61,053.17 | |
| Data composition 3 | 46,388.95 | 19,068.90 | 17,136.90 | 10,686.56 | |
| SVR | Data composition 1 | 241,113.60 | 158,033.00 | 188,119.90 | 156,366.70 |
| Data composition 2 | 179,089.45 | 84,890.89 | 106,825.07 | 77,057.46 | |
| Data composition 3 | 44,949.64 | 26,717.46 | 57,775.94 | 26,049.87 | |
| RF | Data composition 1 | 649,825.46 | 330,855.86 | 144,193.22 | 34,691.26 |
| Data composition 2 | 349,419.29 | 17,407.48 | 98,328.00 | 16,871.83 | |
| Data composition 3 | 289,244.74 | 34,160.84 | 10,816.20 | 9,930.24* | |
| Model | Data composition | Predictors | |||
|---|---|---|---|---|---|
| 1 (Temporal + Previous arrivals) | 2 (Temporal + TripAdvisor) | 3 (Temporal + Google Trends) | 4 (Temporal + Previous arrivals + TripAdvisor + Google Trends) | ||
| ANN | Data composition 1 | 386,762.20 | 268,013.40 | 244,880.00 | 107,909.30 |
| Data composition 2 | 243,863.25 | 160,187.97 | 122,073.73 | 61,053.17 | |
| Data composition 3 | 46,388.95 | 19,068.90 | 17,136.90 | 10,686.56 | |
| SVR | Data composition 1 | 241,113.60 | 158,033.00 | 188,119.90 | 156,366.70 |
| Data composition 2 | 179,089.45 | 84,890.89 | 106,825.07 | 77,057.46 | |
| Data composition 3 | 44,949.64 | 26,717.46 | 57,775.94 | 26,049.87 | |
| RF | Data composition 1 | 649,825.46 | 330,855.86 | 144,193.22 | 34,691.26 |
| Data composition 2 | 349,419.29 | 17,407.48 | 98,328.00 | 16,871.83 | |
| Data composition 3 | 289,244.74 | 34,160.84 | 10,816.20 | 9,930.24* | |
Note(s): The italic figures indicate the best performing model across different data compositions and predictors sets within a similar prediction model, and * indicates the best performing model across different data compositions, predictors sets and prediction models
In line with the RMSE and MAE results, Table 9 shows that the prediction models trained using data composition 3 have the lowest MAPE compared to those using other data compositions. These findings indicate that the prediction models trained using sufficient data covered unexpected events, such as the COVID-19, will positively influence the prediction accuracy of the developed models. As noted by the previous study, researchers must develop forecasting models that can account for unforeseen events (Qiu et al., 2021). Overall, the RF model incorporating all predictors trained using data composition 3 has the highest prediction accuracy.
MAPE of the prediction models
| Model | Data composition | Predictors | |||
|---|---|---|---|---|---|
| 1 (Temporal + Previous arrivals) | 2 (Temporal + TripAdvisor) | 3 (Temporal + Google Trends) | 4 (Temporal + Previous arrivals + TripAdvisor + Google Trends) | ||
| ANN | Data composition 1 | 292.21% | 202.89% | 185.69% | 82.11% |
| Data composition 2 | 186.68% | 122.97% | 93.89% | 47.40% | |
| Data composition 3 | 34.51% | 14.63% | 13.13% | 7.70% | |
| SVR | Data composition 1 | 181.62% | 118.69% | 141.74% | 118.08% |
| Data composition 2 | 137.28% | 64.32% | 81.30% | 59.34% | |
| Data composition 3 | 33.52% | 19.03% | 42.49% | 19.57% | |
| RF | Data composition 1 | 488.66% | 248.91% | 109.15% | 26.74% |
| Data composition 2 | 266.34% | 13.95% | 76.44% | 13.46% | |
| Data composition 3 | 211.23% | 25.93% | 8.00% | 7.09%* | |
| Model | Data composition | Predictors | |||
|---|---|---|---|---|---|
| 1 (Temporal + Previous arrivals) | 2 (Temporal + TripAdvisor) | 3 (Temporal + Google Trends) | 4 (Temporal + Previous arrivals + TripAdvisor + Google Trends) | ||
| ANN | Data composition 1 | 292.21% | 202.89% | 185.69% | 82.11% |
| Data composition 2 | 186.68% | 122.97% | 93.89% | 47.40% | |
| Data composition 3 | 34.51% | 14.63% | 13.13% | 7.70% | |
| SVR | Data composition 1 | 181.62% | 118.69% | 141.74% | 118.08% |
| Data composition 2 | 137.28% | 64.32% | 81.30% | 59.34% | |
| Data composition 3 | 33.52% | 19.03% | 42.49% | 19.57% | |
| RF | Data composition 1 | 488.66% | 248.91% | 109.15% | 26.74% |
| Data composition 2 | 266.34% | 13.95% | 76.44% | 13.46% | |
| Data composition 3 | 211.23% | 25.93% | 8.00% | 7.09%* | |
Note(s): The italic figures indicate the best performing model across different data compositions and predictors sets within a similar prediction model, and * indicates the best performing model across different data compositions, predictors sets and prediction models
Discussing the impact of different predictors sets on prediction accuracy, prediction models trained using multisource Internet data perform better in predicting tourist arrivals than those trained using single-source Internet data and previous tourist arrivals. The ANN 4 and RF 4 models that use a complete set of predictors consistently outperformed the other three models. However, using a complete set of predictors in the SVR models leads to the best RMSE and MAE, but not for MAPE. By utilizing data composition 3, SVR 2 model has a slightly better MAPE than SVR 4 model. The SVR 2 model using data composition 3 has greater prediction error variations but a better average of percentage errors than the SVR 4 model.
Evaluating the accuracy of the models utilizing single-source Internet data, Google Trends data resulted in better forecasts than online forum data for ANN and RF models. In contrast, online forum data yielded better forecasts than Google Trends data in the SVR model. The training complexity of Google Trends data might be higher than online forum data due to the greater number of attributes. In addition, the training complexity of SVM is indeed high (Cervantes et al., 2007). Despite the good theoretic foundations and accuracy, SVM does not perform well when the dataset contains more noise (Sarker, 2021). However, no single method can outperform other methods in all forecasting contexts (Li et al., 2020), and not all Internet data variables will improve the accuracy (Yang et al., 2015).
Figure 7 visually portrays the models' prediction results compared to the actual record of the international tourist arrivals in Indonesia. The training set of data composition 1 covers only two months of the pandemic (March to April 2020), resulting in a premature model's learning process leading to inaccurate forecasts with many overestimation cases. By utilizing data composition 2, the prediction results of the RF model improve when using predictors sets 2 and 4. However, the prediction results of this model have not captured the dynamics of tourist arrivals. Meanwhile, the results significantly improved when we applied data composition 3 and predictor set 4 for ANN and RF models. At the same time, the SVR model with data composition 3 cannot produce good predictions if we append Google Trends data due to increasing model complexity. In general, the prediction accuracy improves when we increase the training dataset covering the COVID-19 period and utilize a complete set of predictors.
The illustration presents twelve line graphs arranged in a grid of four rows and three columns showing tourist arrival predictions for three datasets labeled “Data Composition 1”, “Data Composition 2”, and “Data Composition 3”. Each column corresponds to one data composition, while rows correspond to four predictors labeled “Predictor 1”, “Predictor 2”, “Predictor 3”, and “Predictor 4”. In each graph of column 1, the horizontal axis represents months and includes labels such as “December”, “January”, “February”, “March”, “April”, “May”, and “June”; in column 2, it includes “February”, “March”, “April”, “May”, and “June”; and in column 3, it includes “April”, “May”, and “June”. The vertical axis in each chart is labeled “Tourist Arrival”. A legend below the charts identifies four lines: a dashed black line labeled “Actual”, a red line labeled “A N N”, a blue line labeled “S V R”, and a yellow line labeled “R F”. Column 1 “Data Composition 1”: “Predictor 1”: The vertical axis ranges from 10000 to 850000. The dashed “Actual” line fluctuates roughly between about 100000 and 160000. The prediction curves vary by method. The “A N N” line increases sharply from about 20000 in December to around 57000 in January and then remains relatively stable near 60000 with moderate fluctuations through June. The “S V R” line rises from around 22000 in December to approximately 40000 in January and February, then fluctuates slightly around this level through June. The “R F” line increases sharply from around 35000 in December to about 85000 in January and then stays near that level with minor variations through the remaining months. “Predictor 2”: The vertical axis ranges from 10000 to 550000. The dashed “Actual” line fluctuates between roughly 110000 and 150000 across the months. The “A N N” curve rises sharply from about 20000 in December to near 42000 in January and then fluctuates moderately around 43000 to 47000 through June. The “S V R” curve increases from around 18000 in December to near 40000 in January and then decreases to around 27000 in February before fluctuating between roughly 28000 and 33000 through June. The “R F” curve rises sharply from approximately 22000 in December to around 52000 in January, then fluctuates between about 48000 and 55000 through June. “Predictor 3”: The vertical axis ranges from 10000 to 430000. The dashed “Actual” line fluctuates around 110000 to 150000 from December to June. The “A N N” curve increases sharply from roughly 15000 in December to around 42000 in January and then remains relatively stable near 43000 with small fluctuations through June. The “S V R” curve increases from about 20000 in December to roughly 30000 in January and February, peaks near 33000 in March, and then gradually decreases slightly toward June. The “R F” curve fluctuates around 15000 through early months, rises sharply to about 28000 in March, and then decreases again toward approximately 15000 through the later months. “Predictor 4”: The vertical axis ranges from 10000 to 325000. The dashed “Actual” line fluctuates between about 110000 and 170000 across the months. The “A N N” curve increases sharply from around 15000 in December to about 25000 in January and then fluctuates slightly around this level through June. The “S V R” curve increases from around 20000 in December to about 28000 in January and February, peaks near 32000 in March, and then declines slightly toward June. The “R F” curve stays near about 15000 during early months, rises to approximately 26000 in March, and then drops back toward about 15000 afterward. Column 2 “Data Composition 2”: “Predictor 1”: The vertical axis ranges from 10000 to 490000. The dashed “Actual” line fluctuates roughly between 110000 and 150000 from February through June. The “A N N” curve increases from about 36000 in February to roughly 40000 in March and then gradually decreases slightly toward around 37000 by June. The “S V R” curve rises gradually from about 32000 in February to around 35000 in March, decreases near 30000 in April, and then increases again toward roughly 34000 by June. The “R F” curve fluctuates between approximately 45000 and 50000 across the months with moderate variations. “Predictor 2”: The vertical axis ranges from 10000 to 340000. The dashed “Actual” line fluctuates around 120000 to 150000 from February to June. The “A N N” curve increases from around 26000 in February to roughly 32000 in March and then gradually declines toward about 29000 by June. The “S V R” curve rises from about 19000 in February to roughly 23000 in March, drops near 19000 in April, and then increases toward around 25000 through May and June. The “R F” curve remains relatively stable around 15000 across the months with minimal variation. “Predictor 3”: The vertical axis ranges from 10000 to 250000. The dashed “Actual” line increases from roughly 110000 in February to around 150000 in May and then decreases slightly by June. The “A N N” curve remains relatively stable between approximately 25000 and 26000 across the months. The “S V R” curve increases from about 20000 in February to roughly 25000 in April and then fluctuates slightly near that level through June. The “R F” curve decreases from around 25000 in February to about 22000 in March and then remains relatively stable near that level through June. “Predictor 4”: The vertical axis ranges from 10000 to 220000. The dashed “Actual” line fluctuates between about 110000 and 150000 across the months. The “A N N” curve rises from roughly 19000 in February to about 21000 in March and then gradually declines toward around 19000 through June. The “S V R” curve increases from approximately 20000 in February to around 22000 in March and April, then declines slightly to near 21000 by June. The “R F” curve remains relatively stable around 15000 throughout the months. Column 3 “Data Composition 3”: “Predictor 1”: The vertical axis ranges from 120000 to 450000. The dashed “Actual” line increases from roughly 130000 in April to around 150000 in May and then decreases slightly by June. The “A N N” curve remains relatively stable around 200000 across the months. The “S V R” curve increases steadily from approximately 170000 in April to around 220000 in June. The “R F” curve remains nearly constant around 440000 across the months. “Predictor 2”: The vertical axis ranges from 120000 to 189000. The dashed “Actual” line increases from roughly 130000 in April to about 150000 in May and then declines slightly toward June. The “A N N” curve fluctuates around 150000 with small variations. The “S V R” curve rises from about 135000 in April to roughly 180000 in May and continues increasing slightly toward June. The “R F” curve decreases from around 175000 in April to about 165000 in May and then rises slightly toward June. “Predictor 3”: The vertical axis ranges from 120000 to 210000. The dashed “Actual” line increases from roughly 130000 in April to around 155000 in May and then decreases slightly by June. The “A N N” curve remains near 150000 with a slight upward trend. The “S V R” curve increases steadily from approximately 185000 in April to around 210000 in June. The “R F” curve remains relatively stable near 140000 across the months. “Predictor 4”: The vertical axis ranges from 120000 to 168000. The dashed “Actual” line increases from about 125000 in April to around 150000 in May and then decreases toward roughly 135000 in June. The “A N N” curve increases slightly from about 136000 in April to roughly 140000 in June. The “S V R” curve rises gradually from around 160000 in April to approximately 168000 in June. The “R F” curve increases from about 134000 in April to roughly 145000 in June. Note: All numerical data values are approximated.Prediction results of international tourist arrivals in Indonesia
The illustration presents twelve line graphs arranged in a grid of four rows and three columns showing tourist arrival predictions for three datasets labeled “Data Composition 1”, “Data Composition 2”, and “Data Composition 3”. Each column corresponds to one data composition, while rows correspond to four predictors labeled “Predictor 1”, “Predictor 2”, “Predictor 3”, and “Predictor 4”. In each graph of column 1, the horizontal axis represents months and includes labels such as “December”, “January”, “February”, “March”, “April”, “May”, and “June”; in column 2, it includes “February”, “March”, “April”, “May”, and “June”; and in column 3, it includes “April”, “May”, and “June”. The vertical axis in each chart is labeled “Tourist Arrival”. A legend below the charts identifies four lines: a dashed black line labeled “Actual”, a red line labeled “A N N”, a blue line labeled “S V R”, and a yellow line labeled “R F”. Column 1 “Data Composition 1”: “Predictor 1”: The vertical axis ranges from 10000 to 850000. The dashed “Actual” line fluctuates roughly between about 100000 and 160000. The prediction curves vary by method. The “A N N” line increases sharply from about 20000 in December to around 57000 in January and then remains relatively stable near 60000 with moderate fluctuations through June. The “S V R” line rises from around 22000 in December to approximately 40000 in January and February, then fluctuates slightly around this level through June. The “R F” line increases sharply from around 35000 in December to about 85000 in January and then stays near that level with minor variations through the remaining months. “Predictor 2”: The vertical axis ranges from 10000 to 550000. The dashed “Actual” line fluctuates between roughly 110000 and 150000 across the months. The “A N N” curve rises sharply from about 20000 in December to near 42000 in January and then fluctuates moderately around 43000 to 47000 through June. The “S V R” curve increases from around 18000 in December to near 40000 in January and then decreases to around 27000 in February before fluctuating between roughly 28000 and 33000 through June. The “R F” curve rises sharply from approximately 22000 in December to around 52000 in January, then fluctuates between about 48000 and 55000 through June. “Predictor 3”: The vertical axis ranges from 10000 to 430000. The dashed “Actual” line fluctuates around 110000 to 150000 from December to June. The “A N N” curve increases sharply from roughly 15000 in December to around 42000 in January and then remains relatively stable near 43000 with small fluctuations through June. The “S V R” curve increases from about 20000 in December to roughly 30000 in January and February, peaks near 33000 in March, and then gradually decreases slightly toward June. The “R F” curve fluctuates around 15000 through early months, rises sharply to about 28000 in March, and then decreases again toward approximately 15000 through the later months. “Predictor 4”: The vertical axis ranges from 10000 to 325000. The dashed “Actual” line fluctuates between about 110000 and 170000 across the months. The “A N N” curve increases sharply from around 15000 in December to about 25000 in January and then fluctuates slightly around this level through June. The “S V R” curve increases from around 20000 in December to about 28000 in January and February, peaks near 32000 in March, and then declines slightly toward June. The “R F” curve stays near about 15000 during early months, rises to approximately 26000 in March, and then drops back toward about 15000 afterward. Column 2 “Data Composition 2”: “Predictor 1”: The vertical axis ranges from 10000 to 490000. The dashed “Actual” line fluctuates roughly between 110000 and 150000 from February through June. The “A N N” curve increases from about 36000 in February to roughly 40000 in March and then gradually decreases slightly toward around 37000 by June. The “S V R” curve rises gradually from about 32000 in February to around 35000 in March, decreases near 30000 in April, and then increases again toward roughly 34000 by June. The “R F” curve fluctuates between approximately 45000 and 50000 across the months with moderate variations. “Predictor 2”: The vertical axis ranges from 10000 to 340000. The dashed “Actual” line fluctuates around 120000 to 150000 from February to June. The “A N N” curve increases from around 26000 in February to roughly 32000 in March and then gradually declines toward about 29000 by June. The “S V R” curve rises from about 19000 in February to roughly 23000 in March, drops near 19000 in April, and then increases toward around 25000 through May and June. The “R F” curve remains relatively stable around 15000 across the months with minimal variation. “Predictor 3”: The vertical axis ranges from 10000 to 250000. The dashed “Actual” line increases from roughly 110000 in February to around 150000 in May and then decreases slightly by June. The “A N N” curve remains relatively stable between approximately 25000 and 26000 across the months. The “S V R” curve increases from about 20000 in February to roughly 25000 in April and then fluctuates slightly near that level through June. The “R F” curve decreases from around 25000 in February to about 22000 in March and then remains relatively stable near that level through June. “Predictor 4”: The vertical axis ranges from 10000 to 220000. The dashed “Actual” line fluctuates between about 110000 and 150000 across the months. The “A N N” curve rises from roughly 19000 in February to about 21000 in March and then gradually declines toward around 19000 through June. The “S V R” curve increases from approximately 20000 in February to around 22000 in March and April, then declines slightly to near 21000 by June. The “R F” curve remains relatively stable around 15000 throughout the months. Column 3 “Data Composition 3”: “Predictor 1”: The vertical axis ranges from 120000 to 450000. The dashed “Actual” line increases from roughly 130000 in April to around 150000 in May and then decreases slightly by June. The “A N N” curve remains relatively stable around 200000 across the months. The “S V R” curve increases steadily from approximately 170000 in April to around 220000 in June. The “R F” curve remains nearly constant around 440000 across the months. “Predictor 2”: The vertical axis ranges from 120000 to 189000. The dashed “Actual” line increases from roughly 130000 in April to about 150000 in May and then declines slightly toward June. The “A N N” curve fluctuates around 150000 with small variations. The “S V R” curve rises from about 135000 in April to roughly 180000 in May and continues increasing slightly toward June. The “R F” curve decreases from around 175000 in April to about 165000 in May and then rises slightly toward June. “Predictor 3”: The vertical axis ranges from 120000 to 210000. The dashed “Actual” line increases from roughly 130000 in April to around 155000 in May and then decreases slightly by June. The “A N N” curve remains near 150000 with a slight upward trend. The “S V R” curve increases steadily from approximately 185000 in April to around 210000 in June. The “R F” curve remains relatively stable near 140000 across the months. “Predictor 4”: The vertical axis ranges from 120000 to 168000. The dashed “Actual” line increases from about 125000 in April to around 150000 in May and then decreases toward roughly 135000 in June. The “A N N” curve increases slightly from about 136000 in April to roughly 140000 in June. The “S V R” curve rises gradually from around 160000 in April to approximately 168000 in June. The “R F” curve increases from about 134000 in April to roughly 145000 in June. Note: All numerical data values are approximated.Prediction results of international tourist arrivals in Indonesia
Predicting tourist arrivals during the COVID-19 period is a nontrivial task. In nonroutine circumstances, we cannot rely only on standard historical statistical records to develop accurate forecasts. Nevertheless, alternative data are available. Search engine and online forum data are user-generated data that can be acquired publicly. This study has demonstrated that multisource Internet data can significantly improve the prediction accuracy of tourist arrivals under travel restrictions during the pandemic. This study confirms the usefulness of multisource Internet data for increasing the accuracy of tourist arrival predictions.
6. Conclusion and future works
This research presents machine learning models to predict international tourist arrivals in Indonesia during the COVID-19 using multisource Internet data, namely the TripAdvisor travel forum and Google Trends. The results show the positive impact of combining multisource Internet data to improve forecasting performance. Prediction models utilizing a combination of predictors from an online travel forum and a search engine have better accuracy than those using the predictor from a single source of Internet data, either the online travel forum only or search queries only. Moreover, our models have better performance than the prediction model that only uses historical tourist arrivals statistical records.
In developing the model, we decompose the training datasets into three partitions, namely (1) January 2017–April 2020 (the period when COVID-19 starts to gain popularity and infect Indonesian citizens), (2) January 2017–August 2020 (the period when the government implemented international travel restrictions) and (3) January 2017–December 2020 (the period when the government extended the international travel restrictions and implemented wide-scale social restrictions). The result indicates that the prediction model using the third training set performs best. These results are consistent across all investigated prediction models. Note that this third training set has the most extensive coverage of the pandemic situation. Thus, using more training sets covering the phenomenon of interest, such as COVID-19, will improve the prediction model's learning process and accuracy. In conclusion, the complete set of predictors and the third data composition applied to the RF model yielded the best prediction performance compared to ANN and SVR models.
Compared to the previous studies using the search query and online forum to predict tourist arrivals (Fronzetti Colladon et al., 2019; Sun et al., 2019; Huang and Hao, 2020), this study offers three contributions. First, this study pioneers the practice of a multisource Internet data approach in predicting tourist arrivals amid the COVID-19 pandemic. Second, this study has validated the use of multisource Internet data to improve prediction performance. Third, this is one of the few papers to provide perspectives on the current state of Indonesia's tourism demand.
In terms of managerial implications, the presented forecasting models can help tourism decision-making in many contexts, such as pricing strategies, allocating resources, planning tourism infrastructures and developing emergency plans (Li et al., 2018; Sun et al., 2019). The accurate forecasts reinforce the foresight capabilities of tourism decision-makers and policymakers, which can help the government to make better corresponding decisions in unexpected situations, such as the COVID-19 pandemic. Moreover, the fast-growing Internet data allows managers for in-depth analysis of visitor activities, interests and interactions, as well as their influence on tourism demand forecasting. The Internet data usage in tourism demand analysis offers several advantages, including timeliness, low cost (since it is open to the public) and good predictive power. Lastly, Internet data may help overcome survey data consumers' sample size constraints (Yang et al., 2015).
Not without limitations, this study opens for further research opportunities. First, this study only focuses on international tourist arrivals in Indonesia. The selected keywords are limited and solely represent this country's public interests and attention. Thus, further studies can investigate other search queries and travel forums relevant to their specific contexts. Furthermore, future studies can explore the application of multisource Internet data for different countries or destinations. Second, this study only uses two data variables extracted from an online forum. Other variables extracted from online forums, such as the sentiment index, which provides an overview of public response, can also be incorporated. In addition, more external factors can be further examined as input for the prediction model. Other data sources, such as Facebook, Twitter and other online forums, can be explored to enrich the training data during prediction model development.
Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Declaration of competing interest: Authors declare that there is no conflict of interest due to the publication of this paper.
