Knowing financial and economic information beforehand benefits in planning and developing policies for every country especially for a developing country like Thailand and for other Asian countries. Unfortunately, missing data or non-response plays an essential role in many areas of studies including finance and economics. Eradication of missing data in a proper way before further analysis can gain remarkable outcomes and can be effective for planning policies. This review on the generalized regression estimators for population total can be applied to financial, economic and other data when missing data are present.
The generalized regression estimators for estimating population total, including the variance estimators under unequal probability sampling without replacement with missing data are explored under the reverse framework. Applications to financial and economic data in Thailand are also reviewed.
The review of literatures related to the proposed estimator shows the best performance, giving smaller variances in all scenarios.
The generalized regression estimators can assist in estimating financial and economic data that contain missing values with different missing mechanisms and can be used in other applications which help gain more superior estimators.
1. Introduction
Generalized regression (GREG) estimation is optimized for design-based estimations of population totals for survey sampling, which are often used in financial data which are seldom complete, becoming an inherent issue requiring a solution. An opulence of economic advancement is imperative in every country to maintain the country’s infrastructure and quality of life of citizens which calls for statistical analysis of data, where the problems of missing data and suitable estimators arise. Measures have been placed on a plethora of aspects to ensure economic development in Thailand, as seen in sustainable development plans in “Thailand 4.0”, as Thailand is a country highly dependent on revenue from tourism. With this reason, the economy is liable to fluctuations, especially recently due to the coronavirus pandemic. After withdrawal of revenue from foreign tourists, the economy became more focused on citizens’ assets, income, and cash flow within the country. A myriad of policies have been enforced to provide stability to individuals’ financial stability and capability to manage their assets during a pandemic. Analysis of the population’s financial issues is vital for proper repairment of the crisis and instigation of solutions and endorsement for citizens in need throughout the duration of the pandemic. Data on the population’s expenses is required for insight on the financial obstacles being faced and to further analyze then address the concerns suitably.
Furthermore, the government has induced many means to stimulate tourism within the country such as the “Thai Travel Together” campaign which allows cash flow within the country and mitigates hardships inflicted upon the economy as a result of the crisis from COVID-19. Moreover, additional facets impact the economy, including unsubstantial investment that afflicts the economy on a large-scale. Sustainable development plans have been enforced to target ten industries and try to resolve production efficiency and competitiveness afflicting Thailand’s industrial economic structure.
However, missing data or nonresponse often occurs in real world data which can obscure facts used for decision making in business and economics, so opportunities are lost due to incomplete data. Missing data occurs due to nonresponse or participants choosing not to answer specific questions for instance. Missing data can occur when it does not depend on missing values or observed values, called missing completely at random (MCAR) or uniform nonresponse, or the missingness correlates to the observations but is not related to the missing values and this is called missing at random (MAR). Therefore, resolving nonresponse is imperative for appropriate financial planning. Difficulties in acquiring accurate data can be a result of lack of records or nonresponse derived from surveys. In conclusion, statistical methods that tackle nonresponse are vital measures to solving this problem. The nonresponse issue was first recommended by Hansen and Hurwitz (1946) in the mail survey. They introduced an unbiased estimator for population mean that used data from a sample survey on both respondents and non-respondents under unequal probability sampling without replacement (UPWOR). Horvitz and Thompson (1952) suggested using the weight to create an unbiased population total estimator under unequal probability sampling for with and without replacement. The first order of inclusion probability is used as the weight for correction of the bias. Unfortunately, there is an issue in calculating variance in Horvitz and Thompson due to it requiring joint inclusion probabilities which are hard to find in some complex survey designs. Later, Hajek (1964) proposed a new estimator to correct an issue of the variance estimator which produces less variance with respect to Horvitz and Thompson (1952), but only when there is no relationship between the study variable and the inclusion probabilities. Their new estimator is a ratio estimator, which is the ratio of sample means of two random variables. for estimating population total which is an approximately unbiased ratio estimator.
The GREG estimator is a special type of calibration estimator and improves this method of estimation using auxiliary information. It is in the shape of the Horvitz and Thompson (1952) estimator which integrates with the weighting approach as it can assist in reducing the nonresponse bias. Bethlehem and Keller (1987) introduced to use weights using linear models which is a new weighting method that can be used in person-based estimations. Many works have been done based on GREG to use the benefit of the relationship between the study and auxiliary variables to skyrocket the efficiency of the population total or population mean estimators and also the variance estimators (see, e.g. Montanari, 1987; Särndal et al., 1992; Estevao and Särndal, 2003; Särndal and Lundström, 2005; Särndal, 2007). The two-phase framework concerns studying the selected sample and nonresponse in the first and second phases, respectively, under nonresponse. It is a popular technique to use to study the GREG estimators’ variance (see, e.g. Rao, 1990; Särndal, 1992; Deville and Särndal, 1994; Särndal and Lundström, 2005).
Fay (1991) invented an alternative to the two-phase measure, the reverse framework. The name comes from the order of studies being reversed, nonresponse is a candidate in the first phase and the sampling shown in the second phase (see, e.g. Shao and Steel, 1999; Haziza and Rao, 2006; Haziza, 2010). Under this reverse method, the population total estimators and the GREG estimators along with their variance estimators were investigated within the MCAR and MAR nonresponse mechanisms and under different assumptions for the response probabilities and the sampling fractions (Lawson, 2017; Lawson and Ponkaew, 2019; Lawson and Siripanich, 2022; Ponkaew and Lawson, 2023).
In this paper, the GREG estimators under the reverse framework will be reviewed. The structure of this paper is as follows. The literature review is shown in section 2. The basic setup and the generalized regression estimators with missing data are reviewed in sections 3 and 4, respectively. Examples of the application related to financial and economic data in Bangkok, Thailand are displayed in section 5. Lastly, some conclusions and discussions are presented in section 6.
2. Literature review
First of all, let’s see how the generalized regression estimators have been developed and can be useful for estimating financial, economic, and other data. The generalized regression estimator can estimate the population mean or total. It is in the shape of Horvitz and Thompson’s (1952), a very well-known population total estimator under unequal probability sampling for both including and not including replacement. Nevertheless, the Horvitz and Thompson’s variance estimator is facing issues as it calls for the known joint inclusion probabilities, also known as the second order inclusion probabilities. They are the probabilities of two different units of populations selected in the sample. These values are difficult to find in complex survey designs and therefore the Horvitz and Thompson estimator is not easy to use in practice. Sometimes they are difficult to be calculated. Under unequal probability sampling using replacement, the formulas of the variance estimators are in their simple forms because these probability values, which is different from the variance formula under UPWOR which requires joint inclusion probabilities.
Some researchers also made an effort to solve this issue in the estimation of variance (Sen, 1953; Yates and Grundy, 1953) but still face the same issue requiring joint inclusion probability which is not known or hard to find. Therefore, some methods have been suggested in estimating the joint inclusion probability (Hartley and Rao, 1962; Hajek, 1964, 1981; Brewer, 2002; Brewer and Donadio, 2003).
The GREG estimators assist in finding population mean and total when there is information based on the related auxiliary variable to the study variable. The formula of the GREG estimator is in the structure of the Horvitz and Thompson (1952) estimator with additional adjustments calculated from an auxiliary variable. Optimal GREG estimators were developed using the known value of the regression coefficient in the population (Montanari, 1987; Berger et al., 2003) under different sampling plans such as stratified two-stage cluster sampling. The Taylor linearization method is used to study the variance and associated variance of the GREG estimator which is in a nonlinear form and therefore it needs to be transformed to a linear one. A drawback of the GREG variance estimator under this situation is that it requires complex methods in calculating the variance under UPWOR due to the requirement of the known joint inclusion probabilities as same as Horvitz and Thompson’s (1952) method. With nonresponse, Särndal and Lundström (2005) have introduced an almost unbiased GREG estimator for estimating population total and a variance estimator under the two-phase framework which requires nonresponse propensities. Under the reverse framework, some literatures explored GREG estimators including missing data. A GREG estimator based on the population total estimator when unit nonresponse appears within the study variable with a negligible sampling fraction under an unstratified, one-stage sample, with probability being unequal has been suggested when the nonresponse mechanism is MCAR. This is quite a restrictive assumption where the response probability is constant and tend to not occur in practice and also the estimator is in a nonlinear form (Lawson and Ponkaew, 2019). However, they proposed to use the modified automated linearization method to deal with this problem and showed that their estimator is unbiased and response probability is not essential. Recently in 2023, under the same assumptions of the previous work, the ratio method of estimation is applied to create the new GREG estimators (Ponkaew and Lawson, 2023). Their estimators are more efficient than the previous work in terms of giving smaller relative bias and root mean square errors as the criterions. We can also see from the application results that were applied to the Thai maize agricultural industry in Thailand in 2019 based on the data from the Office of the Agricultural Economics that their estimators provide a smaller variance in estimating the estimate values of total yield of maize in Thailand which could help in planning for policies for the economics part of Thailand’s agriculture in the future.
Under a more flexible nonresponse mechanism such as MAR to allow for more practicality to use in realistic situations, an approximately unbiased GREG estimator and its variance under UPWOR has been suggested in less controlled circumstances, with the response probabilities both known and unknown and the nonresponse mechanism is non-uniform, with both a small sampling fraction or any sampling fraction. This type of nonresponse mechanism can be called MAR or the ignorable nonresponse mechanism. The less restrictive situations in this estimator can assist by acquiring vital data imperative for financial and economic projects in many areas where missingness happens in the study variable. For example, to study farm profitability and resilience, which brings in revenue for the country can be investigated using the GREG estimators by estimating liabilities and net worth using some variables for instance farm type, farm size, region, tenure, and economic performance. Nevertheless, economic data, e.g. the agricultural industry such as total yield, total profit, and total income can be applied using the GREG estimator to find out these values in advance for planning for effective decision making which can develop economic wealth for the whole nation. Handling missingness appropriately can benefit the reliability of the data that is utilized for planning in Thailand and other countries around the world (Lawson and Siripanich, 2022).
3. Basic setup
The notations and the basic notions under the reverse framework will be introduced. Let be a study variable and a population total of the variable is where and is a population size. Let be an auxiliary variable and the population total of the variable is . The order of the paired ith values of the study variable and auxiliary variable is , . For the ratio estimator, the variable is an auxiliary variable. The auxiliary variables and are used to define the first and joint inclusion probabilities under UPWOR and utilized to construct the ratio estimator respectively. A sample of size is drawn using UPWOR. For selecting the population unit in , the known and nonzero probability is represented by where Let, be the first order inclusion probability and be the second order inclusion probability. Assume that the information of matrix of values or is known for all when . The expectation and variance according to UPWOR sampling are defined as and respectively.
The population total GREG estimator is
where , i = 1, 2, …, n, are the column vectors of the auxiliary variable with , , , and are calculated by the linear assisting model : and that is .
Under nonresponse, and denote the response mechanism and the response indicator variable, respectively.
Let be the response probability shown as Let and be the expectation and variance operators according to the response mechanism, and and be the overall expectation and variance operators, respectively. Therefore, and .
The GREG estimator variance from the reverse framework is
4. Generalized regression estimators with missing data
Numerous works have investigated the GREG estimators with missing data under the two-phase framework to study the GREG estimators’ variance where in the first phase only the interested sample is examined and in the second phase only the nonresponse is contemplated. Under the two-phase framework, the GREG estimator and variance were studied in the presence of nonresponse (Särndal and Lundström, 2005). They also recommended an automated linearization method in finding the variance of the GREG estimator where the partial derivatives are not obligatory as in the Taylor series linearization (see, e.g. Estevao and Särndal, 2003; Särndal and Lundström, 2005; Särndal, 2007).
A GREG estimator for population total with nonresponse using the two-phase framework is (Särndal and Lundström, 2005)
where ,
The variance of is
where , , , .
When is known for all under the reverse framework, is
where , , , .
When is unknown for all , let be the estimator of , then the estimator of is
where
Apart from the two-phase framework, the reverse framework by Fay (1991) is also studied to investigate the GREG estimators variance with the order of the selected sample and nonresponse reversed in the phases of sampling. Again, the same issue arises in the variance estimator which is in a nonlinear form and as a result it needs to be transformed to a linear function. Under the reverse framework, a new GREG estimator has been suggested MCAR or the uniform nonresponse mechanism where the response probability is constant. Most researchers (Lawson and Ponkaew, 2019; Ponkaew and Lawson, 2023) considered it under this assumption due to simplicity. A new GREG estimator for nonresponse under UPWOR was developed based on Lawson’s (2017) concept, a nonlinear estimator for population total/mean and is an almost unbiased estimator with probability being proportional to size sampling consisting of replacement. The benefit of the Lawson estimator is that the response probability is not required in the estimation but is under the assumption that the probabilities of response are the same for all units and the sampling fraction can be omitted. Lawson’s (2017) population mean estimator is
When for all units in , then
Additionally, the Lawson (2017) estimator for estimating the population total is
The associated variance estimator for is
The estimated variance of is
The associated variance estimator for the is
and the estimated variance of is
Under the same assumptions where the nonresponse mechanism is MCAR, the sampling fraction is can be omitted under UPWOR, based on the Lawson (2017) estimator, a new GREG estimator has been suggested as follows (Lawson and Ponkaew, 2019).
where , , ,
When the population size is known, the population total GREG estimator is
They also assumed that and as , where is a sequence consisting of positive real numbers. For the GREG estimators’ variance, they considered two situations; replace by then and using the Taylor linearization approach, then
. The estimated variances of these estimators are respectively,
where .
They also studied in theory that and are almost unbiased estimators.
Later, a new GREG estimator derived from the ratio method has been proposed based on the work of Lawson and Ponkaew (2019) using the same assumptions where the nonresponse mechanism is MCAR and they stretched it to cover the situation where the sampling fraction is also large and therefore it cannot be neglected. They also developed to cases where the response probabilities are known and unknown assisting with the benefit of the known auxiliary variable with nonresponse. Usually under the reverse framework the second part of the variance component is omitted but they considered the case that the variance component in this part cannot be ignored (Ponkaew and Lawson, 2023). Therefore, . Again, they considered the automated linearization approach in the transformation of the into a less complex form. They assumed three assumptions in their study; the response mechanism is MCAR, , and as where or .
Their GREG estimators for population mean and total are respectively,
where , ,
Under the reverse framework the can be gained by,
where ,
The variance of Ponkaew and Lawson (2023) are
- (1)
is
- (2)
is
- (1)
The estimators of are
where , ,
- (2)
The estimators of are
where ,
Unfortunately, the works we mentioned above are considered under a strong assumption when the nonresponse mechanism is MCAR where the response probability is constant only. The novel GREG estimators for population mean and total under a more flexible situation where nonresponse occurs under missing at random or MAR, which is a more practical situation, were proposed based on the previous works when the auxiliary variable is known to improve the efficiency of the estimators (Lawson and Siripanich (2022). In their study, they assumed that, : as , where is a sequence of positive real numbers and and as and the sampling fraction is negligible and non-negligible.
The Lawson and Siripanich (2022) estimator are
where , ,
In variance estimation due to the nonlinear estimator, they suggested two estimation techniques called the modified automated linearization approaches to deal with this issue. They suggested to replace by in their estimators and used the Taylor linearization approach to transform nonlinear estimator to linear form.
Their variance estimators are
The estimators of are
where , ,
is the estimator of for all , , if is known for all otherwise , if is known for all otherwise .
The estimators of are
where
These GREG estimators can be calculated using any statistical packages, e.g. R program which was used in the reviewed studies. Due to these new GREG estimators are new estimators under the presence of missing data under unequal probability sampling and so unfortunately there is no function in R that can be used straight away. Although they are not that complex to use in the estimation.
5. Examples of application to financial and economic data
The GREG estimator was applied to estimate the total monthly household income from five communities in Bang Sue district, Bangkok, Thailand (Lawson and Siripanich, 2022). The results were based on a sample of size 195 households that was drawn using UPWOR with Midzuno's (1952) scheme out of 1,181 households which consists of 30% nonresponse in the monthly income. The monthly expenditure, age and work in hours per week were considered as the auxiliary variables to assist in estimating the total income and the variance. The logistic regression model was used to find the unknown response probability using the age variable.
Their results showed that their suggested GREG estimator gave the estimated total income for all households equal to 36,068,543 baht and smaller variances in regards to the Särndal and Lundström (2005) estimator.
Data on total monthly income in households is the key to understanding a core part of a country’s economy. Information on the financial status of citizens contributes to money flow in the economy and provides invaluable insights for strategizing policies to overcome economic inequalities. Estimation of these statistics allow policymakers to identify income disparities within the nation, integrate measures to assert equality and stabilize the economy, leading to the amelioration of quality of life on a myriad of aspects.
Another example was found in studying Thailand’s agriculture which is one of the sources of income that support Thailand’s economy (Ponkaew and Lawson, 2023). The Thai maize of Thailand in 2019 from the Office of the Agricultural Economics was studied based on a sample size of 25 provinces being selected using the UPWOR method by Midzuno (1952) out of 63 provinces. The data contained a 30% nonresponse rate. The total yield of maize estimates for all provinces in Thailand in 2019 was found using their suggested GREG estimator and cultivated area and the harvest area in 2019 were considered as the auxiliary variables along with the cultivated area in 2018 as the size variable. The estimates of total yield of maize for all provinces in Thailand was 525,124 with the smallest variance with respect to the existing estimator.
Statistical estimation of agricultural yield is imperative for agricultural countries such as Thailand and a large part of Asia. These nations’ histories have all consisted of agriculture as their geography and climate incline toward successful growing of crops. In prevailing times, export plays an inherent role as one of the major income sources, and an opulence of land is recruited for farming. These farmers are often short on resources and must go through many lengths to save on time and money, to ensure that their yields bring in profit and not losses. The prediction of crop yields can help policymakers working with farmers to anticipate food shortages leading to losses, and potential risks of farming strategies. As many countries are dependent on agriculture, estimation of accurate yields is an essential component of their economies.
6. Conclusions and discussions
We can see that the GREG estimators can be useful to estimate financial and economic data in Thailand and also other countries. Most of these data contain nonresponse which could occur usually during the collection process and as a result it needs to be take care of to gain more accuracy. Many reviewed works based on the GREG estimators under missing data studied under the reverse framework could benefit in the estimation process where we can apply them to real data, e.g. household income, revenue for business, and inflation and unemployment rate.
The GREG estimators are studied under the MCAR and MAR nonresponse mechanisms where both the sampling fractions are small and therefore it can be negligible or either large and cannot be omitted. These GREG estimators are also almost unbiased estimators with reduced variance regarding the existing estimators. The GREG estimators’ variance estimators are useful to help in estimating the boundary of the variable of interest to see the lower bound and upper bound for these possible values based on survey sampling. Smaller variance from the GREG estimators can benefit in creating more accuracy for the confidence interval for financial and economic data.
The GREG estimators can assist in estimating these data and therefore knowing these data can be helpful in planning in order to define policies of countries to increase the value of business and finance in the future. The integral concept of economic stability can only be enforced by the support of accurate statistical estimation of financial and economic data through policies and efficient decisions. Flexible statistics can monitor and predict situations such as economic trends, employment figures, and inflation rates, which benefit policymakers, economists, and investors. Most crucial being introducing suitable policies to tackle the nation’s financial issues and fill in economic niches, for the well-being of the population through sustainable economic growth.
The GREG estimators can be applied to further studies in any survey designs other than UPWOR for instance, stratified cluster sampling, cluster samplings where nonresponse happens in the study variable and can assist in any application to real data.
Many thanks to Prof. Sa-Aat Niwitpong and Prof. Hung Nguyen for recommending the Asian Journal of Economics and Banking.
