This study aims to examine the performance of various goodness-of-fit (GoF) tests in assessing the normality of the data set, a crucial step in conducting probabilistic analyses in the geotechnical domain. The evaluation focuses on the efficacy of these tests when applied to small sample sizes and data sets with varying coefficient of variation (CoV). Identifying the most efficient GoF test based on the statistical characteristics of the data can enhance the reliability of results and minimise the risk of misleading conclusions.
Multiple GoF tests, including Shapiro–Wilk (S-W), Lilliefors (LL), Anderson–Darling (A-D), Jarque–Bera (J-B), chi-square (CSQ), Cramér–von Mises (CVM) and D’Agostino and Pearson Omnibus (DP) tests, were used for normality assessment. A computational power analysis was performed through Monte Carlo random sample simulation to determine the optimal sample size required to achieve the desired statistical power. Furthermore, the performance and sensitivity of each GoF test were assessed systematically by varying the sample size and effect size (d) to establish the relationship between the minimum required sample size and CoV.
Power analysis revealed that the S-W tests demonstrated higher effectiveness in detecting normality, followed by the A-D, J-B, LL and DP tests. The degree of skewness and CoV in the data sets plays a crucial role in optimising the sample size requirements. For S-W tests, the minimum required sample sizes varies with the CoV as follows: (a) CoV < 10% requires at least 665 samples, (b) 10% ≤ CoV < 30% requires 665–145 samples, (c) 30% ≤ CoV < 50% requires 145–70, (d) 50%≤ CoV < 100% requires 70–25 samples, and (e) CoV ≥ 100% requires at most 25 samples. In comparison, CSQ and CVM tests demand substantially larger minimum sample sizes, ranging approximately between 115 and 4,400.
This study presents a comparative analysis of GoF tests applied to geotechnical data, determining the required sample size through power analysis, with a target statistical power of 0.8 at a chosen significance level of 0.05. These findings provide practical guidance for selecting appropriate normality tests and the required minimum sample size for geotechnical data with varying CoV.
1. Introduction
In civil engineering practice, the uncertain conditions are not uncommon and lead to complexities in the design due to factors such as material heterogeneity, characterisation methods, construction practices and other environmental factors (Phoon and Kulhawy, 1999; Baecher and Christian, 2005; IRC SP-11, 2012; Zbiciak et al., 2025). Particularly in the geotechnical domain, the stochastic nature of geomaterial formation, in-situ stress history, long-term exposure to various climatic and environmental conditions, and weathering result in inherent heterogeneity and spatial variability (Phoon and Kulhawy, 1999; Fenton and Griffiths, 2008; Cherubini and Vessia, 2010; Wang et al., 2015; Liu et al., 2024). Furthermore, the data collection introduces additional challenges, including sampling bias, limited accessibility and variations in testing methods, which compound measurement errors and data inconsistencies (Akbas and Kulhawy, 2010; Zhao et al., 2013; Li et al., 2021; Kumar et al., 2023). Therefore, growing interest towards the integration of variability necessitates the application of statistical methods in geotechnical design to accurately represent soil behaviour, quality control levels and performance assessment (Fenton and Griffiths, 2008; Phoon and Ching, 2018; Harle and Wankhade, 2025).
Fitting an appropriate probability model (using inferential statistics) is one way to model the uncertainty in the parameter of interest. For instance, normal and lognormal distributions are commonly used in science and engineering practice due to their ability to represent a wide range of inherent variability, making them easy to implement for practical applications (Theocharis et al., 2024). In the context of geotechnical practice, the index and engineering properties like Atterberg limits, compaction characteristics, compressibility parameters, hydraulic conductivity, degree of saturation, shear strength parameter and so on, are often fitted to either normal or lognormal distribution (Elkateb et al., 2003; Baecher and Christian, 2005; Zhai and Benson, 2006; Wang et al., 2015; Galeandro et al., 2017; Toraldo et al., 2018; Nguyen et al., 2023; Theocharis et al., 2024; Ganesh and Kumar, 2025). Several reliability-based approaches assume that geotechnical parameters such as shear strength, consolidation characteristics and permeability follow a normal or lognormal distribution (Phoon et al., 1995; Nowak and Collins, 2000; Baecher and Christian, 2005; Fenton and Griffiths, 2008; Saseendran and Dodagoudar, 2020). However, an incorrect selection of the underlying probability model can distort reliability estimates, leading to either underestimation or overestimation of the probability of failure.
Furthermore, the commonly employed statistical techniques in material quality control include parametric tests such as one-sample t-tests, two-sample t-tests, analysis of variance (ANOVA) and the F-test of homogeneity of variance, regression and correlation, which are routinely applied in practice (Burati et al., 2003; IRC SP-11, 2012; Ganesh and Kumar, 2024). A fundamental assumption underlying these parametric tests is that the sampled data originates from a normally (Gaussian) distributed population (Montgomery and Runger, 2010; Ott and Longnecker, 2015). However, recent observations by Phoon et al. (2023) highlight growing concerns among field practitioners, who insist that many statistical models assume that geotechnical data are homogenous, independent, sufficient and normally distributed, which makes analysis unrealistic. This gap between the statistical assumptions and field behaviour underscores the importance of conducting appropriate goodness-of-fit (GoF) tests.
Normality assessment methods can be broadly categorised into graphical approaches, numerical measures and formal statistical tests (D’Agostino and Stephens, 1986; Thode, 2002; Henderson, 2006; Razali and Wah, 2010). In recent years, researchers have also explored the use of machine learning algorithms as complementary tools in assessing the distributional characteristics of the data. Studies by Sigut et al. (2006), Lee et al. (2019) and Simić (2021) demonstrated that neural network and deep learning models can effectively identify non-normal patterns in complex data sets. However, there remains a lack of comprehensive studies that explicitly examine the normality characteristics of laboratory/field-derived geotechnical data. Given the proven reliability and widespread adoption of formal statistical tests, this study focuses on the application of the most frequently cited and commonly adopted GoF tests within the geotechnical domain, as discussed in detail in the following paragraph.
The formal statistical tests (also known GoF test) can be classified into different categories based on their underlying statistical principles:
empirical distribution test (Anderson–Darling (A-D) test, Kolmogorov–Smirnov (K-S) test, Lilliefors (LL) test, Cramér–von Mises (CVM) test, etc.);
regression and correlation-based tests (Shapiro–Wilk (S-W) test, Ryan–Joiner test, etc.);
chi-square (CSQ) test; and
moment-based tests (Jarque–Bera J-B, DP Omnibus test, etc.).
Among these tests, some selected GoF tests were most frequently used to assess the appropriate probability model for the laboratory and in -situ geotechnical data. For example, Benson (1993) used the K-S test to determine the best-fitting probability model for the data set comprising permeability of remoulded soil samples, with CoV values ranging between 27% and 767%. Previous studies by Haldar and Mahadevan (2000), Burati et al. (2003), Baecher and Christian (2005), Ang and Tang (2007), Fenton and Griffiths (2008), Galeandro et al. (2017) and Theocharis et al. (2024) identified the CSQ, K-S and A-D tests as the most frequently used methods for assessing the normality of geotechnical data sets. Similarly, Phoon and Kulhawy (1999), Baecher and Christian (2005), Akbas and Kulhawy (2010), Galeandro et al. (2017), Toraldo et al. (2018), Nguyen et al. (2023) and Theocharis et al. (2024) documented the CoV values for geotechnical properties ranging from 2% to 105%, except for permeability, which may exhibit high CoV values (200%–300%).
Furthermore, Uzielli et al. (2007) recommended the S-W test over the K-S test for geotechnical applications, citing Thode’s (2002) findings, which were not explicitly focused on geotechnical properties. More recently, Somani et al. (2019) employed the S-W test to evaluate the normality of soil-like fractions reclaimed from landfill mining. However, Ganesh and Kumar (2025) applied the K-S test to assess whether the index and engineering properties of processed geomaterials (pond ash-bentonite mixtures) followed a normal or lognormal distribution. Although various GoF tests, such as K-S, CSQ, S-W and A-D, are commonly applied, the efficacy of these tests for geotechnical data has not been addressed or is unknown.
From a general statistical perspective, Ramachandran and Tsokos (2009), Montgomery and Runger (2010) and Ott and Longnecker (2015) suggested that a sample size of 30 or more is typically sufficient to approximate the normality of the sample mean distribution, in accordance with the central limit theorem. However, they also reported that this threshold is not unalterable because it is derived from extensive simulation studies and may vary depending on the degree of skewness of the data. In addition, the normality of the data should be examined carefully when approximating it with a sample size of ≤30. For instance, Tang et al. (2017) optimise the required minimum sample size to be in the range of 54–458 for identifying the most suitable probability model for geotechnical data using Akaike Information Criterion scores (one of the techniques to assess appropriate distribution fit). Their analysis considered data with CoV varying from 10% to 30%. However, in geotechnical engineering practice, most studies have addressed the constraints of obtaining a sample size of more than 30, which is not uncommon (Wang et al., 2015; Tang et al., 2017).
Earlier studies by Yazici and Yolacan (2007) and Yap and Sim (2011) optimised the sample size for assessing the performance of the selected GoF tests using power analysis. However, these investigations primarily relied on synthetic data and did not account for the actual variability observed in geotechnical properties. Therefore, the present study aims to evaluate the efficacy of commonly used GoF tests in assessing the normality of geotechnical data sets with diverse statistical characteristics (CoV = 6%, 7%, 12%, 15%, 17%, 19%, 29%, 37%, 94% and 196%). To ensure accuracy on statistical decisions, sample size optimisation is performed using power analysis, allowing for an effective comparison of the performance and power of each GoF test. Furthermore, to evaluate the test sensitivity under controlled deviations from normality, skewness is artificially induced into the original data using a nonlinear transformation. Based on the findings, a classification framework will be proposed to recommend the minimum sample size requirement as a function of the CoV, thereby improving the practical applicability of normality testing in geotechnical engineering.
2. Methodology
2.1 Experimentally derived geotechnical data
This study evaluates multiple data sets encompassing natural geomaterials, soil-like fractions from legacy waste deposits, and processed geomaterials to assess their normality, as summarised in Table 1. Natural geomaterial samples were collected from depths of 0.6–1.0 m at the highway construction sites located in the Karaikal and Nagapattinam regions of Puducherry and Tamil Nadu, India. Fifty field samples were collected from natural ground earth, which are classified as Silty Sand (SM) and Clayey Sand (SC) based on the Unified Soil Classification System (USCS), as reported in Table 2. The key geotechnical properties evaluated from these samples included the liquid limit (data set 1), plastic limit (data set 2), compaction characteristics (data sets 3 and 4) and California Bearing Ratio (CBR) (data set 5), following the guidelines of ASTM D1557 (2015), ASTM D698 (2021), ASTM D4318 (2017) and ASTM D1883 (1999).
Summary of experimentally derived geotechnical properties evaluated for normality
| S. no. | Natural geomaterial | Legacy waste | Pond ash-bentonite mixture | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Liquid limit (%) | Plastic limit (%) | Dry density (mg/m3) | Moisture content (%) | CBR (%) | % of fines (< 75 µm) | Dry density (mg/m3) | Moisture content (%) | UCS (kPa) | Permeability (m/s) | |
| 1 | 48 | 23 | 15.18 | 8.0 | 6.10 | 5 | 1.62 | 19.54 | 37.7 | 3.50E-07 |
| 2 | 39 | 16 | 16.06 | 9.0 | 20.0 | 6 | 1.65 | 21.62 | 29.7 | 3.70E-07 |
| 3 | 24 | 14 | 16.06 | 9.0 | 20.6 | 9 | 1.67 | 18.52 | 48.1 | 5.80E-07 |
| 4 | 25 | 16 | 16.06 | 10.0 | 8.10 | 9 | 1.67 | 19.56 | 95.2 | 4.10E-08 |
| 5 | 26 | 20 | 16.2 | 10.0 | 14.4 | 9 | 1.69 | 19.85 | 108.5 | 2.60E-08 |
| 6 | 34 | 19 | 16.2 | 10.0 | 10.9 | 9 | 1.71 | 19.47 | 123.3 | 4.00E-08 |
| 7 | 52 | 24 | 16.2 | 10.5 | 4.20 | 10 | 1.71 | 19.22 | 182.6 | 4.30E-09 |
| 8 | 42 | 21 | 16.35 | 11.0 | 6.00 | 11 | 1.72 | 16.89 | 165.1 | 1.20E-09 |
| 9 | 50 | 18 | 16.35 | 11.0 | 20.9 | 11 | 1.72 | 19.35 | 136.7 | 7.20E-09 |
| 10 | 43 | 16 | 16.57 | 11.0 | 14.3 | 12 | 1.73 | 19.87 | 73.5 | 2.60E-07 |
| 11 | 30 | 18 | 16.64 | 11.0 | 17.1 | 12 | 1.74 | 16.42 | 55.3 | 1.00E-07 |
| 12 | 34 | 18 | 16.64 | 11.0 | 13.6 | 13 | 1.75 | 19.58 | 49.9 | 8.70E-08 |
| 13 | 38 | 17 | 17.07 | 11.0 | 12.1 | 13 | 1.75 | 18.93 | 137.6 | 9.70E-09 |
| 14 | 34 | 18 | 17.07 | 11.0 | 12.9 | 13 | 1.77 | 17.75 | 145 | 1.90E-08 |
| 15 | 40 | 17 | 17.22 | 11.5 | 13.8 | 13 | 1.77 | 17.79 | 169.6 | 7.50E-09 |
| 16 | 39 | 20 | 17.22 | 11.5 | 4.10 | 14 | 1.77 | 18.51 | 228.2 | 2.50E-09 |
| 17 | 37 | 18 | 17.37 | 12.0 | 10.8 | 14 | 1.77 | 19.85 | 204.1 | 1.40E-09 |
| 18 | 42 | 17 | 17.51 | 12.0 | 13.0 | 14 | 1.79 | 18.06 | 263.7 | 4.20E-10 |
| 19 | 32 | 17 | 17.51 | 12.0 | 8.90 | 14 | 1.8 | 14.27 | 90.5 | 1.10E-07 |
| 20 | 34 | 17 | 17.51 | 12.0 | 13.9 | 14 | 1.82 | 17.99 | 77.5 | 5.40E-08 |
| 21 | 30 | 18 | 17.51 | 12.0 | 4.30 | 14 | 1.82 | 15.97 | 106.3 | 7.20E-08 |
| 22 | 25 | 17 | 17.51 | 12.0 | 13.8 | 14 | 1.82 | 14.88 | 235.1 | 4.40E-09 |
| 23 | 35 | 17 | 17.66 | 12.0 | 17.1 | 14 | 1.83 | 15.52 | 230.2 | 2.50E-09 |
| 24 | 43 | 16 | 17.66 | 12.5 | 13.3 | 15 | 1.84 | 14.21 | 284.2 | 7.60E-09 |
| 25 | 38 | 21 | 17.66 | 12.5 | 18.7 | 15 | 1.84 | 13.87 | 367.8 | 6.40E-10 |
| 26 | 31 | 19 | 17.8 | 12.5 | 19.5 | 15 | 1.9 | 13.87 | 376 | 3.40E-10 |
| 27 | 22 | 16 | 17.8 | 12.5 | 6.10 | 15 | 1.92 | 15.25 | 475.2 | 2.60E-10 |
| 28 | 30 | 15 | 17.8 | 12.5 | 20.3 | 15 | 1.94 | 12.61 | 168.8 | 8.40E-08 |
| 29 | 40 | 16 | 18.09 | 12.5 | 14.8 | 16 | 1.94 | 12.21 | 106.8 | 5.40E-08 |
| 30 | 28 | 15 | 18.09 | 12.5 | 18.3 | 16 | 1.94 | 12.66 | 139.6 | 2.00E-08 |
| 31 | 41 | 17 | 18.24 | 12.5 | 17.8 | 16 | 1.94 | 12.79 | 425.9 | 2.20E-09 |
| 32 | 41 | 12 | 18.24 | 13.0 | 13.4 | 16 | 1.95 | 14.34 | 385.4 | 4.80E-09 |
| 33 | 33 | 13 | 18.38 | 13.0 | 13.3 | 16 | 1.97 | 13.23 | 362.6 | 1.70E-09 |
| 34 | 30 | 13 | 18.38 | 13.0 | 14.1 | 16 | 1.97 | 12.70 | 881.6 | 2.30E-10 |
| 35 | 30 | 19 | 18.53 | 13.0 | 14.8 | 17 | 1.98 | 14.03 | 974.9 | 1.10E-10 |
| 36 | 37 | 15 | 18.53 | 13.0 | 33.8 | 18 | 1.99 | 13.68 | 786.7 | 3.00E-10 |
| 37 | 20 | 16 | 18.53 | 13.0 | – | 18 | – | – | – | – |
| 38 | 22 | 18 | 18.53 | 13.0 | – | 20 | – | – | – | – |
| 39 | 24 | 21 | 18.82 | 13.0 | – | 21 | – | – | – | – |
| 40 | 32 | – | 18.82 | 13.0 | – | 21 | – | – | – | – |
| 41 | 43 | – | 19.11 | 13.0 | – | 22 | – | – | – | – |
| 42 | 45 | – | 19.11 | 13.0 | – | 22 | – | – | – | – |
| 43 | 42 | – | 19.11 | 13.0 | – | 23 | – | – | – | – |
| 44 | 41 | – | 19.26 | 13.0 | – | 24 | – | – | – | – |
| 45 | – | – | 19.26 | 13.5 | – | - | – | – | – | – |
| 46 | – | – | 19.26 | 13.5 | – | - | – | – | – | – |
| 47 | – | – | 19.26 | 14.0 | – | - | – | – | – | – |
| 48 | – | – | 20.13 | 14.5 | – | - | – | – | – | – |
| 49 | – | – | 21.15 | 15.0 | – | - | – | – | – | – |
| 50 | – | – | 21.88 | 15.5 | – | - | – | – | – | – |
| S. no. | Natural geomaterial | Legacy waste | Pond ash-bentonite mixture | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Liquid limit (%) | Plastic limit (%) | Dry density (mg/m3) | Moisture content (%) | % of fines (< 75 µm) | Dry density (mg/m3) | Moisture content (%) | Permeability (m/s) | |||
| 1 | 48 | 23 | 15.18 | 8.0 | 6.10 | 5 | 1.62 | 19.54 | 37.7 | 3.50E-07 |
| 2 | 39 | 16 | 16.06 | 9.0 | 20.0 | 6 | 1.65 | 21.62 | 29.7 | 3.70E-07 |
| 3 | 24 | 14 | 16.06 | 9.0 | 20.6 | 9 | 1.67 | 18.52 | 48.1 | 5.80E-07 |
| 4 | 25 | 16 | 16.06 | 10.0 | 8.10 | 9 | 1.67 | 19.56 | 95.2 | 4.10E-08 |
| 5 | 26 | 20 | 16.2 | 10.0 | 14.4 | 9 | 1.69 | 19.85 | 108.5 | 2.60E-08 |
| 6 | 34 | 19 | 16.2 | 10.0 | 10.9 | 9 | 1.71 | 19.47 | 123.3 | 4.00E-08 |
| 7 | 52 | 24 | 16.2 | 10.5 | 4.20 | 10 | 1.71 | 19.22 | 182.6 | 4.30E-09 |
| 8 | 42 | 21 | 16.35 | 11.0 | 6.00 | 11 | 1.72 | 16.89 | 165.1 | 1.20E-09 |
| 9 | 50 | 18 | 16.35 | 11.0 | 20.9 | 11 | 1.72 | 19.35 | 136.7 | 7.20E-09 |
| 10 | 43 | 16 | 16.57 | 11.0 | 14.3 | 12 | 1.73 | 19.87 | 73.5 | 2.60E-07 |
| 11 | 30 | 18 | 16.64 | 11.0 | 17.1 | 12 | 1.74 | 16.42 | 55.3 | 1.00E-07 |
| 12 | 34 | 18 | 16.64 | 11.0 | 13.6 | 13 | 1.75 | 19.58 | 49.9 | 8.70E-08 |
| 13 | 38 | 17 | 17.07 | 11.0 | 12.1 | 13 | 1.75 | 18.93 | 137.6 | 9.70E-09 |
| 14 | 34 | 18 | 17.07 | 11.0 | 12.9 | 13 | 1.77 | 17.75 | 145 | 1.90E-08 |
| 15 | 40 | 17 | 17.22 | 11.5 | 13.8 | 13 | 1.77 | 17.79 | 169.6 | 7.50E-09 |
| 16 | 39 | 20 | 17.22 | 11.5 | 4.10 | 14 | 1.77 | 18.51 | 228.2 | 2.50E-09 |
| 17 | 37 | 18 | 17.37 | 12.0 | 10.8 | 14 | 1.77 | 19.85 | 204.1 | 1.40E-09 |
| 18 | 42 | 17 | 17.51 | 12.0 | 13.0 | 14 | 1.79 | 18.06 | 263.7 | 4.20E-10 |
| 19 | 32 | 17 | 17.51 | 12.0 | 8.90 | 14 | 1.8 | 14.27 | 90.5 | 1.10E-07 |
| 20 | 34 | 17 | 17.51 | 12.0 | 13.9 | 14 | 1.82 | 17.99 | 77.5 | 5.40E-08 |
| 21 | 30 | 18 | 17.51 | 12.0 | 4.30 | 14 | 1.82 | 15.97 | 106.3 | 7.20E-08 |
| 22 | 25 | 17 | 17.51 | 12.0 | 13.8 | 14 | 1.82 | 14.88 | 235.1 | 4.40E-09 |
| 23 | 35 | 17 | 17.66 | 12.0 | 17.1 | 14 | 1.83 | 15.52 | 230.2 | 2.50E-09 |
| 24 | 43 | 16 | 17.66 | 12.5 | 13.3 | 15 | 1.84 | 14.21 | 284.2 | 7.60E-09 |
| 25 | 38 | 21 | 17.66 | 12.5 | 18.7 | 15 | 1.84 | 13.87 | 367.8 | 6.40E-10 |
| 26 | 31 | 19 | 17.8 | 12.5 | 19.5 | 15 | 1.9 | 13.87 | 376 | 3.40E-10 |
| 27 | 22 | 16 | 17.8 | 12.5 | 6.10 | 15 | 1.92 | 15.25 | 475.2 | 2.60E-10 |
| 28 | 30 | 15 | 17.8 | 12.5 | 20.3 | 15 | 1.94 | 12.61 | 168.8 | 8.40E-08 |
| 29 | 40 | 16 | 18.09 | 12.5 | 14.8 | 16 | 1.94 | 12.21 | 106.8 | 5.40E-08 |
| 30 | 28 | 15 | 18.09 | 12.5 | 18.3 | 16 | 1.94 | 12.66 | 139.6 | 2.00E-08 |
| 31 | 41 | 17 | 18.24 | 12.5 | 17.8 | 16 | 1.94 | 12.79 | 425.9 | 2.20E-09 |
| 32 | 41 | 12 | 18.24 | 13.0 | 13.4 | 16 | 1.95 | 14.34 | 385.4 | 4.80E-09 |
| 33 | 33 | 13 | 18.38 | 13.0 | 13.3 | 16 | 1.97 | 13.23 | 362.6 | 1.70E-09 |
| 34 | 30 | 13 | 18.38 | 13.0 | 14.1 | 16 | 1.97 | 12.70 | 881.6 | 2.30E-10 |
| 35 | 30 | 19 | 18.53 | 13.0 | 14.8 | 17 | 1.98 | 14.03 | 974.9 | 1.10E-10 |
| 36 | 37 | 15 | 18.53 | 13.0 | 33.8 | 18 | 1.99 | 13.68 | 786.7 | 3.00E-10 |
| 37 | 20 | 16 | 18.53 | 13.0 | – | 18 | – | – | – | – |
| 38 | 22 | 18 | 18.53 | 13.0 | – | 20 | – | – | – | – |
| 39 | 24 | 21 | 18.82 | 13.0 | – | 21 | – | – | – | – |
| 40 | 32 | – | 18.82 | 13.0 | – | 21 | – | – | – | – |
| 41 | 43 | – | 19.11 | 13.0 | – | 22 | – | – | – | – |
| 42 | 45 | – | 19.11 | 13.0 | – | 22 | – | – | – | – |
| 43 | 42 | – | 19.11 | 13.0 | – | 23 | – | – | – | – |
| 44 | 41 | – | 19.26 | 13.0 | – | 24 | – | – | – | – |
| 45 | – | – | 19.26 | 13.5 | – | - | – | – | – | – |
| 46 | – | – | 19.26 | 13.5 | – | - | – | – | – | – |
| 47 | – | – | 19.26 | 14.0 | – | - | – | – | – | – |
| 48 | – | – | 20.13 | 14.5 | – | - | – | – | – | – |
| 49 | – | – | 21.15 | 15.0 | – | - | – | – | – | – |
| 50 | – | – | 21.88 | 15.5 | – | - | – | – | – | – |
CBR = California Bearing Ratio; UCS = unconfined compressive strength
Summary of descriptive statistics of the geotechnical data used for normality testing
| Description | Properties | Soil type | Sample size (n) | Average | SD | CoV (%) | Skewness (b1) | Kurtosis (b2) |
|---|---|---|---|---|---|---|---|---|
| Data set 1 | Liquid limit (%) | SM, SC | 44 | 35.1 | 7.83 | 22.3 | −0.010 | −0.613 |
| Data set 2 | Plastic limit (%) | SM, SC | 39 | 17.4 | 2.56 | 14.7 | 0.388 | 0.596 |
| Data set 3 | Dry density (mg/m3) | SM, SC | 50 | 1.821 | 0.14 | 7.4 | 0.605 | 0.896 |
| Data set 4 | Moisture content (%) | SM, SC | 50 | 12.1 | 1.47 | 12.2 | −0.490 | 0.802 |
| Data set 5 | California bearing ratio (%) | SM, SC | 35# | 13.3 | 4.91 | 37.0 | −0.378 | −0.613 |
| Data set 6 | Fines (%) | SM* | 44 | 14.6 | 4.28 | 29.2 | 0.147 | 0.141 |
| Data set 7 | Dry density (mg/m3) | PA-Bt mixtures | 36 | 1.811 | 0.11 | 5.9 | 0.192 | −1.123 |
| Data set 8 | Moisture content (%) | PA-Bt mixtures | 36 | 16.5 | 2.78 | 16.8 | −0.034 | −1.454 |
| Data set 9 | Unconfined compressive strength (kPa) | PA-Bt mixtures | 36 | 242.5 | 228.71 | 94.3 | 1.954 | 3.665 |
| Data set 10 | Permeability (m/s) | PA-Bt mixtures | 36 | 6.4 | 1.27E-07 | 196.8 | 2.814 | 8.098 |
| Description | Properties | Soil type | Sample size (n) | Average | CoV (%) | Skewness (b1) | Kurtosis (b2) | |
|---|---|---|---|---|---|---|---|---|
| Data set 1 | Liquid limit (%) | SM, | 44 | 35.1 | 7.83 | 22.3 | −0.010 | −0.613 |
| Data set 2 | Plastic limit (%) | SM, | 39 | 17.4 | 2.56 | 14.7 | 0.388 | 0.596 |
| Data set 3 | Dry density (mg/m3) | SM, | 50 | 1.821 | 0.14 | 7.4 | 0.605 | 0.896 |
| Data set 4 | Moisture content (%) | SM, | 50 | 12.1 | 1.47 | 12.2 | −0.490 | 0.802 |
| Data set 5 | California bearing ratio (%) | SM, | 35# | 13.3 | 4.91 | 37.0 | −0.378 | −0.613 |
| Data set 6 | Fines (%) | SM* | 44 | 14.6 | 4.28 | 29.2 | 0.147 | 0.141 |
| Data set 7 | Dry density (mg/m3) | PA-Bt mixtures | 36 | 1.811 | 0.11 | 5.9 | 0.192 | −1.123 |
| Data set 8 | Moisture content (%) | PA-Bt mixtures | 36 | 16.5 | 2.78 | 16.8 | −0.034 | −1.454 |
| Data set 9 | Unconfined compressive strength (kPa) | PA-Bt mixtures | 36 | 242.5 | 228.71 | 94.3 | 1.954 | 3.665 |
| Data set 10 | Permeability (m/s) | PA-Bt mixtures | 36 | 6.4 | 1.27E-07 | 196.8 | 2.814 | 8.098 |
*Soil-like fraction from legacy was classified as equivalent to SM; PA-Bt. = pond ash-bentonite; # Sample size after deducting outlier
Soil-like fractions were collected from a municipal solid waste open dumpsite in Karaikal, Puducherry. Forty-four samples collected from depths ranging from 1 to 4.0 m were analysed for particle size distribution, focusing on the percentages of fines (silt and clay-sized fractions) that passed through the 75 µm sieve. Based on UCSC classification (ASTM D2487, 2017), the soil-like fraction was identified as equivalent to Silty Sand (SM). The fines content (%) was designated as data set 6.
The processed geomaterial was prepared by blending pond ash with bentonite at varying inclusion rates of 10%, 20% and 30%. It was then assessed for its suitability as a landfill liner and cover system application. This data set is reported as a laboratory-derived geotechnical data set. Pond ash was sourced from the ash discharge point at the Neyveli Thermal Power Station, Tamil Nadu, and the bentonite was obtained from a commercial supplier. A total of 36 placement conditions, defined by variations in dry density and moisture content, were designated as data sets 7 and 8, respectively. In addition, the unconfined compressive strength (ASTM D2166, 2016) and permeability (ASTM D5084, 2016) of the processed geomaterials were determined and categorised as Data sets 9 and 10. The geotechnical properties determined from laboratory testing were subsequently used for statistical normality analysis.
2.2 Selected goodness-of-fit test for normality evaluation
Among the GoF tests, this study uses the K-S test with Lilliefors’ correction, commonly referred to as the LL test, because the classical K-S test assumes that the mean and variance are known before the analysis. In contrast, these parameters are invariably estimated from sample data in geotechnical applications. Alongside the LL test, analyses include the A-D, CSQ and S-W tests, which are most commonly used for normality assessment in geotechnical data sets. The present study further incorporates J-B, DP and CVM tests, which involve relatively straightforward computational procedures. A detailed description of these normality tests, along with their respective rejection criteria, is summarised in Table 3. The LL test measures the maximum deviation between the empirical cumulative distribution function and the hypothesised cumulative distribution function, calculated based on the sample data. The test is known to be less sensitive in the tails. The A-D test adjusts for this limitation by assigning greater weight to tail discrepancies (Stephens, 1974). The CVM test offers a balanced approach, integrating squared differences across the entire data range to provide a meaningful assessment of the distribution.
Overview of selected goodness-of-fit tests, associated test statistics and corresponding rejection criteria
| S. no. | Test of normality | Test statistic | Criteria for rejection |
|---|---|---|---|
| 1 | Lilliefors Test (Lilliefors, 1967) | Test statistic: Dn = max |Fx(x) – Sn(x)| Sn = cumulative frequency function Fx = assumed theoretical CDFDn = maximum difference | If , the proposed theoretical distribution is acceptable at the specified significance level α |
| 2 | J-B test (Jarque and Bera, 1987) | Test statistic: ; b2 = kurtosis | The test statistic is compared with 5.9915 |
| 3 | A-D test (Anderson and Darling, 1954) | , n = no. of sample aα, b0, b1are constants | The test statistic is compared with the related critical value |
| 4 | DP test (D’Agostino and Pearson, 1973) | (b2) | p-value of DP value is compared with significance level (α) |
| 5 | CSQ test (Pearson, 1900) | Oj = observed values, Ej = expected values | CSQ value is compared with 5.991 |
| 6 | CVM test (Cramer, 1928; Smirnov, 1936; Stephens, 1974) | Zi = is the standardized observation | The calculated CVM statistics are compared with the critical value at a chosen significance level |
| 7 | S-W test (Shapiro and Wilk, 1965) | b = a1(xn – x1) + a2(xn–1 – x2) +… …, a1, a2,… are coefficients based on sample size, s = standard deviation of sample data | Calculated W is compared with table W |
| S. no. | Test of normality | Test statistic | Criteria for rejection |
|---|---|---|---|
| 1 | Lilliefors Test ( | Test statistic: Dn = max |Fx(x) – Sn(x)| Sn = cumulative frequency function Fx = assumed theoretical | If |
| 2 | J-B test ( | Test statistic: | The test statistic is compared with 5.9915 |
| 3 | A-D test ( | The test statistic is compared with the related critical value | |
| 4 | p-value of | ||
| 5 | |||
| 6 | The calculated | ||
| 7 | S-W test ( | b = a1(xn – x1) + a2(xn–1 – x2) +… …, | Calculated W is compared with table W |
Moment-based tests, including the J-B test (Jarque and Bera, 1987) and the DP omnibus test (D’Agostino and Pearson, 1973), confirmed the distributional characteristics based on measures of skewness and kurtosis. In addition, the CSQ test (Pearson, 1900), a classical data frequency-based method, was also employed. The S-W test (Shapiro and Wilk, 1965), which leverages order statistics and variance ratios, was included due to its strong performance in small sample analyses. This comparative evaluation aims to provide insights into the performance of these GoF tests under different probabilistic models, thereby facilitating the selection of an appropriate test for geotechnical engineering applications.
2.3 Protocol to assess the efficacy of the selected GoF tests using power analysis
The power analysis was conducted through extensive simulations using MATLAB to evaluate the effectiveness of selected GoF tests. Furthermore, the Lilliefors test (lillietest function), A–D (adtest function), chi-square goodness-of-fit (chi2gof function) and Jarque–Bera (jbtest function) tests are built-in functions in MATLAB and can be used directly for analysis. However, tests like Shapiro–Wilk (swtest function), Cramér–von Mises (cmtest function) and D’Agostino–Pearson (DagosPtest function) are sourced from external MATLAB file exchange (BenSaïda, 2025a, 2025b; Trujillo-Ortiz, 2025).
The statistical power of the test is defined as the probability of correctly rejecting the null hypothesis (H0: The distribution is normal) when the alternative hypothesis (HA: The distribution is not normal) is true. Mathematically, it is expressed as Power = 1 − β, where β represents the type II error (fails to reject H0 when HA is true). A high power value minimises Type II error and ensures reliable statistical inference.
In this analysis, the Monte Carlo simulations were carried out to increase the sample size by generating the random realisations of synthetic data based on the moments and marginal distribution of the original sample data set. Key simulation parameters, including significance level, number of iterations, and sample sizes, were then defined. The simulations were conducted for multiple sample sizes, ranging from 5 to 10,000, increasing in increments of 5 (user-defined). A significance level (α) of 0.05 was selected for hypothesis testing across the simulations.
The power of the selected GoF tests was then calculated by following the protocol, as described in Steele (2003) and Arnastauskaite et al. (2021). Step 1: Collection of sample data – the analysis begins with an experimentally obtained data set consisting of observations x1, x2,…, xn. Step 2: Statistical parameter estimation – sample moments and the corresponding parameters of the fitted marginal distributions are computed from the collected data. Step 3: Monte Carlo data generation – using these estimated parameters, surrogate data sets are generated under the specified alternative distribution through Monte Carlo simulation. Step 4: Applying the hypothesis test – the test statistic is computed based on the compatibility hypothesis criteria. If the obtained test statistic (p-value) exceeds the corresponding critical value (chosen significance level of α = 0.05), the null hypothesis (H0) of the distribution is normal and will be accepted. Step 5: Simulation and repetition – steps 2–4 are repeated for k iterations (in this case, k = 10,000) to ensure a robust evaluation of the test performance. Step 6: Calculating the power of the test – the power (1 − β) is determined as count/k, where count represents the number of times the null hypothesis (H0) is correctly rejected under the assumed alternative distribution across the k iterations.
For each realisation, seven different normality tests were applied: the S-W test, LL test, A-D test, CVM test, CSQ test, J-B test and DP omnibus test. After computing the power values, the performance of each test was evaluated using its behaviour at the maximum sample size to establish its comparative effectiveness. A higher power indicates a stronger ability to identify deviations from normality, making the corresponding test more reliable for assessing data normality.
3. Results and discussion
3.1 Index and engineering properties of natural/processed geomaterial assessed for normality
The normality assessment for the geotechnical properties, summarised in Table 1, was conducted with a focus on their application in highway subgrade construction and landfill liner and cover materials. Specifically, data sets 1 to 5 were evaluated for their suitability as highway subgrade material with a target CBR value of at least 10%. Meanwhile, data set 6 was assessed to evaluate the suitability of soil-like fractions as backfill material, considering the percentage of fines. Furthermore, data sets 7 to 10 were analysed for their suitability as liner and cover material, targeting an unconfined compressive strength of ≥ 200 kPa and permeability (k) of ≤ 10−7 cm/s to ensure effective containment performance. Furthermore, conducting statistical parametric tests on these physical and mechanical properties provides valuable insight into the feasibility of utilising these materials in the subgrade and landfill containment systems while addressing inherent variability and uncertainty.
3.2 Descriptive statistics of the geotechnical data used for the analysis
Table 2 provides a comprehensive summary of the descriptive statistics of geotechnical data analysed for normality assessment. To proceed further, each data set was initially organised, and outliers were identified and removed by performing a statistical hypothesis test proposed by Grubbs (1969) to ensure data consistency and unbiasedness. Based on the result of Grubbs’ two-tailed test (significance level, α = 0.01), it was observed that only data set 5 (CBR) contained an outlier (CBR = 33.8%), which was identified and excluded from the data set. The excluded data point exhibited statistically significant deviations from the sample mean and was removed before performing the normality and power analyses. The variability in the data, assessed using the coefficient of variation (CoV), ranged from 6% to 196%. According to Harr’s (1987) classification, the variability is categorised based on the sample CoV. Specifically, data sets 3, 4 and 7 exhibit low variability (CoV < 10%), while data sets 1, 2, 6 and 8 display moderate variability (15% < CoV < 30%). In contrast, CBR (data set 5), UCS (data set 9) and permeability (data set 10) demonstrate high variability (CoV > 30%). Higher CoV is common for the geotechnical domain (Benson, 1993; Baecher and Christian, 2005).
In addition, Table 2 reports the sample averages, standard deviations, skewness and kurtosis values, which provide qualitative insights into the data distribution characteristics. The skewness coefficient indicates asymmetry in the data distribution: data sets 1, 4 and 8 exhibit negative skewness, indicating a left-skewed distribution. While data sets 2, 3, 5, 6, 7, 9 and 10 display positive skewness, reflecting a right-skewed distribution. Furthermore, the kurtosis analysis, which measures the sharpness of the distribution, reveals that data sets 1 to 8 exhibit a flat, short-tailed distribution (Platykurtic, kurtosis < 3). In contrast, data sets 9 and 10 are characterised by sharp peaks and long tails (leptokurtic, kurtosis > 3). Ideally, a data set with skewness near zero and a kurtosis value of approximately three indicates a perfectly symmetrical normal distribution (Newell and Hancock, 1984; Thode, 2002). Although the skewness and kurtosis (third and fourth central moments, respectively) serve as functional parameters for assessing normality, relying solely on these metrics may lead to misleading conclusions. Hence, to ensure robustness in the statistical decisions on the normality assessment, the theory-driven normality tests (GoF) discussed in Section 2.2 shall be applied.
3.3 Normality outcomes for original data sets and necessity of conducting power analysis
Table 4 presents a summary of the hypothesis decisions by various GoF tests in assessing the normality of the original data sets, based on the analytical equations provided in Table 3. The assessment of normality was conducted by comparing the test statistics against the corresponding critical values. Accordingly, the LL test and S-W tests rejected the null hypothesis of normality for data sets 4, 5, 9 and 10. In contrast, the moment-based J-B test accepted normality only for data sets 1 to 8. In addition, the CSQ test rejected the null hypothesis for data sets 2, 4, 5, 7, 8 and 10. However, the CVM tests failed to reject the null hypothesis for data sets 1 to 9, indicating variations in sensitivity across different normality tests. A consistent finding across all the GoF tests was the rejection of normality for data sets 9 and 10.
Summary of hypothesis inference on the normality tests conducted on original and skewed data sets using selected GoF tests
| Data set | LL | A-D | CSQ | J-B | DP | CVM | S-W |
|---|---|---|---|---|---|---|---|
| Original data set | |||||||
| 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 2 | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 4 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 5 | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 6 | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 7 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 8 | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ |
| 9 | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 10 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Skewed data set (k = 2) | |||||||
| 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 2 | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
| 3 | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 4 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 5 | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 6 | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
| 7 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 8 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 9 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 10 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Skewed data set (k = 3) | |||||||
| 1 | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
| 2 | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 3 | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 4 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 5 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 6 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 7 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 8 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 9 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 10 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Data set | A-D | J-B | S-W | ||||
|---|---|---|---|---|---|---|---|
| Original data set | |||||||
| 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 2 | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 4 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 5 | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 6 | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 7 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 8 | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ |
| 9 | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 10 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Skewed data set (k = 2) | |||||||
| 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 2 | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
| 3 | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 4 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 5 | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 6 | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
| 7 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 8 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 9 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 10 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Skewed data set (k = 3) | |||||||
| 1 | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
| 2 | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 3 | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| 4 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 5 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 6 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 7 | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 8 | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
| 9 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| 10 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
✓ Fails to reject H0 (Normality assumption accepted);
✗ Rejecting H0 (Normality assumption rejected)
Notably, these data sets (9 and 10) exhibit high CoV of 94% and 196%, respectively, and are characterised by long-tailed distributions. Since the lognormal distribution is more commonly used in geotechnical engineering, Monte Carlo simulation studies were conducted to further investigate the boundary conditions under which data conform to a lognormal distribution. To achieve this, a random variable was generated synthetically while systematically varying CoV and kurtosis, keeping skewness constant (zero). The generated data sets were then subjected to GoF tests to determine whether they adhered to a normal (Gaussian) or lognormal distribution. Figure 1 visually represents the progressive shift in data distribution from normal to lognormal as variability increases. The analysis revealed a critical threshold: when a data set exhibits a CoV greater than 30% with a kurtosis value exceeding 3, the lognormal distribution provides a superior fit. These findings reinforce the importance of considering CoV and kurtosis when selecting an appropriate probability distribution for geotechnical data.
The scatter plot displays the relationship between the coefficient of kurtosis, ranging from zero to ten on the vertical axis, and the coefficient of variation, which extends from zero to seventy percent on the horizontal axis. The data points are represented by various symbols that may indicate different categories or groups. Two distinct regions are highlighted with red dashed boxes, one indicating data fitted to a lognormal distribution and another showing data fitted to a normal distribution. The plot includes annotations pointing to these fitted distributions for clarity, without interpreting the data itself. The axes are labelled clearly, and the increments on the horizontal axis are marked to facilitate understanding of the data spread.Effect of the coefficient of variation and kurtosis value on fitting the data to the lognormal distribution
Source: Figure by authors
The scatter plot displays the relationship between the coefficient of kurtosis, ranging from zero to ten on the vertical axis, and the coefficient of variation, which extends from zero to seventy percent on the horizontal axis. The data points are represented by various symbols that may indicate different categories or groups. Two distinct regions are highlighted with red dashed boxes, one indicating data fitted to a lognormal distribution and another showing data fitted to a normal distribution. The plot includes annotations pointing to these fitted distributions for clarity, without interpreting the data itself. The axes are labelled clearly, and the increments on the horizontal axis are marked to facilitate understanding of the data spread.Effect of the coefficient of variation and kurtosis value on fitting the data to the lognormal distribution
Source: Figure by authors
Since the data sets analysed varied in sample sizes, and different GoF use different statistical measures, drawing a definitive conclusion on the superior performance of any particular GoF test is challenging. Generally, the assumption of normality becomes less critical for sample sizes n ≥ 30 (Uttley, 2019), as supported by the central limit theorem. Furthermore, there are reported cases where small sample data sets that are genuinely drawn from a normally distributed population may still pass normality tests. However, this outcome is not necessarily meaningful, as it often results from the test’s lack of statistical power (Öztuna et al., 2006; Ghasemi and Zahediasl, 2012). For small sample sizes, normality tests often lack the sensitivity needed to detect deviations from normality, increasing the likelihood of Type II errors (false acceptance of normality). Consequently, without conducting a power analysis, determining the appropriate sample size remains uncertain, which may lead to biased conclusions regarding normality.
3.4 Identification of best-fit distributions and selection of surrogate models for power analysis
Before conducting the power analysis through Monte Carlo Simulations, it is necessary to define the alternative distribution corresponding to the alternative hypothesis. Accordingly, each data set was evaluated against several theoretical distributions, Normal (N), Lognormal (LN), Weibull (WB), Gamma (GA), Generalised Extreme Value (GEV), Gumbel (GB) and Exponential (Exp), as presented in Figure 2. The GoF for each distribution was quantified using mean P-values derived from seven selected GoF tests. The test results revealed that data sets 1, 4 and 10 were well-fitted with the WB distribution, data sets 3, 7 and 9 with the LN distribution, data sets 5, 6 and 8 with the GEV distribution, and data set 2 with the GA distribution, respectively. The GB and Exp distributions were found to be the least suitable for the selected data sets. Notably, analysis indicated that data sets with higher CoV fitted well with LN, WB, GEV and GA, respectively. Consequently, these four models were adopted as surrogate models for the subsequent power analysis to evaluate the performance of GoF tests and determine the sample size.
The image illustrates a bar graph depicting the mean p value for different best fit distributions across ten datasets. The vertical axis is labelled MEAN P VALUE, ranging from zero to one, while the horizontal axis is marked with Dataset 1 through Dataset 10. Each dataset displays multiple coloured bars, representing different types of distributions, Normal, grey, Lognormal, green, Weibull, blue, Gamma, red, Generalized Extreme Value, gold, Gumbel, orange, and Exponential, dark grey. Each distribution type has an associated legend in the top right corner for identification. The bars are arranged to highlight the comparisons across the datasets without presenting any trends, making it clear which dataset corresponds to each distribution.Best-fitting probability model for the original data sets based on the mean P-value estimated from the seven GoF tests
Source: Figure by authors
The image illustrates a bar graph depicting the mean p value for different best fit distributions across ten datasets. The vertical axis is labelled MEAN P VALUE, ranging from zero to one, while the horizontal axis is marked with Dataset 1 through Dataset 10. Each dataset displays multiple coloured bars, representing different types of distributions, Normal, grey, Lognormal, green, Weibull, blue, Gamma, red, Generalized Extreme Value, gold, Gumbel, orange, and Exponential, dark grey. Each distribution type has an associated legend in the top right corner for identification. The bars are arranged to highlight the comparisons across the datasets without presenting any trends, making it clear which dataset corresponds to each distribution.Best-fitting probability model for the original data sets based on the mean P-value estimated from the seven GoF tests
Source: Figure by authors
3.5 Power evaluation of selected GoF test through Monte Carlo simulation
The power analysis revealed key insights into the performance of normality tests applied to geotechnical data. Figure 3(a)–(d) illustrates the variation in the power values with increasing sample sizes for data set 5 (CBR), evaluated under alternate distributions of LN, WB, GA and GEV. The results indicated that the power value consistently increases with sample size and asymptotically approaches 1.0 as the sample size becomes sufficiently large. Moreover, the sample size required to achieve the power value of 1.0 varied considerably across the alternative distributions, ranging from 90 to over 1,000. For instance, at a sample size of 100, using the S-W GoF test, the power values for LN, GA and GEV approached 1.0, whereas the power value of WB is relatively lower, at around 0.1. This observation highlights the strong dependence of power and corresponding sample size requirements on the shape and location parameters of the underlying alternative marginal distributions.
The image consists of a grid of eight plots showing the power, one minus beta, against sample size for various statistical methods. The plots are categorised into four rows labelled a to d, with two subplots per category. The categories include L N, W B, G A, and G E V. Each plot features multiple lines representing different methods, L L, A D, C S Q, J B, D P, C V M, and S W, indicated by distinct markers and colours. The horizontal axis represents sample size, ranging from one to ten thousand, with logarithmic scaling. The vertical axis represents power with values ranging from zero to one. The upper left corner contains legend keys identifying the methods. Graph grid lines enhance readability. Each plot's data points resemble cumulative distribution functions, reflecting the performance of statistical tests as influenced by sample size.Power curves for the original data set (CBR – data set 5) and the skewness-induced data sets. Subplots (a)–(d) correspond to the original data, while (a1)–(d1) and (a2)–(d2) represent the data sets transformed with exponents k = 2 and k = 3, respectively
Source: Figure by authors
The image consists of a grid of eight plots showing the power, one minus beta, against sample size for various statistical methods. The plots are categorised into four rows labelled a to d, with two subplots per category. The categories include L N, W B, G A, and G E V. Each plot features multiple lines representing different methods, L L, A D, C S Q, J B, D P, C V M, and S W, indicated by distinct markers and colours. The horizontal axis represents sample size, ranging from one to ten thousand, with logarithmic scaling. The vertical axis represents power with values ranging from zero to one. The upper left corner contains legend keys identifying the methods. Graph grid lines enhance readability. Each plot's data points resemble cumulative distribution functions, reflecting the performance of statistical tests as influenced by sample size.Power curves for the original data set (CBR – data set 5) and the skewness-induced data sets. Subplots (a)–(d) correspond to the original data, while (a1)–(d1) and (a2)–(d2) represent the data sets transformed with exponents k = 2 and k = 3, respectively
Source: Figure by authors
Conversely, the type of GoF tests used had a marked influence on the power values obtained across the different sample sizes. As observed from Figure 3, the S-W, A-D, LL and J-B tests exhibited relatively higher power values, whereas the CSQ and CVM tests demonstrated comparatively lower power. Consequently, to determine whether the observed differences in power values among the selected normality tests were statistically significant under various distributional conditions (lognormal, gamma, Weibull and GEV), a non-parametric Friedman test was performed. According to Corder and Foreman (2014), the null hypothesis (H0) is postulated that all the GoF tests yield equivalent power values, while the alternative hypothesis (HA) proposed that at least one test performs significantly different. Table 5 summarises the computed power values for data set 5 across various GoF tests and sample sizes. Three representative sample sizes were included in the comparative analysis, and test statistics for each case were calculated using equation (1):
Friedman test results comparing mean power values of the GoF tests across different parametric surrogate distributions
| Effect size | LL | J-B | A-D | S-W | DP | CVM | CSQ | Fr (test statistic) | Fc (critical) | |
|---|---|---|---|---|---|---|---|---|---|---|
| Sample size (n) = 25 | ||||||||||
| LN | 0.22 | 0.382 | 0.497 | 0.521 | 0.597 | 0.412 | 0.010 | 0.046 | 22.6 | 11.6 |
| WB | 0.15 | 0.044 | 0.029 | 0.046 | 0.051 | 0.018 | 0.000 | 0.001 | ||
| GA | 0.18 | 0.167 | 0.229 | 0.217 | 0.273 | 0.176 | 0.000 | 0.007 | ||
| GEV | 0.16 | 0.164 | 0.216 | 0.234 | 0.304 | 0.156 | 0.000 | 0.007 | ||
| Sample size (n) = 50 | ||||||||||
| LN | 0.22 | 0.3820 | 0.4968 | 0.5210 | 0.5966 | 0.4116 | 0.0102 | 0.0456 | 22.4 | 11.6 |
| WB | 0.15 | 0.0440 | 0.0294 | 0.0464 | 0.0506 | 0.0184 | 0.0000 | 0.0014 | ||
| GA | 0.18 | 0.1672 | 0.2286 | 0.2170 | 0.2728 | 0.1764 | 0.0004 | 0.0070 | ||
| GEV | 0.16 | 0.1638 | 0.2158 | 0.2340 | 0.3036 | 0.1560 | 0.0000 | 0.0072 | ||
| Sample size (n) = 100 | ||||||||||
| LN | 0.22 | 0.3820 | 0.4968 | 0.5210 | 0.5966 | 0.4116 | 0.0102 | 0.0456 | 22.7 | 11.6 |
| WB | 0.15 | 0.0440 | 0.0294 | 0.0464 | 0.0506 | 0.0184 | 0.0000 | 0.0014 | ||
| GA | 0.18 | 0.1672 | 0.2286 | 0.2170 | 0.2728 | 0.1764 | 0.0004 | 0.0070 | ||
| GEV | 0.16 | 0.1638 | 0.2158 | 0.2340 | 0.3036 | 0.1560 | 0.0000 | 0.0072 | ||
| Effect size | J-B | A-D | S-W | Fr (test statistic) | Fc (critical) | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sample size (n) = 25 | ||||||||||
| 0.22 | 0.382 | 0.497 | 0.521 | 0.597 | 0.412 | 0.010 | 0.046 | 22.6 | 11.6 | |
| 0.15 | 0.044 | 0.029 | 0.046 | 0.051 | 0.018 | 0.000 | 0.001 | |||
| 0.18 | 0.167 | 0.229 | 0.217 | 0.273 | 0.176 | 0.000 | 0.007 | |||
| 0.16 | 0.164 | 0.216 | 0.234 | 0.304 | 0.156 | 0.000 | 0.007 | |||
| Sample size (n) = 50 | ||||||||||
| 0.22 | 0.3820 | 0.4968 | 0.5210 | 0.5966 | 0.4116 | 0.0102 | 0.0456 | 22.4 | 11.6 | |
| 0.15 | 0.0440 | 0.0294 | 0.0464 | 0.0506 | 0.0184 | 0.0000 | 0.0014 | |||
| 0.18 | 0.1672 | 0.2286 | 0.2170 | 0.2728 | 0.1764 | 0.0004 | 0.0070 | |||
| 0.16 | 0.1638 | 0.2158 | 0.2340 | 0.3036 | 0.1560 | 0.0000 | 0.0072 | |||
| Sample size (n) = 100 | ||||||||||
| 0.22 | 0.3820 | 0.4968 | 0.5210 | 0.5966 | 0.4116 | 0.0102 | 0.0456 | 22.7 | 11.6 | |
| 0.15 | 0.0440 | 0.0294 | 0.0464 | 0.0506 | 0.0184 | 0.0000 | 0.0014 | |||
| 0.18 | 0.1672 | 0.2286 | 0.2170 | 0.2728 | 0.1764 | 0.0004 | 0.0070 | |||
| 0.16 | 0.1638 | 0.2158 | 0.2340 | 0.3036 | 0.1560 | 0.0000 | 0.0072 | |||
Here, m denotes the number of GoF tests considered, l represents the number of distributional conditions (blocks), and Ri corresponds to the sum of ranks assigned to each GoF test. Based on the ranking procedure across the distributional block, the calculated Friedman test statistic values (Fr) were 22.6, 22.4 and 22.7, respectively. The critical value for the present case was obtained as 11.6 (where α = 0.05, l = 4, m = 7) from the updated and extended critical values proposed by López-Vázquez and Hochsztain (2019). Since the computed test statistic value exceeds the critical value, it can be inferred that a statistically significant difference exists among the GoF tests with respect to their power values. Given that the significance level (α) was maintained as 0.05 throughout the power analysis, the changes in the power values must come from the sample size (shown in Figure 3) and effect size. The effect size measures how much the observed data deviates from a perfect normal distribution (Cohen, 1992); a detailed discussion of effect size is presented in the subsequent section.
3.6 Assessing the influence of effect size on the power values of selected GoF tests
As the required sample size to arrive at the power value of 1.0 was influenced not only by the specific GoF tests but also by the statistical characteristics of the data set, particularly the CoV (effect size). Hence, to further examine the sensitivity of test methods in detecting deviations from normality, all the selected GoF tests were applied to the original data sets with the induced skewness, thereby varying the effect size. For this purpose, the original data were systematically transferred to introduce a controlled level of skewness and mean shift, thereby enabling detailed evaluation of each test decision-making capability. Skewness was induced using a nonlinear power transformation of the form, Y = Xk, where Y is the transformed random variable, X is the original random variable, and k is the exponent. In this study, values of k = 2 and k = 3 were adopted to capture the sensitivity of different GoF tests under varying effect sizes. Figure 4 illustrates the standard normal probability plots of the original data set and the data sets induced with skewness.
The image consists of three graphs, labelled a, b, and c, depicting probability distributions. Graph a presents the original data with a bell shaped curve plotted against z values ranging from negative three point five to positive three point five, with probability on the vertical axis. Graph b illustrates skewed data with a parameter k equal to two, maintaining the z value range and showing a shift in the distribution shape. Graph c further represents skewed data with a parameter k equal to three, again within the same z value range, demonstrating an increased skew to the right. Each graph includes vertical bars representing distribution frequency, with arrows indicating the transformation from original to skewed data. The graphs utilise a consistent scale on both axes for easy comparison.Standard normal probability plots for: (a) original data set 2 (plastic limit; CoV =15%), (b) skewed data set (k = 2) and (c) skewed data set (k = 3)
Source: Figure by authors
The image consists of three graphs, labelled a, b, and c, depicting probability distributions. Graph a presents the original data with a bell shaped curve plotted against z values ranging from negative three point five to positive three point five, with probability on the vertical axis. Graph b illustrates skewed data with a parameter k equal to two, maintaining the z value range and showing a shift in the distribution shape. Graph c further represents skewed data with a parameter k equal to three, again within the same z value range, demonstrating an increased skew to the right. Each graph includes vertical bars representing distribution frequency, with arrows indicating the transformation from original to skewed data. The graphs utilise a consistent scale on both axes for easy comparison.Standard normal probability plots for: (a) original data set 2 (plastic limit; CoV =15%), (b) skewed data set (k = 2) and (c) skewed data set (k = 3)
Source: Figure by authors
By systematically introducing the controlled skewness into the data set, the CoV (%) of the data sets increased twice for k = 2 and thrice for k = 3 compared to the original data set’s CoV values. Further, For k = 2, the descriptive statistics of original data sets were transformed to: CoV (%) = {43, 30, 15, 23, 88, 56, 12, 33, 198, 315}; Skewness = {0.45, 0.88, 0.87, −0.03, 2.85, 0.94, 0.27, 0.12, 2.98, 4.21}; and Kurtosis = {−0.19, 1.23, 1.56, 0.67, 12.27, 0.57, −1.14, −1.37, 8.49, 19.41}. Likewise, for k = 3, the transformed statistics were: CoV (%) = {62, 46, 23, 34, 153, 84, 18, 47, 276, 392}; Skewness = {0.92, 1.34, 1.16, 0.42, 4.51, 1.49, 0.34, 0.28, 3.44, 5.13}; and Kurtosis = {0.79, 2.33, 2.43, 1.06, 23.83, 1.67, −1.14, −1.18, 11.52, 28.09}.
Consequently, the effect size was quantified using Cohen’s d, expressed as , where the numerator |μ – μ0| indicates the difference between the means under the null and alternative hypotheses, and s denotes the pooled standard deviation. Generally, a larger effect size (data exhibiting higher CoV) represents a more pronounced deviation from normality, making it easier to detect non-normality. Conversely, a smaller effect size indicates a subtle deviation, requiring a larger sample size to achieve the same statistical power. The computed values indicated an increase in effect size for data sets with the introduction of skewness (following power transformation). For instance, for data set 5 fitted with LN, the effect size of the original data set was 0.223, which increased to 0.371 and 0.698 for transformation exponents k = 2 and k = 3, respectively, in the transformed data. Figure 5 presents the responses of various GoF tests at different levels of effect size (d) for both the original and transformed data sets. The effect size ranged from 0.08 for the original data sets to approximately 0.7 for the skewed data sets. Across all sample sizes (n = 25, 50, and 100), the S-W test demonstrated superior performance and higher sensitivity to the effect size variations, followed by the A-D and J-B tests. Moreover, the results revealed that increasing the sample size led to a reduction in the effect size, thereby validating the central limit theorem. Even at small effect sizes, a power value of 1.0 was achieved at larger sample sizes, confirming the dependency of test sensitivity on sample size.
The image depicts four main sections of statistical power analysis in the context of sample sizes, arranged in a grid format. The first figure, a, represents the log normal distribution, L N, with sample sizes of 25, 50, and 100. Each subplot, such as a 1 and a 2, includes distinct curves for different statistical tests identified by symbols and colours, L L, A D, C S Q, J B, D P, C V M, and S W. The second section, b, illustrates the W B test under similar sample size conditions. Subsequent panels, c and d, showcase the G A and G E V tests respectively, with emphasis on the relationship between power, one minus beta, and effect size, d, across various tests. Each plot includes grid lines for clarity and distinct axes for measuring effect size and power, while sample size indicators guide interpretation. The layout allows for easy comparison between sample settings and their corresponding statistical power outcomes.Sensitivity of the selected GoF tests illustrated through their power responses across varying sample sizes (n) and effect size conditions
Source: Figure by authors
The image depicts four main sections of statistical power analysis in the context of sample sizes, arranged in a grid format. The first figure, a, represents the log normal distribution, L N, with sample sizes of 25, 50, and 100. Each subplot, such as a 1 and a 2, includes distinct curves for different statistical tests identified by symbols and colours, L L, A D, C S Q, J B, D P, C V M, and S W. The second section, b, illustrates the W B test under similar sample size conditions. Subsequent panels, c and d, showcase the G A and G E V tests respectively, with emphasis on the relationship between power, one minus beta, and effect size, d, across various tests. Each plot includes grid lines for clarity and distinct axes for measuring effect size and power, while sample size indicators guide interpretation. The layout allows for easy comparison between sample settings and their corresponding statistical power outcomes.Sensitivity of the selected GoF tests illustrated through their power responses across varying sample sizes (n) and effect size conditions
Source: Figure by authors
Furthermore, the results summarised in Table 4 reveal a substantial shift in statistical outcome. For moderately skewed data (k = 2), the A-D, S-W, and LL tests demonstrated higher sensitivity to nonlinear transformation, with several data sets that initially satisfied the normality assumption being rejected once skewness was introduced. In contrast, the CVM tests showed minimal sensitivity under the same conditions, rejecting the normality only for data set 9. As expected, the moment-based test also responded to the transformation, owing to its direct dependence on skewness and kurtosis. With a further increase in the skewness (k = 3), nearly all the GoF became more sensitive in detecting deviations from normality, as shown in Table 4.
3.7 Optimising the minimum sample size required for normality evaluation
It can be seen from Figures 3 and 5 that, even with a sample size of 100, most of the selected GoF did not achieve a power value of 1.0 for most of the data sets. This suggests that evaluating the effectiveness of the GoF solely based on achieving a power value of 1.0 may require an impractically large sample size. Consequently, researchers have emphasised the importance of setting an optimal power threshold to balance statistical reliability and sample size limitations. Cohen (1992) reported that selecting a power value less than 0.8 introduces an unacceptably high risk of committing a Type II error, which can potentially lead to erroneous conclusions in statistical analysis. Conventionally, with a significance level (α) of 0.05, a power of 0.80 results in a Type II error probability (β) of 0.2, yielding a β:α ratio of 4:1 (Cohen, 1992). Moreover, Murphy and Myors (2004) noted that a power level of 0.8 implies that the probability of correctly detecting an effect is four times greater than failing to do so, whereas a power level of 0.9 increases this likelihood to nine times. Considering these statistical principles, the effectiveness of GoF tests in this study was assessed based on the minimum power values of 0.8, ensuring a balance between minimising Type II errors and maintaining a feasible sample size.
Based on the power value of 0.8 and the chosen significance level of 0.05, the minimum sample sizes required for normality assessment are summarised in Table 6. Furthermore, the sample size obtained from the power analysis indicates that the CSQ and CVM tests require substantially larger samples, typically between 115 and 4,400, which explains their inability to identify non-normality for data sets 1–9, given the original data set’s available sample size of 36–50. In contrast, the S-W and J-B tests exhibit higher efficiency, with minimum sample size requirements ranging from 8 to 65 for data sets exhibiting strong deviations (higher variability) from normality (e.g. data sets 9 and 10). However, for data sets 1–8, the required sample sizes fall within the range of 50–2,500, indicating that any normality decision drawn from these limited samples is statistically unreliable. In addition, for LL, A-D and DP tests, the required sample size approximately ranges from 12 to 5,000.
Summary of minimum sample size requirements for various GoF tests when assessing normality under different surrogate alternative distributions
| Data set | LL | A-D | J-B | DP | S-W | LL | A-D | J-B | DP | S-W |
|---|---|---|---|---|---|---|---|---|---|---|
| Original data | ||||||||||
| LN (α = 0.05, Power = 0.8) | WB (α = 0.05, Power = 0.8) | |||||||||
| 1 | 262 | 144 | 130 | 245 | 120 | 1135 | 741 | 693 | 1175 | 625 |
| 2 | 642 | 414 | 328 | 557 | 309 | 476 | 311 | 274 | 501 | 245 |
| 3 | 2511 | 1667 | 1222 | 2071 | 1213 | 227 | 148 | 134 | 226 | 117 |
| 4 | 834 | 551 | 422 | 725 | 405 | 300 | 195 | 177 | 307 | 153 |
| 5 | 70 | 46 | 48 | 73 | 39 | 3696 | 1986 | 1214 | 1532 | 1019 |
| 6 | 133 | 90 | 85 | 134 | 72 | 5170 | 3495 | 2496 | 2352 | 2394 |
| 7 | 3963 | 2551 | 1853 | 3229 | 1909 | 180 | 117 | 107 | 174 | 94 |
| 8 | 472 | 309 | 257 | 416 | 237 | 476 | 312 | 274 | 501 | 245 |
| 9 | 25 | 19 | 23 | 29 | 17 | 49 | 65 | 41 | 65 | 28 |
| 10 | 12 | 11 | 11 | 14 | 10 | 13 | 12 | 13 | 15 | 11 |
| GA (α = 0.05, Power = 0.8) | GEV (α = 0.05, Power = 0.8) | |||||||||
| 1 | 611 | 390 | 315 | 545 | 287 | 5045 | 3025 | 1745 | 1875 | 1485 |
| 2 | 767 | 956 | 727 | 1174 | 717 | 941 | 616 | 574 | 998 | 528 |
| 3 | 5045 | 3779 | 2732 | 4746 | 2685 | 5045 | 3779 | 2732 | 4746 | 2685 |
| 4 | 2050 | 1313 | 982 | 1679 | 955 | 2033 | 1145 | 794 | 1146 | 636 |
| 5 | 180 | 115 | 109 | 182 | 87 | 157 | 94 | 103 | 189 | 70 |
| 6 | 337 | 215 | 185 | 319 | 161 | 5432 | 3173 | 2416 | 2420 | 2284 |
| 7 | 5133 | 5151 | 4405 | 5199 | 4334 | 5089 | 3673 | 2183 | 2108 | 1927 |
| 8 | 1195 | 721 | 549 | 946 | 528 | 571 | 330 | 287 | 496 | 217 |
| 9 | 49 | 32 | 39 | 62 | 27 | 22 | 17 | 20 | 22 | 15 |
| 10 | 14 | 12 | 15 | 20 | 12 | 10 | 9 | 9 | 13 | 8 |
| Transformed data (K = 2) | ||||||||||
| LN (α = 0.05, Power = 0.8) | WB (α = 0.05, Power = 0.8) | |||||||||
| 1 | 70 | 45 | 48 | 72 | 39 | 821 | 467 | 395 | 652 | 294 |
| 2 | 169 | 106 | 99 | 162 | 86 | 6123 | 3875 | 2269 | 2155 | 2025 |
| 3 | 649 | 417 | 335 | 562 | 311 | 582 | 388 | 348 | 618 | 312 |
| 4 | 218 | 146 | 126 | 205 | 109 | 1403 | 1017 | 971 | 1147 | 896 |
| 5 | 23 | 17 | 21 | 26 | 14 | 102 | 62 | 72 | 131 | 48 |
| 6 | 40 | 27 | 32 | 44 | 24 | 200 | 115 | 122 | 227 | 85 |
| 7 | 1012 | 646 | 493 | 846 | 486 | 304 | 194 | 176 | 317 | 155 |
| 8 | 121 | 83 | 79 | 123 | 63 | 6378 | 3873 | 2265 | 2154 | 2004 |
| 9 | 13 | 12 | 13 | 14 | 12 | 14 | 13 | 15 | 19 | 12 |
| 10 | 9 | 8 | 9 | 13 | 7 | 9 | 8 | 9 | 13 | 7 |
| GA (α = 0.05, Power = 0.8) | GEV (α = 0.05, Power = 0.8) | |||||||||
| 1 | 163 | 105 | 100 | 171 | 80 | 376 | 239 | 210 | 378 | 184 |
| 2 | 380 | 283 | 203 | 357 | 178 | 212 | 135 | 126 | 207 | 107 |
| 3 | 1187 | 933 | 722 | 1164 | 683 | 198 | 134 | 118 | 197 | 104 |
| 4 | 540 | 340 | 273 | 471 | 251 | 3411 | 2987 | 1255 | 1156 | 1217 |
| 5 | 57 | 37 | 44 | 70 | 31 | 814 | 571 | 509 | 905 | 480 |
| 6 | 94 | 61 | 67 | 106 | 49 | 115 | 78 | 76 | 109 | 61 |
| 7 | 1660 | 1464 | 1490 | 1850 | 1044 | 1075 | 943 | 893 | 1160 | 811 |
| 8 | 292 | 187 | 159 | 277 | 140 | 5206 | 3630 | 2083 | 2095 | 1785 |
| 9 | 16 | 13 | 17 | 24 | 13 | 12 | 11 | 12 | 14 | 11 |
| 10 | 11 | 9 | 11 | 14 | 10 | 9 | 7 | 7 | 13 | 7 |
| Transformed data (K = 3) | ||||||||||
| LN (α = 0.05, Power = 0.8) | WB (α = 0.05, Power = 0.8) | |||||||||
| 1 | 34 | 24 | 29 | 39 | 22 | 132 | 80 | 89 | 164 | 62 |
| 2 | 77 | 50 | 52 | 80 | 42 | 500 | 294 | 258 | 755 | 195 |
| 3 | 288 | 189 | 165 | 270 | 148 | 5206 | 3630 | 2083 | 2095 | 1785 |
| 4 | 101 | 59 | 59 | 90 | 50 | 4935 | 1107 | 1098 | 1166 | 989 |
| 5 | 14 | 13 | 14 | 15 | 12 | 35 | 32 | 32 | 48 | 21 |
| 6 | 22 | 15 | 20 | 24 | 14 | 55 | 35 | 44 | 73 | 30 |
| 7 | 460 | 286 | 237 | 393 | 218 | 616 | 407 | 366 | 647 | 332 |
| 8 | 57 | 40 | 43 | 61 | 34 | 498 | 290 | 256 | 445 | 194 |
| 9 | 11 | 10 | 11 | 13 | 10 | 12 | 11 | 12 | 14 | 10 |
| 10 | 8 | 7 | 7 | 11 | 6 | 8 | 7 | 7 | 10 | 6 |
| GA (α = 0.05, Power = 0.8) | GEV (α = 0.05, Power = 0.8) | |||||||||
| 1 | 78 | 50 | 57 | 90 | 42 | 73 | 51 | 52 | 72 | 43 |
| 2 | 174 | 109 | 104 | 177 | 83 | 87 | 61 | 60 | 85 | 51 |
| 3 | 625 | 422 | 336 | 563 | 318 | 123 | 85 | 79 | 123 | 68 |
| 4 | 249 | 159 | 142 | 244 | 120 | 710 | 469 | 422 | 764 | 389 |
| 5 | 31 | 22 | 29 | 42 | 20 | 45 | 34 | 36 | 45 | 30 |
| 6 | 47 | 31 | 38 | 59 | 26 | 36 | 26 | 29 | 36 | 23 |
| 7 | 1085 | 665 | 527 | 877 | 485 | 399 | 262 | 232 | 408 | 200 |
| 8 | 137 | 84 | 86 | 146 | 68 | 339 | 233 | 197 | 351 | 171 |
| 9 | 13 | 11 | 13 | 15 | 11 | 10 | 8 | 9 | 13 | 8 |
| 10 | 10 | 7 | 8 | 12 | 8 | 8 | 6 | 6 | 11 | 6 |
| Data set | A-D | J-B | S-W | A-D | J-B | S-W | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Original data | ||||||||||
| 1 | 262 | 144 | 130 | 245 | 120 | 1135 | 741 | 693 | 1175 | 625 |
| 2 | 642 | 414 | 328 | 557 | 309 | 476 | 311 | 274 | 501 | 245 |
| 3 | 2511 | 1667 | 1222 | 2071 | 1213 | 227 | 148 | 134 | 226 | 117 |
| 4 | 834 | 551 | 422 | 725 | 405 | 300 | 195 | 177 | 307 | 153 |
| 5 | 70 | 46 | 48 | 73 | 39 | 3696 | 1986 | 1214 | 1532 | 1019 |
| 6 | 133 | 90 | 85 | 134 | 72 | 5170 | 3495 | 2496 | 2352 | 2394 |
| 7 | 3963 | 2551 | 1853 | 3229 | 1909 | 180 | 117 | 107 | 174 | 94 |
| 8 | 472 | 309 | 257 | 416 | 237 | 476 | 312 | 274 | 501 | 245 |
| 9 | 25 | 19 | 23 | 29 | 17 | 49 | 65 | 41 | 65 | 28 |
| 10 | 12 | 11 | 11 | 14 | 10 | 13 | 12 | 13 | 15 | 11 |
| 1 | 611 | 390 | 315 | 545 | 287 | 5045 | 3025 | 1745 | 1875 | 1485 |
| 2 | 767 | 956 | 727 | 1174 | 717 | 941 | 616 | 574 | 998 | 528 |
| 3 | 5045 | 3779 | 2732 | 4746 | 2685 | 5045 | 3779 | 2732 | 4746 | 2685 |
| 4 | 2050 | 1313 | 982 | 1679 | 955 | 2033 | 1145 | 794 | 1146 | 636 |
| 5 | 180 | 115 | 109 | 182 | 87 | 157 | 94 | 103 | 189 | 70 |
| 6 | 337 | 215 | 185 | 319 | 161 | 5432 | 3173 | 2416 | 2420 | 2284 |
| 7 | 5133 | 5151 | 4405 | 5199 | 4334 | 5089 | 3673 | 2183 | 2108 | 1927 |
| 8 | 1195 | 721 | 549 | 946 | 528 | 571 | 330 | 287 | 496 | 217 |
| 9 | 49 | 32 | 39 | 62 | 27 | 22 | 17 | 20 | 22 | 15 |
| 10 | 14 | 12 | 15 | 20 | 12 | 10 | 9 | 9 | 13 | 8 |
| Transformed data (K = 2) | ||||||||||
| 1 | 70 | 45 | 48 | 72 | 39 | 821 | 467 | 395 | 652 | 294 |
| 2 | 169 | 106 | 99 | 162 | 86 | 6123 | 3875 | 2269 | 2155 | 2025 |
| 3 | 649 | 417 | 335 | 562 | 311 | 582 | 388 | 348 | 618 | 312 |
| 4 | 218 | 146 | 126 | 205 | 109 | 1403 | 1017 | 971 | 1147 | 896 |
| 5 | 23 | 17 | 21 | 26 | 14 | 102 | 62 | 72 | 131 | 48 |
| 6 | 40 | 27 | 32 | 44 | 24 | 200 | 115 | 122 | 227 | 85 |
| 7 | 1012 | 646 | 493 | 846 | 486 | 304 | 194 | 176 | 317 | 155 |
| 8 | 121 | 83 | 79 | 123 | 63 | 6378 | 3873 | 2265 | 2154 | 2004 |
| 9 | 13 | 12 | 13 | 14 | 12 | 14 | 13 | 15 | 19 | 12 |
| 10 | 9 | 8 | 9 | 13 | 7 | 9 | 8 | 9 | 13 | 7 |
| 1 | 163 | 105 | 100 | 171 | 80 | 376 | 239 | 210 | 378 | 184 |
| 2 | 380 | 283 | 203 | 357 | 178 | 212 | 135 | 126 | 207 | 107 |
| 3 | 1187 | 933 | 722 | 1164 | 683 | 198 | 134 | 118 | 197 | 104 |
| 4 | 540 | 340 | 273 | 471 | 251 | 3411 | 2987 | 1255 | 1156 | 1217 |
| 5 | 57 | 37 | 44 | 70 | 31 | 814 | 571 | 509 | 905 | 480 |
| 6 | 94 | 61 | 67 | 106 | 49 | 115 | 78 | 76 | 109 | 61 |
| 7 | 1660 | 1464 | 1490 | 1850 | 1044 | 1075 | 943 | 893 | 1160 | 811 |
| 8 | 292 | 187 | 159 | 277 | 140 | 5206 | 3630 | 2083 | 2095 | 1785 |
| 9 | 16 | 13 | 17 | 24 | 13 | 12 | 11 | 12 | 14 | 11 |
| 10 | 11 | 9 | 11 | 14 | 10 | 9 | 7 | 7 | 13 | 7 |
| Transformed data (K = 3) | ||||||||||
| 1 | 34 | 24 | 29 | 39 | 22 | 132 | 80 | 89 | 164 | 62 |
| 2 | 77 | 50 | 52 | 80 | 42 | 500 | 294 | 258 | 755 | 195 |
| 3 | 288 | 189 | 165 | 270 | 148 | 5206 | 3630 | 2083 | 2095 | 1785 |
| 4 | 101 | 59 | 59 | 90 | 50 | 4935 | 1107 | 1098 | 1166 | 989 |
| 5 | 14 | 13 | 14 | 15 | 12 | 35 | 32 | 32 | 48 | 21 |
| 6 | 22 | 15 | 20 | 24 | 14 | 55 | 35 | 44 | 73 | 30 |
| 7 | 460 | 286 | 237 | 393 | 218 | 616 | 407 | 366 | 647 | 332 |
| 8 | 57 | 40 | 43 | 61 | 34 | 498 | 290 | 256 | 445 | 194 |
| 9 | 11 | 10 | 11 | 13 | 10 | 12 | 11 | 12 | 14 | 10 |
| 10 | 8 | 7 | 7 | 11 | 6 | 8 | 7 | 7 | 10 | 6 |
| 1 | 78 | 50 | 57 | 90 | 42 | 73 | 51 | 52 | 72 | 43 |
| 2 | 174 | 109 | 104 | 177 | 83 | 87 | 61 | 60 | 85 | 51 |
| 3 | 625 | 422 | 336 | 563 | 318 | 123 | 85 | 79 | 123 | 68 |
| 4 | 249 | 159 | 142 | 244 | 120 | 710 | 469 | 422 | 764 | 389 |
| 5 | 31 | 22 | 29 | 42 | 20 | 45 | 34 | 36 | 45 | 30 |
| 6 | 47 | 31 | 38 | 59 | 26 | 36 | 26 | 29 | 36 | 23 |
| 7 | 1085 | 665 | 527 | 877 | 485 | 399 | 262 | 232 | 408 | 200 |
| 8 | 137 | 84 | 86 | 146 | 68 | 339 | 233 | 197 | 351 | 171 |
| 9 | 13 | 11 | 13 | 15 | 11 | 10 | 8 | 9 | 13 | 8 |
| 10 | 10 | 7 | 8 | 12 | 8 | 8 | 6 | 6 | 11 | 6 |
The above findings are consistent with observations by Yap and Sim (2011), who reported that lognormal alternatives generally require smaller sample sizes to achieve a given power compared to Weibull alternatives. However, while Yap and Sim (2011) reported that achieving a power of 1.0 typically requires sample sizes in the range of 100–2,000 (for different alternative surrogate distributions), the corresponding values obtained in this study are comparatively higher, i.e. approximately 100–5,000. This observed difference is justifiable, as synthetic data sets exhibit a controlled level of variability, enabling deviations from normality to be detected with a minimum number of observations. However, field- or laboratory-derived geotechnical data sets inherently exhibit greater variability and uncertainty, thereby necessitating larger sample sizes to achieve equivalent power levels.
3.8 Establishing the relationship between the coefficient of variation and minimum required sample size
As outlined in the preceding sections, a consistent observation was that data sets with higher variability require a smaller sample size to assess or detect deviations from normality. For instance, in the original data (k = 1), data sets 9 and 10 required only 8–28 samples, whereas data sets 3 and 7 required between 94 and 2,685 samples when assessed using S-W tests. Similarly, increasing k to 2 and 3 (inducing skewness) led to greater variability, which further reduced the sample size for normality assessment. Given the wide disparity in both CoV and sample sizes, a common logarithmic transformation was applied to both independent (predictor) and dependent (response) variables. The resulting log-log relationship between the CoV and the required sample size is illustrated in Figure 6. As depicted in Figure 6, as the CoV increased, the sample size required for detecting non-normality decreased. A strong negative linear correlation was observed for LL, A-D, J-B, DP and S-W GoF tests, with a correlation coefficient of (r = −0.93), indicating the strong linear dependency on CoV. Subsequently, a predictive model was developed to estimate the minimum sample size (n) as a function of CoV. Linear log-log models were fitted for LL, A-D, J-B, DP and S-W tests, expressed as:
The image features a multi panel graph presenting the relationship between log N and log C o V across five distinct panels labelled L L, A D, J B, D P, and S W. Each panel contains scatter plots of different data points represented by various symbols, along with fitted regression lines shown in blue. The dashed red lines indicate the upper and lower 95 percent confidence limits for each dataset. Each panel includes an equation indicating the linear relationship, along with an R 2 value to show the goodness of fit. The y axis represents log N and the x axis represents log C o V, with numerical labels spanning from zero to four on the y axis and from zero to three on the x axis. The data is arranged vertically across the five panels, making navigation straightforward from top to bottom.Relation between CoV and required sample size fitting using a log-log linear regression model
Source: Figure by authors
The image features a multi panel graph presenting the relationship between log N and log C o V across five distinct panels labelled L L, A D, J B, D P, and S W. Each panel contains scatter plots of different data points represented by various symbols, along with fitted regression lines shown in blue. The dashed red lines indicate the upper and lower 95 percent confidence limits for each dataset. Each panel includes an equation indicating the linear relationship, along with an R 2 value to show the goodness of fit. The y axis represents log N and the x axis represents log C o V, with numerical labels spanning from zero to four on the y axis and from zero to three on the x axis. The data is arranged vertically across the five panels, making navigation straightforward from top to bottom.Relation between CoV and required sample size fitting using a log-log linear regression model
Source: Figure by authors
where the regression coefficient (b1) varies from −1.362 to −1.536. Since this model was developed from the given sample, it is necessary to verify whether it accurately represents the population model (μy = β0 + β1X + ε) (Ott and Longnecker, 2015). For this purpose, a hypothesis test was conducted against population regression coefficients (β1) using this test statistic . The hypotheses were formulated as: Null Hypothesis (H0): β1 = 0 (Mean sample size is not linearly related to CoV); and Alternate Hypothesis (HA): β1 ≠ 0 (Mean sample size is linearly related to CoV). For all the regression equations (2)–(6), the calculated ts lies between −25 and −28, with SEb1 almost equal to 0.06, and was compared against the critical t-value of −2. Since ts -exceeded critical t-values, the null hypothesis was rejected, indicating that β1 ≠ 0. This confirms that variation in the sample size can be reliably explained by CoV.
Moreover, the direct use of the log-log equations (2)–(6) is not convenient for the practical prediction of the sample sizes. Accordingly, the back-transformed predictive equations (7)–(11) are suggested for operational application, as they allow direct estimation of the minimum sample size (n) for reliable normality assessment when the CoV is provided in percentage terms:
Since the S-W test exhibited higher power at lower sample sizes and performed well with all types of data sets, they are taken into primary consideration in the context of sample size optimisation with CoV. Table 7 presents recommended sample sizes for different CoV levels, providing practical guidance for data collection based on the variability of the data. For CoV values ranging from 10% to 100%, the mean fit approximation indicates that 25–665 samples are adequate to draw conclusions on normality using the S-W tests. However, it is important to note that these values reflect the general trend that higher CoV often coincides with greater skewness. Data sets with similar CoV can still differ in skewness; in such cases, the same estimated sample size acts as a 95% lower confidence bound when skewness is higher, and as a 95% upper confidence bound when skewness is lower, as presented in Table 7. Furthermore, for the J-B, A-D, LL and DP tests, the required sample sizes for 10% ≤ CoV ≤100% were estimated to be 750–30, 920–30, 1400–40 and 1250–50, respectively.
Recommended GoF tests and corresponding minimum sample size requirements based on data set CoV
| Coefficient of variation | Required sample size (Power ≥ 0.8, α = 0.05) | GoF test | ||
|---|---|---|---|---|
| Mean estimate | Lower CB (95%) | Upper CB (95%) | ||
| < 10% | >665 | >334 | >1323 | S-W |
| 10% ≤ CoV < 30% | 665-145 | 334-64 | 1323-322 | |
| 30% ≤ CoV < 50% | 145-70 | 64-30 | 322-167 | |
| 50% ≤ CoV < 100% | 70-25 | 30-10 | 167-68 | |
| ≥100% | ≤25 | ≤10 | ≤68 | |
| 10% ≤ CoV ≤100% | 750-30 | 394-14 | 1418-78 | J-B |
| 920-30 | 456-12 | 1865-81 | A-D | |
| 1400-40 | 715-16 | 2817-106 | LL | |
| 1250-50 | 634-19 | 2466-121 | DP | |
| Coefficient of variation | Required sample size (Power ≥ 0.8, α = 0.05) | GoF test | ||
|---|---|---|---|---|
| Mean estimate | Lower | Upper | ||
| < 10% | >665 | >334 | >1323 | S-W |
| 10% ≤ CoV < 30% | 665-145 | 334-64 | 1323-322 | |
| 30% ≤ CoV < 50% | 145-70 | 64-30 | 322-167 | |
| 50% ≤ CoV < 100% | 70-25 | 30-10 | 167-68 | |
| ≥100% | ≤25 | ≤10 | ≤68 | |
| 10% ≤ CoV ≤100% | 750-30 | 394-14 | 1418-78 | J-B |
| 920-30 | 456-12 | 1865-81 | A-D | |
| 1400-40 | 715-16 | 2817-106 | ||
| 1250-50 | 634-19 | 2466-121 | ||
CB = confidence bound
4. Recommendations on the use of a specific GoF test for normality assessment
In statistical analysis, sample size plays a crucial role in obtaining reliable point estimates of the parameters of interest, which should satisfy key statistical properties such as unbiasedness, consistency, efficiency and sufficiency (Haldar and Mahadevan, 2000; Ramachandran and Tsokos, 2009). In particular, sufficiency refers to the ability of the samples to capture all available information about the population parameter (e.g. shear strength, permeability, CBR, Atterberg limits). The present study found that the S-W performed well when the sample size ranged from 665 to 145 and the sample CoV was between 10% and 30%. However, when normality was assessed with a sample size of 20 or 30, these tests still indicated normality, even for data that exhibited non-normality. Therefore, acceptance of normality at such a small sample size should not be taken as evidence that the sample adequately represents the population.
In geotechnical quality control, standard statistical tools include confidence interval estimation, control charts (e.g. range charts, standard deviation charts and average charts), and hypothesis testing. When quality control measures focus on the mean (point estimate), decisions heavily rely on the central tendency of the data. As such, data sets with 35 samples and 20 samples may yield different statistical conclusions. In cases where the mean is a critical parameter, the LL test can be used, particularly for assessing normality, as it evaluates the overall agreement between the empirical distribution and the theoretical distribution, with sensitivity to deviations in the central portion of the distribution.
Conversely, in reliability-based analysis of the geotechnical structures, where the probability of failure is often governed by the most critical values distributed in the tail region, the A-D test is generally more appropriate. Its increased sensitivity to deviations in the tails enhances the accuracy of assessing whether critical geotechnical data conform to assumed probability models. In the present study, for example, the nonlinear power transformation introduced skewness primarily in the tail region, which was effectively detected by the A-D test (Table 4): skewed data (k = 2 and k = 3), whereas the LL and CVM tests were least sensitive to capture the deviations. Furthermore, for the chi-square test, there are mixed conclusions regarding the use of chi-square tests for assessing univariate distributions. According to Thode (2002), the chi-square test is not recommended because it is challenging to implement compared to other tests. Moreover, this test may not be attractive to beginners, as there is a high chance of incorrectly rejecting the normality assumption due to computational procedures. In contrast, the DP (sample size = 50–1,250) and J-B (sample size = 30–750) tests are simpler to apply and may serve effectively as complementary tests.
5. Summary and conclusions
The performance of selected GoF tests in assessing the normality of experimentally derived geotechnical data, including percentage of fines, liquid limit, plastic limit, dry density, moisture content, CBR, unconfined compressive strength and permeability, was evaluated using power analysis. Based on statistical estimates and accepted marginal distributions, a Monte Carlo simulation was used to generate synthetic random samples for assessing the normality and accurate power of the selected GoF test. The analysis provides insights into the sensitivity and reliability of these tests when applied to geotechnical data sets with diverse statistical characteristics. Based on the study, the following conclusions are drawn:
The initial findings from the power analysis indicate that when the available sample size is smaller than the minimum required, the GoF tests may still accept normality (e.g. data set 1) due to lack of statistical power, thereby increasing the risk of Type II error, i.e. falsely accepting normality. Conversely, rejecting normality under the same smaller sample size conditions (e.g. data set 5) remains significant, as the conclusion reflects a detectable deviation from the normality. Since Type I error is controlled at α = 0.05, rejection under reduced power provides reliable evidence against the null hypothesis of normality.
For the original data sets, the selected GoF tests achieved a power value of 1.0 at sample sizes ranging from 25 to over 2,000. This considerable variation in the sample size is mainly attributable to the use of different surrogate distributions (LN, WB, GA and GEV) in the alternative hypothesis during the power analysis. Additional analyses indicated that the power values were not only governed by sample size but also by the effect size. Among all the selected GoF tests, the S-W, A-D and J-B tests demonstrated the highest predictive capacity to detect non-normality and required comparatively smaller sample sizes. Notably, the minimum sample sizes derived for field-based geotechnical data sets were considerably larger than those typically reported for synthetic data sets, reflecting the higher inherent variability and uncertainty associated with real-world soil data.
The minimum sample size required to achieve the desired statistical power value of 0.8 at a chosen significance level of 0.05 was determined for each GoF test. Subsequent analyses were conducted to examine the relationship between optimum sample size and the data’s CoV. A clear negative correlation (r = −0.93) was observed, indicating that higher variability reduces the number of sample sizes required to detect deviations from normality. Using the fitted linear log-log regression model, the estimated sample sizes required to detect non-normality reliably were approximately 665-25 for S-W, 750-30 for J-B, 920-30 for A-D, 1400-40 for LL, and 1250-50 for DP, applicable within the CoV range of 10%–100%.
Furthermore, the J-B and DP tests, a moment-based method not frequently used in geotechnical data analysis, also exhibited strong performance, followed by the S-W and A-D tests. The LL test exhibited moderate sensitivity and required a slightly higher sample size than moment-based methods. The CSQ and CVM were found to be more effective for larger sample sizes, further reinforcing the importance of sample size considerations in statistical assessments of geotechnical properties.
Notably, the GoF used for normality assessment requires only the key statistical parameters such as mean, skewness, kurtosis and CoV. As such, the performance and reliability of these tests are influenced primarily by the sample size and statistical characteristics of the geotechnical data. The findings of this study are therefore applicable not only to the data sets analysed but also to other non-geotechnical data sets with CoV ranging from 10% to 100%.
Acknowledgements
The authors thank the anonymous reviewers for their valuable comments and suggestions.

