This study aims to investigate how input representation and training-set size influence the performance of multilayer perceptrons (MLP) for predicting axial pile capacity. Empirical procedures often show wide deviations from full-scale load tests, motivating data-driven approaches that better capture soil and geometry variability.
A database of 546 axial load tests is used to evaluate 12 feature representations ranging from geometry-only inputs to averaged soil descriptors and layered soil profiles sampled along the pile shaft. For each feature set, an MLP is tuned using 500 Optuna trials under three training sizes: one-third, two-thirds and the full data set.
Layered inputs generally outperform geometry alone, but averaged soil properties often perform similarly. The combination of geometry with layered soil type, total unit weight and SPT-N provides the most consistent results. Training size is the dominant factor, with large gains from one-third to two-thirds of the data and smaller gains thereafter.
Large geotechnical data sets can support reliable ML models even when many soil parameters are missing or inferred. Averaged soil information performs nearly as well as layered profiles, reducing the need for complete stratigraphic data. Practitioners can therefore use broad but imperfect databases to develop effective prediction tools.
This study provides a comprehensive assessment of feature design and data sufficiency for neural-network-based pile capacity prediction. It demonstrates that while representation matters, training-set size governs most achievable accuracy and clarifies when layered inputs justify added complexity.
Introduction
Estimating axial pile capacity remains a persistent challenge in practice. Many large structures depend on piles and the consequences of misestimation are significant. Rising construction costs and the need for specialized equipment amplify these consequences. Design workflows often require site specific validation through static load tests on preliminary or sacrificial piles. This requirement, which is uncommon for other structural elements, reflects the uncertainty that surrounds pile capacity prediction and the limited confidence placed in current design procedures.
Practice continues to rely on empirical and semi empirical procedures because they are widely accepted, yet comparisons with load tests show wide discrepancies that in extreme cases approach an order of magnitude (Briaud and Tucker, 1988; Machairas et al., 2018; Phoon and Tang, 2019; Rizk et al., 2022; Ozturk et al., 2023a). Part of the problem is the narrow calibration base and the limited ability of these methods to account for soil property changes and stress redistribution after driving. For example, the basis of the US Federal Highway Administration (FHWA) design guidance comprises, Tomlinson’s (1980)Alpha method which synthesized 56 driven piles in clay, while Nordlund (1963) sand method drew on about 41 tests from 8 sites. Similarly, the Lambda approach (Focht and Vijayvergiya, 1972) used 47 pipe pile tests in clay to develop its guidance. Because these procedures lump together depth dependent soil behavior and installation effects and other uncertainties into simplified rules, designers often apply generous resistance or safety factors, which raises costs and can conceal systematic bias. To address these limitations, this study investigates whether expanding the calibration base and refining data representation through machine learning improves predictive consistency.
The search for better geotechnical predictors has shifted much recent work toward machine learning (ML) in geotechnical engineering (Li and Iskander, 2022; Kodsy et al., 2023a; Zhang et al., 2021, 2025; Wang et al., 2025). ML can model nonlinear relations among many inputs and can represent uncertainty more flexibly than rule-based procedures. Prior studies have tested base and ensemble ML learners for axial capacity and have reported significant gains relative to empirical and semi empirical formulas, although sizeable errors remain and performance varies across data sets (Ozturk et al., 2023b, 2024a, 2024b, 2024c). These results suggest that data driven models can use geometry, and soil information without imposing strict mechanistic assumptions, but they also show that model design and data pre-processing control most of the outcome.
Multilayer perceptrons (MLP) (Hornik et al., 1989) are a class of artificial neural networks (ANN) that is commonly used for regression problems, of which pile capacity prediction is a candidate. Indeed, MLPs have been applied to this problem for nearly three decades (Lee and Lee, 1996; Pal and Deswal, 2008; Momeni et al., 2014; Moayedi and Armaghani, 2017; Ozturk and Iskander, 2025). However, most prior applications were limited by the reliance on a single project database, small sample size, or suboptimal model training. Training on fewer than 100 piles or on piles drawn from a single site can eliminate intrinsic geologic uncertainty and yield delusional excellent performances. Such settings suppress the variability in soil conditions and installation practice that dominates uncertainty in the field. Reported metrics then reflect internal consistency within one fabric and driving system rather than general performance across sites. Conventional cross validation inside a single project can further inflate accuracy if stratigraphy or test setups is the same across training and validation folds. As a result, external validity of the developed ML models is unclear and transfer to new geology and driving conditions is uncertain.
Data curation and label definition add another layer of difficulty. Capacity is often interpreted by different criteria across sources, for example, Davisson (1972) and other offset capacity methods fail to account for capacity dependence on displacement, which also introduces label noise into supervised learning. Soil indices and in situ measurements may also come from different test protocols and depths. Simple averaging along the embedded length removes position information that controls the mobilization of shaft and tip resistance. Layer aware encodings that preserve the profile can carry more information about where strength and density reside, while average encodings are easier to assemble but risk blurring tip effects and stiff interlayers.
In this study the effect of data representation and sample size on the predicted ML capacity are explored, using a database of 546 load tests in a variety of soils. First, 12 feature sets were assembled to test how input design influences MLP performance (Table 1). The first set uses only geometric pile properties because prior studies showed that capacity can be interpreted based on pile geometry alone (Ozturk et al., 2023b). The second set compresses soil variable into averages over the embedded length and serves as the baseline. The remaining sets preserved depth information by sampling each soil variable at five fixed fractions of the embedded length and also included the properties at the tip as the sixth layer. Within this stratified family, individual variables and small groups are paired with geometry to account for performance gain due to using combinations of variables shown in Table 1. Each of the 12 configuration is trained with the same workflow and tuned with Optuna (Akiba et al., 2019) using 500 trials so that differences in accuracy reflect feature content rather than search effort. This design tests whether layered profiles carry more information than averages and which features contribute most when used alone with geometry or in combination.
Input ablation configurations
| Configuration number | Geometry | Soil type | Total unit weight | N value | Su | ϕ | qc | Rationale |
|---|---|---|---|---|---|---|---|---|
| 1 | ✓ | … | … | … | … | … | … | Baseline without soil |
| 2 | ✓ | Average over entire embedded depth | Baseline | |||||
| 3 | ✓ | ✓ | … | … | … | … | … | Soil type |
| 4 | ✓ | … | ✓ | … | … | … | … | Incorporates One feature at a time, as indicated |
| 5 | ✓ | … | … | ✓ | … | … | … | |
| 6 | ✓ | … | … | … | ✓ | … | … | |
| 7 | ✓ | … | … | … | … | ✓ | ||
| 8 | ✓ | … | … | … | … | … | ✓ | |
| 9 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Complete set |
| 10 | ✓ | ✓ | … | … | ✓ | ✓ | … | Strength indicators + soil type |
| 11 | ✓ | … | ✓ | … | ✓ | ✓ | … | Similar to 10, but replacing categorical soil type by TUW |
| 12 | ✓ | ✓ | ✓ | ✓ | … | … | … | Fewest use of correlations |
| Configuration number | Geometry | Soil type | Total unit weight | N value | Su | ϕ | qc | Rationale |
|---|---|---|---|---|---|---|---|---|
| 1 | ✓ | … | … | … | … | … | … | Baseline without soil |
| 2 | ✓ | Average over entire embedded depth | Baseline | |||||
| 3 | ✓ | ✓ | … | … | … | … | … | Soil type |
| 4 | ✓ | … | ✓ | … | … | … | … | Incorporates One feature at a time, as indicated |
| 5 | ✓ | … | … | ✓ | … | … | … | |
| 6 | ✓ | … | … | … | ✓ | … | … | |
| 7 | ✓ | … | … | … | … | ✓ | ||
| 8 | ✓ | … | … | … | … | … | ✓ | |
| 9 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Complete set |
| 10 | ✓ | ✓ | … | … | ✓ | ✓ | … | Strength indicators + soil type |
| 11 | ✓ | … | ✓ | … | ✓ | ✓ | … | Similar to 10, but replacing categorical soil type by |
| 12 | ✓ | ✓ | ✓ | ✓ | … | … | … | Fewest use of correlations |
The second part of the analysis examines the effect of training size. A fixed test set comprising 164 load tests is reserved for evaluation. The remaining data are used at three levels: one third, two thirds, and the full training portion. This mini ablation study measures how performance scales with the amount of training data and whether the advantage of stratified inputs changes with sample size. Crossing the 12 configuration sets with the three training sizes yields 36 models, which provides a consistent basis to separate gains due to representation from gains due to additional data.
The remainder of the paper is organized as follows. The Data set section describes the 546-load-test database, its sources, pile and soil descriptors and the preprocessing steps used to harmonize and infer missing parameters. The Background section summarizes prior comparisons between traditional methods and machine learning models on this database and explains the use of a tuned MLP as the primary model under study. The Methodology section details the feature ablation design, layered versus averaged soil encodings, training–testing splits, Optuna-based hyperparameter optimization, and evaluation metrics. The Results section presents performance trends across feature sets and training sizes, highlighting the relative influence of representation and data volume. The Discussion interprets these trends in the context of geotechnical practice, overfitting and data sufficiency. The Limitations section outlines constraints related to data set size, inferred soil parameters, discretized layering and bounded hyperparameter search. The Practical Implications section translates findings into guidance for database expansion, feature selection and ML deployment in practice. Finally, the Conclusions synthesize the main findings regarding layered encoding, sample size effects and priorities for future data-driven pile capacity modeling.
Data set
To support models that generalize across soil conditions, this study uses a large database of 546 static axial load tests, compiled from multiple sources. The set includes 145 steel pipe piles, 165H-piles, 193 precast concrete piles and 43 timber piles. Diameters range from 10 to 40 in. (25–100 cm) and lengths from 8.5–315 ft (2.5–100 m). The distribution of diameter and length by pile type is shown in Figure 1. The database was assembled from Prof. Olson’s collections (Dennis and Olson, 1983a, 1983b; Olson and Iskander, 2014), the FHWA Deep Foundations Load Test Database v2 (Petek et al., 2016) and the Iowa DOT repository (Roling et al., 2011). The combined sources span a wide range of geologic settings and pile types, so variability in soil conditions is better represented than most prior studies, rather than averaged out.
The combined chart contains a central scatter plot with histograms above and to the right. The main x-axis is Diameter in inches from 0 to 50, with a top secondary axis in metres from 0.0 to 1.2. The main y-axis is Length in feet from 0 to 350, with a right secondary axis in metres from 0 to 100. Data points represent four pile types: Steel Pipe 145, H Pile 165, Concrete 193, and Timber 43. Most points cluster between 10 and 25 inches diameter and 20 to 100 feet length. Some larger diameter piles extend to about 40 inches. A few lengths rise above 150 feet, with one point above 300 feet. The top histogram indicates most piles have smaller diameters. The side histogram indicates most piles have shorter lengths.Distribution of piles used in the analysis
The combined chart contains a central scatter plot with histograms above and to the right. The main x-axis is Diameter in inches from 0 to 50, with a top secondary axis in metres from 0.0 to 1.2. The main y-axis is Length in feet from 0 to 350, with a right secondary axis in metres from 0 to 100. Data points represent four pile types: Steel Pipe 145, H Pile 165, Concrete 193, and Timber 43. Most points cluster between 10 and 25 inches diameter and 20 to 100 feet length. Some larger diameter piles extend to about 40 inches. A few lengths rise above 150 feet, with one point above 300 feet. The top histogram indicates most piles have smaller diameters. The side histogram indicates most piles have shorter lengths.Distribution of piles used in the analysis
Davisson capacities from the source records were adopted as ground truth, with the understanding that the offset criterion introduces its own interpretation errors (Kodsy et al., 2022a, 2022b). The data set contained pile diameter, length, area, circumference, modulus, pile type and pile material and soil descriptors including soil type, total unit weight, SPT N-values, undrained shear strength (su), friction angle (Φ) and cone tip resistance (qc). Soil variables were represented in two ways: first as an average over the embedded length, and also in layered form that records the exact value at top, 20, 40, 60 and 80%, of the embedded length and tip.
Corrected SPT N-values (Ncorr), total unit weight, and soil type were available for most strata. When Ncorr needed to be computed or checked, the procedure was as follows. Reported SPT blow counts were treated as N60 when energy details were not available. Overburden effects were then applied using Peck et al. (1974) to obtain the correction factor Cn, with the standard cap, and Ncorr = Cn· N60. Most records already contained N, total unit weight. If Ncorr was missing but qu or su was present, Ncorr was inferred from NAVFAC DM 7.01 (NAVFAC, 1986) correlations between SPT resistance and undrained strength. The reverse inference was used to fill missing strength when Ncorr existed. From Ncorr, friction angle for cohesionless soils, su for cohesive soils, and total unit weight were estimated using NAVFAC and Bowles (1988) relationships. Finally, qc was generated by assigning a Robertson (2016) soil behavior zone from the layer description and applying the zone-specific qc/N ratio soil type.
Background
Ozturk and Iskander (2025) used the same 546-pile database to compare the performance of traditional procedures, ML models and a tuned MLP. The traditional models comprised (1) FHWA (Hannigan et al., 2016), (2) USA Army Corps of Engineers (USACE) (USACE, 1991), (3) American Petroleum Institute (API) (API, 1993) and (4) Revised Lambda (Kraft et al., 1981) because of their availability on APILE (Wang et al., 2019) providing consistent implementation and representing common practice for axial capacity.
The ML set comprised support vector regression as the base learner and XGBoost as the ensemble benchmark; these were selected because prior work on this database identified them as the strongest baselines in their classes. A multilayer perceptron was included to test whether a carefully tuned neural network could surpass the tree-based ensemble on the same inputs and labels.
An MLP is a fully connected neural network typically used for regression or classification. Inputs pass through one or more hidden layers where linear transformations are followed by nonlinear activations. With sufficient width (i.e. number of nodes in a layer) and appropriate activation functions, MLPs can approximate complex input-output maps and can represent interactions among geometry, soil descriptors, and profile features that are difficult to encode in closed form (Hornik et al., 1989). Training adjusts weights and biases by backpropagation with an optimizer such as SGD (Stochastic Gradient Descent) or Adam (Adaptive Moment). Regularization and normalization are also used to control overfitting.
Optuna is a hyperparameter optimization framework that automates the search over architecture and training settings. It samples candidates from a defined hyperparameter space, evaluates them on a validation fold, and steers subsequent trials toward promising regions while pruning weak runs. Optuna was selected because its adaptive sampling and early pruning mechanism efficiently navigates a high-dimensional search space while controlling computational cost and maintaining reproducibility across experiments. In this context, Optuna selects the number of layers and neurons, activation functions, normalization, dropout, regularization strengths, optimizer family, learning rate, batch size and loss. Using a fixed optimization framework per configuration standardizes effort across feature sets so performance differences can be attributed to representation and data rather than tuning.
The cross-model comparison on the same 546 pile database is summarized in Figure 2. Among traditional methods, API produced the strongest baseline for this data set and was used as the reference. Support vector regression served as the base ML learner and XGBoost (Chen and Guestrin, 2016) as the ensemble benchmark, because it is widely regarded as the gold standard for tabulated data (Shwartz-Ziv and Armon, 2022). An extensively tuned MLP was also trained because XGBoost, while strong on tabular data, is prone to overfitting without tight control of depth, learning rate and tree count. Its tree ensembles partition the feature space and return leaf averages, which limits extrapolation and yields piecewise constant responses outside the support of the training data. Trees are largely insensitive to feature scaling and can down-weight or ignore weak inputs through split selection and regularization. That robustness can mask the effect of representation. Our goal is to test whether layered soil encodings add information beyond averages. An MLP is a smooth, differentiable function approximator that is sensitive to representation and can exploit stratified continuous inputs, making it a better probe of feature design. XGBoost remains the benchmark for comparison, but the MLP is the model under study.
The four-panel figure presents scatter plots comparing Calculated Capacity on the x-axis with Interpreted Capacity on the y-axis. Lower and left axes use kips. Upper and right axes use mega newtons. Panels are labelled A P I, S V R, X G B O O S T, and M L P. Each panel contains train and test data points with diagonal reference lines and dashed bounds around the central line. Data generally increase upward as calculated capacity increases. A P I has the widest spread. S V R shows a tighter upward cluster. X G B O O S T has the tightest alignment with the central line. M L P also shows a strong positive trend with moderate spread. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts.MLP model compared to baseline performance models (Ozturk and Iskander, 2025)
The four-panel figure presents scatter plots comparing Calculated Capacity on the x-axis with Interpreted Capacity on the y-axis. Lower and left axes use kips. Upper and right axes use mega newtons. Panels are labelled A P I, S V R, X G B O O S T, and M L P. Each panel contains train and test data points with diagonal reference lines and dashed bounds around the central line. Data generally increase upward as calculated capacity increases. A P I has the widest spread. S V R shows a tighter upward cluster. X G B O O S T has the tightest alignment with the central line. M L P also shows a strong positive trend with moderate spread. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts.MLP model compared to baseline performance models (Ozturk and Iskander, 2025)
Methodology
The database was randomly split into 70% training and 30% testing portions. All modeling choices, tuning, and ablation studies were performed inside the training portion only. For the ablation study to measure how performance scales with the amount of training data training was done on a subset of the training portion, while testing was done with the entire testing portion. Predictions and metrics are reported for three groups per run: train, test and unused. “Unused” denotes the part of the original training set that was intentionally left out when a reduced training size was used, then predicted after the model was fitted.
Soil descriptions from different sources were harmonized into three generalized classes (clay, sand, rock) to reduce labeling noise and improve model stability. All categorical variables were numerically encoded, with pile type and pile material one-hot encoded and soil classes mapped to integer labels. Redundant geometric variables related to layer depths and repeated embedment lengths were removed to prevent information leakage and multicollinearity.
Inputs were built from two ingredients:
core variables that encode pile geometry and type but no soil information; and
soil descriptors represented either as a single average over the embedded length or as layered values sampled at top, 20, 40, 60 and 80% of the embedded length, and at the tip.
Layered profiles preserve stratigraphy and tip-zone context. Averages were included once as a direct analogue to the previous study for benchmarking under the present tuning configuration. Twelve feature sets were defined as shown in Table 1, as follows:
1 Core: only to measure the value of geometry without soil.
2 Core + Averages: to mirror the previous study’s representation under a smaller tuning budget.
3-8 Core + one layered input at a time: Soil type, total unit weight, N, su, Φ and qc.
9 Core + all layered families to test the upper bound from full stratified information.
10 Core + Soil Type + su + Φ for each layer to provide strength indicators for both cohesive and cohesionless soils.
11 Core + Total Unit Weight + su + Φ for each layer to replace the categorical soil variable with total unit weight as a soil indicator while retaining strength measures.
12 Core + Soil Type + Total Unit Weight + N for each layer to emphasize the three most widely available field descriptors in database before filling the missing parameters.
Each of the 12 sets was trained three times, using 1/3, 2/3 and the full training portion. This yields 36 MLP configurations that separate gains due to representation from gains due to additional data.
For every configuration an MLP was tuned with Optuna for 500 trials. The first 100 trials were random to explore the space broadly. The remaining 400 used Optuna’s model-based sampler with pruning to focus computation on promising regions. The search space covered depth, width, activation, normalization, dropout, L1 and L2 penalties, optimizer family, learning rate, batch size and number of training epochs. Trials were optimized by R2, because earlier work on this database showed that optimizing R2 led to better behavior than optimizing by mean absolute percentage error (MAPE) (Ozturk and Iskander, 2025). Final models were evaluated on the fixed test and the unused groups.
To compare the performances the MAPE as shown in equation (1) was adopted along with the correlation coefficient, R2 in equation (2). MAPE measures the relative magnitude of error while R2 measures how well the predictions track the overall trend of the data:
where Qc is the calculated or predicted capacity and Qm is the measured or interpreted capacity. In addition , and are tracked to assess bias and precision on a scale free basis. Finally, poor predictions were recorded, defined a priori as cases with Qc/Qm < 0.5 or Qc/Qm > 2.0 (predictions less than one half or greater than twice the interpreted capacity).
Results
The results are organized to answer two questions: whether layered soil representations improve MLP performance and how training size affects that performance. First, the 500 Optuna trials for every feature set and training size are summarized in Figure 3, where higher R2 indicates a tighter fit to the observed data. The mean of each distribution of R2 is marked as a black circle filled with purple. Two trends are immediately clear. (1) Moving from one third to two thirds to full training set raises both the central tendency and the upper envelope of R2. (2) Trial outcomes are dispersed within each configuration, with a long tail of weak settings and a smaller but dense fraction of strong ones. This confirms that careful tuning is essential, since no matter what feature set is used hyperparameter tuning has a far more significant impact on R2 than the selected feature set.
The three-panel figure compares R-squared values for multiple feature combinations. Panels are titled Train: 1 slash 3, Train: 2 slash 3, and Full train. The shared y-axis is R-squared from minus 1.0 to 1.0. Each panel contains many faint points for individual trials and larger darker points for average R-squared. The x-axes list feature sets beginning with CORE and combinations such as average soil, type, T U W, N, S u, phi, and Q c. In all panels, individual trial results are widely scattered. Average values remain below zero for most feature sets. Results in the Train: 2 slash 3 and Full train panels are generally less negative than in Train: 1 slash 3.Average and individual model performances grouped by used feature set and portion of training set
The three-panel figure compares R-squared values for multiple feature combinations. Panels are titled Train: 1 slash 3, Train: 2 slash 3, and Full train. The shared y-axis is R-squared from minus 1.0 to 1.0. Each panel contains many faint points for individual trials and larger darker points for average R-squared. The x-axes list feature sets beginning with CORE and combinations such as average soil, type, T U W, N, S u, phi, and Q c. In all panels, individual trial results are widely scattered. Average values remain below zero for most feature sets. Results in the Train: 2 slash 3 and Full train panels are generally less negative than in Train: 1 slash 3.Average and individual model performances grouped by used feature set and portion of training set
Next, the single best test R2 achieved for each feature set under the three training sizes is reported in Figure 4. It is clear that enlarging the training set improves the best attainable R2, with a large gap between one third and the other two, and a smaller gap between two thirds and full. Second, core + average soil information outperforms most layered sets except when using the full training set, combined with using all layered information. Third, adding single layered families to the core set usually helps, but soil type and total unit weight alone did not improve the performance, if not made it worse. This suggests that geotechnical data is extremely noisy requiring large data sets to ameliorate the noise and that pile driving changes soil properties from their measured in-situ properties sufficiently to make the average good-enough for most analyses.
The horizontal dot plot compares Best R-squared values across feature sets. The x-axis ranges from 0.2 to 0.7. The y-axis lists feature sets including CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, and several multi-feature combinations. Three dots appear for each feature set: Full Train, Train 1 slash 3, and Train 2 slash 3. Full Train and Train 2 slash 3 values cluster near 0.55 to 0.66. Train 1 slash 3 values are lower, mostly near 0.30 to 0.50. The highest values are generally from Full Train or Train 2 slash 3.Best R2 comparison on testing set by each feature set and training splits
The horizontal dot plot compares Best R-squared values across feature sets. The x-axis ranges from 0.2 to 0.7. The y-axis lists feature sets including CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, and several multi-feature combinations. Three dots appear for each feature set: Full Train, Train 1 slash 3, and Train 2 slash 3. Full Train and Train 2 slash 3 values cluster near 0.55 to 0.66. Train 1 slash 3 values are lower, mostly near 0.30 to 0.50. The highest values are generally from Full Train or Train 2 slash 3.Best R2 comparison on testing set by each feature set and training splits
The best tuned model for each feature set under the three training sizes is presented in Figures 5–7. Each panel shows predicted versus interpreted capacity with inset tables reporting R2, MAPE, the mean and standard deviation of Qc/Qm, and counts of poor predictions for the Train, Test and Unused subsets. Using the full training set, Figure 5 shows that many configurations do not capture the low-capacity regime well, as seen by the wide, diffuse clouds at small capacities. All panels exhibit Qc/Qm >1 bias, which is the dangerous kind of bias. Any addition of soil information improves on the core feature set in both R2 and MAPE. The core + layered soil type + total unit weight + N configuration gives the strongest overall balance, with the lowest MAPE, a mean Qc/Qm closest to one, the second highest R2, and the fewest poor predictions. Aside from core and core + layered soil type (Configuration 3), the remaining feature combinations show broadly comparable performance under the full training set. Performance between testing and training is also generally consistent, suggesting no overfitting and slight underfitting for low-capacity data points.
The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for different feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel contains train and test data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. All panels show positive upward trends. Data clusters mostly between about 100 and 1000 kips. Several enhanced feature combinations show tighter clustering around the central line than CORE alone. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts.Performance of the best models with each feature set using full training set
The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for different feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel contains train and test data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. All panels show positive upward trends. Data clusters mostly between about 100 and 1000 kips. Several enhanced feature combinations show tighter clustering around the central line than CORE alone. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts.Performance of the best models with each feature set using full training set
The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for multiple CORE-based feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel includes train, test, and unused data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. Data generally increase upward and cluster mostly between about 100 and 1000 kips. Several expanded feature combinations show tighter grouping around the central line than CORE alone. Each panel contains a statistics table with R-squared, error measures, poor count, and sample counts for train, test, unused, and overall sets.Performance of the best models with each feature set using 2/3 of the training set
The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for multiple CORE-based feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel includes train, test, and unused data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. Data generally increase upward and cluster mostly between about 100 and 1000 kips. Several expanded feature combinations show tighter grouping around the central line than CORE alone. Each panel contains a statistics table with R-squared, error measures, poor count, and sample counts for train, test, unused, and overall sets.Performance of the best models with each feature set using 2/3 of the training set
The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for multiple CORE-based feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel includes train, test, and unused data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. Data show positive upward trends with most points clustered between about 100 and 1000 kips. Compared with earlier sets, point spread is wider and alignment with the central line is weaker in several panels. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts for train, test, unused, and overall sets.Performance of the best models with each feature set using 1/3 of the training set
The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for multiple CORE-based feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel includes train, test, and unused data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. Data show positive upward trends with most points clustered between about 100 and 1000 kips. Compared with earlier sets, point spread is wider and alignment with the central line is weaker in several panels. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts for train, test, unused, and overall sets.Performance of the best models with each feature set using 1/3 of the training set
With two thirds of the training data (Figure 6) the lower end of the clouds widens for core and core + average soil, indicating weaker performance of small capacities. This is notable because the average soil representation was near the top when the full training set was used, yet here it drops behind several layered sets. Across panels, unused points tend to perform worse than train and test, which signals overfitting that is not obvious from the small train–test gap. Because Optuna selects the model that scores best on the test set, selection bias can reduce the apparent Train–Test difference; similar means can hide different variances. In this regime core + ϕ and core + qc (Configuration 7,8) compare favorably with the bottom row combinations.
The performance of one third of the training data (Figure 7), falls across all configurations. Point clouds broaden, especially at low capacities, and counts of poor predictions rise. Bias persists, with mean Qc/Qm generally above one. The unused subset is usually better than train and test, which this time shows signs of underfitting.
Discussion
The study was designed to test whether layered soil representations improve MLP accuracy and how much that gain depends on training size. The evidence from Figure 3 shows that average performances do not convey the full story: while the core feature set is consistently outperformed by most sets when the full training data are used, its average performance does not fall among the worst. This indicates that while average performance reflects the general tendency of a feature set, the best individual outcomes tell a different story. The added features may introduce noise that only certain hyperparameter configurations end up memorizing, inflating peak performance, or they may represent meaningful but harder-to-learn structure that reduces consistency across trials and lowers the mean even when the top models improve.
Training size emerged as the strongest driver. A large gap in performance is evident in Figure 4 between using one third and two thirds of the data, and a smaller gap between two thirds and full. In three feature sets the best test R2 at two thirds exceeds that at full, most likely due to randomness from the search process and the fixed split, since most feature sets improve with the full data. The same models suffered heavily on the unused subset, revealing overfitting that was hidden between training and testing. The pattern supports a law of diminishing returns in which marginal gains from additional data are largest when the data set is small and taper as it grows. However, this trend should be interpreted cautiously, as the present database size may still be below the threshold where true performance saturation occurs, and further gains could emerge with additional and more diverse data.
These findings place current practice in context. Many traditional methods were built on fewer than 100 load tests and embed strong functional assumptions. When a disciplined neural-network model still struggles at the one-third level, it is unsurprising that legacy equations calibrated on small samples show wide scatter in modern databases. Feature content matters, but less than sample size; adding soil information to core geometric properties typically helps, yet the improvement from richer layered features is smaller than the improvement from increasing the training set from one third to two thirds. Also, many studies in the literature use the testing fold when optimizing the models, essentially leading to data leakage. The difference between the unused fold and both the training and testing folds indicates that the generalizability of the models should be reported more carefully.
Within the layered family, performance is not uniform. Core + average soil and core + all layered soil frequently rank highest on the two-thirds and full sets, with core + qc close behind. Soil type and total unit weight alone do not improve accuracy and can degrade it, suggesting that categorical descriptors and bulk unit weight carry less predictive information than penetration resistance or strength measures at this resolution. Combinations including total unit weight become competitive only when paired with N, su, or ϕ.
Some underfitting persists even with the full training set and a 500-trial search, as the low-capacity regime shows diffuse clouds suggesting limitations in data quality or coverage rather than insufficient tuning. Evaluation must therefore rely on several metrics: R2 alone can be misleading – core + total unit weight shows lower R2 than core + qc but smaller MAPE – so ratio statistics and counts of poor predictions provide essential context, and visual inspection is needed to detect regime-specific errors and systematic offsets.
Overfitting can also be masked when model selection is based on test performance. In the two-thirds regime the unused subset frequently performs far worse than both train and test despite similar train/test means, indicating selection bias that a nested validation scheme would reduce. At one-third training size, performance deteriorates across all feature sets; despite using 127 piles – more than many traditional calibration data sets – the MLP exhibits unstable relationships and underfitting. This underscores the need for more load tests or, when expansion is not feasible, reduced feature complexity or stronger regularization, albeit with an accuracy ceiling.
Although the models were explicitly trained to predict Davisson-interpreted capacities, the systematic overprediction in the low-capacity regime may reflect structural characteristics of the Davisson criterion itself rather than purely model deficiency. Prior work (Kodsy et al., 2021, 2022a, 2022b, 2023b) has shown that Davisson-based interpretations can exhibit systematic bias relative to other criteria, and such tendencies may be inherited by data-driven models trained on those labels. In addition, because model selection was guided by R2, optimization naturally emphasizes variance in the mid- to high-capacity range where most data reside, which can shift the fitted relationship upward at the lower tail. Data imbalance – fewer low-capacity cases and greater relative noise at small settlements – may further amplify this effect. Therefore, the observed bias likely arises from a combination of label definition, optimization emphasis and data distribution rather than solely from modeling choices. Finally, although 36 MLPs were trained using different feature sets and training sizes, all performed worse than the model from the previous study, which benefited from a larger hyperparameter search space (Appendix I) and 6,000 trials. This comparison reinforces the importance of extensive optimization.
Limitations
This study uses a larger and more diverse data set than most prior work, but it is still modest from a machine learning perspective. Labels come from interpreted capacities and many soil descriptors were inferred from correlations rather than direct measurements. Assumptions such as estimating ϕ, and su, from Ncorr, and generating synthetic qc, from SPT via Robertson zones add noise that neither traditional formulas nor MLPs can remove. These data issues likely cap the attainable accuracy, especially in the low-capacity range.
The proposed layered encoding also has its limitations. soil properties were sampled at fixed relative depths rather than at actual layer boundaries. This approach was selected to standardize the encoding of all piles. Nevertheless, it may miss thin but mechanically dominant strata or abrupt strength transitions that significantly influence pile capacity. Therefore, the comparable or superior performance of averaged inputs may partly reflect this simplified discretization rather than a true lack of benefit from layered data. Future work should explore boundary-aware or higher-resolution sampling aligned with actual stratigraphy to better assess the value of layered representations.
The search budget was fixed at 500 Optuna trials per configuration across 12 feature sets and 3 training sizes. This constraint kept the 36-run matrix tractable but limited the breadth of architectures, optimizers, and training schedules that could be explored. Completing a single 500-trial run required approximately 1–3 h depending on data set size and convergence behavior; across 36 runs, this corresponded to roughly 72 cumulative computation hours under the current resource budget. Preliminary exploratory studies conducted by the authors indicated that performance improvements beyond approximately 200–300 trials were marginal, with diminishing returns relative to computational cost. For this reason, 500 trials were selected as a balance between adequate coverage of the high-dimensional search space and practical feasibility. Nevertheless, the reported “best” models should be interpreted as optimal within this bounded and resource-constrained search framework rather than guaranteed global optima.
Practical implications
Geotechnical databases assembled from real projects are almost always incomplete. Many layers lack key parameters, and available measurements often mix direct tests with values inferred from empirical correlations. Conventional wisdom holds that such “sub-optimal” data are unsuitable for developing mechanics-based design correlations, especially for models that rely on continuous profile information. The present study offers a counterpoint: although individual soil descriptors may be missing or noisy, a sufficiently large and diverse data set can offset the deficiencies of imperfect input features when machine learning is used.
Across 546 load tests, even feature sets built from averaged soil properties – arguably the coarsest possible encoding – performed nearly as well as more elaborate layered representations when enough training data were used. Models using layered profiles delivered modest gains, but these improvements were consistently smaller than the gains achieved simply by increasing training-set size. This indicates that data quantity and diversity matter more than perfect soil characterization. For many practical applications, curated collections of incomplete records may still support reliable ML models, provided that sample sizes are large and hyperparameter tuning is disciplined. A number of practical takeaways are possible:
Valuable models can be trained without complete stratigraphic detail. When project records lack qc, su or φ at all layers, using averaged or partially derived soil information can still yield adequate predictions. However, this result may be tied to the noise level in the present database, which can limit the consistent learnability of layered detail rather than indicating that stratigraphy is unimportant.
Expanding the volume of available tests is more impactful than adding additional parameters. Prioritizing the aggregation of many imperfect load-test records – across sites, agencies or archives – is likely to improve model performance more than demanding perfect soil profiles for each case.
Increasing training size primarily expands inter-site variability – capturing differences in geology, installation practices and stress histories – whereas layered features describe intra-profile variability within a pile. In a database with inferred and heterogeneous soil parameters, additional data may reduce noise and site bias more effectively than increasing feature complexity. In cleaner data sets with consistently measured parameters, layered representations could play a stronger role, so the dominance of training size observed here should be interpreted in light of data quality.
ML workflows should be designed to tolerate noise. Because correlated soil parameters are often inferred rather than measured, model architectures and tuning strategies must anticipate variability and redundancy.
Data-sharing initiatives have outsized benefits. Broad access to diverse load test archives could enable more robust ML models than isolated high-quality data sets.
Overall, the study demonstrates that large, noisy, and incomplete geotechnical data sets – typical of real practice – can still support high-performing ML models. This lowers the barrier for organizations to adopt data-driven prediction tools, even when legacy records are inconsistent or partial.
Conclusions
This study tested MLPs across 12 feature representations and three training sizes to evaluate two questions: whether layered soil information improves prediction, and how performance scales with data. The results show that layered representations generally improve accuracy over geometry alone; however, there is no clear advantage using layered representation over the average soil encoding. The most consistent configuration paired core with layered soil type, layered total unit weight and layered N with an R2 of 0.64 and MAPE of 0.43. Single layered families that carry direct strength or testing information, such as N or qc, were competitive; soil type or total unit weight alone were weak. Using all layered variables did not always yield the best outcome, which indicates that adding variables past a core set can bring diminishing returns.
Training size was the dominant factor. Moving from one third to two thirds of the training data produced a large gain, while the gain from two thirds to the full set was smaller. In a few cases the best two thirds model edged the full set, which is most likely related to randomness in search and splitting rather than a systematic advantage. However, given that the total database contains only 546 piles, it would not be appropriate to interpret this trend as evidence of saturation or diminishing returns. The results indicate that the one-third subset – although already larger than many published data sets – was insufficient for stable and reliable modeling, and that increasing the number of data points improved both central performance and robustness. Whether true convergence has been reached cannot be concluded from this scale of data. Future improvements are therefore more likely to come from expanding the data set, particularly by increasing both volume and diversity of pile–soil conditions, rather than assuming saturation.
Practically, the findings suggest four priorities. First, invest in data size and label quality before expanding feature set. Second, using average soil information along the pile length is almost as informative as using all of them separately or via different combinations, and considering the training cost of using layered representation, averaging should be priority. Third, keep the search disciplined with a standard budget and objective. Fourth, evaluate with several metrics and include a validation fold to control selection bias, and support with visual inspection when possible.
Declaration of generative artificial intelligence and artificial intelligence-assisted technologies in the writing process
During the preparation of this work the authors used ChatGPT to catch grammatical errors. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
The authors are grateful to the late Prof. Roy Olson of the University of Texas at Austin for making his data available to them.
References
Appendix. Hyperparameter search space (Optuna, 500 trials per configuration)
Optuna explored a fixed, predefined search space covering MLP architecture and training settings:
Number of Hidden Layers: 1, 2, 3, 4, 5.
Neurons per Hidden Layer: 16, 32, 48, 64.
Activation Function (per layer): relu, sigmoid, tanh, softplus, softsign, selu, elu, exponential, linear, hard_sigmoid, swish (silu), gelu, softmax, log_softmax, hard_swish, relu6.
Dropout Rate: 0.1, 0.2, 0.3, 0.4, 0.5.
Normalization: none, batch.
Regularization Type: L1, L2, L1 + L2.
L1 Regularization Coefficient: 1e−6 to 1e−2 (log-uniform).
L2 Regularization Coefficient: 1e−6 to 1e−2 (log-uniform).
Kernel Initializer: glorot_uniform, he_normal, lecun_normal.
Optimizer: Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam, Ftrl.
Learning Rate: 1e−4 to 1e−2 (log-uniform).
Batch Size: 48, 72, 96, …, 256 (step = 24).
Loss Function: MAPE, MAE, Huber, log-cosh.
Gradient Clipping (clipnorm): 0.1–1.0.
Training controls (fixed): up to 3,000 epochs with early stopping (patience 30, best weights restored) and learning-rate reduction on plateau (factor 0.5, patience 15).

