Lessons learned from training an MLP model using layered feature sets for estimating pile bearing capacity

Ozturk, Baturalp; Iskander, Magued

doi:10.1108/MLAG-12-2025-0022

Purpose

This study aims to investigate how input representation and training-set size influence the performance of multilayer perceptrons (MLP) for predicting axial pile capacity. Empirical procedures often show wide deviations from full-scale load tests, motivating data-driven approaches that better capture soil and geometry variability.

Design/methodology/approach

A database of 546 axial load tests is used to evaluate 12 feature representations ranging from geometry-only inputs to averaged soil descriptors and layered soil profiles sampled along the pile shaft. For each feature set, an MLP is tuned using 500 Optuna trials under three training sizes: one-third, two-thirds and the full data set.

Findings

Layered inputs generally outperform geometry alone, but averaged soil properties often perform similarly. The combination of geometry with layered soil type, total unit weight and SPT-N provides the most consistent results. Training size is the dominant factor, with large gains from one-third to two-thirds of the data and smaller gains thereafter.

Practical implications

Large geotechnical data sets can support reliable ML models even when many soil parameters are missing or inferred. Averaged soil information performs nearly as well as layered profiles, reducing the need for complete stratigraphic data. Practitioners can therefore use broad but imperfect databases to develop effective prediction tools.

Originality/value

This study provides a comprehensive assessment of feature design and data sufficiency for neural-network-based pile capacity prediction. It demonstrates that while representation matters, training-set size governs most achievable accuracy and clarifies when layered inputs justify added complexity.

Introduction

Estimating axial pile capacity remains a persistent challenge in practice. Many large structures depend on piles and the consequences of misestimation are significant. Rising construction costs and the need for specialized equipment amplify these consequences. Design workflows often require site specific validation through static load tests on preliminary or sacrificial piles. This requirement, which is uncommon for other structural elements, reflects the uncertainty that surrounds pile capacity prediction and the limited confidence placed in current design procedures.

Practice continues to rely on empirical and semi empirical procedures because they are widely accepted, yet comparisons with load tests show wide discrepancies that in extreme cases approach an order of magnitude (Briaud and Tucker, 1988; Machairas et al., 2018; Phoon and Tang, 2019; Rizk et al., 2022; Ozturk et al., 2023a). Part of the problem is the narrow calibration base and the limited ability of these methods to account for soil property changes and stress redistribution after driving. For example, the basis of the US Federal Highway Administration (FHWA) design guidance comprises, Tomlinson’s (1980)Alpha method which synthesized 56 driven piles in clay, while Nordlund (1963) sand method drew on about 41 tests from 8 sites. Similarly, the Lambda approach (Focht and Vijayvergiya, 1972) used 47 pipe pile tests in clay to develop its guidance. Because these procedures lump together depth dependent soil behavior and installation effects and other uncertainties into simplified rules, designers often apply generous resistance or safety factors, which raises costs and can conceal systematic bias. To address these limitations, this study investigates whether expanding the calibration base and refining data representation through machine learning improves predictive consistency.

The search for better geotechnical predictors has shifted much recent work toward machine learning (ML) in geotechnical engineering (Li and Iskander, 2022; Kodsy et al., 2023a; Zhang et al., 2021, 2025; Wang et al., 2025). ML can model nonlinear relations among many inputs and can represent uncertainty more flexibly than rule-based procedures. Prior studies have tested base and ensemble ML learners for axial capacity and have reported significant gains relative to empirical and semi empirical formulas, although sizeable errors remain and performance varies across data sets (Ozturk et al., 2023b, 2024a, 2024b, 2024c). These results suggest that data driven models can use geometry, and soil information without imposing strict mechanistic assumptions, but they also show that model design and data pre-processing control most of the outcome.

Multilayer perceptrons (MLP) (Hornik et al., 1989) are a class of artificial neural networks (ANN) that is commonly used for regression problems, of which pile capacity prediction is a candidate. Indeed, MLPs have been applied to this problem for nearly three decades (Lee and Lee, 1996; Pal and Deswal, 2008; Momeni et al., 2014; Moayedi and Armaghani, 2017; Ozturk and Iskander, 2025). However, most prior applications were limited by the reliance on a single project database, small sample size, or suboptimal model training. Training on fewer than 100 piles or on piles drawn from a single site can eliminate intrinsic geologic uncertainty and yield delusional excellent performances. Such settings suppress the variability in soil conditions and installation practice that dominates uncertainty in the field. Reported metrics then reflect internal consistency within one fabric and driving system rather than general performance across sites. Conventional cross validation inside a single project can further inflate accuracy if stratigraphy or test setups is the same across training and validation folds. As a result, external validity of the developed ML models is unclear and transfer to new geology and driving conditions is uncertain.

Data curation and label definition add another layer of difficulty. Capacity is often interpreted by different criteria across sources, for example, Davisson (1972) and other offset capacity methods fail to account for capacity dependence on displacement, which also introduces label noise into supervised learning. Soil indices and in situ measurements may also come from different test protocols and depths. Simple averaging along the embedded length removes position information that controls the mobilization of shaft and tip resistance. Layer aware encodings that preserve the profile can carry more information about where strength and density reside, while average encodings are easier to assemble but risk blurring tip effects and stiff interlayers.

In this study the effect of data representation and sample size on the predicted ML capacity are explored, using a database of 546 load tests in a variety of soils. First, 12 feature sets were assembled to test how input design influences MLP performance (Table 1). The first set uses only geometric pile properties because prior studies showed that capacity can be interpreted based on pile geometry alone (Ozturk et al., 2023b). The second set compresses soil variable into averages over the embedded length and serves as the baseline. The remaining sets preserved depth information by sampling each soil variable at five fixed fractions of the embedded length and also included the properties at the tip as the sixth layer. Within this stratified family, individual variables and small groups are paired with geometry to account for performance gain due to using combinations of variables shown in Table 1. Each of the 12 configuration is trained with the same workflow and tuned with Optuna (Akiba et al., 2019) using 500 trials so that differences in accuracy reflect feature content rather than search effort. This design tests whether layered profiles carry more information than averages and which features contribute most when used alone with geometry or in combination.

Table 1

Input ablation configurations

Configuration number	Geometry	Soil type	Total unit weight	N value	S_u	ϕ	q_c	Rationale
1	✓	…	…	…	…	…	…	Baseline without soil
2	✓	Average over entire embedded depth						Baseline
3	✓	✓	…	…	…	…	…	Soil type
4	✓	…	✓	…	…	…	…	Incorporates One feature at a time, as indicated
5	✓	…	…	✓	…	…	…
6	✓	…	…	…	✓	…	…
7	✓	…	…	…	…	✓
8	✓	…	…	…	…	…	✓
9	✓	✓	✓	✓	✓	✓	✓	Complete set
10	✓	✓	…	…	✓	✓	…	Strength indicators + soil type
11	✓	…	✓	…	✓	✓	…	Similar to 10, but replacing categorical soil type by TUW
12	✓	✓	✓	✓	…	…	…	Fewest use of correlations

Configuration number	Geometry	Soil type	Total unit weight	N value	S_u	ϕ	q_c	Rationale
1	✓	…	…	…	…	…	…	Baseline without soil
2	✓	Average over entire embedded depth						Baseline
3	✓	✓	…	…	…	…	…	Soil type
4	✓	…	✓	…	…	…	…	Incorporates One feature at a time, as indicated
5	✓	…	…	✓	…	…	…
6	✓	…	…	…	✓	…	…
7	✓	…	…	…	…	✓
8	✓	…	…	…	…	…	✓
9	✓	✓	✓	✓	✓	✓	✓	Complete set
10	✓	✓	…	…	✓	✓	…	Strength indicators + soil type
11	✓	…	✓	…	✓	✓	…	Similar to 10, but replacing categorical soil type by TUW
12	✓	✓	✓	✓	…	…	…	Fewest use of correlations

The second part of the analysis examines the effect of training size. A fixed test set comprising 164 load tests is reserved for evaluation. The remaining data are used at three levels: one third, two thirds, and the full training portion. This mini ablation study measures how performance scales with the amount of training data and whether the advantage of stratified inputs changes with sample size. Crossing the 12 configuration sets with the three training sizes yields 36 models, which provides a consistent basis to separate gains due to representation from gains due to additional data.

The remainder of the paper is organized as follows. The Data set section describes the 546-load-test database, its sources, pile and soil descriptors and the preprocessing steps used to harmonize and infer missing parameters. The Background section summarizes prior comparisons between traditional methods and machine learning models on this database and explains the use of a tuned MLP as the primary model under study. The Methodology section details the feature ablation design, layered versus averaged soil encodings, training–testing splits, Optuna-based hyperparameter optimization, and evaluation metrics. The Results section presents performance trends across feature sets and training sizes, highlighting the relative influence of representation and data volume. The Discussion interprets these trends in the context of geotechnical practice, overfitting and data sufficiency. The Limitations section outlines constraints related to data set size, inferred soil parameters, discretized layering and bounded hyperparameter search. The Practical Implications section translates findings into guidance for database expansion, feature selection and ML deployment in practice. Finally, the Conclusions synthesize the main findings regarding layered encoding, sample size effects and priorities for future data-driven pile capacity modeling.

Data set

To support models that generalize across soil conditions, this study uses a large database of 546 static axial load tests, compiled from multiple sources. The set includes 145 steel pipe piles, 165H-piles, 193 precast concrete piles and 43 timber piles. Diameters range from 10 to 40 in. (25–100 cm) and lengths from 8.5–315 ft (2.5–100 m). The distribution of diameter and length by pile type is shown in Figure 1. The database was assembled from Prof. Olson’s collections (Dennis and Olson, 1983a, 1983b; Olson and Iskander, 2014), the FHWA Deep Foundations Load Test Database v2 (Petek et al., 2016) and the Iowa DOT repository (Roling et al., 2011). The combined sources span a wide range of geologic settings and pile types, so variability in soil conditions is better represented than most prior studies, rather than averaged out.

Figure 1

A scatter plot with histograms compares pile diameter and length across steel pipe, H pile, concrete and timber categories.

View large Download slide

The combined chart contains a central scatter plot with histograms above and to the right. The main x-axis is Diameter in inches from 0 to 50, with a top secondary axis in metres from 0.0 to 1.2. The main y-axis is Length in feet from 0 to 350, with a right secondary axis in metres from 0 to 100. Data points represent four pile types: Steel Pipe 145, H Pile 165, Concrete 193, and Timber 43. Most points cluster between 10 and 25 inches diameter and 20 to 100 feet length. Some larger diameter piles extend to about 40 inches. A few lengths rise above 150 feet, with one point above 300 feet. The top histogram indicates most piles have smaller diameters. The side histogram indicates most piles have shorter lengths.

Distribution of piles used in the analysis

Davisson capacities from the source records were adopted as ground truth, with the understanding that the offset criterion introduces its own interpretation errors (Kodsy et al., 2022a, 2022b). The data set contained pile diameter, length, area, circumference, modulus, pile type and pile material and soil descriptors including soil type, total unit weight, SPT N-values, undrained shear strength (s_u), friction angle (Φ) and cone tip resistance (q_c). Soil variables were represented in two ways: first as an average over the embedded length, and also in layered form that records the exact value at top, 20, 40, 60 and 80%, of the embedded length and tip.

Corrected SPT N-values (N_corr), total unit weight, and soil type were available for most strata. When N_corr needed to be computed or checked, the procedure was as follows. Reported SPT blow counts were treated as N₆₀ when energy details were not available. Overburden effects were then applied using Peck et al. (1974) to obtain the correction factor C_n, with the standard cap, and N_corr = C_n· N₆₀. Most records already contained N, total unit weight. If N_corr was missing but q_u or s_u was present, N_corr was inferred from NAVFAC DM 7.01 (NAVFAC, 1986) correlations between SPT resistance and undrained strength. The reverse inference was used to fill missing strength when N_corr existed. From N_corr, friction angle for cohesionless soils, s_u for cohesive soils, and total unit weight were estimated using NAVFAC and Bowles (1988) relationships. Finally, q_c was generated by assigning a Robertson (2016) soil behavior zone from the layer description and applying the zone-specific q_c/N ratio soil type.

Background

Ozturk and Iskander (2025) used the same 546-pile database to compare the performance of traditional procedures, ML models and a tuned MLP. The traditional models comprised (1) FHWA (Hannigan et al., 2016), (2) USA Army Corps of Engineers (USACE) (USACE, 1991), (3) American Petroleum Institute (API) (API, 1993) and (4) Revised Lambda (Kraft et al., 1981) because of their availability on APILE (Wang et al., 2019) providing consistent implementation and representing common practice for axial capacity.

The ML set comprised support vector regression as the base learner and XGBoost as the ensemble benchmark; these were selected because prior work on this database identified them as the strongest baselines in their classes. A multilayer perceptron was included to test whether a carefully tuned neural network could surpass the tree-based ensemble on the same inputs and labels.

An MLP is a fully connected neural network typically used for regression or classification. Inputs pass through one or more hidden layers where linear transformations are followed by nonlinear activations. With sufficient width (i.e. number of nodes in a layer) and appropriate activation functions, MLPs can approximate complex input-output maps and can represent interactions among geometry, soil descriptors, and profile features that are difficult to encode in closed form (Hornik et al., 1989). Training adjusts weights and biases by backpropagation with an optimizer such as SGD (Stochastic Gradient Descent) or Adam (Adaptive Moment). Regularization and normalization are also used to control overfitting.

Optuna is a hyperparameter optimization framework that automates the search over architecture and training settings. It samples candidates from a defined hyperparameter space, evaluates them on a validation fold, and steers subsequent trials toward promising regions while pruning weak runs. Optuna was selected because its adaptive sampling and early pruning mechanism efficiently navigates a high-dimensional search space while controlling computational cost and maintaining reproducibility across experiments. In this context, Optuna selects the number of layers and neurons, activation functions, normalization, dropout, regularization strengths, optimizer family, learning rate, batch size and loss. Using a fixed optimization framework per configuration standardizes effort across feature sets so performance differences can be attributed to representation and data rather than tuning.

The cross-model comparison on the same 546 pile database is summarized in Figure 2. Among traditional methods, API produced the strongest baseline for this data set and was used as the reference. Support vector regression served as the base ML learner and XGBoost (Chen and Guestrin, 2016) as the ensemble benchmark, because it is widely regarded as the gold standard for tabulated data (Shwartz-Ziv and Armon, 2022). An extensively tuned MLP was also trained because XGBoost, while strong on tabular data, is prone to overfitting without tight control of depth, learning rate and tree count. Its tree ensembles partition the feature space and return leaf averages, which limits extrapolation and yields piecewise constant responses outside the support of the training data. Trees are largely insensitive to feature scaling and can down-weight or ignore weak inputs through split selection and regularization. That robustness can mask the effect of representation. Our goal is to test whether layered soil encodings add information beyond averages. An MLP is a smooth, differentiable function approximator that is sensitive to representation and can exploit stratified continuous inputs, making it a better probe of feature design. XGBoost remains the benchmark for comparison, but the MLP is the model under study.

Figure 2

Four scatter plots compare calculated and interpreted pile capacities for A P I, S V R, X G B O O S T and M L P models.

View large Download slide

The four-panel figure presents scatter plots comparing Calculated Capacity on the x-axis with Interpreted Capacity on the y-axis. Lower and left axes use kips. Upper and right axes use mega newtons. Panels are labelled A P I, S V R, X G B O O S T, and M L P. Each panel contains train and test data points with diagonal reference lines and dashed bounds around the central line. Data generally increase upward as calculated capacity increases. A P I has the widest spread. S V R shows a tighter upward cluster. X G B O O S T has the tightest alignment with the central line. M L P also shows a strong positive trend with moderate spread. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts.

MLP model compared to baseline performance models (Ozturk and Iskander, 2025)

Methodology

The database was randomly split into 70% training and 30% testing portions. All modeling choices, tuning, and ablation studies were performed inside the training portion only. For the ablation study to measure how performance scales with the amount of training data training was done on a subset of the training portion, while testing was done with the entire testing portion. Predictions and metrics are reported for three groups per run: train, test and unused. “Unused” denotes the part of the original training set that was intentionally left out when a reduced training size was used, then predicted after the model was fitted.

Soil descriptions from different sources were harmonized into three generalized classes (clay, sand, rock) to reduce labeling noise and improve model stability. All categorical variables were numerically encoded, with pile type and pile material one-hot encoded and soil classes mapped to integer labels. Redundant geometric variables related to layer depths and repeated embedment lengths were removed to prevent information leakage and multicollinearity.

Inputs were built from two ingredients:

core variables that encode pile geometry and type but no soil information; and
soil descriptors represented either as a single average over the embedded length or as layered values sampled at top, 20, 40, 60 and 80% of the embedded length, and at the tip.

Layered profiles preserve stratigraphy and tip-zone context. Averages were included once as a direct analogue to the previous study for benchmarking under the present tuning configuration. Twelve feature sets were defined as shown in Table 1, as follows:

1 Core: only to measure the value of geometry without soil.

2 Core + Averages: to mirror the previous study’s representation under a smaller tuning budget.

3-8 Core + one layered input at a time: Soil type, total unit weight, N, s_u, Φ and q_c.

9 Core + all layered families to test the upper bound from full stratified information.

10 Core + Soil Type + s_u + Φ for each layer to provide strength indicators for both cohesive and cohesionless soils.

11 Core + Total Unit Weight + s_u + Φ for each layer to replace the categorical soil variable with total unit weight as a soil indicator while retaining strength measures.

12 Core + Soil Type + Total Unit Weight + N for each layer to emphasize the three most widely available field descriptors in database before filling the missing parameters.

Each of the 12 sets was trained three times, using 1/3, 2/3 and the full training portion. This yields 36 MLP configurations that separate gains due to representation from gains due to additional data.

For every configuration an MLP was tuned with Optuna for 500 trials. The first 100 trials were random to explore the space broadly. The remaining 400 used Optuna’s model-based sampler with pruning to focus computation on promising regions. The search space covered depth, width, activation, normalization, dropout, L₁ and L₂ penalties, optimizer family, learning rate, batch size and number of training epochs. Trials were optimized by R², because earlier work on this database showed that optimizing R² led to better behavior than optimizing by mean absolute percentage error (MAPE) (Ozturk and Iskander, 2025). Final models were evaluated on the fixed test and the unused groups.

To compare the performances the MAPE as shown in equation (1) was adopted along with the correlation coefficient, R² in equation (2). MAPE measures the relative magnitude of error while R² measures how well the predictions track the overall trend of the data:

MAPE = \frac{1}{n} \sum_{i = 1}^{n} | \frac{Q_{c, i} - Q_{m, i}}{Q_{m, i}} |

(1)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(Q_{m, i} - Q_{c, i})}^{2}}{\sum_{i = 1}^{n} {(Q_{m, i} - Q_{m})}^{2}}

(2)

where Q_c is the calculated or predicted capacity and Q_m is the measured or interpreted capacity. In addition $μ_{Q_{c} / Q_{m}}$ ⁠, and $σ_{Q_{c} / Q_{m}}$ are tracked to assess bias and precision on a scale free basis. Finally, poor predictions were recorded, defined a priori as cases with Q_c/Q_m < 0.5 or Q_c/Q_m > 2.0 (predictions less than one half or greater than twice the interpreted capacity).

Results

The results are organized to answer two questions: whether layered soil representations improve MLP performance and how training size affects that performance. First, the 500 Optuna trials for every feature set and training size are summarized in Figure 3, where higher R² indicates a tighter fit to the observed data. The mean of each distribution of R² is marked as a black circle filled with purple. Two trends are immediately clear. (1) Moving from one third to two thirds to full training set raises both the central tendency and the upper envelope of R². (2) Trial outcomes are dispersed within each configuration, with a long tail of weak settings and a smaller but dense fraction of strong ones. This confirms that careful tuning is essential, since no matter what feature set is used hyperparameter tuning has a far more significant impact on R² than the selected feature set.

Figure 3

$Three scatter panels compare R-squared results across feature sets for training fractions of one-third, two-thirds and full train.$

View large Download slide

The three-panel figure compares R-squared values for multiple feature combinations. Panels are titled Train: 1 slash 3, Train: 2 slash 3, and Full train. The shared y-axis is R-squared from minus 1.0 to 1.0. Each panel contains many faint points for individual trials and larger darker points for average R-squared. The x-axes list feature sets beginning with CORE and combinations such as average soil, type, T U W, N, S u, phi, and Q c. In all panels, individual trial results are widely scattered. Average values remain below zero for most feature sets. Results in the Train: 2 slash 3 and Full train panels are generally less negative than in Train: 1 slash 3.

Average and individual model performances grouped by used feature set and portion of training set

Next, the single best test R² achieved for each feature set under the three training sizes is reported in Figure 4. It is clear that enlarging the training set improves the best attainable R², with a large gap between one third and the other two, and a smaller gap between two thirds and full. Second, core + average soil information outperforms most layered sets except when using the full training set, combined with using all layered information. Third, adding single layered families to the core set usually helps, but soil type and total unit weight alone did not improve the performance, if not made it worse. This suggests that geotechnical data is extremely noisy requiring large data sets to ameliorate the noise and that pile driving changes soil properties from their measured in-situ properties sufficiently to make the average good-enough for most analyses.

Figure 4

A dot plot compares best R-squared values for feature sets under full train, train one-third and train two-thirds.

View large Download slide

The horizontal dot plot compares Best R-squared values across feature sets. The x-axis ranges from 0.2 to 0.7. The y-axis lists feature sets including CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, and several multi-feature combinations. Three dots appear for each feature set: Full Train, Train 1 slash 3, and Train 2 slash 3. Full Train and Train 2 slash 3 values cluster near 0.55 to 0.66. Train 1 slash 3 values are lower, mostly near 0.30 to 0.50. The highest values are generally from Full Train or Train 2 slash 3.

Best R² comparison on testing set by each feature set and training splits

The best tuned model for each feature set under the three training sizes is presented in Figures 5–7. Each panel shows predicted versus interpreted capacity with inset tables reporting R², MAPE, the mean and standard deviation of Q_c/Q_m, and counts of poor predictions for the Train, Test and Unused subsets. Using the full training set, Figure 5 shows that many configurations do not capture the low-capacity regime well, as seen by the wide, diffuse clouds at small capacities. All panels exhibit Q_c/Q_m >1 bias, which is the dangerous kind of bias. Any addition of soil information improves on the core feature set in both R² and MAPE. The core + layered soil type + total unit weight + N configuration gives the strongest overall balance, with the lowest MAPE, a mean Q_c/Q_m closest to one, the second highest R², and the fewest poor predictions. Aside from core and core + layered soil type (Configuration 3), the remaining feature combinations show broadly comparable performance under the full training set. Performance between testing and training is also generally consistent, suggesting no overfitting and slight underfitting for low-capacity data points.

Figure 5

Twelve scatter plots compare calculated and interpreted capacities for multiple feature set models, on full training set.

View large Download slide

The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for different feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel contains train and test data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. All panels show positive upward trends. Data clusters mostly between about 100 and 1000 kips. Several enhanced feature combinations show tighter clustering around the central line than CORE alone. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts.

Performance of the best models with each feature set using full training set

Figure 6

Twelve scatter plots compare calculated and interpreted capacities using train, test and unused data for multiple feature set models, using 2/3 of the training set.

View large Download slide

The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for multiple CORE-based feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel includes train, test, and unused data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. Data generally increase upward and cluster mostly between about 100 and 1000 kips. Several expanded feature combinations show tighter grouping around the central line than CORE alone. Each panel contains a statistics table with R-squared, error measures, poor count, and sample counts for train, test, unused, and overall sets.

Performance of the best models with each feature set using 2/3 of the training set

Figure 7

Twelve scatter plots compare calculated and interpreted capacities with alternate train, test and unused splits across feature set models, using 1/3 of the training set.

View large Download slide

The twelve-panel figure presents scatter plots of Calculated Capacity versus Interpreted Capacity for multiple CORE-based feature set models. The x-axis is Calculated Capacity in kips, with upper secondary axes in mega newtons. The y-axis is Interpreted Capacity in kips with right secondary axes in mega newtons. Each panel includes train, test, and unused data points, a central diagonal reference line, and dashed boundary lines. Panels are labelled CORE, CORE plus average soil, CORE plus type, CORE plus T U W, CORE plus N, CORE plus S u, CORE plus phi, CORE plus Q c, CORE plus type plus T U W plus N plus S u plus phi plus Q c, CORE plus type plus S u plus phi, CORE plus T U W plus S u plus phi, and CORE plus type plus T U W plus N. Data show positive upward trends with most points clustered between about 100 and 1000 kips. Compared with earlier sets, point spread is wider and alignment with the central line is weaker in several panels. Each panel includes a statistics table with R-squared, error measures, poor count, and sample counts for train, test, unused, and overall sets.

Performance of the best models with each feature set using 1/3 of the training set

With two thirds of the training data (Figure 6) the lower end of the clouds widens for core and core + average soil, indicating weaker performance of small capacities. This is notable because the average soil representation was near the top when the full training set was used, yet here it drops behind several layered sets. Across panels, unused points tend to perform worse than train and test, which signals overfitting that is not obvious from the small train–test gap. Because Optuna selects the model that scores best on the test set, selection bias can reduce the apparent Train–Test difference; similar means can hide different variances. In this regime core + ϕ and core + q_c (Configuration 7,8) compare favorably with the bottom row combinations.

The performance of one third of the training data (Figure 7), falls across all configurations. Point clouds broaden, especially at low capacities, and counts of poor predictions rise. Bias persists, with mean Q_c/Q_m generally above one. The unused subset is usually better than train and test, which this time shows signs of underfitting.

Discussion

The study was designed to test whether layered soil representations improve MLP accuracy and how much that gain depends on training size. The evidence from Figure 3 shows that average performances do not convey the full story: while the core feature set is consistently outperformed by most sets when the full training data are used, its average performance does not fall among the worst. This indicates that while average performance reflects the general tendency of a feature set, the best individual outcomes tell a different story. The added features may introduce noise that only certain hyperparameter configurations end up memorizing, inflating peak performance, or they may represent meaningful but harder-to-learn structure that reduces consistency across trials and lowers the mean even when the top models improve.

Training size emerged as the strongest driver. A large gap in performance is evident in Figure 4 between using one third and two thirds of the data, and a smaller gap between two thirds and full. In three feature sets the best test R² at two thirds exceeds that at full, most likely due to randomness from the search process and the fixed split, since most feature sets improve with the full data. The same models suffered heavily on the unused subset, revealing overfitting that was hidden between training and testing. The pattern supports a law of diminishing returns in which marginal gains from additional data are largest when the data set is small and taper as it grows. However, this trend should be interpreted cautiously, as the present database size may still be below the threshold where true performance saturation occurs, and further gains could emerge with additional and more diverse data.

These findings place current practice in context. Many traditional methods were built on fewer than 100 load tests and embed strong functional assumptions. When a disciplined neural-network model still struggles at the one-third level, it is unsurprising that legacy equations calibrated on small samples show wide scatter in modern databases. Feature content matters, but less than sample size; adding soil information to core geometric properties typically helps, yet the improvement from richer layered features is smaller than the improvement from increasing the training set from one third to two thirds. Also, many studies in the literature use the testing fold when optimizing the models, essentially leading to data leakage. The difference between the unused fold and both the training and testing folds indicates that the generalizability of the models should be reported more carefully.

Within the layered family, performance is not uniform. Core + average soil and core + all layered soil frequently rank highest on the two-thirds and full sets, with core + q_c close behind. Soil type and total unit weight alone do not improve accuracy and can degrade it, suggesting that categorical descriptors and bulk unit weight carry less predictive information than penetration resistance or strength measures at this resolution. Combinations including total unit weight become competitive only when paired with N, s_u, or ϕ.

Some underfitting persists even with the full training set and a 500-trial search, as the low-capacity regime shows diffuse clouds suggesting limitations in data quality or coverage rather than insufficient tuning. Evaluation must therefore rely on several metrics: R² alone can be misleading – core + total unit weight shows lower R² than core + q_c but smaller MAPE – so ratio statistics and counts of poor predictions provide essential context, and visual inspection is needed to detect regime-specific errors and systematic offsets.

Overfitting can also be masked when model selection is based on test performance. In the two-thirds regime the unused subset frequently performs far worse than both train and test despite similar train/test means, indicating selection bias that a nested validation scheme would reduce. At one-third training size, performance deteriorates across all feature sets; despite using 127 piles – more than many traditional calibration data sets – the MLP exhibits unstable relationships and underfitting. This underscores the need for more load tests or, when expansion is not feasible, reduced feature complexity or stronger regularization, albeit with an accuracy ceiling.

Although the models were explicitly trained to predict Davisson-interpreted capacities, the systematic overprediction in the low-capacity regime may reflect structural characteristics of the Davisson criterion itself rather than purely model deficiency. Prior work (Kodsy et al., 2021, 2022a, 2022b, 2023b) has shown that Davisson-based interpretations can exhibit systematic bias relative to other criteria, and such tendencies may be inherited by data-driven models trained on those labels. In addition, because model selection was guided by R², optimization naturally emphasizes variance in the mid- to high-capacity range where most data reside, which can shift the fitted relationship upward at the lower tail. Data imbalance – fewer low-capacity cases and greater relative noise at small settlements – may further amplify this effect. Therefore, the observed bias likely arises from a combination of label definition, optimization emphasis and data distribution rather than solely from modeling choices. Finally, although 36 MLPs were trained using different feature sets and training sizes, all performed worse than the model from the previous study, which benefited from a larger hyperparameter search space (Appendix I) and 6,000 trials. This comparison reinforces the importance of extensive optimization.

Limitations

This study uses a larger and more diverse data set than most prior work, but it is still modest from a machine learning perspective. Labels come from interpreted capacities and many soil descriptors were inferred from correlations rather than direct measurements. Assumptions such as estimating ϕ, and s_u, from N_corr, and generating synthetic q_c, from SPT via Robertson zones add noise that neither traditional formulas nor MLPs can remove. These data issues likely cap the attainable accuracy, especially in the low-capacity range.

The proposed layered encoding also has its limitations. soil properties were sampled at fixed relative depths rather than at actual layer boundaries. This approach was selected to standardize the encoding of all piles. Nevertheless, it may miss thin but mechanically dominant strata or abrupt strength transitions that significantly influence pile capacity. Therefore, the comparable or superior performance of averaged inputs may partly reflect this simplified discretization rather than a true lack of benefit from layered data. Future work should explore boundary-aware or higher-resolution sampling aligned with actual stratigraphy to better assess the value of layered representations.

The search budget was fixed at 500 Optuna trials per configuration across 12 feature sets and 3 training sizes. This constraint kept the 36-run matrix tractable but limited the breadth of architectures, optimizers, and training schedules that could be explored. Completing a single 500-trial run required approximately 1–3 h depending on data set size and convergence behavior; across 36 runs, this corresponded to roughly 72 cumulative computation hours under the current resource budget. Preliminary exploratory studies conducted by the authors indicated that performance improvements beyond approximately 200–300 trials were marginal, with diminishing returns relative to computational cost. For this reason, 500 trials were selected as a balance between adequate coverage of the high-dimensional search space and practical feasibility. Nevertheless, the reported “best” models should be interpreted as optimal within this bounded and resource-constrained search framework rather than guaranteed global optima.

Practical implications

Geotechnical databases assembled from real projects are almost always incomplete. Many layers lack key parameters, and available measurements often mix direct tests with values inferred from empirical correlations. Conventional wisdom holds that such “sub-optimal” data are unsuitable for developing mechanics-based design correlations, especially for models that rely on continuous profile information. The present study offers a counterpoint: although individual soil descriptors may be missing or noisy, a sufficiently large and diverse data set can offset the deficiencies of imperfect input features when machine learning is used.

Across 546 load tests, even feature sets built from averaged soil properties – arguably the coarsest possible encoding – performed nearly as well as more elaborate layered representations when enough training data were used. Models using layered profiles delivered modest gains, but these improvements were consistently smaller than the gains achieved simply by increasing training-set size. This indicates that data quantity and diversity matter more than perfect soil characterization. For many practical applications, curated collections of incomplete records may still support reliable ML models, provided that sample sizes are large and hyperparameter tuning is disciplined. A number of practical takeaways are possible:

Valuable models can be trained without complete stratigraphic detail. When project records lack qc, su or φ at all layers, using averaged or partially derived soil information can still yield adequate predictions. However, this result may be tied to the noise level in the present database, which can limit the consistent learnability of layered detail rather than indicating that stratigraphy is unimportant.
Expanding the volume of available tests is more impactful than adding additional parameters. Prioritizing the aggregation of many imperfect load-test records – across sites, agencies or archives – is likely to improve model performance more than demanding perfect soil profiles for each case.
Increasing training size primarily expands inter-site variability – capturing differences in geology, installation practices and stress histories – whereas layered features describe intra-profile variability within a pile. In a database with inferred and heterogeneous soil parameters, additional data may reduce noise and site bias more effectively than increasing feature complexity. In cleaner data sets with consistently measured parameters, layered representations could play a stronger role, so the dominance of training size observed here should be interpreted in light of data quality.
ML workflows should be designed to tolerate noise. Because correlated soil parameters are often inferred rather than measured, model architectures and tuning strategies must anticipate variability and redundancy.
Data-sharing initiatives have outsized benefits. Broad access to diverse load test archives could enable more robust ML models than isolated high-quality data sets.

Overall, the study demonstrates that large, noisy, and incomplete geotechnical data sets – typical of real practice – can still support high-performing ML models. This lowers the barrier for organizations to adopt data-driven prediction tools, even when legacy records are inconsistent or partial.

Conclusions

This study tested MLPs across 12 feature representations and three training sizes to evaluate two questions: whether layered soil information improves prediction, and how performance scales with data. The results show that layered representations generally improve accuracy over geometry alone; however, there is no clear advantage using layered representation over the average soil encoding. The most consistent configuration paired core with layered soil type, layered total unit weight and layered N with an R² of 0.64 and MAPE of 0.43. Single layered families that carry direct strength or testing information, such as N or q_c, were competitive; soil type or total unit weight alone were weak. Using all layered variables did not always yield the best outcome, which indicates that adding variables past a core set can bring diminishing returns.

Training size was the dominant factor. Moving from one third to two thirds of the training data produced a large gain, while the gain from two thirds to the full set was smaller. In a few cases the best two thirds model edged the full set, which is most likely related to randomness in search and splitting rather than a systematic advantage. However, given that the total database contains only 546 piles, it would not be appropriate to interpret this trend as evidence of saturation or diminishing returns. The results indicate that the one-third subset – although already larger than many published data sets – was insufficient for stable and reliable modeling, and that increasing the number of data points improved both central performance and robustness. Whether true convergence has been reached cannot be concluded from this scale of data. Future improvements are therefore more likely to come from expanding the data set, particularly by increasing both volume and diversity of pile–soil conditions, rather than assuming saturation.

Practically, the findings suggest four priorities. First, invest in data size and label quality before expanding feature set. Second, using average soil information along the pile length is almost as informative as using all of them separately or via different combinations, and considering the training cost of using layered representation, averaging should be priority. Third, keep the search disciplined with a standard budget and objective. Fourth, evaluate with several metrics and include a validation fold to control selection bias, and support with visual inspection when possible.

Declaration of generative artificial intelligence and artificial intelligence-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT to catch grammatical errors. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

The authors are grateful to the late Prof. Roy Olson of the University of Texas at Austin for making his data available to them.

References

Akiba

,

T.

,

Sano

,

S.

,

Yanase

,

T.

,

Ohta

,

T.

and

Koyama

,

M.

(

2019

), “

Optuna: a next-generation hyperparameter optimization framework

”,

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

, doi:

https://doi.org/10.1145/3292500.3330701

.

Google Scholar

Crossref

American Petroleum Institute (API)

(

1993

),

Recommended Practice for Planning, Designing, and Constructing Fixed Offshore Platforms (RP 2A)

,

API Production Dept

,

Washington, DC

.

Bowles

,

J.

(

1988

),

Foundation Analysis and Design

, (4th) ed.

McGraw-Hill

,

New York, NY

.

Google Scholar

Briaud

,

J.

and

Tucker

,

L.M.

(

1988

), “

Measured and predicted axial response of 98 piles

”,

Journal of Geotechnical Engineering

, Vol.

114

No.

9

, pp.

984

-

1001

, doi:

https://doi.org/10.1061/(asce)0733-9410(1988)114:9(984)

.

Google Scholar

Crossref

Chen

,

T.

and

Guestrin

,

C.

(

2016

), “

XGBoost: a scalable tree boosting system

”,

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16

, Vol.

1

No.

1

, pp.

785

-

794

,

https://doi.org/10.1145/2939672.2939785

.

Google Scholar

Davisson

,

M.T.

(

1972

), “High capacity piles”,

Proceedings of Soil Mechanics Lecture Series on Innovations in Foundation Construction

,

ASCE

,

IL Section, Chicago

, pp.

81

-

112

.

Google Scholar

Dennis

,

N.D.

and

Olson

,

R.E.

(

1983a

), “Axial capacity of steel pipe piles in clay”,

Proceedings of Geotechnical Practice in Offshore Engineering

,

ASCE

,

Austin, TX

, pp.

370

-

388

.

Google Scholar

Dennis

,

N.D.

and

Olson

,

R.E.

(

1983b

), “Axial capacity of steel pipe piles in sand”,

Proceedings of Geotechnical Practice in Offshore Engineering

,

ASCE

,

Austin, TX

, pp.

389

-

402

.

Google Scholar

Focht

,

J.A.

and

Vijayvergiya

,

V.N.

(

1972

), “

A new way to predict capacity of piles in clay

”,

Offshore Technology Conference, No. OTC-1718-MS

,

https://doi.org/10.4043/1718-MS

.

Google Scholar

Crossref

Hannigan

,

P.J.

,

Rausche

,

F.

,

Likins

,

G.E.

,

Robinson

,

B.

and

Becker

,

M.

(

2016

),

Design and Construction of Driven Pile Foundations – Volume I Report No. FHWA-NHI-16-009

,

National Highway Institute, Federal Highway Administration, U.S. Department of Transportation

,

Washington, DC

.

Google Scholar

Hornik

,

K.

,

Stinchcombe

,

M.

and

White

,

H.

(

1989

), “

Multilayer feedforward networks are universal approximators

”,

Neural Networks

, Vol.

2

No.

5

, pp.

359

-

366

, doi:

https://doi.org/10.1016/0893-6080(89)90020-8

.

Google Scholar

Crossref

Kodsy

,

A.

,

Machairas

,

N.

and

Iskander

,

M.

(

2021

), “

Assessment of several interpreted pile capacity criteria for large diameter open-ended piles

”,

Geotechnical Testing Journal

, Vol.

44

No.

5

, doi:

https://doi.org/10.1520/GTJ20200074

.

Google Scholar

Kodsy

,

A.

,

Iskander

,

M.G.

and

Pandey

,

A.S.

(

2022a

), “

Universal criterion for interpreting capacity from load tests on piles

”,

Transportation Research Record: Journal of the Transportation Research Board

, Vol.

2676

No.

8

, pp.

530

-

541

, doi:

https://doi.org/10.1177/03611981221084686

.

Google Scholar

Crossref

Kodsy

,

A.

,

Machairas

,

N.

and

Iskander

,

M.

(

2022b

), “

Assessment of several capacity interpretation criteria for drilled shafts

”,

ASCE Geotechnical and Geoenvironmental Engineering J

, Vol.

148

No.

2

, doi:

https://doi.org/10.1061/(ASCE)GT.1943-5606.0002733

.

Google Scholar

Kodsy

,

A.

,

Ozturk

,

B.

and

Iskander

,

M.

(

2023a

), “

Forecasting of pile plugging using machine learning

”,

Acta Geotechnica

, Vol.

18

No.

7

, pp.

3697

-

3714

, doi:

https://doi.org/10.1007/s11440-023-01797-5

.

Google Scholar

Crossref

Kodsy

,

A.

,

Ozturk

,

B.

,

Bazi

,

Y.

and

Iskander

,

M.

(

2023b

), “

Assessment of several interpretation criteria for tensile axial load tests on deep foundations

”,

Transportation Research Record: Journal of the Transportation Research Board

, Vol.

2677

No.

6

,

Sage

, doi:

https://doi.org/10.1177/03611981221149435

.

Google Scholar

Kraft

,

L.M.

,

Focht

,

J.A.

and

Amerasinghe

,

S.F.

(

1981

), “

Friction capacity of piles driven into clay

”,

Journal of Geotechnical and Geoenvironmental Engineering

, Vol.

107

No.

GT11

.

Google Scholar

Lee

,

I.-M.

and

Lee

,

J.-H.

(

1996

), “

Prediction of pile bearing capacity using artificial neural networks

”,

Computers and Geotechnics

, Vol.

18

No.

3

, pp.

189

-

200

, doi:

https://doi.org/10.1016/0266-352x(95)00027-8

.

Google Scholar

Crossref

Li

,

L.

and

Iskander

,

M.

(

2022

), “

Use of machine learning for classification of sand particles

”,

Acta Geotechnica

, Vol.

17

No.

10

, pp.

4739

-

4759

, doi:

https://doi.org/10.1007/s11440-021-01443-y

.

Google Scholar

Crossref

Machairas

,

N.

,

Highley

,

G.A.

and

Iskander

,

M.G.

(

2018

), “

Evaluation of FHWA pile design method against the FHWA deep foundation load test database version 2.0

”,

Transportation Research Record: Journal of the Transportation Research Board

, Vol.

2672

No.

52

, pp.

268

-

277

, doi:

https://doi.org/10.1177/0361198118773196

.

Google Scholar

Crossref

Moayedi

,

H.

and

Armaghani

,

D.J.

(

2017

), “

Optimizing an ANN model with ICA for estimating bearing capacity of driven pile in cohesionless soil

”,

Engineering with Computers

, Vol.

34

No.

2

, pp.

347

-

356

, doi:

https://doi.org/10.1007/s00366-017-0545-7

.

Google Scholar

Crossref

Momeni

,

E.

,

Nazir

,

R.

,

Jahed Armaghani

,

D.

and

Maizir

,

H.

(

2014

), “

Prediction of pile bearing capacity using a hybrid genetic algorithm-based ANN

”,

Measurement

, Vol.

57

, pp.

122

-

131

, doi:

https://doi.org/10.1016/j.measurement.2014.08.007

.

Google Scholar

Crossref

Naval Facilities Engineering Command(NAVFAC)

(

1986

),

Design Manual 7.01 (DM-7.01) Soil Mechanics

,

NAVFAC

,

Alexandria, VA

.

Nordlund

,

R.L.

(

1963

), “

Bearing capacity of piles in cohesionless soils

”,

Journal of the Soil Mechanics and Foundations Division

, Vol.

89

No.

3

, pp.

1

-

35

, doi:

https://doi.org/10.1061/jsfeaq.0000507

.

Google Scholar

Crossref

Olson

,

R.

and

Iskander

,

M.

(

2014

), “

Axial load capacity of pipe piles in sands

”,

Proceedings of Geotechnical Special Publication 233

,

ASCE

, pp.

209

-

220

.

Google Scholar

Crossref

Ozturk

,

B.

and

Iskander

,

M.

(

2025

), “

Comparative prediction of axial pipe pile capacity using SVR, XGBoost, and neural networks

”,

Geodata and AI, Elsevier

, Vol.

6

, p.

100049

, doi:

https://doi.org/10.1016/j.geoai.2025.100049

.

Google Scholar

Crossref

Ozturk

,

B.

,

Kodsy

,

A.

and

Iskander

,

M.

(

2023b

), “

Forecasting the capacity of open-ended pipe piles using machine learning

”,

Infrastructures

, Vol.

8

No.

1

, p.

12

, doi:

https://doi.org/10.3390/infrastructures8010012

.

Google Scholar

Crossref

Ozturk

,

B.

,

Kodsy

,

A.

and

Iskander

,

M.

(

2024a

), “

Effect of feature selection technique on the pile capacity predicted using machine learning

”,

Geo-Congress 2024

, pp.

153

-

163

,

https://doi.org/10.1061/9780784485323.016

.

Google Scholar

Crossref

Ozturk

,

B.

,

Kodsy

,

A.

and

Iskander

,

M.

(

2024b

), “

Forecasting the bearing capacity of open-ended pipe piles using machine learning ensemble methods

”,

IFCEE 2024

,

American Society of Civil Engineers

,

Reston, VA

, pp.

146

-

156

,

https://doi.org/10.1061/9780784485408.016

.

Google Scholar

Crossref

Ozturk

,

B.

,

Kodsy

,

A.

and

Iskander

,

M.

(

2024c

), “

Using machine learning to predict axial pile capacity

”,

Transportation Research Record: Journal of the Transportation Research Board, SAGE Publications

, Vol.

2678

No.

11

, pp.

2224

-

2244

, doi:

https://doi.org/10.1177/03611981241242762

.

Google Scholar

Crossref

Ozturk

,

B.

,

Kodsy

,

A.

,

Bazi

,

Y.

and

Iskander

,

M.G.

(

2023a

), “

Efficacy of several design methods for predicting the axial compressive capacity of piles

”,

Transportation Research Record: Journal of the Transportation Research Board

, Vol.

2677

No.

9

, pp.

1

-

17

, doi:

https://doi.org/10.1177/03611981231158335

.

Google Scholar

Crossref

Pal

,

M.

and

Deswal

,

S.

(

2008

), “

Modeling pile capacity using support vector machines and generalized regression neural network

”,

Journal of Geotechnical and Geoenvironmental Engineering

, Vol.

134

No.

7

, pp.

1021

-

1024

, doi:

https://doi.org/10.1061/(asce)1090-0241(2008)134:7(1021)

.

Google Scholar

Crossref

Peck

,

R.B.

,

Hanson

,

W.E.

and

Thornburn

,

T.H.

(

1974

),

Foundation Engineering

,

Wiley

,

New York, NY

.

Google Scholar

Petek

,

K.

,

Mitchell

,

R.

and

Ellis

,

H.

(

2016

),

FHWA Deep Foundation Load Test Database Version 2.0 User Manual, Report FHWA-HRT-17-034

,

FHWA

,

McLean, VA

.

Google Scholar

Phoon

,

K.K.

and

Tang

,

C.

(

2019

), “

Effect of extrapolation on interpreted capacity and model statistics of steel H-piles

”,

Georisk: Assessment and Management of Risk for Engineered Systems and Geohazards

, Vol.

13

No.

4

, pp.

291

-

302

, doi:

https://doi.org/10.1080/17499518.2019.1652920

.

Google Scholar

Crossref

Rizk

,

A.

,

Kodsy

,

A.

,

Iskander

,

M.

and

Machairas

,

N.

(

2022

), “

Efficacy of design methods for predicting the capacity of large-diameter open-ended piles

”,

Journal of Geotechnical and Geoenvironmental Engineering

, Vol.

148

No.

10

, doi:

https://doi.org/10.1061/(asce)gt.1943-5606.0002824

.

Google Scholar

Robertson

,

P.K.

(

2016

), “

Cone penetration test (CPT)-based soil behaviour type (SBT) classification system – an update

”,

Canadian Geotechnical Journal

, Vol.

53

No.

12

, pp.

1910

-

1927

, doi:

https://doi.org/10.1139/cgj-2016-0044

.

Google Scholar

Crossref

Roling

,

M.J.

,

Sritharan

,

S.

and

Suleiman

,

M.T.

(

2011

), “Development of LRFD procedures for bridge piles in IA”, Vol.

1

An Electronic Database for Pile Load Tests (PILOT), Rep. TR-573

,

IA State Univ

,.,

Ames, IA

.

Google Scholar

Shwartz-Ziv

,

R.

and

Armon

,

A.

(

2022

), “

Tabular data: deep learning is not all you need

”,

Information Fusion

, Vol.

81

, pp.

84

-

90

, doi:

https://doi.org/10.1016/j.inffus.2021.11.011

.

Google Scholar

Crossref

Tomlinson

,

M.J.

(

1980

),

Foundation Design and Construction

, (4th) ed.

Pitman

,

London

.

Google Scholar

U.S. Army Corps of Engineers (USACE)

(

1991

),

Design of Pile Foundations

,

Engineer Manual EM 1110-2-2906

,

Washington, DC

.

Wang

,

S.T.

,

Arrellaga

,

J.A.

and

Vasquez

,

L.

(

2019

),

APILE V2019 Technical Manual - a Program for the Study of Driven Piles under Axial Loads

,

ENSOFT, Inc

.,

3003 West Howard Lane, Austin, Texas 78728, USA

.

Google Scholar

Wang

,

Y.

,

Wang

,

L.

,

Liu

,

S.

,

Han

,

L.

,

Zhang

,

W.

,

Hong

,

L.

,

Zhu

,

Z.

and

Zhu

,

X.

(

2025

), “

Region similarity assessment for empowering physics-informed transfer learning-based landslide susceptibility mapping

”,

Journal of Rock Mechanics and Geotechnical Engineering, Elsevier

, doi:

https://doi.org/10.1016/j.jrmge.2025.06.030

.

Google Scholar

Zhang

,

W.

,

Li

,

H.

,

Li

,

Y.

,

Liu

,

H.

,

Chen

,

Y.

and

Ding

,

X.

(

2021

), “

Application of deep learning algorithms in geotechnical engineering: a short critical review

”,

Artificial Intelligence Review

, Vol.

54

No.

8

, pp.

5633

-

5673

, doi:

https://doi.org/10.1007/s10462-021-09967-1

.

Google Scholar

Crossref

Zhang

,

W.

,

Han

,

H.

,

Sun

,

W.

,

Wang

,

Y.

,

Wu

,

Z.

,

Xiao

,

P.

and

Yan

,

Y.

(

2025

), “

Knowledge-based data-driven prediction of shield tail clearance under karst geological condition

”,

Geoscience Frontiers, Elsevier

, Vol.

17

No.

2

, p.

102221

, doi:

https://doi.org/10.1016/j.gsf.2025.102221

.

Google Scholar

Crossref

Appendix. Hyperparameter search space (Optuna, 500 trials per configuration)

Optuna explored a fixed, predefined search space covering MLP architecture and training settings:

Number of Hidden Layers: 1, 2, 3, 4, 5.
Neurons per Hidden Layer: 16, 32, 48, 64.
Activation Function (per layer): relu, sigmoid, tanh, softplus, softsign, selu, elu, exponential, linear, hard_sigmoid, swish (silu), gelu, softmax, log_softmax, hard_swish, relu6.
Dropout Rate: 0.1, 0.2, 0.3, 0.4, 0.5.
Normalization: none, batch.
Regularization Type: L1, L2, L1 + L2.
L1 Regularization Coefficient: 1e−6 to 1e−2 (log-uniform).
L2 Regularization Coefficient: 1e−6 to 1e−2 (log-uniform).
Kernel Initializer: glorot_uniform, he_normal, lecun_normal.
Optimizer: Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam, Ftrl.
Learning Rate: 1e−4 to 1e−2 (log-uniform).
Batch Size: 48, 72, 96, …, 256 (step = 24).
Loss Function: MAPE, MAE, Huber, log-cosh.
Gradient Clipping (clipnorm): 0.1–1.0.

Training controls (fixed): up to 3,000 epochs with early stopping (patience 30, best weights restored) and learning-rate reduction on plateau (factor 0.5, patience 15).

2026

Baturalp Ozturk and Magued Iskander

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licenceLink to the terms of the CC BY 4.0 licence.

Lessons learned from training an MLP model using layered feature sets for estimating pile bearing capacity

Introduction

Data set

Background

Methodology

Results

Discussion

Limitations

Practical implications

Conclusions

Declaration of generative artificial intelligence and artificial intelligence-assisted technologies in the writing process

References

Appendix. Hyperparameter search space (Optuna, 500 trials per configuration)

Email Alerts

Cited By

Lessons learned from training an MLP model using layered feature sets for estimating pile bearing capacity Open Access

Introduction

Data set

Background

Methodology

Results

Discussion

Limitations

Practical implications

Conclusions

Declaration of generative artificial intelligence and artificial intelligence-assisted technologies in the writing process

References

Appendix. Hyperparameter search space (Optuna, 500 trials per configuration)

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Sharing Unavailable

Lessons learned from training an MLP model using layered feature sets for estimating pile bearing capacity