The purpose of our research is to ascertain the key drivers of professional soccer player valuation in the transfer market.
Drawing on sports economics, finance and management literature, we connect data-driven approaches to player valuation in the context of organizational decision-making. We evaluate the performance of four predictive models using over 800 real-world transfer fee records and extensive features in the “Big Five” leagues from seasons 2017–2018 to 2019–2020. Subsequently, we leverage Shapley Additive Explanations (SHAP) values, an interpretable machine learning (ML) technique, to identify important features and quantify their contributions to transfer fees.
A few fundamental human capital factors (e.g. age) and labor market variables (e.g. contract remaining) emerge as the key value drivers, outweighing technical capabilities (e.g. goal-scoring). Sport-general features (e.g. composure and reaction) hold greater predictive power than soccer-specific skills (e.g. dribbling).
Our research enhances the explainability and transparency of a reasonably accurate player valuation model in two ways. First, we utilize a rich set of interpretable, fine-grained player features. Second and more importantly, SHAP values allow us to deconstruct player valuation and provide economic interpretations of feature importance at both individual and aggregate levels. We also outline the practical implications of adopting interpretable ML in sports organization decision-making.
Introduction
Transfer is a distinctive economic behavior in the professional soccer market, in contrast to the U.S. franchise systems (e.g. National Football League, National Basketball Association) characterized by player-to-player trading. Soccer player transfers play a pivotal role in boosting team performance (Carmichael and Thomas, 1993) and are a pillar of club business models (Neri et al., 2023). A transfer refers to the movement of a player between two clubs, specifically the acquisition of their “performance rights” (Majewski and Majewska, 2017), often incurring a transfer fee. Table 1 summarizes transfer mechanisms introducing heterogeneity into acquisition costs. Free transfers and player loans do not require transfer fees. The market has a relatively low turnover rate such that a moderate number of players are transferred for fees during each transfer window. Total costs sometimes include bonuses and agent commissions.
A typology of transfer mechanisms
| Dimension | Characteristic |
|---|---|
| Transfer Types | Transfer fee |
| Free transfer | |
| Player loans | |
| Swap | |
| Contractual Clauses | Release clauses |
| Sell-on percentages | |
| Buyback clauses | |
| Intermediaries | Agents |
| Club executives/sporting directors | |
| Associated Financial Flows | Signing bonuses |
| Agent commissions | |
| Performance-based bonuses |
| Dimension | Characteristic |
|---|---|
| Transfer Types | Transfer fee |
| Free transfer | |
| Player loans | |
| Swap | |
| Contractual Clauses | Release clauses |
| Sell-on percentages | |
| Buyback clauses | |
| Intermediaries | Agents |
| Club executives/sporting directors | |
| Associated Financial Flows | Signing bonuses |
| Agent commissions | |
| Performance-based bonuses |
Comparable to yet conceptually different from transfer fee, market value is a theoretical construct referring to an estimated price a club would be willing to pay for a player rather than a realized price, independent of an actual transfer (Herm et al., 2014; Balliauw et al., 2022). The underlying assumption is that every player has an “objective” market value determined by various characteristics such as age, reputation, past performance, and future potential. However, such a value seldom exists in an imperfect market with information asymmetries manifested by behind-the-scenes negotiations and hidden player personality issues (Follert and Gleißner, 2024). A widely adopted benchmark in transfer negotiations is crowd-sourced market values on Transfermarkt.com. Some studies have raised concerns about the validity and reproducibility of the “wisdom of the crowd” approach (for a systematic critique of market values on Transfermarkt.com, cf. (Coates and Parshakov, 2022)).
The challenges of predicting a soccer player’s transfer fee or market value are inherent in the valuation of multi-attribute strategic assets under uncertainty, be it human capital, venture capital, or real estate. Decision-makers need to mentally process a vast amount of information. Some turn to the “availability heuristic” (Tversky and Kahneman, 1973) or “fast-and-frugal heuristics” (e.g. emphasis on recent performance or name recognition) (Gigerenzer, 2023; Raab and Gigerenzer, 2015). These can be effective cognitive shortcuts. While clubs also solicit expert opinions, the precise valuation criteria remain opaque. Such valuations may be subject to an anchoring bias (e.g. overreliance on record fees or media speculation) and have room for improvement.
Against this backdrop, our overarching research question is:
What are the key drivers of player valuation in the soccer transfer market?
Our motivation is twofold. First, the bulk of prior sports analytics research has focused on performance evaluation, tactical analysis, fan engagement, or match scheduling (Katz et al., 2020; McHale et al., 2012; Pappalardo et al., 2019; Stambulova et al., 2007). Nevertheless, player performance metrics and transfer fees have not been fully integrated, with many player features being unexplored (Gerrard, 2016; Watanabe et al., 2021). This underutilization of available information may cause inefficiencies in the transfer market (Gerrard, 2014). Second, achieving both predictive accuracy and transparent explanations is a non-trivial empirical undertaking. Traditional multiple regression may oversimplify the relationships between transfer fee drivers (Herm et al., 2014; Shmueli and Koppius, 2011), whereas black-box predictive models lack the transparency necessary for building trust with stakeholders (Bauer et al., 2023; Lundberg et al., 2018; Wanless and Naraine, 2023).
By addressing these gaps, we make two major contributions. First, we curate a rich dataset comprising fine-grained player skills (e.g. reaction, composure) and advanced performance metrics. Then, we match these features with real-world transfer fees rather than estimated market values, ensuring data validity and practical relevance. Our research differs from a related study (McHale and Holmes, 2023). We focus on the “Big Five” leagues (English Premier League, Italian Serie A, Spanish La Liga, German Bundesliga, and French Ligue 1) that account for approximately two-thirds of global transfer spending (Poli et al., 2022). In addition to several coarse performance metrics, we utilize a wide set of granular and interpretable features such as composure and ball control, adding specificity to our results. McHale and Holmes (2023) identify generic overall player ratings as a strong predictor of transfer fees, but the specific player features that contribute to this rating remain unclear.
Second, we leverage Shapley Additive Explanations (SHAP) values, an interpretable Machine Learning (ML) technique, to identify key drivers of player valuation at the individual and market levels. SHAP values explain player valuation by visualizing the most important features and quantifying their marginal contributions, which constitutes a novel addition to sports management (Garnica-Caparrós and Memmert, 2021). Moreover, feature importance measured by SHAP values can directly translate to monetary values and therefore have an economic interpretation. By contrast, McHale and Holmes (2023) use a variance-based metric gain score to quantify feature importance, which is hard to interpret practically and cannot point out the directionality of importance (i.e. a positive or negative influence).
Figure 1 is our framework connecting data-driven approaches to strategic decision-making in sports organizations. The rest of our article revolves around this framework. From the top down, the Related Research section primarily focuses on the two theoretical lenses (human capital and pricing), with a brief discussion of analytics-based management paradigms. From the bottom up, the Data and Methodology section describes our observed features and predictive models coupled with an interpretable ML technique. The Results section presents the empirical results. Finally, we lay out managerial implications, limitations, and future research directions.
The diagram shows three boxes arranged horizontally in the center, connected by right arrows. They are labeled from left to right as “Assets (Players),” “Predictions,” “Explanations.” Above “Assets (Players),” two boxes are arranged horizontally. The top left rectangle is labeled “Human Capital Theory,” and the top right rectangle is labeled “Pricing Theories.” Both rectangles have downward arrows pointing to “Assets (Players).” Below “Assets (Players),” a vertically stacked rectangle is labeled “Data,” and points back to “Assets (Players),” with an upward arrow.A framework for data-driven decision-making in sports organizations. Source: Created by the authors
The diagram shows three boxes arranged horizontally in the center, connected by right arrows. They are labeled from left to right as “Assets (Players),” “Predictions,” “Explanations.” Above “Assets (Players),” two boxes are arranged horizontally. The top left rectangle is labeled “Human Capital Theory,” and the top right rectangle is labeled “Pricing Theories.” Both rectangles have downward arrows pointing to “Assets (Players).” Below “Assets (Players),” a vertically stacked rectangle is labeled “Data,” and points back to “Assets (Players),” with an upward arrow.A framework for data-driven decision-making in sports organizations. Source: Created by the authors
Related Research
Players are the most valuable assets of a soccer club through the lens of human capital (Breuer et al., 2021). Their knowledge, skills, and abilities are intellectual capital and value drivers (Ployhart et al., 2014; Rubio Martin et al., 2022; Wright et al., 1995). Players’ human capital is of entrepreneurial value to clubs (Radaelli et al., 2018). The identification and acquisition of human capital represent the means for sport organizations to succeed athletically and financially.
A strand of literature has identified a handful of human capital factors as salient value drivers: age, nationality, popularity, injury history, match appearances (employee seniority), and playing position (Bryson et al., 2013; Carreras-Simó and García, 2022; Franck and Nüesch, 2008, 2012; Frick, 2007; Herm et al., 2014; Montanari et al., 2008; Pedace, 2008; Shapiro et al., 2017; Stambulova et al., 2007; Kuper and Szymanski, 2009). It is worth noting that popularity originates from intrinsic on-field performance or extrinsic off-field social media celebrity status (Franck and Nüesch, 2008; Franck and Nüesch, 2012; Garcia-del-Barrio and Pujol, 2007; Herm et al., 2014; Shapiro et al., 2017; Rai et al., 2021). The latter often leads to star players earning disproportionately more than their peers–a phenomenon known as the “superstar effect” (Garcia-del-Barrio and Pujol, 2021; Hofmann et al., 2021).
Human capital extends to physiological (e.g. speed, stamina) and psychological (e.g. game intelligence) features (Ali, 2011; Rein and Memmert, 2016; Williams, 2000; Reilly et al., 2000). Ambidexterity (i.e. two-footedness), a special skill, is associated with higher transfer premiums and wages (Bryson et al., 2013; Frick, 2007). These features interact synergistically to affect player value (Ployhart et al., 2014).
At the market level, several contextual factors influence transfer fees. Bargaining power is arguably the most critical theoretical construct. Contract duration is a simple measurement of this construct. More years remaining in a player contract afford the seller club stronger bargaining power (Campa, 2022), because the Bosman ruling eliminates transfer fees upon contract expiration. Bargaining power could also be assessed by clubs’ league rankings or financial resources (Franks et al., 2016; Frick, 2007; Serna Rodríguez et al., 2019; Tunaru et al., 2005). Clubs in wealthy leagues (e.g. the English Premier League) often acquire talent from development-oriented “farm” leagues like the German Bundesliga (Matesanz et al., 2018). Following a recent systematic review (Franceschi et al., 2024), we summarize player value drivers into individual-level human capital factors and market-level contextual factors in Table 2.
Value driver classification and mapping
| Level | Category | Observed feature |
|---|---|---|
| Individual human capital | Demographics | Age, nationality |
| Popularity | International reputation | |
| Physiological Psychological Seniority | Reaction, speed, strength, etc. Composure, positioning, aggression, etc. Minutes/games played | |
| Soccer-Specific | Playing position, goals, expected goals, assists, expected assists, yellow/red cards, dribble, ball control, etc. | |
| Market | Bargaining Power | Contract remaining, team rating |
| League | League name |
| Level | Category | Observed feature |
|---|---|---|
| Individual human capital | Demographics | Age, nationality |
| Popularity | International reputation | |
| Physiological Psychological | Reaction, speed, strength, etc. | |
| Soccer-Specific | Playing position, goals, expected goals, assists, expected assists, yellow/red cards, dribble, ball control, etc. | |
| Market | Bargaining Power | Contract remaining, team rating |
| League | League name |
Classic pricing theories assume monetary values of human capital, including value-based hedonic pricing, risk-aware option pricing, and auction theory. Rosen (1974, p. 34) defines hedonic prices as “the implicit prices of attributes and revealed to economic agents from observed prices of differentiated products and the specific number of characteristics associated with them”. A soccer player possesses a repertoire of features (e.g. human capital factors) that differentiate them from peers. Their transfer fee would therefore be the aggregate of implicit, hedonic values of these utility-generating features. Option pricing conceptualizes a player as a risky asset (Coluccia et al., 2018; Kedar-Levy and Bar-Eli, 2008; Majewski and Majewska, 2017). From this perspective, age is analogous to the asset lifecycle. As a player matures and gains experience, their value appreciates. When they reach the middle or late career, the value has peaked and then begins to decline. Injuries are uncertain risks that could rapidly erode a player’s value. According to auction theory, transfer negotiations resemble asymmetric bidding processes (Rottenberg, 2000). Transfer fees would be the outcome of a bargaining process in which an imbalance of bargaining power exists among buyers (Rottenberg, 2000). Different buyers can submit multiple bids and the seller can accept the preferred one. If a buyer activates the release clause, the seller will be obliged to approve the transfer and no auction will occur.
To identify key performance indicators (KPIs), many sports organizations adopt data-driven, analytics-based paradigms in strategic decision-making, especially consequential human resource management processes such as hiring (Alamar and Methrotra, 2011; Davenport, 2014; Gavião et al., 2023; Gerrard, 2007; Fry and Ohlmann, 2012). ML-assisted recruitment in organizational contexts entails an interplay of technical aspects and social aspects (Sturm et al., 2023). Some sports organizations are reluctant to shed light on proprietary models and certain KPIs for fear of losing a competitive edge (Coleman, 2012; Memmert and Raabe, 2023; Watanabe et al., 2021). Proponents of analytics have a cultural conflict with traditionalists who have a deep appreciation of the game (Alamar and Methrotra, 2011). Transparency of ML predictions, enhanced by interpretable ML techniques, helps analytics departments communicate the rationale behind predictions to stakeholders (e.g. general managers, sporting directors, executives), facilitating ML adoption and fostering trust in algorithms (Coussement et al., 2024; Lolli et al., 2025; Zhou et al., 2025).
Data and Methodology
Data sources
We combine three widely cited open-source soccer datasets, Transfermarkt.com, FBRef.com, and Sofifa, to curate a comprehensive dataset for player valuation (Herm et al., 2014; Müller et al., 2017; Payyappalli and Zhuang, 2019). Real-world transfer fees provided by Transfermarkt.com are our target variable. FBRef.com offers real-world performance data such as playing time, goals, assists, and advanced performance metrics expected goals (xG) and expected assists (xA) (Rathke, 2017). Sofifa supplies physiological, psychological, and technical skill features evaluated by the FIFA video game series. Known for its fidelity, Sofifa data is commonly repurposed for research tasks such as ML-based valuation (Al-Asadi and Tasdemır, 2022), fairness assessment (Awasthi et al., 2021), and player potential prediction (Vroonen et al., 2017; Carpita et al., 2021; McHale and Holmes, 2023; Wakelam et al., 2022).
We match players from different data sources by name and date of birth. Then, we apply a probabilistic record linkage algorithm to non-exact matches because of inconsistent or irregular player names, accepting matches with a similarity score above 0.5 (Stanojevic and Gyarmati, 2016). This results in a final dataset of 831 matched player transfers across the “Big Five” leagues from seasons 2017–2018 to 2019/20. Appendix 1 provides the full list of features used in our research. Our feature selection is a combination of theoretical justifications, domain expertise, and data availability. Guided by McHale and Holmes (2023), we exclude free transfers from the dataset.
Predictive modeling and experimental setup
Our objective is to approximate the underlying player valuation function , where is a vector of the target variable (i.e. transfer fees) and represents a vector of features (i.e. value drivers). We apply four regression-based predictive models: Decision Tree (DT), Random Forest (RF), Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost). We evaluate model performance by Root Mean Square Error (RMSE) and . We split the dataset into 80% training with five-fold cross-validation and 20% testing.
Model interpretation
We employ SHAP values to interpret feature contributions (Bauer and Anzer, 2021). SHAP values treat a model prediction as a cooperative game payout distributed among features based on their marginal contributions (Antwarg et al., 2021; Lundberg et al., 2018; Rudin, 2019; Garnica-Caparrós and Memmert, 2021). In general, the explanation model is defined as the following linear function:
where , . denotes the number of all features. is a binary indicating whether feature is present () or absent (). is derived from Equation (2) where is a feature value of an instance (i.e. a player) being explained. is a subset of feature values excluding . denotes the expected value of the function conditioned on . SHAP values combine these conditional expectations to attribute to each feature.
The sum of SHAP values for an instance equals the difference between the prediction and the baseline in Equation (3). This additivity property ensures that SHAP explanations are consistent and complete. SHAP values provide both aggregate-level and individual player-level interpretability to enhance transparency in the complex, high-stakes decision-making of player valuation.
Results
Table 3 presents the performance metrics for all the four predictive models. XGBoost achieves the lowest RMSE and the highest . Therefore, we base subsequent SHAP value analysis on this model. Appendix 2 documents the tuning parameters of XGBoost.
Model testing results
| Model | RMSE | R2 |
|---|---|---|
| Decision Tree | 1.002 | 0.404 |
| XGBoost | 0.717 | 0.695 |
| Random Forest | 0.884 | 0.536 |
| SVR | 0.797 | 0.623 |
| Model | RMSE | R2 |
|---|---|---|
| Decision Tree | 1.002 | 0.404 |
| XGBoost | 0.717 | 0.695 |
| Random Forest | 0.884 | 0.536 |
| SVR | 0.797 | 0.623 |
Figure 2 decomposes the SHAP values of the predicted log-transformed transfer fee of 18.162 (£77,206,947) for Matthijs de Ligt. The base log value 15.422 (£4,985,279) is the average predicted transfer fee in the testing set. Red arrows indicate features with positive SHAP values increasing the predicted transfer fee from the base value, while blue arrows represent features with negative SHAP values doing the opposite. In de Ligt’s case, no top feature shows a negative impact. The length of each arrow is proportional to the SHAP value magnitude. Taken together, these SHAP values explain the difference between the base and predicted values. Specifically, Team Rating and Age increase the predicted transfer fee by the widest margins. In the summer of 2019, de Ligt transferred to Juventus, a highly competitive and financially well-endowed club in the Italian Serie A League. His young age also elevates the predicted fee. Other features with positive SHAP values include Movement Reactions, Games, Contract Remaining, Mentality Composure, Attacking Heading Accuracy, and International Reputation.
The figure shows a horizontal waterfall chart. The vertical axis lists the following features and their values: “83 equals team underscore rating,” “20 equals age underscore tm,” “83 equals movement underscore reactions,” “29 equals games,” “5 equals contract underscore remaining,” “2450 equals minutes,” “82 equals mentality underscore composure,” “85 equals attacking underscore heading underscore accuracy,” “3 equals international underscore reputation,” and “68 other features.” The horizontal axis ranges from 15.5 to 18.5 in increments of 0.5 units. Rightward arrowheads stacked horizontally display individual contributions to the prediction. At the start of the horizontal axis labels, the text reads “E of f(x) equals 15.422.” A vertical line is drawn on the graph at the horizontal axis value of 18.162, and reads “f(x) equals 18.162.” The values start from 15.422, and the values sum to the final value of 18.162. The value for each arrowhead from the graph is as follows: 83 equals team underscore rating: Plus 0.74. 20 equals age underscore tm: plus 0.67. 83 equals movement underscore reactions: Plus 0.46. 29 equals games: Plus 0.22. 5 equals contract underscore remaining: Plus 0.12. 2450 equals minutes: Plus 0.11. 82 equals mentality underscore composure: Plus 0.09. 85 equals attacking underscore heading underscore accuracy: Plus 0.09. 3 equals international underscore reputation: Plus 0.08. 68 other features: Plus 0.17.Matthijs de ligt transfer fee prediction – SHAP values. Source: Created by the authors
The figure shows a horizontal waterfall chart. The vertical axis lists the following features and their values: “83 equals team underscore rating,” “20 equals age underscore tm,” “83 equals movement underscore reactions,” “29 equals games,” “5 equals contract underscore remaining,” “2450 equals minutes,” “82 equals mentality underscore composure,” “85 equals attacking underscore heading underscore accuracy,” “3 equals international underscore reputation,” and “68 other features.” The horizontal axis ranges from 15.5 to 18.5 in increments of 0.5 units. Rightward arrowheads stacked horizontally display individual contributions to the prediction. At the start of the horizontal axis labels, the text reads “E of f(x) equals 15.422.” A vertical line is drawn on the graph at the horizontal axis value of 18.162, and reads “f(x) equals 18.162.” The values start from 15.422, and the values sum to the final value of 18.162. The value for each arrowhead from the graph is as follows: 83 equals team underscore rating: Plus 0.74. 20 equals age underscore tm: plus 0.67. 83 equals movement underscore reactions: Plus 0.46. 29 equals games: Plus 0.22. 5 equals contract underscore remaining: Plus 0.12. 2450 equals minutes: Plus 0.11. 82 equals mentality underscore composure: Plus 0.09. 85 equals attacking underscore heading underscore accuracy: Plus 0.09. 3 equals international underscore reputation: Plus 0.08. 68 other features: Plus 0.17.Matthijs de ligt transfer fee prediction – SHAP values. Source: Created by the authors
Next, we present SHAP values in the entire testing set to provide aggregate-level explanations. Figure 3 is a bar chart ranking features by their mean absolute SHAP values in descending order, where features with higher average contributions are more influential in predicting transfer fees. Figure 4 attributes predictions to each feature on the vertical axis by mapping feature values to the corresponding SHAP values on the horizontal axis. In this plot, each dot is a player. Its color reflects the magnitude of a feature value. Dots with similar SHAP values are clustered together. Overall, Contract Remaining emerges as the most important feature, followed by Team Rating and Age. More years remaining in the contracts are associated with marked rises in the predicted transfer fees, as evidenced by the visible gap in the SHAP value distribution. Almost on a par with Contract Remaining, the effect of Team Rating is more continuous. A younger Age contributes to a noticeable increase in the predicted fee and vice versa. Mentality Composure has a narrow SHAP value distribution. Therefore, its positive relationship with the predicted fees remains proportional. Skill Ball Control displays a mixed result: most have modestly negative SHAP values, with only a few high values associated with disproportionately large increases in the predicted fees. More Minutes and Games contribute to the predicted fees. Premier League (a dummy variable) has a clear positive effect. Akin to Skill Ball Control, some high Movement Reactions values boost the predicted fees. Lower Movement Sprint Speed values typically have negative SHAP values. xG makes a meager contribution to predicted transfer fees, and so does Skill Dribbling.
The horizontal axis of the horizontal bar graph ranged from 0 to 03 in increments of 0.05. The vertical axis shows 20 variables, labeled from top to bottom as “contract underscore remaining,” “team underscore rating,” “age underscore t m,” “mentality underscore composure,” “skill underscore ball underscore control,” “minutes,” “league underscore name underscore Premier League,” “movement underscore reactions,” “attacking underscore heading underscore accuracy,” “games,” “movement underscore sprint underscore speed,” “x g,” “defending underscore sliding underscore tackle,” “power underscore strength,” “power underscore long underscore shots,” “x a,” “mentality underscore positioning,” “mentality underscore aggression,” “skill underscore dribbling,” and “mentality underscore vision.” The data from the graph is as follows: contract underscore remaining: 0.317. team underscore rating: 0.304. age underscore t m: 0.196. mentality underscore composure: 0.110. skill underscore ball underscore control: 0.099. minutes: 0.090. league underscore name underscore Premier underscore League: 0.088. movement underscore reactions: 0.075. attacking underscore heading underscore accuracy: 0.071. games: 0.065. movement underscore sprint underscore speed: 0.053. x g: 0.044. defending underscore sliding underscore tackle: 0.044. power underscore strength: 0.044. power underscore long underscore shots: 0.036. x a: 0.034. mentality underscore positioning: 0.032. mentality underscore aggression: 0.027. skill underscore dribbling: 0.027. mentality underscore vision: 0.027. Note: All numerical values are approximated.SHAP standard bar chart (Top 20 features). Source: Created by the authors
The horizontal axis of the horizontal bar graph ranged from 0 to 03 in increments of 0.05. The vertical axis shows 20 variables, labeled from top to bottom as “contract underscore remaining,” “team underscore rating,” “age underscore t m,” “mentality underscore composure,” “skill underscore ball underscore control,” “minutes,” “league underscore name underscore Premier League,” “movement underscore reactions,” “attacking underscore heading underscore accuracy,” “games,” “movement underscore sprint underscore speed,” “x g,” “defending underscore sliding underscore tackle,” “power underscore strength,” “power underscore long underscore shots,” “x a,” “mentality underscore positioning,” “mentality underscore aggression,” “skill underscore dribbling,” and “mentality underscore vision.” The data from the graph is as follows: contract underscore remaining: 0.317. team underscore rating: 0.304. age underscore t m: 0.196. mentality underscore composure: 0.110. skill underscore ball underscore control: 0.099. minutes: 0.090. league underscore name underscore Premier underscore League: 0.088. movement underscore reactions: 0.075. attacking underscore heading underscore accuracy: 0.071. games: 0.065. movement underscore sprint underscore speed: 0.053. x g: 0.044. defending underscore sliding underscore tackle: 0.044. power underscore strength: 0.044. power underscore long underscore shots: 0.036. x a: 0.034. mentality underscore positioning: 0.032. mentality underscore aggression: 0.027. skill underscore dribbling: 0.027. mentality underscore vision: 0.027. Note: All numerical values are approximated.SHAP standard bar chart (Top 20 features). Source: Created by the authors
The horizontal axis of the scatter plot is labeled “S H A P value (impact on model output), and plot ranges from negative 1 to 0.75 in increments of 0.25. The vertical axis shows 20 variables, labeled from top to bottom as “contract underscore remaining,” “team underscore rating,” “age underscore t m,” “mentality underscore composure,” “skill underscore ball underscore control,” “minutes,” “league underscore name underscore Premier League,” “movement underscore reactions,” “attacking underscore heading underscore accuracy,” “games,” “movement underscore sprint underscore speed,” “x g,” “defending underscore sliding underscore tackle,” “power underscore strength,” “power underscore long underscore shots,” “x a,” “mentality underscore positioning,” “mentality underscore aggression,” “skill underscore dribbling,” and “mentality underscore vision.” A vertical color bar on the right side is labeled “Feature value,” with “High” at the top in bright pink and “Low” at the bottom in blue. Points are colored according to this gradient, with pink indicating high feature values and blue indicating low feature values. The values from the graph are as follows: contract underscore remaining: Range: negative 0.937 to 0.597. team underscore rating: Range: negative 0.658 to 0.95. age underscore t m: Range: negative 0.437 to 0.364. mentality underscore composure: Range: negative 0.244 to 0.222. skill underscore ball underscore control: Range: negative 0.142 to 0.398. minutes: Range: negative 0.25 to 0.222. league underscore name underscore Premier underscore League: Range: negative 0.102 to 0.347. movement underscore reactions: Range: negative 0.108 to 0.505. attacking underscore heading underscore accuracy: Range: negative 0.114 to 0.199. games: Range: negative 0.341 to 0.108. movement underscore sprint underscore speed: Range: negative 0.176 to 0.227 x g: Range: negative 0.114 to 0.091. defending underscore sliding underscore tackle: Range: negative 0.051 to 0.284 power underscore strength: Range: negative 0.159 to 0.114. power underscore long underscore shots: Range: negative 0.051 to 0.176. x a: Range: negative 0.102 to 0.153. mentality underscore positioning: Range: negative 0.057 to 0.159. mentality underscore aggression: Range: negative 0.057 to 0.193. skill underscore dribbling: Range: negative 0.386 to 0.045. mentality underscore vision: Range: negative 0.148 to 0.068. Note: All numerical values are approximated.SHAP values summary plot (Top 20 features). Source: Created by the authors
The horizontal axis of the scatter plot is labeled “S H A P value (impact on model output), and plot ranges from negative 1 to 0.75 in increments of 0.25. The vertical axis shows 20 variables, labeled from top to bottom as “contract underscore remaining,” “team underscore rating,” “age underscore t m,” “mentality underscore composure,” “skill underscore ball underscore control,” “minutes,” “league underscore name underscore Premier League,” “movement underscore reactions,” “attacking underscore heading underscore accuracy,” “games,” “movement underscore sprint underscore speed,” “x g,” “defending underscore sliding underscore tackle,” “power underscore strength,” “power underscore long underscore shots,” “x a,” “mentality underscore positioning,” “mentality underscore aggression,” “skill underscore dribbling,” and “mentality underscore vision.” A vertical color bar on the right side is labeled “Feature value,” with “High” at the top in bright pink and “Low” at the bottom in blue. Points are colored according to this gradient, with pink indicating high feature values and blue indicating low feature values. The values from the graph are as follows: contract underscore remaining: Range: negative 0.937 to 0.597. team underscore rating: Range: negative 0.658 to 0.95. age underscore t m: Range: negative 0.437 to 0.364. mentality underscore composure: Range: negative 0.244 to 0.222. skill underscore ball underscore control: Range: negative 0.142 to 0.398. minutes: Range: negative 0.25 to 0.222. league underscore name underscore Premier underscore League: Range: negative 0.102 to 0.347. movement underscore reactions: Range: negative 0.108 to 0.505. attacking underscore heading underscore accuracy: Range: negative 0.114 to 0.199. games: Range: negative 0.341 to 0.108. movement underscore sprint underscore speed: Range: negative 0.176 to 0.227 x g: Range: negative 0.114 to 0.091. defending underscore sliding underscore tackle: Range: negative 0.051 to 0.284 power underscore strength: Range: negative 0.159 to 0.114. power underscore long underscore shots: Range: negative 0.051 to 0.176. x a: Range: negative 0.102 to 0.153. mentality underscore positioning: Range: negative 0.057 to 0.159. mentality underscore aggression: Range: negative 0.057 to 0.193. skill underscore dribbling: Range: negative 0.386 to 0.045. mentality underscore vision: Range: negative 0.148 to 0.068. Note: All numerical values are approximated.SHAP values summary plot (Top 20 features). Source: Created by the authors
In addition, we use SHAP dependence plots (Figures 5-12) to further examine these relationships by visualizing the marginal effects of select features on the predicted transfer fees, ceteris paribus. These plots highlight the nuanced and nonlinear nature of feature contributions. Players with less than three years remaining on their contracts tend to have negative SHAP values. In contrast, a substantial increase in predicted transfer fees takes place when Contract Remaining exceeds three years. The nonlinear effect of Age turns negative beyond 26 and declines steadily thereafter. Team Rating between 70 and 80 increases the SHAP values continuously, albeit with a slower rate after 80. At the upper end, especially between 70 and 80, Mentality Composure, Skill Ball Control, and Movement Reactions all have high SHAP values. Minutes ranging from 1,500 to 2,000 records a jump in the SHAP values. The values then fluctuate within a small range. A similar pattern holds for Games, where the SHAP values turn positive only after 20 matches.
The vertical axis of the scatter plot is labeled “S H A P value for contract underscore remaining” and ranges from negative 0.8 to 0.6 in increments of 0.2. The horizontal axis is labeled “contract underscore remaining” and ranges from 0 to 6 in increments of 1. Data points are plotted showing vertical dots. The overall data shows an increasing trend from lower left to upper right. The data ranges for the lies shown are as follows: The data for 0 ranges from (0, negative 0.907) to (0, negative 0.794). The data for 1 ranges from (1, negative 0.693) to (1, negative 0.391). The data for 2 ranges from (2, negative 0.394) to (2, negative 0.248). The data for 3 ranges from (3, negative 0.245) to (3, negative 0.051). The data for 4 ranges from (4, negative 0.194) to (4, negative 0.376). The data for 5 ranges from (5, negative 0.301) to (5, negative 0.615). The data for 6 is shown at (6, 0.454). Note: All numerical values are approximated.Contract remaining SHAP dependence plot. Source: Created by the authors
The vertical axis of the scatter plot is labeled “S H A P value for contract underscore remaining” and ranges from negative 0.8 to 0.6 in increments of 0.2. The horizontal axis is labeled “contract underscore remaining” and ranges from 0 to 6 in increments of 1. Data points are plotted showing vertical dots. The overall data shows an increasing trend from lower left to upper right. The data ranges for the lies shown are as follows: The data for 0 ranges from (0, negative 0.907) to (0, negative 0.794). The data for 1 ranges from (1, negative 0.693) to (1, negative 0.391). The data for 2 ranges from (2, negative 0.394) to (2, negative 0.248). The data for 3 ranges from (3, negative 0.245) to (3, negative 0.051). The data for 4 ranges from (4, negative 0.194) to (4, negative 0.376). The data for 5 ranges from (5, negative 0.301) to (5, negative 0.615). The data for 6 is shown at (6, 0.454). Note: All numerical values are approximated.Contract remaining SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for team underscore rating” and ranges from negative 0.6 to 0.8 in increments of 0.2. The horizontal axis is labeled “team underscore rating” and ranges from 65 to 85 in increments of 2.5. The line starts at (65, negative 0.6), increases to (76.3, 0.133), and steadily keeps increasing to end at (84.4, 0.659). A few of the scattered dots lie at (72.16, 0.437), (75.85, 0.236), and (83.43, 0.653), along with others. Note: All numerical values are approximated.Team rating SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for team underscore rating” and ranges from negative 0.6 to 0.8 in increments of 0.2. The horizontal axis is labeled “team underscore rating” and ranges from 65 to 85 in increments of 2.5. The line starts at (65, negative 0.6), increases to (76.3, 0.133), and steadily keeps increasing to end at (84.4, 0.659). A few of the scattered dots lie at (72.16, 0.437), (75.85, 0.236), and (83.43, 0.653), along with others. Note: All numerical values are approximated.Team rating SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for age underscore t m” and ranges from negative 0.4 to 0.4 in increments of 0.2. The horizontal axis is labeled “age underscore t m” and ranges from 20 to 34 in increments of 2. The line starts at approximately (19, 0.28), descends gradually to around (24, 0.12), and then sharply decreases to (27, negative 0.18), and stays almost constant and ends at (34, negative 0.28). A scattered dots are shown vertically at each age underscore t m value. The range of the dots of few of the points are For age underscore t m 20: Range: 0.27 to 0.30. For age underscore t m 25: Range: 0.058 to 0.216. For age underscore t m 30: Range: negative 0.33 to negative 0.19. Note: All numerical values are approximated.Age SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for age underscore t m” and ranges from negative 0.4 to 0.4 in increments of 0.2. The horizontal axis is labeled “age underscore t m” and ranges from 20 to 34 in increments of 2. The line starts at approximately (19, 0.28), descends gradually to around (24, 0.12), and then sharply decreases to (27, negative 0.18), and stays almost constant and ends at (34, negative 0.28). A scattered dots are shown vertically at each age underscore t m value. The range of the dots of few of the points are For age underscore t m 20: Range: 0.27 to 0.30. For age underscore t m 25: Range: 0.058 to 0.216. For age underscore t m 30: Range: negative 0.33 to negative 0.19. Note: All numerical values are approximated.Age SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for mentality underscore composure” and ranges from negative 0.2 to 0.2 in increments of 0.1. The horizontal axis is labeled “mentality underscore composure” and ranges from 30 to 80 in increments of 10. The line starts at (24, negative 0.09), slightly decreases with a negative slope to (61, negative 0.13), then sharply increases to end at (88, 0.167), and further increases to end at (86, 0.185). The scattered dots lie around the line between 50 and 85 on the horizontal axis. A few of the points of the scattered dots are (59, negative 0.166), (69, negative 0.097), (73.6, 0.113), and (84, 0.22), along with others. Note: All numerical values are approximated.Mentality composure SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for mentality underscore composure” and ranges from negative 0.2 to 0.2 in increments of 0.1. The horizontal axis is labeled “mentality underscore composure” and ranges from 30 to 80 in increments of 10. The line starts at (24, negative 0.09), slightly decreases with a negative slope to (61, negative 0.13), then sharply increases to end at (88, 0.167), and further increases to end at (86, 0.185). The scattered dots lie around the line between 50 and 85 on the horizontal axis. A few of the points of the scattered dots are (59, negative 0.166), (69, negative 0.097), (73.6, 0.113), and (84, 0.22), along with others. Note: All numerical values are approximated.Mentality composure SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for skill underscore ball underscore control” and ranges from negative 0.0 to 0.6 in increments of 0.2. The horizontal axis is labeled “skill underscore ball underscore control” and ranges from 10 to 90 in increments of 10. The line starts at (10, negative 0.115), remains largely flat with a shallow positive slope until (77, 0.064), then sharply increases to end at (90, 0.70). The scattered dots lie closely around the line between 65 and 80 on the horizontal axis. A few of the points of the scattered dots are (22, negative 0.142), (34, negative 0.066), (79, negative 0.12), (90, 0.343), and (84, 0.355), along with others. Note: All numerical values are approximated.Skill ball control SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for skill underscore ball underscore control” and ranges from negative 0.0 to 0.6 in increments of 0.2. The horizontal axis is labeled “skill underscore ball underscore control” and ranges from 10 to 90 in increments of 10. The line starts at (10, negative 0.115), remains largely flat with a shallow positive slope until (77, 0.064), then sharply increases to end at (90, 0.70). The scattered dots lie closely around the line between 65 and 80 on the horizontal axis. A few of the points of the scattered dots are (22, negative 0.142), (34, negative 0.066), (79, negative 0.12), (90, 0.343), and (84, 0.355), along with others. Note: All numerical values are approximated.Skill ball control SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for minutes” and ranges from negative 0.15 to 0.15 in increments of 0.05. The horizontal axis is labeled “minutes” and ranges from 0 to 3000 in increments of 500. The line starts at (042 negative 0.07), and rises with small fluctuations to (1391, negative 0.043). The line then increases sharply to (1873, 0.061), and continues upward to end at (3312, 0.12). The scattered dots lie around the line along the horizontal axis. A few of the scattered dots are at (486, negative 0.052), (1571, negative 0.027), (1793, 0.128), and (2878, 0.125), along with others. Note: All numerical values are approximated.Minutes SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for minutes” and ranges from negative 0.15 to 0.15 in increments of 0.05. The horizontal axis is labeled “minutes” and ranges from 0 to 3000 in increments of 500. The line starts at (042 negative 0.07), and rises with small fluctuations to (1391, negative 0.043). The line then increases sharply to (1873, 0.061), and continues upward to end at (3312, 0.12). The scattered dots lie around the line along the horizontal axis. A few of the scattered dots are at (486, negative 0.052), (1571, negative 0.027), (1793, 0.128), and (2878, 0.125), along with others. Note: All numerical values are approximated.Minutes SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for movement underscore reactions” and ranges from negative 0.1 to 0.5 in increments of 0.1. The horizontal axis is labeled “movement underscore reactions” and ranges from 50 to 85 in increments of 5. The line starts at (49, negative 0.059), and is nearly flat with a small positive slope until (72, negative 0.025), then rises steeply to end at (85, 0.435). The scattered dots follow the line, mostly clustered between 55 and 75, and some between 80 and 85 along the horizontal axis. A few of the points of the scattered dots are (57, negative 0.037), (68, 0.008), (83, 0.181), and (84, 0.508), along with others. Note: All numerical values are approximated.Movement reactions SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for movement underscore reactions” and ranges from negative 0.1 to 0.5 in increments of 0.1. The horizontal axis is labeled “movement underscore reactions” and ranges from 50 to 85 in increments of 5. The line starts at (49, negative 0.059), and is nearly flat with a small positive slope until (72, negative 0.025), then rises steeply to end at (85, 0.435). The scattered dots follow the line, mostly clustered between 55 and 75, and some between 80 and 85 along the horizontal axis. A few of the points of the scattered dots are (57, negative 0.037), (68, 0.008), (83, 0.181), and (84, 0.508), along with others. Note: All numerical values are approximated.Movement reactions SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for games” and ranges from negative 0.3 to 0.1 in increments of 0.1. The horizontal axis is labeled “games” and ranges from 0 to 35 in increments of 5. The line starts at (0.77, negative 0.261), rises sharply to (10.1, negative 0.052), then gradually increases to (30, 0.058). The line dips and again increases to end at (38, 0.081). The scattered dots are distributed around the line and are mostly clustered between 25 and 40 on the horizontal axis. A few of the scattered dots are at (6, negative 0.2), (12, negative 0.002), (24.8, 0.007), and (32.98, 0.097), along with others. Note: All numerical values are approximated.Games SHAP dependence plot. Source: Created by the authors
The figure shows a combination of a line and a scatter chart. The vertical axis of the plot is labeled “S H A P value for games” and ranges from negative 0.3 to 0.1 in increments of 0.1. The horizontal axis is labeled “games” and ranges from 0 to 35 in increments of 5. The line starts at (0.77, negative 0.261), rises sharply to (10.1, negative 0.052), then gradually increases to (30, 0.058). The line dips and again increases to end at (38, 0.081). The scattered dots are distributed around the line and are mostly clustered between 25 and 40 on the horizontal axis. A few of the scattered dots are at (6, negative 0.2), (12, negative 0.002), (24.8, 0.007), and (32.98, 0.097), along with others. Note: All numerical values are approximated.Games SHAP dependence plot. Source: Created by the authors
Discussion
We draw several key findings from our analysis. First, the prominence of fundamental human capital factors and labor market variables in predicting transfer fees contradicts the prima facie expectation that technical capabilities would have a larger influence (Dubois and Walzak, 2025). Chiefly, Age is among the three most influential value drivers. As key value drivers, Games and Minutes measure on-field involvement and signal player experience. This finding underscores the importance of employee seniority (Franceschi et al., 2024; Frick, 2007; McHale and Holmes, 2023). Contract Remaining demonstrates the most substantial influence on predicted transfer fees. When a contract approaches its final stage, particularly the last two years, the incumbent club’s bargaining power weakens significantly, leading to downward pressure on transfer fees. Team Rating measures the sporting strength of the acquiring club and might be a proxy for its financial capacity. Premier League is also a key driver, presumably due to its strong purchasing power and club visibility (Matesanz et al., 2018).
Second, sports-general physiological and psychological features are more impactful than soccer-specific skills. Mentality Composure ranks as the fourth most important value driver. High composure (e.g. above 80), however, yields diminished returns. Physiological features Movement Reaction and Power Strength contribute more to predicted transfer fees than domain-specific technique Skill Dribbling. This implies that dribbling, notwithstanding its hedonic value, may not generate commensurate utility from a club’s perspective. When it comes to soccer-specific skills, Skill Ball Control is the most significant contributing factor. While low to average values show a minor negative effect, exceptionally high Skill Ball Control has a disproportionately positive influence on transfer fees. This suggests that this skill, despite being less visible than dribbling or goal scoring, may be an undervalued core competency. The rank of xG highlights the usefulness of a more nuanced measure of attacking contribution in player valuation. Raw goal counts are rare events and inflated in some leagues (Memmert and Raabe, 2023). In line with Franceschi et al. (2024), yellow or red cards and footedness do not significantly affect predicted transfer fees. In conclusion, we do not intend to downplay the importance of soccer-specific skills. Rather, these features individually have a minuscule influence on player valuation but collectively demonstrate considerable predictive power.
Practical implications
The adoption of SHAP values can be a starting point for human-AI collaboration in sports organization financial strategic decision-making (Dubois and Walzak, 2025). Transparent explanations based on SHAP values for multi-attribute asset valuations may not only enable stakeholders to engage “System 2” thinking (slower, more deliberate, and analytical reasoning) but also reshape their information processing (Bauer et al., 2023). Analytics departments illustrate the contributions of various features to a valuation by SHAP value visualizations and communicate with non-technical stakeholders. General managers and sporting directors determine if data-driven assessments substantiate their intuitions or prompt adjustments to the weight of specific information for due diligence. This human-in-the-loop approach could help recalibrate judgment and augment bounded rationality (i.e. domain-specific, idiosyncratic prior knowledge from experiences).
On the demand side, a buyer club may estimate the degree to which a target player’s features that align with its tactical or strategic needs would drive the transfer fee. Furthermore, a buyer can generate contrastive explanations on the valuation of similar players and implement “blind scouting” for personnel selection (Dubois and Walzak, 2025), comparing SHAP values of anonymized candidates to reduce popularity or nationality biases. Accordingly, a buyer can allocate the budget to star players whose desirable features warrant a premium or cost-effective players whose key features are above average (Beiderbeck et al., 2021; Toma and Campobasso, 2023). On the supply side, a seller club may better understand the diminished returns or outlier effects of certain player features (e.g. Mentality Composure or Skill Ball Control). Clubs dependent on selling youth academy or key players as a business model could tailor talent development schemes to prioritize the features that would improve player brand image and help command higher fees (Hofmann et al., 2021).
Limitations and future research
Due to data availability, our feature selection omits contextual factors such as club-specific business strategies and market asymmetries. Inflation and the timing of transfers are outside the purview of our present research (Yang et al., 2024). The most relevant use case of our methodology is modeling transfer fees. A notable exception is the activation of release clauses, which provides a more direct explanation for transfer fees than player features. Controlling such cases would add rigor and amount to a new research direction. Future research could generalize our methodology to free transfers, loans, or player swaps, exploring key drivers of wages instead of transfer fees. A more holistic approach should model not only transfer fees but also factor in wages, performance bonuses, and even agent commissions. This could help the soccer regulatory body investigate transparency and fairness in club expenditure from a financial fair play standpoint (Neri et al., 2023).
Another future research direction is to operationalize different dimensions of bargaining power through a transfer market network analysis (Liu et al., 2016; Matesanz et al., 2018). In such a network, nodes are clubs, and edges show the flow of players or capital. Edge weights indicate the number of players moving from one club to another or associated fees. Network centrality measures (e.g. in-degree, out-degree, closeness) could capture bargaining power beyond contract remaining. Lastly, future research could ask if organizational fit (e.g. squad chemistry), the alignment between player and team style, would be a key driver in valuation (Al-Madi et al., 2016; Taylor and Giannantonio, 1993).
We acknowledge that theory-driven approaches, such as structural modeling and causal inference, are well-established research traditions and are integral to hypothesis testing. Our research design choice of predictive modeling is rooted in the growing trend in analytics-based sports management (Bogaert et al., 2017; McHale and Holmes, 2023; Watanabe et al., 2021; Yang et al., 2024), given the multidimensional nature of data (e.g. a large number of features). Our methodology does not substitute for or preclude structural modeling or causal inference. Rather, it shows complementarity and advances methodological pluralism. Structural models offer retrospective, theory-grounded insights, whereas predictive models provide proactive forecasts of transfer fees based on current player characteristics. By using SHAP values to explain feature influence on historical transfer fees, we bridge the gap between black-box predictions and economic interpretations.
Appendix 1
Complete feature set
| Category | Feature | Data Type | Description |
|---|---|---|---|
| Demographics | Age | Numerical | The age of a player in a given season |
| Nation group | Categorical | All countries are regrouped into 11 labels (France, Italy, England, Germany, Brazil, Argentina, Belgium, Spain, Netherland, Portugal, other countries) | |
| Market | Team rating | Numerical | The average overall rating of all players in each club |
| League | Categorical | The league a club belongs to | |
| Contract remaining | Numerical | The number of remaining year(s) in each player’s contract | |
| Popularity | International reputation | Numerical | The higher the rating the more famous the player is |
| Seniority | Games | Numerical | The number of games a player appears |
| Minutes | Numerical | The number of minutes a player appears | |
| Physiological Attribute | Acceleration | Numerical | The higher the rating, the shorter the time needed to reach the maximum sprint speed |
| Spring speed | Numerical | The higher the rating, the faster the player runs while in full speed | |
| Agility | Numerical | The higher the rating, the more agile the player is while moving or turning | |
| Reactions | Numerical | The higher the rating, the more quickly the player is responding to a situation around him | |
| Balance | Numerical | The higher the rating, the more easily the player is able to maintain balance when facing physical challenges | |
| Stamina | Numerical | High stamina rating means longer time the player can spend sprinting during a game as well as shorter recovery time | |
| Jumping | Numerical | The higher the rating, the higher the player can jump to win aerial balls | |
| Strength | Numerical | The higher the rating the more physically strong the player is | |
| Injury risk | Categorical | The chance of a player being injured (e.g. low, medium, high) | |
| Psychological Attribute | Aggression | Numerical | The higher the rating, the more successful tackles and more fouls a player is to commit |
| Composure | Numerical | The higher the rating, the better the players perform under pressure | |
| Vision | Numerical | The higher the rating, the greater the player’s awareness of the position of teammates and opponents is | |
| Positioning | Numerical | The higher the rating, the more likely a player is to occupy advantageous positions for receiving the ball and attacking the opponent’s goal | |
| Soccer Performance Metrics and Technical Skills | Position category | Categorical | The general position category of a player |
| Goals | Numerical | The number of goals a player scores | |
| xg | Numerical | The number of goals a player would have scored given the opportunities (i.e. expected goals) | |
| Assists | Numerical | The number of assists a player makes | |
| xa | Numerical | The number of assists a player would have made given the opportunities (i.e. expected assists) | |
| Red cards | Numerical | The number of red cards a player receives | |
| Yellow cards | Numerical | The number of yellow cards a player receives | |
| Tackles won | Numerical | The number of tackles a player wins | |
| Pressure regain | Numerical | The number of possessions a player regains after applying pressure | |
| Blocks | Numerical | The number of incoming shots a player stops | |
| Interceptions | Numerical | The number of the opposing team’s passes a player catches | |
| Clearance | Numerical | The number of kicks by a player to get the ball away from the danger area | |
| Fouls | Numerical | The number of fouls a player commits | |
| Fouled | Numerical | The number of fouls a player causes the opposing team to commit | |
| Long shots | Numerical | The higher the rating, the more accurate shots from outside the box are | |
| Shot power | Numerical | The higher the rating, the harder the player hit the ball while still keeping a shot accurate | |
| Penalties | Numerical | High penalties rating means the player is good at taking penalties | |
| Heading accuracy | Numerical | The higher the rating, the more accurate a headed pass or header at goal is going to be | |
| Volleys | Numerical | High volley rating means accurate shots taken while the ball is in air | |
| Free kick accuracy | Numerical | The higher the rating the better the accuracy of a direct free kick on goal | |
| Short passing | Numerical | The higher the rating, the faster and more accurate the short or ground pass will be | |
| Long passing | Numerical | The higher the rating, the faster and more accurate the long pass in the air will be | |
| Dribbling | Numerical | A high dribbling rating means the player will be able to keep better possession of the ball whilst dribbling | |
| Curve | Numerical | The higher the rating the more curl the player is capable of putting on the ball when passing and shooting | |
| Crossing | Numerical | High crossing rating means high probability of a medium or long-range pass from a wide area of the field towards the center of the opponent’s box finding the teammate and circumventing the opponents | |
| Ball control | Numerical | The higher the rating, the less likely the ball is to bounce away from the player after receiving it | |
| Standing tackle | Numerical | The higher the rating, the more likely the player is to perform a standing tackle without committing a foul | |
| Sliding tackle | Numerical | The higher the rating, the more likely the player is to perform a sliding tackle without committing a foul | |
| Marking | Numerical | The higher the rating, the more easily the player can track and defend an opposing player | |
| Weak foot | Numerical | Weak foot is defined as the player’s foot other than the preferred foot. High weak foot rating means higher shot power and better ball control for the weak foot of that player | |
| gk_kicking | Numerical | Goalkeeper’s ability to distribute long and accurate goal kicks, from out of the hands or on the ground | |
| gk_positioning | Numerical | Goalkeeper’s ability to position himself correctly when saving shots or reacting to crosses | |
| gk_reflexes | Numerical | Goalkeeper’s agility when making a save | |
| gk_diving | Numerical | Goalkeeper’s ability to make a save whilst diving through the air | |
| gk_handling | Numerical | Goalkeeper’s ability to catch the ball and hold onto it |
| Category | Feature | Data Type | Description |
|---|---|---|---|
| Demographics | Age | Numerical | The age of a player in a given season |
| Nation group | Categorical | All countries are regrouped into 11 labels (France, Italy, England, Germany, Brazil, Argentina, Belgium, Spain, Netherland, Portugal, other countries) | |
| Market | Team rating | Numerical | The average overall rating of all players in each club |
| League | Categorical | The league a club belongs to | |
| Contract remaining | Numerical | The number of remaining year(s) in each player’s contract | |
| Popularity | International reputation | Numerical | The higher the rating the more famous the player is |
| Seniority | Games | Numerical | The number of games a player appears |
| Minutes | Numerical | The number of minutes a player appears | |
| Physiological Attribute | Acceleration | Numerical | The higher the rating, the shorter the time needed to reach the maximum sprint speed |
| Spring speed | Numerical | The higher the rating, the faster the player runs while in full speed | |
| Agility | Numerical | The higher the rating, the more agile the player is while moving or turning | |
| Reactions | Numerical | The higher the rating, the more quickly the player is responding to a situation around him | |
| Balance | Numerical | The higher the rating, the more easily the player is able to maintain balance when facing physical challenges | |
| Stamina | Numerical | High stamina rating means longer time the player can spend sprinting during a game as well as shorter recovery time | |
| Jumping | Numerical | The higher the rating, the higher the player can jump to win aerial balls | |
| Strength | Numerical | The higher the rating the more physically strong the player is | |
| Injury risk | Categorical | The chance of a player being injured (e.g. low, medium, high) | |
| Psychological Attribute | Aggression | Numerical | The higher the rating, the more successful tackles and more fouls a player is to commit |
| Composure | Numerical | The higher the rating, the better the players perform under pressure | |
| Vision | Numerical | The higher the rating, the greater the player’s awareness of the position of teammates and opponents is | |
| Positioning | Numerical | The higher the rating, the more likely a player is to occupy advantageous positions for receiving the ball and attacking the opponent’s goal | |
| Soccer Performance Metrics and Technical Skills | Position category | Categorical | The general position category of a player |
| Goals | Numerical | The number of goals a player scores | |
| xg | Numerical | The number of goals a player would have scored given the opportunities (i.e. expected goals) | |
| Assists | Numerical | The number of assists a player makes | |
| xa | Numerical | The number of assists a player would have made given the opportunities (i.e. expected assists) | |
| Red cards | Numerical | The number of red cards a player receives | |
| Yellow cards | Numerical | The number of yellow cards a player receives | |
| Tackles won | Numerical | The number of tackles a player wins | |
| Pressure regain | Numerical | The number of possessions a player regains after applying pressure | |
| Blocks | Numerical | The number of incoming shots a player stops | |
| Interceptions | Numerical | The number of the opposing team’s passes a player catches | |
| Clearance | Numerical | The number of kicks by a player to get the ball away from the danger area | |
| Fouls | Numerical | The number of fouls a player commits | |
| Fouled | Numerical | The number of fouls a player causes the opposing team to commit | |
| Long shots | Numerical | The higher the rating, the more accurate shots from outside the box are | |
| Shot power | Numerical | The higher the rating, the harder the player hit the ball while still keeping a shot accurate | |
| Penalties | Numerical | High penalties rating means the player is good at taking penalties | |
| Heading accuracy | Numerical | The higher the rating, the more accurate a headed pass or header at goal is going to be | |
| Volleys | Numerical | High volley rating means accurate shots taken while the ball is in air | |
| Free kick accuracy | Numerical | The higher the rating the better the accuracy of a direct free kick on goal | |
| Short passing | Numerical | The higher the rating, the faster and more accurate the short or ground pass will be | |
| Long passing | Numerical | The higher the rating, the faster and more accurate the long pass in the air will be | |
| Dribbling | Numerical | A high dribbling rating means the player will be able to keep better possession of the ball whilst dribbling | |
| Curve | Numerical | The higher the rating the more curl the player is capable of putting on the ball when passing and shooting | |
| Crossing | Numerical | High crossing rating means high probability of a medium or long-range pass from a wide area of the field towards the center of the opponent’s box finding the teammate and circumventing the opponents | |
| Ball control | Numerical | The higher the rating, the less likely the ball is to bounce away from the player after receiving it | |
| Standing tackle | Numerical | The higher the rating, the more likely the player is to perform a standing tackle without committing a foul | |
| Sliding tackle | Numerical | The higher the rating, the more likely the player is to perform a sliding tackle without committing a foul | |
| Marking | Numerical | The higher the rating, the more easily the player can track and defend an opposing player | |
| Weak foot | Numerical | Weak foot is defined as the player’s foot other than the preferred foot. High weak foot rating means higher shot power and better ball control for the weak foot of that player | |
| gk_kicking | Numerical | Goalkeeper’s ability to distribute long and accurate goal kicks, from out of the hands or on the ground | |
| gk_positioning | Numerical | Goalkeeper’s ability to position himself correctly when saving shots or reacting to crosses | |
| gk_reflexes | Numerical | Goalkeeper’s agility when making a save | |
| gk_diving | Numerical | Goalkeeper’s ability to make a save whilst diving through the air | |
| gk_handling | Numerical | Goalkeeper’s ability to catch the ball and hold onto it |
Appendix 2
Tuning parameters for XGBoost
| Tuning parameter | Description | Default value | Optimal value |
|---|---|---|---|
| Number of trees B | B is also known as the number of estimators. Unlike random forests, XGBoost can overfit if B is too large | A relatively small number of trees (e.g. 100 trees) | 100 |
| Learning rate λ | λ is a small positive number that controls the rate at which boosting learns. Unlike fitting a single large decision tree to the data, the boosting approach instead learns slowly | Typical values are 0.01 or 0.001 | 0.1 |
| Max depth | The max depth is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but sometimes end up following noise, causing the model to overfit | The default number of the max depth is 6 | 3 |
| Min child weight | The min child weight is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min child weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit | The default number of the min child weight is 1 | 7 |
| Tuning parameter | Description | Default value | Optimal value |
|---|---|---|---|
| Number of trees B | B is also known as the number of estimators. Unlike random forests, XGBoost can overfit if B is too large | A relatively small number of trees (e.g. 100 trees) | 100 |
| Learning rate λ | λ is a small positive number that controls the rate at which boosting learns. Unlike fitting a single large decision tree to the data, the boosting approach instead learns slowly | Typical values are 0.01 or 0.001 | 0.1 |
| Max depth | The max depth is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but sometimes end up following noise, causing the model to overfit | The default number of the max depth is 6 | 3 |
| Min child weight | The min child weight is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min child weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit | The default number of the min child weight is 1 | 7 |

