Expected goals

Last updated

Expected goals (xG) is a statistical metric in association football that assigns a probability to each shot resulting in a goal. [1] [2] By summing these probabilities across a match, season, or set of shots, xG is used to estimate how many goals a team or player would be expected to score given the chances created, independent of whether those chances were actually converted. [1] [2]

Contents

xG values are produced by statistical or machine-learning models trained on historical shot data. Implementations differ in the data they use (for example, event data alone versus models that also incorporate positional or contextual information) and in which shot features are included; as a result, xG figures from different providers are not necessarily directly comparable. [1] [2] [3] [4]

The same general approach has also been applied in ice hockey analytics, where "expected goals" models have been used as an alternative to goals for evaluating team and player performance in a low-scoring sport. [5] [6]

Meaning

In association football, expected goals (xG) assigns each shot a value between 0 and 1 representing the estimated probability that the shot becomes a goal. [1] [2] The sum of these shot probabilities is an expected value for goals scored over a set of shots (for example, a match, season, or a player's attempts), so team and player xG totals are commonly reported alongside goals scored. [1] [2] [7]

xG values are produced by statistical models trained on historical shot outcomes. Models typically include features describing the shot attempt and its context, such as shot location (often expressed via distance and angle), body part used, and type of assist or phase of play, though the exact inputs and definitions depend on the data source and provider. [1] [4] [7]

As a probability, an xG value of 0.3 is commonly interpreted as meaning that shots of similar characteristics would be expected to be scored around 30% of the time over many repeated instances; it is not a statement about the outcome of any single shot. [4]

Because xG is model-based, different implementations can assign different probabilities to the same shot, particularly when they use different event definitions or additional information such as contextual or positional data. [1] [2] [4]

History and application of xG

Association football

Characteristics of the goal moment determining xG: coordinates, quality, body part, interference from the opponent (G. Kravtsov."Applied Statistics"). XG calculation activity.jpg
Characteristics of the goal moment determining xG: coordinates, quality, body part, interference from the opponent (G. Kravtsov."Applied Statistics").

There is some debate about the origin of the term expected goals. Vic Barnett and his colleague Sarah Hilditch referred to "expected goals" in their 1993 paper that investigated the effects of artificial pitch (AP) surfaces on home team performance in association football in England. [8] Their paper included this observation:

Quantitatively we find for the AP group about 0.15 more goals per home match than expected and, allowing for the lower than expected goals against in home matches, an excess goal difference (for home matches) of about 0.31 goals per home match. Over a season this yields about 3 more goals for, an improved goal difference of about 6 goals. [9]

Jake Ensum, Richard Pollard and Samuel Taylor (2004) reported their study of data from 37 matches in the 2002 World Cup in which 930 shots and 93 goals were recorded. [10] Their research sought "to investigate and quantify 12 factors that might affect the success of a shot". Their logistic regression identified five factors that had a significant effect on determining the success of a kicked shot: distance from the goal; angle from the goal; whether or not the player taking the shot was at least 1 m away from the nearest defender; whether or not the shot was immediately preceded by a cross; and the number of outfield players between the shot-taker and goal. [10] They concluded "the calculation of shot probabilities allows a greater depth of analysis of shooting opportunities in comparison to recording only the number of shots". [10] In a subsequent paper (2004), Ensum, Pollard and Taylor combined data from the 1986 and 2002 World Cup competitions to identify three significant factors that determined the success of a kicked shot: distance from the goal; angle from the goal; and whether or not the player taking the shot was at least 1 m away from the nearest defender. [11] More recent studies have identified similar factors as relevant for xG metrics. [12]

Howard Hamilton (2009) proposed "a useful statistic in soccer" that "will ultimately contribute to what I call an 'expected goal value' — for any action on the field in the course of a game, the probability that said action will create a goal". [13]

Sander Itjsma (2011) discussed "a method to assign different value to different chances created during a football match" and in doing so concluded: [14]

we now have a system in place in order to estimate the overall value of the chances created by either team during the match. Knowing how many goals a team is expected to score from its chances is of much more value than just knowing how many attempts to score a goal were made. Other applications of this method of evaluation would be to distinguish a lack of quality attempts created from a finishing problem or to evaluate defensive and goalkeeping performances. And a third option would be to plot the balance of play during the match in terms of the quality of chances created in order to graphically represent how the balance of play evolved during the match. [14]

Sarah Rudd (2011) discussed probable goal scoring patterns (P(Goal)) in her use of Markov chains for tactical analysis (including the proximity of defenders) from 123 games in the 2010-2011 English Premier League season. [15] In a video presentation of her paper at the 2011 New England Symposium of Statistics in Sport, Rudd reported her use of analysis methods to compare "expected goals" with actual goals and her process of applying weightings to incremental actions for P(goal) outcomes. [16]

In April 2012, Sam Green wrote about 'expected goals' in his assessment of Premier League goalscorers. [17] He asked "So how do we quantify which areas of the pitch are the most likely to result in a goal and therefore, which shots have the highest probability of resulting in a goal?". He added:

If we can establish this metric, we can then accurately and effectively increase our chances of scoring and therefore winning matches. Similarly, we can use this data from a defensive perspective to limit the better chances by defending key areas of the pitch. [17]

Green proposed a model to determine "a shot's probability of being on target and/or scored". With this model "we can look at each player's shots and tally up the probability of each of them being a goal to give an expected goal (xG) value". [17]

Ice hockey

In 2004, Alan Ryder shared a methodology for the study of the quality of an ice hockey shot on goal. His discussion started with this sentence "Not all shots on goal are created equal". [18] Ryder's model for the measurement of shot quality was:

  • Collect the data and analyze goal probabilities for each shooting circumstance
  • Build a model of goal probabilities that relies on the measured circumstance
  • For each shot, determine its goal probability
  • Expected Goals: EG = the sum of the goal probabilities for each shot
  • Neutralize the variation in shots on goal by calculating Normalized Expected Goals
  • Shot Quality Against

Ryder concluded:

The model to get to expected goals given the shot quality factors is simply based on the data. There are no meaningful assumptions made. The analytic methods are the classics from statistics and actuarial science. The results are therefore very credible. [19]

In 2007, [20] Ryder issued a product recall notice for his shot quality model. He presented "a cautionary note on the calculation of shot quality" and pointed to "data quality problems with the measurement of the quality of a hockey team's shots taken and allowed". [20]

He reported:

I have been worried that there is a systemic bias in the data. Random errors don't concern me. They even out over large volumes of data. But I do think that ... the scoring in certain rinks has a bias towards longer or shorter shots, the most dominant factor in a shot quality model. And I set out to investigate that possibility. [20]

The term 'expected goals' appeared in a paper about ice hockey performance presented by Brian Macdonald [21] at the MIT Sloan Sports Analytics Conference in 2012. Macdonald's method for calculating expected goals was reported in the paper:

We used data from the last four full NHL seasons. For each team, the season was split into two halves. Since midseason trades and injuries can have an impact on a team's performance, we did not use statistics from the first half of the season to predict goals in the second half. Instead, we split the season into odd and even games, and used statistics from odd games to predict goals in even games. Data from 2007-08, 2008-09, and 2009-10 was used as the training data to estimate the parameters in the model, and data from the entire 2010-11 was set aside for validating the model. The model was also validated using 10-fold cross-validation. Mean squared error (MSE) of actual goals and predicted goals was our choice for measuring the performance of our models. [21]

Model inputs and methods

xG models are typically trained on historical data in which each shot is labelled by whether it resulted in a goal. Many implementations rely on event data that describe the shot and its immediate context, such as distance and angle to goal, body part, type of assist, and whether the attempt was a set piece. [1] [2] [4] Other approaches use synchronised positional (tracking) data to incorporate spatial context, such as the locations of defenders and the goalkeeper at the time of the shot, with the aim of improving probability estimates compared with models based on event data alone. [1] [22]

A variety of modelling techniques have been used, ranging from logistic regression and other probabilistic classifiers to more complex machine-learning approaches. [2] [4] Some studies extend shot-based models by incorporating information from sequences of actions leading to the shot, reflecting the view that chance quality can depend on the build-up as well as the shot itself. [3] Research has explored more interpretable formulations, such as Bayesian mixed models, to make the influence of shot characteristics and surrounding opponents easier to communicate to practitioners. [23]

Because xG is a probabilistic estimate, model evaluation is commonly framed in terms of both how well a model separates goals from non-goals (discrimination) and how well predicted probabilities align with observed scoring frequencies (calibration). [4] [2] Differences in underlying data (including event definitions and availability of contextual variables) and in modelling choices can therefore lead to systematic differences between xG values produced by different models for the same set of shots. [4] [1]

Mathematical formulation

A common way to formalise expected goals is to treat each shot as a Bernoulli trial with an estimated probability of scoring. For a set of shots indexed by , let be the expected-goals value for shot , interpreted as the estimated probability that the shot becomes a goal. [1] [2] The total expected goals for the set of shots is then If is a random variable indicating whether shot results in a goal or not , with , then the total goals scored is and its expected value is

Often shots are assumed to be independent, in which case the variance is The probability of scoring at least one goal from the set is In practice, independence is an approximation because shots within a match may be correlated. [4]

Illustrative example

Suppose a team takes three shots with the following probabilities of scoring: , and . The total expected goals is , meaning the team would be expected to score 0.70 goals on average from many repeated sets of chances with similar characteristics. Under an independence assumption, the probability of scoring at least one goal from these three shots is

Model-based estimation

In practice, the shot probabilities are estimated from historical data by fitting a model that maps shot characteristics to the probability of scoring. Studies distinguish between the data representation (for example, event data versus synchronised tracking data), the feature set (which variables are provided to the model), and the model family (for example, regression or other machine-learning approaches). [1] [2] [4] Because providers may use different event definitions, variable availability, and modelling choices, xG values for the same shot can differ across implementations. [4]

Many published models estimate by applying a function to a linear predictor (a weighted combination of features). A common baseline is logistic regression, where the estimated probability is with coefficients learned from data. [2]

The following worked calculations are illustrative examples of how probability estimates can be computed once a model is fitted.

Example 1: A simple shot-based model (distance and angle)

Many xG models start from information that is recorded for most shots, such as how far the shot was from goal and how open the angle to goal was. A fitted statistical model first produces a numerical score for the shot and then converts that score into a probability between 0 and 1. Logistic regression is a common baseline model of this kind in the literature. [2]

For illustration, suppose the model uses distance to goal (metres) and shot angle (radians), and has illustrative coefficients:

For a shot with and , the model's score is Substituting the values gives Evaluating each term, so

The probability estimate is obtained by applying the logistic function With ,

The model would therefore assign this shot an expected-goals value of about , meaning that shots with similar characteristics would be expected to be scored about 14% of the time over many comparable attempts. [2] [4]

Example 2: Adding positional context (defensive pressure)

Some models add contextual information available from tracking or enriched event data, such as an indicator of defensive pressure at the moment of the shot. [1] [22] This can be represented by including an additional variable in the score.

Continuing Example 1, suppose the model includes a pressure variable scaled from 0 (no pressure) to 1 (high pressure), with coefficient . The score becomes

Using the same shot as Example 1 (, ) and taking , From Example 1, , so and therefore

Applying the logistic function, so with ,

In this illustrative calculation, adding pressure reduces the estimate from about 0.14 to about 0.06 for the same distance-and-angle shot, because the model treats the context as making the chance more difficult. [1] [4]

Example 3: Incorporating build-up information (sequence features)

We can extend shot-based xG by including information from the sequence of actions leading to the shot (for example, the type of pass that created the chance). [3] One simplified way to represent this is to add indicator variables to the score.

Continuing Example 1, suppose the model includes two binary variables:

  • , where if the shot followed a through ball and otherwise, and
  • , where if the shot followed a cutback pass and otherwise.

Suppose the illustrative coefficients are and . The score becomes

Using the same base shot from Example 1, . If and , so

Applying the logistic function, so with ,

This illustrative calculation shows how incorporating build-up context can change the estimated probability even when the shot location and angle are unchanged. [3] [4]

Example 4: Aggregating shot probabilities into match totals

Once probabilities have been estimated for individual shots, match or season totals are obtained by summing them. [2] Suppose a team has three shots with estimated probabilities The expected-goals total is

Under the independence assumption, the probability of scoring at least one goal from the three shots is Substituting the values gives

On the same assumption, if is the total number of goals from the three shots, then Evaluating the terms, so

Model outputs are assessed for discrimination and calibration, and differences in data availability and event definitions can produce systematic differences between models. [4] [2]

Expected goals against (xGA) is typically defined as the sum of the xG probabilities for shots conceded by a team, and expected goal difference (xGD) as the difference between xG for and xGA over the same sample of matches or minutes. [1] [2] [7]

Expected assists (xA) is used to estimate chance creation from passing. In Opta's terminology, xA measures the likelihood that a completed pass becomes a goal assist, based on factors such as pass type and end location, and can be aggregated for players or teams over a match or season. [7] [24]

Opta's expected goals on target (xGOT) is a post-shot model defined for on-target shots, combining the original xG value with information about where the shot ended up within the goalmouth; it is therefore used in some analyses of shooting and goalkeeping performance. [25] [26] [27]

Limitations and criticism

xG is not a single standardised statistic: models differ in their underlying data, event definitions and included features, and these choices can lead to systematically different probability estimates for the same shots. [4] [2] As a result, xG values from different providers are not always directly comparable, particularly when one model includes additional contextual or positional information that another lacks. [1] [4]

Because xG outputs are probabilities, evaluating an xG model involves both whether it separates goals from non-goals and whether its predicted probabilities match observed scoring frequencies (calibration). [4] [28] Data availability can be a limiting factor: models trained on short time windows or on datasets with restricted variables may be less stable or less well calibrated than models trained on larger, richer datasets. [4]

Sample size and variance

xG totals from a single match (or other short sample) can be volatile, and differences between xG and actual goals over small samples can be driven by randomness as well as by systematic differences in chance quality or performance. [29] [2] For this reason, xG-based comparisons are typically interpreted more cautiously over short periods than over longer samples such as seasons or multi-season datasets. [29]

Interpretation pitfalls

At the player level, comparisons between goals scored and xG are often used to discuss "over-" or "under-performance", but such differences can be noisy over limited samples and can reflect modelling choices and biases as well as player skill. [4] [30] In addition, because xG models differ in what they treat as part of the chance (for example, which contextual variables are included), the same player's "goals minus xG" can change depending on the model and data source used. [4]

Some analyses use post-shot variants that incorporate information about shot placement to separate pre-shot chance quality from the execution of shots and goalkeeping outcomes; such measures are intended to reduce ambiguity about whether a deviation from pre-shot xG reflects shot execution or other factors. [31] [25] [26]

See also

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Anzer, Gabriel; Bauer, Pascal (29 March 2021). "A Goal Scoring Probability Model for Shots Based on Synchronized Positional and Event Data in Football (Soccer)". Frontiers in Sports and Active Living. 3 624475. doi: 10.3389/fspor.2021.624475 . PMC   8056301 . PMID   33889843.
  2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Mead, Joseph; O'Hare, Adam; McMenemy, Paul (5 April 2023). "Expected goals in football: Improving model performance and demonstrating value". PLOS ONE. 18 (4) e0282295. Bibcode:2023PLoSO..1882295M. doi: 10.1371/journal.pone.0282295 . PMC   10075453 . PMID   37018167.
  3. 1 2 3 4 Bandara, Ishara (30 October 2024). "Predicting goal probabilities with improved xG models using event sequences in association football". PLOS ONE. 19 (10) e0312278. Bibcode:2024PLoSO..1912278B. doi: 10.1371/journal.pone.0312278 . PMID   39475977.
  4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Robberechts, Pieter; Davis, Jesse (10 December 2020). "How Data Availability Affects the Ability to Learn Good xG Models". In Brefeld, Ulf; Davis, Jesse; Van Haaren, Jan; Zimmermann, Albrecht (eds.). Machine Learning and Data Mining for Sports Analytics: 7th International Workshop, MLSA 2020, Co-located with ECML/PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings. Communications in Computer and Information Science. Vol. 1324. Cham: Springer. pp. 17–27. doi:10.1007/978-3-030-64912-8_2. ISBN   978-3-030-64911-1 . Retrieved 27 January 2026.
  5. Nandakumar, Namita; Jensen, Shane T. (2019). "Historical perspectives and current directions in hockey analytics". Annual Review of Statistics and Its Application. 6: 19–36. doi:10.1146/annurev-statistics-030718-105202 . Retrieved 27 January 2026.
  6. Macdonald, Brian (2 March 2012). "An Expected Goals Model for Evaluating NHL Teams and Players" (PDF). MIT Sloan Sports Analytics Conference 2012. Retrieved 27 January 2026.
  7. 1 2 3 4 "Opta Event Definitions: Expected Goals & Expected Assists". Stats Perform. Retrieved 27 January 2026.
  8. Barnett, Vic; Hilditch, S (1993). "The Effect of an Artificial Pitch Surface on Home Team Performance in Football (Soccer)". Journal of the Royal Statistical Society. Series A (Statistics in Society). 156 (1): 39–50. doi:10.2307/2982859. JSTOR   2982859.
  9. Barnett, Vic; Hilditch, S (1993). "The Effect of an Artificial Pitch Surface on Home Team Performance in Football (Soccer)". Journal of the Royal Statistical Society. Series A (Statistics in Society). 156 (1): 47. doi:10.2307/2982859. JSTOR   2982859.
  10. 1 2 3 Ensum, Jake; Pollard, Richard; Taylor, Samuel (2004). "Applications of logistic regression to shots at goal in association football: calculation of shot probabilities, quantification of factors and player/team". Journal of Sports Sciences. 22 (6): 504.
  11. Pollard, Richard; Ensum, Jake; Taylor, Samuel (2004). "Estimating the probability of a shot resulting in a goal: The effects of distance, angle and space". International Journal of Soccer and Science. 2 (1): 50–55.
  12. "An examination of expected goals and shot efficiency in soccer" (PDF). Universidad de Alicante.
  13. Hamilton, Howard (8 January 2009). "Moneyball and soccer" . Retrieved 6 February 2018.
  14. 1 2 Itjsma, Sander (13 July 2011). "A chance is a chance is a chance?" . Retrieved 4 January 2018.
  15. Rudd, Sarah (24 September 2011). "A Framework for Tactical Analysis and Individual Offensive Production Assessment in Soccer Using Markov Chains" (PDF). Retrieved 7 February 2018.
  16. 2011 NESSIS - Talk by Sarah Rudd on YouTube
  17. 1 2 3 Green, Sam (12 April 2012). "Assessing the performance of Premier League goalscorers" . Retrieved 4 January 2018.
  18. Ryder, Alan (January 2004). "Shot quality" (PDF). p. 2. Retrieved 4 January 2018.
  19. Ryder, Alan (January 2004). "Shot quality" (PDF). p. 15. Retrieved 5 January 2018.
  20. 1 2 3 Ryder, Alan (2007). "Product Recall Notice for 'Shot Quality'" (PDF). Retrieved 5 January 2018.
  21. 1 2 Macdonald, Brian (March 2012). "An Expected Goals Model for Evaluating NHL Teams and Players" (PDF). Retrieved 3 January 2018.
  22. 1 2 Lucey, Patrick; Bialkowski, Alina; Monfort, Mathew; Carr, Peter; Matthews, Iain (2015). ""Quality vs Quantity": Improved Shot Prediction in Soccer using Strategic Features from Spatiotemporal Data". MIT Sloan Sports Analytics Conference. Retrieved 27 January 2026.
  23. Iapteff, Lucas; Le Coz, Stéphane; Rioland, Marc; Houde, Thomas; Carling, Chris; Imbach, Fabrice (2025). "Toward interpretable expected goals modeling using Bayesian mixed models". Frontiers in Sports and Active Living. 7 1504362. doi: 10.3389/fspor.2025.1504362 . PMC   12055760 . PMID   40336706.
  24. "Expected Assists in Context". Stats Perform. Retrieved 27 January 2026.
  25. 1 2 "Introducing Expected Goals on Target (xGOT)". Stats Perform. Retrieved 27 January 2026.
  26. 1 2 Ruiz-de-Alarcón-Quintero, Antonio; Gómez-Carmona, Carlos D.; Bastida-Castillo, Antonio; Pino-Ortega, José (2024). "An Expected Goals on Target (xGOT) Metric as a New Key Performance Indicator in Football". Data. 9 (9): 102. doi: 10.3390/data9090102 .
  27. "Expected Goals (xG)". Hudl Support. Retrieved 27 January 2026.
  28. Davis, Jesse (2024). "Methodology and evaluation in sports analytics". Machine Learning. doi: 10.1007/s10994-024-06585-0 . Retrieved 27 January 2026.
  29. 1 2 Brechot, Marc; Flepp, Raphael (May 2020). "Dealing with randomness in match outcomes: how to rethink performance evaluation in European club football using expected goals" . Journal of Sports Economics. 21 (4): 335–362. doi:10.1177/1527002519897962 . Retrieved 27 January 2026.
  30. Davis, Jesse; Robberechts, Pieter (2024). "Biases in expected goals models confound finishing ability". arXiv: 2401.09940 [cs.LG].
  31. "Post-shot expected goals (PSxG)". Hudl Support. Retrieved 27 January 2026.

Further reading