Brier score

Last updated

The Brier Score is a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied to predicted probabilities.

Contents

The Brier score is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes or classes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). It was proposed by Glenn W. Brier in 1950. [1]

The Brier score can be thought of as a cost function. More precisely, across all items in a set of N predictions, the Brier score measures the mean squared difference between:

Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated. Note that the Brier score, in its most common formulation, takes on a value between zero and one, since this is the square of the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 or 1). In the original (1950) formulation of the Brier score, the range is double, from zero to two.

The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but it is inappropriate for ordinal variables which can take on three or more values.

Definition

The most common formulation of the Brier score is

in which is the probability that was forecast, the actual outcome of the event at instance ( if it does not happen and if it does happen) and is the number of forecasting instances. In effect, it is the mean squared error of the forecast. This formulation is mostly used for binary events (for example "rain" or "no rain"). The above equation is a proper scoring rule only for binary events; if a multi-category forecast is to be evaluated, then the original definition given by Brier below should be used.

Example

Suppose that one is forecasting the probability that it will rain on a given day. Then the Brier score is calculated as follows:

Original definition by Brier

Although the above formulation is the most widely used, the original definition by Brier [1] is applicable to multi-category forecasts as well as it remains a proper scoring rule, while the binary form (as used in the examples above) is only proper for binary events. For binary forecasts, the original formulation of Brier's "probability score" has twice the value of the score currently known as the Brier score.

In which is the number of possible classes in which the event can fall, and the overall number of instances of all classes. is the predicted probability for class is if it is -th class in instance ; , otherwise. For the case Rain / No rain, , while for the forecast Cold / Normal / Warm, .

Decompositions

There are several decompositions of the Brier score which provide a deeper insight on the behavior of a binary classifier.

3-component decomposition

The Brier score can be decomposed into 3 additive components: Uncertainty, Reliability, and Resolution. (Murphy 1973) [2]

Each of these components can be decomposed further according to the number of possible classes in which the event can fall. Abusing the equality sign:

With being the total number of forecasts issued, the number of unique forecasts issued, the observed climatological base rate for the event to occur, the number of forecasts with the same probability category and the observed frequency, given forecasts of probability . The bold notation is in the above formula indicates vectors, which is another way of denoting the original definition of the score and decomposing it according to the number of possible classes in which the event can fall. For example, a 70% chance of rain and an occurrence of no rain are denoted as and respectively. Operations like the square and multiplication on these vectors are understood to be component wise. The Brier Score is then the sum of the resulting vector on the right hand side.

Reliability

The reliability term measures how close the forecast probabilities are to the true probabilities, given that forecast. Reliability is defined in the contrary direction compared to English language. If the reliability is 0, the forecast is perfectly reliable. For example, if we group all forecast instances where 80% chance of rain was forecast, we get a perfect reliability only if it rained 4 out of 5 times after such a forecast was issued.

Resolution

The resolution term measures how much the conditional probabilities given the different forecasts differ from the climatic average. The higher this term is the better. In the worst case, when the climatic probability is always forecast, the resolution is zero. In the best case, when the conditional probabilities are zero and one, the resolution is equal to the uncertainty.

Uncertainty

The uncertainty term measures the inherent uncertainty in the outcomes of the event. For binary events, it is at a maximum when each outcome occurs 50% of the time, and is minimal (zero) if an outcome always occurs or never occurs.

Two-component decomposition

An alternative (and related) decomposition generates two terms instead of three.

The first term is known as calibration (and can be used as a measure of calibration, see statistical calibration), and is equal to reliability. The second term is known as refinement, and it is an aggregation of resolution and uncertainty, and is related to the area under the ROC Curve.

The Brier Score, and the CAL + REF decomposition, can be represented graphically through the so-called Brier Curves, [3] where the expected loss is shown for each operating condition. This makes the Brier Score a measure of aggregated performance under a uniform distribution of class asymmetries. [4]

Brier Skill Score (BSS)

A skill score for a given underlying score is an offset and (negatively-) scaled variant of the underlying score such that a skill score value of zero means that the score for the predictions is merely as good as that of a set of baseline or reference or default predictions, while a skill score value of one (100%) represents the best possible score. A skill score value less than zero means that the performance is even worse than that of the baseline or reference predictions. When the underlying score is the Brier score (BS), the Brier skill score (BSS) is calculated as

where is the Brier score of reference or baseline predictions which we seek to improve on. While the reference predictions could in principle be given by any pre-existing model, by default one can use the naïve model that predicts the overall proportion or frequency of a given class in the data set being scored, as the constant predicted probability of that class occurring in each instance in the data set. This baseline model would represent a "no skill" model that one seeks to improve on. Skill scores originate in the meteorological prediction literature, where the naïve default reference predictions are called the "in-sample climatology" predictions, where climatology means a long-term or overall average of weather predictions, and in-sample means as calculated from the present data set being scored. [5] [6] In this default case, for binary (two-class) classification, the reference Brier score is given by (using the notation of the first equation of this article, at the top of the Definition section):

where is simply the average actual outcome, i.e. the overall proportion of true class 1 in the data set:

With a Brier score, lower is better (it is a loss function) with 0 being the best possible score. But with a Brier skill score, higher is better with 1 (100%) being the best possible score.

The Brier skill score can be more interpretable than the Brier score because the BSS is simply the percentage improvement in the BS compared to the reference model, and a negative BSS means you are doing even worse than the reference model, which may not be obvious from looking at the Brier score itself. However, a BSS near 100% should not typically be expected because this would require that every probability prediction was nearly 0 or 1 (and was correct of course).

Because the Brier score is a strictly proper scoring rule, and the BSS is just an affine transformation of it, the BSS is also a strictly proper scoring rule.

You might notice that classification's (probability estimation's) BSS is to its BS, as regression's coefficient of determination () is to its mean squared error (MSE).

Shortcomings

The Brier score becomes inadequate for very rare (or very frequent) events, because it does not sufficiently discriminate between small changes in forecast that are significant for rare events. [7] Wilks (2010) has found that "[Q]uite large sample sizes, i.e. n > 1000, are required for higher-skill forecasts of relatively rare events, whereas only quite modest sample sizes are needed for low-skill forecasts of common events." [8]

See also

Further reading

Notes

  1. 1 2 Brier (1950). "Verification of Forecasts Expressed in Terms of Probability" (PDF). Monthly Weather Review. 78 (1): 1–3. Bibcode:1950MWRv...78....1B. doi:10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2. S2CID   122906757. Archived from the original (PDF) on 2017-10-23.
  2. Murphy, A. H. (1973). "A new vector partition of the probability score". Journal of Applied Meteorology. 12 (4): 595–600. Bibcode:1973JApMe..12..595M. doi: 10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2 .
  3. Hernandez-Orallo, J.; Flach, P.A.; Ferri, C. (2011). "Brier curves: a new cost-based visualisation of classifier performance" (PDF). Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 585–592.
  4. Hernandez-Orallo, J.; Flach, P.A.; Ferri, C. (2012). "A unified view of performance metrics: translating threshold choice into expected classification loss" (PDF). Journal of Machine Learning Research. 13: 2813–2869.
  5. A bias-corrected decomposition of the Brier score. (Notes and Correspondence.) C. A. T. Ferro and T. E. Fricker in Quarterly Journal of the Royal Meteorological Society, Volume 138, Issue 668, October 2012 Part A, Pages 1954-1960
  6. "Numerical Weather Prediction: The MOGREPS short-range ensemble prediction system: Verification report: Trial Performance of MOGREPS: January 2006 - March 2007. Forecasting Research Technical Report No. 503." Neill Bowler, Marie Dando, Sarah Beare & Ken Mylne
  7. Riccardo Benedetti (2010-01-01). "Scoring Rules for Forecast Verification". Monthly Weather Review. 138 (1): 203–211. Bibcode:2010MWRv..138..203B. doi: 10.1175/2009MWR2945.1 .
  8. Wilks, D. S. (2010). "Sampling distributions of the Brier score and Brier skill score under serial dependence". Quarterly Journal of the Royal Meteorological Society. 136 (1): 2109–2118. Bibcode:2010QJRMS.136.2109W. doi:10.1002/qj.709. S2CID   121504347.

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

The likelihood function is the joint probability of observed data viewed as a function of the parameters of a statistical model.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Support vector machine</span> Set of methods for supervised statistical learning

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided die rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

<span class="mw-page-title-main">Kelly criterion</span> Formula for bet sizing that maximizes the expected logarithmic value

In probability theory, the Kelly criterion is a formula for sizing a bet. The Kelly bet size is found by maximizing the expected value of the logarithm of wealth, which is equivalent to maximizing the expected geometric growth rate. It assumes that the expected returns are known and is optimal for a bettor who values their wealth logarithmically. J. L. Kelly Jr, a researcher at Bell Labs, described the criterion in 1956. Under the stated assumptions, the Kelly criterion leads to higher wealth than any other strategy in the long run.

<span class="mw-page-title-main">Scoring rule</span> Measure for evaluating probabilistic forecasts

In decision theory, a scoring rule provides a summary measure for the evaluation of probabilistic predictions or forecasts. It is applicable to tasks in which predictions assign probabilities to events, i.e. one issues a probability distribution as prediction. This includes probabilistic classification of a set of mutually exclusive outcomes or classes.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

<span class="mw-page-title-main">Softmax function</span> Smooth approximation of one-hot arg max

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

In statistics, the phi coefficient is a measure of association for two binary variables.

Statistical Football prediction is a method used in sports betting, to predict the outcome of football matches by means of statistical tools. The goal of statistical match prediction is to outperform the predictions of bookmakers, who use them to set odds on the outcome of football matches.

The system size expansion, also known as van Kampen's expansion or the Ω-expansion, is a technique pioneered by Nico van Kampen used in the analysis of stochastic processes. Specifically, it allows one to find an approximation to the solution of a master equation with nonlinear transition rates. The leading order term of the expansion is given by the linear noise approximation, in which the master equation is approximated by a Fokker–Planck equation with linear coefficients determined by the transition rates and stoichiometry of the system.

Mean directional accuracy (MDA), also known as mean direction accuracy, is a measure of prediction accuracy of a forecasting method in statistics. It compares the forecast direction to the actual realized direction. It is defined by the following formula:

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

P4 metric enables performance evaluation of the binary classifier. It is calculated from precision, recall, specificity and NPV (negative predictive value). P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.