Scoring rule

Last updated
Visualization of the expected score under various predictions from some common scoring functions. Dashed black line: forecaster's true belief, red: linear, orange: spherical, purple: quadratic, green: log. Scoring functions.gif
Visualization of the expected score under various predictions from some common scoring functions. Dashed black line: forecaster's true belief, red: linear, orange: spherical, purple: quadratic, green: log.

In decision theory, a scoring rule [1] provides a summary measure for the evaluation of probabilistic predictions or forecasts. It is applicable to tasks in which predictions assign probabilities to events, i.e. one issues a probability distribution as prediction. This includes probabilistic classification of a set of mutually exclusive outcomes or classes.

Contents

On the other hand, a scoring function [2] provides a summary measure for the evaluation of point predictions, i.e. one predicts a property or functional , like the expectation or the median.

Scoring rules and scoring functions can be thought of as "cost functions" or "loss functions". They are evaluated as empirical mean of a given sample, simply called score. Scores of different predictions or models can then be compared to conclude which model is best.

If a cost is levied in proportion to a proper scoring rule, the minimal expected cost corresponds to reporting the true set of probabilities. Proper scoring rules are used in meteorology, finance, and pattern classification where a forecaster or algorithm will attempt to minimize the average score to yield refined, calibrated probabilities (i.e. accurate probabilities).

Motivation

A calibration curve allows to judge how well model predictions are calibrated. Blue is the best calibrated model, see calibration (statistics). Calibration plot.png
A calibration curve allows to judge how well model predictions are calibrated. Blue is the best calibrated model, see calibration (statistics).

Since the metrics in Evaluation of binary classifiers are not evaluating the calibration, scoring rules which can do so are needed. These scoring rules can be used as loss functions in empirical risk minimization.

Definition

Consider a sample space , a σ-algebra of subsets of and a convex class of probability measures on . A function defined on and taking values in the extended real line, , is -quasi-integrable if it is measurable with respect to and is quasi-integrable with respect to all .

Probabilistic forecast

A probabilistic forecast is any probability measure .

Scoring rule

A scoring rule is any extended real-valued function such that is -quasi-integrable for all . represents the loss or penalty when the forecast is issued and the observation materializes.

Point forecast

A point forecast is a functional, i.e. a potentially set-valued mapping .

Scoring function

A scoring function is any real-valued function where represents the loss or penalty when the point forecast is issued and the observation materializes.

Orientation

Scoring rules and scoring functions are negatively (positively) oriented if smaller (larger) values mean better. Here we adhere to negative orientation, hence the association with "loss".

Expected Score

We write for the expected score under

Sample average score

Given random samples and corresponding forecasts or (e.g. forecasts from a single model), one calculates the (importance sample) estimated expected score as

or

Average scores are used to compare and rank different forecast(er)s or models.

Propriety and consistency

Strictly proper scoring rules and strictly consistent scoring functions encourage honest forecasts by maximization of the expected reward: If a forecaster is given a reward of if realizes (e.g. ), then the highest expected reward (lowest score) is obtained by reporting the true probability distribution. [1]

Proper scoring rules

We write for the expected score under

A scoring rule is proper relative to if (assuming negative orientation)

for all .

It is strictly proper if the above equation holds with equality if and only if .

Consistent scoring functions

A scoring function is consistent for the functional relative to the class if

for all , all and all .

It is strictly consistent if it is consistent and equality in the above equation implies that .

Example application of scoring rules

The logarithmic rule LogScore.png
The logarithmic rule

An example of probabilistic forecasting is in meteorology where a weather forecaster may give the probability of rain on the next day. One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell. If the actual percentage was substantially different from the stated probability we say that the forecaster is poorly calibrated. A poorly calibrated forecaster might be encouraged to do better by a bonus system. A bonus system designed around a proper scoring rule will incentivize the forecaster to report probabilities equal to his personal beliefs. [3]

In addition to the simple case of a binary decision, such as assigning probabilities to 'rain' or 'no rain', scoring rules may be used for multiple classes, such as 'rain', 'snow', or 'clear', or continuous responses like the amount of rain per day.

The image to the right shows an example of a scoring rule, the logarithmic scoring rule, as a function of the probability reported for the event that actually occurred. One way to use this rule would be as a cost based on the probability that a forecaster or algorithm assigns, then checking to see which event actually occurs.

Examples of proper scoring rules

There are an infinite number of scoring rules, including entire parameterized families of strictly proper scoring rules. The ones shown below are simply popular examples.

Categorical variables

For a categorical response variable with mutually exclusive events, , a probabilistic forecaster or algorithm will return a probability vector with a probability for each of the outcomes.

Logarithmic score

Expected value of logarithmic rule, when Event 1 is expected to occur with probability of 0.8, the blue line is described by the function
0.8
log
[?]
(
x
)
+
(
1
-
0.8
)
log
[?]
(
1
-
x
)
{\displaystyle 0.8\log(x)+(1-0.8)\log(1-x)} ExpectedLog.png
Expected value of logarithmic rule, when Event 1 is expected to occur with probability of 0.8, the blue line is described by the function

The logarithmic scoring rule is a local strictly proper scoring rule. This is also the negative of surprisal, which is commonly used as a scoring criterion in Bayesian inference; the goal is to minimize expected surprise. This scoring rule has strong foundations in information theory.

Here, the score is calculated as the logarithm of the probability estimate for the actual outcome. That is, a prediction of 80% that correctly proved true would receive a score of ln(0.8) = −0.22. This same prediction also assigns 20% likelihood to the opposite case, and so if the prediction proves false, it would receive a score based on the 20%: ln(0.2) = −1.6. The goal of a forecaster is to maximize the score and for the score to be as large as possible, and −0.22 is indeed larger than −1.6.

If one treats the truth or falsity of the prediction as a variable x with value 1 or 0 respectively, and the expressed probability as p, then one can write the logarithmic scoring rule as x ln(p) + (1 − x) ln(1 − p). Note that any logarithmic base may be used, since strictly proper scoring rules remain strictly proper under linear transformation. That is:

is strictly proper for all .

Brier/Quadratic score

The quadratic scoring rule is a strictly proper scoring rule

where is the probability assigned to the correct answer and is the number of classes.

The Brier score, originally proposed by Glenn W. Brier in 1950, [4] can be obtained by an affine transform from the quadratic scoring rule.

Where when the th event is correct and otherwise and is the number of classes.

An important difference between these two rules is that a forecaster should strive to maximize the quadratic score yet minimize the Brier score . This is due to a negative sign in the linear transformation between them.

Hyvärinen scoring rule

The Hyvärinen scoring function (of a density p) is defined by [5]

Where denotes the Hessian trace and denotes the gradient. This scoring rule can be used to computationally simplify parameter inference and address Bayesian model comparison with arbitrarily-vague priors. [5] [6] It was also used to introduce new information-theoretic quantities beyond the existing information theory. [7]

Spherical score

The spherical scoring rule is also a strictly proper scoring rule

Continuous variables

Continuous ranked probability score

Illustration of the continuous ranked probability score (CRPS). Given a sample y and a predicted cumulative distribution F, the CRPS is given by computing the difference between the curves at each point x of the support, squaring it and integrating it over the whole support. Illustration CRPS.png
Illustration of the continuous ranked probability score (CRPS). Given a sample y and a predicted cumulative distribution F, the CRPS is given by computing the difference between the curves at each point x of the support, squaring it and integrating it over the whole support.

The continuous ranked probability score (CRPS) [8] is a strictly proper scoring rule much used in meteorology. It is defined as

where is the forecast cumulative distribution function, is the Heaviside step function and is the observation. Note that the forecast estimates multiple probabilities, so that a cumulative distribution function arises.

Interpretation of proper scoring rules

All proper scoring rules are equal to weighted sums (integral with a non-negative weighting functional) of the losses in a set of simple two-alternative decision problems that use the probabilistic prediction, each such decision problem having a particular combination of associated cost parameters for false positive and false negative decisions. A strictly proper scoring rule corresponds to having a nonzero weighting for all possible decision thresholds. Any given proper scoring rule is equal to the expected losses with respect to a particular probability distribution over the decision thresholds; thus the choice of a scoring rule corresponds to an assumption about the probability distribution of decision problems for which the predicted probabilities will ultimately be employed, with for example the quadratic loss (or Brier) scoring rule corresponding to a uniform probability of the decision threshold being anywhere between zero and one. The classification accuracy score (percent classified correctly), a single-threshold scoring rule which is zero or one depending on whether the predicted probability is on the appropriate side of 0.5, is a proper scoring rule but not a strictly proper scoring rule because it is optimized (in expectation) not only by predicting the true probability but by predicting any probability on the same side of 0.5 as the true probability. [9] [10] [11] [12] [13] [14]

Comparison of strictly proper scoring rules

Shown below on the left is a graphical comparison of the Logarithmic, Quadratic, and Spherical scoring rules for a binary classification problem. The x-axis indicates the reported probability for the event that actually occurred.

It is important to note that each of the scores have different magnitudes and locations. The magnitude differences are not relevant however as scores remain proper under affine transformation. Therefore, to compare different scores it is necessary to move them to a common scale. A reasonable choice of normalization is shown at the picture on the right where all scores intersect the points (0.5,0) and (1,1). This ensures that they yield 0 for a uniform distribution (two probabilities of 0.5 each), reflecting no cost or reward for reporting what is often the baseline distribution. All normalized scores below also yield 1 when the true class is assigned a probability of 1.

Score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red) RawScore.png
Score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red)
Normalized score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red) NormalizedScore.png
Normalized score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red)

Characteristics

Affine transformation

A strictly proper scoring rule, whether binary or multiclass, after an affine transformation remains a strictly proper scoring rule. [3] That is, if is a strictly proper scoring rule then with is also a strictly proper scoring rule, though if then the optimization sense of the scoring rule switches between maximization and minimization.

Locality

A proper scoring rule is said to be local if its estimate for the probability of a specific event depends only on the probability of that event. This statement is vague in most descriptions but we can, in most cases, think of this as the optimal solution of the scoring problem "at a specific event" is invariant to all changes in the observation distribution that leave the probability of that event unchanged. All binary scores are local because the probability assigned to the event that did not occur is determined so there is no degree of flexibility to vary over.

Affine functions of the logarithmic scoring rule are the only strictly proper local scoring rules on a finite set that is not binary.

Decomposition

The expectation value of a proper scoring rule can be decomposed into the sum of three components, called uncertainty, reliability, and resolution, [15] [16] which characterize different attributes of probabilistic forecasts:

If a score is proper and negatively oriented (such as the Brier Score), all three terms are positive definite. The uncertainty component is equal to the expected score of the forecast which constantly predicts the average event frequency. The reliability component penalizes poorly calibrated forecasts, in which the predicted probabilities do not coincide with the event frequencies.

The equations for the individual components depend on the particular scoring rule. For the Brier Score, they are given by

where is the average probability of occurrence of the binary event , and is the conditional event probability, given , i.e.

See also

Literature

Related Research Articles

In mathematics, an operator is generally a mapping or function that acts on elements of a space to produce elements of another space. There is no general definition of an operator, but the term is often used in place of function when the domain is a set of functions or other structured objects. Also, the domain of an operator is often difficult to characterize explicitly, and may be extended so as to act on related objects. See Operator (physics) for other examples.

<span class="mw-page-title-main">Independence (probability theory)</span> When the occurrence of one event does not affect the likelihood of another

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of one does not affect the probability of occurrence of the other or, equivalently, does not affect the odds. Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In differential geometry, a subject of mathematics, a symplectic manifold is a smooth manifold, , equipped with a closed nondegenerate differential 2-form , called the symplectic form. The study of symplectic manifolds is called symplectic geometry or symplectic topology. Symplectic manifolds arise naturally in abstract formulations of classical mechanics and analytical mechanics as the cotangent bundles of manifolds. For example, in the Hamiltonian formulation of classical mechanics, which provides one of the major motivations for the field, the set of all possible configurations of a system is modeled as a manifold, and this manifold's cotangent bundle describes the phase space of the system.

In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue, although according to the Bourbaki group they were first introduced by Frigyes Riesz.

<span class="mw-page-title-main">Hamiltonian mechanics</span> Formulation of classical mechanics using momenta

In physics, Hamiltonian mechanics is a reformulation of Lagrangian mechanics that emerged in 1833. Introduced by Sir William Rowan Hamilton, Hamiltonian mechanics replaces (generalized) velocities used in Lagrangian mechanics with (generalized) momenta. Both theories provide interpretations of classical mechanics and describe the same physical phenomena.

<span class="mw-page-title-main">Indicator function</span> Mathematical function characterizing set membership

In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if A is a subset of some set X, then if and otherwise, where is a common notation for the indicator function. Other common notations are and

<span class="mw-page-title-main">Path integral formulation</span> Formulation of quantum mechanics

The path integral formulation is a description in quantum mechanics that generalizes the stationary action principle of classical mechanics. It replaces the classical notion of a single, unique classical trajectory for a system with a sum, or functional integral, over an infinity of quantum-mechanically possible trajectories to compute a quantum amplitude.

In probability theory and statistics, the conditional probability distribution is a probability distribution that describes the probability of an outcome given the occurrence of a particular event. Given two jointly distributed random variables and , the conditional probability distribution of given is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.

Acoustic impedance and specific acoustic impedance are measures of the opposition that a system presents to the acoustic flow resulting from an acoustic pressure applied to the system. The SI unit of acoustic impedance is the pascal-second per cubic metre, or in the MKS system the rayl per square metre (Rayl/m2), while that of specific acoustic impedance is the pascal-second per metre (Pa·s/m), or in the MKS system the rayl (Rayl). There is a close analogy with electrical impedance, which measures the opposition that a system presents to the electric current resulting from a voltage applied to the system.

<span class="mw-page-title-main">Total variation distance of probability measures</span> Concept in probability theory

In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational distance.

Probability theory and statistics have some commonly used conventions, in addition to standard mathematical notation and mathematical symbols.

The Brier Score is a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied to predicted probabilities.

In mathematics, probabilistic metric spaces are a generalization of metric spaces where the distance no longer takes values in the non-negative real numbers R0, but in distribution functions.

Axiomatic constructive set theory is an approach to mathematical constructivism following the program of axiomatic set theory. The same first-order language with "" and "" of classical set theory is usually used, so this is not to be confused with a constructive types approach. On the other hand, some constructive theories are indeed motivated by their interpretability in type theories.

In many-body theory, the term Green's function is sometimes used interchangeably with correlation function, but refers specifically to correlators of field operators or creation and annihilation operators.

In probability theory, a random measure is a measure-valued random element. Random measures are for example used in the theory of random processes, where they form many important point processes such as Poisson point processes and Cox processes.

In mathematics – specifically, in stochastic analysis – an Itô diffusion is a solution to a specific type of stochastic differential equation. That equation is similar to the Langevin equation used in physics to describe the Brownian motion of a particle subjected to a potential in a viscous fluid. Itô diffusions are named after the Japanese mathematician Kiyosi Itô.

In probability theory, a Markov kernel is a map that in the general theory of Markov processes plays the role that the transition matrix does in the theory of Markov processes with a finite state space.

Quantum stochastic calculus is a generalization of stochastic calculus to noncommuting variables. The tools provided by quantum stochastic calculus are of great use for modeling the random evolution of systems undergoing measurement, as in quantum trajectories. Just as the Lindblad master equation provides a quantum generalization to the Fokker–Planck equation, quantum stochastic calculus allows for the derivation of quantum stochastic differential equations (QSDE) that are analogous to classical Langevin equations.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

References

  1. 1 2 Gneiting, Tilmann; Raftery, Adrian E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation" (PDF). Journal of the American Statistical Association. 102 (447): 359–378. doi:10.1198/016214506000001437. S2CID   1878582.
  2. Gneiting, Tilmann (2011). "Making and Evaluating Point Forecasts". Journal of the American Statistical Association. 106 (494): 746–762. arXiv: 0912.0902 . doi:10.1198/jasa.2011.r10138. S2CID   88518170.
  3. 1 2 Bickel, E.J. (2007). "Some Comparisons among Quadratic, Spherical, and Logarithmic Scoring Rules" (PDF). Decision Analysis. 4 (2): 49–65. doi:10.1287/deca.1070.0089.
  4. Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability" (PDF). Monthly Weather Review. 78 (1): 1–3. Bibcode:1950MWRv...78....1B. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
  5. 1 2 Hyvärinen, Aapo (2005). "Estimation of Non-Normalized Statistical Models by Score Matching". Journal of Machine Learning Research. 6 (24): 695–709. ISSN   1533-7928.
  6. Shao, Stephane; Jacob, Pierre E.; Ding, Jie; Tarokh, Vahid (2019-10-02). "Bayesian Model Comparison with the Hyvärinen Score: Computation and Consistency". Journal of the American Statistical Association. 114 (528): 1826–1837. arXiv: 1711.00136 . doi:10.1080/01621459.2018.1518237. ISSN   0162-1459. S2CID   52264864.
  7. Ding, Jie; Calderbank, Robert; Tarokh, Vahid (2019). "Gradient Information for Representation and Modeling". Advances in Neural Information Processing Systems. 32: 2396–2405.
  8. Zamo, Michaël; Naveau, Philippe (2018-02-01). "Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts". Mathematical Geosciences. 50 (2): 209–234. doi: 10.1007/s11004-017-9709-7 . ISSN   1874-8953. S2CID   125989069.
  9. Leonard J. Savage. Elicitation of personal probabilities and expectations. J. of the American Stat. Assoc., 66(336):783–801, 1971.
  10. Schervish, Mark J. (1989). "A General Method for Comparing Probability Assessors", Annals of Statistics17(4) 1856–1879, https://projecteuclid.org/euclid.aos/1176347398
  11. Rosen, David B. (1996). "How good were those probability predictions? The expected recommendation loss (ERL) scoring rule". In Heidbreder, G. (ed.). Maximum Entropy and Bayesian Methods (Proceedings of the Thirteenth International Workshop, August 1993). Kluwer, Dordrecht, The Netherlands. CiteSeerX   10.1.1.52.1557 .
  12. Roulston, M. S., & Smith, L. A. (2002). Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130, 1653–1660. See APPENDIX "Skill Scores and Cost–Loss".
  13. "Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications", Andreas Buja, Werner Stuetzle, Yi Shen (2005) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.184.5203
  14. Hernandez-Orallo, Jose; Flach, Peter; and Ferri, Cesar (2012). "A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss." Journal of Machine Learning Research13 2813–2869. http://www.jmlr.org/papers/volume13/hernandez-orallo12a/hernandez-orallo12a.pdf
  15. Murphy, A.H. (1973). "A new vector partition of the probability score". Journal of Applied Meteorology. 12 (4): 595–600. Bibcode:1973JApMe..12..595M. doi: 10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2 .
  16. Bröcker, J. (2009). "Reliability, sufficiency, and the decomposition of proper scores" (PDF). Quarterly Journal of the Royal Meteorological Society. 135 (643): 1512–1519. arXiv: 0806.0813 . Bibcode:2009QJRMS.135.1512B. doi:10.1002/qj.456. S2CID   15880012.