# Loss function

Last updated

In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) [1] is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized.

## Contents

In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century. [2] In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s. [3] In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss.

## Examples

### Regret

Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret , i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made had the underlying circumstances been known and the decision that was in fact taken before they were known.

The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t, then a quadratic loss function is

${\displaystyle \lambda (x)=C(t-x)^{2}\;}$

for some constant C; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1.

Many common statistics, including t-tests, regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratic loss function.

The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear first-order conditions. In the context of stochastic control, the expected value of the quadratic form is used.

### 0-1 loss function

In statistics and decision theory, a frequently used loss function is the 0-1 loss function

${\displaystyle L({\hat {y}},y)=I({\hat {y}}\neq y),\,}$

where ${\displaystyle I}$ is the indicator function.

## Constructing loss and objective functions

In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also utility function) in a form suitable for optimization — the problem that Ragnar Frisch has highlighted in his Nobel Prize lecture. [4] The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences. [5] [6] In particular, Andranik Tangian showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either ordinal or cardinal data that were elicited through computer-assisted interviews with decision makers. [7] [8] Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities [9] and the European subsidies for equalizing unemployment rates among 271 German regions. [10]

## Expected loss

In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable X.

### Statistics

Both frequentist and Bayesian statistical theory involve making a decision based on the expected value of the loss function; however, this quantity is defined differently under the two paradigms.

#### Frequentist expected loss

We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, Pθ, of the observed data, X. This is also referred to as the risk function [11] [12] [13] [14] of the decision rule δ and the parameter θ. Here the decision rule depends on the outcome of X. The risk function is given by:

${\displaystyle R(\theta ,\delta )=\operatorname {E} _{\theta }L{\big (}\theta ,\delta (X){\big )}=\int _{X}L{\big (}\theta ,\delta (x){\big )}\,\mathrm {d} P_{\theta }(x).}$

Here, θ is a fixed but possibly unknown state of nature, X is a vector of observations stochastically drawn from a population, ${\displaystyle \operatorname {E} _{\theta }}$ is the expectation over all population values of X, dPθ is a probability measure over the event space of X (parametrized by θ) and the integral is evaluated over the entire support of X.

#### Bayesian expected loss

In a Bayesian approach, the expectation is calculated using the posterior distribution π* of the parameter θ:

${\displaystyle \rho (\pi ^{*},a)=\int _{\Theta }L(\theta ,a)\,\mathrm {d} \pi ^{*}(\theta ).}$

One then should choose the action a* which minimises the expected loss. Although this will result in choosing the same action as would be chosen using the frequentist risk, the emphasis of the Bayesian approach is that one is only interested in choosing the optimal action under the actual observed data, whereas choosing the actual frequentist optimal decision rule, which is a function of all possible observations, is a much more difficult problem.

#### Examples in statistics

• For a scalar parameter θ, a decision function whose output ${\displaystyle {\hat {\theta }}}$ is an estimate of θ, and a quadratic loss function (squared error loss)
${\displaystyle L(\theta ,{\hat {\theta }})=(\theta -{\hat {\theta }})^{2},}$
the risk function becomes the mean squared error of the estimate,
${\displaystyle R(\theta ,{\hat {\theta }})=\operatorname {E} _{\theta }(\theta -{\hat {\theta }})^{2}.}$
• In density estimation, the unknown parameter is probability density itself. The loss function is typically chosen to be a norm in an appropriate function space. For example, for L2 norm,
${\displaystyle L(f,{\hat {f}})=\|f-{\hat {f}}\|_{2}^{2}\,,}$
the risk function becomes the mean integrated squared error
${\displaystyle R(f,{\hat {f}})=\operatorname {E} \|f-{\hat {f}}\|^{2}.\,}$

### Economic choice under uncertainty

In economics, decision-making under uncertainty is often modelled using the von Neumann–Morgenstern utility function of the uncertain variable of interest, such as end-of-period wealth. Since the value of this variable is uncertain, so is the value of the utility function; it is the expected value of utility that is maximized.

## Decision rules

A decision rule makes a choice using an optimality criterion. Some commonly used criteria are:

• Minimax : Choose the decision rule with the lowest worst loss — that is, minimize the worst-case (maximum possible) loss:
${\displaystyle {\underset {\delta }{\operatorname {arg\,min} }}\ \max _{\theta \in \Theta }\ R(\theta ,\delta ).}$
• Invariance : Choose the decision rule which satisfies an invariance requirement.
• Choose the decision rule with the lowest average loss (i.e. minimize the expected value of the loss function):
${\displaystyle {\underset {\delta }{\operatorname {arg\,min} }}\operatorname {E} _{\theta \in \Theta }[R(\theta ,\delta )]={\underset {\delta }{\operatorname {arg\,min} }}\ \int _{\theta \in \Theta }R(\theta ,\delta )\,p(\theta )\,d\theta .}$

## Selecting a loss function

Sound statistical practice requires selecting an estimator consistent with the actual acceptable variation experienced in the context of a particular applied problem. Thus, in the applied use of loss functions, selecting which statistical method to use to model an applied problem depends on knowing the losses that will be experienced from being wrong under the problem's particular circumstances. [15]

A common example involves estimating "location". Under typical statistical assumptions, the mean or average is the statistic for estimating location that minimizes the expected loss experienced under the squared-error loss function, while the median is the estimator that minimizes expected loss experienced under the absolute-difference loss function. Still different estimators would be optimal under other, less common circumstances.

In economics, when an agent is risk neutral, the objective function is simply expressed as the expected value of a monetary quantity, such as profit, income, or end-of-period wealth. For risk-averse or risk-loving agents, loss is measured as the negative of a utility function, and the objective function to be optimized is the expected value of utility.

Other measures of cost are possible, for example mortality or morbidity in the field of public health or safety engineering.

For most optimization algorithms, it is desirable to have a loss function that is globally continuous and differentiable.

Two very commonly used loss functions are the squared loss, ${\displaystyle L(a)=a^{2}}$, and the absolute loss, ${\displaystyle L(a)=|a|}$. However the absolute loss has the disadvantage that it is not differentiable at ${\displaystyle a=0}$. The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of ${\displaystyle a}$'s (as in ${\textstyle \sum _{i=1}^{n}L(a_{i})}$), the final sum tends to be the result of a few particularly large a-values, rather than an expression of the average a-value.

The choice of a loss function is not arbitrary. It is very restrictive and sometimes the loss function may be characterized by its desirable properties. [16] Among the choice principles are, for example, the requirement of completeness of the class of symmetric statistics in the case of i.i.d. observations, the principle of complete information, and some others.

W. Edwards Deming and Nassim Nicholas Taleb argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often are not mathematically nice and are not differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after can not, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of asymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in real-life problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases. [17]

## Related Research Articles

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model. For each specific parameter value in the parameter space, the likelihood function therefore assigns a probabilistic prediction to the observed data . Since it is essentially the product of sampling densities, the likelihood generally encapsulates both the data-generating process as well as the missing-data mechanism that produced the observed sample.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. However, these activities can be viewed as two facets of the same field of application, and together they have undergone substantial development over the past few decades. A modern definition of pattern recognition is:

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.

In statistics, the Rao–Blackwell theorem, sometimes referred to as the Rao–Blackwell–Kolmogorov theorem, is a result which characterizes the transformation of an arbitrarily crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior. The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher. The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.

In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it, in the precise sense of "better" defined below. This concept is analogous to Pareto efficiency.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias can also be measured with respect to the median, rather than the mean, in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is a distinct concept from consistency. Consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors.

In statistics, the concept of being an invariant estimator is a criterion that can be used to compare the properties of different estimators for the same quantity. It is a way of formalising the idea that an estimator should have certain intuitively appealing qualities. Strictly speaking, "invariant" would mean that the estimates themselves are unchanged when both the measurements and the parameters are transformed in a compatible way, but the meaning has been extended to allow the estimates to change in appropriate ways with such transformations. The term equivariant estimator is used in formal mathematical contexts that include a precise description of the relation of the way the estimator changes in response to changes to the dataset and parameterisation: this corresponds to the use of "equivariance" in more general mathematics.

In statistical decision theory, where we are faced with the problem of estimating a deterministic parameter (vector) from observations an estimator is called minimax if its maximal risk is minimal among all estimators of . In a sense this means that is an estimator which performs best in the worst possible case allowed in the problem.

In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance. This article primarily deals with efficiency of estimators.

In statistics, suppose that we have been given some data, and we are constructing a statistical model of that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.

## References

1. Raschka, Sebastian (2019). Python machine learning : machine learning and deep learning with python, scikit-learn, and tensorflow 2. Birmingham: Packt Publishing, Limited. p. 37 - 38. ISBN   1-78995-829-6. OCLC   1135663723.
2. Wald, A. (1950). Statistical Decision Functions. Wiley.
3. Cramér, H. (1930). On the mathematical theory of risk. Centraltryckeriet.
4. Frisch, Ragnar (1969). "From utopian theory to practical applications: the case of econometrics". The Nobel Prize–Prize Lecture . Retrieved 15 February 2021.
5. Tangian, Andranik; Gruber, Josef (1997). Constructing Scalar-Valued Objective Functions. Proceedings of the Third International Conference on Econometric Decision Models: Constructing Scalar-Valued Objective Functions, University of Hagen, held in Katholische Akademie Schwerte September 5–8, 1995. Lecture Notes in Economics and Mathematical Systems. 453. Berlin: Springer. doi:10.1007/978-3-642-48773-6. ISBN   978-3-540-63061-6.
6. Tangian, Andranik; Gruber, Josef (2002). Constructing and Applying Objective Functions. Proceedings of the Fourth International Conference on Econometric Decision Models Constructing and Applying Objective Functions, University of Hagen, held in Haus Nordhelle, August, 28 — 31, 2000. Lecture Notes in Economics and Mathematical Systems. 510. Berlin: Springer. doi:10.1007/978-3-642-56038-5. ISBN   978-3-540-42669-1.
7. Tangian, Andranik (2002). "Constructing a quasi-concave quadratic objective function from interviewing a decision maker". European Journal of Operational Research. 141 (3): 608–640. doi:10.1016/S0377-2217(01)00185-0. S2CID   39623350.
8. Tangian, Andranik (2004). "A model for ordinally constructing additive objective functions". European Journal of Operational Research. 159 (2): 476–512. doi:10.1016/S0377-2217(03)00413-2. S2CID   31019036.
9. Tangian, Andranik (2004). "Redistribution of university budgets with respect to the status quo". European Journal of Operational Research. 157 (2): 409–428. doi:10.1016/S0377-2217(03)00271-6.
10. Tangian, Andranik (2008). "Multi-criteria optimization of regional employment policy: A simulation analysis for Germany". Review of Urban and Regional Development. 20 (2): 103–122. doi:10.1111/j.1467-940X.2008.00144.x.
11. Nikulin, M.S. (2001) [1994], "Risk of a statistical procedure", Encyclopedia of Mathematics , EMS Press
12. Berger, James O. (1985). Statistical decision theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag. Bibcode:1985sdtb.book.....B. ISBN   978-0-387-96098-2. MR   0804611.
13. DeGroot, Morris (2004) [1970]. Optimal Statistical Decisions. Wiley Classics Library. ISBN   978-0-471-68029-1. MR   2288194.
14. Robert, Christian P. (2007). The Bayesian Choice. Springer Texts in Statistics (2nd ed.). New York: Springer. doi:10.1007/0-387-71599-1. ISBN   978-0-387-95231-4. MR   1835885.
15. Pfanzagl, J. (1994). Parametric Statistical Theory. Berlin: Walter de Gruyter. ISBN   978-3-11-013863-4.
16. Detailed information on mathematical principles of the loss function choice is given in Chapter 2 of the book Klebanov, B.; Rachev, Svetlozat T.; Fabozzi, Frank J. (2009). Robust and Non-Robust Models in Statistics. New York: Nova Scientific Publishers, Inc. (and references there).
17. Deming, W. Edwards (2000). Out of the Crisis. The MIT Press. ISBN   9780262541152.
• Waud, Roger N. (1976). "Asymmetric Policymaker Utility Functions and Optimal Policy under Uncertainty". Econometrica. 44 (1): 53–66. doi:10.2307/1911380. JSTOR   1911380.