Loss functions for classification

Last updated January 27, 2024

Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue) BayesConsistentLosses2.jpg — Bayes consistent loss functions: Zero-one loss (gray), Savage loss (green), Logistic loss (orange), Exponential loss (purple), Tangent loss (brown), Square loss (blue)

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).^[1] Given ${\mathcal {X}}$ as the space of all possible inputs (usually ${\mathcal {X}}\subset \mathbb {R} ^{d}$ ), and ${\mathcal {Y}}=\{-1,1\}$ as the set of labels (possible outputs), a typical goal of classification algorithms is to find a function $f:{\mathcal {X}}\to {\mathcal {Y}}$ which best predicts a label $y$ for a given input ${\vec {x}}$ .^[2] However, because of incomplete information, noise in the measurement, or probabilistic components in the underlying process, it is possible for the same ${\vec {x}}$ to generate different $y$ .^[3] As a result, the goal of the learning problem is to minimize expected loss (also known as the risk), defined as

Within classification, several commonly used loss functions are written solely in terms of the product of the true label $y$ and the predicted label $f({\vec {x}})$ . Therefore, they can be defined as functions of only one variable $\upsilon =yf({\vec {x}})$ , so that $V(f({\vec {x}}),y)=\phi (yf({\vec {x}}))=\phi (\upsilon )$ with a suitably chosen function ${\displaystyle \phi$ . These are called margin-based loss functions. Choosing a margin-based loss function amounts to choosing $\phi$ . Selection of a loss function within this framework impacts the optimal $f_{\phi }^{*}$ which minimizes the expected risk, see empirical risk minimization.

In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,

{\begin{aligned}I[f]&=\int _{{\mathcal {X}}\times {\mathcal {Y}}}V(f({\vec {x}}),y)\,p({\vec {x}},y)\,d{\vec {x}}\,dy\\[6pt]&=\int _{\mathcal {X}}\int _{\mathcal {Y}}\phi (yf({\vec {x}}))\,p(y\mid {\vec {x}})\,p({\vec {x}})\,dy\,d{\vec {x}}\\[6pt]&=\int _{\mathcal {X}}[\phi (f({\vec {x}}))\,p(1\mid {\vec {x}})+\phi (-f({\vec {x}}))\,p(-1\mid {\vec {x}})]\,p({\vec {x}})\,d{\vec {x}}\\[6pt]&=\int _{\mathcal {X}}[\phi (f({\vec {x}}))\,p(1\mid {\vec {x}})+\phi (-f({\vec {x}}))\,(1-p(1\mid {\vec {x}}))]\,p({\vec {x}})\,d{\vec {x}}\end{aligned}}

The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for $y$ , and the fourth because $p(-1\mid x)=1-p(1\mid x)$ . The term within brackets $[\phi (f({\vec {x}}))p(1\mid {\vec {x}})+\phi (-f({\vec {x}}))(1-p(1\mid {\vec {x}}))]$ is known as the conditional risk.

One can solve for the minimizer of $I[f]$ by taking the functional derivative of the last equality with respect to $f$ and setting the derivative equal to 0. This will result in the following equation

{\frac {\partial \phi (f)}{\partial f}}\eta +{\frac {\partial \phi (-f)}{\partial f}}(1-\eta )=0\;\;\;\;\;(1)

^{[ citation needed ]}^{[ clarification needed ]}

which is also equivalent to setting the derivative of the conditional risk equal to zero.

Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for false positives and false negatives) would be the 0-1 loss function (0–1 indicator function), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by

V(f({\vec {x}}),y)=H(-yf({\vec {x}}))

where $H$ indicates the Heaviside step function. However, this loss function is non-convex and non-smooth, and solving for the optimal solution is an NP-hard combinatorial optimization problem.^[4] As a result, it is better to substitute loss function surrogates which are tractable for commonly used learning algorithms, as they have convenient properties such as being convex and smooth. In addition to their computational tractability, one can show that the solutions to the learning problem using these loss surrogates allow for the recovery of the actual solution to the original classification problem.^[5] Some of these surrogates are described below.

In practice, the probability distribution $p({\vec {x}},y)$ is unknown. Consequently, utilizing a training set of $n$ independently and identically distributed sample points

S=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}

drawn from the data sample space, one seeks to minimize empirical risk

I_{S}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f({\vec {x}}_{i}),y_{i})

as a proxy for expected risk.^[3] (See statistical learning theory for a more detailed description.)

Bayes consistency

Utilizing Bayes' theorem, it can be shown that the optimal $f_{0/1}^{*}$ , i.e., the one that minimizes the expected risk associated with the zero-one loss, implements the Bayes optimal decision rule for a binary classification problem and is in the form of

f_{0/1}^{*}({\vec {x}})\;=\;{\begin{cases}\;\;\;1&{\text{if }}p(1\mid {\vec {x}})>p(-1\mid {\vec {x}})\\\;\;\;0&{\text{if }}p(1\mid {\vec {x}})=p(-1\mid {\vec {x}})\\-1&{\text{if }}p(1\mid {\vec {x}})<p(-1\mid {\vec {x}})\end{cases}}

.

A loss function is said to be classification-calibrated or Bayes consistent if its optimal $f_{\phi }^{*}$ is such that $f_{0/1}^{*}({\vec {x}})=\operatorname {sgn} (f_{\phi }^{*}({\vec {x}}))$ and is thus optimal under the Bayes decision rule. A Bayes consistent loss function allows us to find the Bayes optimal decision function $f_{\phi }^{*}$ by directly minimizing the expected risk and without having to explicitly model the probability density functions.

For convex margin loss $\phi (\upsilon )$ , it can be shown that $\phi (\upsilon )$ is Bayes consistent if and only if it is differentiable at 0 and $\phi '(0)<0$ .^[6]^[1] Yet, this result does not exclude the existence of non-convex Bayes consistent loss functions. A more general result states that Bayes consistent loss functions can be generated using the following formulation ^[7]

\phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]\;\;\;\;\;(2)

,

where $f(\eta ),(0\leq \eta \leq 1)$ is any invertible function such that $f^{-1}(-v)=1-f^{-1}(v)$ and $C(\eta )$ is any differentiable strictly concave function such that $C(\eta )=C(1-\eta )$ . Table-I shows the generated Bayes consistent loss functions for some example choices of $C(\eta )$ and $f^{-1}(v)$ . Note that the Savage and Tangent loss are not convex. Such non-convex loss functions have been shown to be useful in dealing with outliers in classification.^[7]^[8] For all loss functions generated from (2), the posterior probability $p(y=1|{\vec {x}})$ can be found using the invertible link function as $p(y=1|{\vec {x}})=\eta =f^{-1}(v)$ . Such loss functions where the posterior probability can be recovered using the invertible link are called proper loss functions.

Table-I
Loss name	$\phi (v)$	$C(\eta )$	$f^{-1}(v)$	$f(\eta )$
Exponential	$e^{-v}$	$2{\sqrt {\eta (1-\eta )}}$	${\frac {e^{2v}}{1+e^{2v}}}$	${\frac {1}{2}}\log({\frac {\eta }{1-\eta }})$
Logistic	${\frac {1}{\log(2)}}\log(1+e^{-v})$	${\frac {1}{\log(2)}}[-\eta \log(\eta )-(1-\eta )\log(1-\eta )]$	${\frac {e^{v}}{1+e^{v}}}$	$\log({\frac {\eta }{1-\eta }})$
Square	$(1-v)^{2}$	$4\eta (1-\eta )$	${\frac {1}{2}}(v+1)$	$2\eta -1$
Savage	${\frac {1}{(1+e^{v})^{2}}}$	$\eta (1-\eta )$	${\frac {e^{v}}{1+e^{v}}}$	$\log({\frac {\eta }{1-\eta }})$
Tangent	$(2\arctan(v)-1)^{2}$	$4\eta (1-\eta )$	$\arctan(v)+{\frac {1}{2}}$	$\tan(\eta -{\frac {1}{2}})$

The sole minimizer of the expected risk, $f_{\phi }^{*}$ , associated with the above generated loss functions can be directly found from equation (1) and shown to be equal to the corresponding $f(\eta )$ . This holds even for the nonconvex loss functions, which means that gradient descent based algorithms such as gradient boosting can be used to construct the minimizer.

Proper loss functions, loss margin and regularization

(Red) standard Logistic loss (
g
=
1
,
m
=
2
{\displaystyle \gamma =1,\mu =2}
) and (Blue) increased margin Logistic loss (
g
=
0.2
{\displaystyle \gamma =0.2}
). LogitLossMarginWithMu.jpg — (Red) standard Logistic loss ( $\gamma =1,\mu =2$ ) and (Blue) increased margin Logistic loss ( $\gamma =0.2$ ).

For proper loss functions, the loss margin can be defined as $\mu _{\phi }=-{\frac {\phi '(0)}{\phi ''(0)}}$ and shown to be directly related to the regularization properties of the classifier.^[9] Specifically a loss function of larger margin increases regularization and produces better estimates of the posterior probability. For example, the loss margin can be increased for the logistic loss by introducing a $\gamma$ parameter and writing the logistic loss as ${\frac {1}{\gamma }}\log(1+e^{-\gamma v})$ where smaller $0<\gamma <1$ increases the margin of the loss. It is shown that this is directly equivalent to decreasing the learning rate in gradient boosting $F_{m}(x)=F_{m-1}(x)+\gamma h_{m}(x),$ where decreasing $\gamma$ improves the regularization of the boosted classifier. The theory makes it clear that when a learning rate of $\gamma$ is used, the correct formula for retrieving the posterior probability is now $\eta =f^{-1}(\gamma F(x))$ .

In conclusion, by choosing a loss function with larger margin (smaller $\gamma$ ) we increase regularization and improve our estimates of the posterior probability which in turn improves the ROC curve of the final classifier.

Square loss

While more commonly used in regression, the square loss function can be re-written as a function $\phi (yf({\vec {x}}))$ and utilized for classification. It can be generated using (2) and Table-I as follows

\phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]=4({\frac {1}{2}}(v+1))(1-{\frac {1}{2}}(v+1))+(1-{\frac {1}{2}}(v+1))(4-8({\frac {1}{2}}(v+1)))=(1-v)^{2}.

The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.^[1] In addition, functions which yield high values of $f({\vec {x}})$ for some $x\in X$ will perform poorly with the square loss function, since high values of $yf({\vec {x}})$ will be penalized severely, regardless of whether the signs of $y$ and $f({\vec {x}})$ match.

A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for Tikhonov regularization, one can solve for the regularization parameter using leave-one-out cross-validation in the same time as it would take to solve a single problem.^[10]

The minimizer of $I[f]$ for the square loss function can be directly found from equation (1) as

f_{\text{Square}}^{*}=2\eta -1=2p(1\mid x)-1.

Logistic loss

The logistic loss function can be generated using (2) and Table-I as follows

{\begin{aligned}\phi (v)&=C[f^{-1}(v)]+\left(1-f^{-1}(v)\right)\,C'\left[f^{-1}(v)\right]\\&={\frac {1}{\log(2)}}\left[{\frac {-e^{v}}{1+e^{v}}}\log {\frac {e^{v}}{1+e^{v}}}-\left(1-{\frac {e^{v}}{1+e^{v}}}\right)\log \left(1-{\frac {e^{v}}{1+e^{v}}}\right)\right]+\left(1-{\frac {e^{v}}{1+e^{v}}}\right)\left[{\frac {-1}{\log(2)}}\log \left({\frac {\frac {e^{v}}{1+e^{v}}}{1-{\frac {e^{v}}{1+e^{v}}}}}\right)\right]\\&={\frac {1}{\log(2)}}\log(1+e^{-v}).\end{aligned}}

The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the LogitBoost algorithm.

The minimizer of $I[f]$ for the logistic loss function can be directly found from equation (1) as

f_{\text{Logistic}}^{*}=\log \left({\frac {\eta }{1-\eta }}\right)=\log \left({\frac {p(1\mid x)}{1-p(1\mid x)}}\right).

This function is undefined when $p(1\mid x)=1$ or $p(1\mid x)=0$ (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when $p(1\mid x)$ increases and equals 0 when $p(1\mid x)=0.5$ .^[3]

It's easy to check that the logistic loss and binary cross-entropy loss (Log loss) are in fact the same (up to a multiplicative constant ${\frac {1}{\log(2)}}$ ). The cross-entropy loss is closely related to the Kullback–Leibler divergence between the empirical distribution and the predicted distribution. The cross-entropy loss is ubiquitous in modern deep neural networks.

Exponential loss

The exponential loss function can be generated using (2) and Table-I as follows

\phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]=2{\sqrt {\left({\frac {e^{2v}}{1+e^{2v}}}\right)\left(1-{\frac {e^{2v}}{1+e^{2v}}}\right)}}+\left(1-{\frac {e^{2v}}{1+e^{2v}}}\right)\left({\frac {1-{\frac {2e^{2v}}{1+e^{2v}}}}{\sqrt {{\frac {e^{2v}}{1+e^{2v}}}(1-{\frac {e^{2v}}{1+e^{2v}}})}}}\right)=e^{-v}

The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponential loss is used in the AdaBoost algorithm.

The minimizer of $I[f]$ for the exponential loss function can be directly found from equation (1) as

f_{\text{Exp}}^{*}={\frac {1}{2}}\log \left({\frac {\eta }{1-\eta }}\right)={\frac {1}{2}}\log \left({\frac {p(1\mid x)}{1-p(1\mid x)}}\right).

Savage loss

The Savage loss^[7] can be generated using (2) and Table-I as follows

\phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]=\left({\frac {e^{v}}{1+e^{v}}}\right)\left(1-{\frac {e^{v}}{1+e^{v}}}\right)+\left(1-{\frac {e^{v}}{1+e^{v}}}\right)\left(1-{\frac {2e^{v}}{1+e^{v}}}\right)={\frac {1}{(1+e^{v})^{2}}}.

The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in gradient boosting and the SavageBoost algorithm.

The minimizer of $I[f]$ for the Savage loss function can be directly found from equation (1) as

f_{\text{Savage}}^{*}=\log \left({\frac {\eta }{1-\eta }}\right)=\log \left({\frac {p(1\mid x)}{1-p(1\mid x)}}\right).

Tangent loss

The Tangent loss^[11] can be generated using (2) and Table-I as follows

{\begin{aligned}\phi (v)&=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]=4(\arctan(v)+{\frac {1}{2}})(1-(\arctan(v)+{\frac {1}{2}}))+(1-(\arctan(v)+{\frac {1}{2}}))(4-8(\arctan(v)+{\frac {1}{2}}))\\&=(2\arctan(v)-1)^{2}.\end{aligned}}

The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used in gradient boosting, the TangentBoost algorithm and Alternating Decision Forests.^[12]

The minimizer of $I[f]$ for the Tangent loss function can be directly found from equation (1) as

f_{\text{Tangent}}^{*}=\tan(\eta -{\frac {1}{2}})=\tan(p(1\mid x)-{\frac {1}{2}}).

Hinge loss

The hinge loss function is defined with $\phi (\upsilon )=\max(0,1-\upsilon )=[1-\upsilon ]_{+}$ , where $[a]_{+}=\max(0,a)$ is the positive part function.

V(f({\vec {x}}),y)=\max(0,1-yf({\vec {x}}))=[1-yf({\vec {x}})]_{+}.

The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. Specifically, the hinge loss equals the 0–1 indicator function when $\operatorname {sgn} (f({\vec {x}}))=y$ and $|yf({\vec {x}})|\geq 1$ . In addition, the empirical risk minimization of this loss is equivalent to the classical formulation for support vector machines (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary.^[4]

While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) at $yf({\vec {x}})=1$ . Consequently, the hinge loss function cannot be used with gradient descent methods or stochastic gradient descent methods which rely on differentiability over the entire domain. However, the hinge loss does have a subgradient at $yf({\vec {x}})=1$ , which allows for the utilization of subgradient descent methods.^[4] SVMs utilizing the hinge loss function can also be solved using quadratic programming.

The minimizer of $I[f]$ for the hinge loss function is

f_{\text{Hinge}}^{*}({\vec {x}})\;=\;{\begin{cases}1&{\text{if }}p(1\mid {\vec {x}})>p(-1\mid {\vec {x}})\\-1&{\text{if }}p(1\mid {\vec {x}})<p(-1\mid {\vec {x}})\end{cases}}

when $p(1\mid x)\neq 0.5$ , which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.^[1] The Hinge loss cannot be derived from (2) since $f_{\text{Hinge}}^{*}$ is not invertible.

Generalized smooth hinge loss

The generalized smooth hinge loss function with parameter $\alpha$ is defined as

f_{\alpha }^{*}(z)\;=\;{\begin{cases}{\frac {\alpha }{\alpha +1}}-z&{\text{if }}z\leq 0\\{\frac {1}{\alpha +1}}z^{\alpha +1}-z+{\frac {\alpha }{\alpha +1}}&{\text{if }}0<z<1\\0&{\text{if }}z\geq 1\end{cases}},

where

z=yf({\vec {x}}).

It is monotonically increasing and reaches 0 when $z=1$ .

Related Research Articles

In the calculus of variations and classical mechanics, the Euler–Lagrange equations are a system of second-order ordinary differential equations whose solutions are stationary points of the given action functional. The equations were discovered in the 1750s by Swiss mathematician Leonhard Euler and Italian mathematician Joseph-Louis Lagrange.

Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

In quantum field theory, partition functions are generating functionals for correlation functions, making them key objects of study in the path integral formalism. They are the imaginary time versions of statistical mechanics partition functions, giving rise to a close connection between these two areas of physics. Partition functions can rarely be solved for exactly, although free theories do admit such solutions. Instead, a perturbative approach is usually implemented, this being equivalent to summing over Feynman diagrams.

A first class constraint is a dynamical quantity in a constrained Hamiltonian system whose Poisson bracket with all the other constraints vanishes on the constraint surface in phase space. To calculate the first class constraint, one assumes that there are no second class constraints, or that they have been calculated previously, and their Dirac brackets generated.

In mathematics, the Lerch zeta function, sometimes called the Hurwitz–Lerch zeta function, is a special function that generalizes the Hurwitz zeta function and the polylogarithm. It is named after Czech mathematician Mathias Lerch, who published a paper about the function in 1887.

In theoretical physics, Nordström's theory of gravitation was a predecessor of general relativity. Strictly speaking, there were actually two distinct theories proposed by the Finnish theoretical physicist Gunnar Nordström, in 1912 and 1913 respectively. The first was quickly dismissed, but the second became the first known example of a metric theory of gravitation, in which the effects of gravitation are treated entirely in terms of the geometry of a curved spacetime.

In theoretical physics, a source field is a background field $coupled to the original field as$

In theoretical physics, background field method is a useful procedure to calculate the effective action of a quantum field theory by expanding a quantum field around a classical "background" value B:

In physics and fluid mechanics, a Blasius boundary layer describes the steady two-dimensional laminar boundary layer that forms on a semi-infinite plate which is held parallel to a constant unidirectional flow. Falkner and Skan later generalized Blasius' solution to wedge flow, i.e. flows in which the plate is not parallel to the flow.

In theoretical physics, scalar field theory can refer to a relativistically invariant classical or quantum theory of scalar fields. A scalar field is invariant under any Lorentz transformation.

Intrabeam scattering (IBS) is an effect in accelerator physics where collisions between particles couple the beam emittance in all three dimensions. This generally causes the beam size to grow. In proton accelerators, intrabeam scattering causes the beam to grow slowly over a period of several hours. This limits the luminosity lifetime. In circular lepton accelerators, intrabeam scattering is counteracted by radiation damping, resulting in a new equilibrium beam emittance with a relaxation time on the order of milliseconds. Intrabeam scattering creates an inverse relationship between the smallness of the beam and the number of particles it contains, therefore limiting luminosity.

<span class="mw-page-title-main">Spinodal decomposition</span> Mechanism of spontaneous phase separation

Spinodal decomposition is a mechanism by which a single thermodynamic phase spontaneously separates into two phases. Decomposition occurs when there is no thermodynamic barrier to phase separation. As a result, phase separation via decomposition does not require the nucleation events resulting from thermodynamic fluctuations, which normally trigger phase separation.

In fluid dynamics, Luke's variational principle is a Lagrangian variational description of the motion of surface waves on a fluid with a free surface, under the action of gravity. This principle is named after J.C. Luke, who published it in 1967. This variational principle is for incompressible and inviscid potential flows, and is used to derive approximate wave models like the mild-slope equation, or using the averaged Lagrangian approach for wave propagation in inhomogeneous media.

$<span class="mw-page-title-main">Rogers–Ramanujan continued fraction</span> Continued fraction closely related to the Rogers–Ramanujan identities$

The Rogers–Ramanujan continued fraction is a continued fraction discovered by Rogers (1894) and independently by Srinivasa Ramanujan, and closely related to the Rogers–Ramanujan identities. It can be evaluated explicitly for a broad class of values of its argument.

In fluid dynamics, the mild-slope equation describes the combined effects of diffraction and refraction for water waves propagating over bathymetry and due to lateral boundaries—like breakwaters and coastlines. It is an approximate model, deriving its name from being originally developed for wave propagation over mild slopes of the sea floor. The mild-slope equation is often used in coastal engineering to compute the wave-field changes near harbours and coasts.

Uncertainty theory is a branch of mathematics based on normality, monotonicity, self-duality, countable subadditivity, and product measure axioms.

Static force fields are fields, such as a simple electric, magnetic or gravitational fields, that exist without excitations. The most common approximation method that physicists use for scattering calculations can be interpreted as static forces arising from the interactions between two bodies mediated by virtual particles, particles that exist for only a short time determined by the uncertainty principle. The virtual particles, also known as force carriers, are bosons, with different bosons associated with each force.

Within mathematical analysis, Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the training set data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids overfitting via Tikhonov regularization and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower Mean squared error predict better or generalize better when given unseen data.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

<span class="mw-page-title-main">Path integrals in polymer science</span>

A polymer is a macromolecule, composed of many similar or identical repeated subunits. Polymers are common in, but not limited to, organic media. They range from familiar synthetic plastics to natural biopolymers such as DNA and proteins. Their unique elongated molecular structure produces unique physical properties, including toughness, viscoelasticity, and a tendency to form glasses and semicrystalline structures. The modern concept of polymers as covalently bonded macromolecular structures was proposed in 1920 by Hermann Staudinger. One sub-field in the study of polymers is polymer physics. As a part of soft matter studies, Polymer physics concerns itself with the study of mechanical properties and focuses on the perspective of condensed matter physics.

References

1 2 3 4 Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786 . doi:10.1162/089976604773135104. PMID 15070510. S2CID 11845688.
↑ Shen, Yi (2005), Loss Functions For Binary Classification and Class Probability Estimation (PDF), University of Pennsylvania, retrieved 6 December 2014
1 2 3 Rosasco, Lorenzo; Poggio, Tomaso (2014), A Regularization Tour of Machine Learning, MIT-9.520 Lectures Notes, vol. Manuscript
1 2 3 Piyush, Rai (13 September 2011), Support Vector Machines (Contd.), Classification Loss Functions and Regularizers (PDF), Utah CS5350/6350: Machine Learning, retrieved 4 May 2021
↑ Ramanan, Deva (27 February 2008), Lecture 14 (PDF), UCI ICS273A: Machine Learning, retrieved 6 December 2014
↑ Bartlett, Peter L.; Jordan, Michael I.; Mcauliffe, Jon D. (2006). "Convexity, Classification, and Risk Bounds". Journal of the American Statistical Association. 101 (473): 138–156. doi:10.1198/016214505000000907. ISSN 0162-1459. JSTOR 30047445. S2CID 2833811.
1 2 3 Masnadi-Shirazi, Hamed; Vasconcelos, Nuno (2008). "On the Design of Loss Functions for Classification: Theory, Robustness to Outliers, and SavageBoost" (PDF). Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS'08. USA: Curran Associates Inc.: 1049–1056. ISBN 9781605609492.
↑ Leistner, C.; Saffari, A.; Roth, P. M.; Bischof, H. (September 2009). "On robustness of on-line boosting - a competitive study". 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops. pp. 1362–1369. doi:10.1109/ICCVW.2009.5457451. ISBN 978-1-4244-4442-7. S2CID 6032045.
↑ Vasconcelos, Nuno; Masnadi-Shirazi, Hamed (2015). "A View of Margin Losses as Regularizers of Probability Estimates". Journal of Machine Learning Research. 16 (85): 2751–2795. ISSN 1533-7928.
↑ Rifkin, Ryan M.; Lippert, Ross A. (1 May 2007), Notes on Regularized Least Squares (PDF), MIT Computer Science and Artificial Intelligence Laboratory
↑ Masnadi-Shirazi, H.; Mahadevan, V.; Vasconcelos, N. (June 2010). "On the design of robust classifiers for computer vision". 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 779–786. CiteSeerX 10.1.1.172.6416 . doi:10.1109/CVPR.2010.5540136. ISBN 978-1-4244-6984-0. S2CID 632758.
↑ Schulter, S.; Wohlhart, P.; Leistner, C.; Saffari, A.; Roth, P. M.; Bischof, H. (June 2013). "Alternating Decision Forests". 2013 IEEE Conference on Computer Vision and Pattern Recognition. pp. 508–515. CiteSeerX 10.1.1.301.1305 . doi:10.1109/CVPR.2013.72. ISBN 978-0-7695-4989-7. S2CID 6557162.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[mit-1] 1 2 3 4 Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786 . doi:10.1162/089976604773135104. PMID 15070510. S2CID 11845688.

[penn-2] Shen, Yi (2005), Loss Functions For Binary Classification and Class Probability Estimation (PDF), University of Pennsylvania, retrieved 6 December 2014

[mitlec-3] 1 2 3 Rosasco, Lorenzo; Poggio, Tomaso (2014), A Regularization Tour of Machine Learning, MIT-9.520 Lectures Notes, vol. Manuscript

[Utah-4] 1 2 3 Piyush, Rai (13 September 2011), Support Vector Machines (Contd.), Classification Loss Functions and Regularizers (PDF), Utah CS5350/6350: Machine Learning, retrieved 4 May 2021

[uci-5] Ramanan, Deva (27 February 2008), Lecture 14 (PDF), UCI ICS273A: Machine Learning, retrieved 6 December 2014

[6] Bartlett, Peter L.; Jordan, Michael I.; Mcauliffe, Jon D. (2006). "Convexity, Classification, and Risk Bounds". Journal of the American Statistical Association. 101 (473): 138–156. doi:10.1198/016214505000000907. ISSN 0162-1459. JSTOR 30047445. S2CID 2833811.

[:0-7] 1 2 3 Masnadi-Shirazi, Hamed; Vasconcelos, Nuno (2008). "On the Design of Loss Functions for Classification: Theory, Robustness to Outliers, and SavageBoost" (PDF). Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS'08. USA: Curran Associates Inc.: 1049–1056. ISBN 9781605609492.

[8] Leistner, C.; Saffari, A.; Roth, P. M.; Bischof, H. (September 2009). "On robustness of on-line boosting - a competitive study". 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops. pp. 1362–1369. doi:10.1109/ICCVW.2009.5457451. ISBN 978-1-4244-4442-7. S2CID 6032045.

[9] Vasconcelos, Nuno; Masnadi-Shirazi, Hamed (2015). "A View of Margin Losses as Regularizers of Probability Estimates". Journal of Machine Learning Research. 16 (85): 2751–2795. ISSN 1533-7928.

[10] Rifkin, Ryan M.; Lippert, Ross A. (1 May 2007), Notes on Regularized Least Squares (PDF), MIT Computer Science and Artificial Intelligence Laboratory

[11] Masnadi-Shirazi, H.; Mahadevan, V.; Vasconcelos, N. (June 2010). "On the design of robust classifiers for computer vision". 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 779–786. CiteSeerX 10.1.1.172.6416 . doi:10.1109/CVPR.2010.5540136. ISBN 978-1-4244-6984-0. S2CID 632758.

[12] Schulter, S.; Wohlhart, P.; Leistner, C.; Saffari, A.; Roth, P. M.; Bischof, H. (June 2013). "Alternating Decision Forests". 2013 IEEE Conference on Computer Vision and Pattern Recognition. pp. 508–515. CiteSeerX 10.1.1.301.1305 . doi:10.1109/CVPR.2013.72. ISBN 978-0-7695-4989-7. S2CID 6557162.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]