Hinge loss

Last updated August 10, 2024

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).^[1]

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,^[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.^[3] For example, Crammer and Singer^[4] defined it for a linear classifier as^[5]

\ell (y)=\max(0,1+\max _{y\neq t}\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )

,

where $t$ is the target label, $\mathbf {w} _{t}$ and $\mathbf {w} _{y}$ are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:^[6]^[3]

\ell (y)=\sum _{y\neq t}\max(0,1+\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )

.

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where $w$ denotes the SVM's parameters, $y$ the SVM's predictions, $φ$ the joint feature function, and $Δ$ the Hamming loss:

{\begin{aligned}\ell (\mathbf {y} )&=\max(0,\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle -\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\\&=\max(0,\max _{y\in {\mathcal {Y}}}\left(\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle \right)-\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\end{aligned}}

.

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters $w$ of a linear SVM with score function $y=\mathbf {w} \cdot \mathbf {x}$ that is given by

{\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1,\\0&{\text{otherwise}}.\end{cases}}

However, since the derivative of the hinge loss at $ty=1$ is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's^[7]

\ell (y)={\begin{cases}{\frac {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac {1}{2}}(1-ty)^{2}&{\text{if}}~~0<ty<1,\\0&{\text{if}}~~1\leq ty\end{cases}}

or the quadratically smoothed

\ell _{\gamma }(y)={\begin{cases}{\frac {1}{2\gamma }}\max(0,1-ty)^{2}&{\text{if}}~~ty\geq 1-\gamma ,\\1-{\frac {\gamma }{2}}-ty&{\text{otherwise}}\end{cases}}

suggested by Zhang.^[8] The modified Huber loss $L$ is a special case of this loss function with $\gamma =2$ , specifically $L(t,y)=4\ell _{2}(y)$ .

Related Research Articles

In statistical mechanics, the virial theorem provides a general equation that relates the average over time of the total kinetic energy of a stable system of discrete particles, bound by a conservative force with that of the total potential energy of the system. Mathematically, the theorem states $where T is the total kinetic energy of the N particles, F k represents the force on the k th particle, which is located at position r k, and angle brackets represent the average over time of the enclosed quantity. The word virial for the right-hand side of the equation derives from vis, the Latin word for "force" or "energy", and was given its technical definition by Rudolf Clausius in 1870.$

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

In mathematics, the covariant derivative is a way of specifying a derivative along tangent vectors of a manifold. Alternatively, the covariant derivative is a way of introducing and working with a connection on a manifold by means of a differential operator, to be contrasted with the approach given by a principal connection on the frame bundle – see affine connection. In the special case of a manifold isometrically embedded into a higher-dimensional Euclidean space, the covariant derivative can be viewed as the orthogonal projection of the Euclidean directional derivative onto the manifold's tangent space. In this case the Euclidean derivative is broken into two parts, the extrinsic normal component and the intrinsic covariant derivative component.

In quantum mechanics, a spherically symmetric potential is a system of which the potential only depends on the radial distance from the spherical center and a location in space. A particle in a spherically symmetric potential will behave accordingly to said potential and can therefore be used as an approximation, for example, of the electron in a hydrogen atom or of the formation of chemical bonds.

In physics, the Rabi cycle is the cyclic behaviour of a two-level quantum system in the presence of an oscillatory driving field. A great variety of physical processes belonging to the areas of quantum computing, condensed matter, atomic and molecular physics, and nuclear and particle physics can be conveniently studied in terms of two-level quantum mechanical systems, and exhibit Rabi flopping when coupled to an optical driving field. The effect is important in quantum optics, magnetic resonance and quantum computing, and is named after Isidor Isaac Rabi.

In quantum field theory, the quantum effective action is a modified expression for the classical action taking into account quantum corrections while ensuring that the principle of least action applies, meaning that extremizing the effective action yields the equations of motion for the vacuum expectation values of the quantum fields. The effective action also acts as a generating functional for one-particle irreducible correlation functions. The potential component of the effective action is called the effective potential, with the expectation value of the true vacuum being the minimum of this potential rather than the classical potential, making it important for studying spontaneous symmetry breaking.

In quantum field theory, the Lehmann–Symanzik–Zimmermann (LSZ) reduction formula is a method to calculate S-matrix elements from the time-ordered correlation functions of a quantum field theory. It is a step of the path that starts from the Lagrangian of some quantum field theory and leads to prediction of measurable quantities. It is named after the three German physicists Harry Lehmann, Kurt Symanzik and Wolfhart Zimmermann.

The diabatic representation as a mathematical tool for theoretical calculations of atomic collisions and of molecular interactions.

In solid-state physics, the tight-binding model is an approach to the calculation of electronic band structure using an approximate set of wave functions based upon superposition of wave functions for isolated atoms located at each atomic site. The method is closely related to the LCAO method used in chemistry. Tight-binding models are applied to a wide variety of solids. The model gives good qualitative results in many cases and can be combined with other models that give better results where the tight-binding model fails. Though the tight-binding model is a one-electron model, the model also provides a basis for more advanced calculations like the calculation of surface states and application to various kinds of many-body problem and quasiparticle calculations.

In theoretical physics, a source field is a background field $coupled to the original field as$

In mathematics, the Schur orthogonality relations, which were proven by Issai Schur through Schur's lemma, express a central fact about representations of finite groups. They admit a generalization to the case of compact groups in general, and in particular compact Lie groups, such as the rotation group SO(3).

In applied mathematics, discontinuous Galerkin methods (DG methods) form a class of numerical methods for solving differential equations. They combine features of the finite element and the finite volume framework and have been successfully applied to hyperbolic, elliptic, parabolic and mixed form problems arising from a wide range of applications. DG methods have in particular received considerable interest for problems with a dominant first-order part, e.g. in electrodynamics, fluid mechanics and plasma physics. Indeed, the solutions of such problems may involve strong gradients (and even discontinuities) so that classical finite element methods fail, while finite volume methods are restricted to low order approximations.

In mathematics, the Möbius energy of a knot is a particular knot energy, i.e., a functional on the space of knots. It was discovered by Jun O'Hara, who demonstrated that the energy blows up as the knot's strands get close to one another. This is a useful property because it prevents self-intersection and ensures the result under gradient descent is of the same knot type.

The Luttinger–Kohn model is a flavor of the k·p perturbation theory used for calculating the structure of multiple, degenerate electronic bands in bulk and quantum well semiconductors. The method is a generalization of the single band k·p theory.

In cryptography, learning with errors (LWE) is a mathematical problem that is widely used to create secure encryption algorithms. It is based on the idea of representing secret information as a set of equations with errors. In other words, LWE is a way to hide the value of a secret by introducing noise to it. In more technical terms, it refers to the computational problem of inferring a linear $-ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.$

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors in a feature space over polynomials of the original variables, allowing learning of non-linear models.

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.

In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator.

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is $regularization of the form$

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

References

↑ Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786 . doi:10.1162/089976604773135104. PMID 15070510.
↑ Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study" (PDF). Multiple Classifier Systems. LNCS. Vol. 3541. pp. 278–285. CiteSeerX 10.1.1.110.6789 . doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.
1 2 Doğan, Ürün; Glasmachers, Tobias; Igel, Christian (2016). "A Unified View on Multi-class Support Vector Classification" (PDF). Journal of Machine Learning Research . 17: 1–32.
↑ Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" (PDF). Journal of Machine Learning Research . 2: 265–292.
↑ Moore, Robert C.; DeNero, John (2011). "L₁ and L₂ regularization for multiclass hinge loss models" (PDF). Proc. Symp. on Machine Learning in Speech and Language Processing. Archived from the original (PDF) on 2017-08-28. Retrieved 2013-10-23.
↑ Weston, Jason; Watkins, Chris (1999). "Support Vector Machines for Multi-Class Pattern Recognition" (PDF). European Symposium on Artificial Neural Networks. Archived from the original (PDF) on 2018-05-05. Retrieved 2017-03-01.
↑ Rennie, Jason D. M.; Srebro, Nathan (2005). Loss Functions for Preference Levels: Regression with Discrete Ordered Labels (PDF). Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling.
↑ Zhang, Tong (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms (PDF). ICML.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786 . doi:10.1162/089976604773135104. PMID 15070510.

[duan2005-2] Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study" (PDF). Multiple Classifier Systems. LNCS. Vol. 3541. pp. 278–285. CiteSeerX 10.1.1.110.6789 . doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.

[unifiedview-3] 1 2 Doğan, Ürün; Glasmachers, Tobias; Igel, Christian (2016). "A Unified View on Multi-class Support Vector Classification" (PDF). Journal of Machine Learning Research . 17: 1–32.

[4] Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" (PDF). Journal of Machine Learning Research . 2: 265–292.

[5] Moore, Robert C.; DeNero, John (2011). "L₁ and L₂ regularization for multiclass hinge loss models" (PDF). Proc. Symp. on Machine Learning in Speech and Language Processing. Archived from the original (PDF) on 2017-08-28. Retrieved 2013-10-23.

[6] Weston, Jason; Watkins, Chris (1999). "Support Vector Machines for Multi-Class Pattern Recognition" (PDF). European Symposium on Artificial Neural Networks. Archived from the original (PDF) on 2018-05-05. Retrieved 2017-03-01.

[7] Rennie, Jason D. M.; Srebro, Nathan (2005). Loss Functions for Preference Levels: Regression with Discrete Ordered Labels (PDF). Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling.

[zhang-8] Zhang, Tong (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms (PDF). ICML.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Hinge loss

Contents

Extensions

Optimization

See also

Related Research Articles

References