Triplet loss

Last updated

Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples. It was conceived by Google researchers for their prominent FaceNet algorithm for face detection. [1]

Contents

The triplet loss function minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Triplet Loss Minimization.png
The triplet loss function minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

Triplet loss is designed to support metric learning. Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions. In the context of face detection, data points correspond to images.

Definition

The loss function is defined using triplets of training points of the form . In each triplet, (called an "anchor point") denotes a reference point of a particular identity, (called a "positive point") denotes another point of the same identity in point , and (called a "negative point") denotes an point of an identity different from the identity in point and .

Let be some point and let be the embedding of in the finite-dimensional Euclidean space. It shall be assumed that the L2-norm of is unity (the L2 norm of a vector in a finite dimensional Euclidean space is denoted by .) We assemble triplets of points from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets in the training data set:

The variable is a hyperparameter called the margin, and its value must be set manually. In the FaceNet system, its value was set as 0.2.

Thus, the full form of the function to be minimized is the following:

Intuition

A baseline for understanding the effectiveness of triplet loss is the contrastive loss [2] , which operates on pairs of samples (rather than triplets). Training with the contrastive loss pulls embeddings of similar pairs closer together, and pushes dissimilar pairs apart. Its pairwise approach is greedy, as it considers each pair in isolation.

Triplet loss innovates by considering relative distances. Its goal is that the embedding of an anchor (query) point be closer to positive points than to negative points (also accounting for the margin). It does not try to further optimize the distances once this requirement is met. This is approximated by simultaneously considering two pairs (anchor-positive and anchor-negative), rather than each pair in isolation.

Triplet "mining"

Triplet mining - various options for selecting the negative
N
(
i
)
{\displaystyle N^{(i)}}
given an anchor
A
(
i
)
{\displaystyle A^{(i)}}
and a positive
P
(
i
)
{\displaystyle P^{(i)}}
. The very-hard and semi-hard negatives violate the triplet requirement, and so we would like the optimizer to push their embeddings farther away from the anchor. Triplet mining.png
Triplet mining - various options for selecting the negative given an anchor and a positive . The very-hard and semi-hard negatives violate the triplet requirement, and so we would like the optimizer to push their embeddings farther away from the anchor.

One crucial implementation detail when training with triplet loss is triplet "mining", which focuses on the smart selection of triplets for optimization. This process adds an additional layer of complexity compared to contrastive loss.

A naive approach to preparing training data for the triplet loss involves randomly selecting triplets from the dataset. In general, the set of valid triplets of the form is very large. To speed-up training convergence, it is essential to focus on challenging triplets. In the FaceNet paper, several options were explored, eventually arriving at the following. For each anchor-positive pair, the algorithm considers only semi-hard negatives. These are negatives that violate the triplet requirement (i.e, are "hard"), but lie farther from the anchor than the positive (not too hard). Restated, for each and , they seek such that:

The rationale for this design choice is heuristic. It may appear puzzling that the mining process neglects "very hard" negatives (i.e., closer to the anchor than the positive). Experiments conducted by the FaceNet designers found that this often leads to a convergence to degenerate local minima.

Triplet mining is performed at each training step, from within the sample points contained in the training batch (this is known as online mining), after embeddings were computed for all points in the batch. While ideally the entire dataset could be used, this is impractical in general. To support a large search space for triplets, the FaceNet authors used very large batches (1800 samples). Batches are constructed by selecting a large number of same-category sample points (40), and randomly selected negatives for them.

Extensions

Triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities. This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks. [3]

In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture. [4]

Other extensions involve specifying multiple negatives (multiple negatives ranking loss).

See also

Related Research Articles

<span class="mw-page-title-main">Elliptic curve</span> Algebraic curve

In mathematics, an elliptic curve is a smooth, projective, algebraic curve of genus one, on which there is a specified point O. An elliptic curve is defined over a field K and describes points in K2, the Cartesian product of K with itself. If the field's characteristic is different from 2 and 3, then the curve can be described as a plane algebraic curve which consists of solutions (x, y) for:

The Cauchy–Schwarz inequality is an upper bound on the inner product between two vectors in an inner product space in terms of the product of the vector norms. It is considered one of the most important and widely used inequalities in mathematics.

<span class="mw-page-title-main">Pareto distribution</span> Probability distribution

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena; the principle originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population. The Pareto principle or "80-20 rule" stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value of log45 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80-20 distribution fits a wide range of cases, including natural phenomena and human activities.

In mathematics, differential forms provide a unified approach to define integrands over curves, surfaces, solids, and higher-dimensional manifolds. The modern notion of differential forms was pioneered by Élie Cartan. It has many applications, especially in geometry, topology and physics.

<span class="mw-page-title-main">Root system</span> Geometric arrangements of points, foundational to Lie theory

In mathematics, a root system is a configuration of vectors in a Euclidean space satisfying certain geometrical properties. The concept is fundamental in the theory of Lie groups and Lie algebras, especially the classification and representation theory of semisimple Lie algebras. Since Lie groups and Lie algebras have become important in many parts of mathematics during the twentieth century, the apparently special nature of root systems belies the number of areas in which they are applied. Further, the classification scheme for root systems, by Dynkin diagrams, occurs in parts of mathematics with no overt connection to Lie theory. Finally, root systems are important for their own sake, as in spectral graph theory.

<span class="mw-page-title-main">Anti-de Sitter space</span> Maximally symmetric Lorentzian manifold with a negative cosmological constant

In mathematics and physics, n-dimensional anti-de Sitter space (AdSn) is a maximally symmetric Lorentzian manifold with constant negative scalar curvature. Anti-de Sitter space and de Sitter space are named after Willem de Sitter (1872–1934), professor of astronomy at Leiden University and director of the Leiden Observatory. Willem de Sitter and Albert Einstein worked together closely in Leiden in the 1920s on the spacetime structure of the universe. Paul Dirac was the first person to rigorously explore anti-de Sitter space, doing so in 1963.

In Bayesian probability theory, if, given a likelihood function , the posterior distribution is in the same probability distribution family as the prior probability distribution , the prior and posterior are then called conjugate distributions with respect to that likelihood function and the prior is called a conjugate prior for the likelihood function .

In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also interact with matrix multiplication.

AdaBoost is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values.

In (unconstrained) mathematical optimization, a backtracking line search is a line search method to determine the amount to move along a given search direction. Its use requires that the objective function is differentiable and that its gradient is known.

In mathematics, there is in mathematical analysis a class of Sobolev inequalities, relating norms including those of Sobolev spaces. These are used to prove the Sobolev embedding theorem, giving inclusions between certain Sobolev spaces, and the Rellich–Kondrachov theorem showing that under slightly stronger conditions some Sobolev spaces are compactly embedded in others. They are named after Sergei Lvovich Sobolev.

In quantum computing, quantum finite automata (QFA) or quantum state machines are a quantum analog of probabilistic automata or a Markov decision process. They provide a mathematical abstraction of real-world quantum computers. Several types of automata may be defined, including measure-once and measure-many automata. Quantum finite automata can also be understood as the quantization of subshifts of finite type, or as a quantization of Markov chains. QFAs are, in turn, special cases of geometric finite automata or topological finite automata.

Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes, and thus also belongs to the class of margin classifier algorithms.

In the mathematical field of analysis, a well-known theorem describes the set of discontinuities of a monotone real-valued function of a real variable; all discontinuities of such a (monotone) function are necessarily jump discontinuities and there are at most countably many of them.

For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

<span class="mw-page-title-main">Rectifier (neural networks)</span> Type of activation function

In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function:

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

<span class="mw-page-title-main">Continuous Bernoulli distribution</span> Probability distribution

In probability theory, statistics, and machine learning, the continuous Bernoulli distribution is a family of continuous probability distributions parameterized by a single shape parameter , defined on the unit interval , by:

References

  1. Schroff, Florian; Kalenichenko, Dmitry; Philbin, James (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering": 815–823.{{cite journal}}: Cite journal requires |journal= (help)
  2. Bekuzarov, Maksym (2020-04-19). "Losses explained: Contrastive Loss". Medium. Archived from the original on 2024-06-05. Retrieved 2025-01-06.
  3. Zhou, Mo; Niu, Zhenxing; Wang, Le; Gao, Zhanning; Zhang, Qilin; Hua, Gang (2020-04-03). "Ladder Loss for Coherent Visual-Semantic Embedding" (PDF). Proceedings of the AAAI Conference on Artificial Intelligence. 34 (7): 13050–13057. doi: 10.1609/aaai.v34i07.7006 . ISSN   2374-3468. S2CID   208139521.
  4. Reimers, Nils; Gurevych, Iryna (2019-08-27). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv: 1908.10084 [cs.CL].