Triplet loss

Last updated

Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples. It was conceived by Google researchers for their prominent FaceNet algorithm for face detection. [1]

Contents

The triplet loss function minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Triplet Loss Minimization.png
The triplet loss function minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

Triplet loss is designed to support metric learning. Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions. In the context of face detection, data points correspond to images.

Definition

The loss function is defined using triplets of training points of the form . In each triplet, (called an "anchor point") denotes a reference point of a particular identity, (called a "positive point") denotes another point of the same identity in point , and (called a "negative point") denotes an point of an identity different from the identity in point and .

Let be some point and let be the embedding of in the finite-dimensional Euclidean space. It shall be assumed that the L2-norm of is unity (the L2 norm of a vector in a finite dimensional Euclidean space is denoted by .) We assemble triplets of points from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets in the training data set:

The variable is a hyperparameter called the margin, and its value must be set manually. In the FaceNet system, its value was set as 0.2.

Thus, the full form of the function to be minimized is the following:

Selection of triplets

In general, the number of triplets of the form is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image they chose that positive image for which is maximum (such a positive image was called a "hard positive image") and that negative image for which is minimum (such a positive image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.

Comparison and Extensions

In computer vision tasks such as re-identification, a prevailing belief has been that the triplet loss is inferior to using surrogate losses (i.e., typical classification losses) followed by separate metric learning steps. Recent work showed that for models trained from scratch, as well as pretrained models, a special version of triplet loss doing end-to-end deep metric learning outperforms most other published methods as of 2017. [2]

Additionally, triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities. This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks. [3]

In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture. [4]

Other extensions involve specifying multiple negatives (multiple negatives ranking loss).

See also

Related Research Articles

<span class="mw-page-title-main">Metric space</span> Mathematical space with a notion of distance

In mathematics, a metric space is a set together with a notion of distance between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting for studying many of the concepts of mathematical analysis and geometry.

<span class="mw-page-title-main">Normed vector space</span> Vector space on which a distance is defined

In mathematics, a normed vector space or normed space is a vector space over the real or complex numbers on which a norm is defined. A norm is a generalization of the intuitive notion of "length" in the physical world. If is a vector space over , where is a field equal to or to , then a norm on is a map , typically denoted by , satisfying the following four axioms:

  1. Non-negativity: for every ,.
  2. Positive definiteness: for every , if and only if is the zero vector.
  3. Absolute homogeneity: for every and ,
  4. Triangle inequality: for every and ,
<span class="mw-page-title-main">Lipschitz continuity</span> Strong form of uniform continuity

In mathematical analysis, Lipschitz continuity, named after German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for functions. Intuitively, a Lipschitz continuous function is limited in how fast it can change: there exists a real number such that, for every pair of points on the graph of this function, the absolute value of the slope of the line connecting them is not greater than this real number; the smallest such bound is called the Lipschitz constant of the function. For instance, every function that is defined on an interval and has a bounded first derivative is Lipschitz continuous.

In the mathematical field of differential geometry, a metric tensor is an additional structure on a manifold M that allows defining distances and angles, just as the inner product on a Euclidean space allows defining distances and angles there. More precisely, a metric tensor at a point p of M is a bilinear form defined on the tangent space at p, and a metric field on M consists of a metric tensor at each point p of M that varies smoothly with p.

<span class="mw-page-title-main">Isometry</span> Distance-preserving mathematical transformation

In mathematics, an isometry is a distance-preserving transformation between metric spaces, usually assumed to be bijective. The word isometry is derived from the Ancient Greek: ἴσος isos meaning "equal", and μέτρον metron meaning "measure". If the transformation is from a metric space to itself, it is a kind of geometric transformation known as a motion.

<span class="mw-page-title-main">Anti-de Sitter space</span> Maximally symmetric Lorentzian manifold with a negative cosmological constant

In mathematics and physics, n-dimensional anti-de Sitter space (AdSn) is a maximally symmetric Lorentzian manifold with constant negative scalar curvature. Anti-de Sitter space and de Sitter space are named after Willem de Sitter (1872–1934), professor of astronomy at Leiden University and director of the Leiden Observatory. Willem de Sitter and Albert Einstein worked together closely in Leiden in the 1920s on the spacetime structure of the universe. Paul Dirac was the first person to rigorously explore anti-de Sitter space, doing so in 1963.

In mathematics and especially differential geometry, a Kähler manifold is a manifold with three mutually compatible structures: a complex structure, a Riemannian structure, and a symplectic structure. The concept was first studied by Jan Arnoldus Schouten and David van Dantzig in 1930, and then introduced by Erich Kähler in 1933. The terminology has been fixed by André Weil. Kähler geometry refers to the study of Kähler manifolds, their geometry and topology, as well as the study of structures and constructions that can be performed on Kähler manifolds, such as the existence of special connections like Hermitian Yang–Mills connections, or special metrics such as Kähler–Einstein metrics.

In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in a Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude or length of the vector. This norm can be defined as the square root of the inner product of a vector with itself.

In mathematics, a hyperbolic metric space is a metric space satisfying certain metric relations between points. The definition, introduced by Mikhael Gromov, generalizes the metric properties of classical hyperbolic geometry and of trees. Hyperbolicity is a large-scale property, and is very useful to the study of certain infinite groups called Gromov-hyperbolic groups.

AdaBoost is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values.

In mathematics, there is in mathematical analysis a class of Sobolev inequalities, relating norms including those of Sobolev spaces. These are used to prove the Sobolev embedding theorem, giving inclusions between certain Sobolev spaces, and the Rellich–Kondrachov theorem showing that under slightly stronger conditions some Sobolev spaces are compactly embedded in others. They are named after Sergei Lvovich Sobolev.

In quantum computing, quantum finite automata (QFA) or quantum state machines are a quantum analog of probabilistic automata or a Markov decision process. They provide a mathematical abstraction of real-world quantum computers. Several types of automata may be defined, including measure-once and measure-many automata. Quantum finite automata can also be understood as the quantization of subshifts of finite type, or as a quantization of Markov chains. QFAs are, in turn, special cases of geometric finite automata or topological finite automata.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes.

For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

Bimetric gravity or bigravity refers to two different classes of theories. The first class of theories relies on modified mathematical theories of gravity in which two metric tensors are used instead of one. The second metric may be introduced at high energies, with the implication that the speed of light could be energy-dependent, enabling models with a variable speed of light.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Continuous Bernoulli distribution</span> Probability distribution

In probability theory, statistics, and machine learning, the continuous Bernoulli distribution is a family of continuous probability distributions parameterized by a single shape parameter , defined on the unit interval , by:

References

  1. Schroff, Florian; Kalenichenko, Dmitry; Philbin, James (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering": 815–823.{{cite journal}}: Cite journal requires |journal= (help)
  2. Hermans, Alexander; Beyer, Lucas; Leibe, Bastian (2017-03-22). "In Defense of the Triplet Loss for Person Re-Identification". arXiv: 1703.07737 [cs.CV].
  3. Zhou, Mo; Niu, Zhenxing; Wang, Le; Gao, Zhanning; Zhang, Qilin; Hua, Gang (2020-04-03). "Ladder Loss for Coherent Visual-Semantic Embedding" (PDF). Proceedings of the AAAI Conference on Artificial Intelligence. 34 (7): 13050–13057. doi: 10.1609/aaai.v34i07.7006 . ISSN   2374-3468. S2CID   208139521.
  4. Reimers, Nils; Gurevych, Iryna (2019-08-27). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv: 1908.10084 [cs.CL].