Similarity learning

Last updated May 14, 2024

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

Learning setup

There are four common setups for similarity and metric distance learning.

Regression similarity learning: In this setup, pairs of objects are given $(x_{i}^{1},x_{i}^{2})$ together with a measure of their similarity $y_{i}\in R$ . The goal is to learn a function that approximates $f(x_{i}^{1},x_{i}^{2})\sim y_{i}$ for every new labeled triplet example $(x_{i}^{1},x_{i}^{2},y_{i})$ . This is typically achieved by minimizing a regularized loss $\min _{W}\sum _{i}loss(w;x_{i}^{1},x_{i}^{2},y_{i})+reg(w)$ .
Classification similarity learning: Given are pairs of similar objects $(x_{i},x_{i}^{+})$ and non similar objects $(x_{i},x_{i}^{-})$ . An equivalent formulation is that every pair $(x_{i}^{1},x_{i}^{2})$ is given together with a binary label $y_{i}\in \{0,1\}$ that determines if the two objects are similar or not. The goal is again to learn a classifier that can decide if a new pair of objects is similar or not.
Ranking similarity learning: Given are triplets of objects $(x_{i},x_{i}^{+},x_{i}^{-})$ whose relative similarity obey a predefined order: $x_{i}$ is known to be more similar to $x_{i}^{+}$ than to $x_{i}^{-}$ . The goal is to learn a function $f$ such that for any new triplet of objects $(x,x^{+},x^{-})$ , it obeys $f(x,x^{+})>f(x,x^{-})$ (contrastive learning). This setup assumes a weaker form of supervision than in regression, because instead of providing an exact measure of similarity, one only has to provide the relative order of similarity. For this reason, ranking-based similarity learning is easier to apply in real large-scale applications.^[1]
Locality sensitive hashing (LSH)^[2]: Hashes input items so that similar items map to the same "buckets" in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large-scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases.^[3]

A common approach for learning similarity is to model the similarity function as a bilinear form. For example, in the case of ranking similarity learning, one aims to learn a matrix W that parametrizes the similarity function $f_{W}(x,z)=x^{T}Wz$ . When data is abundant, a common approach is to learn a siamese network – a deep network model with parameter sharing.

Metric learning

Similarity learning is closely related to distance metric learning. Metric learning is the task of learning a distance function over objects. A metric or distance function has to obey four axioms: non-negativity, identity of indiscernibles, symmetry and subadditivity (or the triangle inequality). In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric.

When the objects $x_{i}$ are vectors in $R^{d}$ , then any matrix $W$ in the symmetric positive semi-definite cone $S_{+}^{d}$ defines a distance pseudo-metric of the space of x through the form $D_{W}(x_{1},x_{2})^{2}=(x_{1}-x_{2})^{\top }W(x_{1}-x_{2})$ . When $W$ is a symmetric positive definite matrix, $D_{W}$ is a metric. Moreover, as any symmetric positive semi-definite matrix $W\in S_{+}^{d}$ can be decomposed as $W=L^{\top }L$ where $L\in R^{e\times d}$ and $e\geq rank(W)$ , the distance function $D_{W}$ can be rewritten equivalently $D_{W}(x_{1},x_{2})^{2}=(x_{1}-x_{2})^{\top }L^{\top }L(x_{1}-x_{2})=\|L(x_{1}-x_{2})\|_{2}^{2}$ . The distance $D_{W}(x_{1},x_{2})^{2}=\|x_{1}'-x_{2}'\|_{2}^{2}$ corresponds to the Euclidean distance between the transformed feature vectors $x_{1}'=Lx_{1}$ and $x_{2}'=Lx_{2}$ .

Many formulations for metric learning have been proposed.^[4]^[5] Some well-known approaches for metric learning include learning from relative comparisons,^[6] which is based on the triplet loss, large margin nearest neighbor,^[7] and information theoretic metric learning (ITML).^[8]

In statistics, the covariance matrix of the data is sometimes used to define a distance metric called Mahalanobis distance.

Applications

Similarity learning is used in information retrieval for learning to rank, in face verification or face identification,^[9]^[10] and in recommendation systems. Also, many machine learning approaches rely on some metric. This includes unsupervised learning such as clustering, which groups together close or similar objects. It also includes supervised approaches like K-nearest neighbor algorithm which rely on labels of nearby objects to decide on the label of a new object. Metric learning has been proposed as a preprocessing step for many of these approaches.^[11]

Scalability

Metric and similarity learning naively scale quadratically with the dimension of the input space, as can easily see when the learned metric has a bilinear form $f_{W}(x,z)=x^{T}Wz$ . Scaling to higher dimensions can be achieved by enforcing a sparseness structure over the matrix model, as done with HDSL,^[12] and with COMET.^[13]

Software

metric-learn^[14] is a free software Python library which offers efficient implementations of several supervised and weakly-supervised similarity and metric learning algorithms. The API of metric-learn is compatible with scikit-learn.^[15]

OpenMetricLearning^[16] is a Python framework to train and validate the models producing high-quality embeddings.

Further information

For further information on this topic, see the surveys on metric and similarity learning by Bellet et al.^[4] and Kulis.^[5]

Related Research Articles

In mathematics, a metric space is a set together with a notion of distance between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting for studying many of the concepts of mathematical analysis and geometry.

<span class="mw-page-title-main">Similarity (geometry)</span> Property of objects which are scaled or mirrored versions of each other

In Euclidean geometry, two objects are similar if they have the same shape, or if one has the same shape as the mirror image of the other. More precisely, one can be obtained from the other by uniformly scaling, possibly with additional translation, rotation and reflection. This means that either object can be rescaled, repositioned, and reflected, so as to coincide precisely with the other object. If two objects are similar, each is congruent to the result of a particular uniform scaling of the other.

Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria. Since spatial cognition is a rich source of conceptual metaphors in human thought, the term is also frequently used metaphorically to mean a measurement of the amount of difference between two similar objects or a degree of separation. Most such notions of distance, both physical and metaphorical, are formalized in mathematics using the notion of a metric space.

In the mathematical field of differential geometry, a metric tensor is an additional structure on a manifold $M$ that allows defining distances and angles, just as the inner product on a Euclidean space allows defining distances and angles there. More precisely, a metric tensor at a point $p$ of $M$ is a bilinear form defined on the tangent space at $p$ , and a metric field on $M$ consists of a metric tensor at each point $p$ of $M$ that varies smoothly with $p$ .

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

<span class="mw-page-title-main">Multidimensional scaling</span> Set of related ordination techniques used in information visualization

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of $objects in a set into a configuration of points mapped into an abstract Cartesian space.$

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in a Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude of the vector. This norm can be defined as the square root of the inner product of a vector with itself.

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets.

In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then, positive-definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

In graph theory and statistics, a graphon is a symmetric measurable function $, that is important in the study of dense graphs. Graphons arise both as a natural notion for the limit of a sequence of dense graphs, and as the fundamental defining objects of exchangeable random graph models. Graphons are tied to dense graphs by the following pair of observations: the random graph models defined by graphons give rise to dense graphs almost surely, and, by the regularity lemma, graphons capture the structure of arbitrary large dense graphs.$

Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for metric learning. It learns a pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization.

In data mining and machine learning, $k$ $q$ -flats algorithm is an iterative method which aims to partition $m$ observations into $k$ clusters where each cluster is close to a $q$ -flat, where $q$ is a given integer.

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

References

↑ Chechik, G.; Sharma, V.; Shalit, U.; Bengio, S. (2010). "Large Scale Online Learning of Image Similarity Through Ranking" (PDF). Journal of Machine Learning Research. 11: 1109–1135.
↑ Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB. Vol. 99. No. 6. 1999.
↑ Rajaraman, A.; Ullman, J. (2010). "Mining of Massive Datasets, Ch. 3".
1 2 Bellet, A.; Habrard, A.; Sebban, M. (2013). "A Survey on Metric Learning for Feature Vectors and Structured Data". arXiv: 1306.6709 [cs.LG].
1 2 Kulis, B. (2012). "Metric Learning: A Survey". Foundations and Trends in Machine Learning. 5 (4): 287–364. doi:10.1561/2200000019.
↑ Schultz, M.; Joachims, T. (2004). "Learning a distance metric from relative comparisons" (PDF). Advances in Neural Information Processing Systems. 16: 41–48.
↑ Weinberger, K. Q.; Blitzer, J. C.; Saul, L. K. (2006). "Distance Metric Learning for Large Margin Nearest Neighbor Classification" (PDF). Advances in Neural Information Processing Systems. 18: 1473–1480.
↑ Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; Dhillon, I. S. (2007). "Information-theoretic metric learning". International Conference in Machine Learning (ICML): 209–216.
↑ Guillaumin, M.; Verbeek, J.; Schmid, C. (2009). "Is that you? Metric learning approaches for face identification" (PDF). IEEE International Conference on Computer Vision (ICCV).
↑ Mignon, A.; Jurie, F. (2012). "PCCA: A new approach for distance learning from sparse pairwise constraints" (PDF). IEEE Conference on Computer Vision and Pattern Recognition.
↑ Xing, E. P.; Ng, A. Y.; Jordan, M. I.; Russell, S. (2002). "Distance Metric Learning, with Application to Clustering with Side-information" (PDF). Advances in Neural Information Processing Systems. 15: 505–512.
↑ Liu; Bellet; Sha (2015). "Similarity Learning for High-Dimensional Sparse Data" (PDF). International Conference on Artificial Intelligence and Statistics (AISTATS). arXiv: 1411.2374 . Bibcode:2014arXiv1411.2374L.
↑ Atzmon; Shalit; Chechik (2015). "Learning Sparse Metrics, One Feature at a Time" (PDF). J. Mach. Learn. Research (JMLR).
↑ "Scikit-learn-contrib/Metric-learn". GitHub .
↑ Vazelhes; Carey; Tang; Vauquier; Bellet (2020). "metric-learn: Metric Learning Algorithms in Python" (PDF). J. Mach. Learn. Research (JMLR). arXiv: 1908.04710 .
↑ "OML-Team/Open-metric-learning". GitHub .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Chechik, G.; Sharma, V.; Shalit, U.; Bengio, S. (2010). "Large Scale Online Learning of Image Similarity Through Ranking" (PDF). Journal of Machine Learning Research. 11: 1109–1135.

[2] Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB. Vol. 99. No. 6. 1999.

[3] Rajaraman, A.; Ullman, J. (2010). "Mining of Massive Datasets, Ch. 3".

[survey-4] 1 2 Bellet, A.; Habrard, A.; Sebban, M. (2013). "A Survey on Metric Learning for Feature Vectors and Structured Data". arXiv: 1306.6709 [cs.LG].

[survey2-5] 1 2 Kulis, B. (2012). "Metric Learning: A Survey". Foundations and Trends in Machine Learning. 5 (4): 287–364. doi:10.1561/2200000019.

[SchultzJoachims-6] Schultz, M.; Joachims, T. (2004). "Learning a distance metric from relative comparisons" (PDF). Advances in Neural Information Processing Systems. 16: 41–48.

[LMNN-7] Weinberger, K. Q.; Blitzer, J. C.; Saul, L. K. (2006). "Distance Metric Learning for Large Margin Nearest Neighbor Classification" (PDF). Advances in Neural Information Processing Systems. 18: 1473–1480.

[ITML-8] Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; Dhillon, I. S. (2007). "Information-theoretic metric learning". International Conference in Machine Learning (ICML): 209–216.

[GUILLAUMIN-9] Guillaumin, M.; Verbeek, J.; Schmid, C. (2009). "Is that you? Metric learning approaches for face identification" (PDF). IEEE International Conference on Computer Vision (ICCV).

[MIGNON-10] Mignon, A.; Jurie, F. (2012). "PCCA: A new approach for distance learning from sparse pairwise constraints" (PDF). IEEE Conference on Computer Vision and Pattern Recognition.

[XING-11] Xing, E. P.; Ng, A. Y.; Jordan, M. I.; Russell, S. (2002). "Distance Metric Learning, with Application to Clustering with Side-information" (PDF). Advances in Neural Information Processing Systems. 15: 505–512.

[Liu-12] Liu; Bellet; Sha (2015). "Similarity Learning for High-Dimensional Sparse Data" (PDF). International Conference on Artificial Intelligence and Statistics (AISTATS). arXiv: 1411.2374 . Bibcode:2014arXiv1411.2374L.

[13] Atzmon; Shalit; Chechik (2015). "Learning Sparse Metrics, One Feature at a Time" (PDF). J. Mach. Learn. Research (JMLR).

[14] "Scikit-learn-contrib/Metric-learn". GitHub .

[15] Vazelhes; Carey; Tang; Vauquier; Bellet (2020). "metric-learn: Metric Learning Algorithms in Python" (PDF). J. Mach. Learn. Research (JMLR). arXiv: 1908.04710 .

[16] "OML-Team/Open-metric-learning". GitHub .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]