Similarity measure

Last updated

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

Contents

Cosine similarity is a commonly used similarity measure for real-valued vectors, used in (among other fields) information retrieval to score the similarity of documents in the vector space model. In machine learning, common kernel functions such as the RBF kernel can be viewed as similarity functions. [1]

Use of different similarity measure formulas

Different types of similarity measures exist for various types of objects, depending on the objects being compared. For each type of object there are various similarity measurement formulas. [2]

Similarity between two data points

Image shows the path of calculation when using the Euclidean distance formula Euclidean distance vis.png
Image shows the path of calculation when using the Euclidean distance formula

There are many various options available when it comes to finding similarity between two data points, some of which are a combination of other similarity methods. Some of the methods for similarity measures between two data points include Euclidean distance, Manhattan distance, Minkowski distance, and Chebyshev distance. The Euclidean distance formula is used to find the distance between two points on a plane, which is visualized in the image below. Manhattan distance is commonly used in GPS applications, as it can be used to find the shortest route between two addresses. When you generalize the Euclidean distance formula and Manhattan distance formula you are left with the Minkowski distance formula, which can be used in a wide variety of applications.

Similarity between strings

For comparing strings, there are various measures of string similarity that can be used. Some of these methods include edit distance, Levenshtein distance, Hamming distance, and Jaro distance. The best-fit formula is dependent on the requirements of the application. For example, edit distance is frequently used for natural language processing applications and features, such as spell-checking. Jaro distance is commonly used in record linkage to compare first and last names to other sources.

Similarity between two probability distributions

When comparing probability distributions the Mahalanobis distance formula, Bhattacharyya distance formulas, and the Hellinger distance formula are all very powerful and useful. The Mahalanobis distance formula is commonly used in statistical analysis. It measures the distance between two probability distributions that have different means and variances. This makes it useful for finding outliers across the datasets. The Bhattacharyya distance formula is generally used in image processing, comparing two probability distributions that represent different categories. For example, can be used to classify images based on their features. Hellinger distance is a powerful tool for text mining and classifying documents. Generally, it is used on probability distributions that represent the frequency of different words, allowing you to find similar documents.

Similarity between two sets

A set is a collection of items with no order or repetition. They are generally used to represent relationships or associations between objects or even people. The Jaccard index formula measures the similarity between two sets based on the number of items that are present in both sets relative to the total number of items. The Jaccard index is commonly used in recommendation systems and social media analysis. For example, could recommend a new application to download based on your previous downloads. The Sørensen–Dice coefficient also compares the number of items in both sets to the total number of items present but the weight for the number of shared items is larger. The Sørensen–Dice coefficient is commonly used in biology applications, measuring the similarity between two sets of genes or species.

Similarity between two sequences

When comparing temporal sequences (time series), some similarity measures must additionally account for similarity of two sequences that are not fully aligned.

Use in clustering

Clustering or Cluster analysis is a data mining technique that is used to discover patterns in data by grouping similar objects together. It involves partitioning a set of data points into groups or clusters based on their similarities. One of the fundamental aspects of clustering is how to measure similarity between data points.

Similarity measures play a crucial role in many clustering techniques, as they are used to determine how closely related two data points are and whether they should be grouped together in the same cluster. A similarity measure can take many different forms depending on the type of data being clustered and the specific problem being solved.

One of the most commonly used similarity measures is the Euclidean distance, which is used in many clustering techniques including K-means clustering and Hierarchical clustering. The Euclidean distance is a measure of the straight-line distance between two points in a high-dimensional space. It is calculated as the square root of the sum of the squared differences between the corresponding coordinates of the two points. For example, if we have two data points and , the Euclidean distance between them is

Heatmap of HIST1 region, which is located on mouse chromosome 13 at the following coordinates: [21.7 Mb, 24.1 Mb]. Nuclear Profile Similarity heatmap.png
Heatmap of HIST1 region, which is located on mouse chromosome 13 at the following coordinates: [21.7 Mb, 24.1 Mb].

Another commonly used similarity measure is the Jaccard index or Jaccard similarity, which is used in clustering techniques that work with binary data such as presence/absence data [3] or Boolean data; The Jaccard similarity is particularly useful for clustering techniques that work with text data, where it can be used to identify clusters of similar documents based on their shared features or keywords. [4] It is calculated as the size of the intersection of two sets divided by the size of the union of the two sets.

Similarities among 162 Relevant Nuclear Profile are tested using the Jaccard Similarity measure (see figure with heatmap). The Jaccard similarity of the nuclear profile ranges from 0 to 1, with 0 indicating no similarity between the two sets and 1 indicating perfect similarity with the aim of clustering the most similar nuclear profile.

Manhattan distance, also known as Taxicab geometry, is a commonly used similarity measure in clustering techniques that work with continuous data. It is a measure of the distance between two data points in a high-dimensional space, calculated as the sum of the absolute differences between the corresponding coordinates of the two points.

In spectral clustering, a similarity, or affinity, measure is used to transform data to overcome difficulties related to lack of convexity in the shape of the data distribution. [5] The measure gives rise to an -sized similarity matrix for a set of n points, where the entry in the matrix can be simply the (reciprocal of the) Euclidean distance between and , or it can be a more complex measure of distance such as the Gaussian . [5] Further modifying this result with network analysis techniques is also common. [6]

The choice of similarity measure depends on the type of data being clustered and the specific problem being solved. For example, working with continuous data such as gene expression data, the Euclidean distance or cosine similarity may be appropriate. If working with binary data such as the presence of a genomic loci in a nuclear profile, the Jaccard index may be more appropriate. Lastly, working with data that is arranged in a grid or lattice structure, such as image or signal processing data, the Manhattan distance is particularly useful for the clustering.

Use in recommender systems

Similarity measures are used to develop recommender systems. It observes a user's perception and liking of multiple items. On recommender systems, the method is using a distance calculation such as Euclidean Distance or Cosine Similarity to generate a similarity matrix with values representing the similarity of any pair of targets. Then, by analyzing and comparing the values in the matrix, it is possible to match two targets to a user's preference or link users based on their marks. In this system, it is relevant to observe the value itself and the absolute distance between two values. [7] Gathering this data can indicate a mark's likeliness to a user as well as how mutually closely two marks are either rejected or accepted. It is possible then to recommend to a user targets with high similarity to the user's likes.

Recommender systems are observed in multiple online entertainment platforms, in social media and streaming websites. The logic for the construction of this systems is based on similarity measures. [8]

Use in sequence alignment

Similarity matrices are used in sequence alignment. Higher scores are given to more-similar characters, and lower or negative scores for dissimilar characters.

Nucleotide similarity matrices are used to align nucleic acid sequences. Because there are only four nucleotides commonly found in DNA (Adenine (A), Cytosine (C), Guanine (G) and Thymine (T)), nucleotide similarity matrices are much simpler than protein similarity matrices. For example, a simple matrix will assign identical bases a score of +1 and non-identical bases a score of −1. A more complicated matrix would give a higher score to transitions (changes from a pyrimidine such as C or T to another pyrimidine, or from a purine such as A or G to another purine) than to transversions (from a pyrimidine to a purine or vice versa). The match/mismatch ratio of the matrix sets the target evolutionary distance. [9] [10] The +1/−3 DNA matrix used by BLASTN is best suited for finding matches between sequences that are 99% identical; a +1/−1 (or +4/−4) matrix is much more suited to sequences with about 70% similarity. Matrices for lower similarity sequences require longer sequence alignments.

Amino acid similarity matrices are more complicated, because there are 20 amino acids coded for by the genetic code, and so a larger number of possible substitutions. Therefore, the similarity matrix for amino acids contains 400 entries (although it is usually symmetric). The first approach scored all amino acid changes equally. A later refinement was to determine amino acid similarities based on how many base changes were required to change a codon to code for that amino acid. This model is better, but it doesn't take into account the selective pressure of amino acid changes. Better models took into account the chemical properties of amino acids.

One approach has been to empirically generate the similarity matrices. The Dayhoff method used phylogenetic trees and sequences taken from species on the tree. This approach has given rise to the PAM series of matrices. PAM matrices are labelled based on how many nucleotide changes have occurred, per 100 amino acids. While the PAM matrices benefit from having a well understood evolutionary model, they are most useful at short evolutionary distances (PAM10–PAM120). At long evolutionary distances, for example PAM250 or 20% identity, it has been shown that the BLOSUM matrices are much more effective.

The BLOSUM series were generated by comparing a number of divergent sequences. The BLOSUM series are labeled based on how much entropy remains unmutated between all sequences, so a lower BLOSUM number corresponds to a higher PAM number.

Use in computer vision

The most common method for comparing two images in content-based image retrieval (typically an example image and an image from the database) is using an image distance measure. An image distance measure compares the similarity of two images in various dimensions such as color, texture, shape, and others. For example, a distance of 0 signifies an exact match with the query, with respect to the dimensions that were considered. As one may intuitively gather, a value greater than 0 indicates various degrees of similarities between the images. Search results then can be sorted based on their distance to the queried image. [11] Many measures of image distance (Similarity Models) have been developed. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Euclidean distance</span> Length of a line segment

In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is occasionally called the Pythagorean distance.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

<span class="mw-page-title-main">Multidimensional scaling</span> Set of related ordination techniques used in information visualization

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of objects or individuals" into a configuration of points mapped into an abstract Cartesian space.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">Jaccard index</span> Measure of similarity and diversity between sets

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets.

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">Point accepted mutation</span>

A point accepted mutation — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. This definition does not include all point mutations in the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are mutations that are lethal or that are rejected by natural selection in other ways.

<span class="mw-page-title-main">BLOSUM</span> Bioinformatics tool

In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

In mathematics, a Euclidean distance matrix is an n×n matrix representing the spacing of a set of n points in Euclidean space. For points in k-dimensional space k, the elements of their Euclidean distance matrix A are given by squares of distances between them. That is

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in .

The Sørensen–Dice coefficient is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen and Lee Raymond Dice, who published in 1948 and 1945 respectively.

Distance matrices are used in phylogeny as non-parametric distance methods and were originally applied to phenetic data using a matrix of pairwise distances. These distances are then reconciled to produce a tree. The distance matrix can come from a number of different sources, including measured distance or morphometric analysis, various pairwise distance formulae applied to discrete morphological characters, or genetic distance from sequence, restriction fragment, or allozyme data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states.

Consensus clustering is a method of aggregating results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering, it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

In mathematics, a rigid transformation is a geometric transformation of a Euclidean space that preserves the Euclidean distance between every pair of points.

<span class="mw-page-title-main">3D reconstruction from multiple images</span> Creation of a 3D model from a set of images

3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.

The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets.

References

  1. Vert, Jean-Philippe; Tsuda, Koji; Schölkopf, Bernhard (2004). "A primer on kernel methods" (PDF). Kernel Methods in Computational Biology.
  2. https://iq.opengenus.org/similarity-measurements/ "Different Types of Similarity measurements"
  3. Chung, Neo Christopher; Miasojedow, BłaŻej; Startek, Michał; Gambin, Anna (2019). "Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data". BMC Bioinformatics. 20 (S15): 644. doi: 10.1186/s12859-019-3118-5 . ISSN   1471-2105. PMC   6929325 . PMID   31874610.
  4. International MultiConference of Engineers and Computer Scientists : IMECS 2013 : 13-15 March, 2013, the Royal Garden Hotel, Kowloon, Hong Kong. S. I. Ao, International Association of Engineers. Hong Kong: Newswood Ltd. 2013. ISBN   978-988-19251-8-3. OCLC   842831996.{{cite book}}: CS1 maint: others (link)
  5. 1 2 Ng, A.Y.; Jordan, M.I.; Weiss, Y. (2001), "On Spectral Clustering: Analysis and an Algorithm", Advances in Neural Information Processing Systems, MIT Press, 14: 849–856
  6. Li, Xin-Ye; Guo, Li-Jie (2012), "Constructing affinity matrix in spectral clustering based on neighbor propagation", Neurocomputing, 97: 125–130, doi:10.1016/j.neucom.2012.06.023
  7. Bondarenko, Kirill (2019), Similarity metrics in recommender systems , retrieved 25 April 2023
  8. Javed, Mahnoor (2020), "Using Cosine Similarity to Build a Movie Recommendation System", Towards Data Science, retrieved 25 April 2023
  9. States, D; Gish, W; Altschul, S (1991). "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices". Methods: A Companion to Methods in Enzymology. 3 (1): 66. CiteSeerX   10.1.1.114.8183 . doi:10.1016/S1046-2023(05)80165-3.
  10. Sean R. Eddy (2004). "Where did the BLOSUM62 alignment score matrix come from?" (PDF). Nature Biotechnology. 22 (8): 1035–6. doi:10.1038/nbt0804-1035. PMID   15286655. S2CID   205269887. Archived from the original (PDF) on 2006-09-03.
  11. Shapiro, Linda; George Stockman (2001). Computer Vision. Upper Saddle River, NJ: Prentice Hall. ISBN   978-0-13-030796-5.
  12. Eidenberger, Horst (2011). "Fundamental Media Understanding", atpress. ISBN   978-3-8423-7917-6.