This article includes a list of general references, but it lacks sufficient corresponding inline citations .(March 2011) |
The Jaccard index is a statistic used for gauging the similarity and diversity of sample sets. It is defined in general taking the ratio of two sizes (areas or volumes), the intersection size divided by the union size, also called intersection over union (IoU).
It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) [1] and now is often called the critical success index in meteorology. [2] It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communauté (community coefficient), [3] [4] and independently formulated again by T. Tanimoto. [5] Thus, it is also called Tanimoto index or Tanimoto coefficient in some fields.
The Jaccard index measures similarity between finite sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets:
Note that by design, If A intersection B is empty, then J(A, B) = 0. The Jaccard index is widely used in computer science, ecology, genomics, and other sciences, where binary or binarized data are used. Both the exact solution and approximation methods are available for hypothesis testing with the Jaccard index. [6]
Jaccard similarity also applies to bags, i.e., multisets. This has a similar formula, [7] but the symbols used represent bag intersection and bag sum (not union). The maximum value is 1/2.
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard index and is obtained by subtracting the Jaccard index from 1 or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:
An alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union. Jaccard distance is commonly used to calculate an n × n matrix for clustering and multidimensional scaling of n sample sets.
This distance is a metric on the collection of all finite sets. [8] [9] [10]
There is also a version of the Jaccard distance for measures, including probability measures. If is a measure on a measurable space , then we define the Jaccard index by
and the Jaccard distance by
Care must be taken if or , since these formulas are not well defined in these cases.
The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity index of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function.
Given two objects, A and B, each with n binary attributes, the Jaccard index is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
A B | 0 | 1 |
---|---|---|
0 | ||
1 |
Each attribute must fall into one of these four categories, meaning that
The Jaccard similarity index, J, is given as
The Jaccard distance, dJ, is given as
Statistical inference can be made based on the Jaccard similarity index, and consequently related metrics. [6] Given two sample sets A and B with n attributes, a statistical test can be conducted to see if an overlap is statistically significant. The exact solution is available, although computation can be costly as n increases. [6] Estimation methods are available either by approximating a multinomial distribution or by bootstrapping. [6]
When used for binary attributes, the Jaccard index is very similar to the simple matching coefficient. The main difference is that the SMC has the term in its numerator and denominator, whereas the Jaccard index does not. Thus, the SMC counts both mutual presences (when an attribute is present in both sets) and mutual absence (when an attribute is absent in both sets) as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets.
In market basket analysis, for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC will usually return very high values of similarities even when the baskets bear very little resemblance, thus making the Jaccard index a more appropriate measure of similarity in that context. For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC.
In other contexts, where 0 and 1 carry equivalent information (symmetry), the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummy variables, such as gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independently of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes (in this case, male and female), thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing any bias. The SMC remains, however, more computationally efficient in the case of symmetric dummy variables since it does not require adding extra dimensions.
If and are two vectors with all real , then their Jaccard similarity index (also known then as Ruzicka similarity) is defined as
and Jaccard distance (also known then as Soergel distance)
With even more generality, if and are two non-negative measurable functions on a measurable space with measure , then we can define
where and are pointwise operators. Then Jaccard distance is
Then, for example, for two measurable sets , we have where and are the characteristic functions of the corresponding set.
The weighted Jaccard similarity described above generalizes the Jaccard Index to positive vectors, where a set corresponds to a binary vector given by the indicator function, i.e. . However, it does not generalize the Jaccard Index to probability distributions, where a set corresponds to a uniform probability distribution, i.e.
It is always less if the sets differ in size. If , and then
Instead, a generalization that is continuous between probability distributions and their corresponding support sets is
which is called the "Probability" Jaccard. [11] It has the following bounds against the Weighted Jaccard on probability vectors.
Here the upper bound is the (weighted) Sørensen–Dice coefficient. The corresponding distance, , is a metric over probability distributions, and a pseudo-metric over non-negative vectors.
The Probability Jaccard Index has a geometric interpretation as the area of an intersection of simplices. Every point on a unit -simplex corresponds to a probability distribution on elements, because the unit -simplex is the set of points in dimensions that sum to 1. To derive the Probability Jaccard Index geometrically, represent a probability distribution as the unit simplex divided into sub simplices according to the mass of each item. If you overlay two distributions represented in this way on top of each other, and intersect the simplices corresponding to each item, the area that remains is equal to the Probability Jaccard Index of the distributions.
Consider the problem of constructing random variables such that they collide with each other as much as possible. That is, if and , we would like to construct and to maximize . If we look at just two distributions in isolation, the highest we can achieve is given by where is the Total Variation distance. However, suppose we weren't just concerned with maximizing that particular pair, suppose we would like to maximize the collision probability of any arbitrary pair. One could construct an infinite number of random variables one for each distribution , and seek to maximize for all pairs . In a fairly strong sense described below, the Probability Jaccard Index is an optimal way to align these random variables.
For any sampling method and discrete distributions , if then for some where and , either or . [11]
That is, no sampling method can achieve more collisions than on one pair without achieving fewer collisions than on another pair, where the reduced pair is more similar under than the increased pair. This theorem is true for the Jaccard Index of sets (if interpreted as uniform distributions) and the probability Jaccard, but not of the weighted Jaccard. (The theorem uses the word "sampling method" to describe a joint distribution over all distributions on a space, because it derives from the use of weighted minhashing algorithms that achieve this as their collision probability.)
This theorem has a visual proof on three element distributions using the simplex representation.
Various forms of functions described as Tanimoto similarity and Tanimoto distance occur in the literature and on the Internet. Most of these are synonyms for Jaccard similarity and Jaccard distance, but some are mathematically different. Many sources [12] cite an IBM Technical Report [5] as the seminal reference.
In "A Computer Program for Classifying Plants", published in October 1960, [13] a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard distance.
In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set (i.e. nonzero) in either sample.
Presented in mathematical terms, if samples X and Y are bitmaps, is the ith bit of X, and are bitwise and , or operators respectively, then the similarity ratio is
If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard index of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.[ citation needed ]
Tanimoto goes on to define a "distance" based on this ratio, defined for bitmaps with non-zero similarity:
This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.
Tanimoto distance is often referred to, erroneously, as a synonym for Jaccard distance . This function is a proper distance metric. "Tanimoto Distance" is often stated as being a proper distance metric, probably because of its confusion with Jaccard distance.[ clarification needed ][ citation needed ]
If Jaccard or Tanimoto similarity is expressed over a bit vector, then it can be written as
where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then
and
This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of do not necessarily extend to . In particular, the difference function does not preserve triangle inequality, and is not therefore a proper distance metric, whereas is.
There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.
Lipkus [9] uses a definition of Tanimoto similarity which is equivalent to , and refers to Tanimoto distance as the function . It is, however, made clear within the paper that the context is restricted by the use of a (positive) weighting vector such that, for any vector A being considered, Under these circumstances, the function is a proper distance metric, and so a set of vectors governed by such a weighting vector forms a metric space under this function.
In confusion matrices employed for binary classification, the Jaccard index can be framed in the following formula:
where TP are the true positives, FP the false positives and FN the false negatives. [14]
In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.
In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue, although according to the Bourbaki group they were first introduced by Frigyes Riesz.
In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
The stress–energy tensor, sometimes called the stress–energy–momentum tensor or the energy–momentum tensor, is a tensor physical quantity that describes the density and flux of energy and momentum in spacetime, generalizing the stress tensor of Newtonian physics. It is an attribute of matter, radiation, and non-gravitational force fields. This density and flux of energy and momentum are the sources of the gravitational field in the Einstein field equations of general relativity, just as mass density is the source of such a field in Newtonian gravity.
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.
In special relativity, four-momentum (also called momentum–energy or momenergy) is the generalization of the classical three-dimensional momentum to four-dimensional spacetime. Momentum is a vector in three dimensions; similarly four-momentum is a four-vector in spacetime. The contravariant four-momentum of a particle with relativistic energy E and three-momentum p = (px, py, pz) = γmv, where v is the particle's three-velocity and γ the Lorentz factor, is
In mathematics, the symmetric difference of two sets, also known as the disjunctive union and set sum, is the set of elements which are in either of the sets, but not in their intersection. For example, the symmetric difference of the sets and is .
In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t2) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.
In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how much a model probability distribution Q is different from a true probability distribution P. Mathematically, it is defined as
A classical field theory is a physical theory that predicts how one or more fields in physics interact with matter through field equations, without considering effects of quantization; theories that incorporate quantum mechanics are called quantum field theories. In most contexts, 'classical field theory' is specifically intended to describe electromagnetism and gravitation, two of the fundamental forces of nature.
In mathematics, probabilistic metric spaces are a generalization of metric spaces where the distance no longer takes values in the non-negative real numbers R ≥ 0, but in distribution functions.
The covariant formulation of classical electromagnetism refers to ways of writing the laws of classical electromagnetism in a form that is manifestly invariant under Lorentz transformations, in the formalism of special relativity using rectilinear inertial coordinate systems. These expressions both make it simple to prove that the laws of classical electromagnetism take the same form in any inertial coordinate system, and also provide a way to translate the fields and forces from one frame to another. However, this is not as general as Maxwell's equations in curved spacetime or non-rectilinear coordinate systems.
In mathematics and economics, transportation theory or transport theory is a name given to the study of optimal transportation and allocation of resources. The problem was formalized by the French mathematician Gaspard Monge in 1781.
In mathematics, the Wasserstein distance or Kantorovich–Rubinstein metric is a distance function defined between probability distributions on a given metric space . It is named after Leonid Vaseršteĭn.
In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
In probability theory, the family of complex normal distributions, denoted or , characterizes complex random variables whose real and imaginary parts are jointly normal. The complex normal family has three parameters: location parameter μ, covariance matrix , and the relation matrix . The standard complex normal is the univariate distribution with , , and .
In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).
Lagrangian field theory is a formalism in classical field theory. It is the field-theoretic analogue of Lagrangian mechanics. Lagrangian mechanics is used to analyze the motion of a system of discrete particles each with a finite number of degrees of freedom. Lagrangian field theory applies to continua and fields, which have an infinite number of degrees of freedom.
In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.