Cosine similarity

Last updated

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in .

Contents

For example, in information retrieval and text mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents. [1]

The technique is also used to measure cohesion within clusters in the field of data mining. [2]

One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered.

Other names for cosine similarity include Orchini similarity and Tucker coefficient of congruence; the Otsuka–Ochiai similarity (see below) is cosine similarity applied to binary data.

Definition

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as

where and are the th components of vectors and , respectively.

The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from , since the term frequencies cannot be negative. This remains true when using TF-IDF weights. The angle between two term frequency vectors cannot be greater than 90°.

If the attribute vectors are normalized by subtracting the vector means (e.g., ), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering,

Cosine distance

When the distance between two unit-length vectors is defined to be the length of their vector difference then

Nonetheless the cosine distance [3] is often defined without the square root or factor of 2:

It is important to note that, by virtue of being proportional to squared Euclidean distance, the cosine distance is not a true distance metric; it does not exhibit the triangle inequality property — or, more formally, the Schwarz inequality — and it violates the coincidence axiom. To repair the triangle inequality property while maintaining the same ordering, one can convert to Euclidean distance or angular distance θ = arccos(SC(A, B)). Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

Angular distance and similarity

The normalized angle, referred to as angular distance, between any two vectors and is a formal distance metric and can be calculated from the cosine similarity. [4] The complement of the angular distance metric can then be used to define angular similarity function bounded between 0 and 1, inclusive.

When the vector elements may be positive or negative:

Or, if the vector elements are always positive:

Unfortunately, computing the inverse cosine (arccos) function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.

L2-normalized Euclidean distance

Another effective proxy for cosine distance can be obtained by normalisation of the vectors, followed by the application of normal Euclidean distance. Using this technique each term in each vector is first divided by the magnitude of the vector, yielding a vector of unit length. Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a monotonic transformation of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric. Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard dimensionality reduction techniques. This normalised form distance is often used within many deep learning algorithms.

Otsuka–Ochiai coefficient

In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka, [5] Japanese : 大塚 弥之助) [6] and Akira Ochiai (Japanese : 落合 明), [7] also known as the Ochiai–Barkman [8] or Ochiai coefficient, [9] which can be represented as:

Here, and are sets, and is the number of elements in . If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity. It is identical to the score introduced by Godfrey Thomson. [10]

In a recent book, [11] the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka. The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned) [7] by citing an article by Ikuso Hamai (Japanese : 浜井 生三), [12] who in turn cites the original 1936 article by Yanosuke Otsuka. [6]

Properties

The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions. For any positive constant and vector , the vectors and are maximally similar. The measure is thus most appropriate for data where frequency is more important than absolute values; notably, term frequency in documents. However more recent metrics with a grounding in information theory, such as Jensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts. [13]

Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual , and observe that

(polarization identity)

by expansion. When A and B are normalized to unit length, so this expression is equal to

In short, the cosine distance can be expressed in terms of Euclidean distance as

.

The Euclidean distance is called the chord distance (because it is the length of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.

Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the dot product of two independent random unit vectors. This distribution has a mean of zero and a variance of (where is the number of dimensions), and although the distribution is bounded between -1 and +1, as grows large the distribution is increasingly well-approximated by the normal distribution. [14] [15] Other types of data such as bitstreams, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean. [16]

Triangle inequality for cosine similarity

The ordinary triangle inequality for angles (i.e., arc lengths on a unit hypersphere) gives us that

Because the cosine function decreases as an angle in [0, π] radians increases, the sense of these inequalities is reversed when we take the cosine of each value:

Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines,

This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known. This is used for example in metric data indexing, but has also been used to accelerate spherical k-means clustering [17] the same way the Euclidean triangle inequality has been used to accelerate regular k-means.

Soft cosine measure

A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features. [18] The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.

For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive. Features such as words, n-grams, or syntactic n-grams [19] can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus mapped to different points in VSM; yet they are semantically related. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculating soft cosine, the matrix s is used to indicate similarity between features. It can be calculated through Levenshtein distance, WordNet similarity, or other similarity measures. Then we just multiply by this matrix.

Given two N-dimension vectors and , the soft cosine similarity is calculated as follows:

where sij = similarity(featurei, featurej).

If there is no similarity between features (sii = 1, sij = 0 for ij), the given equation is equivalent to the conventional cosine similarity formula.

The time complexity of this measure is quadratic, which makes it applicable to real-world tasks. Note that the complexity can be reduced to subquadratic. [20] An efficient implementation of such soft cosine similarity is included in the Gensim open source library.

See also

Related Research Articles

<span class="mw-page-title-main">Spherical coordinate system</span> Coordinates comprising a distance and two angles

In mathematics, a spherical coordinate system is a coordinate system for three-dimensional space where the position of a given point in space is specified by three real numbers: the radial distancer along the radial line connecting the point to the fixed point of origin; the polar angleθ between the radial line and a polar axis; and the azimuthal angleφ as the angle of rotation of the radial line around the polar axis. (See graphic re the "physics convention".) Once the radius is fixed, the three coordinates (r, θ, φ), known as a 3-tuple, provide a coordinate system on a sphere, typically called the spherical polar coordinates.

<span class="mw-page-title-main">Solid angle</span> Measure of how large an object appears to an observer at a given point in three-dimensional space

In geometry, a solid angle is a measure of the amount of the field of view from some particular point that a given object covers. That is, it is a measure of how large the object appears to an observer looking from that point. The point from which the object is viewed is called the apex of the solid angle, and the object is said to subtend its solid angle at that point.

In mathematics, the dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers, and returns a single number. In Euclidean geometry, the dot product of the Cartesian coordinates of two vectors is widely used. It is often called the inner product of Euclidean space, even though it is not the only inner product that can be defined on Euclidean space.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.

Unit quaternions, known as versors, provide a convenient mathematical notation for representing spatial orientations and rotations of elements in three dimensional space. Specifically, they encode information about an axis-angle rotation about an arbitrary axis. Rotation and orientation quaternions have applications in computer graphics, computer vision, robotics, navigation, molecular dynamics, flight dynamics, orbital mechanics of satellites, and crystallographic texture analysis.

In linear algebra, two vectors in an inner product space are orthonormal if they are orthogonal unit vectors. A unit vector means that the vector has a length of 1, which is also known as normalized. Orthogonal means that the vectors are all perpendicular to each other. A set of vectors form an orthonormal set if all vectors in the set are mutually orthogonal and all of unit length. An orthonormal set which forms a basis is called an orthonormal basis.

<span class="mw-page-title-main">Spherical harmonics</span> Special mathematical functions defined on the surface of a sphere

In mathematics and physical science, spherical harmonics are special functions defined on the surface of a sphere. They are often employed in solving partial differential equations in many scientific fields. The table of spherical harmonics contains a list of common spherical harmonics.

<span class="mw-page-title-main">Inverse trigonometric functions</span> Inverse functions of sin, cos, tan, etc.

In mathematics, the inverse trigonometric functions are the inverse functions of the trigonometric functions. Specifically, they are the inverses of the sine, cosine, tangent, cotangent, secant, and cosecant functions, and are used to obtain an angle from any of the angle's trigonometric ratios. Inverse trigonometric functions are widely used in engineering, navigation, physics, and geometry.

<span class="mw-page-title-main">Great-circle distance</span> Shortest distance between two points on the surface of a sphere

The great-circle distance, orthodromic distance, or spherical distance is the distance between two points on a sphere, measured along the great-circle arc between them. This arc is the shortest path between the two points on the surface of the sphere.

A parametric surface is a surface in the Euclidean space which is defined by a parametric equation with two parameters . Parametric representation is a very general way to specify a surface, as well as implicit representation. Surfaces that occur in two of the main theorems of vector calculus, Stokes' theorem and the divergence theorem, are frequently given in a parametric form. The curvature and arc length of curves on the surface, surface area, differential geometric invariants such as the first and second fundamental forms, Gaussian, mean, and principal curvatures can all be computed from a given parametrization.

Angular distance or angular separation is the measure of the angle between the orientation of two straight lines, rays, or vectors in three-dimensional space, or the central angle subtended by the radii through two points on a sphere. When the rays are lines of sight from an observer to two points in space, it is known as the apparent distance or apparent separation.

<span class="mw-page-title-main">Sine and cosine</span> Fundamental trigonometric functions

In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is opposite that angle to the length of the longest side of the triangle, and the cosine is the ratio of the length of the adjacent leg to that of the hypotenuse. For an angle , the sine and cosine functions are denoted as and .

In geometry, various formalisms exist to express a rotation in three dimensions as a mathematical transformation. In physics, this concept is applied to classical mechanics where rotational kinematics is the science of quantitative description of a purely rotational motion. The orientation of an object at a given instant is described with the same tools, as it is defined as an imaginary rotation from a reference placement in space, rather than an actually observed rotation from a previous placement in space.

<span class="mw-page-title-main">Axis–angle representation</span> Parameterization of a rotation into a unit vector and angle

In mathematics, the axis–angle representation parameterizes a rotation in a three-dimensional Euclidean space by two quantities: a unit vector e indicating the direction (geometry) of an axis of rotation, and an angle of rotation θ describing the magnitude and sense of the rotation about the axis. Only two numbers, not three, are needed to define the direction of a unit vector e rooted at the origin because the magnitude of e is constrained. For example, the elevation and azimuth angles of e suffice to locate it in any particular Cartesian coordinate frame.

In mathematics, vector spherical harmonics (VSH) are an extension of the scalar spherical harmonics for use with vector fields. The components of the VSH are complex-valued functions expressed in the spherical coordinate basis vectors.

<span class="mw-page-title-main">Pythagorean theorem</span> Relation between sides of a right triangle

In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse is equal to the sum of the areas of the squares on the other two sides.

The following are important identities in vector algebra. Identities that only involve the magnitude of a vector and the dot product of two vectors A·B, apply to vectors in any dimension, while identities that use the cross product A×B only apply in three dimensions, since the cross product is only defined there. Most of these relations can be dated to founder of vector calculus Josiah Willard Gibbs, if not earlier.

In orbital mechanics, Gauss's method is used for preliminary orbit determination from at least three observations of the orbiting body of interest at three different times. The required information are the times of observations, the position vectors of the observation points, the direction cosine vector of the orbiting body from the observation points and general physical data.

The concept of angles between lines, between two planes or between a line and a plane can be generalized to arbitrary dimensions. This generalization was first discussed by Camille Jordan. For any pair of flats in a Euclidean space of arbitrary dimension one can define a set of mutual angles which are invariant under isometric transformation of the Euclidean space. If the flats do not intersect, their shortest distance is one more invariant. These angles are called canonical or principal. The concept of angles can be generalized to pairs of flats in a finite-dimensional inner product space over the complex numbers.

References

  1. Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.
  2. P.-N. Tan, M. Steinbach & V. Kumar, Introduction to Data Mining, Addison-Wesley (2005), ISBN   0-321-32136-7, chapter 8; page 500.
  3. Wolfram Research (2007). "CosineDistance – Wolfram Language & System Documentation Center". wolfram.com.{{cite web}}: CS1 maint: numeric names: authors list (link)
  4. "COSINE DISTANCE, COSINE SIMILARITY, ANGULAR COSINE DISTANCE, ANGULAR COSINE SIMILARITY". www.itl.nist.gov. Retrieved 2020-07-11.
  5. Omori, Masae (2004). "Geological idea of Yanosuke Otuka, who built the foundation of neotectonics (geoscientist)". Earth Science. 58 (4): 256–259. doi:10.15080/agcjchikyukagaku.58.4_256.
  6. 1 2 Otsuka, Yanosuke (1936). "The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of the climate having become colder during the Pleistocene in Japan". Bulletin of the Biogeographical Society of Japan. 6 (16): 165–170.
  7. 1 2 Ochiai, Akira (1957). "Zoogeographical studies on the soleoid fishes found in Japan and its neighhouring regions-II". Bulletin of the Japanese Society of Scientific Fisheries. 22 (9): 526–530. doi: 10.2331/suisan.22.526 .
  8. Barkman, Jan J. (1958). Phytosociology and Ecology of Cryptogamic Epiphytes: Including a Taxonomic Survey and Description of Their Vegetation Units in Europe. Assen: Van Gorcum.
  9. Romesburg, H. Charles (1984). Cluster Analysis for Researchers. Belmont, California: Lifetime Learning Publications. p. 149.
  10. Thomson, Godfrey (1916). "A hierarchy without a general factor" (PDF). British Journal of Psychology. 8: 271–281.
  11. Howarth, Richard J. (2017). Dictionary of Mathematical Geosciences: With Historical Notes. Cham: Springer. p. 421. doi:10.1007/978-3-319-57315-1. ISBN   978-3-319-57314-4. S2CID   67081034. […] attributed by him to "Otsuka" [?A. Otsuka of the Dept. of Fisheries, Tohoku University].
  12. Hamai, Ikuso (1955). "Stratification of community by means of "community coefficient" (continued)". Japanese Journal of Ecology. 5 (1): 41–45. doi:10.18960/seitai.5.1_41.
  13. Connor, Richard (2016). A Tale of Four Metrics. Similarity Search and Applications. Tokyo: Springer. doi:10.1007/978-3-319-46759-7_16.
  14. Spruill, Marcus C. (2007). "Asymptotic distribution of coordinates on high dimensional spheres". Electronic Communications in Probability. 12: 234–247. doi: 10.1214/ECP.v12-1294 .
  15. "Distribution of dot products between two random unit vectors in RD". CrossValidated.
  16. Graham L. Giller (2012). "The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity". Giller Investments Research Notes (20121024/1). doi:10.2139/ssrn.2167044. S2CID   123332455.
  17. Schubert, Erich; Lang, Andreas; Feher, Gloria (2021). Reyes, Nora; Connor, Richard; Kriege, Nils; Kazempour, Daniyal; Bartolini, Ilaria; Schubert, Erich; Chen, Jian-Jia (eds.). "Accelerating Spherical k-Means". Similarity Search and Applications. Lecture Notes in Computer Science. 13058. Cham: Springer International Publishing: 217–231. arXiv: 2107.04074 . doi:10.1007/978-3-030-89657-7_17. ISBN   978-3-030-89657-7. S2CID   235790358.
  18. Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David (29 September 2014). "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043 . Retrieved 7 October 2014.
  19. Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana (2013). Advances in Computational Intelligence. Lecture Notes in Computer Science. Vol. 7630. LNAI 7630. pp. 1–11. doi:10.1007/978-3-642-37798-3_1. ISBN   978-3-642-37798-3.
  20. Novotný, Vít (2018). Implementation Notes for the Soft Cosine Measure. The 27th ACM International Conference on Information and Knowledge Management. Torun, Italy: Association for Computing Machinery. pp. 1639–1642. arXiv: 1808.09407 . doi:10.1145/3269206.3269317. ISBN   978-1-4503-6014-2.