Kendall tau distance

Last updated

The Kendall tau rank distance is a metric (distance function) that counts the number of pairwise disagreements between two ranking lists. The larger the distance, the more dissimilar the two lists are. Kendall tau distance is also called bubble-sort distance since it is equivalent to the number of swaps that the bubble sort algorithm would take to place one list in the same order as the other list. The Kendall tau distance was created by Maurice Kendall.

Contents

Definition

The Kendall tau ranking distance between two lists and is where and are the rankings of the element in and respectively.

will be equal to 0 if the two lists are identical and (where is the list size) if one list is the reverse of the other.

Kendall tau distance may also be defined as

where

Kendall tau distance can also be defined as the total number of discordant pairs.

Kendall tau distance in Rankings: A permutation (or ranking) is an array of N integers where each of the integers between 0 and N-1 appears exactly once.

The Kendall tau distance between two rankings is the number of pairs that are in different order in the two rankings. For example, the Kendall tau distance between 0 3 1 6 2 5 4 and 1 0 3 6 4 2 5 is four because the pairs 0-1, 3-1, 2-4, 5-4 are in different order in the two rankings, but all other pairs are in the same order. [1]

The normalized Kendall tau distance is and therefore lies in the interval [0,1].

If Kendall tau distance function is performed as instead of (where and are the rankings of and elements respectively), then triangular inequality is not guaranteed. The triangular inequality fails sometimes also in cases where there are repetitions in the lists. So then we are not dealing with a metric anymore.

Generalised versions of Kendall tau distance have been proposed to give weights to different items and different positions in the ranking. [2]

Comparison to Kendall tau rank correlation coefficient

The  Kendall tau distance () must not be confused with the Kendall tau rank correlation coefficient ()  used in statistics.

They are related by  ,

Or simpler by   where is the normalised distance see above)

The distance is a value between 0 and . (The normalised distance is between 0 and 1)

The correlation is between -1 and 1.

The distance between equals is 0, the correlation between equals is 1.

The distance between reversals is , the correlation between reversals is -1


For example comparing the rankings A>B>C>D and A>B>C>D the distance is 0 the correlation is 1.

Comparing the rankings A>B>C>D and D>C>B>A the distance is 6 the correlation is -1

Comparing the rankings A>B>C>D and B>D>A>C the distance is 3 the correlation is 0

Example

Suppose one ranks a group of five people by height and by weight:

PersonABCDEranking
Rank by height12345A>B>C>D>E
Rank by weight34125C>D>A>B>E

Here person A is tallest and third-heaviest, B is the second -tallest and fourth-heaviest and so on.

In order to calculate the Kendall tau distance between these two rankings, pair each person with every other person and count the number of times the values in list 1 are in the opposite order of the values in list 2.

PairHeightWeightCount
(A,B)1 < 23 < 4
(A,C)1 < 33 > 1X
(A,D)1 < 43 > 2X
(A,E)1 < 53 < 5
(B,C)2 < 34 > 1X
(B,D)2 < 44 > 2X
(B,E)2 < 54 < 5
(C,D)3 < 41 < 2
(C,E)3 < 51 < 5
(D,E)4 < 52 < 5

Since there are four pairs whose values are in opposite order, the Kendall tau distance is 4. The normalized Kendall tau distance is

A value of 0.4 indicates that 40% of pairs differ in ordering between the two lists.

Computing the Kendall tau distance

A naive implementation in Python (using NumPy) is:

importnumpyasnpdefnormalised_kendall_tau_distance(values1,values2):"""Compute the Kendall tau distance."""n=len(values1)assertlen(values2)==n,"Both lists have to be of equal length"i,j=np.meshgrid(np.arange(n),np.arange(n))a=np.argsort(values1)b=np.argsort(values2)ndisordered=np.logical_or(np.logical_and(a[i]<a[j],b[i]>b[j]),np.logical_and(a[i]>a[j],b[i]<b[j])).sum()returnndisordered/(n*(n-1))

However, this requires memory, which is inefficient for large arrays.

Given two rankings , it is possible to rename the items such that . Then, the problem of computing the Kendall tau distance reduces to computing the number of inversions in the number of index pairs such that while . There are several algorithms for calculating this number.

Here is a basic C implementation.

#include<stdbool.h>intkendallTau(shortx[],shorty[],intlen){inti,j,v=0;boola,b;for(i=0;i<len;i++){for(j=i+1;j<len;j++){a=x[i]<x[j]&&y[i]>y[j];b=x[i]>x[j]&&y[i]<y[j];if(a||b)v++;}}returnabs(v);}floatnormalize(intkt,intlen){returnkt/(len*(len-1)/2.0);}

See also

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.

<span class="mw-page-title-main">Cross-correlation</span> Covariance and correlation

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

In pulsed radar and sonar signal processing, an ambiguity function is a two-dimensional function of propagation delay and Doppler frequency , . It represents the distortion of a returned pulse due to the receiver matched filter of the return from a moving target. The ambiguity function is defined by the properties of the pulse and of the filter, and not any particular target scenario.

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.
<span class="mw-page-title-main">Correlation function (statistical mechanics)</span> Measure of a systems order

In statistical mechanics, the correlation function is a measure of the order in a system, as characterized by a mathematical correlation function. Correlation functions describe how microscopic variables, such as spin and density, at different positions are related. More specifically, correlation functions measure quantitatively the extent to which microscopic variables fluctuate together, on average, across space and/or time. Keep in mind that correlation doesn’t automatically equate to causation. So, even if there’s a non-zero correlation between two points in space or time, it doesn’t mean there is a direct causal link between them. Sometimes, a correlation can exist without any causal relationship. This could be purely coincidental or due to other underlying factors, known as confounding variables, which cause both points to covary (statistically).

Fluorescence correlation spectroscopy (FCS) is a statistical analysis, via time correlation, of stationary fluctuations of the fluorescence intensity. Its theoretical underpinning originated from L. Onsager's regression hypothesis. The analysis provides kinetic parameters of the physical processes underlying the fluctuations. One of the interesting applications of this is an analysis of the concentration fluctuations of fluorescent particles (molecules) in solution. In this application, the fluorescence emitted from a very tiny space in solution containing a small number of fluorescent particles (molecules) is observed. The fluorescence intensity is fluctuating due to Brownian motion of the particles. In other words, the number of the particles in the sub-space defined by the optical system is randomly changing around the average number. The analysis gives the average number of fluorescent particles and average diffusion time, when the particle is passing through the space. Eventually, both the concentration and size of the particle (molecule) are determined. Both parameters are important in biochemical research, biophysics, and chemistry.

In statistics, a rank correlation is any of several statistics that measure an ordinal association — the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

Stress majorization is an optimization strategy used in multidimensional scaling (MDS) where, for a set of -dimensional data items, a configuration of points in -dimensional space is sought that minimizes the so-called stress function . Usually is or , i.e. the matrix lists points in or dimensional Euclidean space so that the result may be visualised. The function is a cost or loss function that measures the squared differences between ideal distances and actual distances in r-dimensional space. It is defined as:

<span class="mw-page-title-main">Non-random two-liquid model</span>

The non-random two-liquid model is an activity coefficient model introduced by Renon and Prausnitz in 1968 that correlates the activity coefficients of a compound with its mole fractions in the liquid phase concerned. It is frequently applied in the field of chemical engineering to calculate phase equilibria. The concept of NRTL is based on the hypothesis of Wilson, who stated that the local concentration around a molecule in most mixtures is different from the bulk concentration. This difference is due to a difference between the interaction energy of the central molecule with the molecules of its own kind and that with the molecules of the other kind . The energy difference also introduces a non-randomness at the local molecular level. The NRTL model belongs to the so-called local-composition models. Other models of this type are the Wilson model, the UNIQUAC model, and the group contribution model UNIFAC. These local-composition models are not thermodynamically consistent for a one-fluid model for a real mixture due to the assumption that the local composition around molecule i is independent of the local composition around molecule j. This assumption is not true, as was shown by Flemr in 1976. However, they are consistent if a hypothetical two-liquid model is used. Models, which have consistency between bulk and the local molecular concentrations around different types of molecules are COSMO-RS, and COSMOSPACE.

In statistical mechanics, the Griffiths inequality, sometimes also called Griffiths–Kelly–Sherman inequality or GKS inequality, named after Robert B. Griffiths, is a correlation inequality for ferromagnetic spin systems. Informally, it says that in ferromagnetic spin systems, if the 'a-priori distribution' of the spin is invariant under spin flipping, the correlation of any monomial of the spins is non-negative; and the two point correlation of two monomial of the spins is non-negative.

Rank SIFT algorithm is the revised SIFT algorithm which uses ranking techniques to improve the performance of the SIFT algorithm. In fact, ranking techniques can be used in key point localization or descriptor generation of the original SIFT algorithm.

In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT.

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

In orbital mechanics, Gauss's method is used for preliminary orbit determination from at least three observations of the orbiting body of interest at three different times. The required information are the times of observations, the position vectors of the observation points, the direction cosine vector of the orbiting body from the observation points and general physical data.

The spectral correlation density (SCD), sometimes also called the cyclic spectral density or spectral correlation function, is a function that describes the cross-spectral density of all pairs of frequency-shifted versions of a time-series. The spectral correlation density applies only to cyclostationary processes because stationary processes do not exhibit spectral correlation. Spectral correlation has been used both in signal detection and signal classification. The spectral correlation density is closely related to each of the bilinear time-frequency distributions, but is not considered one of Cohen's class of distributions.

In quantum optics, correlation functions are used to characterize the statistical and coherence properties – the ability of waves to interfere – of electromagnetic radiation, like optical light. Higher order coherence or n-th order coherence extends the concept of coherence to quantum optics and coincidence experiments. It is used to differentiate between optics experiments that require a quantum mechanical description from those for which classical fields are sufficient.

In statistics, Somers’ D, sometimes incorrectly referred to as Somer’s D, is a measure of ordinal association between two possibly dependent random variables X and Y. Somers’ D takes values between when all pairs of the variables disagree and when all pairs of the variables agree. Somers’ D is named after Robert H. Somers, who proposed it in 1962.

References

  1. "Sorting Applications".
  2. Ravi Kumar and Sergei Vassilvitskii (2010). Generalized Distances between Rankings (PDF).
  3. Ionescu, Vlad. "calculating the number of "inversions" in a permutation". Stack overflow. Retrieved 24 February 2017.
  4. Chan, Timothy M.; Pătraşcu, Mihai (2010). "Counting Inversions, Offline Orthogonal Range Counting, and Related Problems". Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms. p. 161. CiteSeerX   10.1.1.208.2715 . doi:10.1137/1.9781611973075.15. ISBN   978-0-89871-701-3.