Kendall tau distance

Last updated

The Kendall tau rank distance is a metric (distance function) that counts the number of pairwise disagreements between two ranking lists. The larger the distance, the more dissimilar the two lists are. Kendall tau distance is also called bubble-sort distance since it is equivalent to the number of swaps that the bubble sort algorithm would take to place one list in the same order as the other list. The Kendall tau distance was created by Maurice Kendall.

Contents

Definition

The Kendall tau ranking distance between two lists and is

where

will be equal to 0 if the two lists are identical and (where is the list size) if one list is the reverse of the other.

Kendall tau distance may also be defined as

where

Kendall tau distance can also be defined as the total number of discordant pairs.

Kendall tau distance in Rankings: A permutation (or ranking) is an array of N integers where each of the integers between 0 and N-1 appears exactly once.

The Kendall tau distance between two rankings is the number of pairs that are in different order in the two rankings. For example, the Kendall tau distance between 0 3 1 6 2 5 4 and 1 0 3 6 4 2 5 is four because the pairs 0-1, 3-1, 2-4, 5-4 are in different order in the two rankings, but all other pairs are in the same order. [1]


The normalized Kendall tau distance is and therefore lies in the interval [0,1].


If Kendall tau distance function is performed as instead of (where and are the rankings of and elements respectively), then triangular inequality is not guaranteed. The triangular inequality fails sometimes also in cases where there are repetitions in the lists. So then we are not dealing with a metric anymore.

Generalised versions of Kendall tau distance have been proposed to give weights to different items and different positions in the ranking. [2]

Comparison to Kendall tau rank correlation coefficient

The  Kendall tau distance () must not be confused with the Kendall tau rank correlation coefficient ()  used in statistics.

They are related by  ,

Or simpler by   where is the normalised distance see above)

Still they are fundamentally different concepts.[ clarification needed ]

The distance is a value between 0 and . (The normalised distance is between 0 and 1)

The correlation is between -1 and 1.

The distance between equals is 0, the correlation between equals is 1.

The distance between reversals is , the correlation between reversals is -1


For example comparing the rankings A>B>C>D and A>B>C>D the distance is 0 the correlation is 1.

Comparing the rankings A>B>C>D and D>C>B>A the distance is 6 the correlation is -1

Comparing the rankings A>B>C>D and B>D>A>C the distance is 3 the correlation is 0

Example

Suppose one ranks a group of five people by height and by weight:

PersonABCDEranking
Rank by height12345A>B>C>D>E
Rank by weight34125C>D>A>B>E

Here person A is tallest and third-heaviest, B is the second -tallest and fourth-heaviest and so on.

In order to calculate the Kendall tau distance, pair each person with every other person and count the number of times the values in list 1 are in the opposite order of the values in list 2.

PairHeightWeightCount
(A,B)1 < 23 < 4
(A,C)1 < 33 > 1X
(A,D)1 < 43 > 2X
(A,E)1 < 53 < 5
(B,C)2 < 34 > 1X
(B,D)2 < 44 > 2X
(B,E)2 < 54 < 5
(C,D)3 < 41 < 2
(C,E)3 < 51 < 5
(D,E)4 < 52 < 5

Since there are four pairs whose values are in opposite order, the Kendall tau distance is 4. The normalized Kendall tau distance is

A value of 0.4 indicates that 40% of pairs differ in ordering between the two lists.

Computing the Kendall tau distance

A naive implementation in Python (using NumPy) is:

importnumpyasnpdefnormalised_kendall_tau_distance(values1,values2):"""Compute the Kendall tau distance."""n=len(values1)assertlen(values2)==n,"Both lists have to be of equal length"i,j=np.meshgrid(np.arange(n),np.arange(n))a=np.argsort(values1)b=np.argsort(values2)ndisordered=np.logical_or(np.logical_and(a[i]<a[j],b[i]>b[j]),np.logical_and(a[i]>a[j],b[i]<b[j])).sum()returnndisordered/(n*(n-1))

However, this requires memory, which is inefficient for large arrays.

Given two rankings , it is possible to rename the items such that . Then, the problem of computing the Kendall tau distance reduces to computing the number of inversions in the number of index pairs such that while . There are several algorithms for calculating this number.

Here is a basic C implementation.

#include<stdbool.h>intkendallTau(shortx[],shorty[],intlen){inti,j,v=0;boola,b;for(i=0;i<len;i++){for(j=i+1;j<len;j++){a=x[i]<x[j]&&y[i]>y[j];b=x[i]>x[j]&&y[i]<y[j];if(a||b)v++;}}returnabs(v);}floatnormalize(intkt,intlen){returnkt/(len*(len-1)/2.0);}

See also

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.

The nuclear Overhauser effect (NOE) is the transfer of nuclear spin polarization from one population of spin-active nuclei to another via cross-relaxation. A phenomenological definition of the NOE in nuclear magnetic resonance spectroscopy (NMR) is the change in the integrated intensity of one NMR resonance that occurs when another is saturated by irradiation with an RF field. The change in resonance intensity of a nucleus is a consequence of the nucleus being close in space to those directly affected by the RF perturbation.

<span class="mw-page-title-main">Drude model</span> Model of electrical conduction

The Drude model of electrical conduction was proposed in 1900 by Paul Drude to explain the transport properties of electrons in materials. Basically, Ohm's law was well established and stated that the current J and voltage V driving the current are related to the resistance R of the material. The inverse of the resistance is known as the conductance. When we consider a metal of unit length and unit cross sectional area, the conductance is known as the conductivity, which is the inverse of resistivity. The Drude model attempts to explain the resistivity of a conductor in terms of the scattering of electrons by the relatively immobile ions in the metal that act like obstructions to the flow of electrons.

<span class="mw-page-title-main">Cross-correlation</span> Covariance and correlation

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

In pulsed radar and sonar signal processing, an ambiguity function is a two-dimensional function of propagation delay and Doppler frequency , . It represents the distortion of a returned pulse due to the receiver matched filter of the return from a moving target. The ambiguity function is defined by the properties of the pulse and of the filter, and not any particular target scenario.

In probability theory and statistics, given a stochastic process, the autocovariance is a function that gives the covariance of the process with itself at pairs of time points. Autocovariance is closely related to the autocorrelation of the process in question.

<span class="mw-page-title-main">Correlation function (statistical mechanics)</span> Measure of a systems order

In statistical mechanics, the correlation function is a measure of the order in a system, as characterized by a mathematical correlation function. Correlation functions describe how microscopic variables, such as spin and density, at different positions are related. More specifically, correlation functions quantify how microscopic variables co-vary with one another on average across space and time. A classic example of such spatial correlations is in ferro- and antiferromagnetic materials, where the spins prefer to align parallel and antiparallel with their nearest neighbors, respectively. The spatial correlation between spins in such materials is shown in the figure to the right.

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

In MRI and NMR spectroscopy, an observable nuclear spin polarization (magnetization) is created by a homogeneous magnetic field. This field makes the magnetic dipole moments of the sample precess at the resonance (Larmor) frequency of the nuclei. At thermal equilibrium, nuclear spins precess randomly about the direction of the applied field. They become abruptly phase coherent when they are hit by radiofrequent (RF) pulses at the resonant frequency, created orthogonal to the field. The RF pulses cause the population of spin-states to be perturbed from their thermal equilibrium value. The generated transverse magnetization can then induce a signal in an RF coil that can be detected and amplified by an RF receiver. The return of the longitudinal component of the magnetization to its equilibrium value is termed spin-latticerelaxation while the loss of phase-coherence of the spins is termed spin-spin relaxation, which is manifest as an observed free induction decay (FID).

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient.

Resonance fluorescence is the process in which a two-level atom system interacts with the quantum electromagnetic field if the field is driven at a frequency near to the natural frequency of the atom.

<span class="mw-page-title-main">Non-random two-liquid model</span>

The non-random two-liquid model is an activity coefficient model that correlates the activity coefficients of a compound with its mole fractions in the liquid phase concerned. It is frequently applied in the field of chemical engineering to calculate phase equilibria. The concept of NRTL is based on the hypothesis of Wilson that the local concentration around a molecule is different from the bulk concentration. This difference is due to a difference between the interaction energy of the central molecule with the molecules of its own kind and that with the molecules of the other kind . The energy difference also introduces a non-randomness at the local molecular level. The NRTL model belongs to the so-called local-composition models. Other models of this type are the Wilson model, the UNIQUAC model, and the group contribution model UNIFAC. These local-composition models are not thermodynamically consistent for a one-fluid model for a real mixture due to the assumption that the local composition around molecule i is independent of the local composition around molecule j. This assumption is not true, as was shown by Flemr in 1976. However, they are consistent if a hypothetical two-liquid model is used.

The Herschel–Bulkley fluid is a generalized model of a non-Newtonian fluid, in which the strain experienced by the fluid is related to the stress in a complicated, non-linear way. Three parameters characterize this relationship: the consistency k, the flow index n, and the yield shear stress . The consistency is a simple constant of proportionality, while the flow index measures the degree to which the fluid is shear-thinning or shear-thickening. Ordinary paint is one example of a shear-thinning fluid, while oobleck provides one realization of a shear-thickening fluid. Finally, the yield stress quantifies the amount of stress that the fluid may experience before it yields and begins to flow.

In statistical mechanics, the Griffiths inequality, sometimes also called Griffiths–Kelly–Sherman inequality or GKS inequality, named after Robert B. Griffiths, is a correlation inequality for ferromagnetic spin systems. Informally, it says that in ferromagnetic spin systems, if the 'a-priori distribution' of the spin is invariant under spin flipping, the correlation of any monomial of the spins is non-negative; and the two point correlation of two monomial of the spins is non-negative.

In machine learning, a Ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that Ranking SVM also can be used to solve other problems such as Rank SIFT.

Sample entropy (SampEn) is a modification of approximate entropy (ApEn), used for assessing the complexity of physiological time-series signals, diagnosing diseased states. SampEn has two advantages over ApEn: data length independence and a relatively trouble-free implementation. Also, there is a small computational difference: In ApEn, the comparison between the template vector and the rest of the vectors also includes comparison with itself. This guarantees that probabilities are never zero. Consequently, it is always possible to take a logarithm of probabilities. Because template comparisons with itself lower ApEn values, the signals are interpreted to be more regular than they actually are. These self-matches are not included in SampEn. However, since SampEn makes direct use of the correlation integrals, it is not a real measure of information but an approximation. The foundations and differences with ApEn, as well as a step-by-step tutorial for its application is available at.

Coherence is defined as the ability of waves to interfere. Intuitively, coherent waves have a well-defined constant phase relationship. However, an exclusive and extensive physical definition of coherence is more nuanced. Coherence functions, as introduced by Roy Glauber and others in the 1960s, capture the mathematics behind the intuition by defining correlation between the electric field components as coherence. These correlations between electric field components can be measured to arbitrary orders, hence leading to the concept of different orders of coherence. The coherence encountered in most optical experiments, including the classic Young's double slit experiment and Mach-Zehnder interferometer, is first order coherence. Robert Hanbury Brown and Richard Q. Twiss performed a correlation experiment in 1956, and brought to light a different kind of correlation between fields, namely the correlation of intensities, which correspond to second order coherence. Higher order coherences become relevant in photon-coincidence counting experiments. Orders of coherence can be measured using classical correlation functions or by using the quantum analogue of those functions, which take quantum mechanical description of electric field (operators) as input. While the quantum coherence functions might yield the same results as the classical functions, the underlying mechanism and description of the physical processes are fundamentally different because quantum interference deals with interference of possible histories while classical interference deals with interference of physical waves.

In statistics, Somers’ D, sometimes incorrectly referred to as Somer’s D, is a measure of ordinal association between two possibly dependent random variables X and Y. Somers’ D takes values between when all pairs of the variables disagree and when all pairs of the variables agree. Somers’ D is named after Robert H. Somers, who proposed it in 1962.

References

  1. "Sorting Applications".
  2. Ravi Kumar and Sergei Vassilvitskii (2010). Generalized Distances between Rankings (PDF).
  3. Ionescu, Vlad. "calculating the number of "inversions" in a permutation". Stack overflow. Retrieved 24 February 2017.
  4. Chan, Timothy M.; Pătraşcu, Mihai (2010). "Counting Inversions, Offline Orthogonal Range Counting, and Related Problems". Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms. p. 161. CiteSeerX   10.1.1.208.2715 . doi:10.1137/1.9781611973075.15. ISBN   978-0-89871-701-3.