# Kendall rank correlation coefficient

Last updated

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient.

## Contents

It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, [1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897. [2]

Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.

Both Kendall's ${\displaystyle \tau }$ and Spearman's ${\displaystyle \rho }$ can be formulated as special cases of a more general correlation coefficient.

## Definition

Let ${\displaystyle (x_{1},y_{1}),...,(x_{n},y_{n})}$ be a set of observations of the joint random variables X and Y, such that all the values of (${\displaystyle x_{i}}$) and (${\displaystyle y_{i}}$) are unique (ties are neglected for simplicity). Any pair of observations ${\displaystyle (x_{i},y_{i})}$ and ${\displaystyle (x_{j},y_{j})}$, where ${\displaystyle i, are said to be concordant if the sort order of ${\displaystyle (x_{i},x_{j})}$ and ${\displaystyle (y_{i},y_{j})}$ agrees: that is, if either both ${\displaystyle x_{i}>x_{j}}$ and ${\displaystyle y_{i}>y_{j}}$ holds or both ${\displaystyle x_{i} and ${\displaystyle y_{i}; otherwise they are said to be discordant.

The Kendall τ coefficient is defined as:

${\displaystyle \tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n \choose 2}}.}$ [3]

Where ${\displaystyle {n \choose 2}={n(n-1) \over 2}}$ is the binomial coefficient for the number of ways to choose two items from n items.

### Properties

The denominator is the total number of pair combinations, so the coefficient must be in the range −1  τ  1.

• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
• If X and Y are independent, then we would expect the coefficient to be approximately zero.
• An explicit expression for Kendall's rank coefficient is ${\displaystyle \tau ={\frac {2}{n(n-1)}}\sum _{i.

## Hypothesis test

The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of X or Y or the distribution of (X,Y).

Under the null hypothesis of independence of X and Y, the sampling distribution of τ has an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean zero and variance

${\displaystyle {\frac {2(2n+5)}{9n(n-1)}}}$. [4]

## Accounting for ties

A pair ${\displaystyle \{(x_{i},x_{j}),(y_{i},y_{j})\}}$ is said to be tied if ${\displaystyle x_{i}=x_{j}}$ or ${\displaystyle y_{i}=y_{j}}$; a tied pair is neither concordant nor discordant. When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [−1, 1]:

### Tau-a

The Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will not make any adjustment for ties. It is defined as:

${\displaystyle \tau _{A}={\frac {n_{c}-n_{d}}{n_{0}}}}$

where nc, nd and n0 are defined as in the next section.

### Tau-b

The Tau-b statistic, unlike Tau-a, makes adjustments for ties. [5] Values of Tau-b range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

The Kendall Tau-b coefficient is defined as:

${\displaystyle \tau _{B}={\frac {n_{c}-n_{d}}{\sqrt {(n_{0}-n_{1})(n_{0}-n_{2})}}}}$

where

{\displaystyle {\begin{aligned}n_{0}&=n(n-1)/2\\n_{1}&=\sum _{i}t_{i}(t_{i}-1)/2\\n_{2}&=\sum _{j}u_{j}(u_{j}-1)/2\\n_{c}&={\text{Number of concordant pairs}}\\n_{d}&={\text{Number of discordant pairs}}\\t_{i}&={\text{Number of tied values in the }}i^{\text{th}}{\text{ group of ties for the first quantity}}\\u_{j}&={\text{Number of tied values in the }}j^{\text{th}}{\text{ group of ties for the second quantity}}\end{aligned}}}

A simple algorithm developed in BASIC computes Tau-b coefficient using an alternative formula. [6]

Be aware that some statistical packages, e.g. SPSS, use alternative formulas for computational efficiency, with double the 'usual' number of concordant and discordant pairs. [7]

### Tau-c

Tau-c (also called Stuart-Kendall Tau-c) [8] is more suitable than Tau-b for the analysis of data based on non-square (i.e. rectangular) contingency tables. [8] [9] So use Tau-b if the underlying scale of both variables has the same number of possible values (before ranking) and Tau-c if they differ. For instance, one variable might be scored on a 5-point scale (very good, good, average, bad, very bad), whereas the other might be based on a finer 10-point scale.

The Kendall Tau-c coefficient is defined as: [9]

${\displaystyle \tau _{C}={\frac {2(n_{c}-n_{d})}{n^{2}{\frac {(m-1)}{m}}}}}$

where

{\displaystyle {\begin{aligned}n_{c}&={\text{Number of concordant pairs}}\\n_{d}&={\text{Number of discordant pairs}}\\r&={\text{Number of rows}}\\c&={\text{Number of columns}}\\m&=\min(r,c)\end{aligned}}}

## Significance tests

When two quantities are statistically independent, the distribution of ${\displaystyle \tau }$ is not easily characterizable in terms of known distributions. However, for ${\displaystyle \tau _{A}}$ the following statistic, ${\displaystyle z_{A}}$, is approximately distributed as a standard normal when the variables are statistically independent:

${\displaystyle z_{A}={3(n_{c}-n_{d}) \over {\sqrt {n(n-1)(2n+5)/2}}}}$

Thus, to test whether two variables are statistically dependent, one computes ${\displaystyle z_{A}}$, and finds the cumulative probability for a standard normal distribution at ${\displaystyle -|z_{A}|}$. For a 2-tailed test, multiply that number by two to obtain the p-value. If the p-value is below a given significance level, one rejects the null hypothesis (at that significance level) that the quantities are statistically independent.

Numerous adjustments should be added to ${\displaystyle z_{A}}$ when accounting for ties. The following statistic, ${\displaystyle z_{B}}$, has the same distribution as the ${\displaystyle \tau _{B}}$ distribution, and is again approximately equal to a standard normal distribution when the quantities are statistically independent:

${\displaystyle z_{B}={n_{c}-n_{d} \over {\sqrt {v}}}}$

where

${\displaystyle {\begin{array}{ccl}v&=&(v_{0}-v_{t}-v_{u})/18+v_{1}+v_{2}\\v_{0}&=&n(n-1)(2n+5)\\v_{t}&=&\sum _{i}t_{i}(t_{i}-1)(2t_{i}+5)\\v_{u}&=&\sum _{j}u_{j}(u_{j}-1)(2u_{j}+5)\\v_{1}&=&\sum _{i}t_{i}(t_{i}-1)\sum _{j}u_{j}(u_{j}-1)/(2n(n-1))\\v_{2}&=&\sum _{i}t_{i}(t_{i}-1)(t_{i}-2)\sum _{j}u_{j}(u_{j}-1)(u_{j}-2)/(9n(n-1)(n-2))\end{array}}}$

This is sometimes referred to as the Mann-Kendall test. [10]

## Algorithms

The direct computation of the numerator ${\displaystyle n_{c}-n_{d}}$, involves two nested iterations, as characterized by the following pseudocode:

numer := 0 for i := 2..N dofor j := 1..(i − 1) do         numer := numer + sign(x[i] − x[j]) × sign(y[i] − y[j]) return numer

Although quick to implement, this algorithm is ${\displaystyle O(n^{2})}$ in complexity and becomes very slow on large samples. A more sophisticated algorithm [11] built upon the Merge Sort algorithm can be used to compute the numerator in ${\displaystyle O(n\cdot \log {n})}$ time.

Begin by ordering your data points sorting by the first quantity, ${\displaystyle x}$, and secondarily (among ties in ${\displaystyle x}$) by the second quantity, ${\displaystyle y}$. With this initial ordering, ${\displaystyle y}$ is not sorted, and the core of the algorithm consists of computing how many steps a Bubble Sort would take to sort this initial ${\displaystyle y}$. An enhanced Merge Sort algorithm, with ${\displaystyle O(n\log n)}$ complexity, can be applied to compute the number of swaps, ${\displaystyle S(y)}$, that would be required by a Bubble Sort to sort ${\displaystyle y_{i}}$. Then the numerator for ${\displaystyle \tau }$ is computed as:

${\displaystyle n_{c}-n_{d}=n_{0}-n_{1}-n_{2}+n_{3}-2S(y),}$

where ${\displaystyle n_{3}}$ is computed like ${\displaystyle n_{1}}$ and ${\displaystyle n_{2}}$, but with respect to the joint ties in ${\displaystyle x}$ and ${\displaystyle y}$.

A Merge Sort partitions the data to be sorted, ${\displaystyle y}$ into two roughly equal halves, ${\displaystyle y_{\mathrm {left} }}$ and ${\displaystyle y_{\mathrm {right} }}$, then sorts each half recursive, and then merges the two sorted halves into a fully sorted vector. The number of Bubble Sort swaps is equal to:

${\displaystyle S(y)=S(y_{\mathrm {left} })+S(y_{\mathrm {right} })+M(Y_{\mathrm {left} },Y_{\mathrm {right} })}$

where ${\displaystyle Y_{\mathrm {left} }}$ and ${\displaystyle Y_{\mathrm {right} }}$ are the sorted versions of ${\displaystyle y_{\mathrm {left} }}$ and ${\displaystyle y_{\mathrm {right} }}$, and ${\displaystyle M(\cdot ,\cdot )}$ characterizes the Bubble Sort swap-equivalent for a merge operation. ${\displaystyle M(\cdot ,\cdot )}$ is computed as depicted in the following pseudo-code:

function M(L[1..n], R[1..m]) is     i := 1     j := 1     nSwaps := 0     while i ≤ n and j ≤ m doif R[j] < L[i] then             nSwaps := nSwaps + n − i + 1             j := j + 1         else             i := i + 1     return nSwaps

A side effect of the above steps is that you end up with both a sorted version of ${\displaystyle x}$ and a sorted version of ${\displaystyle y}$. With these, the factors ${\displaystyle t_{i}}$ and ${\displaystyle u_{j}}$ used to compute ${\displaystyle \tau _{B}}$ are easily obtained in a single linear-time pass through the sorted arrays.

## Related Research Articles

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

In statistics, a contingency table is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

In system analysis, among other fields of study, a linear time-invariant system is a system that produces an output signal from any input signal subject to the constraints of linearity and time-invariance; these terms are briefly defined below. These properties apply to many important physical systems, in which case the response y(t) of the system to an arbitrary input x(t) can be found directly using convolution: y(t) = x(t) ∗ h(t) where h(t) is called the system's impulse response and ∗ represents convolution. What's more, there are systematic methods for solving any such system, whereas systems not meeting both properties are generally more difficult to solve analytically. A good example of an LTI system is any electrical circuit consisting of resistors, capacitors, inductors and linear amplifiers.

An activity coefficient is a factor used in thermodynamics to account for deviations from ideal behaviour in a mixture of chemical substances. In an ideal mixture, the microscopic interactions between each pair of chemical species are the same and, as a result, properties of the mixtures can be expressed directly in terms of simple concentrations or partial pressures of the substances present e.g. Raoult's law. Deviations from ideality are accommodated by modifying the concentration by an activity coefficient. Analogously, expressions involving gases can be adjusted for non-ideality by scaling partial pressures by a fugacity coefficient.

The control variates method is a variance reduction technique used in Monte Carlo methods. It exploits information about the errors in estimates of known quantities to reduce the error of an estimate of an unknown quantity.

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

Kendall's W is a non-parametric statistic. It is a normalization of the statistic of the Friedman test, and can be used for assessing agreement among raters. Kendall's W ranges from 0 to 1.

In statistics, a concordant pair is a pair of observations, each on two variables, (X1,Y1) and (X2,Y2), having the property that

The Kendall tau rank distance is a metric that counts the number of pairwise disagreements between two ranking lists. The larger the distance, the more dissimilar the two lists are. Kendall tau distance is also called bubble-sort distance since it is equivalent to the number of swaps that the bubble sort algorithm would take to place one list in the same order as the other list. The Kendall tau distance was created by Maurice Kendall.

In statistics, Goodman and Kruskal's gamma is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It measures the strength of association of the cross tabulated data when both variables are measured at the ordinal level. It makes no adjustment for either table size or ties. Values range from −1 to +1. A value of zero indicates the absence of association.

Fluorescence cross-correlation spectroscopy (FCCS) was introduced by Eigen and Rigler in 1994 and experimentally realized by Schwille in 1997. It is essentially an extension of the fluorescence correlation spectroscopy (FCS) procedure by utilizing two differentially colored molecules, instead of one. In other words, coincident green and red intensity fluctuations of distinct molecules correlate if green and red labeled particles are moving together through a predefined confocal volume. As a result, FCCS provides a highly sensitive measurement of molecular interactions independent of diffusion rate. This is an important advancement, given that diffusion rate depends only weakly on the size of the molecular complex.

In statistics, L-moments are a sequence of statistics used to summarize the shape of a probability distribution. They are linear combinations of order statistics (L-statistics) analogous to conventional moments, and can be used to calculate quantities analogous to standard deviation, skewness and kurtosis, termed the L-scale, L-skewness and L-kurtosis respectively. Standardised L-moments are called L-moment ratios and are analogous to standardized moments. Just as for conventional moments, a theoretical distribution has a set of population L-moments. Sample L-moments can be defined for a sample from the population, and can be used as estimators of the population L-moments.

In statistical mechanics, the Griffiths inequality, sometimes also called Griffiths–Kelly–Sherman inequality or GKS inequality, named after Robert B. Griffiths, is a correlation inequality for ferromagnetic spin systems. Informally, it says that in ferromagnetic spin systems, if the 'a-priori distribution' of the spin is invariant under spin flipping, the correlation of any monomial of the spins is non-negative; and the two point correlation of two monomial of the spins is non-negative.

In machine learning, a Ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that Ranking SVM also can be used to solve other problems such as Rank SIFT.

In statistics, Somers’ D, sometimes incorrectly referred to as Somer’s D, is a measure of ordinal association between two possibly dependent random variables X and Y. Somers’ D takes values between when all pairs of the variables disagree and when all pairs of the variables agree. Somers’ D is named after Robert H. Somers, who proposed it in 1962.

Tau functions are an important ingredient in the modern theory of integrable systems, and have numerous applications in a variety of other domains. They were originally introduced by Ryogo Hirota in his direct method approach to soliton equations, based on expressing them in an equivalent bilinear form. The term Tau function, or -function, was first used systematically by Mikio Sato and his students in the specific context of the Kadomtsev–Petviashvili equation, and related integrable hierarchies. It is a central ingredient in the theory of solitons. Tau functions also appear as matrix model partition functions in the spectral theory of Random Matrices, and may also serve as generating functions, in the sense of combinatorics and enumerative geometry, especially in relation to moduli spaces of Riemann surfaces, and enumeration of branched coverings, or so-called Hurwitz numbers.

## References

1. Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika . 30 (1–2): 81–89. doi:10.1093/biomet/30.1-2.81. JSTOR   2332226.
2. Kruskal, W. H. (1958). "Ordinal Measures of Association". Journal of the American Statistical Association . 53 (284): 814–861. doi:10.2307/2281954. JSTOR   2281954. MR   0100941.
3. Nelsen, R.B. (2001) [1994], "Kendall tau metric", Encyclopedia of Mathematics , EMS Press
4. Prokhorov, A.V. (2001) [1994], "Kendall coefficient of rank correlation", Encyclopedia of Mathematics , EMS Press
5. Agresti, A. (2010). Analysis of Ordinal Categorical Data (Second ed.). New York: John Wiley & Sons. ISBN   978-0-470-08289-8.
6. Alfred Brophy. "An algorithm and program for calculation of Kendall's rank correlation coefficient" (PDF).
7. IBM (2016). IBM SPSS Statistics 24 Algorithms. IBM. p. 168. Retrieved 31 August 2017.
8. Berry, K. J.; Johnston, J. E.; Zahran, S.; Mielke, P. W. (2009). "Stuart's tau measure of effect size for ordinal variables: Some methodological considerations". Behavior Research Methods. 41 (4): 1144–1148. doi:. PMID   19897822.
9. Stuart, A. (1953). "The Estimation and Comparison of Strengths of Association in Contingency Tables". Biometrika . 40 (1–2): 105–110. doi:10.2307/2333101. JSTOR   2333101.
10. Knight, W. (1966). "A Computer Method for Calculating Kendall's Tau with Ungrouped Data". Journal of the American Statistical Association . 61 (314): 436–439. doi:10.2307/2282833. JSTOR   2282833.