# Rank correlation

Last updated

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

Statistics is a branch of mathematics working with data collection, organization, analysis, interpretation and presentation. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics.

A ranking is a relationship between a set of items such that, for any two items, the first is either 'ranked higher than', 'ranked lower than' or 'ranked equal to' the second. In mathematics, this is known as a weak order or total preorder of objects. It is not necessarily a total order of objects because two different objects can have the same ranking. The rankings themselves are totally ordered. For example, materials are totally preordered by hardness, while degrees of hardness are totally ordered. If two items are the same in rank it is considered a tie.

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from interval and ratio scales by not having category widths that represent equal increments of the underlying attribute.

## Context

If, for example, one variable is the identity of a college basketball program and another variable is the identity of a college football program, one could test for a relationship between the poll rankings of the two types of program: do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank correlation coefficient can measure that relationship, and the measure of significance of the rank correlation coefficient can show whether the measured relationship is small enough to likely be a coincidence.

If there is only one variable, the identity of a college football program, but it is subject to two different poll rankings (say, one by coaches and one by sportswriters), then the similarity of the two different polls' rankings can be measured with a rank correlation coefficient.

As another example, in a contingency table with low income, medium income, and high income in the row variable and educational level—no high school, high school, university—in the column variable), [1] a rank correlation measures the relationship between income and educational level.

In statistics, a contingency table is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

## Correlation coefficients

Some of the more popular rank correlation statistics include

An increasing rank correlation coefficient implies increasing agreement between rankings. The coefficient is inside the interval [1, 1] and assumes the value:

In mathematics, a coefficient is a multiplicative factor in some term of a polynomial, a series, or any expression; it is usually a number, but may be any expression. In the latter case, the variables appearing in the coefficients are often called parameters, and must be clearly distinguished from the other variables.

• 1 if the agreement between the two rankings is perfect; the two rankings are the same.
• 0 if the rankings are completely independent.
• 1 if the disagreement between the two rankings is perfect; one ranking is the reverse of the other.

Following Diaconis (1988), a ranking can be seen as a permutation of a set of objects. Thus we can look at observed rankings as data obtained when the sample space is (identified with) a symmetric group. We can then introduce a metric, making the symmetric group into a metric space. Different metrics will correspond to different rank correlations.

In mathematics, permutation is the act of arranging the members of a set into a sequence or order, or, if the set is already ordered, rearranging (reordering) its elements—a process called permuting. Permutations differ from combinations, which are selections of some members of a set regardless of order. For example, written as tuples, there are six permutations of the set {1,2,3}, namely: (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), and (3,2,1). These are all the possible orderings of this three-element set. Anagrams of words whose letters are different are also permutations: the letters are already ordered in the original word, and the anagram is a reordering of the letters. The study of permutations of finite sets is an important topic in the fields of combinatorics and group theory.

In mathematics, a set is a collection of distinct objects, considered as an object in its own right. For example, the numbers 2, 4, and 6 are distinct objects when considered separately, but when they are considered collectively they form a single set of size three, written {2, 4, 6}. The concept of a set is one of the most fundamental in mathematics. Developed at the end of the 19th century, set theory is now a ubiquitous part of mathematics, and can be used as a foundation from which nearly all of mathematics can be derived. In mathematics education, elementary topics from set theory such as Venn diagrams are taught at a young age, while more advanced concepts are taught as part of a university degree.

In abstract algebra, the symmetric group defined over any set is the group whose elements are all the bijections from the set to itself, and whose group operation is the composition of functions. In particular, the finite symmetric group Sn defined over a finite set of n symbols consists of the permutation operations that can be performed on the n symbols. Since there are n! possible permutation operations that can be performed on a tuple composed of n symbols, it follows that the number of elements of the symmetric group Sn is n!.

## General correlation coefficient

Kendall (1944) showed that his ${\displaystyle \tau }$ (tau) and Spearman's ${\displaystyle \rho }$ (rho) are particular cases of a general correlation coefficient.

Suppose we have a set of ${\displaystyle n}$ objects, which are being considered in relation to two properties, represented by ${\displaystyle x}$ and ${\displaystyle y}$, forming the sets of values ${\displaystyle \{x_{i}\}_{i\leq n}}$ and ${\displaystyle \{y_{i}\}_{i\leq n}}$. To any pair of individuals, say the ${\displaystyle i}$-th and the ${\displaystyle j}$-th we assign a ${\displaystyle x}$-score, denoted by ${\displaystyle a_{ij}}$, and a ${\displaystyle y}$-score, denoted by ${\displaystyle b_{ij}}$. The only requirement for these functions is that they be anti-symmetric, so ${\displaystyle a_{ij}=-a_{ji}}$ and ${\displaystyle b_{ij}=-b_{ji}}$. (Note that in particular ${\displaystyle a_{ij}=b_{ij}=0}$ if ${\displaystyle i=j}$.) Then the generalized correlation coefficient ${\displaystyle \Gamma }$ is defined as

${\displaystyle \Gamma ={\frac {\sum _{i,j=1}^{n}a_{ij}b_{ij}}{\sqrt {\sum _{i,j=1}^{n}a_{ij}^{2}\sum _{i,j=1}^{n}b_{ij}^{2}}}}}$

Equivalently, if all coefficients are collected into matrices ${\displaystyle A=(a_{ij})}$ and ${\displaystyle B=(b_{ij})}$, with ${\displaystyle A^{\textsf {T}}=-A}$ and ${\displaystyle B^{\textsf {T}}=-B}$, then

${\displaystyle \Gamma ={\frac {\langle A,B\rangle _{\rm {F}}}{\|A\|_{\rm {F}}\|B\|_{\rm {F}}}}}$

where ${\displaystyle \langle A,B\rangle _{\rm {F}}}$ is the Frobenius inner product and ${\displaystyle \|A\|_{\rm {F}}={\sqrt {\langle A,A\rangle _{\rm {F}}}}}$ the Frobenius norm. In particular, the general correlation coefficient is the cosine of the angle between the matrices ${\displaystyle A}$ and ${\displaystyle B}$.

### Kendall's ${\displaystyle \tau }$ as a particular case

If ${\displaystyle r_{i}}$, ${\displaystyle s_{i}}$ are the ranks of the ${\displaystyle i}$-member according to the ${\displaystyle x}$-quality and ${\displaystyle y}$-quality respectively, then we can define

${\displaystyle a_{ij}=\operatorname {sgn}(r_{j}-r_{i}),\quad b_{ij}=\operatorname {sgn}(s_{j}-s_{i}).}$

The sum ${\displaystyle \sum a_{ij}b_{ij}}$ is the number of concordant pairs minus the number of discordant pairs (see Kendall tau rank correlation coefficient). The sum ${\displaystyle \sum a_{ij}^{2}}$ is just ${\displaystyle n(n-1)/2}$, the number of terms ${\displaystyle a_{ij}}$, as is ${\displaystyle \sum b_{ij}^{2}}$. Thus in this case,

${\displaystyle \Gamma ={\frac {2\,(({\text{number of concordant pairs}})-({\text{number of discordant pairs}}))}{\sqrt {n(n-1)n(n-1)}}}={\text{Kendall's }}\tau }$

### Spearman's ${\displaystyle \rho }$ as a particular case

If ${\displaystyle r_{i}}$, ${\displaystyle s_{i}}$ are the ranks of the ${\displaystyle i}$-member according to the ${\displaystyle x}$ and the ${\displaystyle y}$-quality respectively, we can simply define

${\displaystyle a_{ij}=r_{j}-r_{i}}$
${\displaystyle b_{ij}=s_{j}-s_{i}}$

The sums ${\displaystyle \sum a_{ij}^{2}}$ and ${\displaystyle \sum b_{ij}^{2}}$ are equal, since both ${\displaystyle r_{i}}$ and ${\displaystyle s_{i}}$ range from ${\displaystyle 1}$ to ${\displaystyle n}$. Then we have:

${\displaystyle \Gamma ={\frac {\sum (r_{j}-r_{i})(s_{j}-s_{i})}{\sum (r_{j}-r_{i})^{2}}}}$

now

{\displaystyle {\begin{aligned}\sum _{i,j=1}^{n}(r_{j}-r_{i})(s_{j}-s_{i})&=\sum _{i=1}^{n}\sum _{j=1}^{n}r_{i}s_{i}+\sum _{i=1}^{n}\sum _{j=1}^{n}r_{j}s_{j}&-\sum _{i=1}^{n}\sum _{j=1}^{n}r_{i}s_{j}-\sum _{i=1}^{n}\sum _{j=1}^{n}r_{j}s_{i}\\&=2n\sum _{i=1}^{n}r_{i}s_{i}&-2\sum _{i=1}^{n}r_{i}\sum _{j=1}^{n}s_{j}\\&=2n\sum _{i=1}^{n}r_{i}s_{i}&-2({\frac {1}{2}}n(n+1))^{2}\\&=2n\sum _{i=1}^{n}r_{i}s_{i}-{\frac {1}{2}}n^{2}(n+1)^{2}\\\end{aligned}}}

We also have

${\displaystyle S=\sum _{i=1}^{n}(r_{i}-s_{i})^{2}=2\sum r_{i}^{2}-2\sum r_{i}s_{i}}$

and hence

${\displaystyle \sum (r_{j}-r_{i})(s_{j}-s_{i})=2n\sum r_{i}^{2}-{\frac {1}{2}}n^{2}(n+1)^{2}-nS}$

${\displaystyle \sum r_{i}^{2}}$ being the sum of squares of the first ${\displaystyle n}$ naturals equals ${\displaystyle {\frac {1}{6}}n(n+1)(2n+1)}$. Thus, the last equation reduces to

${\displaystyle \sum (r_{j}-r_{i})(s_{j}-s_{i})={\frac {1}{6}}n^{2}(n^{2}-1)-nS}$

Further

${\displaystyle \sum (r_{j}-r_{i})^{2}=2n\sum r_{i}^{2}-2\sum r_{i}r_{j}}$
${\displaystyle =2n\sum r_{i}^{2}-2(\sum r_{i})^{2}={\frac {1}{6}}n^{2}(n^{2}-1)}$

and thus, substituting into the original formula these results we get

${\displaystyle \Gamma _{R}=1-{\frac {6\sum d_{i}^{2}}{n^{3}-n}}}$

where ${\displaystyle d_{i}=r_{i}-s_{i},}$ is the difference between ranks.

which is exactly Spearman's rank correlation coefficient ${\displaystyle \rho }$.

## Rank-biserial correlation

Gene Glass (1965) noted that the rank-biserial can be derived from Spearman's ${\displaystyle \rho }$. "One can derive a coefficient defined on X, the dichotomous variable, and Y, the ranking variable, which estimates Spearman's rho between X and Y in the same way that biserial r estimates Pearson's r between two normal variables” (p. 91). The rank-biserial correlation had been introduced nine years before by Edward Cureton (1956) as a measure of rank correlation when the ranks are in two groups.

### Kerby simple difference formula

Dave Kerby (2014) recommended the rank-biserial as the measure to introduce students to rank correlation, because the general logic can be explained at an introductory level. The rank-biserial is the correlation used with the Mann–Whitney U test, a method commonly covered in introductory college courses on statistics. The data for this test consists of two groups; and for each member of the groups, the outcome is ranked for the study as a whole.

Kerby showed that this rank correlation can be expressed in terms of two concepts: the percent of data that support a stated hypothesis, and the percent of data that do not support it. The Kerby simple difference formula states that the rank correlation can be expressed as the difference between the proportion of favorable evidence (f) minus the proportion of unfavorable evidence (u).

${\displaystyle r=f-u}$

### Example and interpretation

To illustrate the computation, suppose a coach trains long-distance runners for one month using two methods. Group A has 5 runners, and Group B has 4 runners. The stated hypothesis is that method A produces faster runners. The race to assess the results finds that the runners from Group A do indeed run faster, with the following ranks: 1, 2, 3, 4, and 6. The slower runners from Group B thus have ranks of 5, 7, 8, and 9.

The analysis is conducted on pairs, defined as a member of one group compared to a member of the other group. For example, the fastest runner in the study is a member of four pairs: (1,5), (1,7), (1,8), and (1,9). All four of these pairs support the hypothesis, because in each pair the runner from Group A is faster than the runner from Group B. There are a total of 20 pairs, and 19 pairs support the hypothesis. The only pair that does not support the hypothesis are the two runners with ranks 5 and 6, because in this pair, the runner from Group B had the faster time. By the Kerby simple difference formula, 95% of the data support the hypothesis (19 of 20 pairs), and 5% do not support (1 of 20 pairs), so the rank correlation is r = .95 - .05 = .90.

The maximum value for the correlation is r = 1, which means that 100% of the pairs favor the hypothesis. A correlation of r = 0 indicates that half the pairs favor the hypothesis and half do not; in other words, the sample groups do not differ in ranks, so there is no evidence that they come from two different populations. An effect size of r = 0 can be said to describe no relationship between group membership and the members' ranks.

## Related Research Articles

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a limited supply product and its price.

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values,, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other,, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

In statistics, the Pearson correlation coefficient, also referred to as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation, is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s and for which the mathematical formula was derived and published by Auguste Bravais in 1844. The naming of the coefficient is thus an example of Stigler's Law.

In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.

In statistics, an effect size is a quantitative measure of the magnitude of a phenomenon. Examples of effect sizes are the correlation between two variables, the regression coefficient in a regression, the mean difference, or even the risk with which something happens, such as how many people survive after a heart attack for every one person that does not survive. For most types of effect size, a larger absolute value always indicates a stronger effect, with the main exception being if the effect size is an odds ratio. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. They are the first item (magnitude) in the MAGIC criteria for evaluating the strength of a statistical claim. Especially in meta-analysis, where the purpose is to combine multiple effect sizes, the standard error (S.E.) of the effect size is of critical importance. The S.E. of the effect size is used to weigh effect sizes when combining studies, so that large studies are considered more important than small studies in the analysis. The S.E. of the effect size is calculated differently for each type of effect size, but generally only requires knowing the study's sample size (N), or the number of observations in each group.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In statistics, the Fisher transformation can be used to test hypotheses about the value of the population correlation coefficient ρ between variables X and Y. This is because, when the transformation is applied to the sample correlation coefficient, the sampling distribution of the resulting variable is approximately normal, with a variance that is stable over different values of the underlying true correlation.

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. Feature selection techniques are used for four reasons:

In mathematics and physics, the Christoffel symbols are an array of numbers describing a metric connection. The metric connection is a specialization of the affine connection to surfaces or other manifolds endowed with a metric, allowing distances to be measured on that surface. In differential geometry, an affine connection can be defined without reference to a metric, and many additional concepts follow: parallel transport, covariant derivatives, geodesics, etc. also do not require the concept of a metric. However, when a metric is available, these concepts can be directly tied to the "shape" of the manifold itself; that shape is determined by how the tangent space is attached to the cotangent space by the metric tensor. Abstractly, one would say that the manifold has an associated (orthonormal) frame bundle, with each "frame" being a possible choice of a coordinate frame. An invariant metric implies that the structure group of the frame bundle is the orthogonal group O(p, q). As a result, such a manifold is necessarily a (pseudo-)Riemannian manifold. The Christoffel symbols provide a concrete representation of the connection of (pseudo-)Riemannian geometry in terms of coordinates on the manifold. Additional concepts, such as parallel transport, geodesics, etc. can then be expressed in terms of Christoffel symbols.

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function.

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable is dichotomous; Y can either be "naturally" dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient, is a statistic used to measure the ordinal association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.

The non-random two-liquid model is an activity coefficient model that correlates the activity coefficients of a compound with its mole fractions in the liquid phase concerned. It is frequently applied in the field of chemical engineering to calculate phase equilibria. The concept of NRTL is based on the hypothesis of Wilson that the local concentration around a molecule is different from the bulk concentration. This difference is due to a difference between the interaction energy of the central molecule with the molecules of its own kind and that with the molecules of the other kind . The energy difference also introduces a non-randomness at the local molecular level. The NRTL model belongs to the so-called local-composition models. Other models of this type are the Wilson model, the UNIQUAC model, and the group contribution model UNIFAC. These local-composition models are not thermodynamically consistent for a one-fluid model for a real mixture due to the assumption that the local composition around molecule i is independent of the local composition around molecule j. This assumption is not true, as was shown by Flemr in 1976. However, they are consistent if a hypothetical two-liquid model is used.

UNIQUAC is an activity coefficient model used in description of phase equilibria. The model is a so-called lattice model and has been derived from a first order approximation of interacting molecule surfaces in statistical thermodynamics. The model is however not fully thermodynamically consistent due to its two liquid mixture approach. In this approach the local concentration around one central molecule is assumed to be independent from the local composition around another type of molecule.

Pitzer equations are important for the understanding of the behaviour of ions dissolved in natural waters such as rivers, lakes and sea-water. They were first described by physical chemist Kenneth Pitzer. The parameters of the Pitzer equations are linear combinations of parameters, of a virial expansion of the excess Gibbs free energy, which characterise interactions amongst ions and solvent. The derivation is thermodynamically rigorous at a given level of expansion. The parameters may be derived from various experimental data such as the osmotic coefficient, mixed ion activity coefficients, and salt solubility. They can be used to calculate mixed ion activity coefficients and water activities in solutions of high ionic strength for which the Debye–Hückel theory is no longer adequate. They are more rigorous than the equations of specific ion interaction theory, but Pitzer parameters are more difficult to determine experimentally than SIT parameters.

In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA to assess whether the factor variables are additively related to the expected value of the response variable. It can be applied when there are no replicated values in the data set, a situation in which it is impossible to directly estimate a fully general non-additive regression structure and still have information left to estimate the error variance. The test statistic proposed by Tukey has one degree of freedom under the null hypothesis, hence this is often called "Tukey's one-degree-of-freedom test."

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

## References

1. Kruskal, William H. (1958). "Ordinal Measures of Association". Journal of the American Statistical Association . 53 (284): 814–861. doi:10.2307/2281954. JSTOR   2281954.
• Cureton, Edward E. (1956). "Rank-biserial correlation". Psychometrika . 21 (3): 287–290. doi:10.1007/BF02289138.
• Everitt, B. S. (2002), The Cambridge Dictionary of Statistics, Cambridge: Cambridge University Press, ISBN   0-521-81099-X
• Diaconis, P. (1988), Group Representations in Probability and Statistics, Lecture Notes-Monograph Series, Hayward, CA: Institute of Mathematical Statistics, ISBN   0-940600-14-5
• Glass, Gene V. (1965). "A ranking variable analogue of biserial correlation: implications for short-cut item analysis". Journal of Educational Measurement. 2 (1): 91–95. doi:10.1111/j.1745-3984.1965.tb00396.x.
• Kendall, M. G. (1970), Rank Correlation Methods, London: Griffin, ISBN   0-85264-199-0
• Kerby, Dave S. (2014). "The Simple Difference Formula: An Approach to Teaching Nonparametric Correlation". Comprehensive Psychology. 3 (1). doi:.