Rank correlation

Last updated

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

Contents

Context

If, for example, one variable is the identity of a college basketball program and another variable is the identity of a college football program, one could test for a relationship between the poll rankings of the two types of program: do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank correlation coefficient can measure that relationship, and the measure of significance of the rank correlation coefficient can show whether the measured relationship is small enough to likely be a coincidence.

If there is only one variable, the identity of a college football program, but it is subject to two different poll rankings (say, one by coaches and one by sportswriters), then the similarity of the two different polls' rankings can be measured with a rank correlation coefficient.

As another example, in a contingency table with low income, medium income, and high income in the row variable and educational level—no high school, high school, university—in the column variable), [1] a rank correlation measures the relationship between income and educational level.

Correlation coefficients

Some of the more popular rank correlation statistics include

  1. Spearman's ρ
  2. Kendall's τ
  3. Goodman and Kruskal's γ
  4. Somers' D

An increasing rank correlation coefficient implies increasing agreement between rankings. The coefficient is inside the interval [1, 1] and assumes the value:

Following Diaconis (1988), a ranking can be seen as a permutation of a set of objects. Thus we can look at observed rankings as data obtained when the sample space is (identified with) a symmetric group. We can then introduce a metric, making the symmetric group into a metric space. Different metrics will correspond to different rank correlations.

General correlation coefficient

Kendall 1970 [2] showed that his (tau) and Spearman's (rho) are particular cases of a general correlation coefficient.

Suppose we have a set of objects, which are being considered in relation to two properties, represented by and , forming the sets of values and . To any pair of individuals, say the -th and the -th we assign a -score, denoted by , and a -score, denoted by . The only requirement for these functions is that they be anti-symmetric, so and . (Note that in particular if .) Then the generalized correlation coefficient is defined as

Equivalently, if all coefficients are collected into matrices and , with and , then

where is the Frobenius inner product and the Frobenius norm. In particular, the general correlation coefficient is the cosine of the angle between the matrices and .

Kendall's τ as a particular case

If , are the ranks of the -member according to the -quality and -quality respectively, then we can define

The sum is the number of concordant pairs minus the number of discordant pairs (see Kendall tau rank correlation coefficient). The sum is just , the number of terms , as is . Thus in this case,

Spearman's ρ as a particular case

If , are the ranks of the -member according to the and the -quality respectively, we can simply define

The sums and are equal, since both and range from to . Then we have:

now

We also have

and hence

being the sum of squares of the first naturals equals . Thus, the last equation reduces to

Further

and thus, substituting into the original formula these results we get

where is the difference between ranks.

which is exactly Spearman's rank correlation coefficient .

Rank-biserial correlation

Gene Glass (1965) noted that the rank-biserial can be derived from Spearman's . "One can derive a coefficient defined on X, the dichotomous variable, and Y, the ranking variable, which estimates Spearman's rho between X and Y in the same way that biserial r estimates Pearson's r between two normal variables” (p. 91). The rank-biserial correlation had been introduced nine years before by Edward Cureton (1956) as a measure of rank correlation when the ranks are in two groups.

Kerby simple difference formula

Dave Kerby (2014) recommended the rank-biserial as the measure to introduce students to rank correlation, because the general logic can be explained at an introductory level. The rank-biserial is the correlation used with the Mann–Whitney U test, a method commonly covered in introductory college courses on statistics. The data for this test consists of two groups; and for each member of the groups, the outcome is ranked for the study as a whole.

Kerby showed that this rank correlation can be expressed in terms of two concepts: the percent of data that support a stated hypothesis, and the percent of data that do not support it. The Kerby simple difference formula states that the rank correlation can be expressed as the difference between the proportion of favorable evidence (f) minus the proportion of unfavorable evidence (u).

Example and interpretation

To illustrate the computation, suppose a coach trains long-distance runners for one month using two methods. Group A has 5 runners, and Group B has 4 runners. The stated hypothesis is that method A produces faster runners. The race to assess the results finds that the runners from Group A do indeed run faster, with the following ranks: 1, 2, 3, 4, and 6. The slower runners from Group B thus have ranks of 5, 7, 8, and 9.

The analysis is conducted on pairs, defined as a member of one group compared to a member of the other group. For example, the fastest runner in the study is a member of four pairs: (1,5), (1,7), (1,8), and (1,9). All four of these pairs support the hypothesis, because in each pair the runner from Group A is faster than the runner from Group B. There are a total of 20 pairs, and 19 pairs support the hypothesis. The only pair that does not support the hypothesis are the two runners with ranks 5 and 6, because in this pair, the runner from Group B had the faster time. By the Kerby simple difference formula, 95% of the data support the hypothesis (19 of 20 pairs), and 5% do not support (1 of 20 pairs), so the rank correlation is r = .95 - .05 = .90.

The maximum value for the correlation is r = 1, which means that 100% of the pairs favor the hypothesis. A correlation of r = 0 indicates that half the pairs favor the hypothesis and half do not; in other words, the sample groups do not differ in ranks, so there is no evidence that they come from two different populations. An effect size of r = 0 can be said to describe no relationship between group membership and the members' ranks.

Related Research Articles

Correlation Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it actually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

Pearson correlation coefficient Measure of linear correlation

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

Spearmans rank correlation coefficient Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.

In statistics, an effect size is a number measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

Fisher transformation Statistical transformation

In statistics, the Fisher transformation can be used to test hypotheses about the value of the population correlation coefficient ρ between variables X and Y. This is because, when the transformation is applied to the sample correlation coefficient, the sampling distribution of the resulting variable is approximately normal, with a variance that is stable over different values of the underlying true correlation.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used either to test the location of a set of samples or to compare the locations of two populations using a set of matched samples. When applied to test the location of a set of samples, it serves the same purpose as the one-sample Student's t-test. On a set of matched samples, it is a paired difference test like the paired Student's t-test. Unlike the Student's t-test, the Wilcoxon signed-rank test does not assume that the data is normally distributed. On a wide variety of data sets, it has greater statistical power than the Student's t-test and is more likely to produce a statistically significant result. The cost of this applicability is that it has less statistical power than the Student's t-test when the data is normally distributed.

The Friedman test is a non-parametric statistical test developed by Milton Friedman. Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row together, then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test.

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable is dichotomous; Y can either be "naturally" dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

Kendall's W is a non-parametric statistic. It is a normalization of the statistic of the Friedman test, and can be used for assessing agreement among raters. Kendall's W ranges from 0 to 1.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In statistics, Goodman and Kruskal's gamma is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It measures the strength of association of the cross tabulated data when both variables are measured at the ordinal level. It makes no adjustment for either table size or ties. Values range from −1 to +1. A value of zero indicates the absence of association.

Intraclass correlation Descriptive statistic

In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC), is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures it operates on data structured as groups, rather than data structured as paired observations.

Non-random two-liquid model

The non-random two-liquid model is an activity coefficient model that correlates the activity coefficients of a compound with its mole fractions in the liquid phase concerned. It is frequently applied in the field of chemical engineering to calculate phase equilibria. The concept of NRTL is based on the hypothesis of Wilson that the local concentration around a molecule is different from the bulk concentration. This difference is due to a difference between the interaction energy of the central molecule with the molecules of its own kind and that with the molecules of the other kind . The energy difference also introduces a non-randomness at the local molecular level. The NRTL model belongs to the so-called local-composition models. Other models of this type are the Wilson model, the UNIQUAC model, and the group contribution model UNIFAC. These local-composition models are not thermodynamically consistent for a one-fluid model for a real mixture due to the assumption that the local composition around molecule i is independent of the local composition around molecule j. This assumption is not true, as was shown by Flemr in 1976. However, they are consistent if a hypothetical two-liquid model is used.

UNIQUAC

UNIQUAC is an activity coefficient model used in description of phase equilibria. The model is a so-called lattice model and has been derived from a first order approximation of interacting molecule surfaces in statistical thermodynamics. The model is however not fully thermodynamically consistent due to its two liquid mixture approach. In this approach the local concentration around one central molecule is assumed to be independent from the local composition around another type of molecule.

Pitzer equations are important for the understanding of the behaviour of ions dissolved in natural waters such as rivers, lakes and sea-water. They were first described by physical chemist Kenneth Pitzer. The parameters of the Pitzer equations are linear combinations of parameters, of a virial expansion of the excess Gibbs free energy, which characterise interactions amongst ions and solvent. The derivation is thermodynamically rigorous at a given level of expansion. The parameters may be derived from various experimental data such as the osmotic coefficient, mixed ion activity coefficients, and salt solubility. They can be used to calculate mixed ion activity coefficients and water activities in solutions of high ionic strength for which the Debye–Hückel theory is no longer adequate. They are more rigorous than the equations of specific ion interaction theory, but Pitzer parameters are more difficult to determine experimentally than SIT parameters.

Denoising Algorithm based on Relevance network Topology (DART) is an unsupervised algorithm that estimates an activity score for a pathway in a gene expression matrix, following a denoising step. In DART, a weighted average is used where the weights reflect the degree of the nodes in the pruned network. The denoising step removes prior information that is inconsistent with a data set. This strategy substantially improves unsupervised predictions of pathway activity that are based on a prior model, which was learned from a different biological system or context.

References

  1. Kruskal, William H. (1958). "Ordinal Measures of Association". Journal of the American Statistical Association . 53 (284): 814–861. doi:10.2307/2281954. JSTOR   2281954.
  2. Kendall, Maurice G (1970). Rank Correlation Methods (4 ed.). Griffin. ISBN   9780852641996.

Further reading