Spearman's rank correlation coefficient

Last updated
A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data points with greater x values than that of a given data point will have greater y values as well. In contrast, this does not give a perfect Pearson correlation. Spearman fig1.svg
A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data points with greater x values than that of a given data point will have greater y values as well. In contrast, this does not give a perfect Pearson correlation.
When the data are roughly elliptically distributed and there are no prominent outliers, the Spearman correlation and Pearson correlation give similar values. Spearman fig2.svg
When the data are roughly elliptically distributed and there are no prominent outliers, the Spearman correlation and Pearson correlation give similar values.
The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are in the tails of both samples. That is because Spearman's r limits the outlier to the value of its rank. Spearman fig3.svg
The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are in the tails of both samples. That is because Spearman's ρ limits the outlier to the value of its rank.

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.

Contents

The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.

Intuitively, the Spearman correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully opposed for a correlation of −1) rank between the two variables.

Spearman's coefficient is appropriate for both continuous and discrete ordinal variables. [1] [2] Both Spearman's and Kendall's can be formulated as special cases of a more general correlation coefficient.

Definition and calculation

The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables. [3]

For a sample of size n, the n raw scores are converted to ranks , and is computed as

where

denotes the usual Pearson correlation coefficient, but applied to the rank variables,
is the covariance of the rank variables,
and are the standard deviations of the rank variables.

Only if all n ranks are distinct integers, it can be computed using the popular formula

where

is the difference between the two ranks of each observation,
n is the number of observations.

Identical values are usually [4] each assigned fractional ranks equal to the average of their positions in the ascending order of the values, which is equivalent to averaging over all possible permutations.

If ties are present in the data set, the simplified formula above yields incorrect results: Only if in both variables all ranks are distinct, then (calculated according to biased variance). The first equation — normalizing by the standard deviation — may be used even when ranks are normalized to [0, 1] ("relative ranks") because it is insensitive both to translation and linear scaling.

The simplified method should also not be used in cases where the data set is truncated; that is, when the Spearman's correlation coefficient is desired for the top X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above. [5]

There are several other numerical measures that quantify the extent of statistical dependence between pairs of observations. The most common of these is the Pearson product-moment correlation coefficient, which is a similar correlation method to Spearman's rank, that measures the “linear” relationships between the raw numbers rather than between their ranks.

An alternative name for the Spearman rank correlation is the “grade correlation”; [6] in this, the “rank” of an observation is replaced by the “grade”. In continuous distributions, the grade of an observation is, by convention, always one half less than the rank, and hence the grade and rank correlations are the same in this case. More generally, the “grade” of an observation is proportional to an estimate of the fraction of a population less than a given value, with the half-observation adjustment at observed values. Thus this corresponds to one possible treatment of tied ranks. While unusual, the term “grade correlation” is still in use. [7]

Interpretation

Positive and negative Spearman rank correlations
A positive Spearman correlation coefficient corresponds to an increasing monotonic trend between X and Y. Spearman fig5.svg
A positive Spearman correlation coefficient corresponds to an increasing monotonic trend between X and Y.
A negative Spearman correlation coefficient corresponds to a decreasing monotonic trend between X and Y. Spearman fig4.svg
A negative Spearman correlation coefficient corresponds to a decreasing monotonic trend between X and Y.

The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable). If Y tends to increase when X increases, the Spearman correlation coefficient is positive. If Y tends to decrease when X increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease when X increases. The Spearman correlation increases in magnitude as X and Y become closer to being perfectly monotone functions of each other. When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1. A perfectly monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that XiXj and YiYj always have the same sign. A perfectly monotone decreasing relationship implies that these differences always have opposite signs.

The Spearman correlation coefficient is often described as being "nonparametric". This can have two meanings. First, a perfect Spearman correlation results when X and Y are related by any monotonic function. Contrast this with the Pearson correlation, which only gives a perfect value when X and Y are related by a linear function. The other sense in which the Spearman correlation is nonparametric is that its exact sampling distribution can be obtained without requiring knowledge (i.e., knowing the parameters) of the joint probability distribution of X and Y.

Example

In this example, the raw data in the table below is used to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week. [ citation needed ]

IQ, Hours of TV per week,
1067
10027
862
10150
9928
10329
9720
11312
1126
11017

Firstly, evaluate . To do so use the following steps, reflected in the table below.

  1. Sort the data by the first column (). Create a new column and assign it the ranked values 1, 2, 3, ..., n.
  2. Next, sort the data by the second column (). Create a fourth column and similarly assign it the ranked values 1, 2, 3, ..., n.
  3. Create a fifth column to hold the differences between the two rank columns ( and ).
  4. Create one final column to hold the value of column squared.
IQ, Hours of TV per week, rank rank
8621100
972026−416
992838−525
1002747−39
10150510−525
1032969−39
106773416
110178539
112692749
11312104636

With found, add them to find . The value of n is 10. These values can now be substituted back into the equation

to give

which evaluates to ρ = −29/165 = −0.175757575... with a p-value = 0.627188 (using the t-distribution).

Chart of the data presented. It can be seen that there might be a negative correlation, but that the relationship does not appear definitive. Spearman's Rank chart.png
Chart of the data presented. It can be seen that there might be a negative correlation, but that the relationship does not appear definitive.

That the value is close to zero shows that the correlation between IQ and hours spent watching TV is very low, although the negative value suggests that the longer the time spent watching television the lower the IQ. In the case of ties in the original values, this formula should not be used; instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).

Determining significance

One approach to test whether an observed value of ρ is significantly different from zero (r will always maintain −1 ≤ r ≤ 1) is to calculate the probability that it would be greater than or equal to the observed r, given the null hypothesis, by using a permutation test. An advantage of this approach is that it automatically takes into account the number of tied data values in the sample and the way they are treated in computing the rank correlation.

Another approach parallels the use of the Fisher transformation in the case of the Pearson product-moment correlation coefficient. That is, confidence intervals and hypothesis tests relating to the population value ρ can be carried out using the Fisher transformation:

If F(r) is the Fisher transformation of r, the sample Spearman rank correlation coefficient, and n is the sample size, then

is a z-score for r, which approximately follows a standard normal distribution under the null hypothesis of statistical independence (ρ = 0). [8] [9]

One can also test for significance using

which is distributed approximately as Student's t-distribution with n − 2 degrees of freedom under the null hypothesis. [10] A justification for this result relies on a permutation argument. [11]

A generalization of the Spearman coefficient is useful in the situation where there are three or more conditions, a number of subjects are all observed in each of them, and it is predicted that the observations will have a particular order. For example, a number of subjects might each be given three trials at the same task, and it is predicted that performance will improve from trial to trial. A test of the significance of the trend between conditions in this situation was developed by E. B. Page [12] and is usually referred to as Page's trend test for ordered alternatives.

Correspondence analysis based on Spearman's ρ

Classic correspondence analysis is a statistical method that gives a score to every value of two nominal variables. In this way the Pearson correlation coefficient between them is maximized.

There exists an equivalent of this method, called grade correspondence analysis, which maximizes Spearman's ρ or Kendall's τ. [13]

Approximating Spearman's ρ from a stream

There are two existing approaches to approximating the Spearman's rank correlation coefficient from streaming data. [14] [15] The first approach [14] involves coarsening the joint distribution of . For continuous values: cutpoints are selected for and respectively, discretizing these random variables. Default cutpoints are added at and . A count matrix of size , denoted , is then constructed where stores the number of observations that fall into the two-dimensional cell indexed by . For streaming data, when a new observation arrives, the appropriate element is incremented. The Spearman's rank correlation can then be computed, based on the count matrix , using linear algebra operations (Algorithm 2 [14] ). Note that for discrete random variables, no discretization procedure is necessary. This method is applicable to stationary streaming data as well as large data sets. For non-stationary streaming data, where the Spearman's rank correlation coefficient may change over time, the same procedure can be applied, but to a moving window of observations. When using a moving window, memory requirements grow linearly with chosen window size.

The second approach to approximating the Spearman's rank correlation coefficient from streaming data involves the use of Hermite series based estimators. [15] These estimators, based on Hermite polynomials, allow sequential estimation of the probability density function and cumulative distribution function in univariate and bivariate cases. Bivariate Hermite series density estimators and univariate Hermite series based cumulative distribution function estimators are plugged into a large sample version of the Spearman's rank correlation coefficient estimator, to give a sequential Spearman's correlation estimator. This estimator is phrased in terms of linear algebra operations for computational efficiency (equation (8) and algorithm 1 and 2 [15] ). These algorithms are only applicable to continuous random variable data, but have certain advantages over the count matrix approach in this setting. The first advantage is improved accuracy when applied to large numbers of observations. The second advantage is that the Spearman's rank correlation coefficient can be computed on non-stationary streams without relying on a moving window. Instead, the Hermite series based estimator uses an exponential weighting scheme to track time-varying Spearman's rank correlation from streaming data, which has constant memory requirements with respect to "effective" moving window size.

Software implementations

See also

Related Research Articles

Autocorrelation Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

Correlation Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

Pearson correlation coefficient Measure of linear correlation

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

Homoscedasticity Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The spellings homoskedasticity and heteroskedasticity are also frequently used.

Canonical correlation Way of inferring information from cross-covariance matrices

In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y which have maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Jordan in 1875.

The Spearman–Brown prediction formula, also known as the Spearman–Brown prophecy formula, is a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length. The method was published independently by Spearman (1910) and Brown (1910).

In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.

Fisher transformation Statistical transformation

In statistics, the Fisher transformation can be used to test hypotheses about the value of the population correlation coefficient ρ between variables X and Y. This is because, when the transformation is applied to the sample correlation coefficient, the sampling distribution of the resulting variable is approximately normal, with a variance that is stable over different values of the underlying true correlation.

Coefficient of determination Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

Regression dilution

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ. It can be used as an alternative to the paired Student's t-test when the distribution of the difference between two samples' means cannot be assumed to be normally distributed. A Wilcoxon signed-rank test is a nonparametric test that can be used to determine whether two dependent samples were selected from populations having the same distribution.

The following is a glossary of terms used in the mathematical sciences statistics and probability.

Simple linear regression

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.

In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot quantity need not be a statistic—the function and its value can depend on the parameters of the model, but its distribution must not. If it is a statistic, then it is known as an ancillary statistic.

In statistics, a concordant pair is a pair of observations, each on two variables, (X1,Y1) and (X2,Y2), having the property that

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability.

References

  1. Scale types.
  2. Lehman, Ann (2005). Jmp For Basic Univariate And Multivariate Statistics: A Step-by-step Guide . Cary, NC: SAS Press. p.  123. ISBN   978-1-59047-576-8.
  3. Myers, Jerome L.; Well, Arnold D. (2003). Research Design and Statistical Analysis (2nd ed.). Lawrence Erlbaum. pp.  508. ISBN   978-0-8058-4037-7.
  4. Dodge, Yadolah (2010). The Concise Encyclopedia of Statistics . Springer-Verlag New York. p.  502. ISBN   978-0-387-31742-7.
  5. Al Jaber, Ahmed Odeh; Elayyan, Haifaa Omar (2018). Toward Quality Assurance and Excellence in Higher Education. River Publishers. p. 284. ISBN   978-87-93609-54-9.
  6. Yule, G. U.; Kendall, M. G. (1968) [1950]. An Introduction to the Theory of Statistics (14th ed.). Charles Griffin & Co. p. 268.
  7. Piantadosi, J.; Howlett, P.; Boland, J. (2007). "Matching the grade correlation coefficient using a copula with maximum disorder". Journal of Industrial and Management Optimization. 3 (2): 305–312. doi: 10.3934/jimo.2007.3.305 .
  8. Choi, S. C. (1977). "Tests of Equality of Dependent Correlation Coefficients". Biometrika . 64 (3): 645–647. doi:10.1093/biomet/64.3.645.
  9. Fieller, E. C.; Hartley, H. O.; Pearson, E. S. (1957). "Tests for rank correlation coefficients. I". Biometrika. 44 (3–4): 470–481. CiteSeerX   10.1.1.474.9634 . doi:10.1093/biomet/44.3-4.470.
  10. Press; Vettering; Teukolsky; Flannery (1992). Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge University Press. p. 640.
  11. Kendall, M. G.; Stuart, A. (1973). "Sections 31.19, 31.21". The Advanced Theory of Statistics, Volume 2: Inference and Relationship . Griffin. ISBN   978-0-85264-215-3.
  12. Page, E. B. (1963). "Ordered hypotheses for multiple treatments: A significance test for linear ranks". Journal of the American Statistical Association. 58 (301): 216–230. doi:10.2307/2282965. JSTOR   2282965.
  13. Kowalczyk, T.; Pleszczyńska, E.; Ruland, F., eds. (2004). Grade Models and Methods for Data Analysis with Applications for the Analysis of Data Populations. Studies in Fuzziness and Soft Computing. 151. Berlin Heidelberg New York: Springer Verlag. ISBN   978-3-540-21120-4.
  14. 1 2 3 Xiao, W. (2019). "Novel Online Algorithms for Nonparametric Correlations with Application to Analyze Sensor Data". 2019 IEEE International Conference on Big Data (Big Data): 404–412. doi:10.1109/BigData47090.2019.9006483. ISBN   978-1-7281-0858-2.
  15. 1 2 3 Stephanou, Michael; Varughese, Melvin (July 2021). "Sequential estimation of Spearman rank correlation using Hermite series estimators". arXiv: 2012.06287 [stat.ME].
  16. https://www.mathworks.com/help/stats/corr.html

Further reading