Spurious correlation of ratios

Last updated
An illustration of spurious correlation, this figure shows 500 observations of x/z plotted against y/z. The sample correlation is 0.53, even though x, y, and z are statistically independent of each other (i.e., the pairwise correlations between each of them are zero). The z-values are highlighted on a colour scale. Spurious Correlation (Colour).svg
An illustration of spurious correlation, this figure shows 500 observations of x/z plotted against y/z. The sample correlation is 0.53, even though x, y, and z are statistically independent of each other (i.e., the pairwise correlations between each of them are zero). The z-values are highlighted on a colour scale.

In statistics, spurious correlation of ratios is a form of spurious correlation that arises between ratios of absolute measurements which themselves are uncorrelated. [1] [2]

Contents

The phenomenon of spurious correlation of ratios is one of the main motives for the field of compositional data analysis, which deals with the analysis of variables that carry only relative information, such as proportions, percentages and parts-per-million. [3] [4]

Spurious correlation is distinct from misconceptions about correlation and causality.

Illustration of spurious correlation

Pearson states a simple example of spurious correlation: [1]

Select three numbers within certain ranges at random, say x, y, z, these will be pair and pair uncorrelated. Form the proper fractions x/z and y/z for each triplet, and correlation will be found between these indices.

The scatter plot above illustrates this example using 500 observations of x, y, and z. Variables x, y and z are drawn from normal distributions with means 10, 10, and 30, respectively, and standard deviations 1, 1, and 3 respectively, i.e.,

Even though x, y, and z are statistically independent and therefore uncorrelated, in the depicted typical sample the ratios x/z and y/z have a correlation of 0.53. This is because of the common divisor (z) and can be better understood if we colour the points in the scatter plot by the z-value. Trios of (x, y, z) with relatively large z values tend to appear in the bottom left of the plot; trios with relatively small z values tend to appear in the top right.

Approximate amount of spurious correlation

Pearson derived an approximation of the correlation that would be observed between two indices ( and ), i.e., ratios of the absolute measurements :

where is the coefficient of variation of , and the Pearson correlation between and .

This expression can be simplified for situations where there is a common divisor by setting , and are uncorrelated, giving the spurious correlation:

For the special case in which all coefficients of variation are equal (as is the case in the illustrations at right),

Relevance to biology and other sciences

Pearson was joined by Sir Francis Galton [5] and Walter Frank Raphael Weldon [1] in cautioning scientists to be wary of spurious correlation, especially in biology where it is common [6] to scale or normalize measurements by dividing them by a particular variable or total. The danger he saw was that conclusions would be drawn from correlations that are artifacts of the analysis method, rather than actual “organic” relationships.

However, it would appear that spurious correlation (and its potential to mislead) is not yet widely understood. In 1986 John Aitchison, who pioneered the log-ratio approach to compositional data analysis wrote: [3]

It seems surprising that the warnings of three such eminent statistician-scientists as Pearson, Galton and Weldon should have largely gone unheeded for so long: even today uncritical applications of inappropriate statistical methods to compositional data with consequent dubious inferences are regularly reported.

More recent publications suggest that this lack of awareness prevails, at least in molecular bioscience. [7] [8]

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

<span class="mw-page-title-main">Correlation</span> Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In statistics, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Mathematically, compositional data is represented by points on a simplex. Measurements involving probabilities, proportions, percentages, and ppm can all be thought of as compositional data.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

<span class="mw-page-title-main">Fisher transformation</span> Statistical transformation

In statistics, the Fisher transformation of a Pearson correlation coefficient is its inverse hyperbolic tangent (artanh). When the sample correlation coefficient r is near 1 or -1, its distribution is highly skewed, which makes it difficult to estimate confidence intervals and apply tests of significance for the population correlation coefficient ρ. The Fisher transformation solves this problem by yielding a variable whose distribution is approximately normally distributed, with a variance that is stable over different values of r.

In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables. It is the correlation between the variable's values and the best predictions that can be computed linearly from the predictive variables.

In mathematics, the Abel transform, named for Niels Henrik Abel, is an integral transform often used in the analysis of spherically symmetric or axially symmetric functions. The Abel transform of a function f(r) is given by

<span class="mw-page-title-main">Regression dilution</span> Statistical bias in linear regressions

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

Panel (data) analysis is a statistical method, widely used in social science, epidemiology, and econometrics to analyze two-dimensional panel data. The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions. Multidimensional analysis is an econometric method in which data are collected over more than two dimensions.

In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot quantity need not be a statistic—the function and its value can depend on the parameters of the model, but its distribution must not. If it is a statistic, then it is known as an ancillary statistic.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

In econometrics, Prais–Winsten estimation is a procedure meant to take care of the serial correlation of type AR(1) in a linear model. Conceived by Sigbert Prais and Christopher Winsten in 1954, it is a modification of Cochrane–Orcutt estimation in the sense that it does not lose the first observation, which leads to more efficiency as a result and makes it a special case of feasible generalized least squares.

Financial correlations measure the relationship between the changes of two or more financial variables over time. For example, the prices of equity stocks and fixed interest bonds often move in opposite directions: when investors sell stocks, they often use the proceeds to buy bonds and vice versa. In this case, stock and bond prices are negatively correlated.

<span class="mw-page-title-main">Vera Pawlowsky-Glahn</span> Spanish mathematician

Vera Pawlowsky-Glahn is a Spanish-German mathematician. From 2000 till 2018, she was a full-time professor at the University of Girona, Spain in the Department of Computer Science, Applied Mathematics, and Statistics. Since 2018 she is emeritus professor at the same university. She was previously an associate professor at Technology University in Barcelona from 1986 to 2000. Her main areas of research interest include statistical analysis of compositional data, algebraic-geometric approach to statistical inference, and spatial cluster analysis. She was the president of the International Association for Mathematical Geosciences (IAMG) during 2008–2012. IAMG awarded her the William Christian Krumbein Medal in 2006 and the John Cedric Griffiths Teaching Award in 2008. In 2007, she was selected IAMG Distinguished Lecturer.
During the 6th International Workshop on Compositional Data Analysis in June 2015, Vera was appointed president of a commission to formalize the creation of an international organization of scientists interested in the advancement and application of compositional data modeling.

References

  1. 1 2 3 Pearson, Karl (1896). "Mathematical Contributions to the Theory of Evolution – On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs". Proceedings of the Royal Society of London. 60 (359–367): 489–498. doi:10.1098/rspl.1896.0076. JSTOR   115879.
  2. Aldrich, John (1995). "Correlations Genuine and Spurious in Pearson and Yule". Statistical Science. 10 (4): 364–376. doi: 10.1214/ss/1177009870 .
  3. 1 2 Aitchison, John (1986). The statistical analysis of compositional data. Chapman & Hall. ISBN   978-0-412-28060-3.
  4. Pawlowsky-Glahn, Vera; Buccianti, Antonella, eds. (2011). Compositional Data Analysis: Theory and Applications. Wiley. doi:10.1002/9781119976462. ISBN   978-0470711354.
  5. Galton, Francis (1896). "Note to the memoir by Professor Karl Pearson, F.R.S., on spurious correlation". Proceedings of the Royal Society of London. 60 (359–367): 498–502. doi:10.1098/rspl.1896.0077. S2CID   170846631.
  6. Jackson, DA; Somers, KM (1991). "The Spectre of 'Spurious' Correlation". Oecologia. 86 (1): 147–151. Bibcode:1991Oecol..86..147J. doi:10.1007/bf00317404. JSTOR   4219582. PMID   28313173. S2CID   1116627.
  7. Lovell, David; Müller, Warren; Taylor, Jen; Zwart, Alec; Helliwell, Chris (2011). "Chapter 14: Proportions, Percentages, PPM: Do the Molecular Biosciences Treat Compositional Data Right?". In Pawlowsky-Glahn, Vera; Buccianti, Antonella (eds.). Compositional Data Analysis: Theory and Applications. Wiley. doi:10.1002/9781119976462. ISBN   9780470711354.
  8. Lovell, David; Pawlowsky-Glahn, Vera; Egozcue, Juan José; Marguerat, Samuel; Bähler, Jürg (16 March 2015). "Proportionality: A Valid Alternative to Correlation for Relative Data". PLOS Computational Biology. 11 (3): e1004075. Bibcode:2015PLSCB..11E4075L. doi: 10.1371/journal.pcbi.1004075 . PMC   4361748 . PMID   25775355.