In statistics, the Fisher transformation (or Fisher z-transformation) of a Pearson correlation coefficient is its inverse hyperbolic tangent (artanh). When the sample correlation coefficient r is near 1 or -1, its distribution is highly skewed, which makes it difficult to estimate confidence intervals and apply tests of significance for the population correlation coefficient ρ. [1] [2] [3] The Fisher transformation solves this problem by yielding a variable whose distribution is approximately normally distributed, with a variance that is stable over different values of r.
Given a set of N bivariate sample pairs (Xi, Yi), i = 1, ..., N, the sample correlation coefficient r is given by
Here stands for the covariance between the variables and and stands for the standard deviation of the respective variable. Fisher's z-transformation of r is defined as
where "ln" is the natural logarithm function and "artanh" is the inverse hyperbolic tangent function.
If (X, Y) has a bivariate normal distribution with correlation ρ and the pairs (Xi, Yi) are independent and identically distributed, then z is approximately normally distributed with mean
and a standard deviation which does not depend on the value of the correlation rho (i.e., a Variance-stabilizing transformation)
where N is the sample size, and ρ is the true correlation coefficient.
This transformation, and its inverse
can be used to construct a large-sample confidence interval for r using standard normal theory and derivations. See also application to partial correlation.
![]() | This article may require cleanup to meet Wikipedia's quality standards. The specific problem is: the steps of the derivation are not laid out completely.(July 2021) |
Hotelling gives a concise derivation of the Fisher transformation. [4]
To derive the Fisher transformation, one starts by considering an arbitrary increasing, twice-differentiable function of , say . Finding the first term in the large- expansion of the corresponding skewness results [5] in
Setting and solving the corresponding differential equation for yields the inverse hyperbolic tangent function.
Similarly expanding the mean m and variance v of , one gets
and
respectively.
The extra terms are not part of the usual Fisher transformation. For large values of and small values of they represent a large improvement of accuracy at minimal cost, although they greatly complicate the computation of the inverse – a closed-form expression is not available. The near-constant variance of the transformation is the result of removing its skewness – the actual improvement is achieved by the latter, not by the extra terms. Including the extra terms, i.e., computing (z-m)/v1/2, yields:
which has, to an excellent approximation, a standard normal distribution. [6]
The application of Fisher's transformation can be enhanced using a software calculator as shown in the figure. Assuming that the r-squared value found is 0.80, that there are 30 data [ clarification needed ], and accepting a 90% confidence interval, the r-squared value in another random sample from the same population may range from 0.656 to 0.888. When r-squared is outside this range, the population is considered to be different.
The Fisher transformation is an approximate variance-stabilizing transformation for r when X and Y follow a bivariate normal distribution. This means that the variance of z is approximately constant for all values of the population correlation coefficient ρ. Without the Fisher transformation, the variance of r grows smaller as |ρ| gets closer to 1. Since the Fisher transformation is approximately the identity function when |r| < 1/2, it is sometimes useful to remember that the variance of r is well approximated by 1/N as long as |ρ| is not too large and N is not too small. This is related to the fact that the asymptotic variance of r is 1 for bivariate normal data.
The behavior of this transform has been extensively studied since Fisher introduced it in 1915. Fisher himself found the exact distribution of z for data from a bivariate normal distribution in 1921; Gayen in 1951 [8] determined the exact distribution of z for data from a bivariate Type A Edgeworth distribution. Hotelling in 1953 calculated the Taylor series expressions for the moments of z and several related statistics [9] and Hawkins in 1989 discovered the asymptotic distribution of z for data from a distribution with bounded fourth moments. [10]
An alternative to the Fisher transformation is to use the exact confidence distribution density for ρ given by [11] [12] where is the Gaussian hypergeometric function and .
While the Fisher transformation is mainly associated with the Pearson product-moment correlation coefficient for bivariate normal observations, it can also be applied to Spearman's rank correlation coefficient in more general cases. [13] A similar result for the asymptotic distribution applies, but with a minor adjustment factor: see the cited article for details.
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
In probability theory and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.
In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.
In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.
In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.
In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.
In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.
In probability theory and mathematical physics, a random matrix is a matrix-valued random variable—that is, a matrix in which some or all of its entries are sampled randomly from a probability distribution. Random matrix theory (RMT) is the study of properties of random matrices, often as they become large. RMT provides techniques like mean-field theory, diagrammatic methods, the cavity method, or the replica method to compute quantities like traces, spectral densities, or scalar products between eigenvectors. Many physical phenomena, such as the spectrum of nuclei of heavy atoms, the thermal conductivity of a lattice, or the emergence of quantum chaos, can be modeled mathematically as problems concerning large, random matrices.
In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).
In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.
In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters. A pivot need not be a statistic — the function and its value can depend on the parameters of the model, but its distribution must not. If it is a statistic, then it is known as an ancillary statistic.
In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
In probability theory and statistics, the Conway–Maxwell–Poisson distribution is a discrete probability distribution named after Richard W. Conway, William L. Maxwell, and Siméon Denis Poisson that generalizes the Poisson distribution by adding a parameter to model overdispersion and underdispersion. It is a member of the exponential family, has the Poisson distribution and geometric distribution as special cases and the Bernoulli distribution as a limiting case.
A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.
In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.
In statistical inference, the concept of a confidence distribution (CD) has often been loosely referred to as a distribution function on the parameter space that can represent confidence intervals of all levels for a parameter of interest. Historically, it has typically been constructed by inverting the upper limits of lower sided confidence intervals of all levels, and it was also commonly associated with a fiducial interpretation, although it is a purely frequentist concept. A confidence distribution is NOT a probability distribution function of the parameter of interest, but may still be a function useful for making inferences.
In probability theory and statistics, the generalized multivariate log-gamma (G-MVLG) distribution is a multivariate distribution introduced by Demirhan and Hamurkaroglu in 2011. The G-MVLG is a flexible distribution. Skewness and kurtosis are well controlled by the parameters of the distribution. This enables one to control dispersion of the distribution. Because of this property, the distribution is effectively used as a joint prior distribution in Bayesian analysis, especially when the likelihood is not from the location-scale family of distributions such as normal distribution.
{{cite journal}}
: Cite journal requires |journal=
(help)