Quantile normalization

Last updated October 04, 2024

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile-normalize a test distribution to a reference distribution of the same length, sort the test distribution and sort the reference distribution. The highest entry in the test distribution then takes the value of the highest entry in the reference distribution, the next highest entry in the reference distribution, and so on, until the test distribution is a perturbation of the reference distribution.

To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetic mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

Generally a reference distribution will be one of the standard statistical distributions such as the Gaussian distribution or the Poisson distribution. The reference distribution can be generated randomly or from taking regular samples from the cumulative distribution function of the distribution. However, any reference distribution can be used.

Quantile normalization is frequently used in microarray data analysis. It was introduced as quantile standardization^[1] and then renamed as quantile normalization.^[2]

Example

A quick illustration of such normalizing on a very small dataset, organized into columns (1-3) and rows (A-D):

${\begin{matrix}&{}\\[6pt]&A:\\&B:\\&C:\\&D:\end{matrix}}\quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]5&4&3\\2&1&4\\3&4&6\\4&2&8\end{matrix}}$

For each column, rank the entries from lowest to highest (i to iv):

${\begin{matrix}&{}\\[6pt]&A:\\&B:\\&C:\\&D:\end{matrix}}\quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]5&4&3\\2&1&4\\3&4&6\\4&2&8\end{matrix}}\quad \longrightarrow \quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]{\rm {iv}}&{\rm {iii}}&{\rm {i}}\\{\rm {i}}&{\rm {i}}&{\rm {ii}}\\{\rm {ii}}&{\rm {iii}}&{\rm {iii}}\\{\rm {iii}}&{\rm {ii}}&{\rm {iv}}\end{matrix}}$

Set aside these rank values to use later. Go back to the first set of data. Rearrange each columns' values such that each column is in order from lowest to highest. The result is:

${\begin{matrix}&{}\\[6pt]&A:\\&B:\\&C:\\&D:\end{matrix}}\quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]5&4&3\\2&1&4\\3&4&6\\4&2&8\end{matrix}}\quad \longrightarrow \quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]2&1&3\\3&2&4\\4&4&6\\5&4&8\end{matrix}}$

Now find the mean for each row, and rank them lowest to highest (i to iv):

${\begin{aligned}(2+1+3)/3&=2.00{\text{ (rank i)}}\\(3+2+4)/3&=3.00{\text{ (rank ii)}}\\(4+4+6)/3&=4.67{\text{ (rank iii)}}\\(5+4+8)/3&=5.67{\text{ (rank iv)}}\end{aligned}}$

Now take the ranking order from earlier and substitute in the means according to their corresponding ranks:

${\begin{matrix}&{}\\[6pt]&A:\\&B:\\&C:\\&D:\end{matrix}}\quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]{\rm {iv}}&{\rm {iii}}&{\rm {i}}\\{\rm {i}}&{\rm {i}}&{\rm {ii}}\\{\rm {ii}}&{\rm {iii}}&{\rm {iii}}\\{\rm {iii}}&{\rm {ii}}&{\rm {iv}}\end{matrix}}\quad \longrightarrow \quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]5.67&4.67&2.00\\2.00&2.00&3.00\\3.00&4.67&4.67\\4.67&3.00&5.67\end{matrix}}$

These are the new normalized values.

However, note that when, as in column two, values are tied in rank, they should instead be assigned the mean of the values corresponding to the ranks they would normally represent if they were different. In the case of column 2, they represent ranks iii and iv. So we assign the two tied rank iii entries the average of rank iii and rank iv ((4.67 + 5.67)/2 = 5.17). And so we arrive at the following set of normalized values:

${\begin{matrix}&{}\\[6pt]&A:\\&B:\\&C:\\&D:\end{matrix}}\quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]5.67&{\mathbf {4.67}}&2.00\\2.00&2.00&3.00\\3.00&{\mathbf {4.67}}&4.67\\4.67&3.00&5.67\end{matrix}}\quad \longrightarrow \quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]5.67&{\mathbf {5.17}}&2.00\\2.00&2.00&3.00\\3.00&{\mathbf {5.17}}&4.67\\4.67&3.00&5.67\end{matrix}}$

The new values have the same distribution and can now be easily compared. Here are the summary statistics for each of the three columns:

${\begin{array}{r}&{}\\[6pt]&{\text{Min}}:\\&{\text{1st Qrt}}:\\&{\text{Median}}:\\&{\text{Mean}}:\\&{\text{3rd Qrt}}:\\&{\text{Max}}:\end{array}}\quad {\begin{matrix}{\underline {1}}&{\underline {2}}&{\underline {3}}\\[6pt]2.00&2.00&2.00\\2.75&2.75&2.75\\3.83&4.08&3.83\\3.83&3.83&3.83\\4.92&5.17&4.92\\5.67&5.17&5.67\end{matrix}}$

Related Research Articles

In probability theory and statistics, kurtosis refers to the degree of “tailedness” in the probability distribution of a real-valued random variable. Similar to skewness, kurtosis provides insight into specific characteristics of a distribution. Various methods exist for quantifying kurtosis in theoretical distributions, and corresponding techniques allow estimation based on sample data from a population. It’s important to note that different measures of kurtosis can yield varying interpretations.

In mathematics, Pascal's triangle is an infinite triangular array of the binomial coefficients which play a crucial role in probability theory, combinatorics, and algebra. In much of the Western world, it is named after the French mathematician Blaise Pascal, although other mathematicians studied it centuries before him in Persia, India, China, Germany, and Italy.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

In vector calculus, the Jacobian matrix of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. When this matrix is square, that is, when the function takes the same number of variables as input as the number of vector components of its output, its determinant is referred to as the Jacobian determinant. Both the matrix and the determinant are often referred to simply as the Jacobian in literature. They are named after Carl Gustav Jacob Jacobi.

In mathematics, the falling factorial is defined as the polynomial

In elementary algebra, completing the square is a technique for converting a quadratic polynomial of the form $to the form for some values of h and k .$

In numerical analysis, one of the most important problems is designing efficient and stable algorithms for finding the eigenvalues of a matrix. These eigenvalue algorithms may also find eigenvectors.

In physics, the S-matrix or scattering matrix relates the initial state and the final state of a physical system undergoing a scattering process. It is used in quantum mechanics, scattering theory and quantum field theory (QFT).

In mathematics, the kernel of a linear map, also known as the null space or nullspace, is the part of the domain which is mapped to the zero vector of the co-domain; the kernel is always a linear subspace of the domain. That is, given a linear map $L : V \to W$ between two vector spaces $V$ and $W$ , the kernel of $L$ is the vector space of all elements $v$ of $V$ such that $L (v) = 0$ , where $0$ denotes the zero vector in $W$ , or more symbolically:

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

A transversely isotropic material is one with physical properties that are symmetric about an axis that is normal to a plane of isotropy. This transverse plane has infinite planes of symmetry and thus, within this plane, the material properties are the same in all directions. Hence, such materials are also known as "polar anisotropic" materials. In geophysics, vertically transverse isotropy (VTI) is also known as radial anisotropy.

In linear algebra, an eigenvector or characteristic vector is a vector that has its direction unchanged by a given linear transformation. More precisely, an eigenvector, $, of a linear transformation,, is scaled by a constant factor,, when the linear transformation is applied to it: . It is often important to know these vectors in linear algebra. The corresponding eigenvalue, characteristic value, or characteristic root is the multiplying factor .$

In material science and solid mechanics, orthotropic materials have material properties at a particular point which differ along three orthogonal axes, where each axis has twofold rotational symmetry. These directional differences in strength can be quantified with Hankinson's equation.

The Wigner D-matrix is a unitary matrix in an irreducible representation of the groups SU(2) and SO(3). It was introduced in 1927 by Eugene Wigner, and plays a fundamental role in the quantum mechanical theory of angular momentum. The complex conjugate of the D-matrix is an eigenfunction of the Hamiltonian of spherical and symmetric rigid rotors. The letter $D$ stands for Darstellung, which means "representation" in German.

The sample mean or empirical mean, and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables.

In probability and statistics, the quantile function outputs the value of a random variable such that its probability is less than or equal to an input probability value. Intuitively, the quantile function associates with a range at and below a probability input the likelihood that a random variable is realized in that range for some probability distribution. It is also called the percentile function, percent-point function, inverse cumulative distribution function or inverse distribution function.

<span class="mw-page-title-main">Matrix (mathematics)</span> Array of numbers

In mathematics, a matrix is a rectangular array or table of numbers, symbols, or expressions, with elements or entries arranged in rows and columns, which is used to represent a mathematical object or property of such an object.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

References

↑ Amaratunga, D.; Cabrera, J. (2001). "Analysis of Data from Viral DNA Microchips". Journal of the American Statistical Association. 96 (456): 1161. doi:10.1198/016214501753381814. S2CID 18154109.
↑ Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. (2003). "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias". Bioinformatics. 19 (2): 185–193. doi: 10.1093/bioinformatics/19.2.185 . PMID 12538238.

External links

Normalization of Affymetrix Chips Archived 2016-04-23 at the Wayback Machine

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Amaratunga2001-1] Amaratunga, D.; Cabrera, J. (2001). "Analysis of Data from Viral DNA Microchips". Journal of the American Statistical Association. 96 (456): 1161. doi:10.1198/016214501753381814. S2CID 18154109.

[boldstad2003-2] Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. (2003). "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias". Bioinformatics. 19 (2): 185–193. doi: 10.1093/bioinformatics/19.2.185 . PMID 12538238.

[1]

[2]

Quantile normalization

Contents

Example

Related Research Articles

References

External links