Whitening transformation

Last updated November 20, 2024

A whitening transformation or sphering transformation is a linear transformation that transforms a vector of random variables with a known covariance matrix into a set of new variables whose covariance is the identity matrix, meaning that they are uncorrelated and each have variance 1.^[1] The transformation is called "whitening" because it changes the input vector into a white noise vector.

the decorrelation transform removes only the correlations but leaves variances intact,
the standardization transform sets variances to 1 but leaves correlations intact,
a coloring transformation transforms a vector of white random variables into a random vector with a specified covariance matrix.^[2]

Definition

Suppose $X$ is a random (column) vector with non-singular covariance matrix $\Sigma$ and mean $0$ . Then the transformation $Y=WX$ with a whitening matrix $W$ satisfying the condition $W^{\mathrm {T} }W=\Sigma ^{-1}$ yields the whitened random vector $Y$ with unit diagonal covariance.

If $X$ has non-zero mean $\mu$ , then whitening can be performed by $Y=W(X-\mu )$ .

There are infinitely many possible whitening matrices $W$ that all satisfy the above condition. Commonly used choices are $W=\Sigma ^{-1/2}$ (Mahalanobis or ZCA whitening), $W=L^{T}$ where $L$ is the Cholesky decomposition of $\Sigma ^{-1}$ (Cholesky whitening),^[3] or the eigen-system of $\Sigma$ (PCA whitening).^[4]

Optimal whitening transforms can be singled out by investigating the cross-covariance and cross-correlation of $X$ and $Y$ .^[3] For example, the unique optimal whitening transformation achieving maximal component-wise correlation between original $X$ and whitened $Y$ is produced by the whitening matrix $W=P^{-1/2}V^{-1/2}$ where $P$ is the correlation matrix and $V$ the diagonal variance matrix.

Whitening a data matrix

Whitening a data matrix follows the same transformation as for random variables. An empirical whitening transform is obtained by estimating the covariance (e.g. by maximum likelihood) and subsequently constructing a corresponding estimated whitening matrix (e.g. by Cholesky decomposition).

High-dimensional whitening

This modality is a generalization of the pre-whitening procedure extended to more general spaces where $X$ is usually assumed to be a random function or other random objects in a Hilbert space $H$ . One of the main issues of extending whitening to infinite dimensions is that the covariance operator has an unbounded inverse in $H$ . Nevertheless, if one assumes that Picard condition holds for $X$ in the range space of the covariance operator, whitening becomes possible.^[5] A whitening operator can be then defined from the factorization of the Moore–Penrose inverse of the covariance operator, which has effective mapping on Karhunen–Loève type expansions of $X$ . The advantage of these whitening transformations is that they can be optimized according to the underlying topological properties of the data, thus producing more robust whitening representations. High-dimensional features of the data can be exploited through kernel regressors or basis function systems.^[6]

R implementation

An implementation of several whitening procedures in R, including ZCA-whitening and PCA whitening but also CCA whitening, is available in the "whitening" R package ^[7] published on CRAN. The R package "pfica"^[8] allows the computation of high-dimensional whitening representations using basis function systems (B-splines, Fourier basis, etc.).

Related Research Articles

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by $,,,, or .$

In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used with this or similar meanings in many scientific and technical disciplines, including physics, acoustical engineering, telecommunications, and statistical forecasting. White noise refers to a statistical model for signals and signal sources, not to any specific signal. White noise draws its name from white light, although light that appears white generally does not have a flat power spectral density over the visible band.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

Covariance in probability theory and statistics is a measure of the joint variability of two random variables.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X₁, ..., X_n) and Y = (Y₁, ..., Y_m) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y that have a maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Camille Jordan in 1875.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

The Mahalanobis distance is a measure of the distance between a point $and a distribution, introduced by P. C. Mahalanobis in 1933. The mathematical details of Mahalanobis distance first appeared in the Journal of The Asiatic Society of Bengal in 1933. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements. R.C. Bose later obtained the sampling distribution of Mahalanobis distance, under the assumption of equal dispersion.$

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in R^p×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

In statistics and signal processing, a minimum mean square error (MMSE) estimator is an estimation method which minimizes the mean square error (MSE), which is a common measure of estimator quality, of the fitted values of a dependent variable. In the Bayesian setting, the term MMSE more specifically refers to estimation with quadratic loss function. In such case, the MMSE estimator is given by the posterior mean of the parameter to be estimated. Since the posterior mean is cumbersome to calculate, the form of the MMSE estimator is usually constrained to be within a certain class of functions. Linear MMSE estimators are a popular choice since they are easy to use, easy to calculate, and very versatile. It has given rise to many popular estimators such as the Wiener–Kolmogorov filter and Kalman filter.

In probability theory and statistics, the mathematical concepts of covariance and correlation are very similar. Both describe the degree to which two random variables or sets of random variables tend to deviate from their expected values in similar ways.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In probability theory and statistics, the covariance function describes how much two random variables change together (their covariance) with varying spatial or temporal separation. For a random field or stochastic process Z(x) on a domain D, a covariance function C(x, y) gives the covariance of the values of the random field at the two locations x and y:

In statistics, the RV coefficient is a multivariate generalization of the squared Pearson correlation coefficient. It measures the closeness of two set of points that may each be represented in a matrix.

In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.

References

↑ Koivunen, A.C.; Kostinski, A.B. (1999). "The Feasibility of Data Whitening to Improve Performance of Weather Radar". Journal of Applied Meteorology. 38 (6): 741–749. Bibcode:1999JApMe..38..741K. doi: 10.1175/1520-0450(1999)038<0741:TFODWT>2.0.CO;2 . ISSN 1520-0450.
↑ Hossain, Miliha. "Whitening and Coloring Transforms for Multivariate Gaussian Random Variables". Project Rhea. Retrieved 21 March 2016.
1 2 Kessy, A.; Lewin, A.; Strimmer, K. (2018). "Optimal whitening and decorrelation". The American Statistician. 72 (4): 309–314. arXiv: 1512.00809 . doi:10.1080/00031305.2016.1277159. S2CID 55075085.
↑ Friedman, J. (1987). "Exploratory Projection Pursuit" (PDF). Journal of the American Statistical Association. 82 (397): 249–266. doi:10.1080/01621459.1987.10478427. ISSN 0162-1459. JSTOR 2289161. OSTI 1447861.
↑ Vidal, M.; Aguilera, A.M. (2022). "Novel whitening approaches in functional settings". STAT. 12 (1): e516. doi: 10.1002/sta4.516 . hdl: 1854/LU-8770510 .
↑ Ramsay, J.O.; Silverman, J.O. (2005). Functional Data Analysis. Springer New York, NY. doi:10.1007/b98888. ISBN 978-0-387-40080-8.
↑ "whitening R package" . Retrieved 2018-11-25.
↑ "pfica R package" . Retrieved 2023-02-11.

External links

http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf
The ZCA whitening transformation. Appendix A of Learning Multiple Layers of Features from Tiny Images by A. Krizhevsky.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Koivunen, A.C.; Kostinski, A.B. (1999). "The Feasibility of Data Whitening to Improve Performance of Weather Radar". Journal of Applied Meteorology. 38 (6): 741–749. Bibcode:1999JApMe..38..741K. doi: 10.1175/1520-0450(1999)038<0741:TFODWT>2.0.CO;2 . ISSN 1520-0450.

[2] Hossain, Miliha. "Whitening and Coloring Transforms for Multivariate Gaussian Random Variables". Project Rhea. Retrieved 21 March 2016.

[kessy-3] 1 2 Kessy, A.; Lewin, A.; Strimmer, K. (2018). "Optimal whitening and decorrelation". The American Statistician. 72 (4): 309–314. arXiv: 1512.00809 . doi:10.1080/00031305.2016.1277159. S2CID 55075085.

[4] Friedman, J. (1987). "Exploratory Projection Pursuit" (PDF). Journal of the American Statistical Association. 82 (397): 249–266. doi:10.1080/01621459.1987.10478427. ISSN 0162-1459. JSTOR 2289161. OSTI 1447861.

[5] Vidal, M.; Aguilera, A.M. (2022). "Novel whitening approaches in functional settings". STAT. 12 (1): e516. doi: 10.1002/sta4.516 . hdl: 1854/LU-8770510 .

[6] Ramsay, J.O.; Silverman, J.O. (2005). Functional Data Analysis. Springer New York, NY. doi:10.1007/b98888. ISBN 978-0-387-40080-8.

[7] "whitening R package" . Retrieved 2018-11-25.

[8] "pfica R package" . Retrieved 2023-02-11.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]