# Canonical correlation

Last updated

In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y which have maximum correlation with each other. [1] T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." [2] The method was first introduced by Harold Hotelling in 1936, [3] although in the context of angles between flats the mathematical concept was published by Jordan in 1875. [4]

## Definition

Given two column vectors ${\displaystyle X=(x_{1},\dots ,x_{n})'}$ and ${\displaystyle Y=(y_{1},\dots ,y_{m})'}$ of random variables with finite second moments, one may define the cross-covariance ${\displaystyle \Sigma _{XY}=\operatorname {cov} (X,Y)}$ to be the ${\displaystyle n\times m}$ matrix whose ${\displaystyle (i,j)}$ entry is the covariance ${\displaystyle \operatorname {cov} (x_{i},y_{j})}$. In practice, we would estimate the covariance matrix based on sampled data from ${\displaystyle X}$ and ${\displaystyle Y}$ (i.e. from a pair of data matrices).

Canonical-correlation analysis seeks vectors ${\displaystyle a}$ (${\displaystyle a\in \mathbb {R} ^{n}}$) and ${\displaystyle b}$ (${\displaystyle b\in \mathbb {R} ^{m}}$) such that the random variables ${\displaystyle a^{T}X}$ and ${\displaystyle b^{T}Y}$ maximize the correlation ${\displaystyle \rho =\operatorname {corr} (a^{T}X,b^{T}Y)}$. The random variables ${\displaystyle U=a^{T}X}$ and ${\displaystyle V=b^{T}Y}$ are the first pair of canonical variables. Then one seeks vectors maximizing the same correlation subject to the constraint that they are to be uncorrelated with the first pair of canonical variables; this gives the second pair of canonical variables. This procedure may be continued up to ${\displaystyle \min\{m,n\}}$ times.

${\displaystyle (a',b')={\underset {a,b}{\operatorname {argmax} }}\operatorname {corr} (a^{T}X,b^{T}Y)}$

## Computation

### Derivation

Let ${\displaystyle \Sigma _{UV}}$ be the cross-covariance matrix for any random variables ${\displaystyle U}$ and ${\displaystyle V}$. The target function to maximize is

${\displaystyle \rho ={\frac {a^{T}\Sigma _{XY}b}{{\sqrt {a^{T}\Sigma _{XX}a}}{\sqrt {b^{T}\Sigma _{YY}b}}}}.}$

The first step is to define a change of basis and define

${\displaystyle c=\Sigma _{XX}^{1/2}a,}$
${\displaystyle d=\Sigma _{YY}^{1/2}b.}$

And thus we have

${\displaystyle \rho ={\frac {c^{T}\Sigma _{XX}^{-1/2}\Sigma _{XY}\Sigma _{YY}^{-1/2}d}{{\sqrt {c^{T}c}}{\sqrt {d^{T}d}}}}.}$

By the Cauchy–Schwarz inequality, we have

${\displaystyle \left(c^{T}\Sigma _{XX}^{-1/2}\Sigma _{XY}\Sigma _{YY}^{-1/2}\right)(d)\leq \left(c^{T}\Sigma _{XX}^{-1/2}\Sigma _{XY}\Sigma _{YY}^{-1/2}\Sigma _{YY}^{-1/2}\Sigma _{YX}\Sigma _{XX}^{-1/2}c\right)^{1/2}\left(d^{T}d\right)^{1/2},}$
${\displaystyle \rho \leq {\frac {\left(c^{T}\Sigma _{XX}^{-1/2}\Sigma _{XY}\Sigma _{YY}^{-1}\Sigma _{YX}\Sigma _{XX}^{-1/2}c\right)^{1/2}}{\left(c^{T}c\right)^{1/2}}}.}$

There is equality if the vectors ${\displaystyle d}$ and ${\displaystyle \Sigma _{YY}^{-1/2}\Sigma _{YX}\Sigma _{XX}^{-1/2}c}$ are collinear. In addition, the maximum of correlation is attained if ${\displaystyle c}$ is the eigenvector with the maximum eigenvalue for the matrix ${\displaystyle \Sigma _{XX}^{-1/2}\Sigma _{XY}\Sigma _{YY}^{-1}\Sigma _{YX}\Sigma _{XX}^{-1/2}}$ (see Rayleigh quotient). The subsequent pairs are found by using eigenvalues of decreasing magnitudes. Orthogonality is guaranteed by the symmetry of the correlation matrices.

Another way of viewing this computation is that ${\displaystyle c}$ and ${\displaystyle d}$ are the left and right singular vectors of the correlation matrix of X and Y corresponding to the highest singular value.

### Solution

The solution is therefore:

• ${\displaystyle c}$ is an eigenvector of ${\displaystyle \Sigma _{XX}^{-1/2}\Sigma _{XY}\Sigma _{YY}^{-1}\Sigma _{YX}\Sigma _{XX}^{-1/2}}$
• ${\displaystyle d}$ is proportional to ${\displaystyle \Sigma _{YY}^{-1/2}\Sigma _{YX}\Sigma _{XX}^{-1/2}c}$

Reciprocally, there is also:

• ${\displaystyle d}$ is an eigenvector of ${\displaystyle \Sigma _{YY}^{-1/2}\Sigma _{YX}\Sigma _{XX}^{-1}\Sigma _{XY}\Sigma _{YY}^{-1/2}}$
• ${\displaystyle c}$ is proportional to ${\displaystyle \Sigma _{XX}^{-1/2}\Sigma _{XY}\Sigma _{YY}^{-1/2}d}$

Reversing the change of coordinates, we have that

• ${\displaystyle a}$ is an eigenvector of ${\displaystyle \Sigma _{XX}^{-1}\Sigma _{XY}\Sigma _{YY}^{-1}\Sigma _{YX}}$,
• ${\displaystyle b}$ is proportional to ${\displaystyle \Sigma _{YY}^{-1}\Sigma _{YX}a;}$
• ${\displaystyle b}$ is an eigenvector of ${\displaystyle \Sigma _{YY}^{-1}\Sigma _{YX}\Sigma _{XX}^{-1}\Sigma _{XY},}$
• ${\displaystyle a}$ is proportional to ${\displaystyle \Sigma _{XX}^{-1}\Sigma _{XY}b}$.

The canonical variables are defined by:

${\displaystyle U=c'\Sigma _{XX}^{-1/2}X=a'X}$
${\displaystyle V=d'\Sigma _{YY}^{-1/2}Y=b'Y}$

### Implementation

CCA can be computed using singular value decomposition on a correlation matrix. [5] It is available as a function in [6]

CCA computation using singular value decomposition on a correlation matrix is related to the cosine of the angles between flats. The cosine function is ill-conditioned for small angles, leading to very inaccurate computation of highly correlated principal vectors in finite precision computer arithmetic. To fix this trouble, alternative algorithms [7] are available in

## Hypothesis testing

Each row can be tested for significance with the following method. Since the correlations are sorted, saying that row ${\displaystyle i}$ is zero implies all further correlations are also zero. If we have ${\displaystyle p}$ independent observations in a sample and ${\displaystyle {\widehat {\rho }}_{i}}$ is the estimated correlation for ${\displaystyle i=1,\dots ,\min\{m,n\}}$. For the ${\displaystyle i}$th row, the test statistic is:

${\displaystyle \chi ^{2}=-\left(p-1-{\frac {1}{2}}(m+n+1)\right)\ln \prod _{j=i}^{\min\{m,n\}}(1-{\widehat {\rho }}_{j}^{2}),}$

which is asymptotically distributed as a chi-squared with ${\displaystyle (m-i+1)(n-i+1)}$ degrees of freedom for large ${\displaystyle p}$. [8] Since all the correlations from ${\displaystyle \min\{m,n\}}$ to ${\displaystyle p}$ are logically zero (and estimated that way also) the product for the terms after this point is irrelevant.

Note that in the small sample size limit with ${\displaystyle p then we are guaranteed that the top ${\displaystyle m+n-p}$ correlations will be identically 1 and hence the test is meaningless. [9]

## Practical uses

A typical use for canonical correlation in the experimental context is to take two sets of variables and see what is common among the two sets. [10] For example, in psychological testing, one could take two well established multidimensional personality tests such as the Minnesota Multiphasic Personality Inventory (MMPI-2) and the NEO. By seeing how the MMPI-2 factors relate to the NEO factors, one could gain insight into what dimensions were common between the tests and how much variance was shared. For example, one might find that an extraversion or neuroticism dimension accounted for a substantial amount of shared variance between the two tests.

One can also use canonical-correlation analysis to produce a model equation which relates two sets of variables, for example a set of performance measures and a set of explanatory variables, or a set of outputs and set of inputs. Constraint restrictions can be imposed on such a model to ensure it reflects theoretical requirements or intuitively obvious conditions. This type of model is known as a maximum correlation model. [11]

Visualization of the results of canonical correlation is usually through bar plots of the coefficients of the two sets of variables for the pairs of canonical variates showing significant correlation. Some authors suggest that they are best visualized by plotting them as heliographs, a circular format with ray like bars, with each half representing the two sets of variables. [12]

## Examples

Let ${\displaystyle X=x_{1}}$ with zero expected value, i.e., ${\displaystyle \operatorname {E} (X)=0}$.

1. If ${\displaystyle Y=X}$, i.e., ${\displaystyle X}$ and ${\displaystyle Y}$ are perfectly correlated, then, e.g., ${\displaystyle a=1}$ and ${\displaystyle b=1}$, so that the first (and only in this example) pair of canonical variables is ${\displaystyle U=X}$ and ${\displaystyle V=Y=X}$.
2. If ${\displaystyle Y=-X}$, i.e., ${\displaystyle X}$ and ${\displaystyle Y}$ are perfectly anticorrelated, then, e.g., ${\displaystyle a=1}$ and ${\displaystyle b=-1}$, so that the first (and only in this example) pair of canonical variables is ${\displaystyle U=X}$ and ${\displaystyle V=-Y=X}$.

We notice that in both cases ${\displaystyle U=V}$, which illustrates that the canonical-correlation analysis treats correlated and anticorrelated variables similarly.

## Connection to principal angles

Assuming that ${\displaystyle X=(x_{1},\dots ,x_{n})'}$ and ${\displaystyle Y=(y_{1},\dots ,y_{m})'}$ have zero expected values, i.e., ${\displaystyle \operatorname {E} (X)=\operatorname {E} (Y)=0}$, their covariance matrices ${\displaystyle \Sigma _{XX}=\operatorname {Cov} (X,X)=\operatorname {E} [XX']}$ and ${\displaystyle \Sigma _{YY}=\operatorname {Cov} (Y,Y)=\operatorname {E} [YY']}$ can be viewed as Gram matrices in an inner product for the entries of ${\displaystyle X}$ and ${\displaystyle Y}$, correspondingly. In this interpretation, the random variables, entries ${\displaystyle x_{i}}$ of ${\displaystyle X}$ and ${\displaystyle y_{j}}$ of ${\displaystyle Y}$ are treated as elements of a vector space with an inner product given by the covariance ${\displaystyle \operatorname {cov} (x_{i},y_{j})}$; see Covariance#Relationship to inner products.

The definition of the canonical variables ${\displaystyle U}$ and ${\displaystyle V}$ is then equivalent to the definition of principal vectors for the pair of subspaces spanned by the entries of ${\displaystyle X}$ and ${\displaystyle Y}$ with respect to this inner product. The canonical correlations ${\displaystyle \operatorname {corr} (U,V)}$ is equal to the cosine of principal angles.

## Whitening and probabilistic canonical correlation analysis

CCA can also be viewed as a special whitening transformation where the random vectors ${\displaystyle X}$ and ${\displaystyle Y}$ are simultaneously transformed in such a way that the cross-correlation between the whitened vectors ${\displaystyle X^{CCA}}$ and ${\displaystyle Y^{CCA}}$ is diagonal. [13] The canonical correlations are then interpreted as regression coefficients linking ${\displaystyle X^{CCA}}$ and ${\displaystyle Y^{CCA}}$ and may also be negative. The regression view of CCA also provides a way to construct a latent variable probabilistic generative model for CCA, with uncorrelated hidden variables representing shared and non-shared variability.

## Related Research Articles

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it normally refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other,, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

In probability theory and statistics, two real-valued random variables, , , are said to be uncorrelated if their covariance, , is zero. If two variables are uncorrelated, there is no linear relationship between them.

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector. Any covariance matrix is symmetric and positive semi-definite and its main diagonal contains variances.

In statistics, the Pearson correlation coefficient ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

In probability theory and statistics, given a stochastic process, the autocovariance is a function that gives the covariance of the process with itself at pairs of time points. Autocovariance is closely related to the autocorrelation of the process in question.

The algebra of random variables in statistics, provides rules for the symbolic manipulation of random variables, while avoiding delving too deeply into the mathematically sophisticated ideas of probability theory. Its symbolism allows the treatment of sums, products, ratios and general functions of random variables, as well as dealing with operations such as finding the probability distributions and the expectations, variances and covariances of such combinations.

In statistics and signal processing, a minimum mean square error (MMSE) estimator is an estimation method which minimizes the mean square error (MSE), which is a common measure of estimator quality, of the fitted values of a dependent variable. In the Bayesian setting, the term MMSE more specifically refers to estimation with quadratic loss function. In such case, the MMSE estimator is given by the posterior mean of the parameter to be estimated. Since the posterior mean is cumbersome to calculate, the form of the MMSE estimator is usually constrained to be within a certain class of functions. Linear MMSE estimators are a popular choice since they are easy to use, easy to calculate, and very versatile. It has given rise to many popular estimators such as the Wiener–Kolmogorov filter and Kalman filter.

In statistics, an exchangeable sequence of random variables is a sequence X1X2X3, ... whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered. Thus, for example the sequences

A whitening transformation or sphering transformation is a linear transformation that transforms a vector of random variables with a known covariance matrix into a set of new variables whose covariance is the identity matrix, meaning that they are uncorrelated and each have variance 1. The transformation is called "whitening" because it changes the input vector into a white noise vector.

In statistics, the RV coefficient is a multivariate generalization of the squared Pearson correlation coefficient. It measures the closeness of two set of points that may each be represented in a matrix.

In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

In statistics, functional correlation is a dimensionality reduction technique used to quantify the correlation and dependence between two variables when the data is functional. Several approaches have been developed to quantify the relation between two functional variables.

## References

1. Härdle, Wolfgang; Simar, Léopold (2007). "Canonical Correlation Analysis". Applied Multivariate Statistical Analysis. pp. 321–330. CiteSeerX  . doi:10.1007/978-3-540-72244-1_14. ISBN   978-3-540-72243-4.
2. Knapp, T. R. (1978). "Canonical correlation analysis: A general parametric significance-testing system". Psychological Bulletin. 85 (2): 410–416. doi:10.1037/0033-2909.85.2.410.
3. Hotelling, H. (1936). "Relations Between Two Sets of Variates". Biometrika. 28 (3–4): 321–377. doi:10.1093/biomet/28.3-4.321. JSTOR   2333955.
4. Jordan, C. (1875). "Essai sur la géométrie à ${\displaystyle n}$ dimensions". Bull. Soc. Math. France. 3: 103.
5. Hsu, D.; Kakade, S. M.; Zhang, T. (2012). "A spectral algorithm for learning Hidden Markov Models" (PDF). Journal of Computer and System Sciences. 78 (5): 1460. arXiv:. doi:10.1016/j.jcss.2011.12.025.
6. Huang, S. Y.; Lee, M. H.; Hsiao, C. K. (2009). "Nonlinear measures of association with kernel canonical correlation analysis and applications" (PDF). Journal of Statistical Planning and Inference. 139 (7): 2162. doi:10.1016/j.jspi.2008.10.011.
7. Knyazev, A.V.; Argentati, M.E. (2002), "Principal Angles between Subspaces in an A-Based Scalar Product: Algorithms and Perturbation Estimates", SIAM Journal on Scientific Computing, 23 (6): 2009–2041, CiteSeerX  , doi:10.1137/S1064827500377332
8. Kanti V. Mardia, J. T. Kent and J. M. Bibby (1979). Multivariate Analysis. Academic Press.
9. Yang Song, Peter J. Schreier, David Ram´ırez, and Tanuj Hasija Canonical correlation analysis of high-dimensional data with very small sample support arXiv : 1604.02047
10. Sieranoja, S.; Sahidullah, Md; Kinnunen, T.; Komulainen, J.; Hadid, A. (July 2018). "Audiovisual Synchrony Detection with Optimized Audio Features" (PDF). IEEE 3rd Int. Conference on Signal and Image Processing (ICSIP 2018).
11. Tofallis, C. (1999). "Model Building with Multiple Dependent Variables and Constraints". Journal of the Royal Statistical Society, Series D. 48 (3): 371–378. arXiv:. doi:10.1111/1467-9884.00195.
12. Degani, A.; Shafto, M.; Olson, L. (2006). "Canonical Correlation Analysis: Use of Composite Heliographs for Representing Multiple Patterns" (PDF). Diagrammatic Representation and Inference. Lecture Notes in Computer Science. Vol. 4045. p. 93. CiteSeerX  . doi:10.1007/11783183_11. ISBN   978-3-540-35623-3.
13. Jendoubi, T.; Strimmer, K. (2018). "A whitening approach to probabilistic canonical correlation analysis for omics data integration". BMC Bioinformatics. 20 (1): 15. arXiv:. doi:10.1186/s12859-018-2572-9. PMC  . PMID   30626338.
1. Haghighat, Mohammad; Abdel-Mottaleb, Mohamed; Alhalabi, Wadee (2016). "Discriminant Correlation Analysis: Real-Time Feature Level Fusion for Multimodal Biometric Recognition". IEEE Transactions on Information Forensics and Security. 11 (9): 1984–1996. doi:10.1109/TIFS.2016.2569061.