Estimation of covariance matrices

Last updated

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. [1] In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive. [2] [3] [4]

Contents

Statistical analyses of multivariate data often involve exploratory studies of the way in which the variables change in relation to one another and this may be followed up by explicit statistical models involving the covariance matrix of the variables. Thus the estimation of covariance matrices directly from observational data plays two roles:

  • to provide initial estimates that can be used to study the inter-relationships;
  • to provide sample estimates that can be used for model checking.

Estimates of covariance matrices are required at the initial stages of principal component analysis and factor analysis, and are also involved in versions of regression analysis that treat the dependent variables in a data-set, jointly with the independent variable as the outcome of a random sample.

Estimation in a general context

Given a sample consisting of n independent observations x1,..., xn of a p-dimensional random vector XRp×1 (a p×1 column-vector), an unbiased estimator of the (p×p) covariance matrix

is the sample covariance matrix

where is the i-th observation of the p-dimensional random vector, and the vector

is the sample mean. This is true regardless of the distribution of the random variable X, provided of course that the theoretical means and covariances exist. The reason for the factor n  1 rather than n is essentially the same as the reason for the same factor appearing in unbiased estimates of sample variances and sample covariances, which relates to the fact that the mean is not known and is replaced by the sample mean (see Bessel's correction).

In cases where the distribution of the random variable X is known to be within a certain family of distributions, other estimates may be derived on the basis of that assumption. A well-known instance is when the random variable X is normally distributed: in this case the maximum likelihood estimator of the covariance matrix is slightly different from the unbiased estimate, and is given by

A derivation of this result is given below. Clearly, the difference between the unbiased estimator and the maximum likelihood estimator diminishes for large n.

In the general case, the unbiased estimate of the covariance matrix provides an acceptable estimate when the data vectors in the observed data set are all complete: that is they contain no missing elements. One approach to estimating the covariance matrix is to treat the estimation of each variance or pairwise covariance separately, and to use all the observations for which both variables have valid values. Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix.

When estimating the cross-covariance of a pair of signals that are wide-sense stationary, missing samples do not need be random (e.g., sub-sampling by an arbitrary factor is valid).[ citation needed ]

Maximum-likelihood estimation for the multivariate normal distribution

A random vector XRp (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈ Rp × p is a positive-definite matrix and the probability density function of X is

where μRp×1 is the expected value of X. The covariance matrix Σ is the multidimensional analog of what in one dimension would be the variance, and

normalizes the density so that it integrates to 1.

Suppose now that X1, ..., Xn are independent and identically distributed samples from the distribution above. Based on the observed values x1, ..., xn of this sample, we wish to estimate Σ.

First steps

The likelihood function is:

It is fairly readily shown that the maximum-likelihood estimate of the mean vector μ is the "sample mean" vector:

See the section on estimation in the article on the normal distribution for details; the process here is similar.

Since the estimate does not depend on Σ, we can just substitute it for μ in the likelihood function, getting

and then seek the value of Σ that maximizes the likelihood of the data (in practice it is easier to work with log ).

The trace of a 1 × 1 matrix

Now we come to the first surprising step: regard the scalar as the trace of a 1×1 matrix. This makes it possible to use the identity tr(AB) = tr(BA) whenever A and B are matrices so shaped that both products exist. We get

where

is sometimes called the scatter matrix, and is positive definite if there exists a subset of the data consisting of affinely independent observations (which we will assume).

Using the spectral theorem

It follows from the spectral theorem of linear algebra that a positive-definite symmetric matrix S has a unique positive-definite symmetric square root S1/2. We can again use the "cyclic property" of the trace to write

Let B = S1/2 Σ −1S1/2. Then the expression above becomes

The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes

Since the trace of a square matrix equals the sum of eigenvalues ("trace and eigenvalues"), the equation reduces to the problem of finding the eigenvalues λ1, ..., λp that maximize

This is just a calculus problem and we get λi = n for all i. Thus, assume Q is the matrix of eigen vectors, then

i.e., n times the p×p identity matrix.

Concluding steps

Finally we get

i.e., the p×p "sample covariance matrix"

is the maximum-likelihood estimator of the "population covariance matrix" Σ. At this point we are using a capital X rather than a lower-case x because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. The random matrix S can be shown to have a Wishart distribution with n − 1 degrees of freedom. [5] That is:

Alternative derivation

An alternative derivation of the maximum likelihood estimator can be performed via matrix calculus formulae (see also differential of a determinant and differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick:

The differential of this log-likelihood is

It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The first order condition for maximum, , is satisfied when the terms multiplying and are identically zero. Assuming (the maximum likelihood estimate of) is non-singular, the first order condition for the estimate of the mean vector is

which leads to the maximum likelihood estimator

This lets us simplify

as defined above. Then the terms involving in can be combined as

The first order condition will hold when the term in the square bracket is (matrix-valued) zero. Pre-multiplying the latter by and dividing by gives

which of course coincides with the canonical derivation given earlier.

Dwyer [6] points out that decomposition into two terms such as appears above is "unnecessary" and derives the estimator in two lines of working. Note that it may be not trivial to show that such derived estimator is the unique global maximizer for likelihood function.

Intrinsic covariance matrix estimation

Intrinsic expectation

Given a sample of n independent observations x1,..., xn of a p-dimensional zero-mean Gaussian random variable X with covariance R, the maximum likelihood estimator of R is given by

The parameter belongs to the set of positive-definite matrices, which is a Riemannian manifold, not a vector space, hence the usual vector-space notions of expectation, i.e. "", and estimator bias must be generalized to manifolds to make sense of the problem of covariance matrix estimation. This can be done by defining the expectation of a manifold-valued estimator with respect to the manifold-valued point as

where

are the exponential map and inverse exponential map, respectively, "exp" and "log" denote the ordinary matrix exponential and matrix logarithm, and E[·] is the ordinary expectation operator defined on a vector space, in this case the tangent space of the manifold. [1]

Bias of the sample covariance matrix

The intrinsic bias vector field of the SCM estimator is defined to be

The intrinsic estimator bias is then given by .

For complex Gaussian random variables, this bias vector field can be shown [1] to equal

where

and ψ(·) is the digamma function. The intrinsic bias of the sample covariance matrix equals

and the SCM is asymptotically unbiased as n → ∞.

Similarly, the intrinsic inefficiency of the sample covariance matrix depends upon the Riemannian curvature of the space of positive-definite matrices.

Shrinkage estimation

If the sample size n is small and the number of considered variables p is large, the above empirical estimators of covariance and correlation are very unstable. Specifically, it is possible to furnish estimators that improve considerably upon the maximum likelihood estimate in terms of mean squared error. Moreover, for n < p (the number of observations is less than the number of random variables) the empirical estimate of the covariance matrix becomes singular, i.e. it cannot be inverted to compute the precision matrix.

As an alternative, many methods have been suggested to improve the estimation of the covariance matrix. All of these approaches rely on the concept of shrinkage. This is implicit in Bayesian methods and in penalized maximum likelihood methods and explicit in the Stein-type shrinkage approach.

A simple version of a shrinkage estimator of the covariance matrix is represented by the Ledoit-Wolf shrinkage estimator. [7] [8] [9] [10] One considers a convex combination of the empirical estimator () with some suitable chosen target (), e.g., the diagonal matrix. Subsequently, the mixing parameter () is selected to maximize the expected accuracy of the shrunken estimator. This can be done by cross-validation, or by using an analytic estimate of the shrinkage intensity. The resulting regularized estimator () can be shown to outperform the maximum likelihood estimator for small samples. For large samples, the shrinkage intensity will reduce to zero, hence in this case the shrinkage estimator will be identical to the empirical estimator. Apart from increased efficiency the shrinkage estimate has the additional advantage that it is always positive definite and well conditioned.

Various shrinkage targets have been proposed:

  1. the identity matrix, scaled by the average sample variance;
  2. the single-index model;
  3. the constant-correlation model, where the sample variances are preserved, but all pairwise correlation coefficients are assumed to be equal to one another;
  4. the two-parameter matrix, where all variances are identical, and all covariances are identical to one another (although not identical to the variances);
  5. the diagonal matrix containing sample variances on the diagonal and zeros everywhere else;
  6. the identity matrix. [8]

The shrinkage estimator can be generalized to a multi-target shrinkage estimator that utilizes several targets simultaneously. [11] Software for computing a covariance shrinkage estimator is available in R (packages corpcor [12] and ShrinkCovMat [13] ), in Python (scikit-learn library ), and in MATLAB. [14]

See also

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Pauli matrices</span> Matrices important in quantum mechanics and the study of spin

In mathematical physics and mathematics, the Pauli matrices are a set of three 2 × 2 complex matrices which are Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other,, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector. Any covariance matrix is symmetric and positive semi-definite and its main diagonal contains variances.

In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.

Hotellings <i>T</i>-squared distribution

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t2) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the -sphere in . If the distribution reduces to the von Mises distribution on the circle.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuous optimization problems. They belong to the class of evolutionary algorithms and evolutionary computation. An evolutionary algorithm is broadly based on the principle of biological evolution, namely the repeated interplay of variation and selection: in each generation (iteration) new individuals are generated by variation, usually in a stochastic way, of the current parental individuals. Then, some individuals are selected to become the parents in the next generation based on their fitness or objective function value . Like this, over the generation sequence, individuals with better and better -values are generated.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

The ensemble Kalman filter (EnKF) is a recursive filter suitable for problems with a large number of variables, such as discretizations of partial differential equations in geophysical models. The EnKF originated as a version of the Kalman filter for large problems, and it is now an important data assimilation component of ensemble forecasting. EnKF is related to the particle filter but the EnKF makes the assumption that all probability distributions involved are Gaussian; when it is applicable, it is much more efficient than the particle filter.

The sample mean and the sample covariance are statistics computed from a sample of data on one or more random variables.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

In probability theory, the family of complex normal distributions, denoted or , characterizes complex random variables whose real and imaginary parts are jointly normal. The complex normal family has three parameters: location parameter μ, covariance matrix , and the relation matrix . The standard complex normal is the univariate distribution with , , and .

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

References

  1. 1 2 3 Smith, Steven Thomas (May 2005). "Covariance, Subspace, and Intrinsic Cramér–Rao Bounds". IEEE Trans. Signal Process. 53 (5): 1610–1630. doi:10.1109/TSP.2005.845428. S2CID   2751194.
  2. Robust Statistics, Peter J. Huber, Wiley, 1981 (republished in paperback, 2004)
  3. "Modern applied statistics with S", William N. Venables, Brian D. Ripley, Springer, 2002, ISBN   0-387-95457-0, ISBN   978-0-387-95457-8, page 336
  4. Devlin, Susan J.; Gnanadesikan, R.; Kettenring, J. R. (1975). "Robust Estimation and Outlier Detection with Correlation Coefficients". Biometrika . 62 (3): 531–545. doi:10.1093/biomet/62.3.531.
  5. K.V. Mardia, J.T. Kent, and J.M. Bibby (1979) Multivariate Analysis , Academic Press.
  6. Dwyer, Paul S. (June 1967). "Some applications of matrix derivatives in multivariate analysis". Journal of the American Statistical Association. 62 (318): 607–625. doi:10.2307/2283988. JSTOR   2283988.
  7. O. Ledoit and M. Wolf (2004a) "A well-conditioned estimator for large-dimensional covariance matrices Archived 2014-12-05 at the Wayback Machine " Journal of Multivariate Analysis 88 (2): 365—411.
  8. 1 2 A. Touloumis (2015) "Nonparametric Stein-type shrinkage covariance matrix estimators in high-dimensional settings" Computational Statistics & Data Analysis 83: 251—261.
  9. O. Ledoit and M. Wolf (2003) "Improved estimation of the covariance matrix of stock returns with an application to portofolio selection Archived 2014-12-05 at the Wayback Machine " Journal of Empirical Finance10 (5): 603—621.
  10. O. Ledoit and M. Wolf (2004b) "Honey, I shrunk the sample covariance matrix Archived 2014-12-05 at the Wayback Machine " The Journal of Portfolio Management 30 (4): 110—119.
  11. T. Lancewicki and M. Aladjem (2014) "Multi-Target Shrinkage Estimation for Covariance Matrices", IEEE Transactions on Signal Processing , Volume: 62, Issue 24, pages: 6380-6390.
  12. corpcor: Efficient Estimation of Covariance and (Partial) Correlation, CRAN
  13. ShrinkCovMat: Shrinkage Covariance Matrix Estimators, CRAN
  14. MATLAB code for shrinkage targets: scaled identity, single-index model, constant-correlation model, two-parameter matrix, and diagonal matrix.