Hotelling's T-squared distribution

Last updated
Hotelling's T2 distribution
Probability density function
Hotelling-pdf.png
Cumulative distribution function
Hotelling-cdf.png
Parameters p - dimension of the random variables
m - related to the sample size
Support if
otherwise.

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T2), proposed by Harold Hotelling, [1] is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t2) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing. [2]

Contents

Motivation

The distribution arises in multivariate statistics in undertaking tests of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a t-test. The distribution is named for Harold Hotelling, who developed it as a generalization of Student's t-distribution. [1]

Definition

If the vector is Gaussian multivariate-distributed with zero mean and unit covariance matrix and is a matrix with unit scale matrix and m degrees of freedom with a Wishart distribution , then the quadratic form has a Hotelling distribution (with parameters and ): [3]

Furthermore, if a random variable X has Hotelling's T-squared distribution, , then: [1]

where is the F-distribution with parameters p and m  p + 1.

Hotelling t-squared statistic

Let be the sample covariance:

where we denote transpose by an apostrophe. It can be shown that is a positive (semi) definite matrix and follows a p-variate Wishart distribution with n  1 degrees of freedom. [4] The sample covariance matrix of the mean reads .

The Hotelling's t-squared statistic is then defined as: [5]

which is proportional to the distance between the sample mean and . Because of this, one should expect the statistic to assume low values if , and high values if they are different.

From the distribution,

where is the F-distribution with parameters p and n  p.

In order to calculate a p-value (unrelated to p variable here), note that the distribution of equivalently implies that

Then, use the quantity on the left hand side to evaluate the p-value corresponding to the sample, which comes from the F-distribution. A confidence region may also be determined using similar logic.

Motivation

Let denote a p-variate normal distribution with location and known covariance . Let

be n independent identically distributed (iid) random variables, which may be represented as column vectors of real numbers. Define

to be the sample mean with covariance . It can be shown that

where is the chi-squared distribution with p degrees of freedom. [6]

Proof
Proof

To show this use the fact that and derive the characteristic function of the random variable . As usual, let denote the determinant of the argument, as in .

By definition of characteristic function, we have: [7]

There are two exponentials inside the integral, so by multiplying the exponentials we add the exponents together, obtaining:

Now take the term off the integral, and multiply everything by an identity , bringing one of them inside the integral:

But the term inside the integral is precisely the probability density function of a multivariate normal distribution with covariance matrix and mean , so when integrating over all , it must yield per the probability axioms.[ clarification needed ] We thus end up with:

where is an identity matrix of dimension . Finally, calculating the determinant, we obtain:

which is the characteristic function for a chi-square distribution with degrees of freedom.

Two-sample statistic

If and , with the samples independently drawn from two independent multivariate normal distributions with the same mean and covariance, and we define

as the sample means, and

as the respective sample covariance matrices. Then

is the unbiased pooled covariance matrix estimate (an extension of pooled variance).

Finally, the Hotelling's two-sample t-squared statistic is

It can be related to the F-distribution by [4]

The non-null distribution of this statistic is the noncentral F-distribution (the ratio of a non-central Chi-squared random variable and an independent central Chi-squared random variable)

with

where is the difference vector between the population means.

In the two-variable case, the formula simplifies nicely allowing appreciation of how the correlation, , between the variables affects . If we define

and

then

Thus, if the differences in the two rows of the vector are of the same sign, in general, becomes smaller as becomes more positive. If the differences are of opposite sign becomes larger as becomes more positive.

A univariate special case can be found in Welch's t-test.

More robust and powerful tests than Hotelling's two-sample test have been proposed in the literature, see for example the interpoint distance based tests which can be applied also when the number of variables is comparable with, or even larger than, the number of subjects. [8] [9]

See also

Related Research Articles

<span class="mw-page-title-main">Normal distribution</span> Probability distribution

In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the -sphere in . If the distribution reduces to the von Mises distribution on the circle.

The sensitivity index or discriminability index or detectability index is a dimensionless statistic used in signal detection theory. A higher index indicates that the signal can be more readily detected.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

In probability and statistics, the class of exponential dispersion models (EDM), also called exponential dispersion family (EDF), is a set of probability distributions that represents a generalisation of the natural exponential family. Exponential dispersion models play an important role in statistical theory, in particular in generalized linear models because they have a special structure which enables deductions to be made about appropriate statistical inference.

<span class="mw-page-title-main">Normal-inverse-gamma distribution</span>

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

In probability theory, the family of complex normal distributions, denoted or , characterizes complex random variables whose real and imaginary parts are jointly normal. The complex normal family has three parameters: location parameter μ, covariance matrix , and the relation matrix . The standard complex normal is the univariate distribution with , , and .

In probability theory and statistics, the negative multinomial distribution is a generalization of the negative binomial distribution (NB(x0, p)) to more than two outcomes.

<span class="mw-page-title-main">Logit-normal distribution</span>

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

In statistics, inverse-variance weighting is a method of aggregating two or more random variables to minimize the variance of the weighted average. Each random variable is weighted in inverse proportion to its variance.

In probability theory and statistics, the normal-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and precision matrix.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

In statistics, the matrix t-distribution is the generalization of the multivariate t-distribution from vectors to matrices. The matrix t-distribution shares the same relationship with the multivariate t-distribution that the matrix normal distribution shares with the multivariate normal distribution. For example, the matrix t-distribution is the compound distribution that results from sampling from a matrix normal distribution having sampled the covariance matrix of the matrix normal from an inverse Wishart distribution.

In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.

References

  1. 1 2 3 Hotelling, H. (1931). "The generalization of Student's ratio". Annals of Mathematical Statistics . 2 (3): 360–378. doi: 10.1214/aoms/1177732979 .
  2. Johnson, R.A.; Wichern, D.W. (2002). Applied multivariate statistical analysis. Vol. 5. Prentice hall.
  3. Eric W. Weisstein, MathWorld
  4. 1 2 Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979). Multivariate Analysis. Academic Press. ISBN   978-0-12-471250-8.
  5. "6.5.4.3. Hotelling's T squared".
  6. End of chapter 4.2 of Johnson, R.A. & Wichern, D.W. (2002)
  7. Billingsley, P. (1995). "26. Characteristic Functions". Probability and measure (3rd ed.). Wiley. ISBN   978-0-471-00710-4.
  8. Marozzi, M. (2016). "Multivariate tests based on interpoint distances with application to magnetic resonance imaging". Statistical Methods in Medical Research. 25 (6): 2593–2610. doi:10.1177/0962280214529104. PMID   24740998.
  9. Marozzi, M. (2015). "Multivariate multidistance tests for high-dimensional low sample size case-control studies". Statistics in Medicine. 34 (9): 1511–1526. doi:10.1002/sim.6418. PMID   25630579.