Matrix normal distribution

Matrix normal
Notation
Parameters	location (real matrix); scale (positive-definite real matrix); Contents Definition ; Proof ; Properties ; Expected values ; Transformation ; Example ; Maximum likelihood parameter estimation ; Drawing values from the distribution ; Relation to other distributions ; See also ; References ; scale (positive-definite real matrix)
Support
PDF
Mean
Variance	(among-row) and (among-column)

Last updated December 19, 2024

In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.

Definition

The probability density function for the random matrix X (n × p) that follows the matrix normal distribution ${\mathcal {MN}}_{n,p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )$ has the form:

p(\mathbf {X} \mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} )={\frac {\exp \left(-{\frac {1}{2}}\,\mathrm {tr} \left[\mathbf {V} ^{-1}(\mathbf {X} -\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\right]\right)}{(2\pi )^{np/2}|\mathbf {V} |^{n/2}|\mathbf {U} |^{p/2}}}

where $\mathrm {tr}$ denotes trace and M is n × p, U is n × n and V is p × p, and the density is understood as the probability density function with respect to the standard Lebesgue measure in $\mathbb {R} ^{n\times p}$ , i.e.: the measure corresponding to integration with respect to $dx_{11}dx_{21}\dots dx_{n1}dx_{12}\dots dx_{n2}\dots dx_{np}$ .

The matrix normal is related to the multivariate normal distribution in the following way:

\mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} ),

if and only if

\mathrm {vec} (\mathbf {X} )\sim {\mathcal {N}}_{np}(\mathrm {vec} (\mathbf {M} ),\mathbf {V} \otimes \mathbf {U} )

where $\otimes$ denotes the Kronecker product and $\mathrm {vec} (\mathbf {M} )$ denotes the vectorization of $\mathbf {M}$ .

Proof

The equivalence between the above matrix normal and multivariate normal density functions can be shown using several properties of the trace and Kronecker product, as follows. We start with the argument of the exponent of the matrix normal PDF:

{\begin{aligned}&\;\;\;\;-{\frac {1}{2}}{\text{tr}}\left[\mathbf {V} ^{-1}(\mathbf {X} -\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\right]\\&=-{\frac {1}{2}}{\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)^{T}{\text{vec}}\left(\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\mathbf {V} ^{-1}\right)\\&=-{\frac {1}{2}}{\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)^{T}\left(\mathbf {V} ^{-1}\otimes \mathbf {U} ^{-1}\right){\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)\\&=-{\frac {1}{2}}\left[{\text{vec}}(\mathbf {X} )-{\text{vec}}(\mathbf {M} )\right]^{T}\left(\mathbf {V} \otimes \mathbf {U} \right)^{-1}\left[{\text{vec}}(\mathbf {X} )-{\text{vec}}(\mathbf {M} )\right]\end{aligned}}

which is the argument of the exponent of the multivariate normal PDF with respect to Lebesgue measure in $\mathbb {R} ^{np}$ . The proof is completed by using the determinant property: $|\mathbf {V} \otimes \mathbf {U} |=|\mathbf {V} |^{n}|\mathbf {U} |^{p}.$

Properties

If $\mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )$ , then we have the following properties:^[1]^[2]

Expected values

The mean, or expected value is:

E[\mathbf {X} ]=\mathbf {M}

and we have the following second-order expectations:

E[(\mathbf {X} -\mathbf {M} )(\mathbf {X} -\mathbf {M} )^{T}]=\mathbf {U} \operatorname {tr} (\mathbf {V} )

E[(\mathbf {X} -\mathbf {M} )^{T}(\mathbf {X} -\mathbf {M} )]=\mathbf {V} \operatorname {tr} (\mathbf {U} )

where $\operatorname {tr}$ denotes trace.

More generally, for appropriately dimensioned matrices A,B,C:

{\begin{aligned}E[\mathbf {X} \mathbf {A} \mathbf {X} ^{T}]&=\mathbf {U} \operatorname {tr} (\mathbf {A} ^{T}\mathbf {V} )+\mathbf {MAM} ^{T}\\E[\mathbf {X} ^{T}\mathbf {B} \mathbf {X} ]&=\mathbf {V} \operatorname {tr} (\mathbf {U} \mathbf {B} ^{T})+\mathbf {M} ^{T}\mathbf {BM} \\E[\mathbf {X} \mathbf {C} \mathbf {X} ]&=\mathbf {V} \mathbf {C} ^{T}\mathbf {U} +\mathbf {MCM} \end{aligned}}

Transformation

Transpose transform:

\mathbf {X} ^{T}\sim {\mathcal {MN}}_{p\times n}(\mathbf {M} ^{T},\mathbf {V} ,\mathbf {U} )

Linear transform: let D (r-by-n), be of full rank r ≤ n and C (p-by-s), be of full rank s ≤ p, then:

\mathbf {DXC} \sim {\mathcal {MN}}_{r\times s}(\mathbf {DMC} ,\mathbf {DUD} ^{T},\mathbf {C} ^{T}\mathbf {VC} )

Example

Let's imagine a sample of n independent p-dimensional random variables identically distributed according to a multivariate normal distribution:

\mathbf {Y} _{i}\sim {\mathcal {N}}_{p}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }}){\text{ with }}i\in \{1,\ldots ,n\}

.

When defining the n × p matrix $\mathbf {X}$ for which the ith row is $\mathbf {Y} _{i}$ , we obtain:

\mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )

where each row of $\mathbf {M}$ is equal to ${\boldsymbol {\mu }}$ , that is $\mathbf {M} =\mathbf {1} _{n}\times {\boldsymbol {\mu }}^{T}$ , $\mathbf {U}$ is the n × n identity matrix, that is the rows are independent, and $\mathbf {V} ={\boldsymbol {\Sigma }}$ .

Maximum likelihood parameter estimation

Given k matrices, each of size n × p, denoted $\mathbf {X} _{1},\mathbf {X} _{2},\ldots ,\mathbf {X} _{k}$ , which we assume have been sampled i.i.d. from a matrix normal distribution, the maximum likelihood estimate of the parameters can be obtained by maximizing:

\prod _{i=1}^{k}{\mathcal {MN}}_{n\times p}(\mathbf {X} _{i}\mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} ).

The solution for the mean has a closed form, namely

\mathbf {M} ={\frac {1}{k}}\sum _{i=1}^{k}\mathbf {X} _{i}

but the covariance parameters do not. However, these parameters can be iteratively maximized by zero-ing their gradients at:

\mathbf {U} ={\frac {1}{kp}}\sum _{i=1}^{k}(\mathbf {X} _{i}-\mathbf {M} )\mathbf {V} ^{-1}(\mathbf {X} _{i}-\mathbf {M} )^{T}

and

\mathbf {V} ={\frac {1}{kn}}\sum _{i=1}^{k}(\mathbf {X} _{i}-\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} _{i}-\mathbf {M} ),

See for example ^[3] and references therein. The covariance parameters are non-identifiable in the sense that for any scale factor, s>0, we have:

{\mathcal {MN}}_{n\times p}(\mathbf {X} \mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} )={\mathcal {MN}}_{n\times p}(\mathbf {X} \mid \mathbf {M} ,s\mathbf {U} ,{\tfrac {1}{s}}\mathbf {V} ).

Drawing values from the distribution

Sampling from the matrix normal distribution is a special case of the sampling procedure for the multivariate normal distribution. Let $\mathbf {X}$ be an n by p matrix of np independent samples from the standard normal distribution, so that

\mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {0} ,\mathbf {I} ,\mathbf {I} ).

Then let

\mathbf {Y} =\mathbf {M} +\mathbf {A} \mathbf {X} \mathbf {B} ,

so that

\mathbf {Y} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {AA} ^{T},\mathbf {B} ^{T}\mathbf {B} ),

where A and B can be chosen by Cholesky decomposition or a similar matrix square root operation.

Relation to other distributions

Dawid (1981) provides a discussion of the relation of the matrix-valued normal distribution to other distributions, including the Wishart distribution, inverse-Wishart distribution and matrix t-distribution, but uses different notation from that employed here.

Related Research Articles

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart, who first formulated the distribution in 1928. Other names include Wishart ensemble, or Wishart–Laguerre ensemble, or LOE, LUE, LSE.

In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a specialization of the tensor product from vectors to matrices and gives the matrix of the tensor product linear map with respect to a standard choice of basis. The Kronecker product is to be distinguished from the usual matrix multiplication, which is an entirely different operation. The Kronecker product is also sometimes called matrix direct product.

Hotellings <i>T</i>-squared distribution Type of probability distribution

In statistics, particularly in hypothesis testing, the Hotelling's T-squared distribution (T²), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the F-distribution and is most notable for arising as the distribution of a set of sample statistics that are natural generalizations of the statistics underlying the Student's t-distribution. The Hotelling's t-squared statistic (t²) is a generalization of Student's t-statistic that is used in multivariate hypothesis testing.

In wireless communications, channel state information (CSI) is the known channel properties of a communication link. This information describes how a signal propagates from the transmitter to the receiver and represents the combined effect of, for example, scattering, fading, and power decay with distance. The method is called channel estimation. The CSI makes it possible to adapt transmissions to current channel conditions, which is crucial for achieving reliable communication with high data rates in multiantenna systems.

In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the $-sphere in . If the distribution reduces to the von Mises distribution on the circle.$

In mathematics, especially in linear algebra and matrix theory, the vectorization of a matrix is a linear transformation which converts the matrix into a vector. Specifically, the vectorization of a m × n matrix A, denoted vec(A), is the mn × 1 column vector obtained by stacking the columns of the matrix A on top of one another: $Here, represents the element in the i -th row and j -th column of A, and the superscript denotes the transpose. Vectorization expresses, through coordinates, the isomorphism between these (i.e., of matrices and vectors) as vector spaces.$

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the covariance matrix of a multivariate normal distribution.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In probability theory, the family of complex normal distributions, denoted $or, characterizes complex random variables whose real and imaginary parts are jointly normal. The complex normal family has three parameters: location parameter μ, covariance matrix, and the relation matrix . The standard complex normal is the univariate distribution with,, and .$

<span class="mw-page-title-main">Logit-normal distribution</span> Probability distribution

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and t is the standard logistic function, then X = t(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log (X/(1-X)) is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

In probability theory and statistics, the normal-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and precision matrix.

In probability theory and statistics, the normal-inverse-Wishart distribution is a multivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a multivariate normal distribution with unknown mean and covariance matrix.

In statistics, the matrix t-distribution is the generalization of the multivariate t-distribution from vectors to matrices.

Curvilinear coordinates can be formulated in tensor calculus, with important applications in physics and engineering, particularly for describing transportation of physical quantities and deformation of matter in fluid mechanics and continuum mechanics.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space $on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.$

In statistics, the complex Wishart distribution is a complex version of the Wishart distribution. It is the distribution of $times the sample Hermitian covariance matrix of zero-mean independent Gaussian random variables. It has support for Hermitian positive definite matrices.$

The complex inverse Wishart distribution is a matrix probability distribution defined on complex-valued positive-definite matrices and is the complex analog of the real inverse Wishart distribution. The complex Wishart distribution was extensively investigated by Goodman while the derivation of the inverse is shown by Shaman and others. It has greatest application in least squares optimization theory applied to complex valued data samples in digital radio communications systems, often related to Fourier Domain complex filtering.

References

↑ A K Gupta; D K Nagar (22 October 1999). "Chapter 2: MATRIX VARIATE NORMAL DISTRIBUTION". Matrix Variate Distributions. CRC Press. ISBN 978-1-58488-046-2 . Retrieved 23 May 2014.
↑ Ding, Shanshan; R. Dennis Cook (2014). "Dimension folding PCA and PFC for matrix-valued predictors". Statistica Sinica. 24 (1): 463–492. JSTOR 26432553.
↑ Glanz, Hunter; Carvalho, Luis (2013). "An Expectation-Maximization Algorithm for the Matrix Normal Distribution". arXiv: 1309.6609 [stat.ME].

Dawid, A.P. (1981). "Some matrix-variate distribution theory: Notational considerations and a Bayesian application". Biometrika . 68 (1): 265–274. doi:10.1093/biomet/68.1.265. JSTOR 2335827. MR 0614963.
Dutilleul, P (1999). "The MLE algorithm for the matrix normal distribution". Journal of Statistical Computation and Simulation . 64 (2): 105–123. doi:10.1080/00949659908811970.
Arnold, S.F. (1981), The theory of linear models and multivariate analysis, New York: John Wiley & Sons, ISBN 0471050652

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[GuptaNagar1999-1] A K Gupta; D K Nagar (22 October 1999). "Chapter 2: MATRIX VARIATE NORMAL DISTRIBUTION". Matrix Variate Distributions. CRC Press. ISBN 978-1-58488-046-2 . Retrieved 23 May 2014.

[2] Ding, Shanshan; R. Dennis Cook (2014). "Dimension folding PCA and PFC for matrix-valued predictors". Statistica Sinica. 24 (1): 463–492. JSTOR 26432553.

[3] Glanz, Hunter; Carvalho, Luis (2013). "An Expectation-Maximization Algorithm for the Matrix Normal Distribution". arXiv: 1309.6609 [stat.ME].

[1]

[2]

[3]

Notation	${\mathcal {MN}}_{n,p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )$
Parameters	$\mathbf {M}$ location (real $n\times p$ matrix) $\mathbf {U}$ scale (positive-definite real $n\times n$ matrix) Contents Definition Proof Properties Expected values Transformation Example Maximum likelihood parameter estimation Drawing values from the distribution Relation to other distributions See also References $\mathbf {V}$ scale (positive-definite real $p\times p$ matrix)
Support	$\mathbf {X} \in \mathbb {R} ^{n\times p}$
PDF	${\frac {\exp \left(-{\frac {1}{2}}\,\mathrm {tr} \left[\mathbf {V} ^{-1}(\mathbf {X} -\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\right]\right)}{(2\pi )^{np/2}\|\mathbf {V} \|^{n/2}\|\mathbf {U} \|^{p/2}}}$
Mean	$\mathbf {M}$
Variance	$\mathbf {U}$ (among-row) and $\mathbf {V}$ (among-column)