Projection matrix

Last updated May 23, 2024

In statistics, the projection matrix $(\mathbf {P} )$ ,^[1] sometimes also called the influence matrix^[2] or hat matrix $(\mathbf {H} )$ , maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes the influence each response value has on each fitted value.^[3]^[4] The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.

Definition

If the vector of response values is denoted by $\mathbf {y}$ and the vector of fitted values by $\mathbf {\hat {y}}$ ,

\mathbf {\hat {y}} =\mathbf {P} \mathbf {y} .

As $\mathbf {\hat {y}}$ is usually pronounced "y-hat", the projection matrix $\mathbf {P}$ is also named hat matrix as it "puts a hat on $\mathbf {y}$ ".

Application for residuals

The formula for the vector of residuals $\mathbf {r}$ can also be expressed compactly using the projection matrix:

\mathbf {r} =\mathbf {y} -\mathbf {\hat {y}} =\mathbf {y} -\mathbf {P} \mathbf {y} =\left(\mathbf {I} -\mathbf {P} \right)\mathbf {y} .

where $\mathbf {I}$ is the identity matrix. The matrix $\mathbf {M$ is sometimes referred to as the residual maker matrix or the annihilator matrix.

The covariance matrix of the residuals $\mathbf {r}$ , by error propagation, equals

\mathbf {\Sigma } _{\mathbf {r} }=\left(\mathbf {I} -\mathbf {P} \right)^{\textsf {T}}\mathbf {\Sigma } \left(\mathbf {I} -\mathbf {P} \right)

,

where $\mathbf {\Sigma }$ is the covariance matrix of the error vector (and by extension, the response vector as well). For the case of linear models with independent and identically distributed errors in which $\mathbf {\Sigma } =\sigma ^{2}\mathbf {I}$ , this reduces to:^[3]

\mathbf {\Sigma } _{\mathbf {r} }=\left(\mathbf {I} -\mathbf {P} \right)\sigma ^{2}

.

Intuition

From the figure, it is clear that the closest point from the vector $\mathbf {b}$ onto the column space of $\mathbf {A}$ , is $\mathbf {Ax}$ , and is one where we can draw a line orthogonal to the column space of $\mathbf {A}$ . A vector that is orthogonal to the column space of a matrix is in the nullspace of the matrix transpose, so

\mathbf {A} ^{\textsf {T}}(\mathbf {b} -\mathbf {Ax} )=0

.

From there, one rearranges, so

{\begin{aligned}&&\mathbf {A} ^{\textsf {T}}\mathbf {b} &-\mathbf {A} ^{\textsf {T}}\mathbf {Ax} =0\\\Rightarrow &&\mathbf {A} ^{\textsf {T}}\mathbf {b} &=\mathbf {A} ^{\textsf {T}}\mathbf {Ax} \\\Rightarrow &&\mathbf {x} &=\left(\mathbf {A} ^{\textsf {T}}\mathbf {A} \right)^{-1}\mathbf {A} ^{\textsf {T}}\mathbf {b} \end{aligned}}

.

Therefore, since $\mathbf {Ax}$ is on the column space of $\mathbf {A}$ , the projection matrix, which maps $\mathbf {b}$ onto $\mathbf {x}$ is just $\mathbf {A}$ , or $\mathbf {A} \left(\mathbf {A} ^{\textsf {T}}\mathbf {A} \right)^{-1}\mathbf {A} ^{\textsf {T}}$ .

Linear model

Suppose that we wish to estimate a linear model using linear least squares. The model can be written as

\mathbf {y} =\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},

where $\mathbf {X}$ is a matrix of explanatory variables (the design matrix), β is a vector of unknown parameters to be estimated, and ε is the error vector.

Many types of models and techniques are subject to this formulation. A few examples are linear least squares, smoothing splines, regression splines, local regression, kernel regression, and linear filtering.

Ordinary least squares

When the weights for each observation are identical and the errors are uncorrelated, the estimated parameters are

{\hat {\boldsymbol {\beta }}}=\left(\mathbf {X} ^{\textsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\textsf {T}}\mathbf {y} ,

so the fitted values are

{\hat {\mathbf {y} }}=\mathbf {X} {\hat {\boldsymbol {\beta }}}=\mathbf {X} \left(\mathbf {X} ^{\textsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\textsf {T}}\mathbf {y} .

Therefore, the projection matrix (and hat matrix) is given by

\mathbf {P

Weighted and generalized least squares

The above may be generalized to the cases where the weights are not identical and/or the errors are correlated. Suppose that the covariance matrix of the errors is Σ. Then since

{\hat {\mathbf {\beta } }}_{\text{GLS}}=\left(\mathbf {X} ^{\textsf {T}}\mathbf {\Sigma } ^{-1}\mathbf {X} \right)^{-1}\mathbf {X} ^{\textsf {T}}\mathbf {\Sigma } ^{-1}\mathbf {y}

.

the hat matrix is thus

\mathbf {H} =\mathbf {X} \left(\mathbf {X} ^{\textsf {T}}\mathbf {\Sigma } ^{-1}\mathbf {X} \right)^{-1}\mathbf {X} ^{\textsf {T}}\mathbf {\Sigma } ^{-1}

and again it may be seen that $H^{2}=H\cdot H=H$ , though now it is no longer symmetric.

Properties

The projection matrix has a number of useful algebraic properties.^[5]^[6] In the language of linear algebra, the projection matrix is the orthogonal projection onto the column space of the design matrix $\mathbf {X}$ .^[4] (Note that $\left(\mathbf {X} ^{\textsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\textsf {T}}$ is the pseudoinverse of X.) Some facts of the projection matrix in this setting are summarized as follows:^[4]

$\mathbf {u} =(\mathbf {I} -\mathbf {P} )\mathbf {y} ,$ and $\mathbf {u} =\mathbf {y} -\mathbf {P} \mathbf {y} \perp \mathbf {X} .$
$\mathbf {P}$ is symmetric, and so is $\mathbf {M$ .
$\mathbf {P}$ is idempotent: $\mathbf {P} ^{2}=\mathbf {P}$ , and so is $\mathbf {M}$ .
If $\mathbf {X}$ is an n × r matrix with $\operatorname {rank} (\mathbf {X} )=r$ , then $\operatorname {rank} (\mathbf {P} )=r$
The eigenvalues of $\mathbf {P}$ consist of r ones and n − r zeros, while the eigenvalues of $\mathbf {M}$ consist of n − r ones and r zeros.^[7]
$\mathbf {X}$ is invariant under $\mathbf {P}$ : $\mathbf {PX} =\mathbf {X} ,$ hence $\left(\mathbf {I} -\mathbf {P} \right)\mathbf {X} =\mathbf {0}$ .
$\left(\mathbf {I} -\mathbf {P} \right)\mathbf {P} =\mathbf {P} \left(\mathbf {I} -\mathbf {P} \right)=\mathbf {0} .$
$\mathbf {P}$ is unique for certain subspaces.

The projection matrix corresponding to a linear model is symmetric and idempotent, that is, $\mathbf {P} ^{2}=\mathbf {P}$ . However, this is not always the case; in locally weighted scatterplot smoothing (LOESS), for example, the hat matrix is in general neither symmetric nor idempotent.

For linear models, the trace of the projection matrix is equal to the rank of $\mathbf {X}$ , which is the number of independent parameters of the linear model.^[8] For other models such as LOESS that are still linear in the observations $\mathbf {y}$ , the projection matrix can be used to define the effective degrees of freedom of the model.

Practical applications of the projection matrix in regression analysis include leverage and Cook's distance, which are concerned with identifying influential observations, i.e. observations which have a large effect on the results of a regression.

Blockwise formula

Suppose the design matrix $\mathbf {X}$ can be decomposed by columns as $\mathbf {X} ={\begin{bmatrix}\mathbf {A} &\mathbf {B} \end{bmatrix}}$ . Define the hat or projection operator as $\mathbf {P} [\mathbf {X} ]:=\mathbf {X} \left(\mathbf {X} ^{\textsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\textsf {T}}$ . Similarly, define the residual operator as $\mathbf {M} [\mathbf {X} ]:=\mathbf {I} -\mathbf {P} [\mathbf {X} ]$ . Then the projection matrix can be decomposed as follows:^[9]

\mathbf {P} [\mathbf {X} ]=\mathbf {P} [\mathbf {A} ]+\mathbf {P} {\big [}\mathbf {M} [\mathbf {A} ]\mathbf {B} {\big ]},

where, e.g., $\mathbf {P} [\mathbf {A} ]=\mathbf {A} \left(\mathbf {A} ^{\textsf {T}}\mathbf {A} \right)^{-1}\mathbf {A} ^{\textsf {T}}$ and $\mathbf {M} [\mathbf {A} ]=\mathbf {I} -\mathbf {P} [\mathbf {A} ]$ . There are a number of applications of such a decomposition. In the classical application $\mathbf {A}$ is a column of all ones, which allows one to analyze the effects of adding an intercept term to a regression. Another use is in the fixed effects model, where $\mathbf {A}$ is a large sparse matrix of the dummy variables for the fixed effect terms. One can use this partition to compute the hat matrix of $\mathbf {X}$ without explicitly forming the matrix $\mathbf {X}$ , which might be too large to fit into computer memory.

History

The hat matrix was introduced by John Wilder in 1972. An article by Hoaglin, D.C. and Welsch, R.E. (1978) gives the properties of the matrix and also many examples of its application.

Related Research Articles

For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. The filter is named after Rudolf E. Kálmán, who was one of the primary developers of its theory.

In linear algebra and functional analysis, a projection is a linear transformation $from a vector space to itself such that . That is, whenever is applied twice to any vector, it gives the same result as if it were applied once. It leaves its image unchanged. This definition of "projection" formalizes and generalizes the idea of graphical projection. One can also consider the effect of a projection on a geometrical object by examining the effect of the projection on points in the object.$

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

In linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself. That is, the matrix $is idempotent if and only if . For this product to be defined, must necessarily be a square matrix. Viewed this way, idempotent matrices are idempotent elements of matrix rings.$

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

In statistics, the multivariate t-distribution is a multivariate probability distribution. It is a generalization to random vectors of the Student's t-distribution, which is a distribution applicable to univariate random variables. While the case of a random matrix could be treated within this structure, the matrix t-distribution is distinct and makes particular use of the matrix structure.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.$

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

Numerical methods for linear least squares entails the numerical analysis of linear least squares problems.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

In mathematics, the Khatri–Rao product or block Kronecker product of two partitioned matrices $and is defined as$

References

↑ Basilevsky, Alexander (2005). Applied Matrix Algebra in the Statistical Sciences. Dover. pp. 160–176. ISBN 0-486-44538-0.
↑ "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF). Archived from the original (PDF) on 2014-09-03.
1 2 Hoaglin, David C.; Welsch, Roy E. (February 1978). "The Hat Matrix in Regression and ANOVA" (PDF). The American Statistician . 32 (1): 17–22. doi:10.2307/2683469. hdl: 1721.1/1920 . JSTOR 2683469.
1 2 3 David A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University Press.
↑ Gans, P. (1992). Data Fitting in the Chemical Sciences . Wiley. ISBN 0-471-93412-7.
↑ Draper, N. R.; Smith, H. (1998). Applied Regression Analysis. Wiley. ISBN 0-471-17082-8.
↑ Amemiya, Takeshi (1985). Advanced Econometrics . Cambridge: Harvard University Press. pp. 460–461. ISBN 0-674-00560-0.
↑ "Proof that trace of 'hat' matrix in linear regression is rank of X". Stack Exchange. April 13, 2017.
↑ Rao, C. Radhakrishna; Toutenburg, Helge; Shalabh; Heumann, Christian (2008). Linear Models and Generalizations (3rd ed.). Berlin: Springer. p. 323. ISBN 978-3-540-74226-5.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Basilevsky, Alexander (2005). Applied Matrix Algebra in the Statistical Sciences. Dover. pp. 160–176. ISBN 0-486-44538-0.

[2] "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF). Archived from the original (PDF) on 2014-09-03.

[Hoaglin1977-3] 1 2 Hoaglin, David C.; Welsch, Roy E. (February 1978). "The Hat Matrix in Regression and ANOVA" (PDF). The American Statistician . 32 (1): 17–22. doi:10.2307/2683469. hdl: 1721.1/1920 . JSTOR 2683469.

[Freedman09-4] 1 2 3 David A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University Press.

[5] Gans, P. (1992). Data Fitting in the Chemical Sciences . Wiley. ISBN 0-471-93412-7.

[6] Draper, N. R.; Smith, H. (1998). Applied Regression Analysis. Wiley. ISBN 0-471-17082-8.

[7] Amemiya, Takeshi (1985). Advanced Econometrics . Cambridge: Harvard University Press. pp. 460–461. ISBN 0-674-00560-0.

[8] "Proof that trace of 'hat' matrix in linear regression is rank of X". Stack Exchange. April 13, 2017.

[9] Rao, C. Radhakrishna; Toutenburg, Helge; Shalabh; Heumann, Christian (2008). Linear Models and Generalizations (3rd ed.). Berlin: Springer. p. 323. ISBN 978-3-540-74226-5.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

v t e Matrix classes
Explicitly constrained entries	Alternant Anti-diagonal Anti-Hermitian Anti-symmetric Arrowhead Band Bidiagonal Bisymmetric Block-diagonal Block Block tridiagonal Boolean Cauchy Centrosymmetric Conference Complex Hadamard Copositive Diagonally dominant Diagonal Discrete Fourier Transform Elementary Equivalent Frobenius Generalized permutation Hadamard Hankel Hermitian Hessenberg Hollow Integer Logical Matrix unit Metzler Moore Nonnegative Pentadiagonal Permutation Persymmetric Polynomial Quaternionic Signature Skew-Hermitian Skew-symmetric Skyline Sparse Sylvester Symmetric Toeplitz Triangular Tridiagonal Vandermonde Walsh Z
Constant	Exchange Hilbert Identity Lehmer Of ones Pascal Pauli Redheffer Shift Zero
Conditions on eigenvalues or eigenvectors	Companion Convergent Defective Definite Diagonalizable Hurwitz Positive-definite Stieltjes
Satisfying conditions on products or inverses	Congruent Idempotent or Projection Invertible Involutory Nilpotent Normal Orthogonal Unimodular Unipotent Unitary Totally unimodular Weighing
With specific applications	Adjugate Alternating sign Augmented Bézout Carleman Cartan Circulant Cofactor Commutation Confusion Coxeter Distance Duplication and elimination Euclidean distance Fundamental (linear differential equation) Generator Gram Hessian Householder Jacobian Moment Payoff Pick Random Rotation Seifert Shear Similarity Symplectic Totally positive Transformation
Used in statistics	Centering Correlation Covariance Design Doubly stochastic Fisher information Hat Precision Stochastic Transition
Used in graph theory	Adjacency Biadjacency Degree Edmonds Incidence Laplacian Seidel adjacency Tutte
Used in science and engineering	Cabibbo–Kobayashi–Maskawa Density Fundamental (computer vision) Fuzzy associative Gamma Gell-Mann Hamiltonian Irregular Overlap S State transition Substitution Z (chemistry)
Related terms	Jordan normal form Linear independence Matrix exponential Matrix representation of conic sections Perfect matrix Pseudoinverse Row echelon form Wronskian
Mathematicsportal List of matrices Category:Matrices