Gauss–Markov theorem

Last updated

In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) [1] states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. [2] The errors do not need to be normal, nor do they need to be independent and identically distributed (only uncorrelated with mean zero and homoscedastic with finite variance). The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator (which also drops linearity), ridge regression, or simply any degenerate estimator.


The theorem was named after Carl Friedrich Gauss and Andrey Markov, although Gauss' work significantly predates Markov's. [3] But while Gauss derived the result under the assumption of independence and normality, Markov reduced the assumptions to the form stated above. [4] A further generalization to non-spherical errors was given by Alexander Aitken. [5]


Suppose we have in matrix notation,

expanding to,

where are non-random but unobservable parameters, are non-random and observable (called the "explanatory variables"), are random, and so are random. The random variables are called the "disturbance", "noise" or simply "error" (will be contrasted with "residual" later in the article; see errors and residuals in statistics). Note that to include a constant in the model above, one can choose to introduce the constant as a variable with a newly introduced last column of X being unity i.e., for all . Note that though as sample responses, are observable, the following statements and arguments including assumptions, proofs and the others assume under the only condition of knowing but not

The Gauss–Markov assumptions concern the set of error random variables, :

A linear estimator of is a linear combination

in which the coefficients are not allowed to depend on the underlying coefficients , since those are not observable, but are allowed to depend on the values , since these data are observable. (The dependence of the coefficients on each is typically nonlinear; the estimator is linear in each and hence in each random which is why this is "linear" regression.) The estimator is said to be unbiased if and only if

regardless of the values of . Now, let be some linear combination of the coefficients. Then the mean squared error of the corresponding estimation is

in other words it is the expectation of the square of the weighted sum (across parameters) of the differences between the estimators and the corresponding parameters to be estimated. (Since we are considering the case in which all the parameter estimates are unbiased, this mean squared error is the same as the variance of the linear combination.) The best linear unbiased estimator (BLUE) of the vector of parameters is one with the smallest mean squared error for every vector of linear combination parameters. This is equivalent to the condition that

is a positive semi-definite matrix for every other linear unbiased estimator .

The ordinary least squares estimator (OLS) is the function

of and (where denotes the transpose of ) that minimizes the sum of squares of residuals (misprediction amounts):

The theorem now states that the OLS estimator is a BLUE. The main idea of the proof is that the least-squares estimator is uncorrelated with every linear unbiased estimator of zero, i.e., with every linear combination whose coefficients do not depend upon the unobservable but whose expected value is always zero.


Proof that the OLS indeed MINIMIZES the sum of squares of residuals may proceed as follows with a calculation of the Hessian matrix and showing that it is positive definite.

The MSE function we want to minimize is

for a multiple regression model with p variables. The first derivative is

,where X is the design matrix

The Hessian matrix of second derivatives is

Assuming the columns of are linearly independent so that is invertible, let , then

Now let be an eigenvector of .

In terms of vector multiplication, this means

where is the eigenvalue corresponding to . Moreover,

Finally, as eigenvector was arbitrary, it means all eigenvalues of are positive, therefore is positive definite. Thus,

is indeed a local minimum.


Let be another linear estimator of with where is a non-zero matrix. As we're restricting to unbiased estimators, minimum mean squared error implies minimum variance. The goal is therefore to show that such an estimator has a variance no smaller than that of the OLS estimator. We calculate:

Therefore, since is unobservable, is unbiased if and only if . Then:

Since DD' is a positive semidefinite matrix, exceeds by a positive semidefinite matrix.

Remarks on the proof

As it has been stated before, the condition of is a positive semidefinite matrix is equivalent to the property that the best linear unbiased estimator of is (best in the sense that it has minimum variance). To see this, let another linear unbiased estimator of .

Moreover, equality holds if and only if . We calculate

This proves that the equality holds if and only if which gives the uniqueness of the OLS estimator as a BLUE.

Generalized least squares estimator

The generalized least squares (GLS), developed by Aitken, [5] extends the Gauss–Markov theorem to the case where the error vector has a non-scalar covariance matrix. [6] The Aitken estimator is also a BLUE.

Gauss–Markov theorem as stated in econometrics

In most treatments of OLS, the regressors (parameters of interest) in the design matrix are assumed to be fixed in repeated samples. This assumption is considered inappropriate for a predominantly nonexperimental science like econometrics. [7] Instead, the assumptions of the Gauss–Markov theorem are stated conditional on .


The dependent variable is assumed to be a linear function of the variables specified in the model. The specification must be linear in its parameters. This does not mean that there must be a linear relationship between the independent and dependent variables. The independent variables can take non-linear forms as long as the parameters are linear. The equation qualifies as linear while can be transformed to be linear by replacing by another parameter, say . An equation with a parameter dependent on an independent variable does not qualify as linear, for example , where is a function of .

Data transformations are often used to convert an equation into a linear form. For example, the Cobb–Douglas function—often used in economics—is nonlinear:

But it can be expressed in linear form by taking the natural logarithm of both sides: [8]

This assumption also covers specification issues: assuming that the proper functional form has been selected and there are no omitted variables.

One should be aware, however, that the parameters that minimize the residuals of the transformed equation do not necessarily minimize the residuals of the original equation.

Strict exogeneity

For all observations, the expectation—conditional on the regressors—of the error term is zero: [9]

where is the data vector of regressors for the ith observation, and consequently is the data matrix or design matrix.

Geometrically, this assumption implies that and are orthogonal to each other, so that their inner product (i.e., their cross moment) is zero.

This assumption is violated if the explanatory variables are stochastic, for instance when they are measured with error, or are endogenous. [10] Endogeneity can be the result of simultaneity, where causality flows back and forth between both the dependent and independent variable. Instrumental variable techniques are commonly used to address this problem.

Full rank

The sample data matrix must have full column rank.

Otherwise is not invertible and the OLS estimator cannot be computed.

A violation of this assumption is perfect multicollinearity, i.e. some explanatory variables are linearly dependent. One scenario in which this will occur is called "dummy variable trap," when a base dummy variable is not omitted resulting in perfect correlation between the dummy variables and the constant term. [11]

Multicollinearity (as long as it is not "perfect") can be present resulting in a less efficient, but still unbiased estimate. The estimates will be less precise and highly sensitive to particular sets of data. [12] Multicollinearity can be detected from condition number or the variance inflation factor, among other tests.

Spherical errors

The outer product of the error vector must be spherical.

This implies the error term has uniform variance (homoscedasticity) and no serial dependence. [13] If this assumption is violated, OLS is still unbiased, but inefficient. The term "spherical errors" will describe the multivariate normal distribution: if in the multivariate normal density, then the equation is the formula for a ball centered at μ with radius σ in n-dimensional space. [14]

Heteroskedasticity occurs when the amount of error is correlated with an independent variable. For example, in a regression on food expenditure and income, the error is correlated with income. Low income people generally spend a similar amount on food, while high income people may spend a very large amount or as little as low income people spend. Heteroskedastic can also be caused by changes in measurement practices. For example, as statistical offices improve their data, measurement error decreases, so the error term declines over time.

This assumption is violated when there is autocorrelation. Autocorrelation can be visualized on a data plot when a given observation is more likely to lie above a fitted line if adjacent observations also lie above the fitted regression line. Autocorrelation is common in time series data where a data series may experience "inertia." If a dependent variable takes a while to fully absorb a shock. Spatial autocorrelation can also occur geographic areas are likely to have similar errors. Autocorrelation may be the result of misspecification such as choosing the wrong functional form. In these cases, correcting the specification is one possible way to deal with autocorrelation.

In the presence of spherical errors, the generalized least squares estimator can be shown to be BLUE. [6]

See also

Other unbiased statistics

Related Research Articles

Multivariate normal distribution Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

Hookes law Principle of physics that states that the force (F) needed to extend or compress a spring by some distance X scales linearly with respect to that distance

Hooke's law is a law of physics that states that the force needed to extend or compress a spring by some distance scales linearly with respect to that distance—that is, Fs = kx, where k is a constant factor characteristic of the spring, and x is small compared to the total possible deformation of the spring. The law is named after 17th-century British physicist Robert Hooke. He first stated the law in 1676 as a Latin anagram. He published the solution of his anagram in 1678 as: ut tensio, sic vis. Hooke states in the 1678 work that he was aware of the law since 1660.

In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. It is a form of a Student's t-statistic, with the estimate of error varying between points.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model, such as a linear regression. A small RSS indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.

Simple linear regression

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

In statistics and econometrics, the multivariate probit model is a generalization of the probit model used to estimate several correlated binary outcomes jointly. For example, if it is believed that the decisions of sending at least one child to public school and that of voting in favor of a school budget are correlated, then the multivariate probit model would be appropriate for jointly predicting these two choices on an individual-specific basis. J.R. Ashford and R.R. Sowden initially proposed an approach for multivariate probit analysis. Siddhartha Chib and Edward Greenberg extended this idea and also proposed simulation-based inference methods for the multivariate probit model which simplified and generalized parameter estimation.

The topic of heteroscedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In linear regression, mean response and predicted response are values of the dependent variable calculated from the regression parameters and a given value of the independent variable. The values of these two responses are the same, but their calculated variances are different.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

Errors-in-variables models Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.


  1. See chapter 7 of Johnson, R.A.; Wichern, D.W. (2002). Applied multivariate statistical analysis. 5. Prentice hall.
  2. Theil, Henri (1971). "Best Linear Unbiased Estimation and Prediction". Principles of Econometrics . New York: John Wiley & Sons. pp.  119–124. ISBN   0-471-85845-5.
  3. Plackett, R. L. (1949). "A Historical Note on the Method of Least Squares". Biometrika . 36 (3/4): 458–460. doi:10.2307/2332682.
  4. David, F. N.; Neyman, J. (1938). "Extension of the Markoff theorem on least squares". Statistical Research Memoirs. 2: 105–116. OCLC   4025782.
  5. 1 2 Aitken, A. C. (1935). "On Least Squares and Linear Combinations of Observations". Proceedings of the Royal Society of Edinburgh. 55: 42–48. doi:10.1017/S0370164600014346.
  6. 1 2 Huang, David S. (1970). Regression and Econometric Methods . New York: John Wiley & Sons. pp.  127–147. ISBN   0-471-41754-8.
  7. Hayashi, Fumio (2000). Econometrics. Princeton University Press. p. 13. ISBN   0-691-01018-8.
  8. Walters, A. A. (1970). An Introduction to Econometrics. New York: W. W. Norton. p. 275. ISBN   0-393-09931-8.
  9. Hayashi, Fumio (2000). Econometrics. Princeton University Press. p. 7. ISBN   0-691-01018-8.
  10. Johnston, John (1972). Econometric Methods (Second ed.). New York: McGraw-Hill. pp.  267–291. ISBN   0-07-032679-7.
  11. Wooldridge, Jeffrey (2012). Introductory Econometrics (Fifth international ed.). South-Western. p.  220. ISBN   978-1-111-53439-4.
  12. Johnston, John (1972). Econometric Methods (Second ed.). New York: McGraw-Hill. pp.  159–168. ISBN   0-07-032679-7.
  13. Hayashi, Fumio (2000). Econometrics. Princeton University Press. p. 10. ISBN   0-691-01018-8.
  14. Ramanathan, Ramu (1993). "Nonspherical Disturbances". Statistical Methods in Econometrics . Academic Press. pp.  330–351. ISBN   0-12-576830-3.

Further reading