# Generalized least squares

Last updated

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936. [1]

## Method outline

In standard linear regression models we observe data ${\displaystyle \{y_{i},x_{ij}\}_{i=1,\dots ,n,j=2,\dots ,k}}$ on n statistical units. The response values are placed in a vector ${\displaystyle \mathbf {y} =\left(y_{1},\dots ,y_{n}\right)^{\mathsf {T}}}$, and the predictor values are placed in the design matrix ${\displaystyle \mathbf {X} =\left(\mathbf {x} _{1}^{\mathsf {T}},\dots ,\mathbf {x} _{n}^{\mathsf {T}}\right)^{\mathsf {T}}}$, where ${\displaystyle \mathbf {x} _{i}=\left(1,x_{i2},\dots ,x_{ik}\right)}$ is a vector of the k predictor variables (including a constant) for the ith unit. The model forces the conditional mean of ${\displaystyle \mathbf {y} }$ given ${\displaystyle \mathbf {X} }$ to be a linear function of ${\displaystyle \mathbf {X} }$, and assumes the conditional variance of the error term given ${\displaystyle \mathbf {X} }$ is a known nonsingular covariance matrix ${\displaystyle \mathbf {\Omega } }$. This is usually written as

${\displaystyle \mathbf {y} =\mathbf {X} \mathbf {\beta } +\mathbf {\varepsilon } ,\qquad \operatorname {E} [\varepsilon \mid \mathbf {X} ]=0,\ \operatorname {Cov} [\varepsilon \mid \mathbf {X} ]=\mathbf {\Omega } .}$

Here ${\displaystyle \beta \in \mathbb {R} ^{k}}$ is a vector of unknown constants (known as “regression coefficients”) that must be estimated from the data.

Suppose ${\displaystyle \mathbf {b} }$ is a candidate estimate for ${\displaystyle \mathbf {\beta } }$. Then the residual vector for ${\displaystyle \mathbf {b} }$ will be ${\displaystyle \mathbf {y} -\mathbf {X} \mathbf {b} }$. The generalized least squares method estimates ${\displaystyle \mathbf {\beta } }$ by minimizing the squared Mahalanobis length of this residual vector:

${\displaystyle \mathbf {\hat {\beta }} ={\underset {b}{\operatorname {argmin} }}\,(\mathbf {y} -\mathbf {X} \mathbf {b} )^{\mathsf {T}}\mathbf {\Omega } ^{-1}(\mathbf {y} -\mathbf {X} \mathbf {b} )={\underset {b}{\operatorname {argmin} }}\,\mathbf {y} ^{\mathsf {T}}\,\mathbf {\Omega } ^{-1}\mathbf {y} +(\mathbf {X} \mathbf {b} )^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {X} \mathbf {b} -\mathbf {y} ^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {X} \mathbf {b} -(\mathbf {X} \mathbf {b} )^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {y} \,,}$

where the last two terms evaluate to scalars, resulting in

${\displaystyle \mathbf {\hat {\beta }} ={\underset {b}{\operatorname {argmin} }}\,\mathbf {y} ^{\mathsf {T}}\,\mathbf {\Omega } ^{-1}\mathbf {y} +\mathbf {b} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {X} \mathbf {b} -2\mathbf {b} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {y} \,.}$

This objective is a quadratic form in ${\displaystyle \mathbf {b} }$.

Taking the gradient of this quadratic form with respect to ${\displaystyle \mathbf {b} }$ and equating it to zero (when ${\displaystyle \mathbf {b} ={\hat {\beta }}}$) gives

${\displaystyle 2\mathbf {X} ^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {X} {\hat {\beta }}-2\mathbf {X} ^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {y} =0}$

Therefore, the minimum of the objective function can be computed yielding the explicit formula:

${\displaystyle \mathbf {\hat {\beta }} =\left(\mathbf {X} ^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {X} \right)^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {\Omega } ^{-1}\mathbf {y} .}$

### Properties

The GLS estimator is unbiased, consistent, efficient, and asymptotically normal with ${\displaystyle \operatorname {E} [{\hat {\beta }}\mid \mathbf {X} ]=\beta }$ and ${\displaystyle \operatorname {Cov} [{\hat {\beta }}\mid \mathbf {X} ]=(\mathbf {X} ^{\mathsf {T}}\Omega ^{-1}\mathbf {X} )^{-1}}$. GLS is equivalent to applying ordinary least squares to a linearly transformed version of the data. To see this, factor ${\displaystyle \mathbf {\Omega } =\mathbf {C} \mathbf {C} ^{\mathsf {T}}}$, for instance using the Cholesky decomposition. Then if we pre-multiply both sides of the equation ${\displaystyle \mathbf {y} =\mathbf {X} \mathbf {\beta } +\mathbf {\varepsilon } }$ by ${\displaystyle \mathbf {C} ^{-1}}$, we get an equivalent linear model ${\displaystyle \mathbf {y} ^{*}=\mathbf {X} ^{*}\mathbf {\beta } +\mathbf {\varepsilon } ^{*}}$ where ${\displaystyle \mathbf {y} ^{*}=\mathbf {C} ^{-1}\mathbf {y} }$, ${\displaystyle \mathbf {X} ^{*}=\mathbf {C} ^{-1}\mathbf {X} }$, and ${\displaystyle \mathbf {\varepsilon } ^{*}=\mathbf {C} ^{-1}\mathbf {\varepsilon } }$. In this model ${\displaystyle \operatorname {Var} [\varepsilon ^{*}\mid \mathbf {X} ]=\mathbf {C} ^{-1}\mathbf {\Omega } \left(\mathbf {C} ^{-1}\right)^{\mathsf {T}}=\mathbf {I} }$, where ${\displaystyle \mathbf {I} }$ is the identity matrix. Thus we can efficiently estimate ${\displaystyle \mathbf {\beta } }$ by applying Ordinary least squares (OLS) to the transformed data, which requires minimizing

${\displaystyle \left(\mathbf {y} ^{*}-\mathbf {X} ^{*}\mathbf {\beta } \right)^{\mathsf {T}}(\mathbf {y} ^{*}-\mathbf {X} ^{*}\mathbf {\beta } )=(\mathbf {y} -\mathbf {X} \mathbf {b} )^{\mathsf {T}}\,\mathbf {\Omega } ^{-1}(\mathbf {y} -\mathbf {X} \mathbf {b} ).}$

This has the effect of standardizing the scale of the errors and “de-correlating” them. Since OLS is applied to data with homoscedastic errors, the Gauss–Markov theorem applies, and therefore the GLS estimate is the best linear unbiased estimator for β.

## Weighted least squares

A special case of GLS called weighted least squares (WLS) occurs when all the off-diagonal entries of Ω are 0. This situation arises when the variances of the observed values are unequal (i.e.  heteroscedasticity is present), but where no correlations exist among the observed variances. The weight for unit i is proportional to the reciprocal of the variance of the response for unit i. [2]

## Feasible generalized least squares

If the covariance of the errors ${\displaystyle \Omega }$ is unknown, one can get a consistent estimate of ${\displaystyle \Omega }$, say ${\displaystyle {\widehat {\Omega }}}$, [3] using an implementable version of GLS known as the feasible generalized least squares (FGLS) estimator. In FGLS, modeling proceeds in two stages: (1) the model is estimated by OLS or another consistent (but inefficient) estimator, and the residuals are used to build a consistent estimator of the errors covariance matrix (to do so, one often needs to examine the model adding additional constraints, for example if the errors follow a time series process, a statistician generally needs some theoretical assumptions on this process to ensure that a consistent estimator is available); and (2) using the consistent estimator of the covariance matrix of the errors, one can implement GLS ideas.

Whereas GLS is more efficient than OLS under heteroscedasticity or autocorrelation, this is not true for FGLS. The feasible estimator is, provided the errors covariance matrix is consistently estimated, asymptotically more efficient, but for a small or medium size sample, it can be actually less efficient than OLS. This is why, some authors prefer to use OLS, and reformulate their inferences by simply considering an alternative estimator for the variance of the estimator robust to heteroscedasticity or serial autocorrelation. But for large samples FGLS is preferred over OLS under heteroskedasticity or serial correlation. [3] [4] A cautionary note is that the FGLS estimator is not always consistent. One case in which FGLS might be inconsistent is if there are individual specific fixed effects. [5]

In general this estimator has different properties than GLS. For large samples (i.e., asymptotically) all properties are (under appropriate conditions) common with respect to GLS, but for finite samples the properties of FGLS estimators are unknown: they vary dramatically with each particular model, and as a general rule their exact distributions cannot be derived analytically. For finite samples, FGLS may be even less efficient than OLS in some cases. Thus, while GLS can be made feasible, it is not always wise to apply this method when the sample is small. A method sometimes used to improve the accuracy of the estimators in finite samples is to iterate, i.e. taking the residuals from FGLS to update the errors covariance estimator, and then updating the FGLS estimation, applying the same idea iteratively until the estimators vary less than some tolerance. But this method does not necessarily improve the efficiency of the estimator very much if the original sample was small. A reasonable option when samples are not too large is to apply OLS, but throwing away the classical variance estimator

${\displaystyle \sigma ^{2}*(X'X)^{-1}}$

(which is inconsistent in this framework) and using a HAC (Heteroskedasticity and Autocorrelation Consistent) estimator. For example, in autocorrelation context we can use the Bartlett estimator (often known as Newey-West estimator since these authors popularized the use of this estimator among econometricians in their 1987 Econometrica article), and in heteroskedastic context we can use the Eicker–White estimator. This approach is much safer, and it is the appropriate path to take unless the sample is large, and "large" is sometimes a slippery issue (e.g. if the errors distribution is asymmetric the required sample would be much larger).

The ordinary least squares (OLS) estimator is calculated as usual by

${\displaystyle {\widehat {\beta }}_{\text{OLS}}=(X'X)^{-1}X'y}$

and estimates of the residuals ${\displaystyle {\widehat {u}}_{j}=(Y-X{\widehat {\beta }}_{\text{OLS}})_{j}}$ are constructed.

For simplicity consider the model for heteroskedastic errors. Assume that the variance-covariance matrix ${\displaystyle \Omega }$ of the error vector is diagonal, or equivalently that errors from distinct observations are uncorrelated. Then each diagonal entry may be estimated by the fitted residuals ${\displaystyle {\widehat {u}}_{j}}$ so ${\displaystyle {\widehat {\Omega }}_{OLS}}$ may be constructed by

${\displaystyle {\widehat {\Omega }}_{\text{OLS}}=\operatorname {diag} ({\widehat {\sigma }}_{1}^{2},{\widehat {\sigma }}_{2}^{2},\dots ,{\widehat {\sigma }}_{n}^{2}).}$

It is important to notice that the squared residuals cannot be used in the previous expression; we need an estimator of the errors variances. To do so, we can use a parametric heteroskedasticity model, or a nonparametric estimator. Once this step is fulfilled, we can proceed:

Estimate ${\displaystyle \beta _{FGLS1}}$ using ${\displaystyle {\widehat {\Omega }}_{\text{OLS}}}$ using [4] weighted least squares

${\displaystyle {\widehat {\beta }}_{FGLS1}=(X'{\widehat {\Omega }}_{\text{OLS}}^{-1}X)^{-1}X'{\widehat {\Omega }}_{\text{OLS}}^{-1}y}$

The procedure can be iterated. The first iteration is given by

${\displaystyle {\widehat {u}}_{FGLS1}=Y-X{\widehat {\beta }}_{FGLS1}}$
${\displaystyle {\widehat {\Omega }}_{FGLS1}=\operatorname {diag} ({\widehat {\sigma }}_{FGLS1,1}^{2},{\widehat {\sigma }}_{FGLS1,2}^{2},\dots ,{\widehat {\sigma }}_{FGLS1,n}^{2})}$
${\displaystyle {\widehat {\beta }}_{FGLS2}=(X'{\widehat {\Omega }}_{FGLS1}^{-1}X)^{-1}X'{\widehat {\Omega }}_{FGLS1}^{-1}y}$

This estimation of ${\displaystyle {\widehat {\Omega }}}$ can be iterated to convergence.

Under regularity conditions any of the FGLS estimator (or that of any of its iterations, if we iterate a finite number of times) is asymptotically distributed as

${\displaystyle {\sqrt {n}}({\hat {\beta }}_{FGLS}-\beta )\ {\xrightarrow {d}}\ {\mathcal {N}}\!\left(0,\,V\right).}$

where n is the sample size and

${\displaystyle V=\operatorname {p-lim} (X'\Omega ^{-1}X/T)}$

here p-lim means limit in probability

## Related Research Articles

In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of every single equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model, such as a linear regression. A small RSS indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

In statistics, Cook's distance or Cook's D is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ways: to indicate influential data points that are particularly worth checking for validity; or to indicate regions of the design space where it would be good to be able to obtain more data points. It is named after the American statistician R. Dennis Cook, who introduced the concept in 1977.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In econometrics, Prais–Winsten estimation is a procedure meant to take care of the serial correlation of type AR(1) in a linear model. Conceived by Sigbert Prais and Christopher Winsten in 1954, it is a modification of Cochrane–Orcutt estimation in the sense that it does not lose the first observation, which leads to more efficiency as a result and makes it a special case of feasible generalized least squares.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In set theory and logic, Buchholz's ID hierarchy is a hierarchy of subsystems of first-order arithmetic. The systems/theories are referred to as "the formal theories of ν-times iterated inductive definitions". IDν extends PA by ν iterated least fixed points of monotone operators.

## References

1. Aitken, A. C. (1936). "On Least-squares and Linear Combinations of Observations". Proceedings of the Royal Society of Edinburgh. 55: 42–48.
2. Strutz, T. (2016). Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN   978-3-658-11455-8., chapter 3
3. Baltagi, B. H. (2008). Econometrics (4th ed.). New York: Springer.
4. Greene, W. H. (2003). Econometric Analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.
5. Hansen, Christian B. (2007). "Generalized Least Squares Inference in Panel and Multilevel Models with Serial Correlation and Fixed Effects". Journal of Econometrics . 140 (2): 670–694. doi:10.1016/j.jeconom.2006.07.011.