# Weighted least squares

Last updated

Weighted least squares (WLS), also known as weighted linear regression, [1] [2] is a generalization of ordinary least squares and linear regression in which knowledge of the variance of observations is incorporated into the regression. WLS is also a specialization of generalized least squares.

## Introduction

A special case of generalized least squares called weighted least squares occurs when all the off-diagonal entries of Ω (the correlation matrix of the residuals) are null; the variances of the observations (along the covariance matrix diagonal) may still be unequal (heteroscedasticity).

The fit of a model to a data point is measured by its residual, ${\displaystyle r_{i}}$, defined as the difference between a measured value of the dependent variable, ${\displaystyle y_{i}}$ and the value predicted by the model, ${\displaystyle f(x_{i},{\boldsymbol {\beta }})}$:

${\displaystyle r_{i}({\boldsymbol {\beta }})=y_{i}-f(x_{i},{\boldsymbol {\beta }}).}$

If the errors are uncorrelated and have equal variance, then the minimum of the function

${\displaystyle S({\boldsymbol {\beta }})=\sum _{i}r_{i}({\boldsymbol {\beta }})^{2}}$,

is found when ${\displaystyle {\frac {\partial S\left({\hat {\boldsymbol {\beta }}}\right)}{\partial \beta _{j}}}=0}$ (defining ${\displaystyle {\boldsymbol {\hat {\beta }}}}$).

The Gauss–Markov theorem shows that, when this is so, ${\displaystyle {\hat {\boldsymbol {\beta }}}}$ is a best linear unbiased estimator (BLUE). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted. Aitken showed that when a weighted sum of squared residuals is minimized, ${\displaystyle {\hat {\boldsymbol {\beta }}}}$ is the BLUE if each weight is equal to the reciprocal of the variance of the measurement

{\displaystyle {\begin{aligned}S&=\sum _{i=1}^{n}W_{ii}{r_{i}}^{2},&W_{ii}&={\frac {1}{{\sigma _{i}}^{2}}}\end{aligned}}}

The gradient equations for this sum of squares are

${\displaystyle -2\sum _{i}W_{ii}{\frac {\partial f(x_{i},{\boldsymbol {\beta }})}{\partial \beta _{j}}}r_{i}=0,\quad j=1,\ldots ,m}$

which, in a linear least squares system give the modified normal equations,

${\displaystyle \sum _{i=1}^{n}\sum _{k=1}^{m}X_{ij}W_{ii}X_{ik}{\hat {\beta }}_{k}=\sum _{i=1}^{n}X_{ij}W_{ii}y_{i},\quad j=1,\ldots ,m\,.}$

When the observational errors are uncorrelated and the weight matrix, W, is diagonal, these may be written as

${\displaystyle \mathbf {\left(X^{\textsf {T}}WX\right){\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}Wy} .}$

If the errors are correlated, the resulting estimator is the BLUE if the weight matrix is equal to the inverse of the variance-covariance matrix of the observations.

When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as ${\displaystyle w_{ii}={\sqrt {W_{ii}}}}$. The normal equations can then be written in the same form as ordinary least squares:

${\displaystyle \mathbf {\left(X'^{\textsf {T}}X'\right){\hat {\boldsymbol {\beta }}}=X'^{\textsf {T}}y'} \,}$

where we define the following scaled matrix and vector:

{\displaystyle {\begin{aligned}\mathbf {X'} &=\operatorname {diag} \left(\mathbf {w} \right)\mathbf {X} ,\\\mathbf {y'} &=\operatorname {diag} \left(\mathbf {w} \right)\mathbf {y} =\mathbf {y} \oslash \mathbf {\sigma } .\end{aligned}}}

This is a type of whitening transformation; the last expression involves an entrywise division.

For non-linear least squares systems a similar argument shows that the normal equations should be modified as follows.

${\displaystyle \mathbf {\left(J^{\textsf {T}}WJ\right)\,{\boldsymbol {\Delta }}\beta =J^{\textsf {T}}W\,{\boldsymbol {\Delta }}y} .\,}$

Note that for empirical tests, the appropriate W is not known for sure and must be estimated. For this feasible generalized least squares (FGLS) techniques may be used; in this case it is specialized for a diagonal covariance matrix, thus yielding a feasible weighted least squares solution.

If the uncertainty of the observations is not known from external sources, then the weights could be estimated from the given observations. This can be useful, for example, to identify outliers. After the outliers have been removed from the data set, the weights should be reset to one. [3]

## Motivation

In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares:

${\displaystyle {\underset {\boldsymbol {\beta }}{\operatorname {arg\ min} }}\,\sum _{i=1}^{n}w_{i}\left|y_{i}-\sum _{j=1}^{m}X_{ij}\beta _{j}\right|^{2}={\underset {\boldsymbol {\beta }}{\operatorname {arg\ min} }}\,\left\|W^{\frac {1}{2}}\left(\mathbf {y} -X{\boldsymbol {\beta }}\right)\right\|^{2}.}$

where wi > 0 is the weight of the ith observation, and W is the diagonal matrix of such weights.

The weights should, ideally, be equal to the reciprocal of the variance of the measurement. (This implies that the observations are uncorrelated. If the observations are correlated, the expression ${\textstyle S=\sum _{k}\sum _{j}r_{k}W_{kj}r_{j}\,}$ applies. In this case the weight matrix should ideally be equal to the inverse of the variance-covariance matrix of the observations). [3] The normal equations are then:

${\displaystyle \left(X^{\textsf {T}}WX\right){\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}W\mathbf {y} .}$

This method is used in iteratively reweighted least squares.

### Parameter errors and correlation

The estimated parameter values are linear combinations of the observed values

${\displaystyle {\hat {\boldsymbol {\beta }}}=(X^{\textsf {T}}WX)^{-1}X^{\textsf {T}}W\mathbf {y} .}$

Therefore, an expression for the estimated variance-covariance matrix of the parameter estimates can be obtained by error propagation from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the estimated parameters by Mβ. Then

${\displaystyle M^{\beta }=\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}WMW^{\textsf {T}}X\left(X^{\textsf {T}}W^{\textsf {T}}X\right)^{-1}.}$

When W = M−1, this simplifies to

${\displaystyle M^{\beta }=\left(X^{\textsf {T}}WX\right)^{-1}.}$

When unit weights are used (W = I, the identity matrix), it is implied that the experimental errors are uncorrelated and all equal: M = σ2I, where σ2 is the a priori variance of an observation. In any case, σ2 is approximated by the reduced chi-squared ${\displaystyle \chi _{\nu }^{2}}$:

{\displaystyle {\begin{aligned}M^{\beta }&=\chi _{\nu }^{2}\left(X^{\textsf {T}}WX\right)^{-1},\\\chi _{\nu }^{2}&=S/\nu ,\end{aligned}}}

where S is the minimum value of the (weighted) objective function:

${\displaystyle S=r^{\textsf {T}}Wr.}$

The denominator, ${\displaystyle \nu =n-m}$, is the number of degrees of freedom; see effective degrees of freedom for generalizations for the case of correlated observations.

In all cases, the variance of the parameter estimate ${\displaystyle {\hat {\beta }}_{i}}$ is given by ${\displaystyle M_{ii}^{\beta }}$ and the covariance between the parameter estimates ${\displaystyle {\hat {\beta }}_{i}}$ and ${\displaystyle {\hat {\beta }}_{j}}$ is given by ${\displaystyle M_{ij}^{\beta }}$. The standard deviation is the square root of variance, ${\displaystyle \sigma _{i}={\sqrt {M_{ii}^{\beta }}}}$, and the correlation coefficient is given by ${\displaystyle \rho _{ij}=M_{ij}^{\beta }/(\sigma _{i}\sigma _{j})}$. These error estimates reflect only random errors in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors, which, by definition, cannot be quantified. Note that even though the observations may be uncorrelated, the parameters are typically correlated.

### Parameter confidence limits

It is often assumed, for want of any concrete evidence but often appealing to the central limit theorem—see Normal distribution#Occurrence and applications—that the error on each observation belongs to a normal distribution with a mean of zero and standard deviation ${\displaystyle \sigma }$. Under that assumption the following probabilities can be derived for a single scalar parameter estimate in terms of its estimated standard error ${\displaystyle se_{\beta }}$ (given here):

68% that the interval ${\displaystyle {\hat {\beta }}\pm se_{\beta }}$ encompasses the true coefficient value
95% that the interval ${\displaystyle {\hat {\beta }}\pm 2se_{\beta }}$ encompasses the true coefficient value
99% that the interval ${\displaystyle {\hat {\beta }}\pm 2.5se_{\beta }}$ encompasses the true coefficient value

The assumption is not unreasonable when m >> n. If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution with m  n degrees of freedom. When m  n Student's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error. [4]

When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2, or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

### Residual values and correlation

The residuals are related to the observations by

${\displaystyle \mathbf {\hat {r}} =\mathbf {y} -X{\hat {\boldsymbol {\beta }}}=\mathbf {y} -H\mathbf {y} =(I-H)\mathbf {y} ,}$

where H is the idempotent matrix known as the hat matrix:

${\displaystyle H=X\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}W,}$

and I is the identity matrix. The variance-covariance matrix of the residuals, Mr is given by

${\displaystyle M^{\mathbf {r} }=(I-H)M(I-H)^{\textsf {T}}.}$

Thus the residuals are correlated, even if the observations are not.

When ${\displaystyle W=M^{-1}}$,

${\displaystyle M^{\mathbf {r} }=(I-H)M.}$

The sum of weighted residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by XTWT:

${\displaystyle X^{\textsf {T}}W{\hat {\mathbf {r} }}=X^{\textsf {T}}W\mathbf {y} -X^{\textsf {T}}WX{\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}W\mathbf {y} -\left(X^{\rm {T}}WX\right)\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}W\mathbf {y} =\mathbf {0} .}$

Say, for example, that the first term of the model is a constant, so that ${\displaystyle X_{i1}=1}$ for all i. In that case it follows that

${\displaystyle \sum _{i}^{m}X_{i1}W_{i}{\hat {r}}_{i}=\sum _{i}^{m}W_{i}{\hat {r}}_{i}=0.}$

Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero is not accidental, but is a consequence of the presence of the constant term, α, in the model.

If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals, [5] but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large.

## Related Research Articles

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of every single equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting.

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations.

The Gauss–Newton algorithm is used to solve non-linear least squares problems. It is a modification of Newton's method for finding a minimum of a function. Unlike Newton's method, the Gauss–Newton algorithm can only be used to minimize a sum of squared function values, but it has the advantage that second derivatives, which can be challenging to compute, are not required.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units, or where measurements are made on clusters of related statistical units. Because of their advantage in dealing with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures ANOVA.

Equilibrium constants are determined in order to quantify chemical equilibria. When an equilibrium constant K is expressed as a concentration quotient,

The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a p-norm:

In statistics, the projection matrix, sometimes also called the influence matrix or hat matrix, maps the vector of response values to the vector of fitted values. It describes the influence each response value has on each fitted value. The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

Numerical methods for linear least squares entails the numerical analysis of linear least squares problems.

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

## References

1. 1 2 Strutz, T. (2016). Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN   978-3-658-11455-8., chapter 3
2. Mandel, John (1964). The Statistical Analysis of Experimental Data. New York: Interscience.
3. Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979). Multivariate analysis. New York: Academic Press. ISBN   0-12-471250-9.