Weighted least squares

Last updated May 10, 2023

Weighted least squares (WLS), also known as weighted linear regression,^[1]^[2] is a generalization of ordinary least squares and linear regression in which knowledge of the variance of observations is incorporated into the regression. WLS is also a specialization of generalized least squares.

Introduction

A special case of generalized least squares called weighted least squares can be used when all the off-diagonal entries of Ω, the covariance matrix of the errors, are null; the variances of the observations (along the covariance matrix diagonal) may still be unequal (heteroscedasticity).

The fit of a model to a data point is measured by its residual, $r_{i}$ , defined as the difference between a measured value of the dependent variable, $y_{i}$ and the value predicted by the model, $f(x_{i},{\boldsymbol {\beta }})$ :

r_{i}({\boldsymbol {\beta }})=y_{i}-f(x_{i},{\boldsymbol {\beta }}).

If the errors are uncorrelated and have equal variance, then the function

S({\boldsymbol {\beta }})=\sum _{i}r_{i}({\boldsymbol {\beta }})^{2},

is minimised at ${\boldsymbol {\hat {\beta }}}$ , such that ${\frac {\partial S}{\partial \beta _{j}}}({\hat {\boldsymbol {\beta }}})=0$ .

The Gauss–Markov theorem shows that, when this is so, ${\hat {\boldsymbol {\beta }}}$ is a best linear unbiased estimator (BLUE). If, however, the measurements are uncorrelated but have different uncertainties, a modified approach might be adopted. Aitken showed that when a weighted sum of squared residuals is minimized, ${\hat {\boldsymbol {\beta }}}$ is the BLUE if each weight is equal to the reciprocal of the variance of the measurement

{\begin{aligned}S&=\sum _{i=1}^{n}W_{ii}{r_{i}}^{2},&W_{ii}&={\frac {1}{{\sigma _{i}}^{2}}}\end{aligned}}

The gradient equations for this sum of squares are

-2\sum _{i}W_{ii}{\frac {\partial f(x_{i},{\boldsymbol {\beta }})}{\partial \beta _{j}}}r_{i}=0,\quad j=1,\ldots ,m

which, in a linear least squares system give the modified normal equations,

\sum _{i=1}^{n}\sum _{k=1}^{m}X_{ij}W_{ii}X_{ik}{\hat {\beta }}_{k}=\sum _{i=1}^{n}X_{ij}W_{ii}y_{i},\quad j=1,\ldots ,m\,.

When the observational errors are uncorrelated and the weight matrix, W=Ω⁻¹, is diagonal, these may be written as

\mathbf {\left(X^{\textsf {T}}WX\right){\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}Wy} .

If the errors are correlated, the resulting estimator is the BLUE if the weight matrix is equal to the inverse of the variance-covariance matrix of the observations.

When the errors are uncorrelated, it is convenient to simplify the calculations to factor the weight matrix as $w_{ii}={\sqrt {W_{ii}}}$ . The normal equations can then be written in the same form as ordinary least squares:

\mathbf {\left(X'^{\textsf {T}}X'\right){\hat {\boldsymbol {\beta }}}=X'^{\textsf {T}}y'} \,

where we define the following scaled matrix and vector:

{\begin{aligned}\mathbf {X'} &=\operatorname {diag} \left(\mathbf {w} \right)\mathbf {X} ,\\\mathbf {y'} &=\operatorname {diag} \left(\mathbf {w} \right)\mathbf {y} =\mathbf {y} \oslash \mathbf {\sigma } .\end{aligned}}

This is a type of whitening transformation; the last expression involves an entrywise division.

For non-linear least squares systems a similar argument shows that the normal equations should be modified as follows.

\mathbf {\left(J^{\textsf {T}}WJ\right)\,{\boldsymbol {\Delta }}\beta =J^{\textsf {T}}W\,{\boldsymbol {\Delta }}y} .\,

Note that for empirical tests, the appropriate W is not known for sure and must be estimated. For this feasible generalized least squares (FGLS) techniques may be used; in this case it is specialized for a diagonal covariance matrix, thus yielding a feasible weighted least squares solution.

If the uncertainty of the observations is not known from external sources, then the weights could be estimated from the given observations. This can be useful, for example, to identify outliers. After the outliers have been removed from the data set, the weights should be reset to one.^[3]

Motivation

In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares:

{\underset {\boldsymbol {\beta }}{\operatorname {arg\ min} }}\,\sum _{i=1}^{n}w_{i}\left|y_{i}-\sum _{j=1}^{m}X_{ij}\beta _{j}\right|^{2}={\underset {\boldsymbol {\beta }}{\operatorname {arg\ min} }}\,\left\|W^{\frac {1}{2}}\left(\mathbf {y} -X{\boldsymbol {\beta }}\right)\right\|^{2}.

where w_i > 0 is the weight of the ith observation, and W is the diagonal matrix of such weights.

The weights should, ideally, be equal to the reciprocal of the variance of the measurement. (This implies that the observations are uncorrelated. If the observations are correlated, the expression ${\textstyle S=\sum _{k}\sum _{j}r_{k}W_{kj}r_{j}\,}$ applies. In this case the weight matrix should ideally be equal to the inverse of the variance-covariance matrix of the observations).^[3] The normal equations are then:

\left(X^{\textsf {T}}WX\right){\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}W\mathbf {y} .

This method is used in iteratively reweighted least squares.

Parameter errors and correlation

The estimated parameter values are linear combinations of the observed values

{\hat {\boldsymbol {\beta }}}=(X^{\textsf {T}}WX)^{-1}X^{\textsf {T}}W\mathbf {y} .

Therefore, an expression for the estimated variance-covariance matrix of the parameter estimates can be obtained by error propagation from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the estimated parameters by M^β. Then

M^{\beta }=\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}WMW^{\textsf {T}}X\left(X^{\textsf {T}}W^{\textsf {T}}X\right)^{-1}.

When $W = M -1$ , this simplifies to

M^{\beta }=\left(X^{\textsf {T}}WX\right)^{-1}.

When unit weights are used ( $W = I$ , the identity matrix), it is implied that the experimental errors are uncorrelated and all equal: $M = σ 2 I$ , where $σ 2$ is the a priori variance of an observation. In any case, σ² is approximated by the reduced chi-squared $\chi _{\nu }^{2}$ :

{\begin{aligned}M^{\beta }&=\chi _{\nu }^{2}\left(X^{\textsf {T}}WX\right)^{-1},\\\chi _{\nu }^{2}&=S/\nu ,\end{aligned}}

where S is the minimum value of the weighted objective function:

S=r^{\textsf {T}}Wr=\left\|W^{\frac {1}{2}}\left(\mathbf {y} -X{\hat {\boldsymbol {\beta }}}\right)\right\|^{2}.

The denominator, $\nu =n-m$ , is the number of degrees of freedom; see effective degrees of freedom for generalizations for the case of correlated observations.

In all cases, the variance of the parameter estimate ${\hat {\beta }}_{i}$ is given by $M_{ii}^{\beta }$ and the covariance between the parameter estimates ${\hat {\beta }}_{i}$ and ${\hat {\beta }}_{j}$ is given by $M_{ij}^{\beta }$ . The standard deviation is the square root of variance, $\sigma _{i}={\sqrt {M_{ii}^{\beta }}}$ , and the correlation coefficient is given by $\rho _{ij}=M_{ij}^{\beta }/(\sigma _{i}\sigma _{j})$ . These error estimates reflect only random errors in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors, which, by definition, cannot be quantified. Note that even though the observations may be uncorrelated, the parameters are typically correlated.

Parameter confidence limits

It is often assumed, for want of any concrete evidence but often appealing to the central limit theorem—see Normal distribution#Occurrence and applications—that the error on each observation belongs to a normal distribution with a mean of zero and standard deviation $\sigma$ . Under that assumption the following probabilities can be derived for a single scalar parameter estimate in terms of its estimated standard error $se_{\beta }$ (given here):

68% that the interval ${\hat {\beta }}\pm se_{\beta }$ encompasses the true coefficient value
95% that the interval ${\hat {\beta }}\pm 2se_{\beta }$ encompasses the true coefficient value
99% that the interval ${\hat {\beta }}\pm 2.5se_{\beta }$ encompasses the true coefficient value

The assumption is not unreasonable when n >> m. If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution with n − m degrees of freedom. When n ≫ m Student's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error.^[4]

When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2, or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

Residual values and correlation

The residuals are related to the observations by

\mathbf {\hat {r}} =\mathbf {y} -X{\hat {\boldsymbol {\beta }}}=\mathbf {y} -H\mathbf {y} =(I-H)\mathbf {y} ,

where H is the idempotent matrix known as the hat matrix:

H=X\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}W,

and I is the identity matrix. The variance-covariance matrix of the residuals, M^r is given by

M^{\mathbf {r} }=(I-H)M(I-H)^{\textsf {T}}.

Thus the residuals are correlated, even if the observations are not.

When $W=M^{-1}$ ,

M^{\mathbf {r} }=(I-H)M.

The sum of weighted residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by X^TW^T:

X^{\textsf {T}}W{\hat {\mathbf {r} }}=X^{\textsf {T}}W\mathbf {y} -X^{\textsf {T}}WX{\hat {\boldsymbol {\beta }}}=X^{\textsf {T}}W\mathbf {y} -\left(X^{\rm {T}}WX\right)\left(X^{\textsf {T}}WX\right)^{-1}X^{\textsf {T}}W\mathbf {y} =\mathbf {0} .

Say, for example, that the first term of the model is a constant, so that $X_{i1}=1$ for all i. In that case it follows that

\sum _{i}^{m}X_{i1}W_{i}{\hat {r}}_{i}=\sum _{i}^{m}W_{i}{\hat {r}}_{i}=0.

Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero is not accidental, but is a consequence of the presence of the constant term, α, in the model.

If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals,^[5] but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large.

Related Research Articles

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. The filter is named after Rudolf E. Kálmán, who was one of the primary developers of its theory.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations.

The Gauss–Newton algorithm is used to solve non-linear least squares problems, which is equivalent to minimizing a sum of squared function values. It is an extension of Newton's method for finding a minimum of a non-linear function. Since a sum of squares must be nonnegative, the algorithm can be viewed as using Newton's method to iteratively approximate zeroes of the components of the sum, and thus minimizing the sum. In this sense, the algorithm is also an effective method for solving overdetermined systems of equations. It has the advantage that second derivatives, which can be challenging to compute, are not required.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In linear algebra, an idempotent matrix is a matrix which, when multiplied by itself, yields itself. That is, the matrix $is idempotent if and only if . For this product to be defined, must necessarily be a square matrix. Viewed this way, idempotent matrices are idempotent elements of matrix rings.$

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units, or where measurements are made on clusters of related statistical units. Because of their advantage in dealing with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures analysis of variance.

Equilibrium constants are determined in order to quantify chemical equilibria. When an equilibrium constant $K$ is expressed as a concentration quotient,

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, the projection matrix $, sometimes also called the influence matrix or hat matrix, maps the vector of response values to the vector of fitted values. It describes the influence each response value has on each fitted value. The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.$

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.$

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

References

↑ "Weighted regression".
↑ "Visualize a weighted regression".
1 2 Strutz, T. (2016). "3". Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN 978-3-658-11455-8.
↑ Mandel, John (1964). The Statistical Analysis of Experimental Data. New York: Interscience.
↑ Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979). Multivariate analysis. New York: Academic Press. ISBN 0-12-471250-9.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Weighted regression".

[2] "Visualize a weighted regression".

[strutz-3] 1 2 Strutz, T. (2016). "3". Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN 978-3-658-11455-8.

[4] Mandel, John (1964). The Statistical Analysis of Experimental Data. New York: Interscience.

[5] Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979). Multivariate analysis. New York: Academic Press. ISBN 0-12-471250-9.

[1]

[2]

[3]

[4]

[5]