Linear least squares

Last updated

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

Contents

Basic formulation

Consider the linear equation

 

 

 

 

(1)

where and are given and is variable to be computed. When it is generally the case that ( 1 ) has no solution. For example, there is no value of that satisfies

because the first two rows require that but then the third row is not satisfied. Thus, for the goal of solving ( 1 ) exactly is typically replaced by finding the value of that minimizes some error. There are many ways that the error can be defined, but one of the most common is to define it as This produces a minimization problem, called a least squares problem

 

 

 

 

(2)

The solution to the least squares problem ( 1 ) is computed by solving the normal equation [1]

 

 

 

 

(3)

where denotes the transpose of .

Continuing the example, above, with

we find

and

Solving the normal equation gives

Formulations for Linear Regression

The three main linear least squares formulations are:

Alternative formulations

Other formulations include:

Objective function

In OLS (i.e., assuming unweighted observations), the optimal value of the objective function is found by substituting the optimal expression for the coefficient vector:

where , the latter equality holding since is symmetric and idempotent. It can be shown from this [9] that under an appropriate assignment of weights the expected value of S is . If instead unit weights are assumed, the expected value of S is , where is the variance of each observation.

If it is assumed that the residuals belong to a normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a chi-squared () distribution with m − n degrees of freedom. Some illustrative percentile values of are given in the following table. [10]

109.3418.323.2
2524.337.744.3
10099.3124136

These values can be used for a statistical criterion as to the goodness of fit. When unit weights are used, the numbers should be divided by the variance of an observation.

For WLS, the ordinary objective function above is replaced for a weighted average of residuals.

Discussion

In statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model. The resulting fitted model can be used to summarize the data, to predict unobserved values from the same system, and to understand the mechanisms that may underlie the system.

Mathematically, linear least squares is the problem of approximately solving an overdetermined system of linear equations Ax = b, where b is not an element of the column space of the matrix A. The approximate solution is realized as an exact solution to Ax = b', where b' is the projection of b onto the column space of A. The best approximation is then that which minimizes the sum of squared differences between the data values and their corresponding modeled values. The approach is called linear least squares since the assumed function is linear in the parameters to be estimated. Linear least squares problems are convex and have a closed-form solution that is unique, provided that the number of data points used for fitting equals or exceeds the number of unknown parameters, except in special degenerate situations. In contrast, non-linear least squares problems generally must be solved by an iterative procedure, and the problems can be non-convex with multiple optima for the objective function. If prior distributions are available, then even an underdetermined system can be solved using the Bayesian MMSE estimator.

In statistics, linear least squares problems correspond to a particularly important type of statistical model called linear regression which arises as a particular form of regression analysis. One basic form of such a model is an ordinary least squares model. The present article concentrates on the mathematical aspects of linear least squares problems, with discussion of the formulation and interpretation of statistical regression models and statistical inferences related to these being dealt with in the articles just mentioned. See outline of regression analysis for an outline of the topic.

Properties

If the experimental errors, , are uncorrelated, have a mean of zero and a constant variance, , the Gauss–Markov theorem states that the least-squares estimator, , has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, the distribution function of the errors need not be a normal distribution . However, for some probability distributions, there is no guarantee that the least-squares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased.

For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss–Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.

However, in the case that the experimental errors do belong to a normal distribution, the least-squares estimator is also a maximum likelihood estimator. [11]

These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.

Limitations

An assumption underlying the treatment given above is that the independent variable, x, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares or more generally errors-in-variables models, or rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure. [12] [13]

In some cases the (weighted) normal equations matrix XTX is ill-conditioned. When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermonde matrices become increasingly ill-conditioned as the order of the matrix increases.[ citation needed ] In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate.[ citation needed ] Various regularization techniques can be applied in such cases, the most common of which is called ridge regression. If further information about the parameters is known, for example, a range of possible values of , then various techniques can be used to increase the stability of the solution. For example, see constrained least squares.

Another drawback of the least squares estimator is the fact that the norm of the residuals, is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter , e.g., a small value of .[ citation needed ] However, since the true parameter is necessarily unknown, this quantity cannot be directly minimized. If a prior probability on is known, then a Bayes estimator can be used to minimize the mean squared error, . The least squares method is often applied when no prior is known. When several parameters are being estimated jointly, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is the James–Stein estimator. This is an example of more general shrinkage estimators that have been applied to regression problems.

Applications

Least square approximation with linear, quadratic and cubic polynomials. Least Square Approximation.png
Least square approximation with linear, quadratic and cubic polynomials.

Uses in data fitting

The primary application of linear least squares is in data fitting. Given a set of m data points consisting of experimentally measured values taken at m values of an independent variable ( may be scalar or vector quantities), and given a model function with it is desired to find the parameters such that the model function "best" fits the data. In linear least squares, linearity is meant to be with respect to parameters so

Here, the functions may be nonlinear with respect to the variable x.

Ideally, the model function fits the data exactly, so

for all This is usually not possible in practice, as there are more data points than there are parameters to be determined. The approach chosen then is to find the minimal possible value of the sum of squares of the residuals

so to minimize the function

After substituting for and then for , this minimization problem becomes the quadratic minimization problem above with

and the best fit can be found by solving the normal equations.

Example

A plot of the data points (in red), the least squares line of best fit (in blue), and the residuals (in green) Linear least squares example2.svg
A plot of the data points (in red), the least squares line of best fit (in blue), and the residuals (in green)

A hypothetical researcher conducts an experiment and obtains four data points: and (shown in red in the diagram on the right). Because of exploratory data analysis or prior knowledge of the subject matter, the researcher suspects that the -values depend on the -values systematically. The -values are assumed to be exact, but the -values contain some uncertainty or "noise", because of the phenomenon being studied, imperfections in the measurements, etc.

Fitting a line

One of the simplest possible relationships between and is a line . The intercept and the slope are initially unknown. The researcher would like to find values of and that cause the line to pass through the four data points. In other words, the researcher would like to solve the system of linear equations

With four equations in two unknowns, this system is overdetermined. There is no exact solution. To consider approximate solutions, one introduces residuals , , , into the equations:

The th residual is the misfit between the th observation and the th prediction :

Among all approximate solutions, the researcher would like to find the one that is "best" in some sense.

In least squares, one focuses on the sum of the squared residuals:

The best solution is defined to be the one that minimizes with respect to and . The minimum can be calculated by setting the partial derivatives of to zero:

These normal equations constitute a system of two linear equations in two unknowns. The solution is and , and the best-fit line is therefore . The residuals are and (see the diagram on the right). The minimum value of the sum of squared residuals is

This calculation can be expressed in matrix notation as follows. The original system of equations is , where

Intuitively,

More rigorously, if is invertible, then the matrix represents orthogonal projection onto the column space of . Therefore, among all vectors of the form , the one closest to is . Setting

it is evident that is a solution.

Fitting a parabola

The result of fitting a quadratic function
y
=
b
1
+
b
2
x
+
b
3
x
2
{\displaystyle y=\beta _{1}+\beta _{2}x+\beta _{3}x^{2}\,}
(in blue) through a set of data points
(
x
i
,
y
i
)
{\displaystyle (x_{i},y_{i})}
(in red). In linear least squares the function need not be linear in the argument
x
,
{\displaystyle x,}
but only in the parameters
b
j
{\displaystyle \beta _{j}}
that are determined to give the best fit. Linear least squares2.svg
The result of fitting a quadratic function (in blue) through a set of data points (in red). In linear least squares the function need not be linear in the argument but only in the parameters that are determined to give the best fit.

Suppose that the hypothetical researcher wishes to fit a parabola of the form . Importantly, this model is still linear in the unknown parameters (now just ), so linear least squares still applies. The system of equations incorporating residuals is

The sum of squared residuals is

There is just one partial derivative to set to 0:

The solution is , and the fit model is .

In matrix notation, the equations without residuals are again , where now

By the same logic as above, the solution is

The figure shows an extension to fitting the three parameter parabola using a design matrix with three columns (one for , , and ), and one row for each of the red data points.

Fitting other curves and surfaces

More generally, one can have regressors , and a linear model

See also

Related Research Articles

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

<span class="mw-page-title-main">Total least squares</span>

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

<span class="mw-page-title-main">Nonlinear regression</span> Regression analysis

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In mathematics, a moment matrix is a special symmetric square matrix whose rows and columns are indexed by monomials. The entries of the matrix depend on the product of the indexing monomials only

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

A "hat", placed over a symbol is a mathematical notation with various uses.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ().

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

Numerical methods for linear least squares entails the numerical analysis of linear least squares problems.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

References

  1. Weisstein, Eric W. "Normal Equation". MathWorld. Wolfram. Retrieved December 18, 2023.
  2. Lai, T.L.; Robbins, H.; Wei, C.Z. (1978). "Strong consistency of least squares estimates in multiple regression". PNAS . 75 (7): 3034–3036. Bibcode:1978PNAS...75.3034L. doi: 10.1073/pnas.75.7.3034 . JSTOR   68164. PMC   392707 . PMID   16592540.
  3. del Pino, Guido (1989). "The Unifying Role of Iterative Generalized Least Squares in Statistical Algorithms". Statistical Science. 4 (4): 394–403. doi: 10.1214/ss/1177012408 . JSTOR   2245853.
  4. Carroll, Raymond J. (1982). "Adapting for Heteroscedasticity in Linear Models". The Annals of Statistics. 10 (4): 1224–1233. doi: 10.1214/aos/1176345987 . JSTOR   2240725.
  5. Cohen, Michael; Dalal, Siddhartha R.; Tukey, John W. (1993). "Robust, Smoothly Heterogeneous Variance Regression". Journal of the Royal Statistical Society, Series C. 42 (2): 339–353. JSTOR   2986237.
  6. Nievergelt, Yves (1994). "Total Least Squares: State-of-the-Art Regression in Numerical Analysis". SIAM Review. 36 (2): 258–264. doi:10.1137/1036055. JSTOR   2132463.
  7. Britzger, Daniel (2022). "The Linear Template Fit". Eur. Phys. J. C. 82 (8): 731. arXiv: 2112.01548 . Bibcode:2022EPJC...82..731B. doi:10.1140/epjc/s10052-022-10581-w. S2CID   244896511.
  8. Tofallis, C (2009). "Least Squares Percentage Regression". Journal of Modern Applied Statistical Methods. 7: 526–534. doi:10.2139/ssrn.1406472. hdl: 2299/965 . SSRN   1406472.
  9. Hamilton, W. C. (1964). Statistics in Physical Science . New York: Ronald Press.
  10. Spiegel, Murray R. (1975). Schaum's outline of theory and problems of probability and statistics. New York: McGraw-Hill. ISBN   978-0-585-26739-5.
  11. Margenau, Henry; Murphy, George Moseley (1956). The Mathematics of Physics and Chemistry . Princeton: Van Nostrand.
  12. 1 2 Gans, Peter (1992). Data fitting in the Chemical Sciences. New York: Wiley. ISBN   978-0-471-93412-7.
  13. Deming, W. E. (1943). Statistical adjustment of Data. New York: Wiley.
  14. Acton, F. S. (1959). Analysis of Straight-Line Data. New York: Wiley.
  15. Guest, P. G. (1961). Numerical Methods of Curve Fitting. Cambridge: Cambridge University Press.[ page needed ]

Further reading