Lack-of-fit sum of squares

Last updated August 24, 2022

In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null hypothesis that says that a proposed model fits well. The other component is the pure-error sum of squares.

The pure-error sum of squares is the sum of squared deviations of each value of the dependent variable from the average value over all observations sharing its independent variable value(s). These are errors that could never be avoided by any predictive equation that assigned a predicted value for the dependent variable as a function of the value(s) of the independent variable(s). The remainder of the residual sum of squares is attributed to lack of fit of the model since it would be mathematically possible to eliminate these errors entirely.

Principle

In order for the lack-of-fit sum of squares to differ from the sum of squares of residuals, there must be more than one value of the response variable for at least one of the values of the set of predictor variables. For example, consider fitting a line

y=\alpha x+\beta \,

by the method of least squares. One takes as estimates of α and β the values that minimize the sum of squares of residuals, i.e., the sum of squares of the differences between the observed y-value and the fitted y-value. To have a lack-of-fit sum of squares that differs from the residual sum of squares, one must observe more than one y-value for each of one or more of the x-values. One then partitions the "sum of squares due to error", i.e., the sum of squares of residuals, into two components:

sum of squares due to error = (sum of squares due to "pure" error) + (sum of squares due to lack of fit).

The sum of squares due to "pure" error is the sum of squares of the differences between each observed y-value and the average of all y-values corresponding to the same x-value.

The sum of squares due to lack of fit is the weighted sum of squares of differences between each average of y-values corresponding to the same x-value and the corresponding fitted y-value, the weight in each case being simply the number of observed y-values for that x-value.^[1]^[2] Because it is a property of least squares regression that the vector whose components are "pure errors" and the vector of lack-of-fit components are orthogonal to each other, the following equality holds:

{\begin{aligned}&\sum ({\text{observed value}}-{\text{fitted value}})^{2}&&{\text{(error)}}\\&\qquad =\sum ({\text{observed value}}-{\text{local average}})^{2}&&{\text{(pure error)}}\\&\qquad \qquad {}+\sum {\text{weight}}\times ({\text{local average}}-{\text{fitted value}})^{2}&&{\text{(lack of fit)}}\end{aligned}}

Hence the residual sum of squares has been completely decomposed into two components.

Mathematical details

Consider fitting a line with one predictor variable. Define i as an index of each of the n distinct x values, j as an index of the response variable observations for a given x value, and n_i as the number of y values associated with the i^thx value. The value of each response variable observation can be represented by

Y_{ij}=\alpha x_{i}+\beta +\varepsilon _{ij},\qquad i=1,\dots ,n,\quad j=1,\dots ,n_{i}.

Let

{\widehat {\alpha }},{\widehat {\beta }}\,

be the least squares estimates of the unobservable parameters α and β based on the observed values of x_i and Y_i j.

Let

{\widehat {Y}}_{i}={\widehat {\alpha }}x_{i}+{\widehat {\beta }}\,

be the fitted values of the response variable. Then

{\widehat {\varepsilon }}_{ij}=Y_{ij}-{\widehat {Y}}_{i}\,

are the residuals, which are observable estimates of the unobservable values of the error term ε_ij. Because of the nature of the method of least squares, the whole vector of residuals, with

N=\sum _{i=1}^{n}n_{i}

scalar components, necessarily satisfies the two constraints

\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}=0\,

\sum _{i=1}^{n}\left(x_{i}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}\right)=0.\,

It is thus constrained to lie in an (N − 2)-dimensional subspace of R^N, i.e. there are N − 2 "degrees of freedom for error".

Now let

{\overline {Y}}_{i\bullet }={\frac {1}{n_{i}}}\sum _{j=1}^{n_{i}}Y_{ij}

be the average of all Y-values associated with the i^thx-value.

We partition the sum of squares due to error into two components:

{\begin{aligned}&\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}^{\,2}=\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\widehat {Y}}_{i}\right)^{2}\\&=\underbrace {\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\overline {Y}}_{i\bullet }\right)^{2}} _{\text{(sum of squares due to pure error)}}+\underbrace {\sum _{i=1}^{n}n_{i}\left({\overline {Y}}_{i\bullet }-{\widehat {Y}}_{i}\right)^{2}.} _{\text{(sum of squares due to lack of fit)}}\end{aligned}}

Probability distributions

Sums of squares

Suppose the error terms ε_i j are independent and normally distributed with expected value 0 and variance σ². We treat x_i as constant rather than random. Then the response variables Y_i j are random only because the errors ε_i j are random.

It can be shown to follow that if the straight-line model is correct, then the sum of squares due to error divided by the error variance,

{\frac {1}{\sigma ^{2}}}\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}^{\,2}

has a chi-squared distribution with N − 2 degrees of freedom.

Moreover, given the total number of observations N, the number of levels of the independent variable n, and the number of parameters in the model p:

The sum of squares due to pure error, divided by the error variance σ², has a chi-squared distribution with N − n degrees of freedom;
The sum of squares due to lack of fit, divided by the error variance σ², has a chi-squared distribution with n − p degrees of freedom (here p = 2 as there are two parameters in the straight-line model);
The two sums of squares are probabilistically independent.

The test statistic

It then follows that the statistic

{\begin{aligned}F&={\frac {{\text{lack-of-fit sum of squares}}/{\text{degrees of freedom}}}{{\text{pure-error sum of squares}}/{\text{degrees of freedom}}}}\\[8pt]&={\frac {\left.\sum _{i=1}^{n}n_{i}\left({\overline {Y}}_{i\bullet }-{\widehat {Y}}_{i}\right)^{2}\right/(n-p)}{\left.\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\overline {Y}}_{i\bullet }\right)^{2}\right/(N-n)}}\end{aligned}}

has an F-distribution with the corresponding number of degrees of freedom in the numerator and the denominator, provided that the model is correct. If the model is wrong, then the probability distribution of the denominator is still as stated above, and the numerator and denominator are still independent. But the numerator then has a noncentral chi-squared distribution, and consequently the quotient as a whole has a non-central F-distribution.

One uses this F-statistic to test the null hypothesis that the linear model is correct. Since the non-central F-distribution is stochastically larger than the (central) F-distribution, one rejects the null hypothesis if the F-statistic is larger than the critical F value. The critical value corresponds to the cumulative distribution function of the F distribution with x equal to the desired confidence level, and degrees of freedom d₁ = (n − p) and d₂ = (N − n).

The assumptions of normal distribution of errors and independence can be shown to entail that this lack-of-fit test is the likelihood-ratio test of this null hypothesis.

Notes

↑ Brook, Richard J.; Arnold, Gregory C. (1985). Applied Regression Analysis and Experimental Design. CRC Press. pp. 48–49. ISBN 0824772520.
↑ Neter, John; Kutner, Michael H.; Nachstheim, Christopher J.; Wasserman, William (1996). Applied Linear Statistical Models (Fourth ed.). Chicago: Irwin. pp. 121–122. ISBN 0256117365.

Related Research Articles

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In statistics, the (binary) logistic model is a statistical model that models the probability of one event taking place by having the log-odds for the event be a linear combination of one or more independent variables ("predictors"). In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. It is a form of a Student's t-statistic, with the estimate of error varying between points.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model, such as a linear regression. A small RSS indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

In statistics the mean squared prediction error or mean squared error of the predictions of a smoothing or curve fitting procedure is the expected value of the squared difference between the fitted values implied by the predictive function $and the values of the (unobservable) function g . It is an inverse measure of the explanatory power of and can be used in the process of cross-validation of an estimated model.$

The partition of sums of squares is a concept that permeates much of inferential statistics and descriptive statistics. More properly, it is the partitioning of sums of squared deviations or errors. Mathematically, the sum of squared deviations is an unscaled, or unadjusted measure of dispersion. When scaled for the number of degrees of freedom, it estimates the variance, or spread of the observations about their mean value. Partitioning of the sum of squared deviations into various components allows the overall variability in a dataset to be ascribed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares.

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

In statistics, the fraction of variance unexplained (FVU) in the context of a regression task is the fraction of variance of the regressand Y which cannot be explained, i.e., which is not correctly predicted, by the explanatory variables X.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.$

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA to assess whether the factor variables are additively related to the expected value of the response variable. It can be applied when there are no replicated values in the data set, a situation in which it is impossible to directly estimate a fully general non-additive regression structure and still have information left to estimate the error variance. The test statistic proposed by Tukey has one degree of freedom under the null hypothesis, hence this is often called "Tukey's one-degree-of-freedom test."

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Brook, Richard J.; Arnold, Gregory C. (1985). Applied Regression Analysis and Experimental Design. CRC Press. pp. 48–49. ISBN 0824772520.

[2] Neter, John; Kutner, Michael H.; Nachstheim, Christopher J.; Wasserman, William (1996). Applied Linear Statistical Models (Fourth ed.). Chicago: Irwin. pp. 121–122. ISBN 0256117365.

[1]

[2]