# Partition of sums of squares

Last updated

The partition of sums of squares is a concept that permeates much of inferential statistics and descriptive statistics. More properly, it is the partitioning of sums of squared deviations or errors. Mathematically, the sum of squared deviations is an unscaled, or unadjusted measure of dispersion (also called variability). When scaled for the number of degrees of freedom, it estimates the variance, or spread of the observations about their mean value. Partitioning of the sum of squared deviations into various components allows the overall variability in a dataset to be ascribed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares.

## Background

The distance from any point in a collection of data, to the mean of the data, is the deviation. This can be written as ${\displaystyle y_{i}-{\overline {y}}}$, where ${\displaystyle y_{i}}$ is the ith data point, and ${\displaystyle {\overline {y}}}$ is the estimate of the mean. If all such deviations are squared, then summed, as in ${\displaystyle \sum _{i=1}^{n}\left(y_{i}-{\overline {y}}\,\right)^{2}}$, this gives the "sum of squares" for these data.

When more data are added to the collection the sum of squares will increase, except in unlikely cases such as the new data being equal to the mean. So usually, the sum of squares will grow with the size of the data collection. That is a manifestation of the fact that it is unscaled.

In many cases, the number of degrees of freedom is simply the number of data in the collection, minus one. We write this as n  1, where n is the number of data.

Scaling (also known as normalizing) means adjusting the sum of squares so that it does not grow as the size of the data collection grows. This is important when we want to compare samples of different sizes, such as a sample of 100 people compared to a sample of 20 people. If the sum of squares were not normalized, its value would always be larger for the sample of 100 people than for the sample of 20 people. To scale the sum of squares, we divide it by the degrees of freedom, i.e., calculate the sum of squares per degree of freedom, or variance. Standard deviation, in turn, is the square root of the variance.

The above describes how the sum of squares is used in descriptive statistics; see the article on total sum of squares for an application of this broad principle to inferential statistics.

## Partitioning the sum of squares in linear regression

Theorem. Given a linear regression model ${\displaystyle y_{i}=\beta _{0}+\beta _{1}x_{i1}+\cdots +\beta _{p}x_{ip}+\varepsilon _{i}}$including a constant${\displaystyle \beta _{0}}$, based on a sample ${\displaystyle (y_{i},x_{i1},\ldots ,x_{ip}),\,i=1,\ldots ,n}$ containing n observations, the total sum of squares ${\displaystyle \mathrm {TSS} =\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}$ can be partitioned as follows into the explained sum of squares (ESS) and the residual sum of squares (RSS):

${\displaystyle \mathrm {TSS} =\mathrm {ESS} +\mathrm {RSS} ,}$

where this equation is equivalent to each of the following forms:

{\displaystyle {\begin{aligned}\left\|y-{\bar {y}}\mathbf {1} \right\|^{2}&=\left\|{\hat {y}}-{\bar {y}}\mathbf {1} \right\|^{2}+\left\|{\hat {\varepsilon }}\right\|^{2},\quad \mathbf {1} =(1,1,\ldots ,1)^{T},\\\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}&=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {y}})^{2}+\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i})^{2},\\\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}&=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {y}})^{2}+\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2},\\\end{aligned}}}
where ${\displaystyle {\hat {y}}_{i}}$ is the value estimated by the regression line having ${\displaystyle {\hat {b}}_{0}}$, ${\displaystyle {\hat {b}}_{1}}$, ..., ${\displaystyle {\hat {b}}_{p}}$ as the estimated coefficients. [1]

### Proof

{\displaystyle {\begin{aligned}\sum _{i=1}^{n}(y_{i}-{\overline {y}})^{2}&=\sum _{i=1}^{n}(y_{i}-{\overline {y}}+{\hat {y}}_{i}-{\hat {y}}_{i})^{2}=\sum _{i=1}^{n}(({\hat {y}}_{i}-{\bar {y}})+\underbrace {(y_{i}-{\hat {y}}_{i})} _{{\hat {\varepsilon }}_{i}})^{2}\\&=\sum _{i=1}^{n}(({\hat {y}}_{i}-{\bar {y}})^{2}+2{\hat {\varepsilon }}_{i}({\hat {y}}_{i}-{\bar {y}})+{\hat {\varepsilon }}_{i}^{2})\\&=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {y}})^{2}+\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}+2\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}({\hat {y}}_{i}-{\bar {y}})\\&=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {y}})^{2}+\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}+2\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}({\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i1}+\cdots +{\hat {\beta }}_{p}x_{ip}-{\overline {y}})\\&=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {y}})^{2}+\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}+2({\hat {\beta }}_{0}-{\overline {y}})\underbrace {\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}} _{0}+2{\hat {\beta }}_{1}\underbrace {\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}x_{i1}} _{0}+\cdots +2{\hat {\beta }}_{p}\underbrace {\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}x_{ip}} _{0}\\&=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {y}})^{2}+\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}=\mathrm {ESS} +\mathrm {RSS} \\\end{aligned}}}

The requirement that the model include a constant or equivalently that the design matrix contain a column of ones ensures that ${\displaystyle \sum _{i=1}^{n}{\hat {\varepsilon }}_{i}=0}$, i.e. ${\displaystyle {\hat {\varepsilon }}^{T}\mathbf {1} =0}$.

The proof can also be expressed in vector form, as follows:

{\displaystyle {\begin{aligned}SS_{\text{total}}=\Vert \mathbf {y} -{\bar {y}}\mathbf {1} \Vert ^{2}&=\Vert \mathbf {y} -{\bar {y}}\mathbf {1} +\mathbf {\hat {y}} -\mathbf {\hat {y}} \Vert ^{2},\\&=\Vert \left(\mathbf {\hat {y}} -{\bar {y}}\mathbf {1} \right)+\left(\mathbf {y} -\mathbf {\hat {y}} \right)\Vert ^{2},\\&=\Vert {\mathbf {\hat {y}} -{\bar {y}}\mathbf {1} }\Vert ^{2}+\Vert {\hat {\varepsilon }}\Vert ^{2}+2{\hat {\varepsilon }}^{T}\left(\mathbf {\hat {y}} -{\bar {y}}\mathbf {1} \right),\\&=SS_{\text{regression}}+SS_{\text{error}}+2{\hat {\varepsilon }}^{T}\left(X{\hat {\beta }}-{\bar {y}}\mathbf {1} \right),\\&=SS_{\text{regression}}+SS_{\text{error}}+2\left({\hat {\varepsilon }}^{T}X\right){\hat {\beta }}-2{\bar {y}}\underbrace {{\hat {\varepsilon }}^{T}\mathbf {1} } _{0},\\&=SS_{\text{regression}}+SS_{\text{error}}.\end{aligned}}}

The elimination of terms in the last line, used the fact that

${\displaystyle {\hat {\varepsilon }}^{T}X=\left(\mathbf {y} -\mathbf {\hat {y}} \right)^{T}X=\mathbf {y} ^{T}(I-X(X^{T}X)^{-1}X^{T})X={\mathbf {y} }^{T}(X-X)={\mathbf {0} }.}$

### Further partitioning

Note that the residual sum of squares can be further partitioned as the lack-of-fit sum of squares plus the sum of squares due to pure error.

## Related Research Articles

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model, such as a linear regression. A small RSS indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

In linear regression, mean response and predicted response are values of the dependent variable calculated from the regression parameters and a given value of the independent variable. The values of these two responses are the same, but their calculated variances are different.

In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null hypothesis that says that a proposed model fits well. The other component is the pure-error sum of squares.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In statistics, projection pursuit regression (PPR) is a statistical model developed by Jerome H. Friedman and Werner Stuetzle which is an extension of additive models. This model adapts the additive models in that it first projects the data matrix of explanatory variables in the optimal direction before applying smoothing functions to these explanatory variables.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, particularly regression analysis, the Working–Hotelling procedure, named after Holbrook Working and Harold Hotelling, is a method of simultaneous estimation in linear regression models. One of the first developments in simultaneous inference, it was devised by Working and Hotelling for the simple linear regression model in 1929. It provides a confidence region for multiple mean responses, that is, it gives the upper and lower bounds of more than one value of a dependent variable at several levels of the independent variables at a certain confidence level. The resulting confidence bands are known as the Working–Hotelling–Scheffé confidence bands.

## References

1. "Sum of Squares - Definition, Formulas, Regression Analysis". Corporate Finance Institute. Retrieved 2020-10-16.