# Local regression

Last updated

Local regression or local polynomial regression, [1] also known as moving regression, [2] is a generalization of the moving average and polynomial regression. [3] Its most common methods, initially developed for scatterplot smoothing, are LOESS (locally estimated scatterplot smoothing) and LOWESS (locally weighted scatterplot smoothing), both pronounced . They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model. In some fields, LOESS is known and commonly referred to as Savitzky–Golay filter [4] [5] (proposed 15 years before LOESS).

## Contents

LOESS and LOWESS thus build on "classical" methods, such as linear and nonlinear least squares regression. They address situations in which the classical procedures do not perform well or cannot be effectively applied without undue labor. LOESS combines much of the simplicity of linear least squares regression with the flexibility of nonlinear regression. It does this by fitting simple models to localized subsets of the data to build up a function that describes the deterministic part of the variation in the data, point by point. In fact, one of the chief attractions of this method is that the data analyst is not required to specify a global function of any form to fit a model to the data, only to fit segments of the data.

The trade-off for these features is increased computation. Because it is so computationally intensive, LOESS would have been practically impossible to use in the era when least squares regression was being developed. Most other modern methods for process modeling are similar to LOESS in this respect. These methods have been consciously designed to use our current computational ability to the fullest possible advantage to achieve goals not easily achieved by traditional approaches.

A smooth curve through a set of data points obtained with this statistical technique is called a loess curve, particularly when each smoothed value is given by a weighted quadratic least squares regression over the span of values of the y-axis scattergram criterion variable. When each smoothed value is given by a weighted linear least squares regression over the span, this is known as a lowess curve; however, some authorities treat lowess and loess as synonyms[ citation needed ].

## Model definition

In 1964, Savitsky and Golay proposed a method equivalent to LOESS, which is commonly referred to as Savitzky–Golay filter. William S. Cleveland rediscovered the method in 1979 and gave it a distinct name. The method was further developed by Cleveland and Susan J. Devlin (1988). LOWESS is also known as locally weighted polynomial regression.

At each point in the range of the data set a low-degree polynomial is fitted to a subset of the data, with explanatory variable values near the point whose response is being estimated. The polynomial is fitted using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The value of the regression function for the point is then obtained by evaluating the local polynomial using the explanatory variable values for that data point. The LOESS fit is complete after regression function values have been computed for each of the ${\displaystyle n}$ data points. Many of the details of this method, such as the degree of the polynomial model and the weights, are flexible. The range of choices for each part of the method and typical defaults are briefly discussed next.

### Localized subsets of data

The subsets of data used for each weighted least squares fit in LOESS are determined by a nearest neighbors algorithm. A user-specified input to the procedure called the "bandwidth" or "smoothing parameter" determines how much of the data is used to fit each local polynomial. The smoothing parameter, ${\displaystyle \alpha }$, is the fraction of the total number n of data points that are used in each local fit. The subset of data used in each weighted least squares fit thus comprises the ${\displaystyle n\alpha }$ points (rounded to the next largest integer) whose explanatory variables' values are closest to the point at which the response is being estimated. [6]

Since a polynomial of degree k requires at least k + 1 points for a fit, the smoothing parameter ${\displaystyle \alpha }$ must be between ${\displaystyle \left(\lambda +1\right)/n}$ and 1, with ${\displaystyle \lambda }$ denoting the degree of the local polynomial.

${\displaystyle \alpha }$ is called the smoothing parameter because it controls the flexibility of the LOESS regression function. Large values of ${\displaystyle \alpha }$ produce the smoothest functions that wiggle the least in response to fluctuations in the data. The smaller ${\displaystyle \alpha }$ is, the closer the regression function will conform to the data. Using too small a value of the smoothing parameter is not desirable, however, since the regression function will eventually start to capture the random error in the data.

### Degree of local polynomials

The local polynomials fit to each subset of the data are almost always of first or second degree; that is, either locally linear (in the straight line sense) or locally quadratic. Using a zero degree polynomial turns LOESS into a weighted moving average. Higher-degree polynomials would work in theory, but yield models that are not really in the spirit of LOESS. LOESS is based on the ideas that any function can be well approximated in a small neighborhood by a low-order polynomial and that simple models can be fit to data easily. High-degree polynomials would tend to overfit the data in each subset and are numerically unstable, making accurate computations difficult.

### Weight function

As mentioned above, the weight function gives the most weight to the data points nearest the point of estimation and the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near each other in the explanatory variable space are more likely to be related to each other in a simple way than points that are further apart. Following this logic, points that are likely to follow the local model best influence the local model parameter estimates the most. Points that are less likely to actually conform to the local model have less influence on the local model parameter estimates.

The traditional weight function used for LOESS is the tri-cube weight function,

${\displaystyle w(x)=(1-|d|^{3})^{3}}$

where d is the distance of a given data point from the point on the curve being fitted, scaled to lie in the range from 0 to 1. [6]

However, any other weight function that satisfies the properties listed in Cleveland (1979) could also be used. The weight for a specific point in any localized subset of data is obtained by evaluating the weight function at the distance between that point and the point of estimation, after scaling the distance so that the maximum absolute distance over all of the points in the subset of data is exactly one.

Consider the following generalisation of the linear regression model with a metric ${\displaystyle w(x,z)}$ on the target space ${\displaystyle \mathbb {R} ^{m}}$ that depends on two parameters, ${\displaystyle x,z\in \mathbb {R} ^{n}}$. Assume that the linear hypothesis is based on ${\displaystyle n}$ input parameters and that, as customary in these cases, we embed the input space ${\displaystyle \mathbb {R} ^{n}}$ into ${\displaystyle \mathbb {R} ^{n+1}}$ as ${\displaystyle x\mapsto {\hat {x}}:=(1,x)}$, and consider the following loss function

${\displaystyle \operatorname {RSS} _{x}(A)=\sum _{i=1}^{N}(y_{i}-A{\hat {x}}_{i})^{T}w_{i}(x)(y_{i}-A{\hat {x}}_{i}).}$

Here, ${\displaystyle A}$ is an ${\displaystyle (n+1)\times (n+1)}$ real matrix of coefficients, ${\displaystyle w_{i}(x):=w(x_{i},x)}$ and the subscript i enumerates input and output vectors from a training set. Since ${\displaystyle w}$ is a metric, it is a symmetric, positive-definite matrix and, as such, there is another symmetric matrix ${\displaystyle h}$ such that ${\displaystyle w=h^{2}}$. The above loss function can be rearranged into a trace by observing that ${\displaystyle y^{T}wy=(hy)^{T}(hy)=\operatorname {Tr} (hyy^{T}h)=\operatorname {Tr} (wyy^{T})}$. By arranging the vectors ${\displaystyle y_{i}}$ and ${\displaystyle {\hat {x}}_{i}}$ into the columns of a ${\displaystyle m\times N}$ matrix ${\displaystyle Y}$ and an ${\displaystyle (n+1)\times N}$ matrix ${\displaystyle {\hat {X}}}$ respectively, the above loss function can then be written as

${\displaystyle \operatorname {Tr} (W(x)(Y-A{\hat {X}})^{T}(Y-A{\hat {X}}))}$

where ${\displaystyle W}$ is the square diagonal ${\displaystyle N\times N}$ matrix whose entries are the ${\displaystyle w_{i}(x)}$s. Differentiating with respect to ${\displaystyle A}$ and setting the result equal to 0 one finds the extremal matrix equation

${\displaystyle A{\hat {X}}W(x){\hat {X}}^{T}=YW(x){\hat {X}}^{T}.}$

Assuming further that the square matrix ${\displaystyle {\hat {X}}W(x){\hat {X}}^{T}}$ is non-singular, the loss function ${\displaystyle \operatorname {RSS} _{x}(A)}$ attains its minimum at

${\displaystyle A(x)=YW(x){\hat {X}}^{T}({\hat {X}}W(x){\hat {X}}^{T})^{-1}.}$

A typical choice for ${\displaystyle w(x,z)}$ is the Gaussian weight

${\displaystyle w(x,z)=\exp \left(-{\frac {(x-z)^{2}}{2\sigma ^{2}}}\right)}$

As discussed above, the biggest advantage LOESS has over many other methods is the process of fitting a model to the sample data does not begin with the specification of a function. Instead the analyst only has to provide a smoothing parameter value and the degree of the local polynomial. In addition, LOESS is very flexible, making it ideal for modeling complex processes for which no theoretical models exist. These two advantages, combined with the simplicity of the method, make LOESS one of the most attractive of the modern regression methods for applications that fit the general framework of least squares regression but which have a complex deterministic structure.

Although it is less obvious than for some of the other methods related to linear least squares regression, LOESS also accrues most of the benefits typically shared by those procedures. The most important of those is the theory for computing uncertainties for prediction and calibration. Many other tests and procedures used for validation of least squares models can also be extended to LOESS models [ citation needed ].

LOESS makes less efficient use of data than other least squares methods. It requires fairly large, densely sampled data sets in order to produce good models. This is because LOESS relies on the local data structure when performing the local fitting. Thus, LOESS provides less complex data analysis in exchange for greater experimental costs. [6]

Another disadvantage of LOESS is the fact that it does not produce a regression function that is easily represented by a mathematical formula. This can make it difficult to transfer the results of an analysis to other people. In order to transfer the regression function to another person, they would need the data set and software for LOESS calculations. In nonlinear regression, on the other hand, it is only necessary to write down a functional form in order to provide estimates of the unknown parameters and the estimated uncertainty. Depending on the application, this could be either a major or a minor drawback to using LOESS. In particular, the simple form of LOESS can not be used for mechanistic modelling where fitted parameters specify particular physical properties of a system.

Finally, as discussed above, LOESS is a computationally intensive method (with the exception of evenly spaced data, where the regression can then be phrased as a non-causal finite impulse response filter). LOESS is also prone to the effects of outliers in the data set, like other least squares methods. There is an iterative, robust version of LOESS [Cleveland (1979)] that can be used to reduce LOESS' sensitivity to outliers, but too many extreme outliers can still overcome even the robust method.

## Related Research Articles

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems by minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure.

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization, to infer values of a function where no data are available, and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data, and is subject to a degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function of the independent variable.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the variance of observations is incorporated into the regression. WLS is also a specialization of generalized least squares.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

In statistics, the projection matrix, sometimes also called the influence matrix or hat matrix, maps the vector of response values to the vector of fitted values. It describes the influence each response value has on each fitted value. The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box-Cox transformed regressors.

A kernel smoother is a statistical technique to estimate a real valued function as the weighted average of neighboring observed data. The weight is defined by the kernel, such that closer points are given higher weights. The estimated function is smooth, and the level of smoothness is set by a single parameter.

In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.

Smoothing splines are function estimates, , obtained from a set of noisy observations of the target , in order to balance a measure of goodness of fit of to with a derivative based measure of the smoothness of . They provide a means for smoothing noisy data. The most familiar example is the cubic smoothing spline, but there are many other possibilities, including for the case where is a vector quantity.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

In statistics, projection pursuit regression (PPR) is a statistical model developed by Jerome H. Friedman and Werner Stuetzle which is an extension of additive models. This model adapts the additive models in that it first projects the data matrix of explanatory variables in the optimal direction before applying smoothing functions to these explanatory variables.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, several scatterplot smoothing methods are available to fit a function through the points of a scatterplot to best represent the relationship between the variables.

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

## References

### Citations

1. Fox & Weisberg 2018, Appendix.
2. Harrell 2015, p. 29.
3. "Savitzky–Golay filtering – MATLAB sgolayfilt". Mathworks.com.
4. "scipy.signal.savgol_filter — SciPy v0.16.1 Reference Guide". Docs.scipy.org.
5. NIST, "LOESS (aka LOWESS)", section 4.1.4.4, NIST/SEMATECH e-Handbook of Statistical Methods, (accessed 14 April 2017)

### Implementations

This article incorporates  public domain material from the National Institute of Standards and Technology website https://www.nist.gov .