Smoothing spline

Last updated April 03, 2024

Smoothing splines are function estimates, ${\hat {f}}(x)$ , obtained from a set of noisy observations $y_{i}$ of the target $f(x_{i})$ , in order to balance a measure of goodness of fit of ${\hat {f}}(x_{i})$ to $y_{i}$ with a derivative based measure of the smoothness of ${\hat {f}}(x)$ . They provide a means for smoothing noisy $x_{i},y_{i}$ data. The most familiar example is the cubic smoothing spline, but there are many other possibilities, including for the case where $x$ is a vector quantity.

Cubic spline definition

Let $\{x_{i},Y_{i}:i=1,\dots ,n\}$ be a set of observations, modeled by the relation $Y_{i}=f(x_{i})+\epsilon _{i}$ where the $\epsilon _{i}$ are independent, zero mean random variables (usually assumed to have constant variance). The cubic smoothing spline estimate ${\hat {f}}$ of the function $f$ is defined to be the minimizer (over the class of twice differentiable functions) of^[1]^[2]

\sum _{i=1}^{n}\{Y_{i}-{\hat {f}}(x_{i})\}^{2}+\lambda \int {\hat {f}}^{\prime \prime }(x)^{2}\,dx.

Remarks:

$\lambda \geq 0$ is a smoothing parameter, controlling the trade-off between fidelity to the data and roughness of the function estimate. This is often estimated by generalized cross-validation,^[3] or by restricted marginal likelihood (REML)^{[ citation needed ]} which exploits the link between spline smoothing and Bayesian estimation (the smoothing penalty can be viewed as being induced by a prior on the $f$ ).^[4]
The integral is often evaluated over the whole real line although it is also possible to restrict the range to that of $x_{i}$ .
As $\lambda \to 0$ (no smoothing), the smoothing spline converges to the interpolating spline.
As $\lambda \to \infty$ (infinite smoothing), the roughness penalty becomes paramount and the estimate converges to a linear least squares estimate.
The roughness penalty based on the second derivative is the most common in modern statistics literature, although the method can easily be adapted to penalties based on other derivatives.
In early literature, with equally-spaced ordered $x_{i}$ , second or third-order differences were used in the penalty, rather than derivatives.^[5]
The penalized sum of squares smoothing objective can be replaced by a penalized likelihood objective in which the sum of squares terms is replaced by another log-likelihood based measure of fidelity to the data.^[1] The sum of squares term corresponds to penalized likelihood with a Gaussian assumption on the $\epsilon _{i}$ .

Derivation of the cubic smoothing spline

It is useful to think of fitting a smoothing spline in two steps:

First, derive the values ${\hat {f}}(x_{i});i=1,\ldots ,n$ .
From these values, derive ${\hat {f}}(x)$ for all x.

Now, treat the second step first.

Given the vector ${\hat {m}}=({\hat {f}}(x_{1}),\ldots ,{\hat {f}}(x_{n}))^{T}$ of fitted values, the sum-of-squares part of the spline criterion is fixed. It remains only to minimize $\int {\hat {f}}''(x)^{2}\,dx$ , and the minimizer is a natural cubic spline that interpolates the points $(x_{i},{\hat {f}}(x_{i}))$ . This interpolating spline is a linear operator, and can be written in the form

{\hat {f}}(x)=\sum _{i=1}^{n}{\hat {f}}(x_{i})f_{i}(x)

where $f_{i}(x)$ are a set of spline basis functions. As a result, the roughness penalty has the form

\int {\hat {f}}''(x)^{2}dx={\hat {m}}^{T}A{\hat {m}}.

where the elements of A are $\int f_{i}''(x)f_{j}''(x)dx$ . The basis functions, and hence the matrix A, depend on the configuration of the predictor variables $x_{i}$ , but not on the responses $Y_{i}$ or ${\hat {m}}$ .

A is an n×n matrix given by $A=\Delta ^{T}W^{-1}\Delta$ .

Δ is an (n-2)×n matrix of second differences with elements:

$\Delta _{ii}=1/h_{i}$ , $\Delta _{i,i+1}=-1/h_{i}-1/h_{i+1}$ , $\Delta _{i,i+2}=1/h_{i+1}$

W is an (n-2)×(n-2) symmetric tri-diagonal matrix with elements:

$W_{i-1,i}=W_{i,i-1}=h_{i}/6$ , $W_{ii}=(h_{i}+h_{i+1})/3$ and $h_{i}=\xi _{i+1}-\xi _{i}$ , the distances between successive knots (or x values).

Now back to the first step. The penalized sum-of-squares can be written as

\{Y-{\hat {m}}\}^{T}\{Y-{\hat {m}}\}+\lambda {\hat {m}}^{T}A{\hat {m}},

where $Y=(Y_{1},\ldots ,Y_{n})^{T}$ .

Minimizing over ${\hat {m}}$ by differentiating against ${\hat {m}}$ . This results in: $-2\{Y-{\hat {m}}\}+2\lambda A{\hat {m}}=0$ ^[6] and ${\hat {m}}=(I+\lambda A)^{-1}Y.$

De Boor's approach

De Boor's approach exploits the same idea, of finding a balance between having a smooth curve and being close to the given data.^[7]

$p\sum _{i=1}^{n}\left({\frac {Y_{i}-{\hat {f}}\left(x_{i}\right)}{\delta _{i}}}\right)^{2}+\left(1-p\right)\int \left({\hat {f}}^{\left(m\right)}\left(x\right)\right)^{2}\,dx$

where $p$ is a parameter called smooth factor and belongs to the interval $[0,1]$ , and $\delta _{i};i=1,\dots ,n$ are the quantities controlling the extent of smoothing (they represent the weight $\delta _{i}^{-2}$ of each point $Y_{i}$ ). In practice, since cubic splines are mostly used, $m$ is usually $2$ . The solution for $m=2$ was proposed by Christian Reinsch in 1967.^[8] For $m=2$ , when $p$ approaches $1$ , ${\hat {f}}$ converges to the "natural" spline interpolant to the given data.^[7] As $p$ approaches $0$ , ${\hat {f}}$ converges to a straight line (the smoothest curve). Since finding a suitable value of $p$ is a task of trial and error, a redundant constant $S$ was introduced for convenience.^[8] $S$ is used to numerically determine the value of $p$ so that the function ${\hat {f}}$ meets the following condition:

$\sum _{i=1}^{n}\left({\frac {Y_{i}-{\hat {f}}\left(x_{i}\right)}{\delta _{i}}}\right)^{2}\leq S$

The algorithm described by de Boor starts with $p=0$ and increases $p$ until the condition is met.^[7] If $\delta _{i}$ is an estimation of the standard deviation for $Y_{i}$ , the constant $S$ is recommended to be chosen in the interval $\left[n-{\sqrt {2n}},n+{\sqrt {2n}}\right]$ . Having $S=0$ means the solution is the "natural" spline interpolant.^[8] Increasing $S$ means we obtain a smoother curve by getting farther from the given data.

Multidimensional splines

There are two main classes of method for generalizing from smoothing with respect to a scalar $x$ to smoothing with respect to a vector $x$ . The first approach simply generalizes the spline smoothing penalty to the multidimensional setting. For example, if trying to estimate $f(x,z)$ we might use the Thin plate spline penalty and find the ${\hat {f}}(x,z)$ minimizing

\sum _{i=1}^{n}\{y_{i}-{\hat {f}}(x_{i},z_{i})\}^{2}+\lambda \int \left[\left({\frac {\partial ^{2}{\hat {f}}}{\partial x^{2}}}\right)^{2}+2\left({\frac {\partial ^{2}{\hat {f}}}{\partial x\partial z}}\right)^{2}+\left({\frac {\partial ^{2}{\hat {f}}}{\partial z^{2}}}\right)^{2}\right]{\textrm {d}}x\,{\textrm {d}}z.

The thin plate spline approach can be generalized to smoothing with respect to more than two dimensions and to other orders of differentiation in the penalty.^[1] As the dimension increases there are some restrictions on the smallest order of differential that can be used,^[1] but actually Duchon's original paper,^[9] gives slightly more complicated penalties that can avoid this restriction.

The thin plate splines are isotropic, meaning that if we rotate the $x,z$ co-ordinate system the estimate will not change, but also that we are assuming that the same level of smoothing is appropriate in all directions. This is often considered reasonable when smoothing with respect to spatial location, but in many other cases isotropy is not an appropriate assumption and can lead to sensitivity to apparently arbitrary choices of measurement units. For example, if smoothing with respect to distance and time an isotropic smoother will give different results if distance is measure in metres and time in seconds, to what will occur if we change the units to centimetres and hours.

The second class of generalizations to multi-dimensional smoothing deals directly with this scale invariance issue using tensor product spline constructions.^[10]^[11]^[12] Such splines have smoothing penalties with multiple smoothing parameters, which is the price that must be paid for not assuming that the same degree of smoothness is appropriate in all directions.

Related methods

Smoothing splines are related to, but distinct from:

Regression splines. In this method, the data is fitted to a set of spline basis functions with a reduced set of knots, typically by least squares. No roughness penalty is used. (See also multivariate adaptive regression splines.)
Penalized splines. This combines the reduced knots of regression splines, with the roughness penalty of smoothing splines.^[13]^[14]
Thin plate splines and Elastic maps method for manifold learning. This method combines the least squares penalty for approximation error with the bending and stretching penalty of the approximating manifold and uses the coarse discretization of the optimization problem.

Source code

Source code for spline smoothing can be found in the examples from Carl de Boor's book A Practical Guide to Splines. The examples are in the Fortran programming language. The updated sources are available also on Carl de Boor's official site .

Related Research Articles

In the mathematical subfield of numerical analysis, a B-spline or basis spline is a spline function that has minimal support with respect to a given degree, smoothness, and domain partition. Any spline function of given degree can be expressed as a linear combination of B-splines of that degree. Cardinal B-splines have knots that are equidistant from each other. B-splines can be used for curve-fitting and numerical differentiation of experimental data.

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Simultaneous equations models are a type of statistical model in which the dependent variables are functions of other dependent variables, rather than just independent variables. This means some of the explanatory variables are jointly determined with the dependent variable, which in economics usually is the consequence of some underlying equilibrium mechanism. Take the typical supply and demand model: whilst typically one would determine the quantity supplied and demanded to be a function of the price set by the market, it is also possible for the reverse to be true, where producers observe the quantity that consumers demand and then set the price.

In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model that tries to find the line of best fit for a two-dimensional data set. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure.

In mathematics, a spline is a function defined piecewise by polynomials. In interpolating problems, spline interpolation is often preferred to polynomial interpolation because it yields similar results, even when using low degree polynomials, while avoiding Runge's phenomenon for higher degrees.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

Thin plate splines (TPS) are a spline-based technique for data interpolation and smoothing. "A spline is a function defined by polynomials in a piecewise manner." They were introduced to geometric design by Duchon. They are an important special case of a polyharmonic spline. Robust Point Matching (RPM) is a common extension and shortly known as the TPS-RPM algorithm.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ( $).$

A kernel smoother is a statistical technique to estimate a real valued function $as the weighted average of neighboring observed data. The weight is defined by the kernel, such that closer points are given higher weights. The estimated function is smooth, and the level of smoothness is set by a single parameter. Kernel smoothing is a type of weighted moving average.$

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

In statistics, the backfitting algorithm is a simple iterative procedure used to fit a generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome Friedman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss–Seidel method, an algorithm used for solving a certain linear system of equations.

In digital signal processing, multidimensional sampling is the process of converting a function of a multidimensional variable into a discrete collection of values of the function measured on a discrete set of points. This article presents the basic result due to Petersen and Middleton on conditions for perfectly reconstructing a wavenumber-limited function from its measurements on a discrete lattice of points. This result, also known as the Petersen–Middleton theorem, is a generalization of the Nyquist–Shannon sampling theorem for sampling one-dimensional band-limited functions to higher-dimensional Euclidean spaces.

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L² that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function $with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.$

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

A partially linear model is a form of semiparametric model, since it contains parametric and nonparametric elements. Application of the least squares estimators is available to partially linear model, if the hypothesis of the known of nonparametric element is valid. Partially linear equations were first used in the analysis of the relationship between temperature and usage of electricity by Engle, Granger, Rice and Weiss (1986). Typical application of partially linear model in the field of Microeconomics is presented by Tripathi in the case of profitability of firm's production in 1997. Also, partially linear model applied successfully in some other academic field. In 1994, Zeger and Diggle introduced partially linear model into biometrics. In environmental science, Parda-Sanchez et al. used partially linear model to analysis collected data in 2000. So far, partially linear model was optimized in many other statistic methods. In 1988, Robinson applied Nadaraya-Waston kernel estimator to test the nonparametric element to build a least-squares estimator After that, in 1997, local linear method was found by Truong.

References

1 2 3 4 Green, P. J.; Silverman, B.W. (1994). Nonparametric Regression and Generalized Linear Models: A roughness penalty approach. Chapman and Hall.
↑ Hastie, T. J.; Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall. ISBN 978-0-412-34390-2.
↑ Craven, P.; Wahba, G. (1979). "Smoothing noisy data with spline functions". Numerische Mathematik. 31 (4): 377–403. doi:10.1007/bf01404567.
↑ Kimeldorf, G.S.; Wahba, G. (1970). "A Correspondence between Bayesian Estimation on Stochastic Processes and Smoothing by Splines". The Annals of Mathematical Statistics. 41 (2): 495–502. doi: 10.1214/aoms/1177697089 .
↑ Whittaker, E.T. (1922). "On a new method of graduation". Proceedings of the Edinburgh Mathematical Society. 41: 63–75.
↑ Rodriguez, German (Spring 2001). "Smoothing and Non-Parametric Regression" (PDF). 2.3.1 Computation. p. 12. Retrieved 28 August 2017.{{cite web}}: CS1 maint: location (link)
1 2 3 De Boor, C. (2001). A Practical Guide to Splines (Revised Edition). Springer. pp. 207–214. ISBN 978-0-387-90356-9.
1 2 3 Reinsch, Christian H (1967). "Smoothing by Spline Functions". Numerische Mathematik. 10 (3): 177–183. doi:10.1007/BF02162161.
↑ J. Duchon, 1976, Splines minimizing rotation invariant semi-norms in Sobolev spaces. pp 85–100, In: Constructive Theory of Functions of Several Variables, Oberwolfach 1976, W. Schempp and K. Zeller, eds., Lecture Notes in Math., Vol. 571, Springer, Berlin, 1977
↑ Wahba, Grace. Spline Models for Observational Data. SIAM.
↑ Gu, Chong (2013). Smoothing Spline ANOVA Models (2nd ed.). Springer.
↑ Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed). Chapman & Hall/CRC. ISBN 978-1-58488-474-3.
↑ Eilers, P.H.C. and Marx B. (1996). "Flexible smoothing with B-splines and penalties". Statistical Science. 11 (2): 89–121.
↑ Ruppert, David; Wand, M. P.; Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press. ISBN 978-0-521-78050-6.