Curve fitting

Last updated
Fitting of a noisy curve by an asymmetrical peak model, with an iterative process (Gauss-Newton algorithm with variable damping factor a). Regression pic assymetrique.gif
Fitting of a noisy curve by an asymmetrical peak model, with an iterative process (Gauss–Newton algorithm with variable damping factor α).

Curve fitting [1] [2] is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, [3] possibly subject to constraints. [4] [5] Curve fitting can involve either interpolation, [6] [7] where an exact fit to the data is required, or smoothing, [8] [9] in which a "smooth" function is constructed that approximately fits the data. A related topic is regression analysis, [10] [11] which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fitted to data observed with random errors. Fitted curves can be used as an aid for data visualization, [12] [13] to infer values of a function where no data are available, [14] and to summarize the relationships among two or more variables. [15] Extrapolation refers to the use of a fitted curve beyond the range of the observed data, [16] and is subject to a degree of uncertainty [17] since it may reflect the method used to construct the curve as much as it reflects the observed data.

Contents

For linear-algebraic analysis of data, "fitting" usually means trying to find the curve that minimizes the vertical (y-axis) displacement of a point from the curve (e.g., ordinary least squares). However, for graphical and image applications, geometric fitting seeks to provide the best visual fit; which usually means trying to minimize the orthogonal distance to the curve (e.g., total least squares), or to otherwise include both axes of displacement of a point from the curve. Geometric fits are not popular because they usually require non-linear and/or iterative calculations, although they have the advantage of a more aesthetic and geometrically accurate result. [18] [19] [20]

Algebraic fitting of functions to data points

Most commonly, one fits a function of the form y=f(x).

Fitting lines and polynomial functions to data points

Polynomial curves fitting points generated with a sine function. The black dotted line is the "true" data, the red line is a first degree polynomial, the green line is second degree, the orange line is third degree and the blue line is fourth degree. Curve fitting.svg
Polynomial curves fitting points generated with a sine function. The black dotted line is the "true" data, the red line is a first degree polynomial, the green line is second degree, the orange line is third degree and the blue line is fourth degree.

The first degree polynomial equation

is a line with slope a. A line will connect any two points, so a first degree polynomial equation is an exact fit through any two points with distinct x coordinates.

If the order of the equation is increased to a second degree polynomial, the following results:

This will exactly fit a simple curve to three points.

If the order of the equation is increased to a third degree polynomial, the following is obtained:

This will exactly fit four points.

A more general statement would be to say it will exactly fit four constraints. Each constraint can be a point, angle, or curvature (which is the reciprocal of the radius of an osculating circle). Angle and curvature constraints are most often added to the ends of a curve, and in such cases are called end conditions. Identical end conditions are frequently used to ensure a smooth transition between polynomial curves contained within a single spline. Higher-order constraints, such as "the change in the rate of curvature", could also be added. This, for example, would be useful in highway cloverleaf design to understand the rate of change of the forces applied to a car (see jerk), as it follows the cloverleaf, and to set reasonable speed limits, accordingly.

The first degree polynomial equation could also be an exact fit for a single point and an angle while the third degree polynomial equation could also be an exact fit for two points, an angle constraint, and a curvature constraint. Many other combinations of constraints are possible for these and for higher order polynomial equations.

If there are more than n + 1 constraints (n being the degree of the polynomial), the polynomial curve can still be run through those constraints. An exact fit to all constraints is not certain (but might happen, for example, in the case of a first degree polynomial exactly fitting three collinear points). In general, however, some method is then needed to evaluate each approximation. The least squares method is one way to compare the deviations.

There are several reasons given to get an approximate fit when it is possible to simply increase the degree of the polynomial equation and get an exact match.:

The degree of the polynomial curve being higher than needed for an exact fit is undesirable for all the reasons listed previously for high order polynomials, but also leads to a case where there are an infinite number of solutions. For example, a first degree polynomial (a line) constrained by only a single point, instead of the usual two, would give an infinite number of solutions. This brings up the problem of how to compare and choose just one solution, which can be a problem for software and for humans, as well. For this reason, it is usually best to choose as low a degree as possible for an exact match on all constraints, and perhaps an even lower degree, if an approximate fit is acceptable.

Relation between wheat yield and soil salinity Gohana inverted S-curve.png
Relation between wheat yield and soil salinity

Fitting other functions to data points

Other types of curves, such as trigonometric functions (such as sine and cosine), may also be used, in certain cases.

In spectroscopy, data may be fitted with Gaussian, Lorentzian, Voigt and related functions.

In biology, ecology, demography, epidemiology, and many other disciplines, the growth of a population, the spread of infectious disease, etc. can be fitted using the logistic function.

In agriculture the inverted logistic sigmoid function (S-curve) is used to describe the relation between crop yield and growth factors. The blue figure was made by a sigmoid regression of data measured in farm lands. It can be seen that initially, i.e. at low soil salinity, the crop yield reduces slowly at increasing soil salinity, while thereafter the decrease progresses faster.

Geometric fitting of plane curves to data points

If a function of the form cannot be postulated, one can still try to fit a plane curve.

Other types of curves, such as conic sections (circular, elliptical, parabolic, and hyperbolic arcs) or trigonometric functions (such as sine and cosine), may also be used, in certain cases. For example, trajectories of objects under the influence of gravity follow a parabolic path, when air resistance is ignored. Hence, matching trajectory data points to a parabolic curve would make sense. Tides follow sinusoidal patterns, hence tidal data points should be matched to a sine wave, or the sum of two sine waves of different periods, if the effects of the Moon and Sun are both considered.

For a parametric curve, it is effective to fit each of its coordinates as a separate function of arc length; assuming that data points can be ordered, the chord distance may be used. [22]

Fitting a circle by geometric fit

Circle fitting with the Coope method, the points describing a circle arc, centre (1 ; 1), radius 4. Regression circulaire coope arc de cercle.svg
Circle fitting with the Coope method, the points describing a circle arc, centre (1 ; 1), radius 4.
different models of ellipse fitting Wp ellfitting.png
different models of ellipse fitting
Ellipse fitting minimising the algebraic distance (Fitzgibbon method). Regression elliptique distance algebrique donnees gander.svg
Ellipse fitting minimising the algebraic distance (Fitzgibbon method).

Coope [23] approaches the problem of trying to find the best visual fit of circle to a set of 2D data points. The method elegantly transforms the ordinarily non-linear problem into a linear problem that can be solved without using iterative numerical methods, and is hence much faster than previous techniques.

Fitting an ellipse by geometric fit

The above technique is extended to general ellipses [24] by adding a non-linear step, resulting in a method that is fast, yet finds visually pleasing ellipses of arbitrary orientation and displacement.

Fitting surfaces

Note that while this discussion was in terms of 2D curves, much of this logic also extends to 3D surfaces, each patch of which is defined by a net of curves in two parametric directions, typically called u and v. A surface may be composed of one or more surface patches in each direction.

Software

Many statistical packages such as R and numerical software such as the gnuplot, GNU Scientific Library, Igor Pro, MLAB, Maple, MATLAB, TK Solver 6.0, Scilab, Mathematica, GNU Octave, and SciPy include commands for doing curve fitting in a variety of scenarios. There are also programs specifically written to do curve fitting; they can be found in the lists of statistical and numerical-analysis programs as well as in Category:Regression and curve fitting software.

See also

Related Research Articles

In mathematics, an equation is a mathematical formula that expresses the equality of two expressions, by connecting them with the equals sign =. The word equation and its cognates in other languages may have subtly different meanings; for example, in French an équation is defined as containing one or more variables, while in English, any well-formed formula consisting of two expressions related with an equals sign is an equation.

<span class="mw-page-title-main">Interpolation</span> Method for estimating new data within known data points

In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points.

<span class="mw-page-title-main">B-spline</span> Spline function

In the mathematical subfield of numerical analysis, a B-spline or basis spline is a spline function that has minimal support with respect to a given degree, smoothness, and domain partition. Any spline function of given degree can be expressed as a linear combination of B-splines of that degree. Cardinal B-splines have knots that are equidistant from each other. B-splines can be used for curve-fitting and numerical differentiation of experimental data.

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

<span class="mw-page-title-main">Time series</span> Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

In optimization, line search is a basic iterative approach to find a local minimum of an objective function . It first finds a descent direction along which the objective function will be reduced, and then computes a step size that determines how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent or quasi-Newton method. The step size can be determined either exactly or inexactly.

<span class="mw-page-title-main">Savitzky–Golay filter</span> Algorithm to smooth data points

A Savitzky–Golay filter is a digital filter that can be applied to a set of digital data points for the purpose of smoothing the data, that is, to increase the precision of the data without distorting the signal tendency. This is achieved, in a process known as convolution, by fitting successive sub-sets of adjacent data points with a low-degree polynomial by the method of linear least squares. When the data points are equally spaced, an analytical solution to the least-squares equations can be found, in the form of a single set of "convolution coefficients" that can be applied to all data sub-sets, to give estimates of the smoothed signal, at the central point of each sub-set. The method, based on established mathematical procedures, was popularized by Abraham Savitzky and Marcel J. E. Golay, who published tables of convolution coefficients for various polynomials and sub-set sizes in 1964. Some errors in the tables have been corrected. The method has been extended for the treatment of 2- and 3-dimensional data.

<span class="mw-page-title-main">Isotonic regression</span> Type of numerical analysis

In statistics and numerical analysis, isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing everywhere, and lies as close to the observations as possible.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

<span class="mw-page-title-main">Local regression</span> Moving average and polynomial regression method for smoothing data

Local regression or local polynomial regression, also known as moving regression, is a generalization of the moving average and polynomial regression. Its most common methods, initially developed for scatterplot smoothing, are LOESS and LOWESS, both pronounced LOH-ess. They are two strongly related non-parametric regression methods that combine multiple regression models in a k-nearest-neighbor-based meta-model. In some fields, LOESS is known and commonly referred to as Savitzky–Golay filter.

<span class="mw-page-title-main">Cusp (singularity)</span> Point on a curve where motion must move backwards

In mathematics, a cusp, sometimes called spinode in old texts, is a point on a curve where a moving point must reverse direction. A typical example is given in the figure. A cusp is thus a type of singular point of a curve.

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

In statistical modeling, polynomial functions and rational functions are sometimes used as an empirical technique for curve fitting.

In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.

Least absolute deviations (LAD), also known as least absolute errors (LAE), least absolute residuals (LAR), or least absolute values (LAV), is a statistical optimality criterion and a statistical optimization technique based on minimizing the sum of absolute deviations or the L1 norm of such values. It is analogous to the least squares technique, except that it is based on absolute values instead of squared values. It attempts to find a function which closely approximates a set of data by minimizing residuals between points generated by the function and corresponding data points. The LAD estimate also arises as the maximum likelihood estimate if the errors have a Laplace distribution. It was introduced in 1757 by Roger Joseph Boscovich.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, several scatterplot smoothing methods are available to fit a function through the points of a scatterplot to best represent the relationship between the variables.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

References

  1. Sandra Lach Arlinghaus, PHB Practical Handbook of Curve Fitting. CRC Press, 1994.
  2. William M. Kolb. Curve Fitting for Programmable Calculators. Syntec, Incorporated, 1984.
  3. S.S. Halli, K.V. Rao. 1992. Advanced Techniques of Population Analysis. ISBN   0306439972 Page 165 (cf. ... functions are fulfilled if we have a good to moderate fit for the observed data.)
  4. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. By Nate Silver
  5. Data Preparation for Data Mining: Text. By Dorian Pyle.
  6. Numerical Methods in Engineering with MATLAB®. By Jaan Kiusalaas. Page 24.
  7. Numerical Methods in Engineering with Python 3. By Jaan Kiusalaas. Page 21.
  8. Numerical Methods of Curve Fitting. By P. G. Guest, Philip George Guest. Page 349.
  9. See also: Mollifier
  10. Fitting Models to Biological Data Using Linear and Nonlinear Regression. By Harvey Motulsky, Arthur Christopoulos.
  11. Regression Analysis By Rudolf J. Freund, William J. Wilson, Ping Sa. Page 269.
  12. Visual Informatics. Edited by Halimah Badioze Zaman, Peter Robinson, Maria Petrou, Patrick Olivier, Heiko Schröder. Page 689.
  13. Numerical Methods for Nonlinear Engineering Models. By John R. Hauser. Page 227.
  14. Methods of Experimental Physics: Spectroscopy, Volume 13, Part 1. By Claire Marton. Page 150.
  15. Encyclopedia of Research Design, Volume 1. Edited by Neil J. Salkind. Page 266.
  16. Community Analysis and Planning Techniques. By Richard E. Klosterman. Page 1.
  17. An Introduction to Risk and Uncertainty in the Evaluation of Environmental Investments. DIANE Publishing. Pg 69
  18. Ahn, Sung-Joon (December 2008), "Geometric Fitting of Parametric Curves and Surfaces" (PDF), Journal of Information Processing Systems, 4 (4): 153–158, doi:10.3745/JIPS.2008.4.4.153, archived from the original (PDF) on 2014-03-13
  19. Chernov, N.; Ma, H. (2011), "Least squares fitting of quadratic curves and surfaces", in Yoshida, Sota R. (ed.), Computer Vision, Nova Science Publishers, pp. 285–302, ISBN   9781612093994
  20. Liu, Yang; Wang, Wenping (2008), "A Revisit to Least Squares Orthogonal Distance Fitting of Parametric Curves and Surfaces", in Chen, F.; Juttler, B. (eds.), Advances in Geometric Modeling and Processing, Lecture Notes in Computer Science, vol. 4975, pp. 384–397, CiteSeerX   10.1.1.306.6085 , doi:10.1007/978-3-540-79246-8_29, ISBN   978-3-540-79245-1
  21. Calculator for sigmoid regression
  22. p.51 in Ahlberg & Nilson (1967) The theory of splines and their applications, Academic Press, 1967
  23. Coope, I.D. (1993). "Circle fitting by linear and nonlinear least squares". Journal of Optimization Theory and Applications. 76 (2): 381–388. doi:10.1007/BF00939613. hdl: 10092/11104 . S2CID   59583785.
  24. Paul Sheer, A software assistant for manual stereo photometrology, M.Sc. thesis, 1997

Further reading