Multivariate adaptive regression spline

Last updated

In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. [1] It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.

Contents

The term "MARS" is trademarked and licensed to Salford Systems. In order to avoid trademark infringements, many open-source implementations of MARS are called "Earth". [2] [3]

The basics

This section introduces MARS using a few examples. We start with a set of data: a matrix of input variables x, and a vector of the observed responses y, with a response for each row in x. For example, the data could be:

xy
10.516.4
10.718.8
10.819.7
......
20.677.0

Here there is only one independent variable, so the x matrix is just a single column. Given these measurements, we would like to build a model which predicts the expected y for a given x.

A linear model Friedmans mars linear model.png
A linear model

A linear model for the above data is

The hat on the indicates that is estimated from the data. The figure on the right shows a plot of this function: a line giving the predicted versus x, with the original values of y shown as red dots.

The data at the extremes of x indicates that the relationship between y and x may be non-linear (look at the red dots relative to the regression line at low and high values of x). We thus turn to MARS to automatically build a model taking into account non-linearities. MARS software constructs a model from the given x and y as follows

A simple MARS model of the same data Friedmans mars simple model.png
A simple MARS model of the same data

The figure on the right shows a plot of this function: the predicted versus x, with the original values of y once again shown as red dots. The predicted response is now a better fit to the original y values.

MARS has automatically produced a kink in the predicted y to take into account non-linearity. The kink is produced by hinge functions. The hinge functions are the expressions starting with (where is if , else ). Hinge functions are described in more detail below.

In this simple example, we can easily see from the plot that y has a non-linear relationship with x (and might perhaps guess that y varies with the square of x). However, in general there will be multiple independent variables, and the relationship between y and these variables will be unclear and not easily visible by plotting. We can use MARS to discover that non-linear relationship.

An example MARS expression with multiple variables is

Variable interaction in a MARS model Friedmans mars ozone model.png
Variable interaction in a MARS model

This expression models air pollution (the ozone level) as a function of the temperature and a few other variables. Note that the last term in the formula (on the last line) incorporates an interaction between and .

The figure on the right plots the predicted as and vary, with the other variables fixed at their median values. The figure shows that wind does not affect the ozone level unless visibility is low. We see that MARS can build quite flexible regression surfaces by combining hinge functions.

To obtain the above expression, the MARS model building procedure automatically selects which variables to use (some variables are important, others not), the positions of the kinks in the hinge functions, and how the hinge functions are combined.

The MARS model

MARS builds models of the form

The model is a weighted sum of basis functions . Each is a constant coefficient. For example, each line in the formula for ozone above is one basis function multiplied by its coefficient.

Each basis function takes one of the following three forms:

1) a constant 1. There is just one such term, the intercept. In the ozone formula above, the intercept term is 5.2.

2) a hinge function. A hinge function has the form or . MARS automatically selects variables and values of those variables for knots of the hinge functions. Examples of such basis functions can be seen in the middle three lines of the ozone formula.

3) a product of two or more hinge functions. These basis functions can model interaction between two or more variables. An example is the last line of the ozone formula.

Hinge functions

A mirrored pair of hinge functions with a knot at x=3.1 Friedmans mars hinge functions.png
A mirrored pair of hinge functions with a knot at x=3.1

A key part of MARS models are hinge functions taking the form

or

where is a constant, called the knot. The figure on the right shows a mirrored pair of hinge functions with a knot at 3.1.

A hinge function is zero for part of its range, so can be used to partition the data into disjoint regions, each of which can be treated independently. Thus for example a mirrored pair of hinge functions in the expression

creates the piecewise linear graph shown for the simple MARS model in the previous section.

One might assume that only piecewise linear functions can be formed from hinge functions, but hinge functions can be multiplied together to form non-linear functions.

Hinge functions are also called ramp, hockey stick, or rectifier functions. Instead of the notation used in this article, hinge functions are often represented by where means take the positive part.

The model building process

MARS builds a model in two phases: the forward and the backward pass. This two-stage approach is the same as that used by recursive partitioning trees.

The forward pass

MARS starts with a model which consists of just the intercept term (which is the mean of the response values).

MARS then repeatedly adds basis function in pairs to the model. At each step it finds the pair of basis functions that gives the maximum reduction in sum-of-squares residual error (it is a greedy algorithm). The two basis functions in the pair are identical except that a different side of a mirrored hinge function is used for each function. Each new basis function consists of a term already in the model (which could perhaps be the intercept term) multiplied by a new hinge function. A hinge function is defined by a variable and a knot, so to add a new basis function, MARS must search over all combinations of the following:

1) existing terms (called parent terms in this context)

2) all variables (to select one for the new basis function)

3) all values of each variable (for the knot of the new hinge function).

To calculate the coefficient of each term, MARS applies a linear regression over the terms.

This process of adding terms continues until the change in residual error is too small to continue or until the maximum number of terms is reached. The maximum number of terms is specified by the user before model building starts.

The search at each step is usually done in a brute-force fashion, but a key aspect of MARS is that because of the nature of hinge functions, the search can be done quickly using a fast least-squares update technique. Brute-force search can be sped up by using a heuristic that reduces the number of parent terms considered at each step ("Fast MARS" [4] ).

The backward pass

The forward pass usually overfits the model. To build a model with better generalization ability, the backward pass prunes the model, deleting the least effective term at each step until it finds the best submodel. Model subsets are compared using the Generalized cross validation (GCV) criterion described below.

The backward pass has an advantage over the forward pass: at any step it can choose any term to delete, whereas the forward pass at each step can only see the next pair of terms.

The forward pass adds terms in pairs, but the backward pass typically discards one side of the pair and so terms are often not seen in pairs in the final model. A paired hinge can be seen in the equation for in the first MARS example above; there are no complete pairs retained in the ozone example.

Generalized cross validation

The backward pass compares the performance of different models using Generalized Cross-Validation (GCV), a minor variant on the Akaike information criterion that approximates the leave-one-out cross-validation score in the special case where errors are Gaussian, or where the squared error loss function is used. GCV was introduced by Craven and Wahba and extended by Friedman for MARS; lower values of GCV indicate better models. The formula for the GCV is

GCV = RSS / (N · (1 − (effective number of parameters) / N)2)

where RSS is the residual sum-of-squares measured on the training data and N is the number of observations (the number of rows in the x matrix).

The effective number of parameters is defined as

(effective number of parameters) = (number of mars terms) + (penalty) · ((number of Mars terms) − 1 ) / 2

where penalty is typically 2 (giving results equivalent to the Akaike information criterion) but can be increased by the user if they so desire.

Note that

(number of Mars terms − 1 ) / 2

is the number of hinge-function knots, so the formula penalizes the addition of knots. Thus the GCV formula adjusts (i.e. increases) the training RSS to penalize more complex models. We penalize flexibility because models that are too flexible will model the specific realization of noise in the data instead of just the systematic structure of the data.

Constraints

One constraint has already been mentioned: the user can specify the maximum number of terms in the forward pass.

A further constraint can be placed on the forward pass by specifying a maximum allowable degree of interaction. Typically only one or two degrees of interaction are allowed, but higher degrees can be used when the data warrants it. The maximum degree of interaction in the first MARS example above is one (i.e. no interactions or an additive model); in the ozone example it is two.

Other constraints on the forward pass are possible. For example, the user can specify that interactions are allowed only for certain input variables. Such constraints could make sense because of knowledge of the process that generated the data.

Pros and cons

No regression modeling technique is best for all situations. The guidelines below are intended to give an idea of the pros and cons of MARS, but there will be exceptions to the guidelines. It is useful to compare MARS to recursive partitioning and this is done below. (Recursive partitioning is also commonly called regression trees, decision trees, or CART; see the recursive partitioning article for details).

See also

Related Research Articles

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in the regression model. Least squares and weighted least squares may need to be more statistically efficient and prevent misleading inferences. GLS was first described by Alexander Aitken in 1935.

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.

In statistics, the variance inflation factor (VIF) is the ratio (quotient) of the variance of estimating some parameter in a model that includes multiple other terms (parameters) by the variance of a model constructed using only one term. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity. Cuthbert Daniel claims to have invented the concept behind the variance inflation factor, but did not come up with the name.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

Smoothing splines are function estimates, , obtained from a set of noisy observations of the target , in order to balance a measure of goodness of fit of to with a derivative based measure of the smoothness of . They provide a means for smoothing noisy data. The most familiar example is the cubic smoothing spline, but there are many other possibilities, including for the case where Failed to parse : x is a vector quantity.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

References

  1. Friedman, J. H. (1991). "Multivariate Adaptive Regression Splines". The Annals of Statistics. 19 (1): 1–67. CiteSeerX   10.1.1.382.970 . doi:10.1214/aos/1176347963. JSTOR   2241837. MR   1091842. Zbl   0765.62064.
  2. CRAN Package earth
  3. Earth – Multivariate adaptive regression splines in Orange (Python machine learning library)
  4. Friedman, J. H. (1993) Fast MARS, Stanford University Department of Statistics, Technical Report 110
  5. 1 2 3 Kuhn, Max; Johnson, Kjell (2013). Applied Predictive Modeling. New York, NY: Springer New York. doi:10.1007/978-1-4614-6849-3. ISBN   9781461468486.
  6. Friedman, Jerome H. (1993). "Estimating Functions of Mixed Ordinal and Categorical Variables Using Adaptive Splines". In Stephan Morgenthaler; Elvezio Ronchetti; Werner Stahel (eds.). New Directions in Statistical Data Analysis and Robustness. Birkhauser.
  7. Friedman, Jerome H. (1991-06-01). "Estimating Functions of Mixed Ordinal and Categorical Variables Using Adaptive Splines". DTIC. Archived from the original on April 11, 2022. Retrieved 2022-04-11.
  8. Denison, D. G. T.; Mallick, B. K.; Smith, A. F. M. (1 December 1998). "Bayesian MARS" (PDF). Statistics and Computing. 8 (4): 337–346. doi:10.1023/A:1008824606259. ISSN   1573-1375. S2CID   12570055.

Further reading

Several free and commercial software packages are available for fitting MARS-type models.

Free software
Commercial software
  1. Denison, D. G. T.; Holmes, C. C.; Mallick, B. K.; Smith, A. F. M. (2002). Bayesian methods for nonlinear classification and regression. Chichester, England: Wiley. ISBN   978-0-471-49036-4.