Projection pursuit regression

Last updated

In statistics, projection pursuit regression (PPR) is a statistical model developed by Jerome H. Friedman and Werner Stuetzle that extends additive models. This model adapts the additive models in that it first projects the data matrix of explanatory variables in the optimal direction before applying smoothing functions to these explanatory variables.

Contents

Model overview

The model consists of linear combinations of ridge functions: non-linear transformations of linear combinations of the explanatory variables. The basic model takes the form

where xi is a 1 × p row of the design matrix containing the explanatory variables for example i, yi is a 1 × 1 prediction, {βj} is a collection of r vectors (each a unit vector of length p) which contain the unknown parameters, {fj} is a collection of r initially unknown smooth functions that map from , and r is a hyperparameter. Good values for r can be determined through cross-validation or a forward stage-wise strategy which stops when the model fit cannot be significantly improved. As r approaches infinity and with an appropriate set of functions {fj}, the PPR model is a universal estimator, as it can approximate any continuous function in .

Model estimation

For a given set of data , the goal is to minimize the error function

over the functions and vectors . No method exists for solving over all variables at once, but it can be solved via alternating optimization. First, consider each pair individually: Let all other parameters be fixed, and find a "residual", the variance of the output not accounted for by those other parameters, given by

The task of minimizing the error function now reduces to solving

for each j in turn. Typically new pairs are added to the model in a forward stage-wise fashion.

Aside: Previously fitted pairs can be readjusted after new fit-pairs are determined by an algorithm known as backfitting, which entails reconsidering a previous pair, recalculating the residual given how other pairs have changed, refitting to account for that new information, and then cycling through all fit-pairs this way until parameters converge. This process typically results in a model that performs better with fewer fit-pairs, though it takes longer to train, and it is usually possible to achieve the same performance by skipping backfitting and simply adding more fits to the model (increasing r).

Solving the simplified error function to determine an pair can be done with alternating optimization, where first a random is used to project in to 1D space, and then the optimal is found to describe the relationship between that projection and the residuals via your favorite scatter plot regression method. Then if is held constant, assuming is once differentiable, the optimal updated weights can be found via the Gauss–Newton method a quasi-Newton method in which the part of the Hessian involving the second derivative is discarded. To derive this, first Taylor expand , then plug the expansion back in to the simplified error function and do some algebraic manipulation to put it in the form

This is a weighted least squares problem. If we solve for all weights and put them in a diagonal matrix , stack all the new targets in to a vector, and use the full data matrix instead of a single example , then the optimal is given by the closed-form

Use this updated to find a new projection of and refit to the new scatter plot. Then use that new to update by resolving the above, and continue this alternating process until converges.

It has been shown that the convergence rate, the bias and the variance are affected by the estimation of and .

Discussion

The PPR model takes the form of a basic additive model but with the additional component, so each fits a scatter plot of vs the residual (unexplained variance) during training rather than using the raw inputs themselves. This constrains the problem of finding each to low dimension, making it solvable with common least squares or spline fitting methods and sidestepping the curse of dimensionality during training. Because is taken of a projection of , the result looks like a "ridge" orthogonal to the projection dimension, so are often called "ridge functions". The directions are chosen to optimize the fit of their corresponding ridge functions.

Note that because PPR attempts to fit projections of the data, it can be difficult to interpret the fitted model as a whole, because each input variable has been accounted for in a complex and multifaceted way. This can make the model more useful for prediction than for understanding the data, though visualizing individual ridge functions and considering which projections the model is discovering can yield some insight.

Advantages of PPR estimation

Disadvantages of PPR estimation

Extensions of PPR

PPR vs neural networks (NN)

Both projection pursuit regression and fully connected neural networks with a single hidden layer project the input vector onto a one-dimensional hyperplane and then apply a nonlinear transformation of the input variables that are then added in a linear fashion. Thus both follow the same steps to overcome the curse of dimensionality. The main difference is that the functions being fitted in PPR can be different for each combination of input variables and are estimated one at a time and then updated with the weights, whereas in NN these are all specified upfront and estimated simultaneously.

Thus, in PPR estimation the transformations of variables in PPR are data driven whereas in a single-layer neural network these transformations are fixed.

See also

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

<span class="mw-page-title-main">Nonlinear regression</span> Regression analysis

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations (iterations).

Partial least squares regression is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares discriminant analysis (PLS-DA) is a variant used when the Y is categorical.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.


In statistics, the variance inflation factor (VIF) is the ratio (quotient) of the variance of a parameter estimate when fitting a full model that includes other parameters to the variance of the parameter estimate if the model is fit with only the parameter on its own. The VIF provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.

Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m ≥ n). It is used in some forms of nonlinear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences. In economic theory, the non-linear least squares method is applied in (i) the probit regression, (ii) threshold regression, (iii) smooth regression, (iv) logistic link regression, (v) Box–Cox transformed regressors ().

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics, an additive model (AM) is a nonparametric regression method. It was suggested by Jerome H. Friedman and Werner Stuetzle (1981) and is an essential part of the ACE algorithm. The AM uses a one-dimensional smoother to build a restricted class of nonparametric regression models. Because of this, it is less affected by the curse of dimensionality than e.g. a p-dimensional smoother. Furthermore, the AM is more flexible than a standard linear model, while being more interpretable than a general regression surface at the cost of approximation errors. Problems with AM, like many other machine-learning methods, include model selection, overfitting, and multicollinearity.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

The Generalized Additive Model for Location, Scale and Shape (GAMLSS) is an approach to statistical modelling and learning. GAMLSS is a modern distribution-based approach to (semiparametric) regression. A parametric distribution is assumed for the response (target) variable but the parameters of this distribution can vary according to explanatory variables using linear, nonlinear or smooth functions. In machine learning parlance, GAMLSS is a form of supervised machine learning.

In statistics and in machine learning, a linear predictor function is a linear function of a set of coefficients and explanatory variables, whose value is used to predict the outcome of a dependent variable. This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers, as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function with a smooth parameter function . Functional Linear Regression, Functional Poisson Regression and Functional Binomial Regression, with the important Functional Logistic Regression included, are special cases of GFLM. Applications of GFLM include classification and discrimination of stochastic processes and functional data.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

References