Linear predictor function

Last updated December 27, 2023

In statistics and in machine learning, a linear predictor function is a linear function (linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent variable.^[1] This sort of function usually comes in linear regression, where the coefficients are called regression coefficients. However, they also occur in various types of linear classifiers (e.g. logistic regression,^[2] perceptrons,^[3] support vector machines,^[4] and linear discriminant analysis ^[5]), as well as in various other models, such as principal component analysis ^[6] and factor analysis. In many of these models, the coefficients are referred to as "weights".

Definition

The basic form of a linear predictor function $f(i)$ for data point i (consisting of p explanatory variables), for i = 1, ..., n, is

f(i)=\beta _{0}+\beta _{1}x_{i1}+\cdots +\beta _{p}x_{ip},

where $x_{ik}$ , for k = 1, ..., p, is the value of the k-th explanatory variable for data point i, and $\beta _{0},\ldots ,\beta _{p}$ are the coefficients (regression coefficients, weights, etc.) indicating the relative effect of a particular explanatory variable on the outcome.

Notations

It is common to write the predictor function in a more compact form as follows:

The coefficients β₀, β₁, ..., β_p are grouped into a single vector β of size p + 1.
For each data point i, an additional explanatory pseudo-variable x_i0 is added, with a fixed value of 1, corresponding to the intercept coefficient β₀.
The resulting explanatory variables x_i0(= 1), x_i1, ..., x_ip are then grouped into a single vector x_i of size p + 1.

Vector Notation

This makes it possible to write the linear predictor function as follows:

f(i)={\boldsymbol {\beta }}\cdot \mathbf {x} _{i}

using the notation for a dot product between two vectors.

Matrix Notation

An equivalent form using matrix notation is as follows:

f(i)={\boldsymbol {\beta }}^{\mathrm {T} }\mathbf {x} _{i}=\mathbf {x} _{i}^{\mathrm {T} }{\boldsymbol {\beta }}

where ${\boldsymbol {\beta }}$ and $\mathbf {x} _{i}$ are assumed to be a (p+1)-by-1 column vectors, ${\boldsymbol {\beta }}^{\mathrm {T} }$ is the matrix transpose of ${\boldsymbol {\beta }}$ (so ${\boldsymbol {\beta }}^{\mathrm {T} }$ is a 1-by-(p+1) row vector), and ${\boldsymbol {\beta }}^{\mathrm {T} }\mathbf {x} _{i}$ indicates matrix multiplication between the 1-by-(p+1) row vector and the (p+1)-by-1 column vector, producing a 1-by-1 matrix that is taken to be a scalar.

Linear regression

An example of the usage of a linear predictor function is in linear regression, where each data point is associated with a continuous outcome y_i, and the relationship written

y_{i}=f(i)+\varepsilon _{i}={\boldsymbol {\beta }}^{\mathrm {T} }\mathbf {x} _{i}\ +\varepsilon _{i},

where $\varepsilon _{i}$ is a disturbance term or error variable — an unobserved random variable that adds noise to the linear relationship between the dependent variable and predictor function.

Stacking

In some models (standard linear regression, in particular), the equations for each of the data points i = 1, ..., n are stacked together and written in vector form as

\mathbf {y} =\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\,

where

\mathbf {y} ={\begin{pmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{pmatrix}},\quad \mathbf {X} ={\begin{pmatrix}\mathbf {x} '_{1}\\\mathbf {x} '_{2}\\\vdots \\\mathbf {x} '_{n}\end{pmatrix}}={\begin{pmatrix}x_{11}&\cdots &x_{1p}\\x_{21}&\cdots &x_{2p}\\\vdots &\ddots &\vdots \\x_{n1}&\cdots &x_{np}\end{pmatrix}},\quad {\boldsymbol {\beta }}={\begin{pmatrix}\beta _{1}\\\vdots \\\beta _{p}\end{pmatrix}},\quad {\boldsymbol {\varepsilon }}={\begin{pmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\vdots \\\varepsilon _{n}\end{pmatrix}}.

The matrix X is known as the design matrix and encodes all known information about the independent variables. The variables $\varepsilon _{i}$ are random variables, which in standard linear regression are distributed according to a standard normal distribution; they express the influence of any unknown factors on the outcome.

This makes it possible to find optimal coefficients through the method of least squares using simple matrix operations. In particular, the optimal coefficients ${\boldsymbol {\hat {\beta }}}$ as estimated by least squares can be written as follows:

{\boldsymbol {\hat {\beta }}}=(X^{\mathrm {T} }X)^{-1}X^{\mathrm {T} }\mathbf {y} .

The matrix $(X^{\mathrm {T} }X)^{-1}X^{\mathrm {T} }$ is known as the Moore–Penrose pseudoinverse of X. The use of the matrix inverse in this formula requires that X is of full rank, i.e. there is not perfect multicollinearity among different explanatory variables (i.e. no explanatory variable can be perfectly predicted from the others). In such cases, the singular value decomposition can be used to compute the pseudoinverse.

Preprocessing of explanatory variables

When a fixed set of nonlinear functions are used to transform the value(s) of a data point, these functions are known as basis functions. An example is polynomial regression, which uses a linear predictor function to fit an arbitrary degree polynomial relationship (up to a given order) between two sets of data points (i.e. a single real-valued explanatory variable and a related real-valued dependent variable), by adding multiple explanatory variables corresponding to various powers of the existing explanatory variable. Mathematically, the form looks like this:

y_{i}=\beta _{0}+\beta _{1}x_{i}+\beta _{2}x_{i}^{2}+\cdots +\beta _{p}x_{i}^{p}.

In this case, for each data point i, a set of explanatory variables is created as follows:

(x_{i1}=x_{i},\quad x_{i2}=x_{i}^{2},\quad \ldots ,\quad x_{ip}=x_{i}^{p})

and then standard linear regression is run. The basis functions in this example would be

{\boldsymbol {\phi }}(x)=(\phi _{1}(x),\phi _{2}(x),\ldots ,\phi _{p}(x))=(x,x^{2},\ldots ,x^{p}).

This example shows that a linear predictor function can actually be much more powerful than it first appears: It only really needs to be linear in the coefficients. All sorts of non-linear functions of the explanatory variables can be fit by the model.

There is no particular need for the inputs to basis functions to be univariate or single-dimensional (or their outputs, for that matter, although in such a case, a K-dimensional output value is likely to be treated as K separate scalar-output basis functions). An example of this is radial basis functions (RBF's), which compute some transformed version of the distance to some fixed point:

\phi (\mathbf {x

An example is the Gaussian RBF, which has the same functional form as the normal distribution:

\phi (\mathbf {x

which drops off rapidly as the distance from c increases.

A possible usage of RBF's is to create one for every observed data point. This means that the result of an RBF applied to a new data point will be close to 0 unless the new point is near to the point around which the RBF was applied. That is, the application of the radial basis functions will pick out the nearest point, and its regression coefficient will dominate. The result will be a form of nearest neighbor interpolation, where predictions are made by simply using the prediction of the nearest observed data point, possibly interpolating between multiple nearby data points when they are all similar distances away. This type of nearest neighbor method for prediction is often considered diametrically opposed to the type of prediction used in standard linear regression: But in fact, the transformations that can be applied to the explanatory variables in a linear predictor function are so powerful that even the nearest neighbor method can be implemented as a type of linear regression.

It is even possible to fit some functions that appear non-linear in the coefficients by transforming the coefficients into new coefficients that do appear linear. For example, a function of the form $a+b^{2}x_{i1}+{\sqrt {c}}x_{i2}$ for coefficients $a,b,c$ could be transformed into the appropriate linear function by applying the substitutions $b'=b^{2},c'={\sqrt {c}},$ leading to $a+b'x_{i1}+c'x_{i2},$ which is linear. Linear regression and similar techniques could be applied and will often still find the optimal coefficients, but their error estimates and such will be wrong.

The explanatory variables may be of any type: real-valued, binary, categorical, etc. The main distinction is between continuous variables (e.g. income, age, blood pressure, etc.) and discrete variables (e.g. sex, race, political party, etc.). Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables), i.e. separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have the given value". For example, a four-way discrete variable of blood type with the possible values "A, B, AB, O" would be converted to separate two-way dummy variables, "is-A, is-B, is-AB, is-O", where only one of them has the value 1 and all the rest have the value 0. This allows for separate regression coefficients to be matched for each possible value of the discrete variable.

Note that, for K categories, not all K dummy variables are independent of each other. For example, in the above blood type example, only three of the four dummy variables are independent, in the sense that once the values of three of the variables are known, the fourth is automatically determined. Thus, it's really only necessary to encode three of the four possibilities as dummy variables, and in fact if all four possibilities are encoded, the overall model becomes non-identifiable. This causes problems for a number of methods, such as the simple closed-form solution used in linear regression. The solution is either to avoid such cases by eliminating one of the dummy variables, and/or introduce a regularization constraint (which necessitates a more powerful, typically iterative, method for finding the optimal coefficients).^[7]

Related Research Articles

In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term is also used in time series analysis with a different meaning. In each case, the designation "linear" is used to identify a subclass of models for which substantial reduction in the complexity of the related statistical theory is possible.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. It is a set of points in an n-dimensional space, often represented as an ellipsoid around a point which is an estimated solution to a problem, although other shapes can occur.

In statistics, a linear probability model (LPM) is a special case of a binary regression model. Here the dependent variable for each observation takes values which are either 0 or 1. The probability of observing a 0 or 1 in any one case is treated as depending on one or more explanatory variables. For the "linear probability model", this relationship is a particularly simple one, and allows the model to be fitted by linear regression.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which $given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.$

In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in $space, where is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.$

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

In statistics, particularly regression analysis, the Working–Hotelling procedure, named after Holbrook Working and Harold Hotelling, is a method of simultaneous estimation in linear regression models. One of the first developments in simultaneous inference, it was devised by Working and Hotelling for the simple linear regression model in 1929. It provides a confidence region for multiple mean responses, that is, it gives the upper and lower bounds of more than one value of a dependent variable at several levels of the independent variables at a certain confidence level. The resulting confidence bands are known as the Working–Hotelling–Scheffé confidence bands.

References

↑ Makhoul, J. (1975). "Linear prediction: A tutorial review". Proceedings of the IEEE. 63 (4): 561–580. Bibcode:1975IEEEP..63..561M. doi:10.1109/PROC.1975.9792. ISSN 0018-9219.
↑ David A. Freedman (2009). Statistical Models: Theory and Practice . Cambridge University Press. p. 26. ISBN 9780521743853. A simple regression equation has on the right hand side an intercept and an explanatory variable with a slope coefficient. A multiple regression equation has two or more explanatory variables on the right hand side, each with its own slope coefficient
↑ Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory.
↑ Cortes, Corinna; Vapnik, Vladimir N. (1995). "Support-vector networks" (PDF). Machine Learning . 20 (3): 273–297. CiteSeerX 10.1.1.15.9362 . doi:10.1007/BF00994018.
↑ McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience. ISBN 978-0-471-69115-0. MR 1190469.
↑ Jolliffe I.T. Principal Component Analysis, Series: Springer Series in Statistics, 2nd ed., Springer, NY, 2002, XXIX, 487 p. 28 illus. ISBN 978-0-387-95442-4
↑ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN 978-0-387-84884-6.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Makhoul, J. (1975). "Linear prediction: A tutorial review". Proceedings of the IEEE. 63 (4): 561–580. Bibcode:1975IEEEP..63..561M. doi:10.1109/PROC.1975.9792. ISSN 0018-9219.

[Freedman09-2] David A. Freedman (2009). Statistical Models: Theory and Practice . Cambridge University Press. p. 26. ISBN 9780521743853. A simple regression equation has on the right hand side an intercept and an explanatory variable with a slope coefficient. A multiple regression equation has two or more explanatory variables on the right hand side, each with its own slope coefficient

[3] Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory.

[CorinnaCortes-4] Cortes, Corinna; Vapnik, Vladimir N. (1995). "Support-vector networks" (PDF). Machine Learning . 20 (3): 273–297. CiteSeerX 10.1.1.15.9362 . doi:10.1007/BF00994018.

[McLachlan:2004-5] McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience. ISBN 978-0-471-69115-0. MR 1190469.

[Principal_Component_Analysis-6] Jolliffe I.T. Principal Component Analysis, Series: Springer Series in Statistics, 2nd ed., Springer, NY, 2002, XXIX, 487 p. 28 illus. ISBN 978-0-387-95442-4

[7] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN 978-0-387-84884-6.

[1]

[2]

[3]

[4]

[5]

[6]

[7]