# Semiparametric regression

Last updated

In statistics, semiparametric regression includes regression models that combine parametric and nonparametric models. They are often used in situations where the fully nonparametric model may not perform well or when the researcher wants to use a parametric model but the functional form with respect to a subset of the regressors or the density of the errors is not known. Semiparametric regression models are a particular type of semiparametric modelling and, since semiparametric models contain a parametric component, they rely on parametric assumptions and may be misspecified and inconsistent, just like a fully parametric model.

## Methods

Many different semiparametric regression methods have been proposed and developed. The most popular methods are the partially linear, index and varying coefficient models.

### Partially linear models

A partially linear model is given by

${\displaystyle Y_{i}=X'_{i}\beta +g\left(Z_{i}\right)+u_{i},\,\quad i=1,\ldots ,n,\,}$

where ${\displaystyle Y_{i}}$ is the dependent variable, ${\displaystyle X_{i}}$ is a ${\displaystyle p\times 1}$ vector of explanatory variables, ${\displaystyle \beta }$ is a ${\displaystyle p\times 1}$ vector of unknown parameters and ${\displaystyle Z_{i}\in \operatorname {R} ^{q}}$. The parametric part of the partially linear model is given by the parameter vector ${\displaystyle \beta }$ while the nonparametric part is the unknown function ${\displaystyle g\left(Z_{i}\right)}$. The data is assumed to be i.i.d. with ${\displaystyle E\left(u_{i}|X_{i},Z_{i}\right)=0}$ and the model allows for a conditionally heteroskedastic error process ${\displaystyle E\left(u_{i}^{2}|x,z\right)=\sigma ^{2}\left(x,z\right)}$ of unknown form. This type of model was proposed by Robinson (1988) and extended to handle categorical covariates by Racine and Li (2007).

This method is implemented by obtaining a ${\displaystyle {\sqrt {n}}}$ consistent estimator of ${\displaystyle \beta }$ and then deriving an estimator of ${\displaystyle g\left(Z_{i}\right)}$ from the nonparametric regression of ${\displaystyle Y_{i}-X'_{i}{\hat {\beta }}}$ on ${\displaystyle z}$ using an appropriate nonparametric regression method. [1]

### Index models

A single index model takes the form

${\displaystyle Y=g\left(X'\beta _{0}\right)+u,\,}$

where ${\displaystyle Y}$, ${\displaystyle X}$ and ${\displaystyle \beta _{0}}$ are defined as earlier and the error term ${\displaystyle u}$ satisfies ${\displaystyle E\left(u|X\right)=0}$. The single index model takes its name from the parametric part of the model ${\displaystyle x'\beta }$ which is a scalar single index. The nonparametric part is the unknown function ${\displaystyle g\left(\cdot \right)}$.

#### Ichimura's method

The single index model method developed by Ichimura (1993) is as follows. Consider the situation in which ${\displaystyle y}$ is continuous. Given a known form for the function ${\displaystyle g\left(\cdot \right)}$, ${\displaystyle \beta _{0}}$ could be estimated using the nonlinear least squares method to minimize the function

${\displaystyle \sum _{i=1}\left(Y_{i}-g\left(X'_{i}\beta \right)\right)^{2}.}$

Since the functional form of ${\displaystyle g\left(\cdot \right)}$ is not known, we need to estimate it. For a given value for ${\displaystyle \beta }$ an estimate of the function

${\displaystyle G\left(X'_{i}\beta \right)=E\left(Y_{i}|X'_{i}\beta \right)=E\left[g\left(X'_{i}\beta _{o}\right)|X'_{i}\beta \right]}$

using kernel method. Ichimura (1993) proposes estimating ${\displaystyle g\left(X'_{i}\beta \right)}$ with

${\displaystyle {\hat {G}}_{-i}\left(X'_{i}\beta \right),\,}$

the leave-one-out nonparametric kernel estimator of ${\displaystyle G\left(X'_{i}\beta \right)}$.

#### Klein and Spady's estimator

If the dependent variable ${\displaystyle y}$ is binary and ${\displaystyle X_{i}}$ and ${\displaystyle u_{i}}$ are assumed to be independent, Klein and Spady (1993) propose a technique for estimating ${\displaystyle \beta }$ using maximum likelihood methods. The log-likelihood function is given by

${\displaystyle L\left(\beta \right)=\sum _{i}\left(1-Y_{i}\right)\ln \left(1-{\hat {g}}_{-i}\left(X'_{i}\beta \right)\right)+\sum _{i}Y_{i}\ln \left({\hat {g}}_{-i}\left(X'_{i}\beta \right)\right),}$

where ${\displaystyle {\hat {g}}_{-i}\left(X'_{i}\beta \right)}$ is the leave-one-out estimator.

### Smooth coefficient/varying coefficient models

Hastie and Tibshirani (1993) propose a smooth coefficient model given by

${\displaystyle Y_{i}=\alpha \left(Z_{i}\right)+X'_{i}\beta \left(Z_{i}\right)+u_{i}=\left(1+X'_{i}\right)\left({\begin{array}{c}\alpha \left(Z_{i}\right)\\\beta \left(Z_{i}\right)\end{array}}\right)+u_{i}=W'_{i}\gamma \left(Z_{i}\right)+u_{i},}$

where ${\displaystyle X_{i}}$ is a ${\displaystyle k\times 1}$ vector and ${\displaystyle \beta \left(z\right)}$ is a vector of unspecified smooth functions of ${\displaystyle z}$.

${\displaystyle \gamma \left(\cdot \right)}$ may be expressed as

${\displaystyle \gamma \left(Z_{i}\right)=\left(E\left[W_{i}W'_{i}|Z_{i}\right]\right)^{-1}E\left[W_{i}Y_{i}|Z_{i}\right].}$

## Notes

1. See Li and Racine (2007) for an in-depth look at nonparametric regression methods.

## Related Research Articles

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

Simultaneous equations models are a type of statistical model in which the dependent variables are functions of other dependent variables, rather than just independent variables. This means some of the explanatory variables are jointly determined with the dependent variable, which in economics usually is the consequence of some underlying equilibrium mechanism. For instance, in the simple model of supply and demand, price and quantity are jointly determined.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function.

In statistics, a semiparametric model is a statistical model that has parametric and nonparametric components.

In statistics, a fixed effects model is a statistical model in which the model parameters are fixed or non-random quantities. This is in contrast to random effects models and mixed models in which all or some of the model parameters are random variables. In many applications including econometrics and biostatistics a fixed effects model refers to a regression model in which the group means are fixed (non-random) as opposed to a random effects model in which the group means are a random sample from a population. Generally, data can be grouped according to several observed factors. The group means could be modeled as fixed or random effects for each grouping. In a fixed effects model each group mean is a group-specific fixed quantity.

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1936.

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.

In statistics, Kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relation between a pair of random variables X and Y.

Stochastic frontier analysis (SFA) is a method of economic modeling. It has its starting point in the stochastic production frontier models simultaneously introduced by Aigner, Lovell and Schmidt (1977) and Meeusen and Van den Broeck (1977).

The topic of heteroscedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In statistics, the variance function is a smooth function which depicts the variance of a random quantity as a function of its mean. The variance function plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

In statistics and econometrics, the maximum score estimator is a nonparametric estimator for discrete choice models developed by Charles Manski in 1975. Unlike the multinomial probit and multinomial logit estimators, it makes no assumptions about the distribution of the unobservable part of utility. However, its statistical properties are more complicated than the multinomial probit and logit models, making statistical inference difficult. To address these issues, Joel Horowitz proposed a variant, called the smoothed maximum score estimator.

In statistics and econometrics, optimal instruments are a technique for improving the efficiency of estimators in conditional moment models, a class of semiparametric models that generate conditional expectation functions. To estimate parameters of a conditional moment model, the statistician can derive an expectation function and use the generalized method of moments (GMM). However, there are infinitely many moment conditions that can be generated from a single model; optimal instruments provide the most efficient moment conditions.

A partially linear model is a form of semiparametric model, since it contains parametric and nonparametric elements. Application of the least squares estimators is available to partially linear model, if the hypothesis of the known of nonparametric element is valid. Partially linear equations were first used in the analysis of the relationship between temperature and usage of electricity by Engle, Granger, Rice and Weiss (1986). Typical application of partially linear model in the field of Microeconomics is presented by Tripathi in the case of profitability of firm’s production in 1997. Also, partially linear model applied successfully in some other academic field. In 1994, Zeger and Diggle introduced partially linear model into biometrics. In environmental science, Parda-Sanchez et al used partially linear model to analysis collected data in 2000. So far, partially linear model was optimized in many other statistic methods. In 1988, Robinson applied Nadaraya-Waston kernel estimator to test the nonparametric element to build a least-squares estimator After that, in 1997, local linear method was found by Truong.

## References

• Robinson, P.M. (1988). "Root-n Consistent Semiparametric Regression". Econometrica. The Econometric Society. 56 (4): 931–954. doi:10.2307/1912705. JSTOR   1912705.
• Li, Qi; Racine, Jeffrey S. (2007). Nonparametric Econometrics: Theory and Practice. Princeton University Press. ISBN   978-0-691-12161-1.
• Racine, J.S.; Qui, L. (2007). "A Partially Linear Kernel Estimator for Categorical Data". Unpublished Manuscript, Mcmaster University.
• Ichimura, H. (1993). "Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single Index Models". Journal of Econometrics. 58 (1–2): 71–120. doi:10.1016/0304-4076(93)90114-K.
• Klein, R. W.; R. H. Spady (1993). "An Efficient Semiparametric Estimator for Binary Response Models". Econometrica. The Econometric Society. 61 (2): 387–421. CiteSeerX  . doi:10.2307/2951556. JSTOR   2951556.
• Hastie, T.; R. Tibshirani (1993). "Varying-Coefficient Models". Journal of the Royal Statistical Society, Series B. 55: 757–796.