Control function (econometrics)

Last updated January 03, 2025

Control functions (also known as two-stage residual inclusion) are statistical methods to correct for endogeneity problems by modelling the endogeneity in the error term. The approach thereby differs in important ways from other models that try to account for the same econometric problem. Instrumental variables, for example, attempt to model the endogenous variable X as an often invertible model with respect to a relevant and exogenous instrument Z. Panel analysis uses special data properties to difference out unobserved heterogeneity that is assumed to be fixed over time.

Control functions were introduced by Heckman and Robb^[1] although the principle can be traced back to earlier papers.^[2] A particular reason why they are popular is because they work for non-invertible models (such as discrete choice models) and allow for heterogeneous effects, where effects at the individual level can differ from effects at the aggregate.^[3] A well-known example of the control function approach is the Heckman correction.

Formal definition

Assume we start from a standard endogenous variable setup with additive errors, where X is an endogenous variable, and Z is an exogenous variable that can serve as an instrument.

Y=g(X)+U

1

X=\pi (Z)+V

2

E[U\mid Z,V]=E[U\mid V]

3

E[V\mid Z]=0

4

A popular instrumental variable approach is to use a two-step procedure and estimate equation ( 2 ) first and then use the estimates of this first step to estimate equation ( 1 ) in a second step. The control function, however, uses that this model implies

E[Y\mid Z,V]=g(X)+E[U\mid Z,V]=g(X)+E[U\mid V]=g(X)+h(V)

5

The function h(V) is effectively the control function that models the endogeneity and where this econometric approach lends its name from.^[4]

In a Rubin causal model potential outcomes framework, where Y₁ is the outcome variable of people for who the participation indicator D equals 1, the control function approach leads to the following model

E[Y_{1}\mid X,Z,D=1]=\mu _{1}(X)+E[U\mid D=1]

6

as long as the potential outcomes Y₀ and Y₁ are independent of D conditional on X and Z.^[5]

Variance correction

Since the second-stage regression includes generated regressors, its variance-covariance matrix needs to be adjusted.^[6]^[7]

Examples

Endogeneity in Poisson regression

Wooldridge and Terza provide a methodology to both deal with and test for endogeneity within the exponential regression framework, which the following discussion follows closely.^[8] While the example focuses on a Poisson regression model, it is possible to generalize to other exponential regression models, although this may come at the cost of additional assumptions (e.g. for binary response or censored data models).

Assume the following exponential regression model, where $a_{i}$ is an unobserved term in the latent variable. We allow for correlation between $a_{i}$ and $x_{i}$ (implying $x_{i}$ is possibly endogenous), but allow for no such correlation between $a_{i}$ and $z_{i}$ .

\operatorname {E} [y_{i}\mid x_{i},z_{i},a_{i}]=\exp(x_{i}b_{0}+z_{i}c_{0}+a_{i})

The variables $z_{i}$ serve as instrumental variables for the potentially endogenous $x_{i}$ . One can assume a linear relationship between these two variables or alternatively project the endogenous variable $x_{i}$ onto the instruments to get the following reduced form equation:

x_{i}=z_{i}\Pi +v_{i}

1

The usual rank condition is needed to ensure identification. The endogeneity is then modeled in the following way, where $\rho$ determines the severity of endogeneity and $v_{i}$ is assumed to be independent of $e_{i}$ .

a_{i}=v_{i}\rho +e_{i}

Imposing these assumptions, assuming the models are correctly specified, and normalizing $\operatorname {E} [\exp(e_{i})]=1$ , we can rewrite the conditional mean as follows:

\operatorname {E} [y_{i}\mid x_{i},z_{i},v_{i}]=\exp(x_{i}b_{0}+z_{i}c_{0}+v_{i}\rho )

2

If $v_{i}$ were known at this point, it would be possible to estimate the relevant parameters by quasi-maximum likelihood estimation (QMLE). Following the two step procedure strategies, Wooldridge and Terza propose estimating equation ( 1 ) by ordinary least squares. The fitted residuals from this regression can then be plugged into the estimating equation ( 2 ) and QMLE methods will lead to consistent estimators of the parameters of interest. Significance tests on ${\hat {\rho }}$ can then be used to test for endogeneity within the model.

Extensions

The original Heckit procedure makes distributional assumptions about the error terms, however, more flexible estimation approaches with weaker distributional assumptions have been established.^[9] Furthermore, Blundell and Powell show how the control function approach can be particularly helpful in models with nonadditive errors, such as discrete choice models.^[10] This latter approach, however, does implicitly make strong distributional and functional form assumptions.^[5]

Related Research Articles

<span class="mw-page-title-main">Econometrics</span> Empirical statistical testing of economic theories

Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. More precisely, it is "the quantitative analysis of actual economic phenomena based on the concurrent development of theory and observation, related by appropriate methods of inference." An introductory economics textbook describes econometrics as allowing economists "to sift through mountains of data to extract simple relationships." Jan Tinbergen is one of the two founding fathers of econometrics. The other, Ragnar Frisch, also coined the term in the sense in which it is used today.

Simultaneous equations models are a type of statistical model in which the dependent variables are functions of other dependent variables, rather than just independent variables. This means some of the explanatory variables are jointly determined with the dependent variable, which in economics usually is the consequence of some underlying equilibrium mechanism. Take the typical supply and demand model: whilst typically one would determine the quantity supplied and demanded to be a function of the price set by the market, it is also possible for the reverse to be true, where producers observe the quantity that consumers demand and then set the price.

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more error-free independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

In statistics, and particularly in econometrics, the reduced form of a system of equations is the result of solving the system for the endogenous variables. This gives the latter as functions of the exogenous variables, if any. In econometrics, the equations of a structural form model are estimated in their theoretically given form, while an alternative approach to estimation is to first solve the theoretical equations for the endogenous variables to obtain reduced form equations, and then to estimate the reduced form equations.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included.

In econometrics, endogeneity broadly refers to situations in which an explanatory variable is correlated with the error term. The distinction between endogenous and exogenous variables originated in simultaneous equations models, where one separates variables whose values are determined by the model from variables which are predetermined. Ignoring simultaneity in the estimation leads to biased estimates as it violates the exogeneity assumption of the Gauss–Markov theorem. The problem of endogeneity is often ignored by researchers conducting non-experimental research and doing so precludes making policy recommendations. Instrumental variable techniques are commonly used to mitigate this problem.

Panel (data) analysis is a statistical method, widely used in social science, epidemiology, and econometrics to analyze two-dimensional panel data. The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions. Multidimensional analysis is an econometric method in which data are collected over more than two dimensions.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric equation is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the parameter estimates.

In statistics, semiparametric regression includes regression models that combine parametric and nonparametric models. They are often used in situations where the fully nonparametric model may not perform well or when the researcher wants to use a parametric model but the functional form with respect to a subset of the regressors or the density of the errors is not known. Semiparametric regression models are a particular type of semiparametric modelling and, since semiparametric models contain a parametric component, they rely on parametric assumptions and may be misspecified and inconsistent, just like a fully parametric model.

The Heckman correction is a statistical technique to correct bias from non-randomly selected samples or otherwise incidentally truncated dependent variables, a pervasive issue in quantitative social sciences when using observational data. Conceptually, this is achieved by explicitly modelling the individual sampling probability of each observation together with the conditional expectation of the dependent variable. The resulting likelihood function is mathematically similar to the tobit model for censored dependent variables, a connection first drawn by James Heckman in 1974. Heckman also developed a two-step control function approach to estimate this model, which avoids the computational burden of having to estimate both equations jointly, albeit at the cost of inefficiency. Heckman received the Nobel Memorial Prize in Economic Sciences in 2000 for his work in this field.

In statistics, an errors-in-variables model or a measurement error model is a regression model that accounts for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

The methodology of econometrics is the study of the range of differing approaches to undertaking econometric analysis.

In econometrics, the Arellano–Bond estimator is a generalized method of moments estimator used to estimate dynamic models of panel data. It was proposed in 1991 by Manuel Arellano and Stephen Bond, based on the earlier work by Alok Bhargava and John Denis Sargan in 1983, for addressing certain endogeneity problems. The GMM-SYS estimator is a system that contains both the levels and the first difference equations. It provides an alternative to the standard first difference GMM estimator.

In statistics, a fixed-effect Poisson model is a Poisson regression model used for static panel data when the outcome variable is count data. Hausman, Hall, and Griliches pioneered the method in the mid 1980s. Their outcome of interest was the number of patents filed by firms, where they wanted to develop methods to control for the firm fixed effects. Linear panel data models use the linear additivity of the fixed effects to difference them out and circumvent the incidental parameter problem. Even though Poisson models are inherently nonlinear, the use of the linear index and the exponential link function lead to multiplicative separability, more specifically

In applied statistics, fractional models are, to some extent, related to binary response models. However, instead of estimating the probability of being in one bin of a dichotomous variable, the fractional model typically deals with variables that take on all possible values in the unit interval. One can easily generalize this model to take on values on any other interval by appropriate transformations. Examples range from participation rates in 401(k) plans to television ratings of NBA games.

Issues of heterogeneity in duration models can take on different forms. On the one hand, unobserved heterogeneity can play a crucial role when it comes to different sampling methods, such as stock or flow sampling. On the other hand, duration models have also been extended to allow for different subpopulations, with a strong link to mixture models. Many of these models impose the assumptions that heterogeneity is independent of the observed covariates, it has a distribution that depends on a finite number of parameters only, and it enters the hazard function multiplicatively.

A dynamic unobserved effects model is a statistical model used in econometrics for panel analysis. It is characterized by the influence of previous values of the dependent variable on its present value, and by the presence of unobservable explanatory variables.

References

↑ Heckman, James J.; Robb, Richard (1985). "Alternative methods for evaluating the impact of interventions". Journal of Econometrics. 30 (1–2). Elsevier BV: 239–267. doi:10.1016/0304-4076(85)90139-3. ISSN 0304-4076.
↑ Telser, L. G. (1964). "Iterative Estimation of a Set of Linear Regression Equations". Journal of the American Statistical Association . 59 (307): 845–862. doi:10.1080/01621459.1964.10480731.
↑ Arellano, M. (2008). "Binary Models with Endogenous Explanatory Variables" (PDF). Class notes.
↑ Arellano, M. (2003): Endogeneity and Instruments in Nonparametric Models. Comments to papers by Darolles, Florens & Renault; and Blundell & Powell. Advances in Economics and Econometrics, Theory and Applications, Eight World Congress. Volume II, ed. by M. Dewatripont, L.P. Hansen, and S.J. Turnovsky. Cambridge University Press, Cambridge.
1 2 Heckman, J. J., and E. J. Vytlacil (2007): Econometric Evaluation of Social Programs, Part II: Using the Marginal Treatment Effect to Organize Alternative Econometric Estimators to Evaluate Social Programs, and to Forecast the Effects in New Environments. Handbook of Econometrics, Vol 6, ed. by J. J. Heckman and E. E. Leamer. North Holland.
↑ Murphy, Kevin M.; Topel, Robert H. (1985). "Estimation and Inference in Two-Step Econometric Models". Journal of Business & Economic Statistics. 3 (4): 370–379. doi:10.1080/07350015.1985.10509471. JSTOR 1391724.
↑ Gauger, Jean (1989). "The Generated Regressor Correction: Impacts Upon Inferences in Hypothesis Testing". Journal of Macroeconomics . 11 (3): 383–395. doi:10.1016/0164-0704(89)90065-7.
↑ Wooldridge 1997, pp. 382–383; Terza 1998
↑ Matzkin, R. L. (2003). "Nonparametric Estimation of Nonadditive Random Functions" (PDF). Econometrica . 71 (5): 1339–1375. doi:10.1111/1468-0262.00452. hdl: 10908/409 .
↑ Blundell, R., and J. L. Powell (2003): Endogeneity in Nonparametric and Semiparametric Regression Models. Advances in Economics and Econometrics, Theory and Applications, Eight World Congress. Volume II, ed. by M. Dewatripont, L.P. Hansen, and S.J. Turnovsky. Cambridge University Press, Cambridge.