Generated regressor

Last updated

In least squares estimation problems, sometimes one or more regressors specified in the model are not observable. One way to circumvent this issue is to estimate or generate regressors from observable data. [1] This generated regressor method is also applicable to unobserved instrumental variables. Under some regularity conditions, consistency and asymptotic normality of least squares estimator is preserved, but asymptotic variance has a different form in general.

Least squares method in statistics

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the residuals made in the results of every single equation.

Suppose the model of interest is the following:

where g is a conditional mean function and its form is known up to finite-dimensional parameter β. Here is not observable, but we know that for some function h known up to parameter , and a random sample is available. Suppose we have a consistent estimator of that uses the observation 's. Then, β can be estimated by (Non-Linear) Least Squares using . Some examples of the above setup include Anderson et al. (1976 [2] and Barro (1977). [3]

This problem falls into the framework of two-step M-estimator and thus consistency and asymptotic normality of the estimator can be verified using the general theory of two-step M-estimator. [4] As in general two-step M-estimator problem, asymptotic variance of a generated regressor estimator is usually different from that of the estimator with all regressors observed. Yet, in some special cases, the asymptotic variances of the two estimators are identical. To give one such example, consider the setting in which the regression function is linear in parameter and unobserved regressor is a scalar. Denoting the coefficient of unobserved regressor by if and then the asymptotic variance is independent of whether observing the regressor. [4]

With minor modifications in the model, the above formulation is also applicable to Instrumental Variable estimation. Suppose the model of interest is linear in parameter. Error term is correlated with some of the regressors, and the model specifies some instrumental variables, which are not observable but have the representation . If a consistent estimator of of is available using as instruments, the parameter of interest can be estimated by IV. Similar to the above case, consistency and asymptotic normality follows under mild conditions, and the asymptotic variance has a different form than observed IV case. Yet, there are cases in which the two estimators have the same asymptotic variance. One such case occurs if In this special case, inference on the estimated parameter can be conducted with the usual IV standard error estimator.

Related Research Articles

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

Gamma distribution probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are three different parametrizations in common use:

  1. With a shape parameter k and a scale parameter θ.
  2. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.
  3. With a shape parameter k and a mean parameter μ = = α/β.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

Ordinary least squares method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function.

In econometrics, endogeneity broadly refers to situations in which an explanatory variable is correlated with the error term. The distinction between endogenous and exogenous variables originated in simultaneous equations models, where one separates variables whose values are determined by the model from variables which are predetermined; ignoring simultaneity in the estimation leads to biased estimates as it violates the exogeneity assumption of the Gauss–Markov theorem. The problem of endogeneity is unfortunately, oftentimes ignored by researchers conducting non-experimental research and doing so precludes making policy recommendations. Instrumental variable techniques are commonly used to address this problem.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

Simple linear regression inear regression model with a single explanatory variable

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, the Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan, is used to test for heteroskedasticity in a linear regression model. It was independently suggested with some extension by R. Dennis Cook and Sanford Weisberg in 1983. It tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

Fixed effects model statistical model that represents the observed quantities in terms of explanatory variables that are treated as if the quantities were non-random

In statistics, a fixed effects model is a statistical model in which the model parameters are fixed or non-random quantities. This is in contrast to random effects models and mixed models in which all or some of the model parameters are considered as random variables. In many applications including econometrics and biostatistics a fixed effects model refers to a regression model in which the group means are fixed (non-random) as opposed to a random effects model in which the group means are a random sample from a population. Generally, data can be grouped according to several observed factors. The group means could be modeled as fixed or random effects for each grouping. In a fixed effects model each group mean is a group-specific fixed quantity.

Generalized least squares

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1934.

Semiparametric regression

In statistics, semiparametric regression includes regression models that combine parametric and nonparametric models. They are often used in situations where the fully nonparametric model may not perform well or when the researcher wants to use a parametric model but the functional form with respect to a subset of the regressors or the density of the errors is not known. Semiparametric regression models are a particular type of semiparametric modelling and, since semiparametric models contain a parametric component, they rely on parametric assumptions and so may be misspecified, just like a fully parametric model.

Quantile regression

Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares results in estimates of the conditional mean of the response variable given certain values of the predictor variables, quantile regression aims at estimating either the conditional median or other quantiles of the response variable. Essentially, quantile regression is the extension of linear regression and we use it when the conditions of linear regression are not applicable.

The Heckman correction is any of a number of related statistical methods developed by James Heckman at the University of Chicago from 1976 to 1979 which allow the researcher to correct for selection bias. Selection bias problems are endemic to applied econometric problems, which make Heckman’s original technique, and subsequent refinements by both himself and others, indispensable to applied econometricians. Heckman received the Economics Nobel Prize in 2000 for his work in this field.

Principal component regression statistical technique

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). Typically, it considers regressing the outcome on a set of covariates based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model.

Errors-in-variables models Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In computer networks, self-similarity is a feature of network data transfer dynamics. When modeling network data dynamics the traditional time series models, such as an autoregressive moving average model, are not appropriate. This is because these models only provide a finite number of parameters in the model and thus interaction in a finite time window, but the network data usually have a long-range dependent temporal structure. A self-similar process is one way of modeling network data dynamics with such a long range correlation. This article defines and describes network data transfer dynamics in the context of a self-similar process. Properties of the process are shown and methods are given for graphing and estimating parameters modeling the self-similarity of network data.

De-sparsified lasso contributes to construct confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in high-dimensional model.

Two-step M-estimator involving Maximum Likelihood Estimator is a special case of general two-step M-estimator. Thus, consistency and asymptotic normality of the estimator follows from the general result on two-step M-estimators.

Two-step M-estimators deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.

References

  1. Pagan, A., 1984, “Econometric Issues in the Analysis of Regressions with Generated Regressors”, International Economic Review, 25 (1), 221-247.
  2. Anderson, G. J., I. F. Pearce and P. K. Trivedi, "Output, Expected Demand and Unplanned Stocks," in I. F. Pearce et al., eds., A Model of Output, Employment, Wages and Prices in the U.K., Cambridge University Press.
  3. Barro, R. J., 1977, "Unanticipated Money Growth and Unemployment in the United States," American Economic Review, 67, 101-115.
  4. 1 2 Wooldridge, J.M., Econometric Analysis of Cross Section and Panel Data, MIT Press, Cambridge, Mass