Errors-in-variables models

Last updated

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.[ citation needed ]

Contents

Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models. Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the x-axis. The steeper slope is obtained when the independent variable is on the y-axis. By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable. Visualization of errors-in-variables linear regression.png
Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models. Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the x-axis. The steeper slope is obtained when the independent variable is on the y-axis. By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable.

In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For simple linear regression the effect is an underestimate of the coefficient, known as the attenuation bias . In non-linear models the direction of the bias is likely to be more complicated. [1] [2] [3]

Motivating example

Consider a simple linear regression model of the form

where denotes the true but unobserved regressor. Instead we observe this value with an error:

where the measurement error is assumed to be independent of the true value .

If the ′s are simply regressed on the ′s (see simple linear regression), then the estimator for the slope coefficient is

which converges as the sample size increases without bound:

This is in contrast to the "true" effect of , estimated using the ,:

Variances are non-negative, so that in the limit the estimated is smaller than , an effect which statisticians call attenuation or regression dilution. [4] Thus the ‘naïve’ least squares estimator is an inconsistent estimator for . However, is a consistent estimator of the parameter required for a best linear predictor of given the observed : in some applications this may be what is required, rather than an estimate of the ‘true’ regression coefficient , although that would assume that the variance of the errors in the estimation and prediction is identical. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the ′s to the actually observed ′s, in a simple linear regression, is given by

It is this coefficient, rather than , that would be required for constructing a predictor of based on an observed which is subject to noise.

It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous [5] ). Jerry Hausman sees this as an iron law of econometrics: "The magnitude of the estimate is usually smaller than expected." [6]

Specification

Usually measurement error models are described using the latent variables approach. If is the response variable and are observed values of the regressors, then it is assumed there exist some latent variables and which follow the model's “true” functional relationship , and such that the observed quantities are their noisy observations:

where is the model's parameter and are those regressors which are assumed to be error-free (for example when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of 's are zero.

The variables , , are all observed, meaning that the statistician possesses a data set of statistical units which follow the data generating process described above; the latent variables , , , and are not observed however.

This specification does not encompass all the existing errors-in-variables models. For example in some of them function may be non-parametric or semi-parametric. Other approaches model the relationship between and as distributional instead of functional, that is they assume that conditionally on follows a certain (usually parametric) distribution.

Terminology and assumptions

Linear model

Linear errors-in-variables models were studied first, probably because linear models were so widely used and they are easier than non-linear ones. Unlike standard least squares regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariable case is not straightforward.

Simple linear model

The simple linear errors-in-variables model was already presented in the "motivation" section:

where all variables are scalar. Here α and β are the parameters of interest, whereas σε and ση—standard deviations of the error terms—are the nuisance parameters. The "true" regressor x* is treated as a random variable (structural model), independent of the measurement error η (classic assumption).

This model is identifiable in two cases: (1) either the latent regressor x* is not normally distributed, (2) or x* has normal distribution, but neither εt nor ηt are divisible by a normal distribution. [10] That is, the parameters α, β can be consistently estimated from the data set without any additional information, provided the latent regressor is not Gaussian.

Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to assume that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include [11]

Estimation methods that do not assume knowledge of some of the parameters of the model, include

  • Method of moments — the GMM estimator based on the third- (or higher-) order joint cumulants of observable variables. The slope coefficient can be estimated from [12]

    where (n1,n2) are such that K(n1+1,n2) — the joint cumulant of (x,y) — is not zero. In the case when the third central moment of the latent regressor x* is non-zero, the formula reduces to

  • Instrumental variables — a regression which requires that certain additional data variables z, called instruments, were available. These variables should be uncorrelated with the errors in the equation for the dependent (outcome) variable (valid), and they should also be correlated (relevant) with the true regressors x*. If such variables can be found then the estimator takes form
  • The geometric mean functional relationship. This treats both variables as having the same reliability. The resulting slope is the geometric mean of the ordinary least squares slope and the reverse least squares slope, i.e. the two red lines in the diagram. [13]

Multivariable linear model

The multivariable model looks exactly like the simple linear model, only this time β, ηt, xt and x*t are 1 vectors.

In the case when (εt,ηt) is jointly normal, the parameter β is not identified if and only if there is a non-singular k×k block matrix [a A], where a is a 1 vector such that a′x* is distributed normally and independently of A′x*. In the case when εt, ηt1,..., ηtk are mutually independent, the parameter β is not identified if and only if in addition to the conditions above some of the errors can be written as the sum of two independent variables one of which is normal. [14]

Some of the estimation methods for multivariable linear models are

  • Total least squares is an extension of Deming regression to the multivariable setting. When all the k+1 components of the vector (ε,η) have equal variances and are independent, this is equivalent to running the orthogonal regression of y on the vector x — that is, the regression which minimizes the sum of squared distances between points (yt,xt) and the k-dimensional hyperplane of "best fit".
  • The method of moments estimator [15] can be constructed based on the moment conditions E[zt·(ytαβ'xt)] = 0, where the (5k+3)-dimensional vector of instruments zt is defined as

    where designates the Hadamard product of matrices, and variables xt, yt have been preliminarily de-meaned. The authors of the method suggest to use Fuller's modified IV estimator. [16]

    This method can be extended to use moments higher than the third order, if necessary, and to accommodate variables measured without error. [17]
  • The instrumental variables approach requires us to find additional data variables zt that serve as instruments for the mismeasured regressors xt. This method is the simplest from the implementation point of view, however its disadvantage is that it requires collecting additional data, which may be costly or even impossible. When the instruments can be found, the estimator takes standard form
  • The impartial fitting approach treats all variables in the same way by assuming equal reliability, and does not require any distinction between explanatory and response variables as the resulting equation can be rearranged. It is the simplest measurement error model, and is a generalization of the geometric mean functional relationship mentioned above for two variables. It only requires covariances to be computed, and so can be estimated using basic spreadsheet functions. [18]

Non-linear models

A generic non-linear measurement error model takes form

Here function g can be either parametric or non-parametric. When function g is parametric it will be written as g(x*, β).

For a general vector-valued regressor x* the conditions for model identifiability are not known. However in the case of scalar x* the model is identified unless the function g is of the "log-exponential" form [19]

and the latent regressor x* has density

where constants A,B,C,D,E,F may depend on a,b,c,d.

Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information. However there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.

Instrumental variables methods

  • Newey's simulated moments method [20] for parametric models — requires that there is an additional set of observed predictor variableszt, such that the true regressor can be expressed as

    where π0 and σ0 are (unknown) constant matrices, and ζtzt. The coefficient π0 can be estimated using standard least squares regression of x on z. The distribution of ζt is unknown, however we can model it as belonging to a flexible parametric family — the Edgeworth series:

    where ϕ is the standard normal distribution.

    Simulated moments can be computed using the importance sampling algorithm: first we generate several random variables {vts ~ ϕ, s = 1,…,S, t = 1,…,T} from the standard normal distribution, then we compute the moments at t-th observation as

    where θ = (β, σ, γ), A is just some function of the instrumental variables z, and H is a two-component vector of moments

    With moment functions mt one can apply standard GMM technique to estimate the unknown parameter θ.

Repeated observations

In this approach two (or maybe more) repeated observations of the regressor x* are available. Both observations contain their own measurement errors, however those errors are required to be independent:

where x*η1η2. Variables η1, η2 need not be identically distributed (although if they are efficiency of the estimator can be slightly improved). With only these two observations it is possible to consistently estimate the density function of x* using Kotlarski's deconvolution technique. [21]

  • Li's conditional density method for parametric models. [22] The regression equation can be written in terms of the observable variables as

    where it would be possible to compute the integral if we knew the conditional density function ƒx*|x. If this function could be known or estimated, then the problem turns into standard non-linear regression, which can be estimated for example using the NLLS method.
    Assuming for simplicity that η1, η2 are identically distributed, this conditional density can be computed as

    where with slight abuse of notation xj denotes the j-th component of a vector.
    All densities in this formula can be estimated using inversion of the empirical characteristic functions. In particular,

    In order to invert these characteristic function one has to apply the inverse Fourier transform, with a trimming parameter C needed to ensure the numerical stability. For example:

  • Schennach's estimator for a parametric linear-in-parameters nonlinear-in-variables model. [23] This is a model of the form

    where wt represents variables measured without errors. The regressor x* here is scalar (the method can be extended to the case of vector x* as well).
    If not for the measurement errors, this would have been a standard linear model with the estimator

    where

    It turns out that all the expected values in this formula are estimable using the same deconvolution trick. In particular, for a generic observable wt (which could be 1, w1t, …, w t, or yt) and some function h (which could represent any gj or gigj) we have

    where φh is the Fourier transform of h(x*), but using the same convention as for the characteristic functions,

    ,

    and

    The resulting estimator is consistent and asymptotically normal.
  • Schennach's estimator for a nonparametric model. [24] The standard Nadaraya–Watson estimator for a nonparametric model takes form
    for a suitable choice of the kernel K and the bandwidth h. Both expectations here can be estimated using the same technique as in the previous method.

Related Research Articles

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

The method of least squares is a parameter estimation method in regression analysis based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed.

<span class="mw-page-title-main">Deming regression</span> Algorithm for the line of best fit for a two-dimensional dataset

In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model that tries to find the line of best fit for a two-dimensional data set. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, the Vuong closeness test is a likelihood-ratio-based test for model selection using the Kullback–Leibler information criterion. This statistic makes probabilistic statements about two models. They can be nested, strictly non-nested or partially non-nested. The statistic tests the null hypothesis that the two models are equally close to the true data generating process, against the alternative that one model is closer. It cannot make any decision whether the "closer" model is the true model.

<span class="mw-page-title-main">Regression dilution</span> Statistical bias in linear regressions

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero, caused by errors in the independent variable.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations.

Cochrane–Orcutt estimation is a procedure in econometrics, which adjusts a linear model for serial correlation in the error term. Developed in the 1940s, it is named after statisticians Donald Cochrane and Guy Orcutt.

In statistics, a fixed effects model is a statistical model in which the model parameters are fixed or non-random quantities. This is in contrast to random effects models and mixed models in which all or some of the model parameters are random variables. In many applications including econometrics and biostatistics a fixed effects model refers to a regression model in which the group means are fixed (non-random) as opposed to a random effects model in which the group means are a random sample from a population. Generally, data can be grouped according to several observed factors. The group means could be modeled as fixed or random effects for each grouping. In a fixed effects model each group mean is a group-specific fixed quantity.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, semiparametric regression includes regression models that combine parametric and nonparametric models. They are often used in situations where the fully nonparametric model may not perform well or when the researcher wants to use a parametric model but the functional form with respect to a subset of the regressors or the density of the errors is not known. Semiparametric regression models are a particular type of semiparametric modelling and, since semiparametric models contain a parametric component, they rely on parametric assumptions and may be misspecified and inconsistent, just like a fully parametric model.

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors, Eicker–Huber–White standard errors, to recognize the contributions of Friedhelm Eicker, Peter J. Huber, and Halbert White.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

References

  1. Griliches, Zvi; Ringstad, Vidar (1970). "Errors-in-the-variables bias in nonlinear contexts". Econometrica . 38 (2): 368–370. doi:10.2307/1913020. JSTOR   1913020.
  2. Chesher, Andrew (1991). "The effect of measurement error". Biometrika . 78 (3): 451–462. doi:10.1093/biomet/78.3.451. JSTOR   2337015.
  3. Carroll, Raymond J.; Ruppert, David; Stefanski, Leonard A.; Crainiceanu, Ciprian (2006). Measurement Error in Nonlinear Models: A Modern Perspective (Second ed.). ISBN   978-1-58488-633-4.
  4. Greene, William H. (2003). Econometric Analysis (5th ed.). New Jersey: Prentice Hall. Chapter 5.6.1. ISBN   978-0-13-066189-0.
  5. Wansbeek, T.; Meijer, E. (2000). "Measurement Error and Latent Variables". In Baltagi, B. H. (ed.). A Companion to Theoretical Econometrics. Blackwell. pp. 162–179. doi:10.1111/b.9781405106764.2003.00013.x. ISBN   9781405106764.
  6. Hausman, Jerry A. (2001). "Mismeasured variables in econometric analysis: problems from the right and problems from the left". Journal of Economic Perspectives . 15 (4): 57–67 [p. 58]. doi: 10.1257/jep.15.4.57 . JSTOR   2696516.
  7. Fuller, Wayne A. (1987). Measurement Error Models. John Wiley & Sons. p. 2. ISBN   978-0-471-86187-4.
  8. Hayashi, Fumio (2000). Econometrics. Princeton University Press. pp. 7–8. ISBN   978-1400823833.
  9. Koul, Hira; Song, Weixing (2008). "Regression model checking with Berkson measurement errors". Journal of Statistical Planning and Inference. 138 (6): 1615–1628. doi:10.1016/j.jspi.2007.05.048.
  10. Reiersøl, Olav (1950). "Identifiability of a linear relation between variables which are subject to error". Econometrica . 18 (4): 375–389 [p. 383]. doi:10.2307/1907835. JSTOR   1907835. A somewhat more restrictive result was established earlier by Geary, R. C. (1942). "Inherent relations between random variables". Proceedings of the Royal Irish Academy . 47: 63–76. JSTOR   20488436. He showed that under the additional assumption that (ε, η) are jointly normal, the model is not identified if and only if x*s are normal.
  11. Fuller, Wayne A. (1987). "A Single Explanatory Variable". Measurement Error Models. John Wiley & Sons. pp. 1–99. ISBN   978-0-471-86187-4.
  12. Pal, Manoranjan (1980). "Consistent moment estimators of regression coefficients in the presence of errors in variables". Journal of Econometrics . 14 (3): 349–364 (pp. 360–361). doi:10.1016/0304-4076(80)90032-9.
  13. Xu, Shaoji (2014-10-02). "A Property of Geometric Mean Regression". The American Statistician. 68 (4): 277–281. doi:10.1080/00031305.2014.962763. ISSN   0003-1305.
  14. Ben-Moshe, Dan (2020). "Identification of linear regressions with errors in all variables". Econometric Theory . 37 (4): 1–31. arXiv: 1404.1473 . doi:10.1017/S0266466620000250. S2CID   225653359.
  15. Dagenais, Marcel G.; Dagenais, Denyse L. (1997). "Higher moment estimators for linear regression models with errors in the variables". Journal of Econometrics . 76 (1–2): 193–221. CiteSeerX   10.1.1.669.8286 . doi:10.1016/0304-4076(95)01789-5. In the earlier paper Pal (1980) considered a simpler case when all components in vector (ε, η) are independent and symmetrically distributed.
  16. Fuller, Wayne A. (1987). Measurement Error Models. John Wiley & Sons. p. 184. ISBN   978-0-471-86187-4.
  17. Erickson, Timothy; Whited, Toni M. (2002). "Two-step GMM estimation of the errors-in-variables model using high-order moments". Econometric Theory . 18 (3): 776–799. doi:10.1017/s0266466602183101. JSTOR   3533649. S2CID   14729228.
  18. Tofallis, C. (2023). Fitting an Equation to Data Impartially. Mathematics, 11(18), 3957. https://ssrn.com/abstract=4556739 https://doi.org/10.3390/math11183957
  19. Schennach, S.; Hu, Y.; Lewbel, A. (2007). "Nonparametric identification of the classical errors-in-variables model without side information". Working Paper.
  20. Newey, Whitney K. (2001). "Flexible simulated moment estimation of nonlinear errors-in-variables model". Review of Economics and Statistics . 83 (4): 616–627. doi:10.1162/003465301753237704. hdl: 1721.1/63613 . JSTOR   3211757. S2CID   57566922.
  21. Li, Tong; Vuong, Quang (1998). "Nonparametric estimation of the measurement error model using multiple indicators". Journal of Multivariate Analysis . 65 (2): 139–165. doi: 10.1006/jmva.1998.1741 .
  22. Li, Tong (2002). "Robust and consistent estimation of nonlinear errors-in-variables models". Journal of Econometrics . 110 (1): 1–26. doi:10.1016/S0304-4076(02)00120-3.
  23. Schennach, Susanne M. (2004). "Estimation of nonlinear models with measurement error". Econometrica . 72 (1): 33–75. doi:10.1111/j.1468-0262.2004.00477.x. JSTOR   3598849.
  24. Schennach, Susanne M. (2004). "Nonparametric regression in the presence of measurement error". Econometric Theory. 20 (6): 1046–1093. doi:10.1017/S0266466604206028. S2CID   123036368.

Further reading