Regression dilution

Last updated
Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models. Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the abscissa (x-axis). The steeper slope is obtained when the independent variable is on the ordinate (y-axis). By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable. Visualization of errors-in-variables linear regression.png
Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models. Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the abscissa (x-axis). The steeper slope is obtained when the independent variable is on the ordinate (y-axis). By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable.

Regression dilution, also known as regression attenuation, is the biasing of the linear regression slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable.

Contents

Consider fitting a straight line for the relationship of an outcome variable y to a predictor variable x, and estimating the slope of the line. Statistical variability, measurement error or random noise in the y variable causes uncertainty in the estimated slope, but not bias: on average, the procedure calculates the right slope. However, variability, measurement error or random noise in the x variable causes bias in the estimated slope (as well as imprecision). The greater the variance in the x measurement, the closer the estimated slope must approach zero instead of the true value.

Suppose the green and blue data points capture the same data, but with errors (either +1 or -1 on x-axis) for the green points. Minimizing error on the y-axis leads to a smaller slope for the green points, even if they are just a noisy version of the same data. Scheme regression dilution.jpg
Suppose the green and blue data points capture the same data, but with errors (either +1 or -1 on x-axis) for the green points. Minimizing error on the y-axis leads to a smaller slope for the green points, even if they are just a noisy version of the same data.

It may seem counter-intuitive that noise in the predictor variable x induces a bias, but noise in the outcome variable y does not. Recall that linear regression is not symmetric: the line of best fit for predicting y from x (the usual linear regression) is not the same as the line of best fit for predicting x from y. [1]

Slope correction

Regression slope and other regression coefficients can be disattenuated as follows.

The case of a fixed x variable

The case that x is fixed, but measured with noise, is known as the functional model or functional relationship. [2] It can be corrected using total least squares [3] and errors-in-variables models in general.

The case of a randomly distributed x variable

The case that the x variable arises randomly is known as the structural model or structural relationship. For example, in a medical study patients are recruited as a sample from a population, and their characteristics such as blood pressure may be viewed as arising from a random sample.

Under certain assumptions (typically, normal distribution assumptions) there is a known ratio between the true slope, and the expected estimated slope. Frost and Thompson (2000) review several methods for estimating this ratio and hence correcting the estimated slope. [4] The term regression dilution ratio, although not defined in quite the same way by all authors, is used for this general approach, in which the usual linear regression is fitted, and then a correction applied. The reply to Frost & Thompson by Longford (2001) refers the reader to other methods, expanding the regression model to acknowledge the variability in the x variable, so that no bias arises. [5] Fuller (1987) is one of the standard references for assessing and correcting for regression dilution. [6]

Hughes (1993) shows that the regression dilution ratio methods apply approximately in survival models. [7] Rosner (1992) shows that the ratio methods apply approximately to logistic regression models. [8] Carroll et al. (1995) give more detail on regression dilution in nonlinear models, presenting the regression dilution ratio methods as the simplest case of regression calibration methods, in which additional covariates may also be incorporated. [9]

In general, methods for the structural model require some estimate of the variability of the x variable. This will require repeated measurements of the x variable in the same individuals, either in a sub-study of the main data set, or in a separate data set. Without this information it will not be possible to make a correction.

Multiple x variables

The case of multiple predictor variables subject to variability (possibly correlated) has been well-studied for linear regression, and for some non-linear regression models. [6] [9] Other non-linear models, such as proportional hazards models for survival analysis, have been considered only with a single predictor subject to variability. [7]

Correlation correction

Charles Spearman developed in 1904 a procedure for correcting correlations for regression dilution, [10] i.e., to "rid a correlation coefficient from the weakening effect of measurement error". [11]

In measurement and statistics, the procedure is also called correlation disattenuation or the disattenuation of correlation. [12] The correction assures that the Pearson correlation coefficient across data units (for example, people) between two sets of variables is estimated in a manner that accounts for error contained within the measurement of those variables. [13]

Formulation

Let and be the true values of two attributes of some person or statistical unit. These values are variables by virtue of the assumption that they differ for different statistical units in the population. Let and be estimates of and derived either directly by observation-with-error or from application of a measurement model, such as the Rasch model. Also, let

where and are the measurement errors associated with the estimates and .

The estimated correlation between two sets of estimates is

which, assuming the errors are uncorrelated with each other and with the true attribute values, gives

where is the separation index of the set of estimates of , which is analogous to Cronbach's alpha; that is, in terms of classical test theory, is analogous to a reliability coefficient. Specifically, the separation index is given as follows:

where the mean squared standard error of person estimate gives an estimate of the variance of the errors, . The standard errors are normally produced as a by-product of the estimation process (see Rasch model estimation).

The disattenuated estimate of the correlation between the two sets of parameter estimates is therefore

That is, the disattenuated correlation estimate is obtained by dividing the correlation between the estimates by the geometric mean of the separation indices of the two sets of estimates. Expressed in terms of classical test theory, the correlation is divided by the geometric mean of the reliability coefficients of two tests.

Given two random variables and measured as and with measured correlation and a known reliability for each variable, and , the estimated correlation between and corrected for attenuation is

.

How well the variables are measured affects the correlation of X and Y. The correction for attenuation tells one what the estimated correlation is expected to be if one could measure X′ and Y′ with perfect reliability.

Thus if and are taken to be imperfect measurements of underlying variables and with independent errors, then estimates the true correlation between and .

Applicability

A correction for regression dilution is necessary in statistical inference based on regression coefficients. However, in predictive modelling applications, correction is neither necessary nor appropriate. In change detection, correction is necessary.

To understand this, consider the measurement error as follows. Let y be the outcome variable, x be the true predictor variable, and w be an approximate observation of x. Frost and Thompson suggest, for example, that x may be the true, long-term blood pressure of a patient, and w may be the blood pressure observed on one particular clinic visit. [4] Regression dilution arises if we are interested in the relationship between y and x, but estimate the relationship between y and w. Because w is measured with variability, the slope of a regression line of y on w is less than the regression line of y on x. Standard methods can fit a regression of y on w without bias. There is bias only if we then use the regression of y on w as an approximation to the regression of y on x. In the example, assuming that blood pressure measurements are similarly variable in future patients, our regression line of y on w (observed blood pressure) gives unbiased predictions.

An example of a circumstance in which correction is desired is prediction of change. Suppose the change in x is known under some new circumstance: to estimate the likely change in an outcome variable y, the slope of the regression of y on x is needed, not y on w. This arises in epidemiology. To continue the example in which x denotes blood pressure, perhaps a large clinical trial has provided an estimate of the change in blood pressure under a new treatment; then the possible effect on y, under the new treatment, should be estimated from the slope in the regression of y on x.

Another circumstance is predictive modelling in which future observations are also variable, but not (in the phrase used above) "similarly variable". For example, if the current data set includes blood pressure measured with greater precision than is common in clinical practice. One specific example of this arose when developing a regression equation based on a clinical trial, in which blood pressure was the average of six measurements, for use in clinical practice, where blood pressure is usually a single measurement. [14]

All of these results can be shown mathematically, in the case of simple linear regression assuming normal distributions throughout (the framework of Frost & Thompson).

It has been discussed that a poorly executed correction for regression dilution, in particular when performed without checking for the underlying assumptions, may do more damage to an estimate than no correction. [15]

Further reading

Regression dilution was first mentioned, under the name attenuation, by Spearman (1904). [16] Those seeking a readable mathematical treatment might like to start with Frost and Thompson (2000). [4]

See also

Related Research Articles

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In finance, the beta is a statistic that measures the expected increase or decrease of an individual stock price in proportion to movements of the stock market as a whole. Beta can be used to indicate the contribution of an individual asset to the market risk of a portfolio when it is added in small quantity. It is referred to as an asset's non-diversifiable risk, systematic risk, or market risk. Beta is not a measure of idiosyncratic risk.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model, such as a linear regression. A small RSS indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In statistics, the delta method is a method of deriving the asymptotic distribution a random variable. It is applicable when the random variable being considered can be defined as a differentiable function of a random variable which is asymptotically Gaussian.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units, or where measurements are made on clusters of related statistical units. Mixed models are often preferred over traditional analysis of variance regression models because of their flexibility in dealing with missing values and uneven spacing of repeated measurements.The Mixed model analysis allows measurements to be explicitly modeled in a wider variety of correlation and variance-covariance structures.

In statistics, the variance inflation factor (VIF) is the ratio (quotient) of the variance of a parameter estimate when fitting a full model that includes other parameters to the variance of the parameter estimate if the model is fit with only the parameter on its own. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity. Cuthbert Daniel claims to have invented the concept behind the variance inflation factor, but did not come up with the name.

In regression, mean response and predicted response, also known as mean outcome and predicted outcome, are values of the dependent variable calculated from the regression parameters and a given value of the independent variable. The values of these two responses are the same, but their calculated variances are different. The concept is a generalization of the distinction between the standard error of the mean and the sample standard deviation.

<span class="mw-page-title-main">Errors-in-variables models</span> Regression models accounting for possible errors in independent variables

In statistics, errors-in-variables models or measurement error models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In computer networks, self-similarity is a feature of network data transfer dynamics. When modeling network data dynamics the traditional time series models, such as an autoregressive moving average model are not appropriate. This is because these models only provide a finite number of parameters in the model and thus interaction in a finite time window, but the network data usually have a long-range dependent temporal structure. A self-similar process is one way of modeling network data dynamics with such a long range correlation. This article defines and describes network data transfer dynamics in the context of a self-similar process. Properties of the process are shown and methods are given for graphing and estimating parameters modeling the self-similarity of network data.

In econometrics, Prais–Winsten estimation is a procedure meant to take care of the serial correlation of type AR(1) in a linear model. Conceived by Sigbert Prais and Christopher Winsten in 1954, it is a modification of Cochrane–Orcutt estimation in the sense that it does not lose the first observation, which leads to more efficiency as a result and makes it a special case of feasible generalized least squares.

In econometrics, the Park test is a test for heteroscedasticity. The test is based on the method proposed by Rolla Edward Park for estimating linear regression parameters in the presence of heteroscedastic error terms.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

References

  1. Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. p. 19. ISBN   0-471-17082-8.
  2. Riggs, D. S.; Guarnieri, J. A.; et al. (1978). "Fitting straight lines when both variables are subject to error". Life Sciences. 22 (13–15): 1305–60. doi:10.1016/0024-3205(78)90098-x. PMID   661506.
  3. Golub, Gene H.; van Loan, Charles F. (1980). "An Analysis of the Total Least Squares Problem". SIAM Journal on Numerical Analysis. Society for Industrial & Applied Mathematics (SIAM). 17 (6): 883–893. doi:10.1137/0717073. hdl: 1813/6251 . ISSN   0036-1429.
  4. 1 2 3 Frost, C. and S. Thompson (2000). "Correcting for regression dilution bias: comparison of methods for a single predictor variable." Journal of the Royal Statistical Society Series A 163: 173–190.
  5. Longford, N. T. (2001). "Correspondence". Journal of the Royal Statistical Society, Series A. 164 (3): 565. doi: 10.1111/1467-985x.00219 . S2CID   247674444.
  6. 1 2 Fuller, W. A. (1987). Measurement Error Models. New York: Wiley. ISBN   9780470317334.
  7. 1 2 Hughes, M. D. (1993). "Regression dilution in the proportional hazards model". Biometrics. 49 (4): 1056–1066. doi:10.2307/2532247. JSTOR   2532247. PMID   8117900.
  8. Rosner, B.; Spiegelman, D.; et al. (1992). "Correction of Logistic Regression Relative Risk Estimates and Confidence Intervals for Random Within-Person Measurement Error". American Journal of Epidemiology. 136 (11): 1400–1403. doi:10.1093/oxfordjournals.aje.a116453. PMID   1488967.
  9. 1 2 Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Measurement error in non-linear models. New York, Wiley.
  10. Spearman, C. (1904). "The Proof and Measurement of Association between Two Things" (PDF). The American Journal of Psychology. University of Illinois Press. 15 (1): 72–101. doi:10.2307/1412159. ISSN   0002-9556. JSTOR   1412159 . Retrieved 2021-07-10.
  11. Jensen, A.R. (1998). The g Factor: The Science of Mental Ability . Human evolution, behavior, and intelligence. Praeger. ISBN   978-0-275-96103-9.
  12. Osborne, Jason W. (2003-05-27). "Effect Sizes and the Disattenuation of Correlation and Regression Coefficients: Lessons from Educational Psychology". Practical Assessment, Research, and Evaluation. 8 (1). doi:10.7275/0k9h-tq64 . Retrieved 2021-07-10.
  13. Franks, Alexander; Airoldi, Edoardo; Slavov, Nikolai (2017-05-08). "Post-transcriptional regulation across human tissues". PLOS Computational Biology. 13 (5): e1005535. doi: 10.1371/journal.pcbi.1005535 . ISSN   1553-7358. PMC   5440056 . PMID   28481885.
  14. Stevens, R. J.; Kothari, V.; Adler, A. I.; Stratton, I. M.; Holman, R. R. (2001). "Appendix to "The UKPDS Risk Engine: a model for the risk of coronary heart disease in type 2 diabetes UKPDS 56)". Clinical Science. 101: 671–679. doi:10.1042/cs20000335.
  15. Davey Smith, G.; Phillips, A. N. (1996). "Inflation in epidemiology: 'The proof and measurement of association between two things' revisited". British Medical Journal . 312 (7047): 1659–1661. doi:10.1136/bmj.312.7047.1659. PMC   2351357 . PMID   8664725.
  16. Spearman, C (1904). "The proof and measurement of association between two things". American Journal of Psychology. 15 (1): 72–101. doi:10.2307/1412159. JSTOR   1412159.