Bad control

Last updated

In statistics, bad controls are variables that introduce an unintended discrepancy between regression coefficients and the effects that said coefficients are supposed to measure. These are contrasted with confounders which are "good controls" and need to be included to remove omitted variable bias. [1] [2] [3] This issue arises when a bad control is an outcome variable (or similar to) in a causal model and thus adjusting for it would eliminate part of the desired causal path. In other words, bad controls might as well be dependent variables in the model under consideration. [3] Angrist and Pischke (2008) additionally differentiate two types of bad controls: a simple bad-control scenario and proxy-control scenario where the included variable partially controls for omitted factors but is partially affected by the variable of interest. [3] Pearl (1995) provides a graphical method for determining good controls using causality diagrams and the back-door criterion and front-door criterion. [4]

Contents

Examples

Simple bad control

Causal diagram showing a type of bad control. If we control for work type
T
{\displaystyle T}
when performing regression from education
E
{\displaystyle E}
to wages
W
{\displaystyle W}
we have disrupted a causal path
E
-
T
-
W
{\displaystyle E\to T\to W}
and such a regression coefficient does not have a causal interpretation. Causal diagram with a mediator.svg
Causal diagram showing a type of bad control. If we control for work type when performing regression from education to wages we have disrupted a causal path and such a regression coefficient does not have a causal interpretation.

A simplified example studies effect of education on wages . [3] In this thought experiment, two levels of education are possible: lower and higher and two types of jobs are performed: white-collar and blue-collar work. When considering the causal effect of education on wages of an individual, it might be tempting to control for the work-type , however, work type is a mediator () in the causal relationship between education and wages (see causal diagram) and thus, controlling for it precludes causal inference from the regression coefficients.

Bad proxy-control

Causal diagram showing bad proxy-control. If we control for late ability
L
{\displaystyle L}
when performing regression from education
E
{\displaystyle E}
to wages
W
{\displaystyle W}
we have introduced a new non-causal path
E
-
L
-
I
-
W
{\displaystyle E\to L\leftarrow I\to W}
and thus a collider bias. Causal diagram with a collider.svg
Causal diagram showing bad proxy-control. If we control for late ability when performing regression from education to wages we have introduced a new non-causal path and thus a collider bias.

Another example of bad control is when attempting to control for innate ability when estimating effect of education on wages . [3] In this example, innate ability (thought of as for example IQ at pre-school age) is a variable influencing wages , but its value is unavailable to researchers at the time of estimation. Instead they choose before-work IQ test scores , or late ability, as a proxy variable to estimate innate ability and perform regression from education to wages adjusting for late ability. Unfortunately, late ability (in this thought experiment) is causally determined by education and innate ability and, by controlling for it, researchers introduced collider bias into their model by opening a back-door path previously not present in their model. On the other hand, if both links and are strong, one can expect strong (non-causal) correlation between and and thus large omitted-variable bias if is not controlled for. This issue, however, is separate from the causality problem.

Related Research Articles

Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. More precisely, it is "the quantitative analysis of actual economic phenomena based on the concurrent development of theory and observation, related by appropriate methods of inference." An introductory economics textbook describes econometrics as allowing economists "to sift through mountains of data to extract simple relationships." Jan Tinbergen is one of the two founding fathers of econometrics. The other, Ragnar Frisch, also coined the term in the sense in which it is used today.

Causality is an influence by which one event, process, state, or object (acause) contributes to the production of another event, process, state, or object (an effect) where the cause is partly responsible for the effect, and the effect is partly dependent on the cause. In general, a process has many causes, which are also said to be causal factors for it, and all lie in its past. An effect can in turn be a cause of, or causal factor for, many other effects, which all lie in its future. Some writers have held that causality is metaphysically prior to notions of time and space.

<span class="mw-page-title-main">Spurious relationship</span> Apparent, but false, correlation between causally-independent variables

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor.

<span class="mw-page-title-main">Coefficient of determination</span> Indicator for how well data points fit a line or curve

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included.

<span class="mw-page-title-main">Granger causality</span> Statistical hypothesis test for forecasting

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969. Ordinarily, regressions reflect "mere" correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of "true causality" is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only "predictive causality". Using the term "causality" alone is a misnomer, as Granger-causality is better described as "precedence", or, as Granger himself later claimed in 1977, "temporally related". Rather than testing whether Xcauses Y, the Granger causality tests whether X forecastsY.

<span class="mw-page-title-main">Structural equation modeling</span> Form of causal modeling that fit networks of constructs to data

Structural equation modeling (SEM) is a diverse set of methods used by scientists doing both observational and experimental research. SEM is used mostly in the social and behavioral sciences but it is also used in epidemiology, business, and other fields. A definition of SEM is difficult without reference to technical language, but a good starting place is the name itself.

<span class="mw-page-title-main">Confounding</span> Variable or factor in causal inference

In causal inference, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. The existence of confounders is an important quantitative explanation why correlation does not imply causation. Some notations are explicitly designed to identify the existence, possible existence, or non-existence of confounders in causal relationships between elements of a system.

Difference in differences is a statistical technique used in econometrics and quantitative research in the social sciences that attempts to mimic an experimental research design using observational study data, by studying the differential effect of a treatment on a 'treatment group' versus a 'control group' in a natural experiment. It calculates the effect of a treatment on an outcome by comparing the average change over time in the outcome variable for the treatment group to the average change over time for the control group. Although it is intended to mitigate the effects of extraneous factors and selection bias, depending on how the treatment group is chosen, this method may still be subject to certain biases.

<span class="mw-page-title-main">Causal model</span> Conceptual model in philosophy of science

In metaphysics, a causal model is a conceptual model that describes the causal mechanisms of a system. Several types of causal notation may be used in the development of a causal model. Causal models can improve study designs by providing clear rules for deciding which independent variables need to be included/controlled for.

<span class="mw-page-title-main">Mediation (statistics)</span> Statistical model

In statistics, a mediation model seeks to identify and explain the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third hypothetical variable, known as a mediator variable. Rather than a direct causal relationship between the independent variable and the dependent variable, which is often false, a mediation model proposes that the independent variable influences the mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables.

<span class="mw-page-title-main">Joshua Angrist</span> Israeli–American economist

Joshua David Angrist is an Israeli–American economist and Ford Professor of Economics at the Massachusetts Institute of Technology. Angrist, together with Guido Imbens, was awarded the Nobel Memorial Prize in Economics in 2021 "for their methodological contributions to the analysis of causal relationships".

In statistics, econometrics, political science, epidemiology, and related disciplines, a regression discontinuity design (RDD) is a quasi-experimental pretest–posttest design that aims to determine the causal effects of interventions by assigning a cutoff or threshold above or below which an intervention is assigned. By comparing observations lying closely on either side of the threshold, it is possible to estimate the average treatment effect in environments in which randomisation is unfeasible. However, it remains impossible to make true causal inference with this method alone, as it does not automatically reject causal effects by any potential confounding variable. First applied by Donald Thistlethwaite and Donald Campbell (1960) to the evaluation of scholarship programs, the RDD has become increasingly popular in recent years. Recent study comparisons of randomised controlled trials (RCTs) and RDDs have empirically demonstrated the internal validity of the design.

In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. Paul R. Rosenbaum and Donald Rubin introduced the technique in 1983.

The methodology of econometrics is the study of the range of differing approaches to undertaking econometric analysis.

Causal inference is the process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system. The main difference between causal inference and inference of association is that causal inference analyzes the response of an effect variable when a cause of the effect variable is changed. The study of why things occur is called etiology, and can be described using the language of scientific causal notation. Causal inference is said to provide the evidence of causality theorized by causal reasoning.

The experimentalist approach to econometrics is a way of doing econometrics that, according to Angrist and Krueger (1999): … puts front and center the problem of identifying causal effects from specific events or situations. These events or situations are thought of as natural experiments that generate exogenous variations in variables that would otherwise be endogenous in the behavioral relationship of interest. An example from the economic study of education can be used to illustrate the approach. Here we might be interested in the effect of effect of an additional year of education on earnings. Those working with an experimentalist approach to econometrics would argue that such a question is problematic to answer because, and this is using their terminology, education is not randomly assigned. That is those with different education levels would tend to also have different levels of other variables. And these other variable, many of which would be unobserved, also affect earnings. This renders the causal effect of extra years of schooling difficult to identify. The experimentalist approach looks for an instrumental variable that is correlated with X but uncorrelated with the unobservables.

In statistics, econometrics, epidemiology, genetics and related disciplines, causal graphs are probabilistic graphical models used to encode assumptions about the data-generating process.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

References

  1. Cinelli C, Forney A, Pearl J (2020). "A crash course in good and bad controls" (PDF). Sociological Methods & Research. SAGE Publications Sage CA: Los Angeles.
  2. Angrist JD, Pischke JS (2014). Mastering ’metrics: The path from cause to effect. Princeton University Press. ISBN   9780691152844.
  3. 1 2 3 4 5 Angrist JD, Pischke JS (2008). Mostly Harmless Econometrics: An Empiricist's Companion. ISBN   0691120358.
  4. Pearl J (1995). "Causal diagrams for empirical research". Biometrika. 82 (4): 669–688. doi:10.1093/biomet/82.4.669. ISSN   0006-3444.