Causal graph

Last updated

In statistics, econometrics, epidemiology, genetics and related disciplines, causal graphs (also known as path diagrams, causal Bayesian networks or DAGs) are probabilistic graphical models used to encode assumptions about the data-generating process.

Contents

Causal graphs can be used for communication and for inference. They are complementary to other forms of causal reasoning, for instance using causal equality notation. As communication devices, the graphs provide formal and transparent representation of the causal assumptions that researchers may wish to convey and defend. As inference tools, the graphs enable researchers to estimate effect sizes from non-experimental data, [1] [2] [3] [4] [5] derive testable implications of the assumptions encoded, [1] [6] [7] [8] test for external validity, [9] and manage missing data [10] and selection bias. [11]

Causal graphs were first used by the geneticist Sewall Wright [12] under the rubric "path diagrams". They were later adopted by social scientists [13] [14] [15] [16] [17] and, to a lesser extent, by economists. [18] These models were initially confined to linear equations with fixed parameters. Modern developments have extended graphical models to non-parametric analysis, and thus achieved a generality and flexibility that has transformed causal analysis in computer science, epidemiology, [19] and social science. [20] Recent advances include the development of large-scale causality graphs, such as CauseNet, which compiles over 11 million causal relations extracted from web sources to support causal question answering and reasoning. [21]

Construction and terminology

The causal graph can be drawn in the following way. Each variable in the model has a corresponding vertex or node and an arrow is drawn from a variable X to a variable Y whenever Y is judged to respond to changes in X when all other variables are being held constant. Variables connected to Y through direct arrows are called parents of Y, or "direct causes of Y," and are denoted by Pa(Y).

Causal models often include "error terms" or "omitted factors" which represent all unmeasured factors that influence a variable Y when Pa(Y) are held constant. In most cases, error terms are excluded from the graph. However, if the graph author suspects that the error terms of any two variables are dependent (e.g. the two variables have an unobserved or latent common cause) then a bidirected arc is drawn between them. Thus, the presence of latent variables is taken into account through the correlations they induce between the error terms, as represented by bidirected arcs.

Fundamental tools

A fundamental tool in graphical analysis is d-separation, which allows researchers to determine, by inspection, whether the causal structure implies that two sets of variables are independent given a third set. In recursive models without correlated error terms (sometimes called Markovian), these conditional independences represent all of the model's testable implications. [22]

Example

Suppose we wish to estimate the effect of attending an elite college on future earnings. Simply regressing earnings on college rating will not give an unbiased estimate of the target effect because elite colleges are highly selective, and students attending them are likely to have qualifications for high-earning jobs prior to attending the school. Assuming that the causal relationships are linear, this background knowledge can be expressed in the following structural equation model (SEM) specification.

Model 1

where represents the individual's qualifications prior to college, represents qualifications after college, contains attributes representing the quality of the college attended, and the individual's salary.

Figure 1: Unidentified model with latent variables (
Q
1
{\displaystyle Q_{1}}
and
Q
2
{\displaystyle Q_{2}}
) shown explicitly College notID.png
Figure 1: Unidentified model with latent variables ( and ) shown explicitly
Figure 2: Unidentified model with latent variables summarized College notID proj.png
Figure 2: Unidentified model with latent variables summarized

Figure 1 is a causal graph that represents this model specification. Each variable in the model has a corresponding node or vertex in the graph. Additionally, for each equation, arrows are drawn from the independent variables to the dependent variables. These arrows reflect the direction of causation. In some cases, we may label the arrow with its corresponding structural coefficient as in Figure 1.

If and are unobserved or latent variables their influence on and can be attributed to their error terms. By removing them, we obtain the following model specification:

Model 2

The background information specified by Model 1 imply that the error term of , , is correlated with C's error term, . As a result, we add a bidirected arc between S and C, as in Figure 2.

Figure 3: Identified model with latent variables (
Q
1
{\displaystyle Q_{1}}
and
Q
2
{\displaystyle Q_{2}}
) shown explicitly College.png
Figure 3: Identified model with latent variables ( and ) shown explicitly
Figure 4: Identified model with latent variables summarized College proj.png
Figure 4: Identified model with latent variables summarized

Since is correlated with and, therefore, , is endogenous and is not identified in Model 2. However, if we include the strength of an individual's college application, , as shown in Figure 3, we obtain the following model:

Model 3

By removing the latent variables from the model specification we obtain:

Model 4

with correlated with .

Now, is identified and can be estimated using the regression of on and . This can be verified using the single-door criterion, [1] [23] a necessary and sufficient graphical condition for the identification of a structural coefficients, like , using regression.

Related Research Articles

<span class="mw-page-title-main">Simpson's paradox</span> Error in statistical reasoning with groups

Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science statistics, and is particularly problematic when frequency data are unduly given causal interpretations. The paradox can be resolved when confounding variables and causal relations are appropriately addressed in the statistical modeling.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Simultaneous equations models are a type of statistical model in which the dependent variables are functions of other dependent variables, rather than just independent variables. This means some of the explanatory variables are jointly determined with the dependent variable, which in economics usually is the consequence of some underlying equilibrium mechanism. Take the typical supply and demand model: whilst typically one would determine the quantity supplied and demanded to be a function of the price set by the market, it is also possible for the reverse to be true, where producers observe the quantity that consumers demand and then set the price.

A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.

<span class="mw-page-title-main">Spurious relationship</span> Apparent, but false, correlation between causally-independent variables

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor.

In statistics, path analysis is used to describe the directed dependencies among a set of variables. This includes models equivalent to any form of multiple regression analysis, factor analysis, canonical correlation analysis, discriminant analysis, as well as more general families of models in the multivariate analysis of variance and covariance analyses.

Belief propagation, also known as sum–product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory, and has demonstrated empirical success in numerous applications, including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

<span class="mw-page-title-main">Trygve Haavelmo</span> Norwegian economist and econometrician

Trygve Magnus Haavelmo, born in Skedsmo, Norway, was an economist whose research interests centered on econometrics. He received the Nobel Memorial Prize in Economic Sciences in 1989.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

<span class="mw-page-title-main">Granger causality</span> Statistical hypothesis test for forecasting

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969. Ordinarily, regressions reflect "mere" correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of "true causality" is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only "predictive causality". Using the term "causality" alone is a misnomer, as Granger-causality is better described as "precedence", or, as Granger himself later claimed in 1977, "temporally related". Rather than testing whether Xcauses Y, the Granger causality tests whether X forecastsY.

<span class="mw-page-title-main">Log–log plot</span> 2D graphic with logarithmic scales on both axes

In science and engineering, a log–log graph or log–log plot is a two-dimensional graph of numerical data that uses logarithmic scales on both the horizontal and vertical axes. Power functions – relationships of the form – appear as straight lines in a log–log graph, with the exponent corresponding to the slope, and the coefficient corresponding to the intercept. Thus these graphs are very useful for recognizing these relationships and estimating parameters. Any base can be used for the logarithm, though most commonly base 10 are used.

<span class="mw-page-title-main">Bond graph</span> Graphical representation of a dynamic system

A bond graph is a graphical representation of a physical dynamic system. It allows the conversion of the system into a state-space representation. It is similar to a block diagram or signal-flow graph, with the major difference that the arcs in bond graphs represent bi-directional exchange of physical energy, while those in block diagrams and signal-flow graphs represent uni-directional flow of information. Bond graphs are multi-energy domain and domain neutral. This means a bond graph can incorporate multiple domains seamlessly.

<span class="mw-page-title-main">Causal model</span> Conceptual model in philosophy of science

In metaphysics, a causal model is a conceptual model that describes the causal mechanisms of a system. Several types of causal notation may be used in the development of a causal model. Causal models can improve study designs by providing clear rules for deciding which independent variables need to be included/controlled for.

<span class="mw-page-title-main">Mediation (statistics)</span> Statistical model

In statistics, a mediation model seeks to identify and explain the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third hypothetical variable, known as a mediator variable. Rather than a direct causal relationship between the independent variable and the dependent variable, which is often false, a mediation model proposes that the independent variable influences the mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables.

<span class="mw-page-title-main">James Robins</span>

James M. Robins is an epidemiologist and biostatistician best known for advancing methods for drawing causal inferences from complex observational studies and randomized trials, particularly those in which the treatment varies with time. He is the 2013 recipient of the Nathan Mantel Award for lifetime achievement in statistics and epidemiology, and a recipient of the 2022 Rousseeuw Prize in Statistics, jointly with Miguel Hernán, Eric Tchetgen-Tchetgen, Andrea Rotnitzky and Thomas Richardson.

<span class="mw-page-title-main">Collider (statistics)</span> Variable that is causally influenced by two or more variables

In statistics and causal graphs, a variable is a collider when it is causally influenced by two or more variables. The name "collider" reflects the fact that in graphical models, the arrow heads from variables that lead into the collider appear to "collide" on the node that is the collider. They are sometimes also referred to as inverted forks.

Causal inference is the process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system. The main difference between causal inference and inference of association is that causal inference analyzes the response of an effect variable when a cause of the effect variable is changed. The study of why things occur is called etiology, and can be described using the language of scientific causal notation. Causal inference is said to provide the evidence of causality theorized by causal reasoning.

A graphoid is a set of statements of the form, "X is irrelevant to Y given that we know Z" where X, Y and Z are sets of variables. The notion of "irrelevance" and "given that we know" may obtain different interpretations, including probabilistic, relational and correlational, depending on the application. These interpretations share common properties that can be captured by paths in graphs. The theory of graphoids characterizes these properties in a finite set of axioms that are common to informational irrelevance and its graphical representations.

In statistics, linear regression is a model that estimates the linear relationship between a scalar response and one or more explanatory variables. A model with exactly one explanatory variable is a simple linear regression; a model with two or more explanatory variables is a multiple linear regression. This term is distinct from multivariate linear regression, which predicts multiple correlated dependent variables rather than a single dependent variable.

<i>Causality</i> (book)

Causality: Models, Reasoning, and Inference is a book by Judea Pearl. It is an exposition and analysis of causality. It is considered to have been instrumental in laying the foundations of the modern debate on causal inference in several fields including statistics, computer science and epidemiology. In this book, Pearl espouses the Structural Causal Model (SCM) that uses structural equation modeling. This model is a competing viewpoint to the Rubin causal model. Some of the material from the book was reintroduced in the more general-audience targeting The Book of Why.

References

  1. 1 2 3 Pearl, Judea (2000). Causality . Cambridge, MA: MIT Press. ISBN   9780521773621.
  2. Tian, Jin; Pearl, Judea (2002). "A general identification condition for causal effects". Proceedings of the Eighteenth National Conference on Artificial Intelligence. ISBN   978-0-262-51129-2.
  3. Shpitser, Ilya; Pearl, Judea (2008). "Complete Identification Methods for the Causal Hierarchy" (PDF). Journal of Machine Learning Research. 9: 1941–1979.
  4. Huang, Y.; Valtorta, M. (2006). Identifiability in Causal Bayesian Networks: A Sound and Complete Algorithm (PDF).
  5. Bareinboim, Elias; Pearl, Judea (2012). "Causal Inference by Surrogate Experiments: z-Identifiability". Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. arXiv: 1210.4842 . Bibcode:2012arXiv1210.4842B. ISBN   978-0-9749039-8-9.
  6. Tian, Jin; Pearl, Judea (2002). "On the Testable Implications of Causal Models with Hidden Variables". Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence. pp. 519–27. arXiv: 1301.0608 . Bibcode:2013arXiv1301.0608T. ISBN   978-1-55860-897-9.
  7. Shpitser, Ilya; Pearl, Judea (2008). "Complete Identification Methods for the Causal Hierarchy" (PDF). Journal of Machine Learning Research. 9 (64): 1941–1979. ISSN   1533-7928 . Retrieved 2024-08-11.
  8. Chen, Bryant; Pearl, Judea (2014). "Testable Implications of Linear Structural Equation Models". Proceedings of the AAAI Conference on Artificial Intelligence. 28. doi: 10.1609/aaai.v28i1.9065 . S2CID   1612893.
  9. Bareinmboim, Elias; Pearl, Judea (2014). "External Validity: From do-calculus to Transportability across Populations". Statistical Science. 29 (4): 579–595. arXiv: 1503.01603 . doi:10.1214/14-sts486. S2CID   5586184.
  10. Mohan, Karthika; Pearl, Judea; Tian, Jin (2013). "Graphical Models for Inference with Missing Data" (PDF). Advances in Neural Information Processing Systems.
  11. Bareinboim, Elias; Tian, Jin; Pearl, Judea (2014). "Recovering from Selection Bias in Causal and Statistical Inference". Proceedings of the AAAI Conference on Artificial Intelligence. 28. doi: 10.1609/aaai.v28i1.9074 .
  12. Wright, S. (1921). "Correlation and causation". Journal of Agricultural Research. 20: 557–585.
  13. Blalock, H. M. (1960). "Correlational analysis and causal inferences". American Anthropologist. 62 (4): 624–631. doi: 10.1525/aa.1960.62.4.02a00060 .
  14. Duncan, O. D. (1966). "Path analysis: Sociological examples". American Journal of Sociology. 72: 1–16. doi:10.1086/224256. S2CID   59428866.
  15. Duncan, O. D. (1976). "Introduction to structural equation models". American Journal of Sociology. 82 (3): 731–733. doi:10.1086/226377.
  16. Jöreskog, K. G. (1969). "A general approach to confirmatory maximum likelihood factor analysis". Psychometrika. 34 (2): 183–202. doi:10.1007/bf02289343. S2CID   186236320.
  17. Goldberger, A. S. (1972). "Structural equation models in the social sciences". Econometrica. 40 (6): 979–1001. doi:10.2307/1913851. JSTOR   1913851.
  18. White, Halbert; Chalak, Karim; Lu, Xun (2011). "Linking granger causality and the pearl causal model with settable systems" (PDF). Causality in Time Series Challenges in Machine Learning. 5.
  19. Rothman, Kenneth J.; Greenland, Sander; Lash, Timothy (2008). Modern epidemiology. Lippincott Williams & Wilkins. ISBN   978-0-7817-5564-1.
  20. Morgan, S. L.; Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. New York: Cambridge University Press. doi:10.1017/cbo9781107587991. ISBN   978-1-107-06507-9.
  21. Heindorf, Stefan; Scholten, Yan; Wachsmuth, Henning; Ngonga Ngomo, Axel-Cyrille; Potthast, Martin (2020). "CauseNet: Towards a Causality Graph Extracted from the Web". Proceedings of the 29th ACM International Conference on Information & Knowledge Management. CIKM. ACM.
  22. Geiger, Dan; Pearl, Judea (1993). "Logical and Algorithmic Properties of Conditional Independence". Annals of Statistics. 21 (4): 2001–2021. CiteSeerX   10.1.1.295.2043 . doi:10.1214/aos/1176349407.
  23. Chen, B.; Pearl, J (2014). "Graphical Tools for Linear Structural Equation Modeling" (PDF). Technical Report.