Matching (statistics)

Last updated

Matching is a statistical technique that evaluates the effect of a treatment by comparing the treated and the non-treated units in an observational study or quasi-experiment (i.e. when the treatment is not randomly assigned). The goal of matching is to reduce bias for the estimated treatment effect in an observational-data study, by finding, for every treated unit, one (or more) non-treated unit(s) with similar observable characteristics against which the covariates are balanced out. By matching treated units to similar non-treated units, matching enables a comparison of outcomes among treated and non-treated units to estimate the effect of the treatment reducing bias due to confounding. [1] [2] [3] Propensity score matching, an early matching technique, was developed as part of the Rubin causal model, [4] but has been shown to increase model dependence, bias, inefficiency, and power and is no longer recommended compared to other matching methods. [5] A simple, easy-to-understand, and statistically powerful method of matching known as Coarsened Exact Matching or CEM. [6]

Contents

Matching has been promoted by Donald Rubin. [4] It was prominently criticized in economics by LaLonde (1986), [7] who compared estimates of treatment effects from an experiment to comparable estimates produced with matching methods and showed that matching methods are biased. Dehejia and Wahba (1999) reevaluated LaLonde's critique and showed that matching is a good solution. [8] Similar critiques have been raised in political science [9] and sociology [10] journals.

Analysis

When the outcome of interest is binary, the most general tool for the analysis of matched data is conditional logistic regression as it handles strata of arbitrary size and continuous or binary treatments (predictors) and can control for covariates. In particular cases, simpler tests like paired difference test, McNemar test and Cochran–Mantel–Haenszel test are available.

When the outcome of interest is continuous, estimation of the average treatment effect is performed.

Matching can also be used to "pre-process" a sample before analysis via another technique, such as regression analysis. [11]

Overmatching

Overmatching, or post-treatment bias, is matching for an apparent mediator that actually is a result of the exposure. [12] If the mediator itself is stratified, an obscured relation of the exposure to the disease would highly be likely to be induced. [13] Overmatching thus causes statistical bias. [13]

For example, matching the control group by gestation length and/or the number of multiple births when estimating perinatal mortality and birthweight after in vitro fertilization (IVF) is overmatching, since IVF itself increases the risk of premature birth and multiple birth. [14]

It may be regarded as a sampling bias in decreasing the external validity of a study, because the controls become more similar to the cases in regard to exposure than the general population.

See also

Related Research Articles

<span class="mw-page-title-main">Experiment</span> Scientific procedure performed to validate a hypothesis

An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs when a particular factor is manipulated. Experiments vary greatly in goal and scale but always rely on repeatable procedure and logical analysis of the results. There also exist natural experimental studies.

<span class="mw-page-title-main">Field experiment</span> Experiment conducted outside the laboratory

Field experiments are experiments carried out outside of laboratory settings.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

<span class="mw-page-title-main">Confounding</span> Variable or factor in causal inference

In causal inference, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. The existence of confounders is an important quantitative explanation why correlation does not imply causation. Some notations are explicitly designed to identify the existence, possible existence, or non-existence of confounders in causal relationships between elements of a system.

The Rubin causal model (RCM), also known as the Neyman–Rubin causal model, is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes, named after Donald Rubin. The name "Rubin causal model" was first coined by Paul W. Holland. The potential outcomes framework was first proposed by Jerzy Neyman in his 1923 Master's thesis, though he discussed it only in the context of completely randomized experiments. Rubin extended it into a general framework for thinking about causation in both observational and experimental studies.

In statistics, ignorability is a feature of an experiment design whereby the method of data collection does not depend on the missing data. A missing data mechanism such as a treatment assignment or survey sampling strategy is "ignorable" if the missing data matrix, which indicates which variables are observed or missing, is independent of the missing data conditional on the observed data.

<span class="mw-page-title-main">Randomized experiment</span> Experiment using randomness in some aspect, usually to aid in removal of bias

In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey sampling.

<span class="mw-page-title-main">Observational study</span> Study with uncontrolled variable of interest

In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints. One common observational study is about the possible effect of a treatment on subjects, where the assignment of subjects into a treated group versus a control group is outside the control of the investigator. This is in contrast with experiments, such as randomized controlled trials, where each subject is randomly assigned to a treated group or a control group. Observational studies, for lacking an assignment mechanism, naturally present difficulties for inferential analysis.

<span class="mw-page-title-main">Quasi-experiment</span> Empirical interventional study

A quasi-experiment is an empirical interventional study used to estimate the causal impact of an intervention on target population without random assignment. Quasi-experimental research shares similarities with the traditional experimental design or randomized controlled trial, but it specifically lacks the element of random assignment to treatment or control. Instead, quasi-experimental designs typically allow the researcher to control the assignment to the treatment condition, but using some criterion other than random assignment.

The average treatment effect (ATE) is a measure used to compare treatments in randomized experiments, evaluation of policy interventions, and medical trials. The ATE measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control. In a randomized trial, the average treatment effect can be estimated from a sample using a comparison in mean outcomes for treated and untreated units. However, the ATE is generally understood as a causal parameter that a researcher desires to know, defined without reference to the study design or estimation procedure. Both observational studies and experimental study designs with random assignment may enable one to estimate an ATE in a variety of ways.

In statistics, econometrics, political science, epidemiology, and related disciplines, a regression discontinuity design (RDD) is a quasi-experimental pretest–posttest design that aims to determine the causal effects of interventions by assigning a cutoff or threshold above or below which an intervention is assigned. By comparing observations lying closely on either side of the threshold, it is possible to estimate the average treatment effect in environments in which randomisation is unfeasible. However, it remains impossible to make true causal inference with this method alone, as it does not automatically reject causal effects by any potential confounding variable. First applied by Donald Thistlethwaite and Donald Campbell (1960) to the evaluation of scholarship programs, the RDD has become increasingly popular in recent years. Recent study comparisons of randomised controlled trials (RCTs) and RDDs have empirically demonstrated the internal validity of the design.

In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. Paul R. Rosenbaum and Donald Rubin introduced the technique in 1983.

Paired difference test is a type of location test that is used when comparing two sets of paired measurements to assess whether their population means differ. A paired difference test uses additional information about the sample that is not present in an ordinary unpaired testing situation, either to increase the statistical power, or to reduce the effects of confounders.

Causal inference is the process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system. The main difference between causal inference and inference of association is that causal inference analyzes the response of an effect variable when a cause of the effect variable is changed. The study of why things occur is called etiology, and can be described using the language of scientific causal notation. Causal inference is said to provide the evidence of causality theorized by causal reasoning.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

Paul R. Rosenbaum is the Robert G. Putzel Professor Emeritus in the Department of Statistics and Data Science at Wharton School of the University of Pennsylvania, where he worked from 1986 through 2021. He has written extensively about causal inference in observational studies, including sensitivity analysis, optimal matching, design sensitivity, evidence factors, quasi-experimental devices, and the propensity score. With various coauthors, he has also written about health outcomes, racial disparities in health outcomes, instrumental variables, psychometrics and experimental design.

Experimental benchmarking allows researchers to learn about the accuracy of non-experimental research designs. Specifically, one can compare observational results to experimental findings to calibrate bias. Under ordinary conditions, carrying out an experiment gives the researchers an unbiased estimate of their parameter of interest. This estimate can then be compared to the findings of observational research. Note that benchmarking is an attempt to calibrate non-statistical uncertainty. When combined with meta-analysis this method can be used to understand the scope of bias associated with a specific area of research.

<span class="mw-page-title-main">Roderick J. A. Little</span> Ph.D. University of London 1974

Roderick Joseph Alexander Little is an academic statistician, whose main research contributions lie in the statistical analysis of data with missing values and the analysis of complex sample survey data. Little is Richard D. Remington Distinguished University Professor of Biostatistics in the Department of Biostatistics at the University of Michigan, where he also holds academic appointments in the Department of Statistics and the Institute for Social Research.

Differential effects play a special role in certain observational studies in which treatments are not assigned to subjects at random, where differing outcomes may reflect biased assignments rather than effects caused by the treatments.

Jasjeet "Jas" Singh Sekhon is a data scientist, political scientist, and statistician at Yale University. Sekhon is the Eugene Meyer Professor at Yale University, a fellow of the American Statistical Association, and a fellow of the Society for Political Methodology. Sekhon's primary research interests lie in causal inference, machine learning, and their intersection. He has also published research on their application in various fields including voting behavior, online experimentation, epidemiology, and medicine.

References

  1. Rubin, Donald B. (1973). "Matching to Remove Bias in Observational Studies". Biometrics. 29 (1): 159–183. doi:10.2307/2529684. JSTOR   2529684.
  2. Anderson, Dallas W.; Kish, Leslie; Cornell, Richard G. (1980). "On Stratification, Grouping and Matching". Scandinavian Journal of Statistics. 7 (2): 61–66. JSTOR   4615774.
  3. Kupper, Lawrence L.; Karon, John M.; Kleinbaum, David G.; Morgenstern, Hal; Lewis, Donald K. (1981). "Matching in Epidemiologic Studies: Validity and Efficiency Considerations". Biometrics. 37 (2): 271–291. CiteSeerX   10.1.1.154.1197 . doi:10.2307/2530417. JSTOR   2530417. PMID   7272415.
  4. 1 2 Rosenbaum, Paul R.; Rubin, Donald B. (1983). "The Central Role of the Propensity Score in Observational Studies for Causal Effects". Biometrika . 70 (1): 41–55. doi: 10.1093/biomet/70.1.41 .
  5. King, Gary; Nielsen, Richard (October 2019). "Why Propensity Scores Should Not Be Used for Matching". Political Analysis. 27 (4): 435–454. doi: 10.1017/pan.2019.11 . hdl: 1721.1/128459 . ISSN   1047-1987.
  6. Iacus, Stefano M.; King, Gary; Porro, Giuseppe (2011). "Multivariate Matching Methods That Are Monotonic Imbalance Bounding". Journal of the American Statistical Association. 106 (493): 345–361. doi:10.1198/jasa.2011.tm09599. hdl: 2434/151476 . ISSN   0162-1459. S2CID   14790456.
  7. LaLonde, Robert J. (1986). "Evaluating the Econometric Evaluations of Training Programs with Experimental Data". American Economic Review . 76 (4): 604–620. JSTOR   1806062.
  8. Dehejia, R. H.; Wahba, S. (1999). "Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs" (PDF). Journal of the American Statistical Association . 94 (448): 1053–1062. doi:10.1080/01621459.1999.10473858.
  9. Arceneaux, Kevin; Gerber, Alan S.; Green, Donald P. (2006). "Comparing Experimental and Matching Methods Using a Large-Scale Field Experiment on Voter Mobilization". Political Analysis. 14 (1): 37–62. doi:10.1093/pan/mpj001.
  10. Arceneaux, Kevin; Gerber, Alan S.; Green, Donald P. (2010). "A Cautionary Note on the Use of Matching to Estimate Causal Effects: An Empirical Example Comparing Matching Estimates to an Experimental Benchmark". Sociological Methods & Research. 39 (2): 256–282. doi:10.1177/0049124110378098. S2CID   37012563.
  11. Ho, Daniel E.; Imai, Kosuke; King, Gary; Stuart, Elizabeth A. (2007). "Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference". Political Analysis. 15 (3): 199–236. doi: 10.1093/pan/mpl013 .
  12. King, Gary; Zeng, Langche (2007). "Detecting Model Dependence in Statistical Inference: A Response". International Studies Quarterly. 51 (1): 231–241. doi:10.1111/j.1468-2478.2007.00449.x. ISSN   0020-8833. JSTOR   4621711. S2CID   12669035.
  13. 1 2 Marsh, J. L.; Hutton, J. L.; Binks, K. (2002). "Removal of radiation dose response effects: an example of over-matching". British Medical Journal . 325 (7359): 327–330. doi:10.1136/bmj.325.7359.327. PMC   1123834 . PMID   12169512.
  14. Gissler, M.; Hemminki, E. (1996). "The danger of overmatching in studies of the perinatal mortality and birthweight of infants born after assisted conception". Eur J Obstet Gynecol Reprod Biol. 69 (2): 73–75. doi:10.1016/0301-2115(95)02517-0. PMID   8902436.

Further reading