Last updated

Simpson's paradox, which also goes by several other names, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. This result is often encountered in social-science and medical-science statistics [1] [2] [3] and is particularly problematic when frequency data is unduly given causal interpretations. [4] The paradox can be resolved when causal relations are appropriately addressed in the statistical modeling. [4] [5] It is also referred to as Simpson's reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox. [6]

## Contents

Edward H. Simpson first described this phenomenon in a technical paper in 1951, [7] but the statisticians Karl Pearson et al., in 1899, [8] and Udny Yule, in 1903, [9] had mentioned similar effects earlier. The name Simpson's paradox was introduced by Colin R. Blyth in 1972. [10] Simpson's paradox has been used as an exemplar to illustrate to the non-specialist or public audience the kind of misleading results mis-applied statistics can generate. [11] [12]

## Examples

### UC Berkeley gender bias

One of the best-known examples of Simpson's paradox comes from a study of gender bias among graduate school admissions to University of California, Berkeley. The admission figures for the fall of 1973 showed that men applying were more likely than women to be admitted, and the difference was so large that it was unlikely to be due to chance. [13] [14]

AllMenWomen
Total12,76341%844244%432135%

However, when examining the individual departments, it appeared that six out of 85 departments were significantly biased against men, whereas four were significantly biased against women. In total, the pooled and corrected data showed a "small but statistically significant bias in favor of women". [14] The data from the six largest departments are listed below, the top two departments by number of applicants for each gender italicised.

DepartmentAllMenWomen
A93364%82562%10882%
B58563%56063%2568%
C91835%32537%59334%
D79234%41733%37535%
E58425%19128%39324%
F7146%3736%3417%

The research paper by Bickel et al. concluded that women tended to apply to more competitive departments with low rates of admission, even among qualified applicants (such as in the English department), whereas men tended to apply to less competitive departments with high rates of admission (such as in the engineering department). [14]

### Kidney stone treatment

Another example comes from a real-life medical study [15] comparing the success rates of two treatments for kidney stones. [16] The table below shows the success rates and numbers of treatments for treatments involving both small and large kidney stones, where Treatment A includes open surgical procedures and Treatment B includes closed surgical procedures. The numbers in parentheses indicate the number of success cases over the total size of the group.

Treatment
Stone size
Treatment ATreatment B
Small stonesGroup 1
93% (81/87)
Group 2
87% (234/270)
Large stonesGroup 3
73% (192/263)
Group 4
69% (55/80)
Both78% (273/350)83% (289/350)

The paradoxical conclusion is that treatment A is more effective when used on small stones, and also when used on large stones, yet treatment B appears to be more effective when considering both sizes at the same time. In this example, the "lurking" variable (or confounding variable) causing the paradox is the size of the stones, which was not previously known to researchers to be important until its effects were included.

Which treatment is considered better is determined by which success ratio (successes/total) is larger. The reversal of the inequality between the two ratios when considering the combined data, which creates Simpson's paradox, happens because two effects occur together:

1. The sizes of the groups, which are combined when the lurking variable is ignored, are very different. Doctors tend to give cases with large stones the better treatment A, and the cases with small stones the inferior treatment B. Therefore, the totals are dominated by groups 3 and 2, and not by the two much smaller groups 1 and 4.
2. The lurking variable, stone size, has a large effect on the ratios; i.e., the success rate is more strongly influenced by the severity of the case than by the choice of treatment. Therefore, the group of patients with large stones using treatment A (group 3) does worse than the group with small stones, even if the latter used the inferior treatment B (group 2).

Based on these effects, the paradoxical result is seen to arise by suppression of the causal effect of the size of the stones on the chance of a successful treatment. In short, the less effective treatment B appeared to be more effective because it was applied more frequently to the small stones cases, which were easier to treat. [16]

### Batting averages

A common example of Simpson's paradox involves the batting averages of players in professional baseball. It is possible for one player to have a higher batting average than another player each year for a number of years, but to have a lower batting average across all of those years. This phenomenon can occur when there are large differences in the number of at bats between the years. Mathematician Ken Ross demonstrated this using the batting average of two baseball players, Derek Jeter and David Justice, during the years 1995 and 1996: [17] [18]

Year
Batter
19951996Combined
Derek Jeter12/48.250183/582.314195/630.310
David Justice104/411.25345/140.321149/551.270

In both 1995 and 1996, Justice had a higher batting average (in bold type) than Jeter did. However, when the two baseball seasons are combined, Jeter shows a higher batting average than Justice. According to Ross, this phenomenon would be observed about once per year among the possible pairs of players. [17]

## Vector interpretation

Simpson's paradox can also be illustrated using a 2-dimensional vector space. [19] A success rate of ${\displaystyle \textstyle {\frac {p}{q}}}$ (i.e., successes/attempts) can be represented by a vector ${\displaystyle {\overrightarrow {A}}=(q,p)}$, with a slope of ${\displaystyle \textstyle {\frac {p}{q}}}$. A steeper vector then represents a greater success rate. If two rates ${\displaystyle \textstyle {\frac {p_{1}}{q_{1}}}}$ and ${\displaystyle \textstyle {\frac {p_{2}}{q_{2}}}}$ are combined, as in the examples given above, the result can be represented by the sum of the vectors ${\displaystyle (q_{1},p_{1})}$ and ${\displaystyle (q_{2},p_{2})}$, which according to the parallelogram rule is the vector ${\displaystyle (q_{1}+q_{2},p_{1}+p_{2})}$, with slope ${\displaystyle \textstyle {\frac {p_{1}+p_{2}}{q_{1}+q_{2}}}}$.

Simpson's paradox says that even if a vector ${\displaystyle {\overrightarrow {L_{1}}}}$ (in orange in figure) has a smaller slope than another vector ${\displaystyle {\overrightarrow {B_{1}}}}$ (in blue), and ${\displaystyle {\overrightarrow {L_{2}}}}$ has a smaller slope than ${\displaystyle {\overrightarrow {B_{2}}}}$, the sum of the two vectors ${\displaystyle {\overrightarrow {L_{1}}}+{\overrightarrow {L_{2}}}}$ can potentially still have a larger slope than the sum of the two vectors ${\displaystyle {\overrightarrow {B_{1}}}+{\overrightarrow {B_{2}}}}$, as shown in the example. For this to occur one of the orange vectors must have a greater slope than one of the blue vectors (here ${\displaystyle {\overrightarrow {L_{2}}}}$ and ${\displaystyle {\overrightarrow {B_{1}}}}$), and these will generally be longer than the alternatively subscripted vectors – thereby dominating the overall comparison.

## Correlation between variables

Simpson's paradox can also arise in correlations, in which two variables appear to have (say) a positive correlation towards one another, when in fact they have a negative correlation, the reversal having been brought about by a "lurking" confounder. Berman et al. [20] give an example from economics, where a dataset suggests overall demand is positively correlated with price (that is, higher prices lead to more demand), in contradiction of expectation. Analysis reveals time to be the confounding variable: plotting both price and demand against time reveals the expected negative correlation over various periods, which then reverses to become positive if the influence of time is ignored by simply plotting demand against price.

## Implications for decision making

The practical significance of Simpson's paradox surfaces in decision making situations where it poses the following dilemma: Which data should we consult in choosing an action, the aggregated or the partitioned? In the Kidney Stone example above, it is clear that if one is diagnosed with "Small Stones" or "Large Stones" the data for the respective subpopulation should be consulted and Treatment A would be preferred to Treatment B. But what if a patient is not diagnosed, and the size of the stone is not known; would it be appropriate to consult the aggregated data and administer Treatment B? This would stand contrary to common sense; a treatment that is preferred both under one condition and under its negation should also be preferred when the condition is unknown.

On the other hand, if the partitioned data is to be preferred a priori, what prevents one from partitioning the data into arbitrary sub-categories (say based on eye color or post-treatment pain) artificially constructed to yield wrong choices of treatments? Pearl [4] shows that, indeed, in many cases it is the aggregated, not the partitioned data that gives the correct choice of action. Worse yet, given the same table, one should sometimes follow the partitioned and sometimes the aggregated data, depending on the story behind the data, with each story dictating its own choice. Pearl [4] considers this to be the real paradox behind Simpson's reversal.

As to why and how a story, not data, should dictate choices, the answer is that it is the story which encodes the causal relationships among the variables. Once we explicate these relationships and represent them formally, we can test which partition gives the correct treatment preference. For example, if we represent causal relationships in a graph called "causal diagram" (see Bayesian networks), we can test whether nodes that represent the proposed partition intercept spurious paths in the diagram. This test, called the "back-door criterion", reduces Simpson's paradox to an exercise in graph theory. [21]

## Psychology

Psychological interest in Simpson's paradox seeks to explain why people deem sign reversal to be impossible at first, offended by the idea that an action preferred both under one condition and under its negation should be rejected when the condition is unknown. The question is where people get this strong intuition from, and how it is encoded in the mind.

Simpson's paradox demonstrates that this intuition cannot be derived from either classical logic or probability calculus alone, and thus led philosophers to speculate that it is supported by an innate causal logic that guides people in reasoning about actions and their consequences.[ citation needed ] Savage's sure-thing principle [10] is an example of what such logic may entail. A qualified version of Savage's sure thing principle can indeed be derived from Pearl's do-calculus [4] and reads: "An action A that increases the probability of an event B in each subpopulation Ci of C must also increase the probability of B in the population as a whole, provided that the action does not change the distribution of the subpopulations." This suggests that knowledge about actions and consequences is stored in a form resembling Causal Bayesian Networks.

## Probability

A paper by Pavlides and Perlman presents a proof, due to Hadjicostas, that in a random 2 × 2 × 2 table with uniform distribution, Simpson's paradox will occur with a probability of exactly 160. [22] A study by Kock suggests that the probability that Simpson's paradox would occur at random in path models (i.e., models generated by path analysis) with two predictors and one criterion variable is approximately 12.8 percent; slightly higher than 1 occurrence per 8 path models. [23]

A "second" less well-known Simpson's paradox was discussed in his 1951 paper. It can occur when the rational interpretation need not be found in the separate table but may instead reside in the combined table. Which form of the data should be used hinges on the background and the process giving rise to the data.

Norton and Divine give a hypothetical example of the second paradox. [24]

## Related Research Articles

Causality is influence by which one event, process, state or object contributes to the production of another event, process, state or object where the cause is partly responsible for the effect, and the effect is partly dependent on the cause. In general, a process has many causes, which are also said to be causal factors for it, and all lie in its past. An effect can in turn be a cause of, or causal factor for, many other effects, which all lie in its future. Some writers have held that causality is metaphysically prior to notions of time and space.

The phrase "correlation does not imply causation" refers to the inability to legitimately deduce a cause-and-effect relationship between two events or variables solely on the basis of an observed association or correlation between them. The idea that "correlation implies causation" is an example of a questionable-cause logical fallacy, in which two events occurring together are taken to have established a cause-and-effect relationship. This fallacy is also known by the Latin phrase cum hoc ergo propter hoc. This differs from the fallacy known as post hoc ergo propter hoc, in which an event following another is seen as a necessary consequence of the former event, and from conflation, the errant merging of two events, ideas, databases, etc., into one.

A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics.

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other,, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

An ecological fallacy is a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong. 'Ecological fallacy' is a term that is sometimes used to describe the fallacy of division, which is not a statistical fallacy. The four common statistical ecological fallacies are: confusion between ecological correlations and individual correlations, confusion between group average and total average, Simpson's paradox, and confusion between higher average and higher likelihood.

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor.

In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.

External validity is the validity of applying the conclusions of a scientific study outside the context of that study. In other words, it is the extent to which the results of a study can be generalized to and across other situations, people, stimuli, and times. In contrast, internal validity is the validity of conclusions drawn within the context of a particular study. Because general conclusions are almost always a goal in research, external validity is an important property of any study. Mathematical analysis of external validity concerns a determination of whether generalization across heterogeneous populations is feasible, and devising statistical and computational methods that produce valid generalizations.

Probability theory and statistics have some commonly used conventions, in addition to standard mathematical notation and mathematical symbols.

The following is a glossary of terms used in the mathematical sciences statistics and probability.

In statistics, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.

In philosophy of science, a causal model is a conceptual model that describes the causal mechanisms of a system. Causal models can improve study designs by providing clear rules for deciding which independent variables need to be included/controlled for.

In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. Paul R. Rosenbaum and Donald Rubin introduced the technique in 1983.

In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

In statistics and causal graphs, a variable is a collider when it is causally influenced by two or more variables. The name "collider" reflects the fact that in graphical models, the arrow heads from variables that lead into the collider appear to "collide" on the node that is the collider. They are sometimes also referred to as inverted forks.

In probability theory, a logit-normal distribution is a probability distribution of a random variable whose logit has a normal distribution. If Y is a random variable with a normal distribution, and P is the standard logistic function, then X = P(Y) has a logit-normal distribution; likewise, if X is logit-normally distributed, then Y = logit(X)= log is normally distributed. It is also known as the logistic normal distribution, which often refers to a multinomial logit version (e.g.).

In statistics, Lord's paradox raises the issue of when it is appropriate to control for baseline status. In three papers, Frederic M. Lord gave examples when statisticians could reach different conclusions depending on whether adjust for pre-existing differences. Holland & Rubin (1983) use these examples to illustrate how there may be multiple valid descriptive comparisons in the data, but causal conclusions require an underlying (untestable) causal model.

A graphoid is a set of statements of the form, "X is irrelevant to Y given that we know Z" where X, Y and Z are sets of variables. The notion of "irrelevance" and "given that we know" may obtain different interpretations, including probabilistic, relational and correlational, depending on the application. These interpretations share common properties that can be captured by paths in graphs. The theory of graphoids characterizes these properties in a finite set of axioms that are common to informational irrelevance and its graphical representations.

In statistics, econometrics, epidemiology, genetics and related disciplines, causal graphs are probabilistic graphical models used to encode assumptions about the data-generating process.

## References

1. Clifford H. Wagner (February 1982). "Simpson's Paradox in Real Life". The American Statistician . 36 (1): 46–48. doi:10.2307/2684093. JSTOR   2684093.
2. Holt, G. B. (2016). Potential Simpson's paradox in multicenter study of intraperitoneal chemotherapy for ovarian cancer. Journal of Clinical Oncology, 34(9), 1016–1016.
3. Franks, Alexander; Airoldi, Edoardo; Slavov, Nikolai (2017). "Post-transcriptional regulation across human tissues". PLOS Computational Biology. 13 (5): e1005535. arXiv:. doi:10.1371/journal.pcbi.1005535. ISSN   1553-7358. PMC  . PMID   28481885.
4. Judea Pearl. Causality: Models, Reasoning, and Inference, Cambridge University Press (2000, 2nd edition 2009). ISBN   0-521-77362-8.
5. Kock, N., & Gaskins, L. (2016). Simpson's paradox, moderation and the emergence of quadratic relationships in path models: An information systems illustration. International Journal of Applied Nonlinear Science, 2(3), 200–234.
6. I. J. Good, Y. Mittal (June 1987). "The Amalgamation and Geometry of Two-by-Two Contingency Tables". The Annals of Statistics . 15 (2): 694–711. doi:. ISSN   0090-5364. JSTOR   2241334.
7. Simpson, Edward H. (1951). "The Interpretation of Interaction in Contingency Tables". Journal of the Royal Statistical Society, Series B. 13: 238–241.
8. Pearson, Karl; Lee, Alice; Bramley-Moore, Lesley (1899). "Genetic (reproductive) selection: Inheritance of fertility in man, and of fecundity in thoroughbred racehorses". Philosophical Transactions of the Royal Society A . 192: 257–330. doi:.
9. G. U. Yule (1903). "Notes on the Theory of Association of Attributes in Statistics". Biometrika . 2 (2): 121–134. doi:10.1093/biomet/2.2.121.
10. Colin R. Blyth (June 1972). "On Simpson's Paradox and the Sure-Thing Principle". Journal of the American Statistical Association. 67 (338): 364–366. doi:10.2307/2284382. JSTOR   2284382.
11. Robert L. Wardrop (February 1995). "Simpson's Paradox and the Hot Hand in Basketball". The American Statistician, 49 (1): pp. 24–28.
12. Alan Agresti (2002). "Categorical Data Analysis" (Second edition). John Wiley and Sons ISBN   0-471-36093-7
13. David Freedman, Robert Pisani, and Roger Purves (2007), Statistics (4th edition), W. W. Norton. ISBN   0-393-92972-8.
14. P.J. Bickel, E.A. Hammel and J.W. O'Connell (1975). "Sex Bias in Graduate Admissions: Data From Berkeley" (PDF). Science . 187 (4175): 398–404. doi:10.1126/science.187.4175.398. PMID   17835295.
15. C. R. Charig; D. R. Webb; S. R. Payne; J. E. Wickham (29 March 1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy". Br Med J (Clin Res Ed) . 292 (6524): 879–882. doi:10.1136/bmj.292.6524.879. PMC  . PMID   3083922.
16. Steven A. Julious; Mark A. Mullee (3 December 1994). "Confounding and Simpson's paradox". BMJ . 309 (6967): 1480–1481. doi:10.1136/bmj.309.6967.1480. PMC  . PMID   7804052.
17. Ken Ross. "A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans (Paperback)" Pi Press, 2004. ISBN   0-13-147990-3. 12–13
18. Statistics available from Baseball-Reference.com: Data for Derek Jeter; Data for David Justice.
19. Kocik Jerzy (2001). "Proofs without Words: Simpson's Paradox" (PDF). Mathematics Magazine . 74 (5): 399. doi:10.2307/2691038. JSTOR   2691038.
20. Berman, S. DalleMule, L. Greene, M., Lucker, J. (2012), "Simpson's Paradox: A Cautionary Tale in Advanced Analytics", Significance .
21. Pearl, Judea (December 2013). "Understanding Simpson's paradox" (PDF). UCLA Cognitive Systems Laboratory, Technical Report R-414.
22. Marios G. Pavlides & Michael D. Perlman (August 2009). "How Likely is Simpson's Paradox?". The American Statistician . 63 (3): 226–233. doi:10.1198/tast.2009.09007.
23. Kock, N. (2015). How likely is Simpson's paradox in path models? International Journal of e-Collaboration, 11(1), 1–7.
24. Norton, H. James; Divine, George (August 2015). "Simpson's paradox ... and how to avoid it". Significance. 12 (4): 40–43. doi:.

## Bibliography

• Leila Schneps and Coralie Colmez, Math on trial. How numbers get used and abused in the courtroom, Basic Books, 2013. ISBN   978-0-465-03292-1. (Sixth chapter: "Math error number 6: Simpson's paradox. The Berkeley sex bias case: discrimination detection").