Why Most Published Research Findings Are False

Last updated
The PDF of the paper Ioannidis (2005) Why Most Published Research Findings Are False.pdf
The PDF of the paper

"Why Most Published Research Findings Are False" is a 2005 essay written by John Ioannidis, a professor at the Stanford School of Medicine, and published in PLOS Medicine . [1] It is considered foundational to the field of metascience.

Contents

In the paper, Ioannidis argued that a large number, if not the majority, of published medical research papers contain results that cannot be replicated. In simple terms, the essay states that scientists use hypothesis testing to determine whether scientific discoveries are significant. Statistical significance is formalized in terms of probability, with its p-value measure being reported in the scientific literature as a screening mechanism. Ioannidis posited assumptions about the way people perform and report these tests; then he constructed a statistical model which indicates that most published findings are likely false positive results.

While the general arguments in the paper recommending reforms in scientific research methodology were well-received, Ionnidis received criticism for the validity of his model and his claim that the majority of scientific findings are false. Responses to the paper suggest lower false positive and false negative rates than what Ionnidis puts forth.

Argument

Suppose that in a given scientific field there is a known baseline probability that a result is true, denoted by . When a study is conducted, the probability that a positive result is obtained is . Given these two factors, we want to compute the conditional probability , which is known as the positive predictive value (PPV). Bayes' theorem allows us to compute the PPV as:where is the type I error rate (false positives) and is the type II error rate (false negatives); the statistical power is . It is customary in most scientific research to desire and . If we assume for a given scientific field, then we may compute the PPV for different values of and :

0.10.20.30.40.50.60.70.80.9
0.010.910.900.890.870.850.820.770.690.53
0.020.830.820.800.770.740.690.630.530.36
0.030.770.750.720.690.650.600.530.430.27
0.040.710.690.660.630.580.530.450.360.22
0.050.670.640.610.570.530.470.400.310.18

However, the simple formula for PPV derived from Bayes' theorem does not account for bias in study design or reporting. Some published findings would not have been presented as research findings if not for researcher bias. Let be the probability that an analysis was only published due to researcher bias. Then the PPV is given by the more general expression:The introduction of bias will tend to depress the PPV; in the extreme case when the bias of a study is maximized, . Even if a study meets the benchmark requirements for and , and is free of bias, there is still a 36% probability that a paper reporting a positive result will be incorrect; if the base probability of a true result is lower, then this will push the PPV lower too. Furthermore, there is strong evidence that the average statistical power of a study in many scientific fields is well below the benchmark level of 0.8. [2] [3] [4]

Given the realities of bias, low statistical power, and a small number of true hypotheses, Ioannidis concludes that the majority of studies in a variety of scientific fields are likely to report results that are false.

Corollaries

In addition to the main result, Ioannidis lists six corollaries for factors that can influence the reliability of published research.

Research findings in a scientific field are less likely to be true,

  1. the smaller the studies conducted.
  2. the smaller the effect sizes.
  3. the greater the number and the lesser the selection of tested relationships.
  4. the greater the flexibility in designs, definitions, outcomes, and analytical modes.
  5. the greater the financial and other interests and prejudices.
  6. the hotter the scientific field (with more scientific teams involved).

Ioannidis has added to this work by contributing to a meta-epidemiological study which found that only 1 in 20 interventions tested in Cochrane Reviews have benefits that are supported by high-quality evidence. [5] He also contributed to research suggesting that the quality of this evidence does not seem to improve over time. [6]

Reception

Despite skepticism about extreme statements made in the paper, Ioannidis's broader argument and warnings have been accepted by a large number of researchers. [7] The growth of metascience and the recognition of a scientific replication crisis have bolstered the paper's credibility, and led to calls for methodological reforms in scientific research. [8] [9]

In commentaries and technical responses, statisticians Goodman and Greenland identified several weaknesses in Ioannidis' model. [10] [11] Ioannidis's use of dramatic and exaggerated language that he "proved" that most research findings' claims are false and that "most research findings are false for most research designs and for most fields" [italics added] was rejected, and yet they agreed with his paper's conclusions and recommendations.

Biostatisticians Jager and Leek criticized the model as being based on justifiable but arbitrary assumptions rather than empirical data, and did an investigation of their own which calculated that the false positive rate in biomedical studies was estimated to be around 14%, not over 50% as Ioannidis asserted. [12] Their paper was published in a 2014 special edition of the journal Biostatistics along with extended, supporting critiques from other statisticians. Leek summarized the key points of agreement as: when talking about the science-wise false discovery rate one has to bring data; there are different frameworks for estimating the science-wise false discovery rate; and "it is pretty unlikely that most published research is false", but that probably varies by one's definition of "most" and "false". [13]

Statistician Ulrich Schimmack reinforced the importance of the empirical basis for models by noting the reported false discovery rate in some scientific fields is not the actual discovery rate because non-significant results are rarely reported. Ioannidis's theoretical model fails to account for that, but when a statistical method ("z-curve") to estimate the number of unpublished non-significant results is applied to two examples, the false positive rate is between 8% and 17%, not greater than 50%. [14]

Causes of high false positive rate

Despite these weaknesses there is nonetheless general agreement with the problem and recommendations Ioannidis discusses, yet his tone has been described as "dramatic" and "alarmingly misleading", which runs the risk of making people unnecessarily skeptical or cynical about science. [10] [15]

A lasting impact of this work has been awareness of the underlying drivers of the high false positive rate in clinical medicine and biomedical research, and efforts by journals and scientists to mitigate them. Ioannidis restated these drivers in 2016 as being: [16]

Related Research Articles

<span class="mw-page-title-main">BQP</span> Computational complexity class of problems

In computational complexity theory, bounded-error quantum polynomial time (BQP) is the class of decision problems solvable by a quantum computer in polynomial time, with an error probability of at most 1/3 for all instances. It is the quantum analogue to the complexity class BPP.

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Beta distribution</span> Probability distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and a rate parameter

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used, the assumed distribution of the test, and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.

A continuous-time Markov chain (CTMC) is a continuous stochastic process in which, for each state, the process will change state according to an exponential random variable and then move to a different state as specified by the probabilities of a stochastic matrix. An equivalent formulation describes the process as changing state according to the least value of a set of exponential random variables, one for each possible state it can move to, with the parameters determined by the current state.

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included.

<span class="mw-page-title-main">Positive and negative predictive values</span> Statistical measures of whether a finding is likely to be true

The positive and negative predictive values are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test ; they depend also on the prevalence. Both PPV and NPV can be derived using Bayes' theorem.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In probability theory and statistics, the Dirichlet-multinomial distribution is a family of discrete multivariate probability distributions on a finite support of non-negative integers. It is also called the Dirichlet compound multinomial distribution (DCM) or multivariate Pólya distribution. It is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected. The compounding corresponds to a Pólya urn scheme. It is frequently encountered in Bayesian statistics, machine learning, empirical Bayes methods and classical statistics as an overdispersed multinomial distribution.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.

Medical statistics deals with applications of statistics to medicine and the health sciences, including epidemiology, public health, forensic medicine, and clinical research. Medical statistics has been a recognized branch of statistics in the United Kingdom for more than 40 years, but the term has not come into general use in North America, where the wider term 'biostatistics' is more commonly used. However, "biostatistics" more commonly connotes all applications of statistics to biology. Medical statistics is a subdiscipline of statistics.

It is the science of summarizing, collecting, presenting and interpreting data in medical practice, and using them to estimate the magnitude of associations and test hypotheses. It has a central role in medical investigations. It not only provides a way of organizing information on a wider and more formal basis than relying on the exchange of anecdotes and personal experience, but also takes into account the intrinsic variation inherent in most biological processes.

An -superprocess, , within mathematics probability theory is a stochastic process on that is usually constructed as a special limit of near-critical branching diffusions.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

<span class="mw-page-title-main">Network science</span> Academic field

Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes and the connections between the elements or actors as links. The field draws on theories and methods including graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena."

In particle physics, CLs represents a statistical method for setting upper limits on model parameters, a particular form of interval estimation used for parameters that can take only non-negative values. Although CLs are said to refer to Confidence Levels, "The method's name is ... misleading, as the CLs exclusion region is not a confidence interval." It was first introduced by physicists working at the LEP experiment at CERN and has since been used by many high energy physics experiments. It is a frequentist method in the sense that the properties of the limit are defined by means of error probabilities, however it differs from standard confidence intervals in that the stated confidence level of the interval is not equal to its coverage probability. The reason for this deviation is that standard upper limits based on a most powerful test necessarily produce empty intervals with some fixed probability when the parameter value is zero, and this property is considered undesirable by most physicists and statisticians.

<span class="mw-page-title-main">Replication crisis</span> Observed inability to reproduce scientific studies

The replication crisis is an ongoing methodological crisis in which the results of many scientific studies are difficult or impossible to reproduce. Because the reproducibility of empirical results is an essential part of the scientific method, such failures undermine the credibility of theories building on them and potentially call into question substantial parts of scientific knowledge.

<span class="mw-page-title-main">Jurimetrics</span> Application of quantitative metrics to law

Jurimetrics is the application of quantitative methods, especially probability and statistics, to law. In the United States, the journal Jurimetrics is published by the American Bar Association and Arizona State University. The Journal of Empirical Legal Studies is another publication that emphasizes the statistical analysis of law.

In probability theory and statistics, the modified half-normal distribution (MHN) is a three-parameter family of continuous probability distributions supported on the positive part of the real line. It can be viewed as a generalization of multiple families, including the half-normal distribution, truncated normal distribution, gamma distribution, and square root of the gamma distribution, all of which are special cases of the MHN distribution. Therefore, it is a flexible probability model for analyzing real-valued positive data. The name of the distribution is motivated by the similarities of its density function with that of the half-normal distribution.

References

  1. Ioannidis, John P. A. (2005). "Why Most Published Research Findings Are False". PLOS Medicine. 2 (8): e124. doi: 10.1371/journal.pmed.0020124 . ISSN   1549-1277. PMC   1182327 . PMID   16060722.
  2. Button, Katherine S.; Ioannidis, John P. A.; Mokrysz, Claire; Nosek, Brian A.; Flint, Jonathan; Robinson, Emma S. J.; Munafò, Marcus R. (2013). "Power failure: why small sample size undermines the reliability of neuroscience". Nature Reviews Neuroscience. 14 (5): 365–376. doi: 10.1038/nrn3475 . ISSN   1471-0048. PMID   23571845.
  3. Szucs, Denes; Ioannidis, John P. A. (2017-03-02). "Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature". PLOS Biology. 15 (3): e2000797. doi: 10.1371/journal.pbio.2000797 . ISSN   1545-7885. PMC   5333800 . PMID   28253258.
  4. Ioannidis, John P. A.; Stanley, T. D.; Doucouliagos, Hristos (2017). "The Power of Bias in Economics Research". The Economic Journal. 127 (605): F236–F265. doi: 10.1111/ecoj.12461 . ISSN   1468-0297. S2CID   158829482.
  5. Howick, Jeremy; Koletsi, Despina; Ioannidis, John P. A.; Madigan, Claire; Pandis, Nikolaos; Loef, Martin; Walach, Harald; Sauer, Sebastian; Kleijnen, Jos; Seehra, Jadbinder; Johnson, Tess; Schmidt, Stefan (August 1, 2022). "Most healthcare interventions tested in Cochrane Reviews are not effective according to high quality evidence: a systematic review and meta-analysis". Journal of Clinical Epidemiology. 148: 160–169. doi:10.1016/j.jclinepi.2022.04.017. PMID   35447356. S2CID   248250137 via www.jclinepi.com.
  6. Howick, Jeremy; Koletsi, Despina; Pandis, Nikolaos; Fleming, Padhraig S.; Loef, Martin; Walach, Harald; Schmidt, Stefan; Ioannidis, John P. A. (October 1, 2020). "The quality of evidence for medical interventions does not improve or worsen: a metaepidemiological study of Cochrane reviews". Journal of Clinical Epidemiology. 126: 154–159. doi:10.1016/j.jclinepi.2020.08.005. PMID   32890636. S2CID   221512241 via www.jclinepi.com.
  7. Belluz, Julia (2015-02-16). "John Ioannidis has dedicated his life to quantifying how science is broken". Vox. Retrieved 2020-03-28.
  8. "Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)? « Statistical Modeling, Causal Inference, and Social Science". statmodeling.stat.columbia.edu. Retrieved 2020-03-28.
  9. Wasserstein, Ronald L.; Lazar, Nicole A. (2016-04-02). "The ASA Statement on p-Values: Context, Process, and Purpose". The American Statistician. 70 (2): 129–133. doi: 10.1080/00031305.2016.1154108 . ISSN   0003-1305.
  10. 1 2 Goodman, Steven; Greenland, Sander (24 April 2007). "Why Most Published Research Findings Are False: Problems in the Analysis". PLOS Medicine. 4 (4): e168. doi: 10.1371/journal.pmed.0040168 . PMC   1855693 . PMID   17456002.
  11. Goodman, Steven; Greenland, Sander. "ASSESSING THE UNRELIABILITY OF THE MEDICAL LITERATURE: A RESPONSE TO "WHY MOST PUBLISHED RESEARCH FINDINGS ARE FALSE"". Collection of Biostatistics Research Archive. Working Paper 135: Johns Hopkins University, Dept. of Biostatistics Working Papers. Archived from the original on 2 November 2018.{{cite web}}: CS1 maint: location (link)
  12. Jager, Leah R.; Leek, Jeffrey T. (1 January 2014). "An estimate of the science-wise false discovery rate and application to the top medical literature". Biostatistics. 15 (1). Oxford Academic: 1–12. doi: 10.1093/biostatistics/kxt007 . PMID   24068246. Archived from the original on 11 June 2020.
  13. Leek, Jeff. "Is most science false? The titans weigh in". simplystatistics.org. Archived from the original on 31 January 2017.
  14. Schimmack, Ulrich (16 January 2019). "Ioannidis (2005) was wrong: Most published research findings are not false". Replicability-Index. Archived from the original on 19 September 2020.
  15. Ingraham, Paul (15 September 2016). "Ioannidis: Making Science Look Bad Since 2005". www.PainScience.com. Archived from the original on 21 June 2020.
  16. Minikel, Eric V. (17 March 2016). "John Ioannidis: The state of research on research". www.cureffi.org. Archived from the original on 17 January 2020.

Further reading