Invalid science

Last updated

Invalid science consists of scientific claims based on experiments that cannot be reproduced or that are contradicted by experiments that can be reproduced. Recent analyses indicate that the proportion of retracted claims in the scientific literature is steadily increasing. [1] The number of retractions has grown tenfold over the past decade, but they still make up approximately 0.2% of the 1.4m papers published annually in scholarly journals. [2]

Contents

The U.S. Office of Research Integrity (ORI), investigates scientific misconduct. [3]

Incidence

Science magazine ranked first for the number of articles retracted at 70, just edging out PNAS, which retracted 69. Thirty-two of Science's retractions were due to fraud or suspected fraud, and 37 to error. A subsequent "retraction index" indicated that journals with relatively high impact factors, such as Science, Nature and Cell, had a higher rate of retractions. Under 0.1% of papers in PubMed had were retracted of more than 25 million papers going back to the 1940s. [3] [4]

The fraction of retracted papers due to scientific misconduct was estimated at two-thirds, according to studies of 2047 papers published since 1977. Misconducted included fraud and plagiarism. Another one-fifth were retracted because of mistakes, and the rest were pulled for unknown or other reasons. [3]

A separate study analyzed 432 claims of genetic links for various health risks that vary between men and women. Only one of these claims proved to be consistently reproducible. Another meta review, found that of the 49 most-cited clinical research studies published between 1990 and 2003, more than 40 percent of them were later shown to be either totally wrong or significantly incorrect. [5] [6]

Biological sciences

In 2012 biotech firm Amgen was able to reproduce just six of 53 important studies in cancer research. Earlier, a group at Bayer, a drug company, successfully repeated only one fourth of 67 important papers. In 2000-10 roughly 80,000 patients took part in clinical trials based on research that was later retracted because of mistakes or improprieties. [1]

Paleontology

Nathan Mhyrvold failed repeatedly to replicate the findings of several papers on dinosaur growth. Dinosaurs added a layer to their bones each year. Tyrannosaurus rex was thought to have increased in size by more than 700 kg a year, until Mhyrvold showed that this was a factor of 2 too large. In 4 of 12 papers he examined, the original data had been lost. In three, the statistics were correct, while three had serious errors that invalidated their conclusions. Two papers mistakenly relied on data from these three. He discovered that some of the paper's graphs did not reflect the data. In one case, he found that only four of nine points on the graph came from data cited in the paper. [7]

Major retractions

Torcetrapib was originally hyped as a drug that could block a protein that converts HDL cholesterol into LDL with the potential to "redefine cardiovascular treatment". One clinical trial showed that the drug could increase HDL and decrease LDL. Two days after Pfizer announced its plans for the drug, it ended the Phase III clinical trial due to higher rates of chest pain and heart failure and a 60 percent increase in overall mortality. Pfizer had invested more than $1 billion in developing the drug. [5]

An in-depth review of the most highly cited biomarkers (whose presence are used to infer illness and measure treatment effects) claimed that 83 percent of supposed correlations became significantly weaker in subsequent studies. Homocysteine is an amino acid whose levels correlated with heart disease. However, a 2010 study showed that lowering homocysteine by nearly 30 percent had no effect on heart attack or stroke. [5]

Priming

Priming studies claim that decisions can be influenced by apparently irrelevant events that a subject witnesses just before making a choice. Nobel Prize-winner Daniel Kahneman alleges that much of it is poorly founded. Researchers have been unable to replicate some of the more widely cited examples. A paper in PLoS ONE [8] reported that nine separate experiments could not reproduce a study purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan. [2] A further systematic replication involving 40 different labs around the world did not replicate the main finding. [9] However, this latter systematic replication showed that participants who did not think there was a relation between thinking about a hooligan or a professor were significantly more susceptible to the priming manipulation.

Potential causes

Competition

In the 1950s, when academic research accelerated during the cold war, the total number of scientists was a few hundred thousand. In the new century 6m-7m researchers are active. The number of research jobs has not matched this increase. Every year six new PhDs compete for every academic post. Replicating other researcher’s results is not perceived to be valuable. The struggle to compete encourages exaggeration of findings and biased data selection. A recent survey found that one in three researchers knows of a colleague who has at least somewhat distorted their results. [1]

Publication bias

Major journals reject in excess of 90% of submitted manuscripts and tend to favor the most dramatic claims. The statistical measures that researchers use to test their claims allow a fraction of false claims to appear valid. Invalid claims are more likely to be dramatic (because they are false.) Without replication, such errors are less likely to be caught. [1]

Conversely, failures to prove a hypothesis are rarely even offered for publication. “Negative results” now account for only 14% of published papers, down from 30% in 1990. Knowledge of what is not true is as important as of what is true. [1]

Peer review

Peer review is the primary validation technique employed by scientific publications. However, a prominent medical journal tested the system and found major failings. It supplied research with induced errors and found that most reviewers failed to spot the mistakes, even after being told of the tests. [1]

A pseudonymous fabricated paper on the effects of a chemical derived from lichen on cancer cells was submitted to 304 journals for peer review. The paper was filled with errors of study design, analysis and interpretation. 157 lower-rated journals accepted it. Another study sent an article containing eight deliberate mistakes in study design, analysis and interpretation to more than 200 of the British Medical Journal’s regular reviewers. On average, they reported fewer than two of the problems. [2]

Peer reviewers typically do not re-analyse data from scratch, checking only that the authors’ analysis is properly conceived. [2]

Statistics

Type I and type II errors

Scientists divide errors into type I, incorrectly asserting the truth of a hypothesis (false positive) and type II, rejecting a correct hypothesis (false negative). Statistical checks assess the probability that data which seem to support a hypothesis come about simply by chance. If the probability is less than 5%, the evidence is rated “statistically significant”. One definitional consequence is a type one error rate of one in 20. [2]

Statistical power

In 2005 Stanford epidemiologist John Ioannidis showed that the idea that only one paper in 20 gives a false-positive result was incorrect. He claimed, “most published research findings are probably false.” He found three categories of problems: insufficient “statistical power” (avoiding type II errors); the unlikeliness of the hypothesis; and publication bias favoring novel claims. [2]

A statistically powerful study identifies factors with only small effects on data. In general studies with more repetitions that run the experiment more times on more subjects have greater power. A power of 0.8 means that of ten true hypotheses tested, the effects of two are missed. Ioannidis found that in neuroscience the typical statistical power is 0.21; another study found that psychology studies average 0.35. [2]

Unlikeliness is a measure of the degree of surprise in a result. Scientists prefer surprising results, leading them to test hypotheses that are unlikely to very unlikely. Ioannidis claimed that in epidemiology, some one in ten hypotheses should be true. In exploratory disciplines like genomics, which rely on examining voluminous data about genes and proteins, only one in a thousand should prove correct. [2]

In a discipline in which 100 out of 1,000 hypotheses are true, studies with a power of 0.8 will find 80 and miss 20. Of the 900 incorrect hypotheses, 5% or 45 will be accepted because of type I errors. Adding the 45 false positives to the 80 true positives gives 125 positive results, or 36% specious. Dropping statistical power to 0.4, optimistic for many fields, would still produce 45 false positives but only 40 true positives, less than half. [2]

Negative results are more reliable. Statistical power of 0.8 produces 875 negative results of which only 20 are false, giving an accuracy of over 97%. Negative results however account for a minority of published results, varying by discipline. A study of 4,600 papers found that the proportion of published negative results dropped from 30% to 14% between 1990 and 2007. [2]

Subatomic physics sets an acceptable false-positive rate of one in 3.5m (known as the five-sigma standard). However, even this does not provide perfect protection. The problem invalidates some 3/4s of machine learning studies according to one review. [2]

Statistical significance

Statistical significance is a measure for testing statistical correlation. It was invented by English mathematician Ronald Fisher in the 1920s. It defines a “significant” result as any data point that would be produced by chance less than 5 (or more stringently, 1) percent of the time. A significant result is widely seen as an important indicator that the correlation is not random. [5]

While correlations track the relationship between truly independent measurements, such as smoking and cancer, they are much less effective when variables cannot be isolated, a common circumstance in biological systems. For example, statistics found a high correlation between lower back pain and abnormalities in spinal discs, although it was later discovered that serious abnormalities were present in two-thirds of pain-free patients. [5]

Minimum threshold publishers

Journals such as PLoS One use a “minimal-threshold” standard, seeking to publish as much science as possible, rather than to pick out the best work. Their peer reviewers assess only whether a paper is methodologically sound. Almost half of their submissions are still rejected on that basis. [2]

Unpublished research

Only 22% of the clinical trials financed by the National Institutes of Health (NIH) released summary results within one year of completion, even though the NIH requires it. Fewer than half published within 30 months; a third remained unpublished after 51 months. [2] When other scientists rely on invalid research, they may waste time on lines of research that are themselves invalid. The failure to report failures means that researchers waste money and effort exploring blind alleys already investigated by other scientists. [1]

Fraud

In 21 surveys of academics (mostly in the biomedical sciences but also in civil engineering, chemistry and economics) carried out between 1987 and 2008, 2% admitted fabricating data, but 28% claimed to know of colleagues who engaged in questionable research practices. [2]

Lack of access to data and software

Clinical trials are generally too costly to rerun. Access to trial data is the only practical approach to reassessment. A campaign to persuade pharmaceutical firms to make all trial data available won its first convert in February 2013 when GlaxoSmithKline became the first to agree. [2]

Software used in a trial is generally considered to be proprietary intellectual property and is not available to replicators, further complicating matters. Journals that insist on data-sharing tend not to do the same for software. [2]

Even well-written papers may not include sufficient detail and/or tacit knowledge (subtle skills and extemporisations not considered notable) for the replication to succeed. One cause of replication failure is insufficient control of the protocol, which can cause disputes between the original and replicating researchers. [2]

Reform

Statistics training

Geneticists have begun more careful reviews, particularly of the use of statistical techniques. The effect was to stop a flood of specious results from genome sequencing. [1]

Protocol registration

Registering research protocols in advance and monitoring them over the course of a study can prevent researchers from modifying the protocol midstream to highlight preferred results. Providing raw data for other researchers to inspect and test can also better hold researchers to account. [1]

Post-publication review

Replacing peer review with post-publication evaluations can encourage researchers to think more about the long-term consequences of excessive or unsubstantiated claims. That system was adopted in physics and mathematics with good results. [1]

Replication

Few researchers, especially junior workers, seek opportunities to replicate others' work, partly to protect relationships with senior researchers. [2]

Reproduction benefits from access to the original study's methods and data. More than half of 238 biomedical papers published in 84 journals failed to identify all the resources (such as chemical reagents) necessary to reproduce the results. In 2008 some 60% of researchers said they would share raw data; in 2013 just 45% do. Journals have begun to demand that at least some raw data be made available, although only 143 of 351 randomly selected papers covered by some data-sharing policy actually complied. [2]

The Reproducibility Initiative is a service allowing life scientists to pay to have their work validated by an independent lab. In October 2013 the initiative received funding to review 50 of the highest-impact cancer findings published between 2010 and 2012. Blog Syn is a website run by graduate students that is dedicated to reproducing chemical reactions reported in papers. [2]

In 2013 replication efforts received greater attention. Nature and related publications introduced an 18-point checklist for life science authors in May, [10] in its effort to ensure that its published research can be reproduced. Expanded "methods" sections and all data were to be available online. The Centre for Open Science opened as an independent laboratory focused on replication. The journal Perspectives on Psychological Science announced a section devoted to replications. Another project announced plans to replicate 100 studies published in the first three months of 2008 in three leading psychology journals. [2]

Major funders, including the European Research Council, the US National Science Foundation and Research Councils UK have not changed their preference for new work over replications. [2]

See also

Related Research Articles

Biostatistics are the development and application of statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

Scientific misconduct is the violation of the standard codes of scholarly conduct and ethical behavior in the publication of professional scientific research. A Lancet review on Handling of Scientific Misconduct in Scandinavian countries provides the following sample definitions, reproduced in The COPE report 1999:

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters.

Reproducibility, also known as replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated. There are different kinds of replication but typically replication studies involve different researchers using the same methodology. Only after one or several such successful replications should a result be recognized as scientific knowledge.

In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. More precisely, a study's defined significance level, denoted by , is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, , is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when . The significance level for a study is chosen before data collection, and is typically set to 5% or much lower—depending on the field of study.

In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a significant finding disturbs the balance of findings in favor of positive results. The study of publication bias is an important topic in metascience.

In academic publishing, a retraction is the action by which a published paper in an academic journal is removed from the journal.

Data dredging Use of data mining to uncover patterns in data that can be presented as statistically significant

Data dredging is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.

<i>Journal of Personality and Social Psychology</i> Academic journal

The Journal of Personality and Social Psychology is a monthly peer-reviewed scientific journal published by the American Psychological Association that was established in 1965. It covers the fields of social and personality psychology. The editors-in-chief are Shinobu Kitayama, Colin Wayne Leach, and Richard E. Lucas.

Multiple comparisons problem Problem where one considers a set of inferences simultaneously based on the observed values

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values.

Why Most Published Research Findings Are False

"Why Most Published Research Findings Are False" is a 2005 essay written by John Ioannidis, a professor at the Stanford School of Medicine, and published in PLOS Medicine. It is considered foundational to the field of metascience.

Retraction Watch

Retraction Watch is a blog that reports on retractions of scientific papers and on related topics. The blog was launched in August 2010 and is produced by science writers Ivan Oransky and Adam Marcus. Its parent organization is the Center for Scientific Integrity.

Anil Potti is a physician and former Duke University associate professor and cancer researcher, focusing on oncogenomics. He, along with Joseph Nevins, are at the center of a research fabrication scandal at Duke University. On 9 November 2015, the Office of Research Integrity (ORI) found that Potti had engaged in research misconduct. According to Potti's voluntary settlement agreement with ORI, Potti can continue to perform research with the requirement of supervision until year 2020, while he "neither admits nor denies ORI's findings of research misconduct." As of 2020 Potti, who is employed at the Cancer Center of North Dakota, has had 11 of his research publications retracted, one publication has received an expression of concern, and two others have been corrected.

Replication crisis Ongoing methodological crisis in science stemming from failure to replicate many studies

The replication crisis is an ongoing methodological crisis in which it has been found that the results of many scientific studies are difficult or impossible to reproduce. Because the reproducibility of empirical results is an essential part of the scientific method, such failures undermine the credibility of theories building on them and potentially call into question substantial parts of scientific knowledge.

Annarosa Leri is a medical doctor and former associate professor at Harvard University. Along with former professor Piero Anversa, Leri was engaged in biomedical research at Brigham and Women’s Hospital in Boston, an affiliate of Harvard Medical School. Since at least 2003 Anversa and Leri had investigated the ability of the heart to regenerate damaged cells using cardiac stem cells.

The Reproducibility Project: Psychology was a crowdsourced collaboration of 270 contributing authors to repeat 100 published experimental and correlational psychological studies. This project was led by the Center for Open Science and its co-founder, Brian Nosek, who started the project in November 2011. The results of this collaboration were published in August 2015. Reproducibility is the ability to produce the same findings, using the same methodologies as the original work, but on a different dataset. The project has illustrated the growing problem of failed reproducibility in social science. This project has started a movement that has spread through the science world with the expanded testing of the reproducibility of published works.

Misuse of p-values is common in scientific research and scientific education. p-values are often used or interpreted incorrectly; the American Statistical Association states that p-values can indicate how incompatible the data are with a specified statistical model. From a Neyman–Pearson hypothesis testing approach to statistical inferences, the data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level. From a Fisherian statistical testing approach to statistical inferences, a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false.

Metascience is the use of scientific methodology to study science itself. Metascience seeks to increase the quality of scientific research while reducing inefficiency. It is also known as "research on research" and "the science of science", as it uses research methods to study how research is done and find where improvements can be made. Metascience concerns itself with all fields of research and has been described as "a bird's eye view of science". In the words of John Ioannidis, "Science is the best thing that has happened to human beings ... but we can do it better."

In psychologist Hans Eysenck's P–E–N model of personality, psychoticism is a trait which is typified by aggressiveness and interpersonal hostility. In 2010, a paper titled "The nature of the relationship between personality traits and political attitudes" claimed to find a strong positive correlation between conservatism and psychoticism. This error was repeated in subsequent papers by the same authors; however, around 2015, the authors acknowledged the correlation is actually negative rather than positive, and began issuing corrections.

References

  1. 1 2 3 4 5 6 7 8 9 10 "Problems with scientific research: How science goes wrong". The Economist. 2013-10-19. Retrieved 2013-10-19.
  2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 "Unreliable research: Trouble at the lab". The Economist. 2013-10-19. Retrieved 2013-10-22.
  3. 1 2 3 "Misconduct, Not Mistakes, Causes Most Retractions of Scientific Papers | Science/AAAS | News". News.sciencemag.org. 2012-10-01. Retrieved 2013-10-19.
  4. Fang, F. C.; Steen, R. G.; Casadevall, A. (2012). "Misconduct accounts for the majority of retracted scientific publications". Proceedings of the National Academy of Sciences. 109 (42): 17028–33. Bibcode:2012PNAS..10917028F. doi: 10.1073/pnas.1212247109 . PMC   3479492 . PMID   23027971.
  5. 1 2 3 4 5 Lehrer, Jonah (December 16, 2011). "Trials and Errors: Why Science Is Failing Us". Wired . Retrieved 22 October 2013.
  6. "Highly cited studies often refuted". Medscape.com. Retrieved 2013-10-22.
  7. Anonymous (2013-12-21). "Palaeontology: A bone to pick". The Economist. Retrieved 2014-04-17.
  8. Shanks, David R.; Newell, Ben R.; Lee, Eun Hee; Balakrishnan, Divya; Ekelund, Lisa; Cenac, Zarus; Kavvadia, Fragkiski; Moore, Christopher (2013-04-24). "Priming Intelligent Behavior: An Elusive Phenomenon". PLOS ONE. 8 (4): e56515. Bibcode:2013PLoSO...856515S. doi: 10.1371/journal.pone.0056515 . ISSN   1932-6203. PMC   3634790 . PMID   23637732.
  9. O’Donnell, Michael; Nelson, Leif D.; Ackermann, Evi; Aczel, Balazs; Akhtar, Athfah; Aldrovandi, Silvio; Alshaif, Nasseem; Andringa, Ronald; Aveyard, Mark; Babincak, Peter; Balatekin, Nursena (2018-02-21). "Registered Replication Report: Dijksterhuis and van Knippenberg (1998)" (PDF). Perspectives on Psychological Science. 13 (2): 268–294. doi:10.1177/1745691618755704. ISSN   1745-6916. PMID   29463182. S2CID   3423830.
  10. Reporting Checklist For Life Sciences Articles