Data dredging

Last updated
A humorous example of a result produced by data dredging, showing a coincidental correlation between the number of letters in Scripps National Spelling Bee's winning word and the number of people in the United States killed by venomous spiders Spurious correlations - spelling bee spiders.svg
A humorous example of a result produced by data dredging, showing a coincidental correlation between the number of letters in Scripps National Spelling Bee's winning word and the number of people in the United States killed by venomous spiders

Data dredging (also known as data snooping or p-hacking) [1] [lower-alpha 1] is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results. [2]

Contents

The process of data dredging involves testing multiple hypotheses using a single data set by exhaustively searching—perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.

Conventional tests of statistical significance are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the null hypothesis). This level of risk is called the significance. When large numbers of tests are performed, some produce false results of this type; hence 5% of randomly chosen hypotheses might be (erroneously) reported to be statistically significant at the 5% significance level, 1% might be (erroneously) reported to be statistically significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results. The term p-hacking (in reference to p-values) was coined in a 2014 paper by the three researchers behind the blog Data Colada, which has been focusing on uncovering such problems in social sciences research. [3] [4] [5]

Data dredging is an example of disregarding the multiple comparisons problem. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined. [6]

Types

Drawing conclusions from data

The conventional statistical hypothesis testing procedure using frequentist probability is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data. Lastly, a statistical significance test is carried out to see how likely the results are by chance alone (also called testing against the null hypothesis).

A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns.

For example, flipping a coin five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. It is important to realize that the statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.

Hypothesis suggested by non-representative data

Suppose that a study of a random sample of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data dredging might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data dredging, could then be "people born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be reproducible; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.

Systematic bias

Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment, abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks. [6] This problem can be very severe, for example, in the observational study. [6] [2]

Missing factors, unmeasured confounders, and loss to follow-up can also lead to bias. [6] By selecting papers with significant p-values, negative studies are selected against, which is publication bias. This is also known as file drawer bias, because less significant p-value results are left in the file drawer and never published.

Multiple modelling

Another aspect of the conditioning of statistical tests by knowledge of the data can be seen while using the system or machine analysis and linear regression to observe the frequency of data.[ clarify ] A crucial step in the process is to decide which covariates to include in a relationship explaining one or more other variables. There are both statistical (see stepwise regression) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model, but it may also introduce bias and alter mean square error in estimation. [7] [8]

Examples

In meteorology and epidemiology

In meteorology, hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

As another example, suppose that observers note that a particular town appears to have a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is likely to obtain a p-value less than 0.01 for many null hypotheses.

In sociology

Another way to flatten a p-curve is to control for gender. An analysis by Simonsohn et. al. of a study by Bruns and Ioannidis (2016) demonstrates this, as when Bruns and Ioannidis dropped the gender control, this also dropped the reported t-value from 9.29 to 0.88, showing a non-causal effect where a causal one was previously recorded (3). [9] This is an important finding because t-values are inversely proportional to p-values, meaning higher t-values (above 2.8) indicate lower p-values. By controlling for gender, one can artificially inflate the t-value, thus artificially deflating the p-value as well.

Appearance in media

One example is the chocolate weight loss hoax study conducted by journalist John Bohannon, who explained publicly in a Gizmodo article that the study was deliberately conducted fraudulently as a social experiment. [10] This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement. This study was published in the Institute of Diet and Health. According to Bohannon, to reduce the p-value to below 0.05, taking 18 different variables into consideration when testing was crucial.

Remedies

While looking for patterns in data is legitimate, applying a statistical test of significance or hypothesis test to the same data until a pattern emerges is prone to abuse. One way to construct hypotheses while avoiding data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid. (This is a simple type of cross-validation and is often termed training-test or split-half validation.)

Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance (alpha) by this number; this is the Bonferroni correction. However, this is a very conservative metric. A family-wise alpha of 0.05, divided in this way by 1,000 to account for 1,000 significance tests, yields a very stringent per-hypothesis alpha of 0.00005. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are Scheffé's method and, if the researcher has in mind only pairwise comparisons, the Tukey method. To avoid the extreme conservativeness of the Bonferroni correction, more sophisticated selective inference methods are available. [11] The most common selective inference method is the use of Benjamini and Hochberg's false discovery rate controlling procedure: it is a less conservative approach that has become a popular method for control of multiple hypothesis tests.

When neither approach is practical, one can make a clear distinction between data analyses that are confirmatory and analyses that are exploratory. Statistical inference is appropriate only for the former. [8]

Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

Academic journals increasingly shift to the registered report format, which aims to counteract very serious issues such as data dredging and HARKing, which have made theory-testing research very unreliable. For example, Nature Human Behaviour has adopted the registered report format, as it "shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them". [12] The European Journal of Personality defines this format as follows: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes." [13]

Methods and results can also be made publicly available, as in the open science approach, making it yet more difficult for data dredging to take place. [14]

See also

Notes

  1. Other names are data butchery, data fishing, selective inference, significance chasing, and significance questing.

Related Research Articles

In statistics, the likelihood principle is the proposition that, given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function.

<span class="mw-page-title-main">Statistics</span> Study of the collection, analysis, interpretation, and presentation of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

<span class="mw-page-title-main">Statistical hypothesis test</span> Method of statistical inference

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests have been defined.

In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by , is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, , is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when . The significance level for a study is chosen before data collection, and is typically set to 5% or much lower—depending on the field of study.

Statistical bias, in the mathematical field of statistics, is a systematic tendency in which the methods used to gather data and generate statistics present an inaccurate, skewed or biased depiction of reality. Statistical bias exists in numerous stages of the data collection and analysis process, including: the source of the data, the methods used to collect the data, the estimator chosen, and the methods used to analyze the data. Data analysts can take various measures at each stage of the process to reduce the impact of statistical bias in their work. Understanding the source of statistical bias can help to assess whether the observed results are close to actuality. Issues of statistical bias has been argued to be closely linked to issues of statistical validity.

In scientific research, the null hypothesis is the claim that the effect being studied does not exist. Note that the term "effect" here is not meant to imply a causative relationship.

In statistics, hypotheses suggested by a given dataset, when tested with the same dataset that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning would be involved: something seems true in the limited data set; therefore we hypothesize that it is true in general; therefore we wrongly test it on the same, limited data set, which seems to confirm that it is true. Generating hypotheses based on data already observed, in the absence of testing them on new data, is referred to as post hoc theorizing.

In null-hypothesis significance testing, the -value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis." That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data."

<span class="mw-page-title-main">Mathematical statistics</span> Branch of statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

<span class="mw-page-title-main">One- and two-tailed tests</span> Alternative ways of computing the statistical significance of a parameter inferred from a data set

In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but not both. An example can be whether a machine produces more than one-percent defective products. In this situation, if the estimated value exists in one of the one-sided critical areas, depending on the direction of interest, the alternative hypothesis is accepted over the null hypothesis. Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the extreme portions of distributions, where observations lead to rejection of the null hypothesis, are small and often "tail off" toward zero as in the normal distribution, colored in yellow, or "bell curve", pictured on the right and colored in green.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

Statistical conclusion validity is the degree to which conclusions about the relationship among variables based on the data are correct or "reasonable". This began as being solely about whether the statistical conclusion about the relationship of the variables was correct, but now there is a movement towards moving to "reasonable" conclusions that use: quantitative, statistical, and qualitative data. Fundamentally, two types of errors can occur: type I and type II. Statistical conclusion validity concerns the qualities of the study that make these types of errors more likely. Statistical conclusion validity involves ensuring the use of adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures.

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

In statistics, a false coverage rate (FCR) is the average rate of false coverage, i.e. not covering the true parameters, among the selected intervals.

<span class="mw-page-title-main">Replication crisis</span> Observed inability to reproduce scientific studies

The replication crisis is an ongoing methodological crisis in which the results of many scientific studies are difficult or impossible to reproduce. Because the reproducibility of empirical results is an essential part of the scientific method, such failures undermine the credibility of theories building on them and potentially call into question substantial parts of scientific knowledge.

Misuse of p-values is common in scientific research and scientific education. p-values are often used or interpreted incorrectly; the American Statistical Association states that p-values can indicate how incompatible the data are with a specified statistical model. From a Neyman–Pearson hypothesis testing approach to statistical inferences, the data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level. From a Fisherian statistical testing approach to statistical inferences, a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false.

References

  1. Wasserstein, Ronald L.; Lazar, Nicole A. (2016-04-02). "The ASA Statement on p-Values: Context, Process, and Purpose". The American Statistician. 70 (2). Informa UK Limited: 129–133. doi: 10.1080/00031305.2016.1154108 . ISSN   0003-1305.
  2. 1 2 Davey Smith, G.; Ebrahim, S. (2002). "Data dredging, bias, or confounding". BMJ. 325 (7378): 1437–1438. doi:10.1136/bmj.325.7378.1437. PMC   1124898 . PMID   12493654.
  3. Lewis-Kraus, Gideon (2023-09-30). "They Studied Dishonesty. Was Their Work a Lie?". The New Yorker. ISSN   0028-792X . Retrieved 2023-10-01.
  4. Subbaraman, Nidhi (2023-09-24). "The Band of Debunkers Busting Bad Scientists". Wall Street Journal . Archived from the original on 2023-09-24. Retrieved 2023-10-08.
  5. "APA PsycNet". psycnet.apa.org. Retrieved 2023-10-08.
  6. 1 2 3 4 Young, S. S.; Karr, A. (2011). "Deming, data and observational studies" (PDF). Significance. 8 (3): 116–120. doi: 10.1111/j.1740-9713.2011.00506.x .
  7. Selvin, H. C.; Stuart, A. (1966). "Data-Dredging Procedures in Survey Analysis". The American Statistician. 20 (3): 20–23. doi:10.1080/00031305.1966.10480401. JSTOR   2681493.
  8. 1 2 Berk, R.; Brown, L.; Zhao, L. (2009). "Statistical Inference After Model Selection". J Quant Criminol. 26 (2): 217–236. doi:10.1007/s10940-009-9077-7. S2CID   10350955.
  9. Simonsohn, Uri; Nelson, Leif D.; Simmons, Joseph P. (2019-03-11). "P-curve won't do your laundry, but it will distinguish replicable from non-replicable findings in observational research: Comment on Bruns & Ioannidis (2016)". PLOS ONE. 14 (3): e0213454. Bibcode:2019PLoSO..1413454S. doi: 10.1371/journal.pone.0213454 . PMC   6411107 . PMID   30856227.
  10. Bohannon, John (2015-05-27). "I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How". Gizmodo . Retrieved 2023-10-20.
  11. Taylor, J.; Tibshirani, R. (2015). "Statistical learning and selective inference". Proceedings of the National Academy of Sciences. 112 (25): 7629–7634. doi: 10.1073/pnas.1507583112 . PMC   4485109 .
  12. "Promoting reproducibility with registered reports". Nature Human Behaviour. 1 (1): 0034. 10 January 2017. doi: 10.1038/s41562-016-0034 . S2CID   28976450.
  13. "Streamlined review and registered reports soon to be official at EJP". ejp-blog.com. 6 February 2018.
  14. Vyse, Stuart (2017). "P-Hacker Confessions: Daryl Bem and Me". Skeptical Inquirer . 41 (5): 25–27. Archived from the original on 2018-08-05. Retrieved 5 August 2018.
  15. Gelman, Andrew (2013). "The garden of forking paths" (PDF).

Further reading