Statcheck

Last updated

Statcheck is an R package designed to detect statistical errors in peer-reviewed psychology articles [1] by searching papers for statistical results, redoing the calculations described in each paper, and comparing the two values to see if they match. [2] It takes advantage of the fact that psychological research papers tend to report their results in accordance with the guidelines published by the American Psychological Association (APA). [3] This leads to several disadvantages: it can only detect results reported completely and in exact accordance with the APA's guidelines, [4] and it cannot detect statistics that are only included in tables in the paper. [5] Another limitation is that Statcheck cannot deal with statistical corrections to test statistics, like Greenhouse–Geisser or Bonferroni corrections, which actually make tests more conservative. [6] Some journals have begun piloting Statcheck as part of their peer review process. Statcheck is free software published under the GNU GPL v3. [7]

Contents

Validity

In 2017, Statcheck's developers published a preprint paper concluding that the program accurately identified statistical errors over 95% of the time. [8] This validity study comprised more than 1,000 hand-checked tests among which 5.00% turned out to be inconsistent. [9] The study found that Statcheck recognized 60% of all statistical tests. A reanalysis of these data found that if the program flagged a test as inconsistent, it was correct in 60.4% of cases. Reversely, if a test was truly inconsistent, Statcheck flagged it in an estimated 51.8% of cases (this estimate included the undetected tests and assumed that they had the same rate of inconsistencies as the detected tests). Overall, Statcheck's accuracy was 95.9%, half a percentage point higher than the chance level of 95.4% expected when all tests are simply taken at face value. Statcheck was conservatively biased (by about one standard deviation) against flagging tests. [10]

More recent research has used Statcheck on papers published in Canadian psychology journals, finding similar rates of statistical reporting errors as the original authors based on a 30-year sample of such articles. The same study also found many typographical errors in online versions of relatively old papers, and that correcting for these reduced the estimated percent of tests that were erroneously reported. [11]

History

Statcheck was first developed in 2015 by Michele Nuijten of Tilburg University and Sacha Epskamp of the University of Amsterdam. [12] [8] Later that year, Nuijten and her colleagues published a paper using Statcheck on over 30,000 psychology papers and reported that "half of all published psychology papers [...] contained at least one p-value that was inconsistent with its test". [13] The study was subsequently written up favorably in Nature . [14] [15] In 2016, Nuijten and Epskamp both received the Leamer-Rosenthal Prize for Open Social Science from the Berkeley Initiative for Transparency in the Social Sciences for creating Statcheck. [16]

In 2016, Tilburg University researcher Chris Hartgerink used Statcheck to scan over 50,000 psychology papers and posted the results to PubPeer; they subsequently published the data they extracted from these papers in an article in the journal Data . [14] [17] Hartgerink told Motherboard that "We're checking how reliable is the actual science being presented by science". [18] They also told Vox that they intended to use Statcheck to perform a function similar to a spell checker software program. [12] Hartgerink's action also sent email alerts to every researcher who had authored or co-authored a paper that it had flagged. These flaggings, and their posting on a public forum, proved controversial, prompting the German Psychological Society to issue a statement condemning this use of Statcheck. [14] Psychologist Dorothy V.M. Bishop, who had two of her own papers flagged by Statcheck, criticized the program for publicly flagging many papers (including one of her own) despite not having found any statistical errors in it. [19] Other critics alleged that Statcheck had reported the presence of errors in papers that did not actually contain them, due to the tool's failure to correctly read statistics from certain papers. [20]

Journals that have begun piloting the use of Statcheck as part of their peer review process include Psychological Science , [21] the Canadian Journal of Human Sexuality , [22] and the Journal of Experimental Social Psychology . [23] The open access publisher PsychOpen has also used it on all papers accepted for publication in their journals since 2017. [24]

See also

Related Research Articles

<span class="mw-page-title-main">Intelligence quotient</span> Score from a test designed to assess intelligence

An intelligence quotient (IQ) is a total score derived from a set of standardised tests or subtests designed to assess human intelligence. The abbreviation "IQ" was coined by the psychologist William Stern for the German term Intelligenzquotient, his term for a scoring method for intelligence tests at University of Breslau he advocated in a 1912 book.

Reproducibility, closely related to replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated. There are different kinds of replication but typically replication studies involve different researchers using the same methodology. Only after one or several such successful replications should a result be recognized as scientific knowledge.

<span class="mw-page-title-main">Meta-analysis</span> Statistical method that summarizes and/or integrates data from multiple sources

Meta-analysis is a method of synthesis of quantitative data from multiple independent studies addressing a common research question. An important part of this method involves computing a combined effect size across all of the studies. As such, this statistical approach involves extracting effect sizes and variance measures from various studies. By combining these effect sizes the statistical power is improved and can resolve uncertainties or discrepancies found in individual studies. Meta-analyses are integral in supporting research grant proposals, shaping treatment guidelines, and influencing health policies. They are also pivotal in summarizing existing research to guide future studies, thereby cementing their role as a fundamental methodology in metascience. Meta-analyses are often, but not always, important components of a systematic review.

In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by , is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, , is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when . The significance level for a study is chosen before data collection, and is typically set to 5% or much lower—depending on the field of study.

<span class="mw-page-title-main">Open access</span> Research publications distributed freely online

Open access (OA) is a set of principles and a range of practices through which nominally copyrightable publications are delivered to readers free of access charges or other barriers. With open access strictly defined, or libre open access, barriers to copying or reuse are also reduced or removed by applying an open license for copyright, which regulates post-publication uses of the work.

In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a significant finding disturbs the balance of findings in favor of positive results. The study of publication bias is an important topic in metascience.

In academic publishing, a retraction is a mechanism by which a published paper in an academic journal is flagged for being seriously flawed to the extent that their results and conclusions can no longer be relied upon. Retracted articles are not removed from the published literature but marked as retracted. In some cases it may be necessary to remove an article from publication, such as when the article is clearly defamatory, violates personal privacy, is the subject of a court order, or might pose a serious health risk to the general public.

Moral psychology is a field of study in both philosophy and psychology. Historically, the term "moral psychology" was used relatively narrowly to refer to the study of moral development. Moral psychology eventually came to refer more broadly to various topics at the intersection of ethics, psychology, and philosophy of mind. Some of the main topics of the field are moral judgment, moral reasoning, moral sensitivity, moral responsibility, moral motivation, moral identity, moral action, moral development, moral diversity, moral character, altruism, psychological egoism, moral luck, moral forecasting, moral emotion, affective forecasting, and moral disagreement.

<span class="mw-page-title-main">Intimate relationship</span> Physical or emotional intimacy

An intimate relationship is an interpersonal relationship that involves emotional or physical closeness between people and may include sexual intimacy and feelings of romance or love. Intimate relationships are interdependent, and the members of the relationship mutually influence each other. The quality and nature of the relationship depends on the interactions between individuals, and is derived from the unique context and history that builds between people over time. Social and legal institutions such as marriage acknowledge and uphold intimate relationships between people. However, intimate relationships are not necessarily monogamous or sexual, and there is wide social and cultural variability in the norms and practices of intimacy between people.

Scholarly peer review or academic peer review is the process of having a draft version of a researcher's methods and findings reviewed by experts in the same field. Peer review is widely used for helping the academic publisher decide whether the work should be accepted, considered acceptable with revisions, or rejected for official publication in an academic journal, a monograph or in the proceedings of an academic conference. If the identities of authors are not revealed to each other, the procedure is called dual-anonymous peer review.

Personality change refers to the different forms of change in various aspects of personality. These changes include how we experience things, how our perception of experiences changes, and how we react in situations. An individual's personality may stay somewhat consistent throughout their life. Still, more often than not, everyone undergoes some form of change to their personality in their lifetime.

<span class="mw-page-title-main">Replication crisis</span> Observed inability to reproduce scientific studies

The replication crisis is an ongoing methodological crisis in which the results of many scientific studies are difficult or impossible to reproduce. Because the reproducibility of empirical results is an essential part of the scientific method, such failures undermine the credibility of theories building on them and potentially call into question substantial parts of scientific knowledge.

<span class="mw-page-title-main">JASP</span> Free and open-source statistical program

JASP is a free and open-source program for statistical analysis supported by the University of Amsterdam. It is designed to be easy to use, and familiar to users of SPSS. It offers standard analysis procedures in both their classical and Bayesian form. JASP generally produces APA style results tables and plots to ease publication. It promotes open science via integration with the Open Science Framework and reproducibility by integrating the analysis settings into the results. The development of JASP is financially supported by several universities and research funds. As the JASP GUI is developed in C++ using Qt framework, some of the team left to make a notable fork which is Jamovi which has its GUI developed in JavaScript and HTML5.

Hypercorrection is the higher likelihood of correcting a general knowledge error when originally certain that the information they understand is accurate as opposed to unsure of the information. The phenomenon suggests that once a general knowledge information is confidently misremembered by someone and the person learns the right version after their initial response is corrected, their likelihood of remembering this piece of information will be higher than someone who was unsure of their initial answer. It refers to the finding that when given corrective feedback, errors that are committed with high confidence are easier to correct than low confidence errors.

Metascience is the use of scientific methodology to study science itself. Metascience seeks to increase the quality of scientific research while reducing inefficiency. It is also known as "research on research" and "the science of science", as it uses research methods to study how research is done and find where improvements can be made. Metascience concerns itself with all fields of research and has been described as "a bird's eye view of science". In the words of John Ioannidis, "Science is the best thing that has happened to human beings ... but we can do it better."

The Meta-Research Center at Tilburg University is a metascience research center within the School of Social and Behavioral Sciences at the Dutch Tilburg University. They were profiled in a September 2018 article in Science.

Suparna Rajaram, SUNY Distinguished Professor of Psychology at Stony Brook University, is an Indian-born cognitive psychologist and expert on memory and amnesia. Rajaram served as Chair of the Governing Board of the Psychonomic Society (2008) and as president of the Association for Psychological Science (2017–2018). Along with Judith Kroll and Randi Martin, Rajaram co-founded the organization Women in Cognitive Science in 2001, with the aim of improving the visibility of contributions of women to cognitive science. In 2019, she was an inaugural recipient of Psychonomic Society's Clifford T. Morgan Distinguished Leadership Award for significant contributions and sustained leadership in the discipline of cognitive psychology.

Preregistration is the practice of registering the hypotheses, methods, or analyses of a scientific study before it is conducted. Clinical trial registration is similar, although it may not require the registration of a study's analysis protocol. Finally, registered reports include the peer review and in principle acceptance of a study protocol prior to data collection.

Crowdsourced science refers to collaborative contributions of a large group of people to the different steps of the research process in science. In psychology, the nature and scope of the collaborations can vary in their application and in the benefits it offers.

References

  1. Nuijten, Michèle B. (2017-02-28). "BayesMed and statcheck". Aps Observer. 30 (3). Retrieved 2018-10-18.
  2. Baker, Monya (2016-11-25). "Stat-checking software stirs up psychology". Nature. 540 (7631): 151–152. Bibcode:2016Natur.540..151B. doi: 10.1038/540151a . ISSN   0028-0836. PMID   27905454.
  3. Wren, Jonathan D. (2018-06-15). "Algorithmically outsourcing the detection of statistical errors and other problems". The EMBO Journal. 37 (12): e99651. doi:10.15252/embj.201899651. ISSN   0261-4189. PMC   6003655 . PMID   29794111.
  4. Colombo, Matteo; Duev, Georgi; Nuijten, Michèle B.; Sprenger, Jan (2018-04-12). "Statistical reporting inconsistencies in experimental philosophy". PLOS ONE. 13 (4): e0194360. Bibcode:2018PLoSO..1394360C. doi: 10.1371/journal.pone.0194360 . ISSN   1932-6203. PMC   5896892 . PMID   29649220.
  5. van der Zee, Tim; Anaya, Jordan; Brown, Nicholas J. L. (2017-07-10). "Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab". BMC Nutrition. 3 (1): 54. doi: 10.1186/s40795-017-0167-x . ISSN   2055-0928. PMC   7050813 . PMID   32153834.
  6. Schmidt, Thomas (2016). "Sources of false positives and false negatives in the Statcheck algorithm". arXiv: 1610.01010 [q-bio.QM].
  7. "Statcheck/DESCRIPTION at master · MicheleNuijten/Statcheck". GitHub .
  8. 1 2 Chawla, Dalmeet Singh (2017-11-28). "Controversial software is proving surprisingly accurate at spotting errors in psychology papers". Science. Retrieved 2018-10-18.
  9. Nuijten, Michèle B. "The validity of the tool "Statcheck" in discovering statistical reporting inconsistencies". PsyArXiv.
  10. Schmidt, Thomas. "Statcheck does not work: All the numbers". PsyArXiv.
  11. Green, Christopher D.; Abbas, Sahir; Belliveau, Arlie; Beribisky, Nataly; Davidson, Ian J.; DiGiovanni, Julian; Heidari, Crystal; Martin, Shane M.; Oosenbrug, Eric (August 2018). "Statcheck in Canada: What proportion of CPA journal articles contain errors in the reporting of p-values?". Canadian Psychology. 59 (3): 203–210. doi:10.1037/cap0000139. ISSN   1878-7304. S2CID   149813772.
  12. 1 2 Resnick, Brian (2016-09-30). "A bot crawled thousands of studies looking for simple math errors. The results are concerning". Vox. Retrieved 2018-10-18.
  13. Nuijten, Michèle B.; Hartgerink, Chris H. J.; van Assen, Marcel A. L. M.; Epskamp, Sacha; Wicherts, Jelte M. (2015-10-23). "The prevalence of statistical reporting errors in psychology (1985–2013)". Behavior Research Methods. 48 (4): 1205–1226. doi:10.3758/s13428-015-0664-2. ISSN   1554-3528. PMC   5101263 . PMID   26497820.
  14. 1 2 3 Buranyi, Stephen (2017-02-01). "The high-tech war on science fraud". The Guardian. Retrieved 2018-10-18.
  15. Baker, Monya (2015-10-28). "Smart software spots statistical errors in psychology papers". Nature. doi:10.1038/nature.2015.18657. ISSN   1476-4687. S2CID   187878096 . Retrieved 2018-10-19.
  16. "Michèle Nuijten". Berkeley Initiative for Transparency in the Social Sciences. 2016-12-16. Retrieved 2018-10-19.
  17. Hartgerink, Chris (2016-09-23). "688,112 Statistical Results: Content Mining Psychology Articles for Statistical Test Results". Data. 1 (3): 14. doi: 10.3390/data1030014 .
  18. Buranyi, Stephen (2016-09-05). "Scientists Are Worried About 'Peer Review by Algorithm'". Motherboard. Retrieved 2018-10-18.
  19. "Here's why more than 50,000 psychology studies are about to have PubPeer entries". Retraction Watch. 2016-09-02. Retrieved 2018-10-18.
  20. Stokstad, Erik (2018-09-21). "The truth squad". Science. 361 (6408): 1189–1191. Bibcode:2018Sci...361.1189S. doi:10.1126/science.361.6408.1189. ISSN   0036-8075. PMID   30237339. S2CID   52309610.
  21. Freedman, Leonard P.; Venugopalan, Gautham; Wisman, Rosann (2017-05-02). "Reproducibility2020: Progress and priorities". F1000Research. 6: 604. doi: 10.12688/f1000research.11334.1 . ISSN   2046-1402. PMC   5461896 . PMID   28620458.
  22. Sakaluk, John K.; Graham, Cynthia A. (2017-11-17). "Promoting Transparent Reporting of Conflicts of Interests and Statistical Analyses at The Journal of Sex Research". The Journal of Sex Research. 55 (1): 1–6. doi: 10.1080/00224499.2017.1395387 . ISSN   0022-4499. PMID   29148841.
  23. JESP piloting the use of statcheck . Retrieved 2018-10-19.{{cite book}}: |website= ignored (help)
  24. "PsychOpen uses Statcheck tool for quality check". PsychOpen. 2017-04-10. Retrieved 2018-10-23.