Misuse of statistics

Last updated

Statistics, when used in a misleading fashion, can trick the casual observer into believing something other than what the data shows. That is, a misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misuse may be accidental. In others, it is purposeful and for the gain of the perpetrator. When the statistical reason involved is false or misapplied, this constitutes a statistical fallacy .

Contents

The false statistics trap can be quite damaging for the quest for knowledge. For example, in medical science, correcting a falsehood may take decades and cost lives.

Misuses can be easy to fall into. Professional scientists, mathematicians and even professional statisticians, can be fooled by even some simple methods, even if they are careful to check everything. Scientists have been known to fool themselves with statistics due to lack of knowledge of probability theory and lack of standardization of their tests.

Definition, limitations and context

One usable definition is: "Misuse of Statistics: Using numbers in such a manner that – either by intent or through ignorance or carelessness – the conclusions are unjustified or incorrect." [1] The "numbers" include misleading graphics discussed in other sources. The term is not commonly encountered in statistics texts and there is no single authoritative definition. It is a generalization of lying with statistics which was richly described by examples from statisticians 60 years ago.

The definition confronts some problems (some are addressed by the source): [2]

  1. Statistics usually produces probabilities; conclusions are provisional
  2. The provisional conclusions have errors and error rates. Commonly 5% of the provisional conclusions of significance testing are wrong
  3. Statisticians are not in complete agreement on ideal methods
  4. Statistical methods are based on assumptions which are seldom fully met
  5. Data gathering is usually limited by ethical, practical and financial constraints.

How to Lie with Statistics acknowledges that statistics can legitimately take many forms. Whether the statistics show that a product is "light and economical" or "flimsy and cheap" can be debated whatever the numbers. Some object to the substitution of statistical correctness for moral leadership (for example) as an objective. Assigning blame for misuses is often difficult because scientists, pollsters, statisticians and reporters are often employees or consultants.

An insidious misuse of statistics is completed by the listener, observer, audience, or juror. The supplier provides the "statistics" as numbers or graphics (or before/after photographs), allowing the consumer to draw conclusions that may be unjustified or incorrect. The poor state of public statistical literacy and the non-statistical nature of human intuition make it possible to mislead without explicitly producing faulty conclusion. The definition is weak on the responsibility of the consumer of statistics.

A historian listed over 100 fallacies in a dozen categories including those of generalization and those of causation. [3] A few of the fallacies are explicitly or potentially statistical including sampling, statistical nonsense, statistical probability, false extrapolation, false interpolation and insidious generalization. All of the technical/mathematical problems of applied probability would fit in the single listed fallacy of statistical probability. Many of the fallacies could be coupled to statistical analysis, allowing the possibility of a false conclusion flowing from a statistically sound analysis.

An example use of statistics is in the analysis of medical research. The process includes [4] [5] experimental planning, the conduct of the experiment, data analysis, drawing the logical conclusions and presentation/reporting. The report is summarized by the popular press and by advertisers. Misuses of statistics can result from problems at any step in the process. The statistical standards ideally imposed on the scientific report are much different than those imposed on the popular press and advertisers; however, cases exist of advertising disguised as science. The definition of the misuse of statistics is weak on the required completeness of statistical reporting. The opinion is expressed that newspapers must provide at least the source for the statistics reported.

Simple causes

Many misuses of statistics occur because

Types of misuse

Discarding unfavorable observations

To promote a neutral (useless) product, a company must find or conduct, for example, 40 studies with a confidence level of 95%. If the product is useless, this would produce one study showing the product was beneficial, one study showing it was harmful, and thirty-eight inconclusive studies (38 is 95% of 40). This tactic becomes more effective when there are more studies available. Organizations that do not publish every study they carry out, such as tobacco companies denying a link between smoking and cancer, anti-smoking advocacy groups and media outlets trying to prove a link between smoking and various ailments, or miracle pill vendors, are likely to use this tactic.

Ronald Fisher considered this issue in his famous lady tasting tea example experiment (from his 1935 book, The Design of Experiments ). Regarding repeated experiments, he said, "It would be illegitimate and would rob our calculation of its basis if unsuccessful results were not all brought into the account."

Another term related to this concept is cherry picking.

Ignoring important features

Multivariable datasets have two or more features/dimensions. If too few of these features are chosen for analysis (for example, if just one feature is chosen and simple linear regression is performed instead of multiple linear regression), the results can be misleading. This leaves the analyst vulnerable to any of various statistical paradoxes, or in some (not all) cases false causality as below.

Loaded questions

The answers to surveys can often be manipulated by wording the question in such a way as to induce a prevalence towards a certain answer from the respondent. For example, in polling support for a war, the questions:

will likely result in data skewed in different directions, although they are both polling about the support for the war. A better way of wording the question could be "Do you support the current US military action abroad?" A still more nearly neutral way to put that question is "What is your view about the current US military action abroad?" The point should be that the person being asked has no way of guessing from the wording what the questioner might want to hear.

Another way to do this is to precede the question by information that supports the "desired" answer. For example, more people will likely answer "yes" to the question "Given the increasing burden of taxes on middle-class families, do you support cuts in income tax?" than to the question "Considering the rising federal budget deficit and the desperate need for more revenue, do you support cuts in income tax?"

The proper formulation of questions can be very subtle. The responses to two questions can vary dramatically depending on the order in which they are asked. [15] "A survey that asked about 'ownership of stock' found that most Texas ranchers owned stock, though probably not the kind traded on the New York Stock Exchange." [16]

Overgeneralization

Overgeneralization is a fallacy occurring when a statistic about a particular population is asserted to hold among members of a group for which the original population is not a representative sample.

For example, suppose 100% of apples are observed to be red in summer. The assertion "All apples are red" would be an instance of overgeneralization because the original statistic was true only of a specific subset of apples (those in summer), which is not expected to be representative of the population of apples as a whole.

A real-world example of the overgeneralization fallacy can be observed as an artifact of modern polling techniques, which prohibit calling cell phones for over-the-phone political polls. As young people are more likely than other demographic groups to lack a conventional "landline" phone, a telephone poll that exclusively surveys responders of calls landline phones, may cause the poll results to undersample the views of young people, if no other measures are taken to account for this skewing of the sampling. Thus, a poll examining the voting preferences of young people using this technique may not be a perfectly accurate representation of young peoples' true voting preferences as a whole without overgeneralizing, because the sample used excludes young people that carry only cell phones, who may or may not have voting preferences that differ from the rest of the population.

Overgeneralization often occurs when information is passed through nontechnical sources, in particular mass media.

Biased samples

Scientists have learned at great cost that gathering good experimental data for statistical analysis is difficult. Example: The placebo effect (mind over body) is very powerful. 100% of subjects developed a rash when exposed to an inert substance that was falsely called poison ivy while few developed a rash to a "harmless" object that really was poison ivy. [17] Researchers combat this effect by double-blind randomized comparative experiments. Statisticians typically worry more about the validity of the data than the analysis. This is reflected in a field of study within statistics known as the design of experiments.

Pollsters have learned at great cost that gathering good survey data for statistical analysis is difficult. The selective effect of cellular telephones on data collection (discussed in the Overgeneralization section) is one potential example; If young people with traditional telephones are not representative, the sample can be biased. Sample surveys have many pitfalls and require great care in execution. [18] One effort required almost 3000 telephone calls to get 1000 answers. The simple random sample of the population "isn't simple and may not be random." [19]

Misreporting or misunderstanding of estimated error

If a research team wants to know how 300 million people feel about a certain topic, it would be impractical to ask all of them. However, if the team picks a random sample of about 1000 people, they can be fairly certain that the results given by this group are representative of what the larger group would have said if they had all been asked.

This confidence can actually be quantified by the central limit theorem and other mathematical results. Confidence is expressed as a probability of the true result (for the larger group) being within a certain range of the estimate (the figure for the smaller group). This is the "plus or minus" figure often quoted for statistical surveys. The probability part of the confidence level is usually not mentioned; if so, it is assumed to be a standard number like 95%.

The two numbers are related. If a survey has an estimated error of ±5% at 95% confidence, it also has an estimated error of ±6.6% at 99% confidence. ±% at 95% confidence is always ±% at 99% confidence for a normally distributed population.

The smaller the estimated error, the larger the required sample, at a given confidence level; for example, at 95.4% confidence:

People may assume, because the confidence figure is omitted, that there is a 100% certainty that the true result is within the estimated error. This is not mathematically correct.

Many people may not realize that the randomness of the sample is very important. In practice, many opinion polls are conducted by phone, which distorts the sample in several ways, including exclusion of people who do not have phones, favoring the inclusion of people who have more than one phone, favoring the inclusion of people who are willing to participate in a phone survey over those who refuse, etc. Non-random sampling makes the estimated error unreliable.

On the other hand, people may consider that statistics are inherently unreliable because not everybody is called, or because they themselves are never polled. People may think that it is impossible to get data on the opinion of dozens of millions of people by just polling a few thousands. This is also inaccurate. [lower-alpha 1] A poll with perfect unbiased sampling and truthful answers has a mathematically determined margin of error, which only depends on the number of people polled.

However, often only one margin of error is reported for a survey. When results are reported for population subgroups, a larger margin of error will apply, but this may not be made clear. For example, a survey of 1000 people may contain 100 people from a certain ethnic or economic group. The results focusing on that group will be much less reliable than results for the full population. If the margin of error for the full sample was 4%, say, then the margin of error for such a subgroup could be around 13%.

There are also many other measurement problems in population surveys.

The problems mentioned above apply to all statistical experiments, not just population surveys.

False causality

When a statistical test shows a correlation between A and B, there are usually six possibilities:

  1. A causes B.
  2. B causes A.
  3. A and B both partly cause each other.
  4. A and B are both caused by a third factor, C.
  5. B is caused by C which is correlated to A.
  6. The observed correlation was due purely to chance.

The sixth possibility can be quantified by statistical tests that can calculate the probability that the correlation observed would be as large as it is just by chance if, in fact, there is no relationship between the variables. However, even if that possibility has a small probability, there are still the five others.

If the number of people buying ice cream at the beach is statistically related to the number of people who drown at the beach, then nobody would claim ice cream causes drowning because it's obvious that it isn't so. (In this case, both drowning and ice cream buying are clearly related by a third factor: the number of people at the beach).

This fallacy can be used, for example, to prove that exposure to a chemical causes cancer. Replace "number of people buying ice cream" with "number of people exposed to chemical X", and "number of people who drown" with "number of people who get cancer", and many people will believe you. In such a situation, there may be a statistical correlation even if there is no real effect. For example, if there is a perception that a chemical site is "dangerous" (even if it really isn't) property values in the area will decrease, which will entice more low-income families to move to that area. If low-income families are more likely to get cancer than high-income families (due to a poorer diet, for example, or less access to medical care) then rates of cancer will go up, even though the chemical itself is not dangerous. It is believed [22] that this is exactly what happened with some of the early studies showing a link between EMF (electromagnetic fields) from power lines and cancer. [23]

In well-designed studies, the effect of false causality can be eliminated by assigning some people into a "treatment group" and some people into a "control group" at random, and giving the treatment group the treatment and not giving the control group the treatment. In the above example, a researcher might expose one group of people to chemical X and leave a second group unexposed. If the first group had higher cancer rates, the researcher knows that there is no third factor that affected whether a person was exposed because he controlled who was exposed or not, and he assigned people to the exposed and non-exposed groups at random. However, in many applications, actually doing an experiment in this way is either prohibitively expensive, infeasible, unethical, illegal, or downright impossible. For example, it is highly unlikely that an IRB would accept an experiment that involved intentionally exposing people to a dangerous substance in order to test its toxicity. The obvious ethical implications of such types of experiments limit researchers' ability to empirically test causation.

Proof of the null hypothesis

In a statistical test, the null hypothesis () is considered valid until enough data proves it wrong. Then is rejected and the alternative hypothesis () is considered to be proven as correct. By chance this can happen, although is true, with a probability denoted (the significance level). This can be compared to the judicial process, where the accused is considered innocent () until proven guilty () beyond reasonable doubt ().

But if data does not give us enough proof to reject that , this does not automatically prove that is correct. If, for example, a tobacco producer wishes to demonstrate that its products are safe, it can easily conduct a test with a small sample of smokers versus a small sample of non-smokers. It is unlikely that any of them will develop lung cancer (and even if they do, the difference between the groups has to be very big in order to reject ). Therefore, it is likely—even when smoking is dangerous—that our test will not reject . If is accepted, it does not automatically follow that smoking is proven harmless. The test has insufficient power to reject , so the test is useless and the value of the "proof" of is also null.

This can—using the judicial analogue above—be compared with the truly guilty defendant who is released just because the proof is not enough for a guilty verdict. This does not prove the defendant's innocence, but only that there is not proof enough for a guilty verdict.

"...the null hypothesis is never proved or established, but it is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (Fisher in The Design of Experiments ) Many reasons for confusion exist including the use of double negative logic and terminology resulting from the merger of Fisher's "significance testing" (where the null hypothesis is never accepted) with "hypothesis testing" (where some hypothesis is always accepted).

Confusing statistical significance with practical significance

Statistical significance is a measure of probability; practical significance is a measure of effect. [24] A baldness cure is statistically significant if a sparse peach-fuzz usually covers the previously naked scalp. The cure is practically significant when a hat is no longer required in cold weather and the barber asks how much to take off the top. The bald want a cure that is both statistically and practically significant; It will probably work and if it does, it will have a big hairy effect. Scientific publication often requires only statistical significance. This has led to complaints (for the last 50 years) that statistical significance testing is a misuse of statistics. [25]

Data dredging

Data dredging is an abuse of data mining. In data dredging, large compilations of data are examined in order to find a correlation, without any pre-defined choice of a hypothesis to be tested. Since the required confidence interval to establish a relationship between two parameters is usually chosen to be 95% (meaning that there is a 95% chance that the relationship observed is not due to random chance), there is thus a 5% chance of finding a correlation between any two sets of completely random variables. Given that data dredging efforts typically examine large datasets with many variables, and hence even larger numbers of pairs of variables, spurious but apparently statistically significant results are almost certain to be found by any such study.

Note that data dredging is a valid way of finding a possible hypothesis but that hypothesis must then be tested with data not used in the original dredging. The misuse comes in when that hypothesis is stated as fact without further validation.

"You cannot legitimately test a hypothesis on the same data that first suggested that hypothesis. The remedy is clear. Once you have a hypothesis, design a study to search specifically for the effect you now think is there. If the result of this test is statistically significant, you have real evidence at last." [26]

Data manipulation

Informally called "fudging the data," this practice includes selective reporting (see also publication bias) and even simply making up false data.

Examples of selective reporting abound. The easiest and most common examples involve choosing a group of results that follow a pattern consistent with the preferred hypothesis while ignoring other results or "data runs" that contradict the hypothesis.

Scientists, in general, question the validity of study results that cannot be reproduced by other investigators. However, some scientists refuse to publish their data and methods. [27]

Data manipulation is a serious issue/consideration in the most honest of statistical analyses. Outliers, missing data and non-normality can all adversely affect the validity of statistical analysis. It is appropriate to study the data and repair real problems before analysis begins. "[I]n any scatter diagram there will be some points more or less detached from the main part of the cloud: these points should be rejected only for cause." [28]

Other fallacies

Pseudoreplication is a technical error associated with analysis of variance. Complexity hides the fact that statistical analysis is being attempted on a single sample (N=1). For this degenerate case the variance cannot be calculated (division by zero). An (N=1) will always give the researcher the highest statistical correlation between intent bias and actual findings.

The gambler's fallacy assumes that an event for which a future likelihood can be measured had the same likelihood of happening once it has already occurred. Thus, if someone had already tossed 9 coins and each has come up heads, people tend to assume that the likelihood of a tenth toss also being heads is 1023 to 1 against (which it was before the first coin was tossed) when in fact the chance of the tenth head is 50% (assuming the coin is unbiased).

The prosecutor's fallacy [29] assumes that the probability of an apparently criminal event being random chance is equal to the chance that the suspect is innocent. A prominent example in the UK is the wrongful conviction of Sally Clark for killing her two sons who appeared to have died of Sudden Infant Death Syndrome (SIDS). In his expert testimony, now discredited Professor Sir Roy Meadow claimed that due to the rarity of SIDS, the probability of Clark being innocent was 1 in 73 million. This was later questioned by the Royal Statistical Society; [30] assuming Meadows figure was accurate, one has to weigh up all the possible explanations against each other to make a conclusion on which most likely caused the unexplained death of the two children. Available data suggest that the odds would be in favour of double SIDS compared to double homicide by a factor of nine. [31] The 1 in 73 million figure was also misleading as it was reached by finding the probability of a baby from an affluent, non-smoking family dying from SIDS and squaring it: this erroneously treats each death as statistically independent, assuming that there is no factor, such as genetics, that would make it more likely for two siblings to die from SIDS. [32] [33] This is also an example of the ecological fallacy as it assumes the probability of SIDS in Clark's family was the same as the average of all affluent, non-smoking families; social class is a highly complex and multifaceted concept, with numerous other variables such as education, line of work, and many more. Assuming that an individual will have the same attributes as the rest of a given group fails to account for the effects of other variables which in turn can be misleading. [33] The conviction of Sally Clark was eventually overturned and Meadow was struck from the medical register. [34]

The ludic fallacy. Probabilities are based on simple models that ignore real (if remote) possibilities. Poker players do not consider that an opponent may draw a gun rather than a card. The insured (and governments) assume that insurers will remain solvent, but see AIG and systemic risk.

Other types of misuse

Other misuses include comparing apples and oranges, using the wrong average, [35] regression toward the mean, [36] and the umbrella phrase garbage in, garbage out. [37] Some statistics are simply irrelevant to an issue. [38]

Certain advertising phrasing such as "[m]ore than 99 in 100," may be misinterpreted as 100%. [39]

Anscombe's quartet is a made-up dataset that exemplifies the shortcomings of simple descriptive statistics (and the value of data plotting before numerical analysis).

See also

Related Research Articles

Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

<span class="mw-page-title-main">Statistics</span> Study of the collection, analysis, interpretation, and presentation of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

<span class="mw-page-title-main">Statistical inference</span> Process of using data analysis

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

The following outline is provided as an overview of and topical guide to statistics:

<span class="mw-page-title-main">Statistical hypothesis test</span> Method of statistical inference

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests have been defined.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In scientific research, the null hypothesis is the claim that the effect being studied does not exist. Note that the term "effect" here is not meant to imply a causative relationship.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

<i>Z</i>-test Statistical test

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-test tests the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.

In null-hypothesis significance testing, the -value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis." That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data."

<span class="mw-page-title-main">Data dredging</span> Misuse of data analysis

Data dredging is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.

In statistics, the question of checking whether a coin is fair is one whose importance lies, firstly, in providing a simple problem on which to illustrate basic ideas of statistical inference and, secondly, in providing a simple problem that can be used to compare various competing methods of statistical inference, including decision theory. The practical problem of checking whether a coin is fair might be considered as easily solved by performing a sufficiently large number of trials, but statistics and probability theory can provide guidance on two types of question; specifically those of how many trials to undertake and of the accuracy of an estimate of the probability of turning up heads, derived from a given sample of trials.

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted. A type II error, or a false negative, is the failure to reject a null hypothesis that is actually false. For example: a guilty person may be not convicted.

Discrimination testing is a technique employed in sensory analysis to determine whether there is a detectable difference among two or more products. The test uses a group of assessors (panellists) with a degree of training appropriate to the complexity of the test to discriminate from one product to another through one of a variety of experimental designs. Though useful, these tests typically do not quantify or describe any differences, requiring a more specifically trained panel under different study design to describe differences and assess significance of the difference.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

Misuse of p-values is common in scientific research and scientific education. p-values are often used or interpreted incorrectly; the American Statistical Association states that p-values can indicate how incompatible the data are with a specified statistical model. From a Neyman–Pearson hypothesis testing approach to statistical inferences, the data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level. From a Fisherian statistical testing approach to statistical inferences, a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false.

References

Notes

  1. Some data on accuracy of polls is available. Regarding one important poll by the U.S. government, "Relatively speaking, both sampling error and non-sampling [bias] error are tiny." [20] The difference between the votes predicted by one private poll and the actually tally for American presidential elections is available for comparison at "Election Year Presidential Preferences: Gallup Poll Accuracy Record: 1936–2012". The predictions were typically calculated on the basis of less than 5000 opinions by likely voters. [21]

Sources

  1. Spirer, Spirer & Jaffe 1998, p. 1.
  2. Gardenier, John; Resnik, David (2002). "The misuse of statistics: concepts, tools, and a research agenda". Accountability in Research: Policies and Quality Assurance. 9 (2): 65–74. doi:10.1080/08989620212968. PMID   12625352. S2CID   24167609.
  3. Fischer, David (1979). Historians' fallacies: toward a logic of historical thought. New York: Harper & Row. pp. 337–338. ISBN   978-0060904982.
  4. Strasak, Alexander M.; Qamruz Zaman; Karl P. Pfeiffer; Georg Göbel; Hanno Ulmer (2007). "Statistical errors in the medical research-a review of common pitfalls". Swiss Medical Weekly. 137 (3–4): 44–49. doi:10.4414/smw.2007.11587. PMID   17299669. In this article anything less than the best statistical practice is equated to the potential misuse of statistics. In a few pages 47 potential statistical errors are discussed; errors in study design, data analysis, documentation, presentation and interpretation. "[S]tatisticians should be involved early in study design, as mistakes at this point can have major repercussions, negatively affecting all subsequent stages of medical research."
  5. Indrayan, Abhaya (2007). "Statistical fallacies in orthopedic research". Indian Journal of Orthopaedics. 41 (1): 37–46. doi: 10.4103/0019-5413.30524 . PMC   2981893 . PMID   21124681. Contains a rich list of medical misuses of statistics of all types.
  6. Spirer, Spirer & Jaffe 1998, chapters 7 & 8.
  7. Spirer, Spirer & Jaffe 1998, chapter 3.
  8. Spirer, Spirer & Jaffe 1998, chapter 4.
  9. Adler, Robert; John Ewing; Peter Taylor (2009). "Citation statistics". Statistical Science. 24 (1): 1–14. doi: 10.1214/09-STS285 .
  10. Spirer, Spirer & Jaffe 1998, chapter title.
  11. Spirer, Spirer & Jaffe 1998, chapter 5.
  12. Weatherburn, Don (November 2011), "Uses and abuses of crime statistics" (PDF), Crime and Justice Bulletin: Contemporary Issues in Crime and Justice, 153, NSW Bureau of Crime Statistics and Research, ISBN   9781921824357, ISSN   1030-1046, archived from the original on June 21, 2014{{citation}}: CS1 maint: unfit URL (link) This Australian report on crime statistics provides numerous examples of interpreting and misinterpreting the data. "The increase in media access to information about crime has not been matched by an increase in the quality of media reporting on crime. The misuse of crime statistics by the media has impeded rational debate about law and order." Among the alleged media abuses: selective use of data, selective reporting of facts, misleading commentary, misrepresentation of facts and misleading headlines. Police and politicians also abused the statistics.
  13. Krugman, Paul (1994). Peddling prosperity: economic sense and nonsense in the age of diminished expectations . New York: W.W. Norton. p.  111. ISBN   0-393-03602-2.
  14. Spirer, Spirer & Jaffe 1998.
  15. Kahneman 2013, p. 102.
  16. Moore & Notz 2006, p. 59.
  17. Moore & Notz 2006, p. 97.
  18. Moore & McCabe 2003, pp. 252–254.
  19. Moore & Notz 2006, p. 53, Sample surveys in the real world.
  20. Freedman, Pisani & Purves 1998, chapter 22: Measuring Employment and Unemployment, p. 405.
  21. Freedman, Pisani & Purves 1998, pp. 389–390.
  22. Farley, John W. (2003). Barrett, Stephen (ed.). "Power Lines and Cancer: Nothing to Fear". Quackwatch.
  23. Vince, Gaia (2005-06-03). "Large study links power lines to childhood cancer". New Scientist. Archived from the original on August 16, 2014.{{cite news}}: CS1 maint: unfit URL (link) Cites: Draper, G. (2005). "Childhood cancer in relation to distance from high voltage power lines in England and Wales: a case-control study". BMJ. 330 (7503): 1290. doi:10.1136/bmj.330.7503.1290. PMC   558197 . PMID   15933351.
  24. Moore & McCabe 2003, pp. 463.
  25. Rozeboom, William W. (1960). "The fallacy of the null-hypothesis significance test". Psychological Bulletin. 57 (5): 416–428. doi:10.1037/h0042040. PMID   13744252.
  26. Moore & McCabe 2003, p. 466.
  27. Neylon, C (2009). "Scientists lead the push for open data sharing". Research Information. 41. Europa Science: 22–23. ISSN   1744-8026. Archived from the original on December 3, 2013.{{cite journal}}: CS1 maint: unfit URL (link)
  28. Freedman, Pisani & Purves 1998 , chapter 9: More about correlations, §3: Some exceptional cases
  29. Seife, Charles (2011). Proofiness: how you're being fooled by the numbers. New York: Penguin. pp. 203–205 and Appendix C. ISBN   9780143120070. Discusses the notorious British case.
  30. Royal Statistical Society (23 October 2001). " "Royal Statistical Society concerned by issues raised in Sally Clark case" (PDF). Archived from the original (PDF) on 2011-08-24. (28.0 KB)"
  31. Hill, R. (2004). "Multiple sudden infant deaths – coincidence or beyond coincidence?". Paediatric and Perinatal Epidemiology. 18 (5): 320–6. doi:10.1111/j.1365-3016.2004.00560.x. PMID   15367318.
  32. "Beyond reasonable doubt". Plus Maths. Retrieved 2022-04-01.
  33. 1 2 Watkins, Stephen J. (2000-01-01). "Conviction by mathematical error?: Doctors and lawyers should get probability theory right". BMJ. 320 (7226): 2–3. doi:10.1136/bmj.320.7226.2. ISSN   0959-8138. PMC   1117305 . PMID   10617504.
  34. Dyer, Clare (2005-07-21). "Professor Roy Meadow struck off". BMJ. 331 (7510): 177. doi:10.1136/bmj.331.7510.177. ISSN   0959-8138. PMC   1179752 . PMID   16037430.
  35. Huff 1954, chapter 2.
  36. Kahneman 2013, chapter 17.
  37. Hooke 1983, §50.
  38. Campbell 1974, chapter 3: Meaningless statistics.
  39. Mazer, Robert. "LABORATORY'S MARKETING MATERIALS MAY EXPOSE LAB TO LEGAL CLAIMS". LinkedIn. Retrieved 10 April 2024.

Further reading