Type III error

Last updated

In statistical hypothesis testing, there are various notions of so-called type III errors (or errors of the third kind), and sometimes type IV errors or higher, by analogy with the type I and type II errors of Jerzy Neyman and Egon Pearson. Fundamentally, type III errors occur when researchers provide the right answer to the wrong question, i.e. when the correct hypothesis is rejected but for the wrong reason.

Contents

Since the paired notions of type I errors (or "false positives") and type II errors (or "false negatives") that were introduced by Neyman and Pearson are now widely used, their choice of terminology ("errors of the first kind" and "errors of the second kind"), has led others to suppose that certain sorts of mistakes that they have identified might be an "error of the third kind", "fourth kind", etc. [lower-alpha 1]

None of these proposed categories have been widely accepted. The following is a brief account of some of these proposals.

Systems theory

In systems theory an additional type III error is often defined: type III (δ): asking the wrong question and using the wrong null hypothesis. [2]

David

Florence Nightingale David [3] a sometime colleague of both Neyman and Pearson at the University College London, making a humorous aside at the end of her 1947 paper, suggested that, in the case of her own research, perhaps Neyman and Pearson's "two sources of error" could be extended to a third:

I have been concerned here with trying to explain what I believe to be the basic ideas [of my "theory of the conditional power functions"], and to forestall possible criticism that I am falling into error (of the third kind) and am choosing the test falsely to suit the significance of the sample.

1947, p.339

Mosteller

In 1948, Frederick Mosteller [lower-alpha 2] argued that a "third kind of error" was required to describe circumstances he had observed, namely:

Kaiser

According to Henry F. Kaiser, in his 1966 paper extended Mosteller's classification such that an error of the third kind entailed an incorrect decision of direction following a rejected two-tailed test of hypothesis. In his discussion (1966, pp. 162–163), Kaiser also speaks of α errors, β errors, and γ errors for type I, type II and type III errors respectively (C.O. Dellomos).

Kimball

In 1957, Allyn W. Kimball, a statistician with the Oak Ridge National Laboratory, proposed a different kind of error to stand beside "the first and second types of error in the theory of testing hypotheses". Kimball defined this new "error of the third kind" as being "the error committed by giving the right answer to the wrong problem" (1957, p. 134).

Mathematician Richard Hamming expressed his view that "It is better to solve the right problem the wrong way than to solve the wrong problem the right way".

Harvard economist Howard Raiffa describes an occasion when he, too, "fell into the trap of working on the wrong problem" (1968, pp. 264–265). [lower-alpha 4]

Mitroff and Featheringham

In 1974, Ian Mitroff and Tom Featheringham extended Kimball's category, arguing that "one of the most important determinants of a problem's solution is how that problem has been represented or formulated in the first place".

They defined type III errors as either "the error ... of having solved the wrong problem ... when one should have solved the right problem" or "the error ... [of] choosing the wrong problem representation ... when one should have ... chosen the right problem representation" (1974), p. 383.

In the 2009 book Dirty rotten strategies by Ian I. Mitroff and Abraham Silvers described type III and type IV errors providing many examples of both developing good answers to the wrong questions (III) and deliberately selecting the wrong questions for intensive and skilled investigation (IV). Most of the examples have nothing to do with statistics, many being problems of public policy or business decisions. [4]

Raiffa

In 1969, the Harvard economist Howard Raiffa jokingly suggested "a candidate for the error of the fourth kind: solving the right problem too late" (1968, p. 264).

Marascuilo and Levin

In 1970, L. A. Marascuilo and J. R. Levin proposed a "fourth kind of error" – a "type IV error" – which they defined in a Mosteller-like manner as being the mistake of "the incorrect interpretation of a correctly rejected hypothesis"; which, they suggested, was the equivalent of "a physician's correct diagnosis of an ailment followed by the prescription of a wrong medicine" (1970, p. 398).

See also

Notes

  1. For example, Onwuegbuzie & Daniel claim to have identified an additional eight kinds of error. [1]
  2. The 1981 President of the American Association for the Advancement of Science
  3. Compare with Kantian ethics, where it is not enough to do the right thing, but it must be done for the right reason, and Gettier problems, where "justified true belief" is not equated with knowledge.
  4. Note that Raiffa, from his imperfect recollection, incorrectly attributed this "error of the third kind" to John Tukey (1915–2000).

Related Research Articles

<span class="mw-page-title-main">Statistics</span> Study of the collection, analysis, interpretation, and presentation of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after imposing some constraint. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. More precisely, a study's defined significance level, denoted by , is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, , is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when . The significance level for a study is chosen before data collection, and is typically set to 5% or much lower—depending on the field of study.

Statistical bias is a systematic tendency which causes differences between results and facts. The bias exists in numbers of the process of data analysis, including the source of the data, the estimator chosen, and the ways the data was analyzed. Bias may have a serious impact on results, for example, to investigate people's buying habits. If the sample size is not large enough, the results may not be representative of the buying habits of all the people. That is, there may be discrepancies between the survey results and the actual results. Therefore, understanding the source of statistical bias can help to assess whether the observed results are close to the real results.

In inferential statistics, the null hypothesis is that two possibilities are the same. The null hypothesis is that the observed difference is due to chance alone. Using statistical tests, it is possible to calculate the likelihood that the null hypothesis is true.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Reporting p-values of statistical tests is common practice in academic publications of many quantitative fields. Since the precise meaning of p-value is hard to grasp, misuse is widespread and has been a major topic in metascience.

In statistical hypothesis testing, the alternative hypothesis is one of the proposed proposition in the hypothesis test. In general the goal of hypothesis test is to demonstrate that in the given condition, there is sufficient evidence supporting the credibility of alternative hypothesis instead of the exclusive proposition in the test. It is usually consistent with the research hypothesis because it is constructed from literature review, previous studies, etc. However, the research hypothesis is sometimes consistent with the null hypothesis.

In statistics, the Neyman–Pearson lemma was introduced by Jerzy Neyman and Egon Pearson in a paper in 1933. The Neyman-Pearson lemma is part of the Neyman-Pearson theory of statistical testing, which introduced concepts like errors of the second kind, power function, and inductive behavior. The previous Fisherian theory of significance testing postulated only one hypothesis. By introducing a competing hypothesis, the Neyman-Pearsonian flavor of statistical testing allows investigating the two types of errors. The trivial cases where one always rejects or accepts the null hypothesis are of little interest but it does prove that one must not relinquish control over one type of error while calibrating the other. Neyman and Pearson accordingly proceeded to restrict their attention to the class of all level tests while subsequently minimizing type II error, traditionally denoted by . Their seminal paper of 1933, including the Neyman-Pearson lemma, comes at the end of this endeavor, not only showing the existence of tests with the most power that retain a prespecified level of type I error, but also providing a way to construct such tests. The Karlin-Rubin theorem extends the Neyman-Pearson lemma to settings involving composite hypotheses with monotone likelihood ratios.

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to familywise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis, while a type II error is the failure to reject a null hypothesis that is actually false. Much of statistical theory revolves around the minimization of one or both of these errors, though the complete elimination of either is a statistical impossibility if the outcome is not determined by a known, observable causal process. By selecting a low threshold (cut-off) value and modifying the alpha (α) level, the quality of the hypothesis test can be increased. The knowledge of type I errors and type II errors is widely used in medical science, biometrics and computer science.

In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem. Bonferroni correction is the simplest method for counteracting this; however, it is a conservative method that gives greater risk of failure to reject a false null hypothesis than other methods, as it ignores potentially valuable information, such as the distribution of p-values across all comparisons.

Multiple comparisons problem Problem where one considers a set of inferences simultaneously based on the observed values

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values.

The foundations of statistics concern the epistemological debate in statistics over how one should conduct inductive inference from data. Among the issues considered in statistical inference are the question of Bayesian inference versus frequentist inference, the distinction between Fisher's "significance testing" and Neyman–Pearson "hypothesis testing", and whether the likelihood principle should be followed. Some of these issues have been debated for up to 200 years without resolution.

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist-inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

In statistics, when performing multiple comparisons, a false positive ratio is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive and the total number of actual negative events.

A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition, while a false negative is the opposite error, where the test result incorrectly indicates the absence of a condition when it is actually present. These are the two kinds of errors in a binary test, in contrast to the two kinds of correct result. They are also known in medicine as a false positivediagnosis, and in statistical classification as a false positiveerror.

Misuse of p-values is common in scientific research and scientific education. p-values are often used or interpreted incorrectly; the American Statistical Association states that p-values can indicate how incompatible the data are with a specified statistical model. From a Neyman–Pearson hypothesis testing approach to statistical inferences, the data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level. From a Fisherian statistical testing approach to statistical inferences, a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false.

In statistical hypothesis testing, the error exponent of a hypothesis testing procedure is the rate at which the probabilities of Type I and Type II decay exponentially with the size of the sample used in the test. For example, if the probability of error of a test decays as , where is the sample size, the error exponent is .

References

  1. Onwuegbuzie, A.J.; Daniel, L. G. (19 February 2003). "Typology of analytical and interpretational errors in quantitative and qualitative educational research". Current Issues in Education. 6 (2).
  2. P.J. Boxer (1994). "Notes on Checkland's Soft Systems Methodology" (PDF). Archived from the original (PDF) on 2009-12-29.
  3. "Florence Nightingale David".
  4. Ian I. Mitroff and Abraham Silvers, Dirty rotten strategies: How We Trick Ourselves and Others into Solving the Wrong Problems Precisely, Stanford Business Press (2009), hardcover, 210 pages, ISBN   978-0-8047-5996-0