False positive rate

Last updated December 25, 2023

In statistics, when performing multiple comparisons, a false positive ratio (also known as fall-out or false alarm ratio) is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events (regardless of classification).

Definition

The false positive rate is ${\boldsymbol {\mathrm {FPR} }}={\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TN} }}$

where $\mathrm {FP}$ is the number of false positives, $\mathrm {TN}$ is the number of true negatives and $N=\mathrm {FP} +\mathrm {TN}$ is the total number of ground truth negatives.

The level of significance that is used to test each hypothesis is set based on the form of inference (simultaneous inference vs. selective inference) and its supporting criteria (for example FWER or FDR), that were pre-determined by the researcher.

When performing multiple comparisons in a statistical framework such as above, the false positive ratio (also known as the false alarm ratio, as opposed to false positive rate / false alarm rate ) usually refers to the probability of falsely rejecting the null hypothesis for a particular test. Using the terminology suggested here, it is simply $V/m_{0}$ .

Since V is a random variable and $m_{0}$ is a constant ( $V\leq m_{0}$ ), the false positive ratio is also a random variable, ranging between 0–1.
The false positive rate (or "false alarm rate") usually refers to the expectancy of the false positive ratio, expressed by $E(V/m_{0})$ .

It is worth noticing that the two definitions ("false positive ratio" / "false positive rate") are somewhat interchangeable. For example, in the referenced article^[1] $V/m_{0}$ serves as the false positive "rate" rather than as its "ratio".

Classification of multiple hypothesis tests

The following table defines the possible outcomes when testing multiple null hypotheses. Suppose we have a number m of null hypotheses, denoted by: $H 1, H 2, ..., H m .$ Using a statistical test, we reject the null hypothesis if the test is declared significant. We do not reject the null hypothesis if the test is non-significant. Summing each type of outcome over all H_i yields the following random variables:

	Null hypothesis is true (H₀)	Alternative hypothesis is true (H_A)	Total
Test is declared significant	$V$	$S$	$R$
Test is declared non-significant	$U$	$T$	$m-R$
Total	$m_{0}$	$m-m_{0}$	$m$

$m$ is the total number hypotheses tested
$m_{0}$ is the number of true null hypotheses, an unknown parameter
$m-m_{0}$ is the number of true alternative hypotheses
$V$ is the number of false positives (Type I error) (also called "false discoveries")
$S$ is the number of true positives (also called "true discoveries")
$T$ is the number of false negatives (Type II error)
$U$ is the number of true negatives
$R=V+S$ is the number of rejected null hypotheses (also called "discoveries", either true or false)

In $m$ hypothesis tests of which $m_{0}$ are true null hypotheses, $R$ is an observable random variable, and $S$ , $T$ , $U$ , and $V$ are unobservable random variables.

Comparison with other error rates

While the false positive rate is mathematically equal to the type I error rate, it is viewed as a separate term for the following reasons:^{[ citation needed ]}

The type I error rate is often associated with the a-priori setting of the significance level by the researcher: the significance level represents an acceptable error rate considering that all null hypotheses are true (the "global null" hypothesis). The choice of a significance level may thus be somewhat arbitrary (i.e. setting 10% (0.1), 5% (0.05), 1% (0.01) etc.)

As opposed to that, the false positive rate is associated with a post-prior result, which is the expected number of false positives divided by the total number of hypotheses under the real combination of true and non-true null hypotheses (disregarding the "global null" hypothesis). Since the false positive rate is a parameter that is not controlled by the researcher, it cannot be identified with the significance level.

Moreover, false positive rate is usually used regarding a medical test or diagnostic device (i.e. "the false positive rate of a certain diagnostic device is 1%"), while type I error is a term associated with statistical tests, where the meaning of the word "positive" is not as clear (i.e. "the type I error of a test is 1%").

The false positive rate should also not be confused with the family-wise error rate, which is defined as ${\boldsymbol {\mathrm {FWER} }}=\Pr(V\geq 1)\,$ . As the number of tests grows, the familywise error rate usually converges to 1 while the false positive rate remains fixed.

Lastly, it is important to note the profound difference between the false positive rate and the false discovery rate: while the first is defined as $E(V/m_{0})$ , the second is defined as $E(V/R)$ .

Related Research Articles

Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. More generally, hypothesis testing allows us to make probabilistic statements about population parameters. More informally, hypothesis testing is the processes of making decisions under uncertainty. Typically, hypothesis testing procedures involve a user selected tradeoff between false positives and false negatives.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by $, and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.$

In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis." That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data."

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.

In statistical hypothesis testing, a type I error is the mistaken rejection of a null hypothesis that is actually true. A type I error is also known as a "false positive" finding or conclusion; example: "an innocent person is convicted". A type II error is the failure to reject a null hypothesis that is actually false. A type II error is also known as a "false negative" finding or conclusion; example: "a guilty person is not convicted". Much of statistical theory revolves around the minimization of one or both of these errors, though the complete elimination of either is a statistical impossibility if the outcome is not determined by a known, observable causal process. By selecting a low threshold (cut-off) value and modifying the alpha (α) level, the quality of the hypothesis test can be increased. The knowledge of type I errors and type II errors is widely used in medical science, biometrics and computer science.

<span class="mw-page-title-main">Fisher's method</span> Statistical method

In statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combine the results from several independence tests bearing upon the same overall hypothesis (H₀).

In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem.

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values.

In statistics, the Holm–Bonferroni method, also called the Holm method or Bonferroni–Holm method, is used to counteract the problem of multiple comparisons. It is intended to control the family-wise error rate (FWER) and offers a simple test uniformly more powerful than the Bonferroni correction. It is named after Sture Holm, who codified the method, and Carlo Emilio Bonferroni.

The Newman–Keuls or Student–Newman–Keuls (SNK)method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use studentized range statistics. Unlike Tukey's range test, the Newman–Keuls method uses different critical values for different pairs of mean comparisons. Thus, the procedure is more likely to reveal significant differences between group means and to commit type I errors by incorrectly rejecting a null hypothesis when it is true. In other words, the Neuman-Keuls procedure is more powerful but less conservative than Tukey's range test.

In statistics Wilks' theorem states that the log-likelihood ratio is asymptotically normal. This can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

In statistics, a false coverage rate (FCR) is the average rate of false coverage, i.e. not covering the true parameters, among the selected intervals.

In statistics, the Šidák correction, or Dunn–Šidák correction, is a method used to counteract the problem of multiple comparisons. It is a simple method to control the family-wise error rate. When all null hypotheses are true, the method provides familywise error control that is exact for tests that are stochastically independent, conservative for tests that are positively dependent, and liberal for tests that are negatively dependent. It is credited to a 1967 paper by the statistician and probabilist Zbyněk Šidák. The Šidák method can be used to determine the statistical significance, and evaluate adjusted P value and confidence intervals.

A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition, while a false negative is the opposite error, where the test result incorrectly indicates the absence of a condition when it is actually present. These are the two kinds of errors in a binary test, in contrast to the two kinds of correct result. They are also known in medicine as a false positivediagnosis, and in statistical classification as a false positiveerror.

Misuse of p-values is common in scientific research and scientific education. p-values are often used or interpreted incorrectly; the American Statistical Association states that p-values can indicate how incompatible the data are with a specified statistical model. From a Neyman–Pearson hypothesis testing approach to statistical inferences, the data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level. From a Fisherian statistical testing approach to statistical inferences, a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false.

In statistical hypothesis testing, specifically multiple hypothesis testing, the q-value in the Storey-Tibshirani procedure provides a means to control the positive false discovery rate (pFDR). Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

In statistical hypothesis testing, the error exponent of a hypothesis testing procedure is the rate at which the probabilities of Type I and Type II decay exponentially with the size of the sample used in the test. For example, if the probability of error $of a test decays as, where is the sample size, the error exponent is .$

References

↑ Burke, Donald; Brundage, John; Redfield, Robert (1988). "Measurement of the False Positive Rate in a Screening Program for Human Immunodeficiency Virus Infections". The New England Journal of Medicine . 319 (15): 961–964. doi:10.1056/NEJM198810133191501. PMID 3419477.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Burke.at.all1988-1] Burke, Donald; Brundage, John; Redfield, Robert (1988). "Measurement of the False Positive Rate in a Screening Program for Human Immunodeficiency Virus Infections". The New England Journal of Medicine . 319 (15): 961–964. doi:10.1056/NEJM198810133191501. PMID 3419477.

[1]