Q-value (statistics)

Last updated

In statistical hypothesis testing, specifically multiple hypothesis testing, the q-value in the Storey procedure provides a means to control the positive false discovery rate (pFDR). [1] Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value. [2]

Contents

Schematic representation of the Storey procedure for correcting for multiple hypothesis testing, assuming correctly calculated p-values. y-axis is frequency. Storey-Tibshirani procedure.png
Schematic representation of the Storey procedure for correcting for multiple hypothesis testing, assuming correctly calculated p-values. y-axis is frequency.

History

In statistics, testing multiple hypotheses simultaneously using methods appropriate for testing single hypotheses tends to yield many false positives: the so-called multiple comparisons problem. [3] For example, assume that one were to test 1,000 null hypotheses, all of which are true, and (as is conventional in single hypothesis testing) to reject null hypotheses with a significance level of 0.05; due to random chance, one would expect 5% of the results to appear significant ( P < 0.05), yielding 50 false positives (rejections of the null hypothesis). [4] Since the 1950s, statisticians had been developing methods for multiple comparisons that reduced the number of false positives, such as controlling the family-wise error rate (FWER) using the Bonferroni correction, but these methods also increased the number of false negatives (i.e. reduced the statistical power). [3] In 1995, Yoav Benjamini and Yosef Hochberg proposed controlling the false discovery rate (FDR) as a more statistically powerful alternative to controlling the FWER in multiple hypothesis testing. [3] The pFDR and the q-value were introduced by John D. Storey in 2002 in order to improve upon a limitation of the FDR, namely that the FDR is not defined when there are no positive results. [1] [5]

Definition

Let there be a null hypothesis and an alternative hypothesis . Perform hypothesis tests; let the test statistics be i.i.d. random variables such that . That is, if is true for test (), then follows the null distribution ; while if is true (), then follows the alternative distribution . Let , that is, for each test, is true with probability and is true with probability . Denote the critical region (the values of for which is rejected) at significance level by . Let an experiment yield a value for the test statistic. The q-value of is formally defined as

That is, the q-value is the infimum of the pFDR if is rejected for test statistics with values . Equivalently, the q-value equals

which is the infimum of the probability that is true given that is rejected (the false discovery rate). [1]

Relationship to the p-value

The p-value is defined as

the infimum of the probability that is rejected given that is true (the false positive rate). Comparing the definitions of the p- and q-values, it can be seen that the q-value is the minimum posterior probability that is true. [1]

Interpretation

The q-value can be interpreted as the false discovery rate (FDR): the proportion of false positives among all positive results. Given a set of test statistics and their associated q-values, rejecting the null hypothesis for all tests whose q-value is less than or equal to some threshold ensures that the expected value of the false discovery rate is . [6]

Applications

Biology

Gene expression

Genome-wide analyses of differential gene expression involve simultaneously testing the expression of thousands of genes. Controlling the FWER (usually to 0.05) avoids excessive false positives (i.e. detecting differential expression in a gene that is not differentially expressed) but imposes a strict threshold for the p-value that results in many false negatives (many differentially expressed genes are overlooked). However, controlling the pFDR by selecting genes with significant q-values lowers the number of false negatives (increases the statistical power) while ensuring that the expected value of the proportion of false positives among all positive results is low (e.g. 5%). [6]

For example, suppose that among 10,000 genes tested, 1,000 are actually differentially expressed and 9,000 are not:

  • If we consider every gene with a p-value of less than 0.05 to be differentially expressed, we expect that 450 (5%) of the 9,000 genes that are not differentially expressed will appear to be differentially expressed (450 false positives).
  • If we control the FWER to 0.05, there is only a 5% probability of obtaining at least one false positive. However, this very strict criterion will reduce the power such that few of the 1,000 genes that are actually differentially expressed will appear to be differentially expressed (many false negatives).
  • If we control the pFDR to 0.05 by considering all genes with a q-value of less than 0.05 to be differentially expressed, then we expect 5% of the positive results to be false positives (e.g. 900 true positives, 45 false positives, 100 false negatives, 8,955 true negatives). This strategy enables one to obtain relatively low numbers of both false positives and false negatives.

Implementations

Note: the following is an incomplete list.

R

Related Research Articles

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used, the assumed distribution of the test, and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.

The Lotka–Volterra equations, also known as the Lotka–Volterra predator–prey model, are a pair of first-order nonlinear differential equations, frequently used to describe the dynamics of biological systems in which two species interact, one as a predator and the other as prey. The populations change through time according to the pair of equations:

In null-hypothesis significance testing, the -value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis". That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data".

In statistics, Duncan's new multiple range test (MRT) is a multiple comparison procedure developed by David B. Duncan in 1955. Duncan's MRT belongs to the general class of multiple comparison procedures that use the studentized range statistic qr to compare sets of means.

In statistics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis depends on which version of the test is used, but is usually stationarity or trend-stationarity. It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models.

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted.

In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

In statistics, when performing multiple comparisons, a false positive ratio is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive and the total number of actual negative events.

The Newman–Keuls or Student–Newman–Keuls (SNK)method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use studentized range statistics. Unlike Tukey's range test, the Newman–Keuls method uses different critical values for different pairs of mean comparisons. Thus, the procedure is more likely to reveal significant differences between group means and to commit type I errors by incorrectly rejecting a null hypothesis when it is true. In other words, the Neuman-Keuls procedure is more powerful but less conservative than Tukey's range test.

The uncertainty theory invented by Baoding Liu is a branch of mathematics based on normality, monotonicity, self-duality, countable subadditivity, and product measure axioms.

In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA to assess whether the factor variables are additively related to the expected value of the response variable. It can be applied when there are no replicated values in the data set, a situation in which it is impossible to directly estimate a fully general non-additive regression structure and still have information left to estimate the error variance. The test statistic proposed by Tukey has one degree of freedom under the null hypothesis, hence this is often called "Tukey's one-degree-of-freedom test."

In statistics, a false coverage rate (FCR) is the average rate of false coverage, i.e. not covering the true parameters, among the selected intervals.

A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition, while a false negative is the opposite error, where the test result incorrectly indicates the absence of a condition when it is actually present. These are the two kinds of errors in a binary test, in contrast to the two kinds of correct result. They are also known in medicine as a false positivediagnosis, and in statistical classification as a false positiveerror.

In statistical decision theory, a randomised decision rule or mixed decision rule is a decision rule that associates probabilities with deterministic decision rules. In finite decision problems, randomised decision rules define a risk set which is the convex hull of the risk points of the nonrandomised decision rules.

The harmonic mean p-value(HMP) is a statistical technique for addressing the multiple comparisons problem that controls the strong-sense family-wise error rate (this claim has been disputed). It improves on the power of Bonferroni correction by performing combined tests, i.e. by testing whether groups of p-values are statistically significant, like Fisher's method. However, it avoids the restrictive assumption that the p-values are independent, unlike Fisher's method. Consequently, it controls the false positive rate when tests are dependent, at the expense of less power (i.e. a higher false negative rate) when tests are independent. Besides providing an alternative to approaches such as Bonferroni correction that controls the stringent family-wise error rate, it also provides an alternative to the widely-used Benjamini-Hochberg procedure (BH) for controlling the less-stringent false discovery rate. This is because the power of the HMP to detect significant groups of hypotheses is greater than the power of BH to detect significant individual hypotheses.

References

  1. 1 2 3 4 Storey, John D. (2002). "A direct approach to false discovery rates". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 64 (3): 479–498. CiteSeerX   10.1.1.320.7131 . doi:10.1111/1467-9868.00346.
  2. Storey, John D. (2003). "The positive false discovery rate: a Bayesian interpretation and the q-value". The Annals of Statistics. 31 (6): 2013–2035. doi: 10.1214/aos/1074290335 .
  3. 1 2 3 Benjamini, Yoav; Hochberg, Yosef (1995). "Controlling the false discovery rate: a practical and powerful approach tomultiple testing". Journal of the Royal Statistical Society. Series B (Methodological). 57: 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x.
  4. Nuzzo, Regina (2014). "Scientific method: Statistical errors". Nature. Retrieved 5 March 2019.
  5. Storey, John D. (2002). "A direct approach to false discovery rates". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 64 (3): 479–498. CiteSeerX   10.1.1.320.7131 . doi:10.1111/1467-9868.00346.
  6. 1 2 Storey, John D.; Tibshirani, Robert (2003). "Statistical significance for genomewide studies". PNAS. 100 (16): 9440–9445. Bibcode:2003PNAS..100.9440S. doi:10.1073/pnas.1530509100. PMC   170937 . PMID   12883005.
  7. Storey, John D.; Bass, Andrew J.; Dabney, Alan; Robinson, David; Warnes, Gregory (2019). "qvalue: Q-value estimation for false discovery rate control". Bioconductor.