Family-wise error rate

Last updated

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.

Contents

Familywise and Experimentwise Error Rates

John Tukey developed in 1953 the concept of a familywise error rate as the probability of making a Type I error among a specified group, or "family," of tests. [1] Ryan (1959) proposed the related concept of an experimentwise error rate, which is the probability of making a Type I error in a given experiment. [2] Hence, an experimentwise error rate is a familywise error rate where the family includes all the tests that are conducted within an experiment.

As Ryan (1959, Footnote 3) explained, an experiment may contain two or more families of multiple comparisons, each of which relates to a particular statistical inference and each of which has its own separate familywise error rate. [2] Hence, familywise error rates are usually based on theoretically informative collections of multiple comparisons. In contrast, an experimentwise error rate may be based on a collection of simultaneous comparisons that refer to a diverse range of separate inferences. Some have argued that it may not be useful to control the experimentwise error rate in such cases. [3] Indeed, Tukey suggested that familywise control was preferable in such cases (Tukey, 1956, personal communication, in Ryan, 1962, p. 302). [4]

Background

Within the statistical framework, there are several definitions for the term "family":

  1. To take into account the selection effect due to data dredging
  2. To ensure simultaneous correctness of a set of inferences as to guarantee a correct overall decision

To summarize, a family could best be defined by the potential selective inference that is being faced: A family is the smallest set of items of inference in an analysis, interchangeable about their meaning for the goal of research, from which selection of results for action, presentation or highlighting could be made (Yoav Benjamini).[ citation needed ]

Classification of multiple hypothesis tests

The following table defines the possible outcomes when testing multiple null hypotheses. Suppose we have a number m of null hypotheses, denoted by: H1, H2, ..., Hm. Using a statistical test, we reject the null hypothesis if the test is declared significant. We do not reject the null hypothesis if the test is non-significant. Summing each type of outcome over all Hi  yields the following random variables:

Null hypothesis is true (H0)Alternative hypothesis is true (HA)Total
Test is declared significantVSR
Test is declared non-significantUT
Totalm

In m hypothesis tests of which are true null hypotheses, R is an observable random variable, and S, T, U, and V are unobservable random variables.

Definition

The FWER is the probability of making at least one type I error in the family,

or equivalently,

Thus, by assuring , the probability of making one or more type I errors in the family is controlled at level .

A procedure controls the FWER in the weak sense if the FWER control at level is guaranteed only when all null hypotheses are true (i.e. when , meaning the "global null hypothesis" is true). [5]

A procedure controls the FWER in the strong sense if the FWER control at level is guaranteed for any configuration of true and non-true null hypotheses (whether the global null hypothesis is true or not). [5]

Controlling procedures

Some classical solutions that ensure strong level FWER control, and some newer solutions exist.

The Bonferroni procedure

The Šidák procedure

Tukey's procedure

Holm's step-down procedure (1979)

This procedure is uniformly more powerful than the Bonferroni procedure. [6] The reason why this procedure controls the family-wise error rate for all the m hypotheses at level α in the strong sense is, because it is a closed testing procedure. As such, each intersection is tested using the simple Bonferroni test.[ citation needed ]

Hochberg's step-up procedure

Hochberg's step-up procedure (1988) is performed using the following steps: [7]

Hochberg's procedure is more powerful than Holm's. Nevertheless, while Holm’s is a closed testing procedure (and thus, like Bonferroni, has no restriction on the joint distribution of the test statistics), Hochberg’s is based on the Simes test, so it holds only under non-negative dependence.[ citation needed ] The Simes test is derived under assumption of independent tests; [8] it is conservative for tests that are positively dependent in a certain sense [9] [10] and is anti-conservative for certain cases of negative dependence. [11] [12] However, it has been suggested that a modified version of the Hochberg procedure remains valid under general negative dependence. [13]

Dunnett's correction

Charles Dunnett (1955, 1966) described an alternative alpha error adjustment when k groups are compared to the same control group. Now known as Dunnett's test, this method is less conservative than the Bonferroni adjustment.[ citation needed ]

Scheffé's method

Resampling procedures

The procedures of Bonferroni and Holm control the FWER under any dependence structure of the p-values (or equivalently the individual test statistics). Essentially, this is achieved by accommodating a `worst-case' dependence structure (which is close to independence for most practical purposes). But such an approach is conservative if dependence is actually positive. To give an extreme example, under perfect positive dependence, there is effectively only one test and thus, the FWER is uninflated.

Accounting for the dependence structure of the p-values (or of the individual test statistics) produces more powerful procedures. This can be achieved by applying resampling methods, such as bootstrapping and permutations methods. The procedure of Westfall and Young (1993) requires a certain condition that does not always hold in practice (namely, subset pivotality). [14] The procedures of Romano and Wolf (2005a,b) dispense with this condition and are thus more generally valid. [15] [16]

Harmonic mean p-value procedure

The harmonic mean p-value (HMP) procedure [17] [18] provides a multilevel test that improves on the power of Bonferroni correction by assessing the significance of groups of hypotheses while controlling the strong-sense family-wise error rate. The significance of any subset of the tests is assessed by calculating the HMP for the subset, where are weights that sum to one (i.e. ). An approximate procedure that controls the strong-sense family-wise error rate at level approximately rejects the null hypothesis that none of the p-values in subset are significant when [19] (where ). This approximation is reasonable for small (e.g. ) and becomes arbitrarily good as approaches zero. An asymptotically exact test is also available (see main article).

Alternative approaches

FWER control exerts a more stringent control over false discovery compared to false discovery rate (FDR) procedures. FWER control limits the probability of at least one false discovery, whereas FDR control limits (in a loose sense) the expected proportion of false discoveries. Thus, FDR procedures have greater power at the cost of increased rates of type I errors, i.e., rejecting null hypotheses that are actually true. [20]

On the other hand, FWER control is less stringent than per-family error rate control, which limits the expected number of errors per family. Because FWER control is concerned with at least one false discovery, unlike per-family error rate control it does not treat multiple simultaneous false discoveries as any worse than one false discovery. The Bonferroni correction is often considered as merely controlling the FWER, but in fact also controls the per-family error rate. [21]

Related Research Articles

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used, the assumed distribution of the test, and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.

In statistics, the Neyman–Pearson lemma describes the existence and uniqueness of the likelihood ratio as a uniformly most powerful test in certain contexts. It was introduced by Jerzy Neyman and Egon Pearson in a paper in 1933. The Neyman–Pearson lemma is part of the Neyman–Pearson theory of statistical testing, which introduced concepts like errors of the second kind, power function, and inductive behavior. The previous Fisherian theory of significance testing postulated only one hypothesis. By introducing a competing hypothesis, the Neyman–Pearsonian flavor of statistical testing allows investigating the two types of errors. The trivial cases where one always rejects or accepts the null hypothesis are of little interest but it does prove that one must not relinquish control over one type of error while calibrating the other. Neyman and Pearson accordingly proceeded to restrict their attention to the class of all level tests while subsequently minimizing type II error, traditionally denoted by . Their seminal paper of 1933, including the Neyman–Pearson lemma, comes at the end of this endeavor, not only showing the existence of tests with the most power that retain a prespecified level of type I error, but also providing a way to construct such tests. The Karlin-Rubin theorem extends the Neyman–Pearson lemma to settings involving composite hypotheses with monotone likelihood ratios.

<span class="mw-page-title-main">Kruskal–Wallis test</span> Non-parametric method for testing whether samples originate from the same distribution

The Kruskal–Wallis test by ranks, Kruskal–Wallis test, or one-way ANOVA on ranks is a non-parametric statistical test for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

In statistics, Duncan's new multiple range test (MRT) is a multiple comparison procedure developed by David B. Duncan in 1955. Duncan's MRT belongs to the general class of multiple comparison procedures that use the studentized range statistic qr to compare sets of means.

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. A type II error, or a false negative, is the failure to reject a null hypothesis that is actually false.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

In statistics, the Holm–Bonferroni method, also called the Holm method or Bonferroni–Holm method, is used to counteract the problem of multiple comparisons. It is intended to control the family-wise error rate (FWER) and offers a simple test uniformly more powerful than the Bonferroni correction. It is named after Sture Holm, who codified the method, and Carlo Emilio Bonferroni.

In statistics, the closed testing procedure is a general method for performing more than one hypothesis test simultaneously.

Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSDtest, is a single-step multiple comparison procedure and statistical test. It can be used to correctly interpret the statistical significance of the difference between means that have been selected for comparison because of their extreme values.

In statistics, when performing multiple comparisons, a false positive ratio is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive and the total number of actual negative events.

The Newman–Keuls or Student–Newman–Keuls (SNK)method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use studentized range statistics. Unlike Tukey's range test, the Newman–Keuls method uses different critical values for different pairs of mean comparisons. Thus, the procedure is more likely to reveal significant differences between group means and to commit type I errors by incorrectly rejecting a null hypothesis when it is true. In other words, the Neuman-Keuls procedure is more powerful but less conservative than Tukey's range test.

In statistics, a false coverage rate (FCR) is the average rate of false coverage, i.e. not covering the true parameters, among the selected intervals.

In statistics, the Šidák correction, or Dunn–Šidák correction, is a method used to counteract the problem of multiple comparisons. It is a simple method to control the family-wise error rate. When all null hypotheses are true, the method provides familywise error control that is exact for tests that are stochastically independent, conservative for tests that are positively dependent, and liberal for tests that are negatively dependent. It is credited to a 1967 paper by the statistician and probabilist Zbyněk Šidák. The Šidák method can be used to adjust alpha levels, p-values, or confidence intervals.

One of the application of Student's t-test is to test the location of one sequence of independent and identically distributed random variables. If we want to test the locations of multiple sequences of such variables, Šidák correction should be applied in order to calibrate the level of the Student's t-test. Moreover, if we want to test the locations of nearly infinitely many sequences of variables, then Šidák correction should be used, but with caution. More specifically, the validity of Šidák correction depends on how fast the number of sequences goes to infinity.

<i>q</i>-value (statistics) Statistical hypothesis testing measure

In statistical hypothesis testing, specifically multiple hypothesis testing, the q-value in the Storey procedure provides a means to control the positive false discovery rate (pFDR). Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

The harmonic mean p-value(HMP) is a statistical technique for addressing the multiple comparisons problem that controls the strong-sense family-wise error rate (this claim has been disputed). It improves on the power of Bonferroni correction by performing combined tests, i.e. by testing whether groups of p-values are statistically significant, like Fisher's method. However, it avoids the restrictive assumption that the p-values are independent, unlike Fisher's method. Consequently, it controls the false positive rate when tests are dependent, at the expense of less power (i.e. a higher false negative rate) when tests are independent. Besides providing an alternative to approaches such as Bonferroni correction that controls the stringent family-wise error rate, it also provides an alternative to the widely-used Benjamini-Hochberg procedure (BH) for controlling the less-stringent false discovery rate. This is because the power of the HMP to detect significant groups of hypotheses is greater than the power of BH to detect significant individual hypotheses.

References

  1. Tukey, J. W. (1953). The problem of multiple comparisons. Based on Tukey (1953),
  2. 1 2 Ryan, Thomas A. (1959). "Multiple comparison in psychological research". Psychological Bulletin. 56 (1). American Psychological Association (APA): 26–47. doi:10.1037/h0042478. ISSN   1939-1455. PMID   13623958.
  3. 1 2 Hochberg, Y.; Tamhane, A. C. (1987). Multiple Comparison Procedures . New York: Wiley. p.  5. ISBN   978-0-471-82222-6.
  4. Ryan, T. A. (1962). "The experiment as the unit for computing rates of error". Psychological Bulletin. 59 (4): 301–305. doi:10.1037/h0040562. PMID   14495585.
  5. 1 2 Dmitrienko, Alex; Tamhane, Ajit; Bretz, Frank (2009). Multiple Testing Problems in Pharmaceutical Statistics (1 ed.). CRC Press. p. 37. ISBN   9781584889847.
  6. Aickin, M; Gensler, H (1996). "Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods". American Journal of Public Health . 86 (5): 726–728. doi:10.2105/ajph.86.5.726. PMC   1380484 . PMID   8629727.
  7. Hochberg, Yosef (1988). "A Sharper Bonferroni Procedure for Multiple Tests of Significance" (PDF). Biometrika . 75 (4): 800–802. doi:10.1093/biomet/75.4.800.
  8. Simes, R. J. (1986). "An improved Bonferroni procedure for multiple tests of significance". Biometrika. 73 (3): 751–754. doi:10.1093/biomet/73.3.751.
  9. Sarkar, Sanat K.; Chang, Chung-Kuei (1997). "The Simes method for multiple hypothesis testing with positively dependent test statistics". Journal of the American Statistical Association. 92 (440): 1601–1608. doi:10.1080/01621459.1997.10473682.
  10. Sarkar, Sanat K. (1998). "Some probability inequalities for ordered MTP2 random variables: a proof of the Simes conjecture". The Annals of Statistics. 26 (2): 494–504. doi:10.1214/aos/1028144846.
  11. Samuel-Cahn, Ester (1996). "Is the Simes improved Bonferroni procedure conservative?". Biometrika. 83 (4): 928–933. doi:10.1093/biomet/83.4.928.
  12. Block, Henry W.; Savits, Thomas H.; Wang, Jie (2008). "Negative dependence and the Simes inequality". Journal of Statistical Planning and Inference. 138 (12): 4107–4110. doi:10.1016/j.jspi.2008.03.026.
  13. Gou, Jiangtao; Tamhane, Ajit C. (2018). "Hochberg procedure under negative dependence" (PDF). Statistica Sinica. 28: 339–362. doi:10.5705/ss.202016.0306 (inactive 2 December 2024).{{cite journal}}: CS1 maint: DOI inactive as of December 2024 (link)
  14. Westfall, P. H.; Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. New York: John Wiley. ISBN   978-0-471-55761-6.
  15. Romano, J.P.; Wolf, M. (2005a). "Exact and approximate stepdown methods for multiple hypothesis testing". Journal of the American Statistical Association . 100 (469): 94–108. doi:10.1198/016214504000000539. hdl: 10230/576 . S2CID   219594470.
  16. Romano, J.P.; Wolf, M. (2005b). "Stepwise multiple testing as formalized data snooping". Econometrica . 73 (4): 1237–1282. CiteSeerX   10.1.1.198.2473 . doi:10.1111/j.1468-0262.2005.00615.x.
  17. Good, I J (1958). "Significance tests in parallel and in series". Journal of the American Statistical Association. 53 (284): 799–813. doi:10.1080/01621459.1958.10501480. JSTOR   2281953.
  18. Wilson, D J (2019). "The harmonic mean p-value for combining dependent tests". Proceedings of the National Academy of Sciences USA. 116 (4): 1195–1200. Bibcode:2019PNAS..116.1195W. doi: 10.1073/pnas.1814092116 . PMC   6347718 . PMID   30610179.
  19. Sciences, National Academy of (2019-10-22). "Correction for Wilson, The harmonic mean p-value for combining dependent tests". Proceedings of the National Academy of Sciences. 116 (43): 21948. Bibcode:2019PNAS..11621948.. doi: 10.1073/pnas.1914128116 . PMC   6815184 . PMID   31591234.
  20. Shaffer, J. P. (1995). "Multiple hypothesis testing". Annual Review of Psychology . 46: 561–584. doi:10.1146/annurev.ps.46.020195.003021. hdl: 10338.dmlcz/142950 .
  21. Frane, Andrew (2015). "Are per-family Type I error rates relevant in social and behavioral science?". Journal of Modern Applied Statistical Methods. 14 (1): 12–23. doi: 10.22237/jmasm/1430453040 .