Closed testing procedure

Last updated

In statistics, the closed testing procedure [1] is a general method for performing more than one hypothesis test simultaneously.

Contents

The closed testing principle

Suppose there are k hypotheses H1,..., Hk to be tested and the overall type I error rate is α. The closed testing principle allows the rejection of any one of these elementary hypotheses, say Hi, if all possible intersection hypotheses involving Hi can be rejected by using valid local level α tests; the adjusted p-value is the largest among those hypotheses. It controls the family-wise error rate for all the k hypotheses at level α in the strong sense.

Example

Suppose there are three hypotheses H1,H2, and H3 to be tested and the overall type I error rate is 0.05. Then H1 can be rejected at level α if H1H2H3, H1H2, H1H3 and H1 can all be rejected using valid tests with level α.

Special cases

The Holm–Bonferroni method is a special case of a closed test procedure for which each intersection null hypothesis is tested using the simple Bonferroni test. As such, it controls the family-wise error rate for all the k hypotheses at level α in the strong sense.

Multiple test procedures developed using the graphical approach for constructing and illustrating multiple test procedures [2] are a subclass of closed testing procedures.

See also

Related Research Articles

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. More generally, hypothesis testing allows us to make probabilistic statements about population parameters. Thus, it is one way of making decisions under uncertainty. Typically, hypothesis testing procedures involve a user selected tradeoff between false positives and false negatives.

<span class="mw-page-title-main">Kruskal–Wallis one-way analysis of variance</span> Non-parametric method for testing whether samples originate from the same distribution

The Kruskal–Wallis test by ranks, Kruskal–Wallis H test, or one-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

<span class="mw-page-title-main">Data dredging</span> Misuse of data analysis

Data dredging is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.

The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples. The one-sample version serves a purpose similar to that of the one-sample Student's t-test. For two matched samples, it is a paired difference test like the paired Student's t-test. The Wilcoxon test can be a good alternative to the t-test when population means are not of interest; for example, when one wishes to test whether a population's median is nonzero, or whether there is a better than 50% chance that a sample from one population is greater than a sample from another population.

In statistics, Duncan's new multiple range test (MRT) is a multiple comparison procedure developed by David B. Duncan in 1955. Duncan's MRT belongs to the general class of multiple comparison procedures that use the studentized range statistic qr to compare sets of means.

In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expected proportion of "discoveries" that are false. Equivalently, the FDR is the expected ratio of the number of false positive classifications to the total number of positive classifications. The total number of rejections of the null include both the number of false positives (FP) and true positives (TP). Simply put, FDR = FP /. FDR-controlling procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.

In statistical hypothesis testing, a type I error is the mistaken rejection of a null hypothesis that is actually true. A type I error is also known as a "false positive" finding or conclusion; example: "an innocent person is convicted". A type II error is the failure to reject a null hypothesis that is actually false. A type II error is also known as a "false negative" finding or conclusion; example: "a guilty person is not convicted". Much of statistical theory revolves around the minimization of one or both of these errors, though the complete elimination of either is a statistical impossibility if the outcome is not determined by a known, observable causal process. By selecting a low threshold (cut-off) value and modifying the alpha (α) level, the quality of the hypothesis test can be increased. The knowledge of type I errors and type II errors is widely used in medical science, biometrics and computer science.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

In statistics, the Holm–Bonferroni method, also called the Holm method or Bonferroni–Holm method, is used to counteract the problem of multiple comparisons. It is intended to control the family-wise error rate (FWER) and offers a simple test uniformly more powerful than the Bonferroni correction. It is named after Sture Holm, who codified the method, and Carlo Emilio Bonferroni.

Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSDtest, is a single-step multiple comparison procedure and statistical test. It can be used to find means that are significantly different from each other.

In statistics, when performing multiple comparisons, a false positive ratio is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive and the total number of actual negative events.

The Newman–Keuls or Student–Newman–Keuls (SNK)method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use studentized range statistics. Unlike Tukey's range test, the Newman–Keuls method uses different critical values for different pairs of mean comparisons. Thus, the procedure is more likely to reveal significant differences between group means and to commit type I errors by incorrectly rejecting a null hypothesis when it is true. In other words, the Neuman-Keuls procedure is more powerful but less conservative than Tukey's range test.

In statistics, a false coverage rate (FCR) is the average rate of false coverage, i.e. not covering the true parameters, among the selected intervals.

In statistics, the Šidák correction, or Dunn–Šidák correction, is a method used to counteract the problem of multiple comparisons. It is a simple method to control the family-wise error rate. When all null hypotheses are true, the method provides familywise error control that is exact for tests that are stochastically independent, conservative for tests that are positively dependent, and liberal for tests that are negatively dependent. It is credited to a 1967 paper by the statistician and probabilist Zbyněk Šidák. The Šidák method can be used to determine the statistical significance, and evaluate adjusted P value and confidence intervals.

One of the application of Student's t-test is to test the location of one sequence of independent and identically distributed random variables. If we want to test the locations of multiple sequences of such variables, Šidák correction should be applied in order to calibrate the level of the Student's t-test. Moreover, if we want to test the locations of nearly infinitely many sequences of variables, then Šidák correction should be used, but with caution. More specifically, the validity of Šidák correction depends on how fast the number of sequences goes to infinity.

<i>q</i>-value (statistics) Statistical hypothesis testing measure

In statistical hypothesis testing, specifically multiple hypothesis testing, the q-value in the Storey-Tibshirani procedure provides a means to control the positive false discovery rate (pFDR). Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

The harmonic mean p-value(HMP) is a statistical technique for addressing the multiple comparisons problem that controls the strong-sense family-wise error rate (this claim has been disputed). It improves on the power of Bonferroni correction by performing combined tests, i.e. by testing whether groups of p-values are statistically significant, like Fisher's method. However, it avoids the restrictive assumption that the p-values are independent, unlike Fisher's method. Consequently, it controls the false positive rate when tests are dependent, at the expense of less power (i.e. a higher false negative rate) when tests are independent. Besides providing an alternative to approaches such as Bonferroni correction that controls the stringent family-wise error rate, it also provides an alternative to the widely-used Benjamini-Hochberg procedure (BH) for controlling the less-stringent false discovery rate. This is because the power of the HMP to detect significant groups of hypotheses is greater than the power of BH to detect significant individual hypotheses.

References

  1. Marcus, R; Peritz, E; Gabriel, KR (1976). "On closed testing procedures with special reference to ordered analysis of variance". Biometrika. 63 (3): 655–660. doi:10.1093/biomet/63.3.655. JSTOR   2335748.
  2. Bretz, F; Maurer, W; Brannath, W; Posch, M (2009). "A graphical approach to sequentially rejective multiple test procedures". Stat Med . 28 (4): 586–604. doi:10.1002/sim.3495. S2CID   12068118.