Fisher's method

Last updated
Under Fisher's method, two small p-values P1 and P2 combine to form a smaller p-value. The darkest boundary defines the region where the meta-analysis p-value is below 0.05. For example, if both p-values are around 0.10, or if one is around 0.04 and one is around 0.25, the meta-analysis p-value is around 0.05. Fisher-method-fused-P.svg
Under Fisher's method, two small p-values P1 and P2 combine to form a smaller p-value. The darkest boundary defines the region where the meta-analysis p-value is below 0.05. For example, if both p-values are around 0.10, or if one is around 0.04 and one is around 0.25, the meta-analysis p-value is around 0.05.

In statistics, Fisher's method, [1] [2] also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combine the results from several independence tests bearing upon the same overall hypothesis (H0).

Contents

Application to independent test statistics

Fisher's method combines extreme value probabilities from each test, commonly known as "p-values", into one test statistic (X2) using the formula

where pi is the p-value for the ith hypothesis test. When the p-values tend to be small, the test statistic X2 will be large, which suggests that the null hypotheses are not true for every test.

When all the null hypotheses are true, and the pi (or their corresponding test statistics) are independent, X2 has a chi-squared distribution with 2k degrees of freedom, where k is the number of tests being combined. This fact can be used to determine the p-value for X2.

The distribution of X2 is a chi-squared distribution for the following reason; under the null hypothesis for test i, the p-value pi follows a uniform distribution on the interval [0,1]. The negative logarithm of a uniformly distributed value follows an exponential distribution. Scaling a value that follows an exponential distribution by a factor of two yields a quantity that follows a chi-squared distribution with two degrees of freedom. Finally, the sum of k independent chi-squared values, each with two degrees of freedom, follows a chi-squared distribution with 2k degrees of freedom.

Limitations of independence assumption

Dependence among statistical tests is generally[ vague ] positive, which means that the p-value of X2 is too small (anti-conservative) if the dependency is not taken into account. Thus, if Fisher's method for independent tests is applied in a dependent setting, and the p-value is not small enough to reject the null hypothesis, then that conclusion will continue to hold even if the dependence is not properly accounted for. However, if positive dependence is not accounted for, and the meta-analysis p-value is found to be small, the evidence against the null hypothesis is generally overstated. The mean false discovery rate, , reduced for k independent or positively correlated tests, may suffice to control alpha for useful comparison to an over-small p-value from Fisher's X2.

Extension to dependent test statistics

In cases where the tests are not independent, the null distribution of X2 is more complicated. A common strategy is to approximate the null distribution with a scaled χ2-distribution random variable. Different approaches may be used depending on whether or not the covariance between the different p-values is known.

Brown's method [3] can be used to combine dependent p-values whose underlying test statistics have a multivariate normal distribution with a known covariance matrix. Kost's method [4] extends Brown's to allow one to combine p-values when the covariance matrix is known only up to a scalar multiplicative factor.

The harmonic mean p-value offers an alternative to Fisher's method for combining p-values when the dependency structure is unknown but the tests cannot be assumed to be independent. [5] [6]

Interpretation

Fisher's method is typically applied to a collection of independent test statistics, usually from separate studies having the same null hypothesis. The meta-analysis null hypothesis is that all of the separate null hypotheses are true. The meta-analysis alternative hypothesis is that at least one of the separate alternative hypotheses is true.

In some settings, it makes sense to consider the possibility of "heterogeneity," in which the null hypothesis holds in some studies but not in others, or where different alternative hypotheses may hold in different studies. A common reason for the latter form of heterogeneity is that effect sizes may differ among populations. For example, consider a collection of medical studies looking at the risk of a high glucose diet for developing type II diabetes. Due to genetic or environmental factors, the true risk associated with a given level of glucose consumption may be greater in some human populations than in others.

In other settings, the alternative hypothesis is either universally false, or universally true there is no possibility of it holding in some settings but not in others. For example, consider several experiments designed to test a particular physical law. Any discrepancies among the results from separate studies or experiments must be due to chance, possibly driven by differences in power.

In the case of a meta-analysis using two-sided tests, it is possible to reject the meta-analysis null hypothesis even when the individual studies show strong effects in differing directions. In this case, we are rejecting the hypothesis that the null hypothesis is true in every study, but this does not imply that there is a uniform alternative hypothesis that holds across all studies. Thus, two-sided meta-analysis is particularly sensitive to heterogeneity in the alternative hypotheses. One sided meta-analysis can detect heterogeneity in the effect magnitudes, but focuses on a single, pre-specified effect direction.

Relation to Stouffer's Z-score method

The relationship between Fisher's method and Stouffer's method can be understood from the relationship between z and -log(p) Zlogp.pdf
The relationship between Fisher's method and Stouffer's method can be understood from the relationship between z and log(p)

A closely related approach to Fisher's method is Stouffer's Z, based on Z-scores rather than p-values, allowing incorporation of study weights. It is named for the sociologist Samuel A. Stouffer. [7] If we let Zi  =  Φ  1(1pi), where Φ is the standard normal cumulative distribution function, then

is a Z-score for the overall meta-analysis. This Z-score is appropriate for one-sided right-tailed p-values; minor modifications can be made if two-sided or left-tailed p-values are being analysed. Specifically, if two-sided p-values are being analyzed, the two-sided p-value (pi/2) is used, or 1-pi if left-tailed p-values are used. [8] [ unreliable source? ]

Since Fisher's method is based on the average of log(pi) values, and the Z-score method is based on the average of the Zi values, the relationship between these two approaches follows from the relationship between z and log(p) = log(1Φ(z)). For the normal distribution, these two values are not perfectly linearly related, but they follow a highly linear relationship over the range of Z-values most often observed, from 1 to 5. As a result, the power of the Z-score method is nearly identical to the power of Fisher's method.

One advantage of the Z-score approach is that it is straightforward to introduce weights. [9] [10] If the ith Z-score is weighted by wi, then the meta-analysis Z-score is

which follows a standard normal distribution under the null hypothesis. While weighted versions of Fisher's statistic can be derived, the null distribution becomes a weighted sum of independent chi-squared statistics, which is less convenient to work with.

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

<span class="mw-page-title-main">Statistical hypothesis test</span> Method of statistical inference

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. The are around 100 specialized statistical tests.

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models, specifically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.

In scientific research, the null hypothesis is the claim that the effect being studied does not exist. Note that the term "effect" here is not meant to imply a causative relationship.

Pearson's chi-squared test or Pearson's test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

<span class="mw-page-title-main">Chi-squared test</span> Statistical hypothesis test

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.

<i>F</i>-test Statistical hypothesis test, mostly using multiple restrictions

An F-test is any statistical test used to compare the variances of two samples or the ratio of variances between multiple samples. The test statistic, random variable F, is used to determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term (ε). It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Ronald Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test since the latter converges to the former as the size of the dataset increases.

In null-hypothesis significance testing, the -value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis." That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data."

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

In statistics, the binomial test is an exact test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories using sample data.

In statistics, the score test assesses constraints on statistical parameters based on the gradient of the likelihood function—known as the score—evaluated at the hypothesized parameter value under the null hypothesis. Intuitively, if the restricted estimator is near the maximum of the likelihood function, the score should not differ from zero by more than sampling error. While the finite sample distributions of score tests are generally unknown, they have an asymptotic χ2-distribution under the null hypothesis as first proved by C. R. Rao in 1948, a fact that can be used to determine statistical significance.

A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.

Exact (significance) test is a test such that if the null hypothesis is true, then all assumptions made during the derivation of the distribution of the test statistic are met. Using an exact test provides a significance test that maintains the type I error rate of the test at the desired significance level of the test. For example, an exact test at a significance level of , when repeated over many samples where the null hypothesis is true, will reject at most of the time. This is in contrast to an approximate test in which the desired type I error rate is only approximately maintained, while this approximation may be made as close to as desired by making the sample size sufficiently large.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

<span class="mw-page-title-main">Multiple comparisons problem</span> Statistical interpretation with many tests

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values.

In statistics, the multinomial test is the test of the null hypothesis that the parameters of a multinomial distribution equal specified values; it is used for categorical data.

In statistics, extensions of Fisher's method are a group of approaches that allow approximately valid statistical inferences to be made when the assumptions required for the direct application of Fisher's method are not valid. Fisher's method is a way of combining the information in the p-values from different statistical tests so as to form a single overall test: this method requires that the individual test statistics should be statistically independent.

In statistics, the Jonckheere trend test is a test for an ordered alternative hypothesis within an independent samples (between-participants) design. It is similar to the Kruskal-Wallis test in that the null hypothesis is that several independent samples are from the same population. However, with the Kruskal–Wallis test there is no a priori ordering of the populations from which the samples are drawn. When there is an a priori ordering, the Jonckheere test has more statistical power than the Kruskal–Wallis test. The test was developed by Aimable Robert Jonckheere, who was a psychologist and statistician at University College London.

References

  1. Fisher, R.A. (1925). Statistical Methods for Research Workers . Oliver and Boyd (Edinburgh). ISBN   0-05-002170-2.
  2. Fisher, R.A.; Fisher, R. A (1948). "Questions and answers #14". The American Statistician. 2 (5): 30–31. doi:10.2307/2681650. JSTOR   2681650.
  3. Brown, M. (1975). "A method for combining non-independent, one-sided tests of significance". Biometrics. 31 (4): 987–992. doi:10.2307/2529826. JSTOR   2529826.
  4. Kost, J.; McDermott, M. (2002). "Combining dependent P-values". Statistics & Probability Letters. 60 (2): 183–190. doi:10.1016/S0167-7152(02)00310-3.
  5. Good, I J (1958). "Significance tests in parallel and in series". Journal of the American Statistical Association. 53 (284): 799–813. doi:10.1080/01621459.1958.10501480. JSTOR   2281953.
  6. Wilson, D J (2019). "The harmonic mean p-value for combining dependent tests". Proceedings of the National Academy of Sciences USA. 116 (4): 1195–1200. Bibcode:2019PNAS..116.1195W. doi: 10.1073/pnas.1814092116 . PMC   6347718 . PMID   30610179.
  7. Stouffer, S.A.; Suchman, E.A.; DeVinney, L.C.; Star, S.A.; Williams, R.M. Jr. (1949). The American Soldier, Vol.1: Adjustment during Army Life. Princeton University Press, Princeton.
  8. "Testing two-tailed p-values using Stouffer's approach". stats.stackexchange.com. Retrieved 2015-09-14.
  9. Mosteller, F.; Bush, R.R. (1954). "Selected quantitative techniques". In Lindzey, G. (ed.). Handbook of Social Psychology, Vol. 1. Addison_Wesley, Cambridge, Mass. pp. 289–334.
  10. Liptak, T. (1958). "On the combination of independent tests" (PDF). Magyar Tud. Akad. Mat. Kutato Int. Kozl. 3: 171–197.

See also