F-test

Last updated January 06, 2024

An F-test is any statistical test used to compare the variances of two samples or the ratio of variances between multiple samples. The test statistic, random variable F, is used to determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term (ε).^[1] It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Ronald Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.^[2]

Common examples

Common examples of the use of F-tests include the study of the following cases

One-way ANOVA table with 3 random groups that each has 30 observations. F value is being calculated in the second to last column
The hypothesis that the means of a given set of normally distributed populations, all having the same standard deviation, are equal. This is perhaps the best-known F-test, and plays an important role in the analysis of variance (ANOVA).
- F test of analysis of variance (ANOVA) follows three assumptions

The hypothesis that a proposed regression model fits the data well. See Lack-of-fit sum of squares.
The hypothesis that a data set in a regression analysis follows the simpler of two proposed linear models that are nested within each other.
Multiple-comparison testing is conducted using needed data in already completed F-test, if F-test leads to rejection of null hypothesis and the factor under study has an impact on the dependent variable.^[1]
- "a priori comparisons"/ "planned comparisons"- a particular set of comparisons
- "pairwise comparisons"-all possible comparisons
  - i.e. Fisher's least significant difference (LSD) test, Tukey's honestly significant difference (HSD) test, Newman Keuls test, Ducan's test
- "a posteriori comparisons"/ "post hoc comparisons"/ "exploratory comparisons"- choose comparisons after examining the data
  - i.e. Scheffé's method

F-test of the equality of two variances

The F-test is sensitive to non-normality.^[3]^[4] In the analysis of variance (ANOVA), alternative tests include Levene's test, Bartlett's test, and the Brown–Forsythe test. However, when any of these tests are conducted to test the underlying assumption of homoscedasticity (i.e. homogeneity of variance), as a preliminary step to testing for mean effects, there is an increase in the experiment-wise Type I error rate.^[5]

Formula and calculation

Most F-tests arise by considering a decomposition of the variability in a collection of data in terms of sums of squares. The test statistic in an F-test is the ratio of two scaled sums of squares reflecting different sources of variability. These sums of squares are constructed so that the statistic tends to be greater when the null hypothesis is not true. In order for the statistic to follow the F-distribution under the null hypothesis, the sums of squares should be statistically independent, and each should follow a scaled χ²-distribution. The latter condition is guaranteed if the data values are independent and normally distributed with a common variance.

One-way analysis of variance

The formula for the one-way ANOVAF-test statistic is

F={\frac {\text{explained variance}}{\text{unexplained variance}}},

or

F={\frac {\text{between-group variability}}{\text{within-group variability}}}.

The "explained variance", or "between-group variability" is

\sum _{i=1}^{K}n_{i}({\bar {Y}}_{i\cdot }-{\bar {Y}})^{2}/(K-1)

where ${\bar {Y}}_{i\cdot }$ denotes the sample mean in the i-th group, $n_{i}$ is the number of observations in the i-th group, ${\bar {Y}}$ denotes the overall mean of the data, and $K$ denotes the number of groups.

The "unexplained variance", or "within-group variability" is

\sum _{i=1}^{K}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\bar {Y}}_{i\cdot }\right)^{2}/(N-K),

where $Y_{ij}$ is the j^th observation in the i^th out of $K$ groups and $N$ is the overall sample size. This F-statistic follows the F-distribution with degrees of freedom $d_{1}=K-1$ and $d_{2}=N-K$ under the null hypothesis. The statistic will be large if the between-group variability is large relative to the within-group variability, which is unlikely to happen if the population means of the groups all have the same value.

The result of the F test can be determined by comparing calculated F value and critical F value with specific significance level (e.g. 5%). The F table serves as a reference guide containing critical F values for the distribution of the F-statistic under the assumption of a true null hypothesis. It is designed to help determine the threshold beyond which the F statistic is expected to exceed a controlled percentage of the time (e.g., 5%) when the null hypothesis is accurate. To locate the critical F value in the F table, one needs to utilize the respective degrees of freedom. This involves identifying the appropriate row and column in the F table that corresponds to the significance level being tested (e.g., 5%).^[6]

How to use critical F values:

If the F statistic < the critical F value

Fail to reject null hypothesis
Reject alternative hypothesis
There is no significant differences among sample averages
The observed differences among sample averages could be reasonably caused by random chance itself
The result is not statistically significant

If the F statistic > the critical F value

Accept alternative hypothesis
Reject null hypothesis
There is significant differences among sample averages
The observed differences among sample averages could not be reasonably caused by random chance itself
The result is statistically significant

Note that when there are only two groups for the one-way ANOVA F-test, $F=t^{2}$ where t is the Student's $t$ statistic.

Advantages

Multi-group Comparison Efficiency: Facilitating simultaneous comparison of multiple groups, enhancing efficiency particularly in situations involving more than two groups.
Clarity in Variance Comparison: Offering a straightforward interpretation of variance differences among groups, contributing to a clear understanding of the observed data patterns.
Versatility Across Disciplines: Demonstrating broad applicability across diverse fields, including social sciences, natural sciences, and engineering.

Disadvantages

Sensitivity to Assumptions: The F-test is highly sensitive to certain assumptions, such as homogeneity of variance and normality which can affect the accuracy of test results.
Limited Scope to Group Comparisons: The F-test is tailored for comparing variances between groups, making it less suitable for analyses beyond this specific scope.
Interpretation Challenges: The F-test does not pinpoint specific group pairs with distinct variances. Careful interpretation is necessary, and additional post hoc tests are often essential for a more detailed understanding of group-wise differences.

Multiple-comparison ANOVA problems

The F-test in one-way analysis of variance (ANOVA) is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. For example, suppose that a medical trial compares four treatments. The ANOVA F-test can be used to assess whether any of the treatments are on average superior, or inferior, to the others versus the null hypothesis that all four treatments yield the same mean response. This is an example of an "omnibus" test, meaning that a single test is performed to detect any of several possible differences. Alternatively, we could carry out pairwise tests among the treatments (for instance, in the medical trial example with four treatments we could carry out six tests among pairs of treatments). The advantage of the ANOVA F-test is that we do not need to pre-specify which treatments are to be compared, and we do not need to adjust for making multiple comparisons. The disadvantage of the ANOVA F-test is that if we reject the null hypothesis, we do not know which treatments can be said to be significantly different from the others, nor, if the F-test is performed at level α, can we state that the treatment pair with the greatest mean difference is significantly different at level α.

Regression problems

Consider two models, 1 and 2, where model 1 is 'nested' within model 2. Model 1 is the restricted model, and model 2 is the unrestricted one. That is, model 1 has p₁ parameters, and model 2 has p₂ parameters, where p₁ < p₂, and for any choice of parameters in model 1, the same regression curve can be achieved by some choice of the parameters of model 2.

One common context in this regard is that of deciding whether a model fits the data significantly better than does a naive model, in which the only explanatory term is the intercept term, so that all predicted values for the dependent variable are set equal to that variable's sample mean. The naive model is the restricted model, since the coefficients of all potential explanatory variables are restricted to equal zero.

Another common context is deciding whether there is a structural break in the data: here the restricted model uses all data in one regression, while the unrestricted model uses separate regressions for two different subsets of the data. This use of the F-test is known as the Chow test.

The model with more parameters will always be able to fit the data at least as well as the model with fewer parameters. Thus typically model 2 will give a better (i.e. lower error) fit to the data than model 1. But one often wants to determine whether model 2 gives a significantly better fit to the data. One approach to this problem is to use an F-test.

If there are n data points to estimate parameters of both models from, then one can calculate the F statistic, given by

F={\frac {\left({\frac {{\text{RSS}}_{1}-{\text{RSS}}_{2}}{p_{2}-p_{1}}}\right)}{\left({\frac {{\text{RSS}}_{2}}{n-p_{2}}}\right)}}={\frac {{\text{RSS}}_{1}-{\text{RSS}}_{2}}{{\text{RSS}}_{2}}}\cdot {\frac {n-p_{2}}{p_{2}-p_{1}}},

where RSS_i is the residual sum of squares of model i. If the regression model has been calculated with weights, then replace RSS_i with χ², the weighted sum of squared residuals. Under the null hypothesis that model 2 does not provide a significantly better fit than model 1, F will have an F distribution, with (p₂−p₁, n−p₂) degrees of freedom. The null hypothesis is rejected if the F calculated from the data is greater than the critical value of the F-distribution for some desired false-rejection probability (e.g. 0.05). Since F is a monotone function of the likelihood ratio statistic, the F-test is a likelihood ratio test.

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by $, and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.$

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Linear trend estimation is a statistical method that is used to analyze data patterns. When a series of measurements of a process are treated as a sequence or time series, trend estimation can be used to make and justify statements about tendencies in the data by relating the measurements to the times at which they occurred. This model can then be used to describe the behavior of the observed data.

A t-test is a statistical hypothesis test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are different. In many cases, a Z-test will yield very similar results to a t-test since the latter converges to the former as the size of the dataset increases.

The Kruskal–Wallis test by ranks, Kruskal–Wallis H test, or one-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable.

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

The Chow test, proposed by econometrician Gregory Chow in 1960, is a test of whether the true coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for the presence of a structural break at a period which can be assumed to be known a priori. In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

In statistics, Levene's test is an inferential statistic used to assess the equality of variances for a variable calculated for two or more groups. Some common statistical procedures assume that variances of the populations from which different samples are drawn are equal. Levene's test assesses this assumption. It tests the null hypothesis that the population variances are equal. If the resulting p-value of Levene's test is less than some significance level (typically 0.05), the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances. Thus, the null hypothesis of equal variances is rejected and it is concluded that there is a difference between the variances in the population.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In statistics, one-way analysis of variance is a technique to compare whether two samples' means are significantly different. This analysis of variance technique requires a numeric response variable "Y" and a single explanatory variable "X", hence "one-way".

Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSDtest, is a single-step multiple comparison procedure and statistical test. It can be used to find means that are significantly different from each other.

Named after the Dutch mathematician Bartel Leendert van der Waerden, the Van der Waerden test is a statistical test that k population distribution functions are equal. The Van der Waerden test converts the ranks from a standard Kruskal-Wallis one-way analysis of variance to quantiles of the standard normal distribution. These are called normal scores and the test is computed from these normal scores.

The Newman–Keuls or Student–Newman–Keuls (SNK)method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use studentized range statistics. Unlike Tukey's range test, the Newman–Keuls method uses different critical values for different pairs of mean comparisons. Thus, the procedure is more likely to reveal significant differences between group means and to commit type I errors by incorrectly rejecting a null hypothesis when it is true. In other words, the Neuman-Keuls procedure is more powerful but less conservative than Tukey's range test.

In statistics, Tukey's test of additivity, named for John Tukey, is an approach used in two-way ANOVA to assess whether the factor variables are additively related to the expected value of the response variable. It can be applied when there are no replicated values in the data set, a situation in which it is impossible to directly estimate a fully general non-additive regression structure and still have information left to estimate the error variance. The test statistic proposed by Tukey has one degree of freedom under the null hypothesis, hence this is often called "Tukey's one-degree-of-freedom test."

In statistics, one purpose for the analysis of variance (ANOVA) is to analyze differences in means between groups. The test statistic, F, assumes independence of observations, homogeneous variances, and population normality. ANOVA on ranks is a statistic designed for situations when the normality assumption has been violated.

In statistics, expected mean squares (EMS) are the expected values of certain statistics arising in partitions of sums of squares in the analysis of variance (ANOVA). They can be used for ascertaining which statistic should appear in the denominator in an F-test for testing a null hypothesis that a particular effect is absent.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References

1 2 Berger, Paul D.; Maurer, Robert E.; Celli, Giovana B. (2018). Experimental Design. Cham: Springer International Publishing. p. 108. doi:10.1007/978-3-319-64583-4. ISBN 978-3-319-64582-7.
↑ Lomax, Richard G. (2007). Statistical Concepts: A Second Course . p. 10. ISBN 978-0-8058-5850-1.
↑ Box, G. E. P. (1953). "Non-Normality and Tests on Variances". Biometrika. 40 (3/4): 318–335. doi:10.1093/biomet/40.3-4.318. JSTOR 2333350.
↑ Markowski, Carol A; Markowski, Edward P. (1990). "Conditions for the Effectiveness of a Preliminary Test of Variance". The American Statistician . 44 (4): 322–326. doi:10.2307/2684360. JSTOR 2684360.
↑ Sawilowsky, S. (2002). "Fermat, Schubert, Einstein, and Behrens–Fisher: The Probable Difference Between Two Means When σ₁² ≠ σ₂²". Journal of Modern Applied Statistical Methods. 1 (2): 461–472. doi: 10.22237/jmasm/1036109940 . Archived from the original on 2015-04-03. Retrieved 2015-03-30.
↑ Siegel, Andrew F. (2016-01-01), Siegel, Andrew F. (ed.), "Chapter 15 - ANOVA: Testing for Differences Among Many Samples and Much More", Practical Business Statistics (Seventh Edition), Academic Press, pp. 469–492, doi:10.1016/b978-0-12-804250-2.00015-8, ISBN 978-0-12-804250-2 , retrieved 2023-12-10

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Berger, Paul D.; Maurer, Robert E.; Celli, Giovana B. (2018). Experimental Design. Cham: Springer International Publishing. p. 108. doi:10.1007/978-3-319-64583-4. ISBN 978-3-319-64582-7.

[2] Lomax, Richard G. (2007). Statistical Concepts: A Second Course . p. 10. ISBN 978-0-8058-5850-1.

[3] Box, G. E. P. (1953). "Non-Normality and Tests on Variances". Biometrika. 40 (3/4): 318–335. doi:10.1093/biomet/40.3-4.318. JSTOR 2333350.

[4] Markowski, Carol A; Markowski, Edward P. (1990). "Conditions for the Effectiveness of a Preliminary Test of Variance". The American Statistician . 44 (4): 322–326. doi:10.2307/2684360. JSTOR 2684360.

[5] Sawilowsky, S. (2002). "Fermat, Schubert, Einstein, and Behrens–Fisher: The Probable Difference Between Two Means When σ₁² ≠ σ₂²". Journal of Modern Applied Statistical Methods. 1 (2): 461–472. doi: 10.22237/jmasm/1036109940 . Archived from the original on 2015-04-03. Retrieved 2015-03-30.

[6] Siegel, Andrew F. (2016-01-01), Siegel, Andrew F. (ed.), "Chapter 15 - ANOVA: Testing for Differences Among Many Samples and Much More", Practical Business Statistics (Seventh Edition), Academic Press, pp. 469–492, doi:10.1016/b978-0-12-804250-2.00015-8, ISBN 978-0-12-804250-2 , retrieved 2023-12-10

[1]

[2]

[3]

[4]

[5]

[6]