ANOVA on ranks

Last updated

In statistics, one purpose for the analysis of variance (ANOVA) is to analyze differences in means between groups. The test statistic, F, assumes independence of observations, homogeneous variances, and population normality. ANOVA on ranks is a statistic designed for situations when the normality assumption has been violated.

Contents

Logic of the F test on means

The F statistic is a ratio of a numerator to a denominator. Consider randomly selected subjects that are subsequently randomly assigned to groups A, B, and C. Under the truth of the null hypothesis, the variability (or sum of squares) of scores on some dependent variable will be the same within each group. When divided by the degrees of freedom (i.e., based on the number of subjects per group), the denominator of the F ratio is obtained.

Treat the mean for each group as a score, and compute the variability (again, the sum of squares) of those three scores. When divided by its degrees of freedom (i.e., based on the number of groups), the numerator of the F ratio is obtained.

Under the truth of the null hypothesis, the sampling distribution of the F ratio depends on the degrees of freedom for the numerator and the denominator.

Model a treatment applied to group A by increasing every score by X. (This model maintains the underlying assumption of homogeneous variances. In practice it is rare – if not impossible – for an increase of X in a group mean to occur via an increase of each member's score by X.) This will shift the distribution X units in the positive direction, but will not have any impact on the variability within the group. However, the variability between the three groups' mean scores will now increase. If the resulting F ratio raises the value to such an extent that it exceeds the threshold of what constitutes a rare event (called the Alpha level), the Anova F test is said to reject the null hypothesis of equal means between the three groups, in favor of the alternative hypothesis that at least one of the groups has a larger mean (which in this example, is group A).

Handling violation of population normality

Ranking is one of many procedures used to transform data that do not meet the assumptions of normality. Conover and Iman provided a review of the four main types of rank transformations (RT). [1] One method replaces each original data value by its rank (from 1 for the smallest to N for the largest). This rank-based procedure has been recommended as being robust to non-normal errors, resistant to outliers, and highly efficient for many distributions. It may result in a known statistic (e.g., in the two independent samples layout ranking results in the Wilcoxon rank-sum / Mann–Whitney U test), and provides the desired robustness and increased statistical power that is sought. For example, Monte Carlo studies have shown that the rank transformation in the two independent samples t-test layout can be successfully extended to the one-way independent samples ANOVA, as well as the two independent samples multivariate Hotelling's T2 layouts [2] Commercial statistical software packages (e.g., SAS) followed with recommendations to data analysts to run their data sets through a ranking procedure (e.g., PROC RANK) prior to conducting standard analyses using parametric procedures. [3] [4] [5]

Failure of ranking in the factorial ANOVA and other complex layouts

ANOVA on ranks means that a standard analysis of variance is calculated on the rank-transformed data. Conducting factorial ANOVA on the ranks of original scores has also been suggested. [6] [7] [8] However, Monte Carlo studies, [9] [10] [11] [12] and subsequent asymptotic studies [13] [14] found that the rank transformation is inappropriate for testing interaction effects in a 4x3 and a 2x2x2 factorial design. As the number of effects (i.e., main, interaction) become non-null, and as the magnitude of the non-null effects increase, there is an increase in Type I error, resulting in a complete failure of the statistic with as high as a 100% probability of making a false positive decision. Similarly, it was found that the rank transformation increasingly fails in the two dependent samples layout as the correlation between pretest and posttest scores increase. [15] It was also discovered that the Type I error rate problem was exacerbated in the context of Analysis of Covariance, particularly as the correlation between the covariate and the dependent variable increased. [16]

Transforming ranks

A variant of rank-transformation is 'quantile normalization' in which a further transformation is applied to the ranks such that the resulting values have some defined distribution (often a normal distribution with a specified mean and variance). Further analyses of quantile-normalized data may then assume that distribution to compute significance values. However, two specific types of secondary transformations, the random normal scores and expected normal scores transformation, have been shown to greatly inflate Type I errors and severely reduce statistical power. [17]

Violating homoscedasticity

The ANOVA on ranks has never been recommended when the underlying assumption of homogeneous variances has been violated, either by itself, or in conjunction with a violation of the assumption of population normality.[ citation needed ] In general, rank based statistics become nonrobust with respect to Type I errors for departures from homoscedasticity even more quickly than parametric counterparts that share the same assumption.[ citation needed ]

Further information

Kepner and Wackerly summarized the literature in noting "by the late 1980s, the volume of literature on RT methods was rapidly expanding as new insights, both positive and negative, were gained regarding the utility of the method. Concerned that RT methods would be misused, Sawilowsky et al. (1989, p. 255) cautioned practitioners to avoid the use of these tests 'except in those specific situations where the characteristics of the tests are well understood'." [18] According to Hettmansperger and McKean, [19] "Sawilowsky (1990) [20] provides an excellent review of nonparametric approaches to testing for interaction" in ANOVA.

Notes

  1. Conover, W. J.; Iman, R. L. (1981). "Rank transformations as a bridge between parametric and nonparametric statistics". American Statistician. 35 (3): 124–129. doi:10.2307/2683975. JSTOR   2683975. Archived from the original on 2011-03-02.
  2. Nanna, M. J. (2002). "Hoteling's T2 vs. the rank transformation with real Likert data". Journal of Modern Applied Statistical Methods. 1: 83–99. doi: 10.22237/jmasm/1020255180 .
  3. SAS Institute. (1985). SAS/stat guide for personal computers (5th ed.). Cary, NC: Author.
  4. SAS Institute. (1987). SAS/stat guide for personal computers (6th ed.). Cary, NC: Author.
    • SAS Institute. (2008). SAS/STAT 9.2 User's guide: Introduction to Nonparametric Analysis. Cary, NC. Author.
  5. Conover, W. J.; Iman, R. L. (1976). "On some alternative procedures using ranks for the analysis of experimental designs". Communications in Statistics - Theory and Methods. A5 (14): 1349–1368. doi:10.1080/03610927608827447.
  6. Iman, R. L. (1974). "A power study of a rank transform for the two-way classification model when interactions may be present". Canadian Journal of Statistics. 2 (2): 227–239. doi:10.2307/3314695. JSTOR   3314695.
  7. Iman, R. L., & Conover, W. J. (1976). A comparison of several rank tests for the two-way layout (SAND76-0631). Albuquerque, NM: Sandia Laboratories.
  8. Sawilowsky, S. (1985). Robust and power analysis of the 2x2x2 ANOVA, rank transformation, random normal scores, and expected normal scores transformation tests. Unpublished doctoral dissertation, University of South Florida.
  9. Sawilowsky, S.; Blair, R. C. & Higgins, J. J. (1989). "An investigation of the type I error and power properties of the rank transform procedure in factorial ANOVA". Journal of Educational Statistics. 14 (3): 255–267. doi:10.2307/1165018. JSTOR   1165018.
  10. Blair, R. C.; Sawilowsky, S. S. & Higgins, J. J. (1987). "Limitations of the rank transform statistic in tests for interactions". Communications in Statistics - Simulation and Computation. B16 (4): 1133–1145. doi:10.1080/03610918708812642.
  11. Sawilowsky, S. (1990). "Nonparametric tests of interaction in experimental design". Review of Educational Research. 60 (1): 91–126. doi:10.3102/00346543060001091. S2CID   146336002.
  12. Thompson, G. L. (1991). "A note on the rank transform for interactions". Biometrika. 78 (3): 697–701. doi:10.1093/biomet/78.3.697.
  13. Thompson, G. L.; Ammann, L. P. (1989). "Efficiencies of the rank-transform in two-way models with no interaction". Journal of the American Statistical Association. 84 (405): 325–330. doi:10.1080/01621459.1989.10478773.
  14. Blair, R. C.; Higgins, J. J. (1985). "A Comparison of the Power of the Paired Samples Rank Transform Statistic to that of Wilcoxon's Signed Ranks Statistic". Journal of Educational and Behavioral Statistics. 10 (4): 368–383. doi:10.3102/10769986010004368. S2CID   121958144.
  15. Headrick, T. C. (1997). Type I error and power of the rank transform analysis of covariance (ANCOVA) in a 3 x 4 factorial layout. Unpublished doctoral dissertation, University of South Florida.
  16. Sawilowsky, S. (1985). "A comparison of random normal scores test under the F and Chi-square distributions to the 2x2x2 ANOVA test". Florida Journal of Educational Research. 27: 83–97.
  17. Kepner, James L.; Wackerly, Dennis D. (1996). "On Rank Transformation Techniques for Balanced Incomplete Repeated-Measures Designs". Journal of the American Statistical Association. 91 (436): 1619–1625. doi:10.1080/01621459.1996.10476730. JSTOR   2291588.
  18. Hettmansperger, T. P.; McKean, J. W. (1998). Robust nonparametric statistical methods. Kendall's Library of Statistics. Vol. 5 (First ed.). London: Edward Arnold. pp. xiv+467 pp. ISBN   0-340-54937-8. MR   1604954.
  19. Sawilowsky, S. (1990). "Nonparametric tests of interaction in experimental design". Review of Educational Research. 60: 91–126. doi:10.3102/00346543060001091. S2CID   146336002.

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as is parametric statistics. Nonparametric statistics can be used for descriptive statistics or statistical inference. Nonparametric tests are often used when the assumptions of parametric tests are evidently violated.

<i>F</i>-test Statistical hypothesis test, mostly using multiple restrictions

An F-test is any statistical test used to compare the variances of two samples or the ratio of variances between multiple samples. The test statistic, random variable F, is used to determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term (ε). It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Ronald Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

Analysis of covariance (ANCOVA) is a general linear model that blends ANOVA and regression. ANCOVA evaluates whether the means of a dependent variable (DV) are equal across levels of one or more categorical independent variables (IV) and across one or more continuous variables. For example, the categorical variable(s) might describe treatment and the continuous variable(s) might be covariates or nuisance variables; or vice versa. Mathematically, ANCOVA decomposes the variance in the DV into variance explained by the CV(s), variance explained by the categorical IV, and residual variance. Intuitively, ANCOVA can be thought of as 'adjusting' the DV by the group means of the CV(s).

In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

<span class="mw-page-title-main">Interaction (statistics)</span> Statistical term

In statistics, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the effect of one causal variable on an outcome depends on the state of a second causal variable. Although commonly thought of in terms of causal relationships, the concept of an interaction can also describe non-causal associations. Interactions are often considered in the context of regression analyses or factorial experiments.

A t-test is a statistical hypothesis test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are different. In many cases, a Z-test will yield very similar results to a t-test since the latter converges to the former as the size of the dataset increases.

<span class="mw-page-title-main">Kruskal–Wallis one-way analysis of variance</span> Non-parametric method for testing whether samples originate from the same distribution

The Kruskal–Wallis test by ranks, Kruskal–Wallis H test, or one-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

In statistics, the Lilliefors test is a normality test based on the Kolmogorov–Smirnov test. It is used to test the null hypothesis that data come from a normally distributed population, when the null hypothesis does not specify which normal distribution; i.e., it does not specify the expected value and variance of the distribution. It is named after Hubert Lilliefors, professor of statistics at George Washington University.

The Friedman test is a non-parametric statistical test developed by Milton Friedman. Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row together, then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

<span class="mw-page-title-main">Shlomo Sawilowsky</span> American educational statistician

Shlomo S. Sawilowsky was a professor of educational statistics and Distinguished Faculty Fellow at Wayne State University in Detroit, Michigan, where he has received teaching, mentoring, and research awards.

In statistics, one-way analysis of variance is a technique to compare whether two samples' means are significantly different. This analysis of variance technique requires a numeric response variable "Y" and a single explanatory variable "X", hence "one-way".

The Brown–Forsythe test is a statistical test for the equality of group variances based on performing an Analysis of Variance (ANOVA) on a transformation of the response variable. When a one-way ANOVA is performed, samples are assumed to have been drawn from distributions with equal variance. If this assumption is not valid, the resulting F-test is invalid. The Brown–Forsythe test statistic is the F statistic resulting from an ordinary one-way analysis of variance on the absolute deviations of the groups or treatments data from their individual medians.

Named after the Dutch mathematician Bartel Leendert van der Waerden, the Van der Waerden test is a statistical test that k population distribution functions are equal. The Van der Waerden test converts the ranks from a standard Kruskal-Wallis one-way analysis of variance to quantiles of the standard normal distribution. These are called normal scores and the test is computed from these normal scores.

The Newman–Keuls or Student–Newman–Keuls (SNK)method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use studentized range statistics. Unlike Tukey's range test, the Newman–Keuls method uses different critical values for different pairs of mean comparisons. Thus, the procedure is more likely to reveal significant differences between group means and to commit type I errors by incorrectly rejecting a null hypothesis when it is true. In other words, the Neuman-Keuls procedure is more powerful but less conservative than Tukey's range test.

In statistics, an F-test of equality of variances is a test for the null hypothesis that two normal populations have the same variance. Notionally, any F-test can be regarded as a comparison of two variances, but the specific case being discussed in this article is that of two populations, where the test statistic used is the ratio of two sample variances. This particular situation is of importance in mathematical statistics since it provides a basic exemplar case in which the F-distribution can be derived. For application in applied statistics, there is concern that the test is so sensitive to the assumption of normality that it would be inadvisable to use it as a routine test for the equality of variances. In other words, this is a case where "approximate normality", is not good enough to make the test procedure approximately valid to an acceptable degree.

In statistics, the Conover squared ranks test is a non-parametric version of the parametric Levene's test for equality of variance. Conover's squared ranks test is the only equality of variance test that appears to be non-parametric. Other tests of significance of difference of data dispersion are parametric. The squared ranks test is arguably a test of significance of difference of data dispersion not variance per se. This becomes important, for example, when the Levene's test fails to satisfy the rather generous conditions for normality associated with that test and is a default alternative under those conditions for certain statistical software programs like the VarianceEquivalenceTest routine in Mathematica. In addition to Levene's test, other parametric tests for equality of variance include the Bartlett, Brown-Forsythe, and Fisher Ratio tests.

The Scheirer–Ray–Hare (SRH) test is a statistical test that can be used to examine whether a measure is affected by two or more factors. Since it does not require a normal distribution of the data, it is one of the non-parametric methods. It is an extension of the Kruskal–Wallis test, the non-parametric equivalent for one-way analysis of variance (ANOVA), to the application for more than one factor. It is thus a non-parameter alternative to multi-factorial ANOVA analyses. The test is named after James Scheirer, William Ray and Nathan Hare, who published it in 1976.