Sign test

Last updated

The sign test is a statistical method to test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations (such as weight pre- and post-treatment) for each subject, the sign test determines if one member of the pair (such as pre-treatment) tends to be greater than (or less than) the other member of the pair (such as post-treatment).

Contents

The paired observations may be designated x and y. For comparisons of paired observations (x,y), the sign test is most useful if comparisons can only be expressed as x > y, x = y, or x < y. If, instead, the observations can be expressed as numeric quantities (x = 7, y = 18), or as ranks (rank of x = 1st, rank of y = 8th), then the paired t-test [1] or the Wilcoxon signed-rank test [2] will usually have greater power than the sign test to detect consistent differences.

If X and Y are quantitative variables, the sign test can be used to test the hypothesis that the difference between the X and Y has zero median, assuming continuous distributions of the two random variables X and Y, in the situation when we can draw paired samples from X and Y. [3]

The sign test can also test if the median of a collection of numbers is significantly greater than or less than a specified value. For example, given a list of student grades in a class, the sign test can determine if the median grade is significantly different from, say, 75 out of 100.

The sign test is a non-parametric test which makes very few assumptions about the nature of the distributions under test – this means that it has very general applicability but may lack the statistical power of the alternative tests.

The two conditions for the paired-sample sign test are that a sample must be randomly selected from each population, and the samples must be dependent, or paired. Independent samples cannot be meaningfully paired. Since the test is nonparametric, the samples need not come from normally distributed populations. Also, the test works for left-tailed, right-tailed, and two-tailed tests.

Method

Let p = Pr(X > Y), and then test the null hypothesis H0: p = 0.50. In other words, the null hypothesis states that given a random pair of measurements (xi, yi), then xi and yi are equally likely to be larger than the other.

To test the null hypothesis, independent pairs of sample data are collected from the populations {(x1, y1), (x2, y2), . . ., (xn, yn)}. Pairs are omitted for which there is no difference so that there is a possibility of a reduced sample of m pairs. [4]

Then let W be the number of pairs for which yi  xi > 0. Assuming that H0 is true, then W follows a binomial distribution W ~ b(m, 0.5).

Assumptions

Let Zi = Yi  Xi for i = 1, ... , n.

  1. The differences Zi are assumed to be independent.
  2. Each Zi comes from the same continuous population.
  3. The values Xi and Yi represent are ordered (at least the ordinal scale), so the comparisons "greater than", "less than", and "equal to" are meaningful.

Significance testing

Since the test statistic is expected to follow a binomial distribution, the standard binomial test is used to calculate significance. The normal approximation to the binomial distribution can be used for large sample sizes, m > 25. [4]

The left-tail value is computed by Pr(Ww), which is the p-value for the alternative H1: p < 0.50. This alternative means that the X measurements tend to be higher.

The right-tail value is computed by Pr(Ww), which is the p-value for the alternative H1: p > 0.50. This alternative means that the Y measurements tend to be higher.

For a two-sided alternative H1 the p-value is twice the smaller tail-value.

Example of two-sided sign test for matched pairs

Zar gives the following example of the sign test for matched pairs. Data are collected on the length of the left hind leg and left foreleg for 10 deer. [5]

DeerHind leg length (cm)Foreleg length (cm)Difference
1142138+
2140136+
3144147
4144139+
5142143
6146141+
7149143+
8150145+
9142136+
10148146+

The null hypothesis is that there is no difference between the hind leg and foreleg length in deer. The alternative hypothesis is that there is a difference between hind leg length and foreleg length. This is a two-tailed test, rather than a one-tailed test. For the two tailed test, the alternative hypothesis is that hind leg length may be either greater than or less than foreleg length. A one-sided test could be that hind leg length is greater than foreleg length, so that the difference can only be in one direction (greater than).

There are n=10 deer. There are 8 positive differences and 2 negative differences. If the null hypothesis is true, that there is no difference in hind leg and foreleg lengths, then the expected number of positive differences is 5 out of 10. What is the probability that the observed result of 8 positive differences, or a more extreme result, would occur if there is no difference in leg lengths?

Because the test is two-sided, a result as extreme or more extreme than 8 positive differences includes the results of 8, 9, or 10 positive differences, and the results of 0, 1, or 2 positive differences. The probability of 8 or more positives among 10 deer or 2 or fewer positives among 10 deer is the same as the probability of 8 or more heads or 2 or fewer heads in 10 flips of a fair coin. The probabilities can be calculated using the binomial test, with the probability of heads = probability of tails = 0.5.

The two-sided probability of a result as extreme as 8 of 10 positive difference is the sum of these probabilities:

0.00098 + 0.00977 + 0.04395 + 0.04395 + 0.00977 + 0.00098 = 0.109375.

Thus, the probability of observing a results as extreme as 8 of 10 positive differences in leg lengths, if there is no difference in leg lengths, is p = 0.109375. The null hypothesis is not rejected at a significance level of p = 0.05. With a larger sample size, the evidence might be sufficient to reject the null hypothesis.

Because the observations can be expressed as numeric quantities (actual leg length), the paired t-test or Wilcoxon signed rank test will usually have greater power than the sign test to detect consistent differences. For this example, the paired t-test for differences indicates that there is a significant difference between hind leg length and foreleg length (p = 0.007).

If the observed result was 9 positive differences in 10 comparisons, the sign test would be significant. Only coin flips with 0, 1, 9, or 10 heads would be as extreme as or more extreme than the observed result.

The probability of a result as extreme as 9 of 10 positive difference is the sum of these probabilities:

0.00098 + 0.00977 + 0.00977 + 0.00098 = 0.0215.

In general, 8 of 10 positive differences is not significant (p = 0.11), but 9 of 10 positive differences is significant (p = 0.0215).

Examples

Example of one-sided sign test for matched pairs

Conover [6] gives the following example using a one-sided sign test for matched pairs. A manufacturer produces two products, A and B. The manufacturer wishes to know if consumers prefer product B over product A. A sample of 10 consumers are each given product A and product B, and asked which product they prefer.

The null hypothesis is that consumers do not prefer product B over product A. The alternative hypothesis is that consumers prefer product B over product A. This is a one-sided (directional) test.

At the end of the study, 8 consumers preferred product B, 1 consumer preferred product A, and one reported no preference.

The tie is excluded from the analysis, giving n = number of +'s and –'s = 8 + 1 = 9.

What is the probability of a result as extreme as 8 positives in favor of B in 9 pairs, if the null hypothesis is true, that consumers have no preference for B over A? This is the probability of 8 or more heads in 9 flips of a fair coin, and can be calculated using the binomial distribution with p(heads) = p(tails) = 0.5.

P(8 or 9 heads in 9 flips of a fair coin) = 0.0195. The null hypothesis is rejected, and the manufacturer concludes that consumers prefer product B over product A.

Example of sign test for median of a single sample

Sprent [7] gives the following example of a sign test for a median. In a clinical trial, survival time (weeks) is collected for 10 subjects with non-Hodgkin's lymphoma. The exact survival time was not known for one subject who was still alive after 362 weeks, when the study ended. The subjects' survival times were

49, 58, 75, 110, 112, 132, 151, 276, 281, 362+

The plus sign indicates the subject still alive at the end of the study. The researcher wished to determine if the median survival time was less than or greater than 200 weeks.

The null hypothesis is that median survival is 200 weeks. The alternative hypothesis is that median survival is not 200 weeks. This is a two-sided test: the alternative median may be greater than or less than 200 weeks.

If the null hypothesis is true, that the median survival is 200 weeks, then, in a random sample approximately half the subjects should survive less than 200 weeks, and half should survive more than 200 weeks. Observations below 200 are assigned a minus (−); observations above 200 are assigned a plus (+). For the subject survival times, there are 7 observations below 200 weeks (−) and 3 observations above 200 weeks (+) for the n=10 subjects.

Because any one observation is equally likely to be above or below the population median, the number of plus scores will have a binomial distribution with mean = 0.5. What is the probability of a result as extreme as 7 in 10 subjects being below the median? This is exactly the same as the probability of a result as extreme as 7 heads in 10 tosses of a fair coin. Because this is a two-sided test, an extreme result can be either three or fewer heads or seven or more heads.

The probability of observing k heads in 10 tosses of a fair coin, with p(heads) = 0.5, is given by the binomial formula:

Pr(Number of heads = k) = Choose(10, k) × 0.510

The probability for each value of k is given in the table below.

k012345678910
Pr0.00100.00980.04390.11720.20510.24610.20510.11720.04390.00980.0010

The probability of 0, 1, 2, 3, 7, 8, 9, or 10 heads in 10 tosses is the sum of their individual probabilities:

0.0010 + 0.0098 + 0.0439 + 0.1172 + 0.1172 + 0.0439 + 0.0098 + 0.0010 = 0.3438.

Thus, the probability of observing 3 or fewer plus signs or 7 or more plus signs in the survival data, if the median survival is 200 weeks, is 0.3438. The expected number of plus signs is 5 if the null hypothesis is true. Observing 3 or fewer, or 7 or more pluses is not significantly different from 5. The null hypothesis is not rejected. Because of the extremely small sample size, this sample has low power to detect a difference.

Software implementations

The sign test is a special case of the binomial test where the probability of success under the null hypothesis is p=0.5. Thus, the sign test can be performed using the binomial test, which is provided in most statistical software programs. On-line calculators for the sign test can be founded by searching for "sign test calculator". Many websites offer the binomial test, but generally offer only a two-sided version.

Excel software for the sign test

A template for the sign test using Excel is available at http://www.real-statistics.com/non-parametric-tests/sign-test/

R software for the sign test

In R, the binomial test can be performed using the function binom.test().

The syntax for the function is

binom.test(x,n,p=0.5,alternative=c("two.sided","less","greater"),conf.level=0.95)

where

Examples of the sign test using the R function binom.test

The sign test example from Zar [5] compared the length of hind legs and forelegs of deer. The hind leg was longer than the foreleg in 8 of 10 deer. Thus, there are x=8 successes in n=10 trials. The hypothesized probability of success (defined as hind leg longer than foreleg) is p = 0.5 under the null hypothesis that hind legs and forelegs do not differ in length. The alternative hypothesis is that hind leg length may be either greater than or less than foreleg length, which is a two sided test, specified as alternative="two.sided".

The R command binom.test(x=8,n=10,p=0.5,alternative="two.sided") gives p=0.1094, as in the example.

The sign test example in Conover [6] examined consumer preference for product A vs. product B. The null hypothesis was that consumers do not prefer product B over product A. The alternative hypothesis was that consumers prefer product B over product A, a one-sided test. In the study, 8 of 9 consumers who expressed a preference preferred product B over product A.

The R command binom.test(x=8,n=9,p=0.5,alternative="greater") gives p=0.01953, as in the example.

History

Conover [6] and Sprent [7] describe John Arbuthnot's use of the sign test in 1710. Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710. In every year, the number of males born in London exceeded the number of females. If the null hypothesis of equal number of births is true, the probability of the observed outcome is 1/282, leading Arbuthnot to conclude that the probability of male and female births were not exactly equal.

For his publications in 1692 and 1710, Arbuthnot is credited with "… the first use of significance tests …" [8] , the first example of reasoning about statistical significance and moral certainty, [9] and "… perhaps the first published report of a nonparametric test …". [6]

Hald [9] further describes the impact of Arbuthnot's research.

"Nicholas Bernoulli (1710–1713) completes the analysis of Arbuthnot's data by showing that the larger part of the variation of the yearly number of male births can be explained as binomial with p = 18/35. This is the first example of fitting a binomial to data. Hence we here have a test of significance rejecting the hypothesis p = 0.5 followed by an estimation of p and a discussion of the goodness of fit …"

Relationship to other statistical tests

Wilcoxon signed-rank test

The sign test requires only that the observations in a pair be ordered, for example x > y. In some cases, the observations for all subjects can be assigned a rank value (1, 2, 3, ...). If the observations can be ranked, and each observation in a pair is a random sample from a symmetric distribution, then the Wilcoxon signed-rank test is appropriate. The Wilcoxon test will generally have greater power to detect differences than the sign test. The asymptotic relative efficiency of the sign test to the Wilcoxon signed rank test, under these circumstances, is 0.67. [6]

Paired t-test

If the paired observations are numeric quantities (such as the actual length of the hind leg and foreleg in the Zar example), and the differences between paired observations are random samples from a single normal distribution, then the paired t-test is appropriate. The paired t-test will generally have greater power to detect differences than the sign test. The asymptotic relative efficiency of the sign test to the paired t-test, under these circumstances, is 0.637. However, if the distribution of the differences between pairs is not normal, but instead is heavy-tailed (platykurtic distribution), the sign test can have more power than the paired t-test, with asymptotic relative efficiency of 2.0 relative to the paired t-test and 1.3 relative to the Wilcoxon signed rank test. [6]

McNemar's test

In some applications, the observations within each pair can only take the values 0 or 1. For example, 0 may indicate failure and 1 may indicate success. There are 4 possible pairs: {0,0}, {0,1}, {1,0}, and {1,1}. In these cases, the same procedure as the sign test is used, but is known as McNemar's test. [6]

Friedman test

Instead of paired observations such as (Product A, Product B), the data may consist of three or more levels (Product A, Product B, Product C). If the individual observations can be ordered in the same way as for the sign test, for example B > C > A, then the Friedman test may be used. [5]

Trinomial test

Bian, McAleer and Wong [10] proposed in 2011 a non-parametric test for paired data when there are many ties. They showed that their trinomial test is superior to the sign test in presence of ties.

See also

Related Research Articles

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution, or to compare two samples. In essence, the test answers the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters.

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions. Nonparametric statistics is based on either being distribution-free or having a specified distribution but with the distribution's parameters unspecified. Nonparametric statistics includes both descriptive statistics and statistical inference. Nonparametric tests are often used when the assumptions of parametric tests are violated.

In scientific research, the null hypothesis is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is due to chance alone, and an underlying causative relationship does not exist, hence the term "null". In addition to the null hypothesis, an alternative hypothesis is also developed, which claims that a relationship does exist between two variables.

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

<span class="mw-page-title-main">Chi-squared test</span> Statistical hypothesis test

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.

In statistics, Mood's median test is a special case of Pearson's chi-squared test. It is a nonparametric test that tests the null hypothesis that the medians of the populations from which two or more samples are drawn are identical. The data in each sample are assigned to two groups, one consisting of data whose values are higher than the median value in the two groups combined, and the other consisting of data whose values are at the median or below. A Pearson's chi-squared test is then used to determine whether the observed frequencies in each sample differ from expected frequencies derived from a distribution combining the two groups.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are different.

In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Reporting p-values of statistical tests is common practice in academic publications of many quantitative fields. Since the precise meaning of p-value is hard to grasp, misuse is widespread and has been a major topic in metascience.

In statistics, the binomial test is an exact test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories using sample data.

The Kruskal–Wallis test by ranks, Kruskal–Wallis H test, or one-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

<span class="mw-page-title-main">One- and two-tailed tests</span> Alternative ways of computing the statistical significance of a parameter inferred from a data set

In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but not both. An example can be whether a machine produces more than one-percent defective products. In this situation, if the estimated value exists in one of the one-sided critical areas, depending on the direction of interest, the alternative hypothesis is accepted over the null hypothesis. Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the extreme portions of distributions, where observations lead to rejection of the null hypothesis, are small and often "tail off" toward zero as in the normal distribution, colored in yellow, or "bell curve", pictured on the right and colored in green.

A test statistic is a statistic used in statistical hypothesis testing. A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the null from the alternative hypothesis, where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis.

The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples. The one-sample version serves a purpose similar to that of the one-sample Student's t-test. For two matched samples, it is a paired difference test like the paired Student's t-test. The Wilcoxon test can be a good alternative to the t-test when population means are not of interest; for example, when one wishes to test whether a population's median is nonzero, or whether there is a better than 50% chance that a sample from one population is greater than a sample from another population.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.

In statistics, McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal. It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium.

In statistics, the Siegel–Tukey test, named after Sidney Siegel and John Tukey, is a non-parametric test which may be applied to data measured at least on an ordinal scale. It tests for differences in scale between two groups.

References

  1. Baguley, Thomas (2012), Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences, Palgrave Macmillan, p. 281, ISBN   9780230363557 .
  2. Corder, Gregory W.; Foreman, Dale I. (2014), "3.6 Statistical Power", Nonparametric Statistics: A Step-by-Step Approach (2nd ed.), John Wiley & Sons, ISBN   9781118840429 .
  3. The Sign Test for a Median // STAT 415 Intro Mathematical Statistics. Penn State University.
  4. 1 2 Mendenhall W, Wackerly DD, Scheaffer RL (1989), "15: Nonparametric statistics", Mathematical statistics with applications (Fourth ed.), PWS-Kent, pp. 674–679, ISBN   0-534-92026-8
  5. 1 2 3 Zar, Jerold H. (1999), "Chapter 24: More on Dichotomous Variables", Biostatistical Analysis (Fourth ed.), Prentice-Hall, pp. 516–570, ISBN   0-13-081542-X
  6. 1 2 3 4 5 6 7 Conover, W.J. (1999), "Chapter 3.4: The Sign Test", Practical Nonparametric Statistics (Third ed.), Wiley, pp. 157–176, ISBN   0-471-16068-7
  7. 1 2 Sprent, P. (1989), Applied Nonparametric Statistical Methods (Second ed.), Chapman & Hall, ISBN   0-412-44980-3
  8. Bellhouse, P. (2001), "John Arbuthnot", in Statisticians of the Centuries by C.C. Heyde and E. Seneta, Springer, pp. 39–42, ISBN   0-387-95329-9
  9. 1 2 Hald, Anders (1998), "Chapter 4. Chance or Design: Tests of Significance", A History of Mathematical Statistics from 1750 to 1930, Wiley, p. 65
  10. Bian G, McAleer M, Wong WK (2011), A trinomial test for paired data when there are many ties., Mathematics and Computers in Simulation, 81(6), pp. 1153–1160