Binomial test

Last updated

In statistics, the binomial test is an exact test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories using sample data.

Contents

Usage

The binomial test is useful to test hypotheses about the probability () of success:

where is a user-defined value between 0 and 1.

If in a sample of size there are successes, while we expect , the formula of the binomial distribution gives the probability of finding this value:

If the null hypothesis were correct, then the expected number of successes would be . We find our -value for this test by considering the probability of seeing an outcome as, or more, extreme. For a one-tailed test, this is straightforward to compute. Suppose that we want to test if . Then our -value would be,

An analogous computation can be done if we're testing if using the summation of the range from to instead.

Calculating a -value for a two-tailed test is slightly more complicated, since a binomial distribution isn't symmetric if . This means that we can't just double the -value from the one-tailed test. Recall that we want to consider events that are as, or more, extreme than the one we've seen, so we should consider the probability that we would see an event that is as, or less, likely than . Let denote all such events. Then the two-tailed -value is calculated as,

Common use

One common use of the binomial test is the case where the null hypothesizes that two categories occur with equal frequency (), such as a coin toss. Tables are widely available to give the significance observed numbers of observations in the categories for this case. However, as the example below shows, the binomial test is not restricted to this case.

When there are more than two categories, and an exact test is required, the multinomial test, based on the multinomial distribution, must be used instead of the binomial test. [1]

Large samples

For large samples such as the example below, the binomial distribution is well approximated by convenient continuous distributions, and these are used as the basis for alternative tests that are much quicker to compute, such as Pearson's chi-squared test and the G-test. However, for small samples these approximations break down, and there is no alternative to the binomial test.

The most usual (and easiest) approximation is through the standard normal distribution, in which a z-test is performed of the test statistic , given by

where is the number of successes observed in a sample of size and is the probability of success according to the null hypothesis. An improvement on this approximation is possible by introducing a continuity correction:

For very large , this continuity correction will be unimportant, but for intermediate values, where the exact binomial test doesn't work, it will yield a substantially more accurate result.

In notation in terms of a measured sample proportion , null hypothesis for the proportion , and sample size , where and , one may rearrange and write the z-test above as

by dividing by in both numerator and denominator, which is a form that may be more familiar to some readers.

Example

Suppose we have a board game that depends on the roll of one die and attaches special importance to rolling a 6. In a particular game, the die is rolled 235 times, and 6 comes up 51 times. If the die is fair, we would expect 6 to come up

times. We have now observed that the number of 6s is higher than what we would expect on average by pure chance had the die been a fair one. But, is the number significantly high enough for us to conclude anything about the fairness of the die? This question can be answered by the binomial test. Our null hypothesis would be that the die is fair (probability of each number coming up on the die is 1/6).

To find an answer to this question using the binomial test, we use the binomial distribution

with pmf .

As we have observed a value greater than the expected value, we could consider the probability of observing 51 6s or higher under the null, which would constitute a one-tailed test (here we are basically testing whether this die is biased towards generating more 6s than expected). In order to calculate the probability of 51 or more 6s in a sample of 235 under the null hypothesis we add up the probabilities of getting exactly 51 6s, exactly 52 6s, and so on up to probability of getting exactly 235 6s:

If we have a significance level of 5%, then this result (0.02654 < 5%) indicates that we have evidence that is significant enough to reject the null hypothesis that the die is fair.

Normally, when we are testing for fairness of a die, we are also interested if the die is biased towards generating fewer 6s than expected, and not only more 6s as we considered in the one-tailed test above. In order to consider both the biases, we use a two-tailed test. Note that to do this we cannot simply double the one-tailed p-value unless the probability of the event is 1/2. This is because the binomial distribution becomes asymmetric as that probability deviates from 1/2. There are two methods to define the two-tailed p-value. One method is to sum the probability that the total deviation in numbers of events in either direction from the expected value is either more than or less than the expected value. The probability of that occurring in our example is 0.0437. The second method involves computing the probability that the deviation from the expected value is as unlikely or more unlikely than the observed value, i.e. from a comparison of the probability density functions. This can create a subtle difference, but in this example yields the same probability of 0.0437. In both cases, the two-tailed test reveals significance at the 5% level, indicating that the number of 6s observed was significantly different for this die than the expected number at the 5% level.

In statistical software packages

Binomial tests are available in most software used for statistical purposes. E.g.

See also

Related Research Articles

<span class="mw-page-title-main">Binomial distribution</span> Probability distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success or failure. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

<span class="mw-page-title-main">Binomial coefficient</span> Number of subsets of a given size

In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficient is indexed by a pair of integers nk ≥ 0 and is written It is the coefficient of the xk term in the polynomial expansion of the binomial power (1 + x)n; this coefficient can be computed by the multiplicative formula

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution, or to compare two samples. In essence, the test answers the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

<span class="mw-page-title-main">Negative binomial distribution</span> Probability distribution

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of successes occurs. For example, we can define rolling a 6 on a dice as a success, and rolling any other number as a failure, and ask how many failure rolls will occur before we see the third success. In such a case, the probability distribution of the number of failures that appear will be a negative binomial distribution.

<span class="mw-page-title-main">Hypergeometric distribution</span> Discrete probability distribution

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the probability of successes in draws, without replacement, from a finite population of size that contains exactly objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of successes in draws with replacement.

<span class="mw-page-title-main">Logistic regression</span> Statistical model for a binary dependent variable

In statistics, the logistic model is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

<span class="mw-page-title-main">Chi-squared test</span> Statistical hypothesis test

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

<span class="mw-page-title-main">Beta function</span> Mathematical function

In mathematics, the beta function, also called the Euler integral of the first kind, is a special function that is closely related to the gamma function and to binomial coefficients. It is defined by the integral

<i>Z</i>-test Statistical test

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.

In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistician Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis." That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data."

The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples. The one-sample version serves a purpose similar to that of the one-sample Student's t-test. For two matched samples, it is a paired difference test like the paired Student's t-test. The Wilcoxon test can be a good alternative to the t-test when population means are not of interest; for example, when one wishes to test whether a population's median is nonzero, or whether there is a better than 50% chance that a sample from one population is greater than a sample from another population.

The sign test is a statistical method to test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations for each subject, the sign test determines if one member of the pair tends to be greater than the other member of the pair.

Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement between the two approaches was discussed in Harold Jeffreys' 1939 textbook; it became known as Lindley's paradox after Dennis Lindley called the disagreement a paradox in a 1957 paper.

<span class="mw-page-title-main">68–95–99.7 rule</span> Shorthand used in statistics

In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively.

<span class="mw-page-title-main">Fisher's noncentral hypergeometric distribution</span>

In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. It can also be defined as the conditional distribution of two or more binomially distributed variables dependent upon their fixed sum.

In statistics, the multinomial test is the test of the null hypothesis that the parameters of a multinomial distribution equal specified values; it is used for categorical data.

In statistics, a zero-inflated model is a statistical model based on a zero-inflated probability distribution, i.e. a distribution that allows for frequent zero-valued observations.

<i>q</i>-value (statistics) Statistical hypothesis testing measure

In statistical hypothesis testing, specifically multiple hypothesis testing, the q-value in the Storey-Tibshirani procedure provides a means to control the positive false discovery rate (pFDR). Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

References

  1. Howell, David C. (2007). Statistical methods for psychology (6. ed.). Belmont, Calif.: Thomson. ISBN   978-0495012870.