The Two-proportion Z-test is a statistical method used to determine whether the difference between the proportions of two groups, coming from a binomial distribution is statistically significant. [1] This approach relies on the assumption that the sample proportions follow a normal distribution under the Central Limit Theorem, allowing the construction of a z-test for hypothesis testing and confidence interval estimation. It is used in various fields to compare success rates, response rates, or other proportions across different groups.
The z-test for comparing two proportions is a Statistical hypothesis test for evaluating whether the proportion of a certain characteristic differs significantly between two independent samples. This test leverages the property that the sample proportions (which is the average of observations coming from a Bernoulli distribution) are asymptotically normal under the Central Limit Theorem, enabling the construction of a z-test.
The test involves two competing hypotheses:
The z-statistic for comparing two proportions is computed using: [2]
Where:
The pooled proportion is used to estimate the shared probability of success under the null hypothesis, and the standard error accounts for variability across the two samples.
The z-test determines statistical significance by comparing the calculated z-statistic to a critical value. E.g., for a significance level of we reject the null hypothesis if (for a two-tailed test). Or, alternatively, by computing the p-value and rejecting the null hypothesis if .
The confidence interval for the difference between two proportions, based on the definitions above, is:
Where:
This interval provides a range of plausible values for the true difference between population proportions.
Using the z-test confidence intervals for hypothesis testing would give the same results as the chi-squared test for a two-by-two contingency table. [3] : 216–7 [4] : 875 Fisher’s exact test is more suitable for when the sample sizes are small.
Notice how the variance estimation is different between the hypothesis testing and the confidence intervals. The first uses a pooled variance (based on the null hypothesis), while the second has to estimate the variance using each sample separately (so as to allow for the confidence interval to accommodate a range of differences in proportions). This difference may lead to slightly different results if using the confidence interval as an alternative to the hypothesis testing method.
The minimum detectable effect (MDE) is the smallest difference between two proportions ( and ) that a statistical test can detect for a chosen Type I error level (), statistical power (), and sample sizes ( and ). It is commonly used in study design to determine whether the sample sizes allows for a test with sufficient sensitivity to detect meaningful differences.
The MDE for when using the (two-sided) z-test formula for comparing two proportions, incorporating critical values for and , and the standard errors of the proportions: [5] [6]
Where:
The MDE depends on the sample sizes, baseline proportions (), and test parameters. When the baseline proportions are not known, they need to be assumed or roughly estimated from a small study. Larger samples or smaller power requirements leads to a smaller MDE, making the test more sensitive to smaller differences. Researchers may use the MDE to assess the feasibility of detecting meaningful differences before conducting a study.
The Minimal Detectable Effect (MDE) is the smallest difference, denoted as , that satisfies two essential criteria in hypothesis testing:
Given that the distribution is normal under the null and the alternative hypothesis, for the two criteria to happen, it is required that the distance of will be such that the critical value for rejecting the null () is exactly in the location in which the probability of exceeding this value, under the null, is (), and also that the probability of exceeding this value, under the alternative, is .
The first criterion establishes the critical value required to reject the null hypothesis. The second criterion specifies how far the alternative distribution must be from to ensure that the probability of exceeding it under the alternative hypothesis is at least . [7] [8]
Condition 1: Rejecting
Under the null hypothesis, the test statistic is based on the pooled standard error ():
might be estimated (as described above).
To reject , the observed difference must exceed the critical threshold () after properly inflating it to the SE:
If the MDE is defined solely as , the statistical power would be only 50% because the alternative distribution is symmetric about the threshold. To achieve a higher power level, an additional component is required in the MDE calculation.
Condition 2: Achieving power
Under the alternative hypothesis, the standard error is (). It means that if the alternative distribution was centered around some value (e.g., ), then the minimal must be at least larger than to ensure that the probability of detecting the difference under the alternative hypothesis is at least .
Combining conditions
To meet both conditions, the total detectable difference incorporates components from both the null and alternative distributions. The MDE is defined as:
By summing the critical thresholds from the null and adding to it the relevant quantile from the alternative distributions, the MDE ensures the test satisfies the dual requirements of rejecting at significance level and achieving statistical power of at least .
To ensure valid results, the following assumptions must be met:
The z-test is most reliable when sample sizes are large, and all assumptions are satisfied.
Use prop.test()
with continuity correction disabled:
prop.test(x=c(120,150),n=c(1000,1000),correct=FALSE)
Output includes z-test equivalent results: chi-squared statistic, p-value, and confidence interval:
2-sample test for equality of proportions without continuity correction data: c(120, 150) out of c(1000, 1000) X-squared = 3.8536, df = 1, p-value = 0.04964 alternative hypothesis: two.sided 95 percent confidence interval: -5.992397e-02 -7.602882e-05 sample estimates: prop 1 prop 2 0.12 0.15
Use proportions_ztest
from statsmodels:
fromstatsmodels.stats.proportionimportproportions_ztestz,p=proportions_ztest([120,150],[1000,1000],0)# For CI: from statsmodels.stats.proportion import proportions_diff_confint_indep
In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing some constraint, based on the ratio of their likelihoods. If the more constrained model is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.
In probability theory and statistics, Student's t distribution is a continuous probability distribution that generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.
In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.
In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used, the assumed distribution of the test, and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.
A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-test tests the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.
In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis.
Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.
In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided die rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.
In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.
Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.
The Wilcoxon signed-rank test is a non-parametric rank test for statistical hypothesis testing used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples. The one-sample version serves a purpose similar to that of the one-sample Student's t-test. For two matched samples, it is a paired difference test like the paired Student's t-test. The Wilcoxon test is a good alternative to the t-test when the normal distribution of the differences between paired individuals cannot be assumed. Instead, it assumes a weaker hypothesis that the distribution of this difference is symmetric around a central value and it aims to test whether this center value differs significantly from zero. The Wilcoxon test is a more powerful alternative to the sign test because it considers the magnitude of the differences, but it requires this moderately strong assumption of symmetry.
A permutation test is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.
In statistics, a binomial proportion confidence interval is a confidence interval for the probability of success calculated from the outcome of a series of success–failure experiments. In other words, a binomial proportion confidence interval is an interval estimate of a success probability when only the number of experiments and the number of successes are known.
In statistics, the Durbin–Watson statistic is a test statistic used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. It is named after James Durbin and Geoffrey Watson. The small sample distribution of this ratio was derived by John von Neumann. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. Note that the distribution of this test statistic does not depend on the estimated regression coefficients and the variance of the errors.
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
In statistics, the t-statistic is the ratio of the difference in a number’s estimated value from its assumed value to its standard error. It is used in hypothesis testing via Student's t-test. The t-statistic is used in a t-test to determine whether to support or reject the null hypothesis. It is very similar to the z-score but with the difference that t-statistic is used when the sample size is small or the population standard deviation is unknown. For example, the t-statistic is used in estimating the population mean from a sampling distribution of sample means if the population standard deviation is unknown. It is also used along with p-value when running hypothesis tests where the p-value tells us what the odds are of the results to have happened.
In particle physics, CLs represents a statistical method for setting upper limits on model parameters, a particular form of interval estimation used for parameters that can take only non-negative values. Although CLs are said to refer to Confidence Levels, "The method's name is ... misleading, as the CLs exclusion region is not a confidence interval." It was first introduced by physicists working at the LEP experiment at CERN and has since been used by many high energy physics experiments. It is a frequentist method in the sense that the properties of the limit are defined by means of error probabilities, however it differs from standard confidence intervals in that the stated confidence level of the interval is not equal to its coverage probability. The reason for this deviation is that standard upper limits based on a most powerful test necessarily produce empty intervals with some fixed probability when the parameter value is zero, and this property is considered undesirable by most physicists and statisticians.
In statistics, almost sure hypothesis testing or a.s. hypothesis testing utilizes almost sure convergence in order to determine the validity of a statistical hypothesis with probability one. This is to say that whenever the null hypothesis is true, then an a.s. hypothesis test will fail to reject the null hypothesis w.p. 1 for all sufficiently large samples. Similarly, whenever the alternative hypothesis is true, then an a.s. hypothesis test will reject the null hypothesis with probability one, for all sufficiently large samples. Along similar lines, an a.s. confidence interval eventually contains the parameter of interest with probability 1. Dembo and Peres (1994) proved the existence of almost sure hypothesis tests.
Boschloo's test is a statistical hypothesis test for analysing 2x2 contingency tables. It examines the association of two Bernoulli distributed random variables and is a uniformly more powerful alternative to Fisher's exact test. It was proposed in 1970 by R. D. Boschloo.