The two-proportion Z-test or two-sample proportion Z-test is a statistical method used to determine whether the difference between the proportions of two groups, coming from a binomial distribution is statistically significant. [1] This approach relies on the assumption that the sample proportions follow a normal distribution under the Central Limit Theorem, allowing the construction of a z-test for hypothesis testing and confidence interval estimation. [2] It is used in various fields to compare success rates, response rates, or other proportions across different groups.
The z-test for comparing two proportions is a frequentist statistical hypothesis test used to evaluate whether two independent samples have different population proportions for a binary outcome. Under mild regularity conditions (sufficiently large sample sizes and independent sampling), the sample proportions (which is the average of observations coming from a Bernoulli distribution) are approximately normally distributed under the central limit theorem, which permits using a z-statistic constructed from the difference of sample proportions and an estimated standard error. [2]
The test involves two competing hypotheses:
The z-statistic for comparing two proportions is computed using: [3] [4] [5] [2] : 10.6.2 where and are the sample proportion in the first and second sample, and are the size of the first and second sample, respectively, is pooled proportion, calculated as , where and are the counts of successes in the two samples The pooled proportion is used to estimate the shared probability of success under the null hypothesis, and the standard error accounts for variability across the two samples.
The z-test determines statistical significance by comparing the calculated z-statistic to a critical value. E.g., for a significance level of we reject the null hypothesis if (for a two-tailed test). Or, alternatively, by computing the p-value and rejecting the null hypothesis if .
The confidence interval for the difference between two proportions, based on the definitions above, is: [5] [2] : 10.6.3 where is the critical value of the standard normal distribution (e.g., 1.96 for a 95% confidence level).
This interval provides a range of plausible values for the true difference between population proportions.
Notice how the variance estimation is different between the hypothesis testing and the confidence intervals. The first uses a pooled variance (based on the null hypothesis), while the second has to estimate the variance using each sample separately (so as to allow for the confidence interval to accommodate a range of differences in proportions). This difference may lead to slightly different results if using the confidence interval as an alternative to the hypothesis testing method.
Sample size determination is the act of choosing the number of observations to include in each group for running the statistical test. For the Two-proportion Z-test, this is closely-related with deciding on the minimum detectable effect.
For finding the required sample size (given some effect size , power , and type I error ), we define that , (when κ = 1, equal sample size is assumed for each group), then: [6] [7]
The minimum detectable effect or MDE is the smallest difference between two proportions ( and ) that a statistical test can detect for a chosen type I error level (), statistical power (), and sample sizes ( and ). It is commonly used in study design to determine whether the sample sizes allows for a test with sufficient sensitivity to detect meaningful differences.
The MDE for when using the (two-sided) z-test formula for comparing two proportions, incorporating critical values for and , and the standard errors of the proportions: [8] [9] where is critical value for the significance level, is quantile for the desired power, and is when assuming the null is correct.
The MDE depends on the sample sizes, baseline proportions (), and test parameters. When the baseline proportions are not known, they need to be assumed or roughly estimated from a small study. Larger samples or smaller power requirements leads to a smaller MDE, making the test more sensitive to smaller differences. Researchers may use the MDE to assess the feasibility of detecting meaningful differences before conducting a study.
The Minimal Detectable Effect (MDE) is the smallest difference, denoted as , that satisfies two essential criteria in hypothesis testing:
Given that the distribution is normal under the null and the alternative hypothesis, for the two criteria to happen, it is required that the distance of will be such that the critical value for rejecting the null () is exactly in the location in which the probability of exceeding this value, under the null, is (), and also that the probability of exceeding this value, under the alternative, is .
The first criterion establishes the critical value required to reject the null hypothesis. The second criterion specifies how far the alternative distribution must be from to ensure that the probability of exceeding it under the alternative hypothesis is at least . [10] [11]
Condition 1: Rejecting
Under the null hypothesis, the test statistic is based on the pooled standard error ():
might be estimated (as described above).
To reject , the observed difference must exceed the critical threshold () after properly inflating it to the SE:
If the MDE is defined solely as , the statistical power would be only 50% because the alternative distribution is symmetric about the threshold. To achieve a higher power level, an additional component is required in the MDE calculation.
Condition 2: Achieving power
Under the alternative hypothesis, the standard error is (). It means that if the alternative distribution was centered around some value (e.g., ), then the minimal must be at least larger than to ensure that the probability of detecting the difference under the alternative hypothesis is at least .
Combining conditions
To meet both conditions, the total detectable difference incorporates components from both the null and alternative distributions. The MDE is defined as:
By summing the critical thresholds from the null and adding to it the relevant quantile from the alternative distributions, the MDE ensures the test satisfies the dual requirements of rejecting at significance level and achieving statistical power of at least .
To ensure valid results, the following assumptions must be met:
The z-test is most reliable when sample sizes are large, and all assumptions are satisfied.
Using the z-test confidence intervals for hypothesis testing would give the same results as the chi-squared test for a two-by-two contingency table. [13] : 216–7 [14] : 875 Fisher’s exact test is more suitable for when the sample sizes are small.
Treatment of 2-by-2 contingency table has been investigated as early as the 19th century [15] , with further work during the 20th century. [16]
Notice that:
Suppose group 1 has 120 successes out of 1000 trials () and group 2 has 150 successes out of 1000 trials (). The pooled proportion is . The pooled standard error is
The z-statistic isgiving a two-sided p-value of about 0.0497 (just under 0.05). An approximate 95% confidence interval for the difference using the unpooled standard error isBecause the 95% CI (just barely) excludes 0 and the p-value is ≈0.0497, the difference is statistically significant at the 5% level by the usual large-sample criteria (but is borderline; conclusions should account for study context and multiple testing if applicable).
Implementations are available in many statistical environments. See below for implementation details in some popular languages. Other implementations also exists for SPSS [17] , SAS [18] , and Minitab [5] .
Use prop.test()
with continuity correction disabled:
prop.test(x=c(120,150),n=c(1000,1000),correct=FALSE)
Output includes z-test equivalent results: chi-squared statistic, p-value, and confidence interval:
2-sample test for equality of proportions without continuity correction data: c(120, 150) out of c(1000, 1000) X-squared = 3.8536, df = 1, p-value = 0.04964 alternative hypothesis: two.sided 95 percent confidence interval: -5.992397e-02 -7.602882e-05 sample estimates: prop 1 prop 2 0.12 0.15
Use proportions_ztest
from statsmodels:
fromstatsmodels.stats.proportionimportproportions_ztestz,p=proportions_ztest([120,150],[1000,1000],0)# For CI: from statsmodels.stats.proportion import proportions_diff_confint_indep
(This is using Presto flavor of SQL)
-- Calculate group sizes and accuracy per groupWITHgroup_statsAS(SELECT-- Number of samples in each groupCOUNT_IF(ab_group='test')ASn_test,COUNT_IF(ab_group='control')ASn_control,-- Proportion correct (accuracy) in each groupCOUNT_IF(ab_group='test'ANDis_success=1)*1.0/NULLIF(COUNT_IF(ab_group='test'),0)ASp_test,COUNT_IF(ab_group='control'ANDis_success=1)*1.0/NULLIF(COUNT_IF(ab_group='control'),0)ASp_controlFROMsome_table)SELECT-- Sample sizesn_test,n_control,-- Accuracy (proportion correct) per groupROUND(p_test,3)ASp_test,ROUND(p_control,3)ASp_control,-- Difference in accuracy between groupsROUND(p_test-p_control,3)ASdiff_p,-- Standard error of the difference in proportionsROUND(SQRT((p_test*(1-p_test)/n_test)+(p_control*(1-p_control)/n_control)),3)ASse_diff,-- Total sample sizen_test+n_controlASn_total,-- Pooled proportion (overall accuracy)ROUND((n_test*p_test+n_control*p_control)/NULLIF(n_test+n_control,0),3)ASp_pooled,-- 95% Confidence Interval for the difference in proportionsROUND((p_test-p_control)-1.96*SQRT((p_test*(1-p_test)/n_test)+(p_control*(1-p_control)/n_control)),3)ASdiff_p_ci_lower,ROUND((p_test-p_control)+1.96*SQRT((p_test*(1-p_test)/n_test)+(p_control*(1-p_control)/n_control)),3)ASdiff_p_ci_upper,-- Ratio of test to control accuracyROUND(p_test/NULLIF(p_control,0),3)ASaccuracy_ratio,-- Two-sided p-value for difference in proportions (z-test)ROUND(2*(1-NORMAL_CDF(0,1,ABS(p_test-p_control)/SQRT((1-((n_test*p_test+n_control*p_control)/NULLIF(n_test+n_control,0)))*((n_test*p_test+n_control*p_control)/NULLIF(n_test+n_control,0))*(1.0/n_test+1.0/n_control)))),3)ASdiff_p_p_value,-- Statistical significance at 95% confidenceABS(p_test-p_control)/SQRT((1-((n_test*p_test+n_control*p_control)/NULLIF(n_test+n_control,0)))*((n_test*p_test+n_control*p_control)/NULLIF(n_test+n_control,0))*(1.0/n_test+1.0/n_control))>1.96ASis_diff_stat_sig,-- Minimal Detectable Effect (MDE) at 95% confidence, 80% powerROUND((1.96*SQRT(((n_test*p_test+n_control*p_control)/NULLIF(n_test+n_control,0))*(1-((n_test*p_test+n_control*p_control)/NULLIF(n_test+n_control,0)))*(1.0/n_test+1.0/n_control))+0.84*SQRT((p_test*(1-p_test)/n_test)+(p_control*(1-p_control)/n_control))),3)ASminimal_detectable_effectFROMgroup_stats
Output would be something like this:
n_test | n_control | p_test | p_control | diff_p | se_diff | p_pooled | diff_p_ci_lower | diff_p_ci_upper | p_value | is_stat_sig | accuracy_ratio | mde |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1000 | 1000 | 0.120 | 0.150 | -0.030 | 0.015 | 0.135 | -0.0599 | 0.0001 | 0.050 | TRUE | 0.800 | 0.043 |