McNemar's test

Last updated

In statistics, McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947. [1] An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium. [2]


The commonly used parameters to assess a diagnostic test in medical sciences are sensitivity and specificity. Sensitivity (or recall) is the ability of a test to correctly identify the people with disease. Specificity (or precision) is the ability of the test to correctly identify those without the disease.

Now presume two tests are performed on the same group of patients. And also presume that these tests have identical sensitivity and specificity. In this situation one is carried away by these findings and presume that both the tests are equivalent. However this may not be the case. For this we have to study the patients with disease and patients without disease (by a reference test). We also have to find out where these two tests disagree with each other. This is precisely the basis of McNemar's test. This test compares the sensitivity and specificity of two diagnostic tests on the same group of patients. [3]


The test is applied to a 2 × 2 contingency table, which tabulates the outcomes of two tests on a sample of N subjects, as follows.

Test 2 positiveTest 2 negativeRow total
Test 1 positiveaba + b
Test 1 negativecdc + d
Column totala + cb + dN

The null hypothesis of marginal homogeneity states that the two marginal probabilities for each outcome are the same, i.e. pa + pb = pa + pc and pc + pd = pb + pd.

Thus the null and alternative hypotheses are [1]

Here pa, etc., denote the theoretical probability of occurrences in cells with the corresponding label.

The McNemar test statistic is:

Under the null hypothesis, with a sufficiently large number of discordants (cells b and c), has a chi-squared distribution with 1 degree of freedom. If the result is significant, this provides sufficient evidence to reject the null hypothesis, in favour of the alternative hypothesis that pb  pc, which would mean that the marginal proportions are significantly different from each other.


If either b or c is small (b + c < 25) then is not well-approximated by the chi-squared distribution. [ citation needed ] An exact binomial test can then be used, where b is compared to a binomial distribution with size parameter n = b + c and p = 0.5. Effectively, the exact binomial test evaluates the imbalance in the discordants b and c. To achieve a two-sided P-value, the P-value of the extreme tail should be multiplied by 2. For bc:

which is simply twice the binomial distribution cumulative distribution function with p = 0.5 and n = b + c.

Edwards [4] proposed the following continuity corrected version of the McNemar test to approximate the binomial exact-P-value:

The mid-P McNemar test (mid-p binomial test) is calculated by subtracting half the probability of the observed b from the exact one-sided P-value, then double it to obtain the two-sided mid-P-value: [5] [6]

This is equivalent to:

where the second term is the binomial distribution probability mass function and n = b + c. Binomial distribution functions are readily available in common software packages and the McNemar mid-P test can easily be calculated. [6]

The traditional advice has been to use the exact binomial test when b + c < 25. However, simulations have shown both the exact binomial test and the McNemar test with continuity correction to be overly conservative. [6] When b + c < 6, the exact-P-value always exceeds the common significance level 0.05. The original McNemar test was most powerful, but often slightly liberal. The mid-P version was almost as powerful as the asymptotic McNemar test and was not found to exceed the nominal significance level.


In the first example, a researcher attempts to determine if a drug has an effect on a particular disease. Counts of individuals are given in the table, with the diagnosis (disease: present or absent) before treatment given in the rows, and the diagnosis after treatment in the columns. The test requires the same subjects to be included in the before-and-after measurements (matched pairs).

After: presentAfter: absentRow total
Before: present101121222
Before: absent593392
Column total160154314

In this example, the null hypothesis of "marginal homogeneity" would mean there was no effect of the treatment. From the above data, the McNemar test statistic:

has the value 21.35, which is extremely unlikely to form the distribution implied by the null hypothesis (P < 0.001). Thus the test provides strong evidence to reject the null hypothesis of no treatment effect.

A second example illustrates differences between the asymptotic McNemar test and alternatives. [6] The data table is formatted as before, with different numbers in the cells:

After: presentAfter: absentRow total
Before: present59665
Before: absent168096
Column total7586161

With these data, the sample size (161 patients) is not small, however results from the McNemar test and other versions are different. The exact binomial test gives P = 0.053 and McNemar's test with continuity correction gives = 3.68 and P = 0.055. The asymptotic McNemar's test gives = 4.55 and P = 0.033 and the mid-P McNemar's test gives P = 0.035. Both the McNemar's test and mid-P version provide stronger evidence for a statistically significant treatment effect in this second example.


An interesting observation when interpreting McNemar's test is that the elements of the main diagonal do not contribute to the decision about whether (in the above example) pre- or post-treatment condition is more favourable. Thus, the sum b + c can be small and statistical power of the tests described above can be low even though the number of pairs a + b + c + d is large (see second example above).

An extension of McNemar's test exists in situations where independence does not necessarily hold between the pairs; instead, there are clusters of paired data where the pairs in a cluster may not be independent, but independence holds between different clusters. [7] An example is analyzing the effectiveness of a dental procedure; in this case, a pair corresponds to the treatment of an individual tooth in patients who might have multiple teeth treated; the effectiveness of treatment of two teeth in the same patient is not likely to be independent, but the treatment of two teeth in different patients is more likely to be independent. [8]

Information in the pairings

In the 1970s, it was conjectured that retaining one's tonsils might protect against Hodgkin's lymphoma. John Rice wrote: [9]

85 Hodgkin's patients [...] had a sibling of the same sex who was free of the disease and whose age was within 5 years of the patient's. These investigators presented the following table:

They calculated a chi-squared statistic [...] [they] had made an error in their analysis by ignoring the pairings.[...] [their] samples were not independent, because the siblings were paired [...] we set up a table that exhibits the pairings:

It is to the second table that McNemar's test can be applied. Notice that the sum of the numbers in the second table is 85—the number of pairs of siblings—whereas the sum of the numbers in the first table is twice as big, 170—the number of individuals. The second table gives more information than the first. The numbers in the first table can be found by using the numbers in the second table, but not vice versa. The numbers in the first table give only the marginal totals of the numbers in the second table.

See also

Related Research Articles

In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after imposing some constraint. If the constraint is supported by the observed data, the two likelihoods should not differ by more than sampling error. Thus the likelihood-ratio test tests whether this ratio is significantly different from one, or equivalently whether its natural logarithm is significantly different from zero.

Chi-squared distribution Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.

In analytic number theory and related branches of mathematics, a complex-valued arithmetic function is a Dirichlet character of modulus if for all integers and :

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

Chi-squared test Statistical hypothesis test

A chi-squared test is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

In statistics, Yates's correction for continuity is used in certain situations when testing for independence in a contingency table. It aims at correcting the error introduced by assuming that the discrete probabilities of frequencies in the table can be approximated by a continuous distribution (chi-squared). In some cases, Yates's correction may adjust too far, and so its current use is limited.

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

In null hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Reporting p-values of statistical tests is common practice in academic publications of many quantitative fields. Since the precise meaning of p-value is hard to grasp, misuse is widespread and has been a major topic in metascience.

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

In statistics, a contingency table is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

In statistics, the binomial test is an exact test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories using sample data.

The Kruskal–Wallis test by ranks, Kruskal–Wallis H test, or one-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

A test statistic is a statistic used in statistical hypothesis testing. A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the null from the alternative hypothesis, where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis.

Permutation test

A permutation test is an exact test, a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under all possible rearrangements of the observed data points. Permutation test are, therefore, a form of resampling. In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Ronald Fisher and E. J. G. Pitman in the 1930s.

In statistics, an exact (significance) test is a test where if the null hypothesis is true, then all assumptions made during the derivation of the distribution of the test statistic are met. Using an exact test provides a significance test that keeps the type I error rate of the test at the desired significance level of the test. For example, an exact test at a significance level of , when repeating the test over many samples where the null hypothesis is true, will reject at most of the time. It is opposed to an approximate test in which the desired type I error rate is only approximately kept, while this approximation may be made as close to as desired by making the sample size big enough.

The transmission disequilibrium test (TDT) was proposed by Spielman, McGinnis and Ewens (1993) as a family-based association test for the presence of genetic linkage between a genetic marker and a trait. It is an application of McNemar's test.

In statistics, the Cochran–Mantel–Haenszel test (CMH) is a test used in the analysis of stratified or matched categorical data. It allows an investigator to test the association between a binary predictor or treatment and a binary outcome such as case or control status while taking into account the stratification. Unlike the McNemar test which can only handle pairs, the CMH test handles arbitrary strata size. It is named after William G. Cochran, Nathan Mantel and William Haenszel. Extensions of this test to a categorical response and/or to several groups are commonly called Cochran–Mantel–Haenszel statistics. It is often used in observational studies where random assignment of subjects to different treatments cannot be controlled, but confounding covariates can be measured.

In statistics Wilks' theorem offers an asymptotic distribution of the log-likelihood ratio statistic, which can be used to produce confidence intervals for maximum-likelihood estimates or as a test statistic for performing the likelihood-ratio test.

Boschloo's test is a statistical hypothesis test for analysing 2x2 contingency tables. It examines the association of two Bernoulli distributed random variables and is a uniformly more powerful alternative to Fisher's exact test. It was proposed in 1970 by R. D. Boschloo.


  1. 1 2 McNemar, Quinn (June 18, 1947). "Note on the sampling error of the difference between correlated proportions or percentages". Psychometrika. 12 (2): 153–157. doi:10.1007/BF02295996. PMID   20254758.
  2. Spielman RS; McGinnis RE; Ewens WJ (Mar 1993). "Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM)". Am J Hum Genet. 52 (3): 506–16. PMC   1682161 . PMID   8447318.
  3. Hawass, N E (April 1997). "Comparing the sensitivities and specificities of two diagnostic procedures performed on the same group of patients". The British Journal of Radiology. 70 (832): 360–366. doi:10.1259/bjr.70.832.9166071. ISSN   0007-1285. PMID   9166071.
  4. Edwards, A (1948). "Note on the "correction for continuity" in testing the significance of the difference between correlated proportions". Psychometrika. 13 (3): 185–187. doi:10.1007/bf02289261. PMID   18885738.
  5. Lancaster, H.O. (1961). "Significance tests in discrete distributions". J Am Stat Assoc. 56 (294): 223–234. doi:10.1080/01621459.1961.10482105.
  6. 1 2 3 4 Fagerland, M.W.; Lydersen, S.; Laake, P. (2013). "The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional". BMC Medical Research Methodology. 13: 91. doi:10.1186/1471-2288-13-91. PMC   3716987 . PMID   23848987.
  7. Yang, Z.; Sun, X.; Hardin, J.W. (2010). "A note on the tests for clustered matched-pair binary data". Biometrical Journal. 52 (5): 638–652. doi:10.1002/bimj.201000035. PMID   20976694.
  8. Durkalski, V.L.; Palesch, Y.Y.; Lipsitz, S.R.; Rust, P.F. (2003). "Analysis of clustered matched-pair data". Statistics in Medicine. 22 (15): 2417–28. doi:10.1002/sim.1438. PMID   12872299. Archived from the original on January 5, 2013. Retrieved April 1, 2009.
  9. Rice, John (1995). Mathematical Statistics and Data Analysis (Second ed.). Belmont, California: Duxbury Press. pp.  492–494. ISBN   978-0-534-20934-6.
  10. Liddell, D. (1976). "Practical Tests of 2 × 2 Contingency Tables". Journal of the Royal Statistical Society. 25 (4): 295–304. JSTOR   2988087.
  11. "Maxwell's test, McNemar's test, Kappa test". Retrieved 2012-11-22.
  12. Sun, Xuezheng; Yang, Zhao (2008). "Generalized McNemar's Test for Homogeneity of the Marginal Distributions" (PDF). SAS Global Forum.
  13. Stuart, Alan (1955). "A Test for Homogeneity of the Marginal Distributions in a Two-Way Classification". Biometrika. 42 (3/4): 412–416. doi:10.1093/biomet/42.3-4.412. JSTOR   2333387.
  14. Maxwell, A.E. (1970). "Comparing the Classification of Subjects by Two Independent Judges". The British Journal of Psychiatry. 116 (535): 651–655. doi:10.1192/bjp.116.535.651. PMID   5452368.
  15. "McNemar Tests of Marginal Homogeneity". 2006-08-30. Retrieved 2012-11-22.
  16. Bhapkar, V.P. (1966). "A Note on the Equivalence of Two Test Criteria for Hypotheses in Categorical Data". Journal of the American Statistical Association. 61 (313): 228–235. doi:10.1080/01621459.1966.10502021. JSTOR   2283057.
  17. Yang, Z.; Sun, X.; Hardin, J.W. (2012). "Testing Marginal Homogeneity in Matched-Pair Polytomous Data". Therapeutic Innovation & Regulatory Science. 46 (4): 434–438. doi:10.1177/0092861512442021.
  18. Agresti, Alan (2002). Categorical Data Analysis (PDF). Hooken, New Jersey: John Wiley & Sons, Inc. p. 413. ISBN   978-0-471-36093-3.