Cohen's kappa

Last updated

Cohen's kappa coefficient (κ) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. [1] It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items. [2]

Contents

History

The first mention of a kappa-like statistic is attributed to Galton (1892); [3] see Smeeton (1985). [4]

The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal Educational and Psychological Measurement in 1960. [5]

Definition

Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The definition of is:

where po is the relative observed agreement among raters, and pe is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then . If there is no agreement among the raters other than what would be expected by chance (as given by pe), . It is possible for the statistic to be negative, [6] which implies that there is no effective agreement between the two raters or the agreement is worse than random.

For k categories, N observations to categorize and the number of times rater i predicted category k:

This is derived from the following construction:

Where is the estimated probability that both rater 1 and rater 2 will classify the same item as k, while is the estimated probability that rater 1 will classify an item as k (and similarly for rater 2). The relation is based on using the assumption that the rating of the two raters are independent. The term is estimated by using the number of items classified as k by rater 1 () divided by the total items to classify (): (and similarly for rater 2).

Binary classification confusion matrix

In the traditional 2 × 2 confusion matrix employed in machine learning and statistics to evaluate binary classifications, the Cohen's Kappa formula can be written as [7] :

where TP are the true positives, FP are the false positives, TN are the true negatives, and FN are the false negatives.

Examples

Simple example

Suppose that you were analyzing data related to a group of 50 people applying for a grant. Each grant proposal was read by two readers and each reader either said "Yes" or "No" to the proposal. Suppose the disagreement count data were as follows, where A and B are readers, data on the main diagonal of the matrix (a and d) count the number of agreements and off-diagonal data (b and c) count the number of disagreements:

B
A
YesNo
Yesab
Nocd

e.g.

B
A
YesNo
Yes205
No1015

The observed proportionate agreement is:

To calculate pe (the probability of random agreement) we note that:

So the expected probability that both would say yes at random is:

Similarly:

Overall random agreement probability is the probability that they agreed on either Yes or No, i.e.:

So now applying our formula for Cohen's Kappa we get:

Same percentages but different numbers

A case sometimes considered to be a problem with Cohen's Kappa occurs when comparing the Kappa calculated for two pairs of raters with the two raters in each pair having the same percentage agreement but one pair give a similar number of ratings in each class while the other pair give a very different number of ratings in each class. [8] (In the cases below, notice B has 70 yeses and 30 nos, in the first case, but those numbers are reversed in the second.) For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) in terms of agreement in each class, so we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each:

B
A
YesNo
Yes4515
No2515
B
A
YesNo
Yes2535
No535

we find that it shows greater similarity between A and B in the second case, compared to the first. This is because while the percentage agreement is the same, the percentage agreement that would occur 'by chance' is significantly higher in the first case (0.54 compared to 0.46).

Properties

Hypothesis testing and confidence interval

P-value for kappa is rarely reported, probably because even relatively low values of kappa can nonetheless be significantly different from zero but not of sufficient magnitude to satisfy investigators. [9] :66 Still, its standard error has been described [10] and is computed by various computer programs. [11]

Confidence intervals for Kappa may be constructed, for the expected Kappa values if we had infinite number of items checked, using the following formula: [1]

Where is the standard normal percentile when , and

This is calculated by ignoring that pe is estimated from the data, and by treating po as an estimated probability of a binomial distribution while using asymptotic normality (i.e.: assuming that the number of items is large and that po is not close to either 0 or 1). (and the CI in general) may also be estimated using bootstrap methods.

Interpreting magnitude

Kappa (vertical axis) and Accuracy (horizontal axis) calculated from the same simulated binary data. Each point on the graph is calculated from a pairs of judges randomly rating 10 subjects for having a diagnosis of X or not. Note in this example a Kappa=0 is approximately equivalent to an accuracy=0.5 Kappa vs accuracy.png
Kappa (vertical axis) and Accuracy (horizontal axis) calculated from the same simulated binary data. Each point on the graph is calculated from a pairs of judges randomly rating 10 subjects for having a diagnosis of X or not. Note in this example a Kappa=0 is approximately equivalent to an accuracy=0.5

If statistical significance is not a useful guide, what magnitude of kappa reflects adequate agreement? Guidelines would be helpful, but factors other than agreement can influence its magnitude, which makes interpretation of a given magnitude problematic. As Sim and Wright noted, two important factors are prevalence (are the codes equiprobable or do their probabilities vary) and bias (are the marginal probabilities for the two observers similar or different). Other things being equal, kappas are higher when codes are equiprobable. On the other hand, Kappas are higher when codes are distributed asymmetrically by the two observers. In contrast to probability variations, the effect of bias is greater when Kappa is small than when it is large. [12] :261–262

Another factor is the number of codes. As number of codes increases, kappas become higher. Based on a simulation study, Bakeman and colleagues concluded that for fallible observers, values for kappa were lower when codes were fewer. And, in agreement with Sim & Wrights's statement concerning prevalence, kappas were higher when codes were roughly equiprobable. Thus Bakeman et al. concluded that "no one value of kappa can be regarded as universally acceptable." [13] :357 They also provide a computer program that lets users compute values for kappa specifying number of codes, their probability, and observer accuracy. For example, given equiprobable codes and observers who are 85% accurate, value of kappa are 0.49, 0.60, 0.66, and 0.69 when number of codes is 2, 3, 5, and 10, respectively.

Nonetheless, magnitude guidelines have appeared in the literature. Perhaps the first was Landis and Koch, [14] who characterized values < 0 as indicating no agreement and 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement. This set of guidelines is however by no means universally accepted; Landis and Koch supplied no evidence to support it, basing it instead on personal opinion. It has been noted that these guidelines may be more harmful than helpful. [15] Fleiss's [16] :218 equally arbitrary guidelines characterize kappas over 0.75 as excellent, 0.40 to 0.75 as fair to good, and below 0.40 as poor.

Kappa maximum

Kappa assumes its theoretical maximum value of 1 only when both observers distribute codes the same, that is, when corresponding row and column sums are identical. Anything less is less than perfect agreement. Still, the maximum value kappa could achieve given unequal distributions helps interpret the value of kappa actually obtained. The equation for κ maximum is: [17]

where , as usual, ,

k = number of codes, are the row probabilities, and are the column probabilities.

Limitations

Kappa is an index that considers observed agreement with respect to a baseline agreement. However, investigators must consider carefully whether Kappa's baseline agreement is relevant for the particular research question. Kappa's baseline is frequently described as the agreement due to chance, which is only partially correct. Kappa's baseline agreement is the agreement that would be expected due to random allocation, given the quantities specified by the marginal totals of square contingency table. Thus, κ = 0 when the observed allocation is apparently random, regardless of the quantity disagreement as constrained by the marginal totals. However, for many applications, investigators should be more interested in the quantity disagreement in the marginal totals than in the allocation disagreement as described by the additional information on the diagonal of the square contingency table. Thus for many applications, Kappa's baseline is more distracting than enlightening. Consider the following example:

Kappa example Kappa example.png
Kappa example
Comparison 1
Reference
GR
ComparisonG114
R01

The disagreement proportion is 14/16 or 0.875. The disagreement is due to quantity because allocation is optimal. κ is 0.01.

Comparison 2
Reference
GR
ComparisonG01
R114

The disagreement proportion is 2/16 or 0.125. The disagreement is due to allocation because quantities are identical. Kappa is −0.07.

Here, reporting quantity and allocation disagreement is informative while Kappa obscures information. Furthermore, Kappa introduces some challenges in calculation and interpretation because Kappa is a ratio. It is possible for Kappa's ratio to return an undefined value due to zero in the denominator. Furthermore, a ratio does not reveal its numerator nor its denominator. It is more informative for researchers to report disagreement in two components, quantity and allocation. These two components describe the relationship between the categories more clearly than a single summary statistic. When predictive accuracy is the goal, researchers can more easily begin to think about ways to improve a prediction by using two components of quantity and allocation, rather than one ratio of Kappa. [2]

Some researchers have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can make it unreliable for measuring agreement in situations such as the diagnosis of rare diseases. In these situations, κ tends to underestimate the agreement on the rare category. [18] For this reason, κ is considered an overly conservative measure of agreement. [19] Others [20] [ citation needed ] contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.

Scott's Pi

A similar statistic, called pi, was proposed by Scott (1955). Cohen's kappa and Scott's pi differ in terms of how pe is calculated.

Fleiss' kappa

Note that Cohen's kappa measures agreement between two raters only. For a similar measure of agreement (Fleiss' kappa) used when there are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic, not Cohen's kappa. Kappa is also used to compare performance in machine learning, but the directional version known as Informedness or Youden's J statistic is argued to be more appropriate for supervised learning. [21]

Weighted kappa

The weighted kappa allows disagreements to be weighted differently [22] and is especially useful when codes are ordered. [9] :66 Three matrices are involved, the matrix of observed scores, the matrix of expected scores based on chance agreement, and the weight matrix. Weight matrix cells located on the diagonal (upper-left to bottom-right) represent agreement and thus contain zeros. Off-diagonal cells contain weights indicating the seriousness of that disagreement. Often, cells one off the diagonal are weighted 1, those two off 2, etc.

The equation for weighted κ is:

where k=number of codes and , , and are elements in the weight, observed, and expected matrices, respectively. When diagonal cells contain weights of 0 and all off-diagonal cells weights of 1, this formula produces the same value of kappa as the calculation given above.

See also

Related Research Articles

Binomial distribution Probability distribution

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success or failure. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.

Exponential distribution Probability distribution

In probability theory and statistics, the exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

Geometric distribution Probability distribution

In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions:

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Weibull distribution Continuous probability distribution

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Fréchet (1927) and first applied by Rosin & Rammler (1933) to describe a particle size distribution.

In probability theory and statistics, the cumulantsκn of a probability distribution are a set of quantities that provide an alternative to the moments of the distribution. The moments determine the cumulants in the sense that any two probability distributions whose moments are identical will have identical cumulants as well, and similarly the cumulants determine the moments.

The Gram–Charlier A series, and the Edgeworth series are series that approximate a probability distribution in terms of its cumulants. The series are the same; but, the arrangement of terms differ. The key idea of these expansions is to write the characteristic function of the distribution whose probability density function f is to be approximated in terms of the characteristic function of a distribution with known and suitable properties, and to recover f through the inverse Fourier transform.

In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the lower or upper end of the range, and the division of the range could notionally be made at any point.

von Mises distribution Probability distribution on the circle

In probability theory and directional statistics, the von Mises distribution is a continuous probability distribution on the circle. It is a close approximation to the wrapped normal distribution, which is the circular analogue of the normal distribution. A freely diffusing angle on a circle is a wrapped normally distributed random variable with an unwrapped variance that grows linearly in time. On the other hand, the von Mises distribution is the stationary distribution of a drift and diffusion process on the circle in a harmonic potential, i.e. with a preferred orientation. The von Mises distribution is the maximum entropy distribution for circular data when the real and imaginary parts of the first circular moment are specified. The von Mises distribution is a special case of the von Mises–Fisher distribution on the N-dimensional sphere.

A quasiprobability distribution is a mathematical object similar to a probability distribution but which relaxes some of Kolmogorov's axioms of probability theory. Although quasiprobabilities share several of general features with ordinary probabilities, such as, crucially, the ability to yield expectation values with respect to the weights of the distribution, they all violate the σ-additivity axiom, because regions integrated under them do not represent probabilities of mutually exclusive states. To compensate, some quasiprobability distributions also counterintuitively have regions of negative probability density, contradicting the first axiom. Quasiprobability distributions arise naturally in the study of quantum mechanics when treated in phase space formulation, commonly used in quantum optics, time-frequency analysis, and elsewhere.

In directional statistics, the von Mises–Fisher distribution, is a probability distribution on the -sphere in . If the distribution reduces to the von Mises distribution on the circle.

Fleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement between not more than two raters or the intra-rater reliability. The measure calculates the degree of agreement in classification over that which would be expected by chance.

In statistics, inter-rater reliability is the degree of agreement among raters. It is a score of how much homogeneity or consensus exists in the ratings given by various judges.

Scott's pi is a statistic for measuring inter-rater reliability for nominal data in communication studies. Textual entities are annotated with categories by different annotators, and various measures are used to assess the extent of agreement between the annotators, one of which is Scott's pi. Since automatically annotating text is a popular problem in natural language processing, and the goal is to get the computer program that is being developed to agree with the humans in the annotations it creates, assessing the extent to which humans agree with each other is important for establishing a reasonable upper limit on computer performance.

Intraclass correlation Descriptive statistic

In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC), is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures it operates on data structured as groups, rather than data structured as paired observations.

Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

In probability and statistics, the Tweedie distributions are a family of probability distributions which include the purely continuous normal, gamma and Inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous. Tweedie distributions are a special case of exponential dispersion models and are often used as distributions for generalized linear models.

Krippendorff's alpha coefficient, named after academic Klaus Krippendorff, is a statistical measure of the agreement achieved when coding a set of units of analysis. Since the 1970s, alpha has been used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.

Asymmetric Laplace distribution

In probability theory and statistics, the asymmetric Laplace distribution (ALD) is a continuous probability distribution which is a generalization of the Laplace distribution. Just as the Laplace distribution consists of two exponential distributions of equal scale back-to-back about x = m, the asymmetric Laplace consists of two exponential distributions of unequal scale back to back about x = m, adjusted to assure continuity and normalization. The difference of two variates exponentially distributed with different means and rate parameters will be distributed according to the ALD. When the two rate parameters are equal, the difference will be distributed according to the Laplace distribution.

Wrapped asymmetric Laplace distribution

In probability theory and directional statistics, a wrapped asymmetric Laplace distribution is a wrapped probability distribution that results from the "wrapping" of the asymmetric Laplace distribution around the unit circle. For the symmetric case, the distribution becomes a wrapped Laplace distribution. The distribution of the ratio of two circular variates (Z) from two different wrapped exponential distributions will have a wrapped asymmetric Laplace distribution. These distributions find application in stochastic modelling of financial data.

References

  1. 1 2 McHugh, Mary L. (2012). "Interrater reliability: The kappa statistic". Biochemia Medica. 22 (3): 276–282. doi:10.11613/bm.2012.031. PMC   3900052 . PMID   23092060.
  2. 1 2 Pontius, Robert; Millones, Marco (2011). "Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment". International Journal of Remote Sensing. 32 (15): 4407–4429. Bibcode:2011IJRS...32.4407P. doi:10.1080/01431161.2011.552923. S2CID   62883674.
  3. Galton, F. (1892) Finger Prints Macmillan, London.
  4. Smeeton, N.C. (1985). "Early History of the Kappa Statistic". Biometrics. 41 (3): 795. JSTOR   2531300.
  5. Cohen, Jacob (1960). "A coefficient of agreement for nominal scales". Educational and Psychological Measurement. 20 (1): 37–46. doi:10.1177/001316446002000104. hdl: 1942/28116 . S2CID   15926286.
  6. Sim, Julius; Wright, Chris C. (2005). "The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements". Physical Therapy. 85 (3): 257–268. doi: 10.1093/ptj/85.3.257 . ISSN   1538-6724. PMID   15733050.
  7. Chicco D., Warrens M.J., Jurman G. (June 2021). "The Matthews correlation coefficient (MCC) is more informative than Cohen's Kappa and Brier score in binary classification assessment". IEEE Access. 9: 78368 - 78381. doi: 10.1109/ACCESS.2021.3084050 .CS1 maint: uses authors parameter (link)
  8. Kilem Gwet (May 2002). "Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity" (PDF). Statistical Methods for Inter-Rater Reliability Assessment. 2: 1–10. Archived from the original (PDF) on 2011-07-07. Retrieved 2011-02-02.
  9. 1 2 Bakeman, R.; Gottman, J.M. (1997). Observing interaction: An introduction to sequential analysis (2nd ed.). Cambridge, UK: Cambridge University Press. ISBN   978-0-521-27593-4.
  10. Fleiss, J.L.; Cohen, J.; Everitt, B.S. (1969). "Large sample standard errors of kappa and weighted kappa". Psychological Bulletin. 72 (5): 323–327. doi:10.1037/h0028106.
  11. Robinson, B.F; Bakeman, R. (1998). "ComKappa: A Windows 95 program for calculating kappa and related statistics". Behavior Research Methods, Instruments, and Computers. 30 (4): 731–732. doi: 10.3758/BF03209495 .
  12. Sim, J; Wright, C. C (2005). "The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements". Physical Therapy. 85 (3): 257–268. doi: 10.1093/ptj/85.3.257 . PMID   15733050.
  13. Bakeman, R.; Quera, V.; McArthur, D.; Robinson, B. F. (1997). "Detecting sequential patterns and determining their reliability with fallible observers". Psychological Methods. 2 (4): 357–370. doi:10.1037/1082-989X.2.4.357.
  14. Landis, J.R.; Koch, G.G. (1977). "The measurement of observer agreement for categorical data". Biometrics. 33 (1): 159–174. doi:10.2307/2529310. JSTOR   2529310. PMID   843571.
  15. Gwet, K. (2010). "Handbook of Inter-Rater Reliability (Second Edition)" ISBN   978-0-9708062-2-2 [ page needed ]
  16. Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. ISBN   978-0-471-26370-8.
  17. Umesh, U. N.; Peterson, R.A.; Sauber M. H. (1989). "Interjudge agreement and the maximum value of kappa". Educational and Psychological Measurement. 49 (4): 835–850. doi:10.1177/001316448904900407. S2CID   123306239.
  18. Viera, Anthony J.; Garrett, Joanne M. (2005). "Understanding interobserver agreement: the kappa statistic". Family Medicine. 37 (5): 360–363. PMID   15883903.
  19. Strijbos, J.; Martens, R.; Prins, F.; Jochems, W. (2006). "Content analysis: What are they talking about?". Computers & Education. 46: 29–48. CiteSeerX   10.1.1.397.5780 . doi:10.1016/j.compedu.2005.04.002.
  20. Uebersax, JS. (1987). "Diversity of decision-making models and the measurement of interrater agreement" (PDF). Psychological Bulletin. 101: 140–146. CiteSeerX   10.1.1.498.4965 . doi:10.1037/0033-2909.101.1.140. Archived from the original (PDF) on 2016-03-03. Retrieved 2010-10-16.
  21. Powers, David M. W. (2012). "The Problem with Kappa" (PDF). Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop. Archived from the original (PDF) on 2016-05-18. Retrieved 2012-07-20.
  22. Cohen, J. (1968). "Weighed kappa: Nominal scale agreement with provision for scaled disagreement or partial credit". Psychological Bulletin. 70 (4): 213–220. doi:10.1037/h0026256. PMID   19673146.

Further reading