Krippendorff's alpha

Last updated

Krippendorff's alpha coefficient, [1] named after academic Klaus Krippendorff, is a statistical measure of the agreement achieved when coding a set of units of analysis. Since the 1970s, alpha has been used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.

Contents

Krippendorff's alpha generalizes several known statistics, often called measures of inter-coder agreement, inter-rater reliability, reliability of coding given sets of units (as distinct from unitizing) but it also distinguishes itself from statistics that are called reliability coefficients but are unsuitable to the particulars of coding data generated for subsequent analysis.

Krippendorff's alpha is applicable to any number of coders, each assigning one value to one unit of analysis, to incomplete (missing) data, to any number of values available for coding a variable, to binary, nominal, ordinal, interval, ratio, polar, and circular metrics (note that this is not a metric in the mathematical sense, but often the square of a mathematical metric, see levels of measurement), and it adjusts itself to small sample sizes of the reliability data. The virtue of a single coefficient with these variations is that computed reliabilities are comparable across any numbers of coders, values, different metrics, and unequal sample sizes.

Software for calculating Krippendorff's alpha is available. [2] [3] [4] [5] [6] [7] [8] [9] [10]

Reliability data

Reliability data are generated in a situation in which m ≥ 2 jointly instructed (e.g., by a code book) but independently working coders assign any one of a set of values 1,...,V to a common set of N units of analysis. In their canonical form, reliability data are tabulated in an m-by-N matrix containing N values vij that coder ci has assigned to unit uj. Define mj as the number of values assigned to unit j across all coders c. When data are incomplete, mj may be less than m. Reliability data require that values be pairable, i.e., mj ≥ 2. The total number of pairable values is nmN.

To help clarify, here is what the canonical form looks like, in the abstract:

u1u2u3...uN
c1v11v12v13v1N
c2v21v22v23v2N
c3v31v32v33v3N
cmvm1vm2vm3vmN

General form of alpha

We denote by the set of all possible responses an observer can give. The responses of all observers for an example is called a unit (it forms a multiset). We denote a multiset with these units as the items, .

Alpha is given by:

where is the disagreement observed and is the disagreement expected by chance.

where is a metric function (note that this is not a metric in the mathematical sense, but often the square of a mathematical metric, see below), is the total number of pairable elements, is the number of items in a unit, number of pairs in unit , and is the permutation function. Rearranging terms, the sum can be interpreted in a conceptual way as the weighted average of the disagreements of the individual units---weighted by the number of coders assigned to unit j:

where is the mean of the numbers (here and define pairable elements). Note that in the case for all , is just the average all the numbers with . There is also an interpretation of as the (weighted) average observed distance from the diagonal.

where is the number of ways the pair can be made. This can be seen to be the average distance from the diagonal of all possible pairs of responses that could be derived from the multiset of all observations.

The above is equivalent to the usual form of once it has been simplified algebraically. [11]

One interpretation of Krippendorff's alpha is:

indicates perfect reliability
indicates the complete absence of reliability. Units and the values assigned to them are statistically unrelated.
when disagreements are systematic and exceed what can be expected by chance.

In this general form, disagreements Do and De may be conceptually transparent but are computationally inefficient. They can be simplified algebraically, especially when expressed in terms of the visually more instructive coincidence matrix representation of the reliability data.

Coincidence matrices

A coincidence matrix cross tabulates the n pairable values from the canonical form of the reliability data into a v-by-v square matrix, where v is the number of values available in a variable. Unlike contingency matrices, familiar in association and correlation statistics, which tabulate pairs of values (cross tabulation), a coincidence matrix tabulates all pairable values. A coincidence matrix omits references to coders and is symmetrical around its diagonal, which contains all perfect matches, viu = vi'u for two coders i and i' , across all units u. The matrix of observed coincidences contains frequencies:

omitting unpaired values, where I(∘) = 1 if is true, and 0 otherwise.

Because a coincidence matrix tabulates all pairable values and its contents sum to the total n, when four or more coders are involved, ock may be fractions.

The matrix of expected coincidences contains frequencies:

which sum to the same nc, nk, and n as does ock. In terms of these coincidences, Krippendorff's alpha becomes:

Difference functions

Difference functions [12] between values v and v' reflect the metric properties (levels of measurement) of their variable.

In general:

In particular:

For nominal data , where v and v' serve as names.
For ordinal data , where v and v are ranks.
For interval data , where v and v are interval scale values.
For ratio data , where v and v are absolute values.
For polar data , where vmin and vmax define the end points of the polar scale.
For circular data , where the sine function is expressed in degrees and U is the circumference or the range of values in a circle or loop before they repeat. For equal interval circular metrics, the smallest and largest integer values of this metric are adjacent to each other and U = vlargest  vsmallest + 1.

Significance

Inasmuch as mathematical statements of the statistical distribution of alpha are always only approximations, it is preferable to obtain alpha’s distribution by bootstrapping. [13] [14] Alpha's distribution gives rise to two indices:

The minimum acceptable alpha coefficient should be chosen according to the importance of the conclusions to be drawn from imperfect data. When the costs of mistaken conclusions are high, the minimum alpha needs to be set high as well. In the absence of knowledge of the risks of drawing false conclusions from unreliable data, social scientists commonly rely on data with reliabilities α  0.800, consider data with 0.800 > α  0.667 only to draw tentative conclusions, and discard data whose agreement measures α < 0.667. [15]

A computational example

Let the canonical form of reliability data be a 3-coder-by-15 unit matrix with 45 cells:

Units u:123456789101112131415
Coder A*****34121133*3
Coder B1*213343*******
Coder C**21344*21133*4

Suppose “*” indicates a default category like “cannot code,” “no answer,” or “lacking an observation.” Then, * provides no information about the reliability of data in the four values that matter. Note that unit 2 and 14 contains no information and unit 1 contains only one value, which is not pairable within that unit. Thus, these reliability data consist not of mN = 45 but of n = 26 pairable values, not in N = 15 but in 12 multiply coded units.

The coincidence matrix for these data would be constructed as follows:

o11 = {in u=4}: {in u=10}: {in u=11}:
o13 = {in u=8}: o31
o22 = {in u=3}: {in u=9}:
o33 = {in u=5}: {in u=6}: {in u=12}: {in u=13}:
o34 = {in u=6}: {in u=15}: o43
o44 = {in u=7}:
Values v or v:1234nv
Value 1617
Value 244
Value 317210
Value 4235
Frequency nv'7410526

In terms of the entries in this coincidence matrix, Krippendorff's alpha may be calculated from:

For convenience, because products with and , only the entries in one of the off-diagonal triangles of the coincidence matrix are listed in the following:

Considering that all when for nominal data the above expression yields:

With for interval data the above expression yields:

Here, because disagreements happens to occur largely among neighboring values, visualized by occurring closer to the diagonal of the coincidence matrix, a condition that takes into account but does not. When the observed frequencies ovv are on the average proportional to the expected frequencies ev ≠ v', .

Comparing alpha coefficients across different metrics can provide clues to how coders conceptualize the metric of a variable.

Alpha's embrace of other statistics

Krippendorff's alpha brings several known statistics under a common umbrella, each of them has its own limitations but no additional virtues.

Krippendorff's alpha is more general than any of these special purpose coefficients. It adjusts to varying sample sizes and affords comparisons across a wide variety of reliability data, mostly ignored by the familiar measures.

Coefficients incompatible with alpha and the reliability of coding

Semantically, reliability is the ability to rely on something, here on coded data for subsequent analysis. When a sufficiently large number of coders agree perfectly on what they have read or observed, relying on their descriptions is a safe bet. Judgments of this kind hinge on the number of coders duplicating the process and how representative the coded units are of the population of interest. Problems of interpretation arise when agreement is less than perfect, especially when reliability is absent.

Naming a statistic as one of agreement, reproducibility, or reliability does not make it a valid index of whether one can rely on coded data in subsequent decisions. Its mathematical structure must fit the process of coding units into a system of analyzable terms.

Notes

  1. Krippendorff, K. (2013) pp. 221–250 describes the mathematics of alpha and its use in content analysis since 1969.
  2. Hayes, A. F. & Krippendorff, K. (2007) describe and provide SPSS and SAS macros for computing alpha, its confidence limits and the probability of failing to reach a chosen minimum.
  3. Reference manual of the irr package containing the kripp.alpha() function for the platform-independent statistics package R
  4. The Alpha resources page.
  5. Matlab code to compute Krippendorff's alpha.
  6. Python code to compute Krippendorff's alpha.
  7. Python code for Krippendorff's alpha fast computation.
  8. Several user-written additions to the commercial software Stata are available.
  9. Open Source Python implementation supporting Dataframes
  10. Marzi, Giacomo; Balzano, Marco; Marchiori, Davide (2024). "K-Alpha Calculator–Krippendorff's Alpha Calculator: A user-friendly tool for computing Krippendorff's Alpha inter-rater reliability coefficient". MethodsX. 12: 102545. doi:10.1016/j.mex.2023.102545. hdl: 10278/5046412 . ISSN   2215-0161.
  11. Honour, David. "Understanding Krippendorff's Alpha" (PDF).
  12. Computing Krippendorff’s Alpha Reliability” http://repository.upenn.edu/asc_papers/43/
  13. Krippendorff, K. (2004) pp. 237–238
  14. Hayes, A. F. & Krippendorff, K. (2007) Answering the Call for a Standard Reliability Measure for Coding Data
  15. Krippendorff, K. (2004) pp. 241–243
  16. Scott, W. A. (1955)
  17. Fleiss, J. L. (1971)
  18. Cohen, J. (1960)
  19. Siegel, S. & Castellan, N. J. (1988), pp. 284–291.
  20. Spearman, C. E. (1904)
  21. Pearson, K. (1901), Tildesley, M. L. (1921)
  22. Krippendorff, K. (1970)
  23. Cohen, J. (1960)
  24. Krippendorff, K. (1978) raised this issue with Joseph Fleiss
  25. Zwick, R. (1988), Brennan, R. L. & Prediger, D. J. (1981), Krippendorff (1978, 2004).
  26. Nunnally, J. C. & Bernstein, I. H. (1994)
  27. Cronbach, L. J. (1951)
  28. Bennett, E. M., Alpert, R. & Goldstein, A. C. (1954)
  29. Goodman, L. A. & Kruskal, W. H. (1954) p. 758
  30. Lin, L. I. (1989)

Related Research Articles

Algorithms for calculating variance play a major role in computational statistics. A key difficulty in the design of good algorithms for this problem is that formulas for the variance may involve sums of squares, which can lead to numerical instability as well as to arithmetic overflow when dealing with large values.

<span class="mw-page-title-main">Binomial coefficient</span> Number of subsets of a given size

In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficient is indexed by a pair of integers nk ≥ 0 and is written It is the coefficient of the xk term in the polynomial expansion of the binomial power (1 + x)n; this coefficient can be computed by the multiplicative formula

<span class="mw-page-title-main">Pauli matrices</span> Matrices important in quantum mechanics and the study of spin

In mathematical physics and mathematics, the Pauli matrices are a set of three 2 × 2 complex matrices that are Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

In mathematics, the branch of real analysis studies the behavior of real numbers, sequences and series of real numbers, and real functions. Some particular properties of real-valued sequences and functions that real analysis studies include convergence, limits, continuity, smoothness, differentiability and integrability.

<span class="mw-page-title-main">Taylor series</span> Mathematical approximation of a function

In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor series are equal near this point. Taylor series are named after Brook Taylor, who introduced them in 1715. A Taylor series is also called a Maclaurin series when 0 is the point where the derivatives are considered, after Colin Maclaurin, who made extensive use of this special case of Taylor series in the 18th century.

<span class="mw-page-title-main">Taylor's theorem</span> Approximation of a function by a truncated power series

In calculus, Taylor's theorem gives an approximation of a -times differentiable function around a given point by a polynomial of degree , called the -th-order Taylor polynomial. For a smooth function, the Taylor polynomial is the truncation at the order of the Taylor series of the function. The first-order Taylor polynomial is the linear approximation of the function, and the second-order Taylor polynomial is often referred to as the quadratic approximation. There are several versions of Taylor's theorem, some giving explicit estimates of the approximation error of the function by its Taylor polynomial.

In mathematics, a power series is an infinite series of the form

In numerical analysis, polynomial interpolation is the interpolation of a given bivariate data set by the polynomial of lowest possible degree that passes through the points of the dataset.

In mathematics, differential forms provide a unified approach to define integrands over curves, surfaces, solids, and higher-dimensional manifolds. The modern notion of differential forms was pioneered by Élie Cartan. It has many applications, especially in geometry, topology and physics.

<span class="mw-page-title-main">Spearman's rank correlation coefficient</span> Nonparametric measure of rank correlation

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a nonparametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

<span class="mw-page-title-main">Green's function</span> Impulse response of an inhomogeneous linear differential operator

In mathematics, a Green's function is the impulse response of an inhomogeneous linear differential operator defined on a domain with specified initial conditions or boundary conditions.

In signal processing, a finite impulse response (FIR) filter is a filter whose impulse response is of finite duration, because it settles to zero in finite time. This is in contrast to infinite impulse response (IIR) filters, which may have internal feedback and may continue to respond indefinitely.

Multi-index notation is a mathematical notation that simplifies formulas used in multivariable calculus, partial differential equations and the theory of distributions, by generalising the concept of an integer index to an ordered tuple of indices.

Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

Fleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement between not more than two raters or the intra-rater reliability. The measure calculates the degree of agreement in classification over that which would be expected by chance.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

References