Inter-rater reliability

Last updated

In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, inter-coder reliability, and so on) is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

Contents

Assessment tools that rely on ratings must exhibit good inter-rater reliability, otherwise they are not valid tests.

There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are joint-probability of agreement, such as Cohen's kappa, Scott's pi and Fleiss' kappa; or inter-rater correlation, concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.

Concept

There are several operational definitions of "inter-rater reliability," reflecting different viewpoints about what is a reliable agreement between raters. [1] There are three operational definitions of agreement:

  1. Reliable raters agree with the "official" rating of a performance.
  2. Reliable raters agree with each other about the exact ratings to be awarded.
  3. Reliable raters agree about which performance is better and which is worse.

These combine with two operational definitions of behavior:

  1. Reliable raters are automatons, behaving like "rating machines". This category includes rating of essays by computer [2] This behavior can be evaluated by generalizability theory.
  2. Reliable raters behave like independent witnesses. They demonstrate their independence by disagreeing slightly. This behavior can be evaluated by the Rasch model.

Statistics

Joint probability of agreement

The joint-probability of agreement is the simplest and the least robust measure. It is estimated as the percentage of the time the raters agree in a nominal or categorical rating system. It does not take into account the fact that agreement may happen solely based on chance. There is some question whether or not there is a need to 'correct' for chance agreement; some suggest that, in any case, any such adjustment should be based on an explicit model of how chance and error affect raters' decisions. [3]

When the number of categories being used is small (e.g. 2 or 3), the likelihood for 2 raters to agree by pure chance increases dramatically. This is because both raters must confine themselves to the limited number of options available, which impacts the overall agreement rate, and not necessarily their propensity for "intrinsic" agreement (an agreement is considered "intrinsic" if it is not due to chance).

Therefore, the joint probability of agreement will remain high even in the absence of any "intrinsic" agreement among raters. A useful inter-rater reliability coefficient is expected (a) to be close to 0 when there is no "intrinsic" agreement and (b) to increase as the "intrinsic" agreement rate improves. Most chance-corrected agreement coefficients achieve the first objective. However, the second objective is not achieved by many known chance-corrected measures. [4]

Kappa statistics

Four sets of recommendations for interpreting level of inter-rater agreement Comparison of rubrics for evaluating inter-rater kappa (and intra-class correlation) coefficients.png
Four sets of recommendations for interpreting level of inter-rater agreement

Kappa is a way of measuring agreement or reliability, correcting for how often ratings might agree by chance. Cohen's kappa, [5] which works for two raters, and Fleiss' kappa, [6] an adaptation that works for any fixed number of raters, improve upon the joint probability in that they take into account the amount of agreement that could be expected to occur through chance. The original versions had the same problem as the joint-probability in that they treat the data as nominal and assume the ratings have no natural ordering; if the data actually have a rank (ordinal level of measurement), then that information is not fully considered in the measurements.

Later extensions of the approach included versions that could handle "partial credit" and ordinal scales. [7] These extensions converge with the family of intra-class correlations (ICCs), so there is a conceptually related way of estimating reliability for each level of measurement from nominal (kappa) to ordinal (ordinal kappa or ICC—stretching assumptions) to interval (ICC, or ordinal kappa—treating the interval scale as ordinal), and ratio (ICCs). There also are variants that can look at agreement by raters across a set of items (e.g., do two interviewers agree about the depression scores for all of the items on the same semi-structured interview for one case?) as well as raters x cases (e.g., how well do two or more raters agree about whether 30 cases have a depression diagnosis, yes/no—a nominal variable).

Kappa is similar to a correlation coefficient in that it cannot go above +1.0 or below -1.0. Because it is used as a measure of agreement, only positive values would be expected in most situations; negative values would indicate systematic disagreement. Kappa can only achieve very high values when both agreement is good and the rate of the target condition is near 50% (because it includes the base rate in the calculation of joint probabilities). Several authorities have offered "rules of thumb" for interpreting the level of agreement, many of which agree in the gist even though the words are not identical. [8] [9] [10] [11]

Correlation coefficients

Either Pearson's , Kendall's τ, or Spearman's can be used to measure pairwise correlation among raters using a scale that is ordered. Pearson assumes the rating scale is continuous; Kendall and Spearman statistics assume only that it is ordinal. If more than two raters are observed, an average level of agreement for the group can be calculated as the mean of the , τ, or values from each possible pair of raters.

Intra-class correlation coefficient

Another way of performing reliability testing is to use the intra-class correlation coefficient (ICC). [12] There are several types of this and one is defined as, "the proportion of variance of an observation due to between-subject variability in the true scores". [13] The range of the ICC may be between 0.0 and 1.0 (an early definition of ICC could be between 1 and +1). The ICC will be high when there is little variation between the scores given to each item by the raters, e.g. if all raters give the same or similar scores to each of the items. The ICC is an improvement over Pearson's and Spearman's , as it takes into account the differences in ratings for individual segments, along with the correlation between raters.

Limits of agreement

Bland-Altman plot Bland-Altman-Plot.png
Bland–Altman plot

Another approach to agreement (useful when there are only two raters and the scale is continuous) is to calculate the differences between each pair of the two raters' observations. The mean of these differences is termed bias and the reference interval (mean ± 1.96 ×  standard deviation) is termed limits of agreement. The limits of agreement provide insight into how much random variation may be influencing the ratings.

If the raters tend to agree, the differences between the raters' observations will be near zero. If one rater is usually higher or lower than the other by a consistent amount, the bias will be different from zero. If the raters tend to disagree, but without a consistent pattern of one rating higher than the other, the mean will be near zero. Confidence limits (usually 95%) can be calculated for both the bias and each of the limits of agreement.

There are several formulae that can be used to calculate limits of agreement. The simple formula, which was given in the previous paragraph and works well for sample size greater than 60, [14] is

For smaller sample sizes, another common simplification [15] is

However, the most accurate formula (which is applicable for all sample sizes) [14] is

Bland and Altman [15] have expanded on this idea by graphing the difference of each point, the mean difference, and the limits of agreement on the vertical against the average of the two ratings on the horizontal. The resulting Bland–Altman plot demonstrates not only the overall degree of agreement, but also whether the agreement is related to the underlying value of the item. For instance, two raters might agree closely in estimating the size of small items, but disagree about larger items.

When comparing two methods of measurement, it is not only of interest to estimate both bias and limits of agreement between the two methods (inter-rater agreement), but also to assess these characteristics for each method within itself. It might very well be that the agreement between two methods is poor simply because one of the methods has wide limits of agreement while the other has narrow. In this case, the method with the narrow limits of agreement would be superior from a statistical point of view, while practical or other considerations might change this appreciation. What constitutes narrow or wide limits of agreement or large or small bias is a matter of a practical assessment in each case.

Krippendorff's alpha

Krippendorff's alpha [16] [17] is a versatile statistic that assesses the agreement achieved among observers who categorize, evaluate, or measure a given set of objects in terms of the values of a variable. It generalizes several specialized agreement coefficients by accepting any number of observers, being applicable to nominal, ordinal, interval, and ratio levels of measurement, being able to handle missing data, and being corrected for small sample sizes.

Alpha emerged in content analysis where textual units are categorized by trained coders and is used in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychometrics where individual attributes are tested by multiple methods, in observational studies where unstructured happenings are recorded for subsequent analysis, and in computational linguistics where texts are annotated for various syntactic and semantic qualities.

Disagreement

For any task in which multiple raters are useful, raters are expected to disagree about the observed target. By contrast, situations involving unambiguous measurement, such as simple counting tasks (e.g. number of potential customers entering a store), often do not require more than one person performing the measurement.

Measurement involving ambiguity in characteristics of interest in the rating target are generally improved with multiple trained raters. Such measurement tasks often involve subjective judgment of quality. Examples include ratings of physician 'bedside manner', evaluation of witness credibility by a jury, and presentation skill of a speaker.

Variation across raters in the measurement procedures and variability in interpretation of measurement results are two examples of sources of error variance in rating measurements. Clearly stated guidelines for rendering ratings are necessary for reliability in ambiguous or challenging measurement scenarios.

Without scoring guidelines, ratings are increasingly affected by experimenter's bias, that is, a tendency of rating values to drift towards what is expected by the rater. During processes involving repeated measurements, correction of rater drift can be addressed through periodic retraining to ensure that raters understand guidelines and measurement goals.

See also

Related Research Articles

<span class="mw-page-title-main">Psychological statistics</span>

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

In statistics, a contingency table is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

Repeatability or test–retest reliability is the closeness of the agreement between the results of successive measurements of the same measure, when carried out under the same conditions of measurement. In other words, the measurements are taken by a single person or instrument on the same item, under the same conditions, and in a short period of time. A less-than-perfect test–retest reliability causes test–retest variability. Such variability can be caused by, for example, intra-individual variability and inter-observer variability. A measurement may be said to be repeatable when this variation is smaller than a pre-determined acceptance criterion.

Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

<span class="mw-page-title-main">Joseph L. Fleiss</span> American mathematician

Joseph L. Fleiss was an American professor of biostatistics at the Columbia University Mailman School of Public Health, where he also served as head of the Division of Biostatistics from 1975 to 1992. He is known for his work in mental health statistics, particularly assessing the reliability of diagnostic classifications, and the measures, models, and control of errors in categorization.

A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative attribute. In the social sciences, particularly psychology, common examples are the Likert response scale and 1-10 rating scales in which a person selects the number that is considered to reflect the perceived quality of a product.

Fleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement between not more than two raters or the intra-rater reliability. The measure calculates the degree of agreement in classification over that which would be expected by chance.

Generalizability theory, or G theory, is a statistical framework for conceptualizing, investigating, and designing reliable observations. It is used to determine the reliability of measurements under specific conditions. It is particularly useful for assessing the reliability of performance assessments. It was originally introduced in Cronbach, L.J., Rajaratnam, N., & Gleser, G.C. (1963).

Differential item functioning (DIF) is a statistical characteristic of an item that shows the extent to which the item might be measuring different abilities for members of separate subgroups. Average item scores for subgroups having the same overall score on the test are compared to determine whether the item is measuring in essentially the same way for all subgroups. The presence of DIF requires review and judgment, and it does not necessarily indicate the presence of bias. DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups with the same underlying true ability have a different probability of giving a certain response. Common procedures for assessing DIF are Mantel-Haenszel, item response theory (IRT) based methods, and logistic regression.

Scott's pi is a statistic for measuring inter-rater reliability for nominal data in communication studies. Textual entities are annotated with categories by different annotators, and various measures are used to assess the extent of agreement between the annotators, one of which is Scott's pi. Since automatically annotating text is a popular problem in natural language processing, and the goal is to get the computer program that is being developed to agree with the humans in the annotations it creates, assessing the extent to which humans agree with each other is important for establishing a reasonable upper limit on computer performance.

<span class="mw-page-title-main">Intraclass correlation</span> Descriptive statistic

In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC), is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures, it operates on data structured as groups rather than data structured as paired observations.

In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability.

Youden's J statistic is a single statistic that captures the performance of a dichotomous diagnostic test. (Bookmaker) Informedness is its generalization to the multiclass case and estimates the probability of an informed decision.

Krippendorff's alpha coefficient, named after academic Klaus Krippendorff, is a statistical measure of the agreement achieved when coding a set of units of analysis. Since the 1970s, alpha has been used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.

Cultural consensus theory is an approach to information pooling which supports a framework for the measurement and evaluation of beliefs as cultural; shared to some extent by a group of individuals. Cultural consensus models guide the aggregation of responses from individuals to estimate (1) the culturally appropriate answers to a series of related questions and (2) individual competence in answering those questions. The theory is applicable when there is sufficient agreement across people to assume that a single set of answers exists. The agreement between pairs of individuals is used to estimate individual cultural competence. Answers are estimated by weighting responses of individuals by their competence and then combining responses.

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from the interval scale and ratio scale by not having category widths that represent equal increments of the underlying attribute.

References

  1. Saal, F.E.; Downey, R.G.; Lahey, M.A. (1980). "Rating the ratings: Assessing the psychometric quality of rating data". Psychological Bulletin. 88 (2): 413. doi:10.1037/0033-2909.88.2.413.
  2. Page, E.B.; Petersen, N.S. (1995). "The computer moves into essay grading: Updating the ancient test". Phi Delta Kappan. 76 (7): 561.
  3. Uebersax, J.S. (1987). "Diversity of decision-making models and the measurement of interrater agreement". Psychological Bulletin. 101 (1): 140–146. doi:10.1037/0033-2909.101.1.140. S2CID   39240770.
  4. "Correcting Inter-Rater Reliability for Chance Agreement: Why?". www.agreestat.com. Retrieved 2018-12-26.
  5. Cohen, J. (1960). "A coefficient of agreement for nominal scales" (PDF). Educational and Psychological Measurement. 20 (1): 37–46. doi:10.1177/001316446002000104. S2CID   15926286.
  6. Fleiss, J.L. (1971). "Measuring nominal scale agreement among many raters". Psychological Bulletin. 76 (5): 378–382. doi:10.1037/h0031619.
  7. Landis, J. Richard; Koch, Gary G. (1977). "The Measurement of Observer Agreement for Categorical Data". Biometrics. 33 (1): 159–74. doi:10.2307/2529310. JSTOR   2529310. PMID   843571. S2CID   11077516.
  8. Landis, J. Richard; Koch, Gary G. (1977). "An Application of Hierarchical Kappa-type Statistics in the Assessment of Majority Agreement among Multiple Observers". Biometrics. 33 (2): 363–74. doi:10.2307/2529786. JSTOR   2529786. PMID   884196.
  9. Cicchetti, D. V.; Sparrow, S. A. (1981). "Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior". American Journal of Mental Deficiency. 86 (2): 127–137. PMID   7315877.
  10. Fleiss, J. L. (1981-04-21). Statistical methods for rates and proportions. 2nd ed. Wiley. ISBN   0-471-06428-9. OCLC   926949980.
  11. Regier, Darrel A.; Narrow, William E.; Clarke, Diana E.; Kraemer, Helena C.; Kuramoto, S. Janet; Kuhl, Emily A.; Kupfer, David J. (2013). "DSM-5 Field Trials in the United States and Canada, Part II: Test-Retest Reliability of Selected Categorical Diagnoses". American Journal of Psychiatry. 170 (1): 59–70. doi:10.1176/appi.ajp.2012.12070999. ISSN   0002-953X. PMID   23111466.
  12. Shrout, P.E.; Fleiss, J.L. (1979). "Intraclass correlations: uses in assessing rater reliability". Psychological Bulletin. 86 (2): 420–428. doi:10.1037/0033-2909.86.2.420. PMID   18839484. S2CID   13168820.
  13. Everitt, B.S. (1996). Making sense of statistics in psychology: A second-level course. Oxford University Press. ISBN   978-0-19-852365-9.
  14. 1 2 Ludbrook, J. (2010). Confidence in Altman–Bland plots: a critical review of the method of differences. Clinical and Experimental Pharmacology and Physiology, 37(2), 143-149.
  15. 1 2 Bland, J. M., & Altman, D. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet, 327(8476), 307-310.
  16. Krippendorff, Klaus (2018). Content analysis : an introduction to its methodology (4th ed.). Los Angeles. ISBN   9781506395661. OCLC   1019840156.{{cite book}}: CS1 maint: location missing publisher (link)
  17. Hayes, A.F.; Krippendorff, K. (2007). "Answering the call for a standard reliability measure for coding data". Communication Methods and Measures. 1 (1): 77–89. doi:10.1080/19312450709336664. S2CID   15408575.

Further reading