Intraclass correlation

Last updated

In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC), [1] is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures, it operates on data structured as groups rather than data structured as paired observations.

Contents

A dot plot showing a dataset with high intraclass correlation. Values from the same group tend to be similar. ICC-example2.svg
A dot plot showing a dataset with high intraclass correlation. Values from the same group tend to be similar.
A dot plot showing a dataset with low intraclass correlation. There is very little tendency for values from the same group to be similar. ICC-example1.svg
A dot plot showing a dataset with low intraclass correlation. There is very little tendency for values from the same group to be similar.

The intraclass correlation is commonly used to quantify the degree to which individuals with a fixed degree of relatedness (e.g. full siblings) resemble each other in terms of a quantitative trait (see heritability). Another prominent application is the assessment of consistency or reproducibility of quantitative measurements made by different observers measuring the same quantity.

Early ICC definition: unbiased but complex formula

The earliest work on intraclass correlations focused on the case of paired measurements, and the first intraclass correlation (ICC) statistics to be proposed were modifications of the interclass correlation (Pearson correlation).

Consider a data set consisting of N paired data values (xn,1, xn,2), for n = 1, ..., N. The intraclass correlation r originally proposed [2] by Ronald Fisher [3] is

where

Later versions of this statistic [3] used the degrees of freedom 2N 1 in the denominator for calculating s2 and N 1 in the denominator for calculating r, so that s2 becomes unbiased, and r becomes unbiased if s is known.

The key difference between this ICC and the interclass (Pearson) correlation is that the data are pooled to estimate the mean and variance. The reason for this is that in the setting where an intraclass correlation is desired, the pairs are considered to be unordered. For example, if we are studying the resemblance of twins, there is usually no meaningful way to order the values for the two individuals within a twin pair. Like the interclass correlation, the intraclass correlation for paired data will be confined to the interval  [−1, +1].

The intraclass correlation is also defined for data sets with groups having more than 2 values. For groups consisting of three values, it is defined as [3]

where

As the number of items per group grows, so does the number of cross-product terms in this expression grows. The following equivalent form is simpler to calculate:

where K is the number of data values per group, and is the sample mean of the nth group. [3] This form is usually attributed to Harris. [4] The left term is non-negative; consequently the intraclass correlation must satisfy

For large K, this ICC is nearly equal to

which can be interpreted as the fraction of the total variance that is due to variation between groups. Ronald Fisher devotes an entire chapter to intraclass correlation in his classic book Statistical Methods for Research Workers . [3]

For data from a population that is completely noise, Fisher's formula produces ICC values that are distributed about 0, i.e. sometimes being negative. This is because Fisher designed the formula to be unbiased, and therefore its estimates are sometimes overestimates and sometimes underestimates. For small or 0 underlying values in the population, the ICC calculated from a sample may be negative.

Modern ICC definitions: simpler formula but positive bias

Beginning with Ronald Fisher, the intraclass correlation has been regarded within the framework of analysis of variance (ANOVA), and more recently in the framework of random effects models. A number of ICC estimators have been proposed. Most of the estimators can be defined in terms of the random effects model

where Yij is the ith observation in the jth group, μ is an unobserved overall mean, αj is an unobserved random effect shared by all values in group j, and εij is an unobserved noise term. [5] For the model to be identified, the αj and εij are assumed to have expected value zero and to be uncorrelated with each other. Also, the αj are assumed to be identically distributed, and the εij are assumed to be identically distributed. The variance of αj is denoted σ2
α
and the variance of εij is denoted σ2
ε
.

The population ICC in this framework is [6]

With this framework, the ICC is the correlation of two observations from the same group.

[Proof]

For a one-way random effects model:

, , s and s independent and s are independent from s.

The variance of any observation is: The covariance of two observations from the same group (for ) is: [7]

In this, we've used properties of the covariance.

Put together we get:

An advantage of this ANOVA framework is that different groups can have different numbers of data values, which is difficult to handle using the earlier ICC statistics. This ICC is always non-negative, allowing it to be interpreted as the proportion of total variance that is "between groups." This ICC can be generalized to allow for covariate effects, in which case the ICC is interpreted as capturing the within-class similarity of the covariate-adjusted data values. [8]

This expression can never be negative (unlike Fisher's original formula) and therefore, in samples from a population which has an ICC of 0, the ICCs in the samples will be higher than the ICC of the population.

A number of different ICC statistics have been proposed, not all of which estimate the same population parameter. There has been considerable debate about which ICC statistics are appropriate for a given use, since they may produce markedly different results for the same data. [9] [10]

Relationship to Pearson's correlation coefficient

In terms of its algebraic form, Fisher's original ICC is the ICC that most resembles the Pearson correlation coefficient. One key difference between the two statistics is that in the ICC, the data are centered and scaled using a pooled mean and standard deviation, whereas in the Pearson correlation, each variable is centered and scaled by its own mean and standard deviation. This pooled scaling for the ICC makes sense because all measurements are of the same quantity (albeit on units in different groups). For example, in a paired data set where each "pair" is a single measurement made for each of two units (e.g., weighing each twin in a pair of identical twins) rather than two different measurements for a single unit (e.g., measuring height and weight for each individual), the ICC is a more natural measure of association than Pearson's correlation.

An important property of the Pearson correlation is that it is invariant to application of separate linear transformations to the two variables being compared. Thus, if we are correlating X and Y, where, say, Y = 2X + 1, the Pearson correlation between X and Y is 1 a perfect correlation. This property does not make sense for the ICC, since there is no basis for deciding which transformation is applied to each value in a group. However, if all the data in all groups are subjected to the same linear transformation, the ICC does not change.

Use in assessing conformity among observers

The ICC is used to assess the consistency, or conformity, of measurements made by multiple observers measuring the same quantity. [11] For example, if several physicians are asked to score the results of a CT scan for signs of cancer progression, we can ask how consistent the scores are to each other. If the truth is known (for example, if the CT scans were on patients who subsequently underwent exploratory surgery), then the focus would generally be on how well the physicians' scores matched the truth. If the truth is not known, we can only consider the similarity among the scores. An important aspect of this problem is that there is both inter-observer and intra-observer variability. Inter-observer variability refers to systematic differences among the observers for example, one physician may consistently score patients at a higher risk level than other physicians. Intra-observer variability refers to deviations of a particular observer's score on a particular patient that are not part of a systematic difference.

The ICC is constructed to be applied to exchangeable measurements that is, grouped data in which there is no meaningful way to order the measurements within a group. In assessing conformity among observers, if the same observers rate each element being studied, then systematic differences among observers are likely to exist, which conflicts with the notion of exchangeability. If the ICC is used in a situation where systematic differences exist, the result is a composite measure of intra-observer and inter-observer variability. One situation where exchangeability might reasonably be presumed to hold would be where a specimen to be scored, say a blood specimen, is divided into multiple aliquots, and the aliquots are measured separately on the same instrument. In this case, exchangeability would hold as long as no effect due to the sequence of running the samples was present.

Since the intraclass correlation coefficient gives a composite of intra-observer and inter-observer variability, its results are sometimes considered difficult to interpret when the observers are not exchangeable. Alternative measures such as Cohen's kappa statistic, the Fleiss kappa, and the concordance correlation coefficient [12] have been proposed as more suitable measures of agreement among non-exchangeable observers.

Calculation in software packages

Different intraclass correlation coefficient definitions applied to three scenarios of inter-observer concordance. Intraclass correlation coefficient graph improved.svg
Different intraclass correlation coefficient definitions applied to three scenarios of inter-observer concordance.

ICC is supported in the open source software package R (using the function "icc" with the packages psy or irr, or via the function "ICC" in the package psych.) The rptR package [13] provides methods for the estimation of ICC and repeatabilities for Gaussian, binomial and Poisson distributed data in a mixed-model framework. Notably, the package allows estimation of adjusted ICC (i.e. controlling for other variables) and computes confidence intervals based on parametric bootstrapping and significances based on the permutation of residuals. Commercial software also supports ICC, for instance Stata or SPSS [14]

Different types of ICC Archived 2009-03-03 at the Wayback Machine
Shrout and Fleiss conventionMcGraw and Wong convention [15] Name in SPSS and Stata [16] [17]
ICC(1,1)One-way random, single score ICC(1)One-way random, single measures
ICC(2,1)Two-way random, single score ICC(A,1)Two-way random, single measures, absolute agreement
ICC(3,1)Two-way mixed, single score ICC(C,1)Two-way mixed, single measures, consistency
undefinedTwo-way random, single score ICC(C,1)Two-way random, single measures, consistency
undefinedTwo-way mixed, single score ICC(A,1)Two-way mixed, single measures, absolute agreement
ICC(1,k)One-way random, average score ICC(k)One-way random, average measures
ICC(2,k)Two-way random, average score ICC(A,k)Two-way random, average measures, absolute agreement
ICC(3,k)Two-way mixed, average score ICC(C,k)Two-way mixed, average measures, consistency
undefinedTwo-way random, average score ICC(C,k)Two-way random, average measures, consistency
undefinedTwo-way mixed, average score ICC(A,k)Two-way mixed, average measures, absolute agreement

The three models are:

Number of measurements:

Consistency or absolute agreement:

The consistency ICC cannot be estimated in the one-way random effects model, as there is no way to separate the inter-rater and residual variances.

An overview and re-analysis of the three models for the single measures ICC, with an alternative recipe for their use, has also been presented by Liljequist et al. (2019). [18]

Interpretation

Cicchetti (1994) [19] gives the following often quoted guidelines for interpretation for kappa or ICC inter-rater agreement measures:

A different guideline is given by Koo and Li (2016): [20]

See also

Related Research Articles

<span class="mw-page-title-main">Central limit theorem</span> Fundamental theorem in probability theory and statistics

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.

<span class="mw-page-title-main">Correlation</span> Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

Covariance in probability theory and statistics is a measure of the joint variability of two random variables.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of children from a primary school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

Linear elasticity is a mathematical model as to how solid objects deform and become internally stressed by prescribed loading conditions. It is a simplification of the more general nonlinear theory of elasticity and a branch of continuum mechanics.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

In statistics, a studentized residual is the dimensionless ratio resulting from the division of a residual by an estimate of its standard deviation, both expressed in the same units. It is a form of a Student's t-statistic, with the estimate of error varying between points.

In econometrics, the autoregressive conditional heteroskedasticity (ARCH) model is a statistical model for time series data that describes the variance of the current error term or innovation as a function of the actual sizes of the previous time periods' error terms; often the variance is related to the squares of the previous innovations. The ARCH model is appropriate when the error variance in a time series follows an autoregressive (AR) model; if an autoregressive moving average (ARMA) model is assumed for the error variance, the model is a generalized autoregressive conditional heteroskedasticity (GARCH) model.

<span class="mw-page-title-main">Ordinary least squares</span> Method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

In theoretical physics, the Wess–Zumino model has become the first known example of an interacting four-dimensional quantum field theory with linearly realised supersymmetry. In 1974, Julius Wess and Bruno Zumino studied, using modern terminology, dynamics of a single chiral superfield whose cubic superpotential leads to a renormalizable theory. It is a special case of 4D N = 1 global supersymmetry.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability.

Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSDtest, is a single-step multiple comparison procedure and statistical test. It can be used to correctly interpret the statistical significance of the difference between means that have been selected for comparison because of their extreme values.

A paired difference test, better known as a paired comparison, is a type of location test that is used when comparing two sets of paired measurements to assess whether their population means differ. A paired difference test is designed for situations where there is dependence between pairs of measurements. That applies in a within-subjects study design, i.e., in a study where the same set of subjects undergo both of the conditions being compared.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect of each independent variable but also if there is any interaction between them.

In computational chemistry and molecular dynamics, the combination rules or combining rules are equations that provide the interaction energy between two dissimilar non-bonded atoms, usually for the part of the potential representing the van der Waals interaction. In the simulation of mixtures, the choice of combining rules can sometimes affect the outcome of the simulation.

References

  1. Koch GG (1982). "Intraclass correlation coefficient". In Samuel Kotz and Norman L. Johnson (ed.). Encyclopedia of Statistical Sciences . Vol. 4. New York: John Wiley & Sons. pp. 213–217.
  2. Bartko JJ (August 1966). "The intraclass correlation coefficient as a measure of reliability". Psychological Reports. 19 (1): 3–11. doi:10.2466/pr0.1966.19.1.3. PMID   5942109. S2CID   145480729.
  3. 1 2 3 4 5 Fisher RA (1954). Statistical Methods for Research Workers (Twelfth ed.). Edinburgh: Oliver and Boyd. ISBN   978-0-05-002170-5.
  4. Harris JA (October 1913). "On the Calculation of Intra-Class and Inter-Class Coefficients of Correlation from Class Moments when the Number of Possible Combinations is Large". Biometrika . 9 (3/4): 446–472. doi:10.1093/biomet/9.3-4.446. JSTOR   2331901.
  5. Donner A, Koval JJ (March 1980). "The estimation of intraclass correlation in the analysis of family data". Biometrics. 36 (1): 19–25. doi:10.2307/2530491. JSTOR   2530491. PMID   7370372.
  6. Proof that ICC in the anova model is the correlation of two items: ocram , Understanding the intra-class correlation coefficient, URL (version: 2012-12-05):
  7. dsaxton (https://stats.stackexchange.com/users/78861/dsaxton), Random effects model: Observations from the same level have covariance $\sigma^2$?, URL (version: 2016-03-22) link
  8. Stanish W, Taylor N (1983). "Estimation of the Intraclass Correlation Coefficient for the Analysis of Covariance Model". The American Statistician. 37 (3): 221–224. doi:10.2307/2683375. JSTOR   2683375.
  9. Müller R, Büttner P (December 1994). "A critical discussion of intraclass correlation coefficients". Statistics in Medicine. 13 (23–24): 2465–76. doi:10.1002/sim.4780132310. PMID   7701147. See also comment:
  10. McGraw KO, Wong SP (1996). "Forming inferences about some intraclass correlation coefficients". Psychological Methods . 1: 30–46. doi:10.1037/1082-989X.1.1.30. There are several errors in the article:
  11. Shrout PE, Fleiss JL (March 1979). "Intraclass correlations: uses in assessing rater reliability". Psychological Bulletin. 86 (2): 420–8. doi:10.1037/0033-2909.86.2.420. PMID   18839484.
  12. Nickerson CA (December 1997). "A Note on 'A Concordance Correlation Coefficient to Evaluate Reproducibility'". Biometrics . 53 (4): 1503–1507. doi:10.2307/2533516. JSTOR   2533516.
  13. Stoffel MA, Nakagawa S, Schielzeth J (2017). "rptR: repeatability estimation and variance decomposition by generalized linear mixed-effects models". Methods in Ecology and Evolution. 8 (11): 1639–1644. doi: 10.1111/2041-210x.12797 . ISSN   2041-210X.
  14. MacLennan RN (November 1993). "Interrater Reliability with SPSS for Windows 5.0". The American Statistician . 47 (4): 292–296. doi:10.2307/2685289. JSTOR   2685289.
  15. McGraw KO, Wong SP (1996). "Forming Inferences About Some Intraclass Correlation Coefficients". Psychological Methods . 1 (1): 30–40. doi:10.1037/1082-989X.1.1.30.
  16. Stata user's guide release 15 (PDF). College Station, Texas: Stata Press. 2017. pp. 1101–1123. ISBN   978-1-59718-249-2.
  17. Howell DC. "Intra-class correlation coefficients" (PDF).
  18. Liljequist D, Elfving B, Skavberg Roaldsen K (2019). "Intraclass correlation - A discussion and demonstration of basic features". PLOS ONE. 14 (7): e0219854. doi: 10.1371/journal.pone.0219854 . PMC   6645485 . PMID   31329615.
  19. Cicchetti DV (1994). "Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology". Psychological Assessment. 6 (4): 284–290. doi:10.1037/1040-3590.6.4.284.
  20. Koo TK, Li MY (June 2016). "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research". Journal of Chiropractic Medicine. 15 (2): 155–63. doi:10.1016/j.jcm.2016.02.012. PMC   4913118 . PMID   27330520.

Others