statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, the intraclass correlation, or the intraclass correlation coefficient (ICC), is a

descriptive statistic A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...

that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

, unlike most other correlation measures it operates on data structured as groups, rather than data structured as paired observations. The ''intraclass correlation'' is commonly used to quantify the degree to which individuals with a fixed degree of relatedness (e.g. full siblings) resemble each other in terms of a quantitative trait (see

heritability Heritability is a statistic used in the fields of breeding and genetics that estimates the degree of ''variation'' in a phenotypic trait in a population that is due to genetic variation between individuals in that population. The concept of h ...

). Another prominent application is the assessment of consistency or reproducibility of quantitative measurements made by different observers measuring the same quantity.

Early ICC definition: unbiased but complex formula

The earliest work on intraclass correlations focused on the case of paired measurements, and the first intraclass correlation (ICC) statistics to be proposed were modifications of the

interclass correlation In statistics, the interclass correlation (or ''interclass correlation coefficient'') measures a relation between two variables of different classes (types), such as the weights of 10-year-old sons and the weights of their 40-year-old fathers. Devia ...

(Pearson correlation). Consider a data set consisting of ''N'' paired data values (''x''_''n'',1, ''x''_''n'',2), for ''n'' = 1, ..., ''N''. The intraclass correlation ''r'' originally proposed by

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

is :

r = \frac \sum_^N (x_ - \bar) ( x_ - \bar),

where :

\bar = \frac \sum_^N (x_ + x_),

s^2 = \frac \left\.

Later versions of this statistic used the degrees of freedom 2''N'' −1 in the denominator for calculating ''s''² and ''N'' −1 in the denominator for calculating ''r'', so that ''s''² becomes unbiased, and ''r'' becomes unbiased if ''s'' is known. The key difference between this ICC and the interclass (Pearson) correlation is that the data are pooled to estimate the mean and variance. The reason for this is that in the setting where an intraclass correlation is desired, the pairs are considered to be unordered. For example, if we are studying the resemblance of twins, there is usually no meaningful way to order the values for the two individuals within a twin pair. Like the interclass correlation, the intraclass correlation for paired data will be confined to the interval ��1, +1 The intraclass correlation is also defined for data sets with groups having more than 2 values. For groups consisting of three values, it is defined as :

r = \frac \sum_^N \left\,

where :

\bar = \frac \sum_^N (x_ + x_ + x_),

s^2 = \frac \left\.

As the number of items per group grows, so does the number of cross-product terms in this expression grows. The following equivalent form is simpler to calculate: :

r = \frac\cdot\frac - \frac,

where ''K'' is the number of data values per group, and

\bar_n

is the sample mean of the ''n''^th group. This form is usually attributed to

Harris Harris may refer to: Places Canada * Harris, Ontario * Northland Pyrite Mine (also known as Harris Mine) * Harris, Saskatchewan * Rural Municipality of Harris No. 316, Saskatchewan Scotland * Harris, Outer Hebrides (sometimes called the Isle of ...

. The left term is non-negative; consequently the intraclass correlation must satisfy :

r \geq \frac  .

For large ''K'', this ICC is nearly equal to :

\frac,

which can be interpreted as the fraction of the total variance that is due to variation between groups.

devotes an entire chapter to intraclass correlation in his classic book '' Statistical Methods for Research Workers''. For data from a population that is completely noise, Fisher's formula produces ICC values that are distributed about 0, i.e. sometimes being negative. This is because Fisher designed the formula to be unbiased, and therefore its estimates are sometimes overestimates and sometimes underestimates. For small or 0 underlying values in the population, the ICC calculated from a sample may be negative.

Modern ICC definitions: simpler formula but positive bias

Beginning with Ronald Fisher, the intraclass correlation has been regarded within the framework of

analysis of variance Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statisticia ...

(ANOVA), and more recently in the framework of

random effects model In statistics, a random effects model, also called a variance components model, is a statistical model where the model parameters are random variables. It is a kind of hierarchical linear model, which assumes that the data being analysed are dra ...

s. A number of ICC estimators have been proposed. Most of the estimators can be defined in terms of the random effects model :

Y_ = \mu + \alpha_j + \varepsilon_,

where ''Y''_''ij'' is the ''i''^th observation in the ''j''^th group, ''μ'' is an unobserved overall

mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...

, ''α_j'' is an unobserved random effect shared by all values in group ''j'', and ''ε_ij'' is an unobserved noise term. For the model to be identified, the ''α_j'' and ''ε_ij'' are assumed to have expected value zero and to be uncorrelated with each other. Also, the ''α_j'' are assumed to be identically distributed, and the ''ε_ij'' are assumed to be identically distributed. The variance of ''α_j'' is denoted ''σ'' and the variance of ''ε''_''ij'' is denoted ''σ''. The population ICC in this framework is :

\frac.

With this framework, the ICC is the

of two observations from the same group. For a one way random effects model:

Y_=\mu+\alpha_i+\epsilon_

\alpha_i \sim N(0,\sigma_\alpha^2)

\epsilon_ \sim N(0,\sigma^2)

\alpha_i

s and

\epsilon_

s independent and

\alpha_i

s are independent from

\epsilon_

s. The variance of any observation is:

Var(Y_)=\sigma^2 + \sigma_\alpha^2

The covariance of two observations from the same group i (for

j \neq k

) is:

\begin
\text(Y_, Y_) &= \text(\mu + \alpha_i + \epsilon_, \mu + \alpha_i + \epsilon_) \\
 &= \text(\alpha_i + \epsilon_, \alpha_i + \epsilon_) 
 = \text(\alpha_i, \alpha_i) + 2\text(\alpha_i, \epsilon_) + \text(\epsilon_, \epsilon_) \\
 &= \text(\alpha_i, \alpha_i) = \text(\alpha_i) \\
 &= \sigma^2_\alpha .\\
\end

In this, we've used properties of the covariance. Put together we get:

\text(Y_, Y_) = \frac = \frac

An advantage of this ANOVA framework is that different groups can have different numbers of data values, which is difficult to handle using the earlier ICC statistics. This ICC is always non-negative, allowing it to be interpreted as the proportion of total variance that is "between groups." This ICC can be generalized to allow for covariate effects, in which case the ICC is interpreted as capturing the within-class similarity of the covariate-adjusted data values. This expression can never be negative (unlike Fisher's original formula) and therefore, in samples from a population which has an ICC of 0, the ICCs in the samples will be higher than the ICC of the population. A number of different ICC statistics have been proposed, not all of which estimate the same population parameter. There has been considerable debate about which ICC statistics are appropriate for a given use, since they may produce markedly different results for the same data.

Relationship to Pearson's correlation coefficient

In terms of its algebraic form, Fisher's original ICC is the ICC that most resembles the

Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...

. One key difference between the two statistics is that in the ICC, the data are centered and scaled using a pooled mean and standard deviation, whereas in the Pearson correlation, each variable is centered and scaled by its own mean and standard deviation. This pooled scaling for the ICC makes sense because all measurements are of the same quantity (albeit on units in different groups). For example, in a paired data set where each "pair" is a single measurement made for each of two units (e.g., weighing each twin in a pair of identical twins) rather than two different measurements for a single unit (e.g., measuring height and weight for each individual), the ICC is a more natural measure of association than Pearson's correlation. An important property of the Pearson correlation is that it is invariant to application of separate

linear transformation In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...

s to the two variables being compared. Thus, if we are correlating ''X'' and ''Y'', where, say, ''Y'' = 2''X'' + 1, the Pearson correlation between ''X'' and ''Y'' is 1 — a perfect correlation. This property does not make sense for the ICC, since there is no basis for deciding which transformation is applied to each value in a group. However, if all the data in all groups are subjected to the same linear transformation, the ICC does not change.

Use in assessing conformity among observers

The ICC is used to assess the consistency, or conformity, of measurements made by multiple observers measuring the same quantity. For example, if several physicians are asked to score the results of a CT scan for signs of cancer progression, we can ask how consistent the scores are to each other. If the truth is known (for example, if the CT scans were on patients who subsequently underwent exploratory surgery), then the focus would generally be on how well the physicians' scores matched the truth. If the truth is not known, we can only consider the similarity among the scores. An important aspect of this problem is that there is both inter-observer and intra-observer variability. Inter-observer variability refers to systematic differences among the observers — for example, one physician may consistently score patients at a higher risk level than other physicians. Intra-observer variability refers to deviations of a particular observer's score on a particular patient that are not part of a systematic difference. The ICC is constructed to be applied to exchangeable measurements — that is, grouped data in which there is no meaningful way to order the measurements within a group. In assessing conformity among observers, if the same observers rate each element being studied, then systematic differences among observers are likely to exist, which conflicts with the notion of exchangeability. If the ICC is used in a situation where systematic differences exist, the result is a composite measure of intra-observer and inter-observer variability. One situation where exchangeability might reasonably be presumed to hold would be where a specimen to be scored, say a blood specimen, is divided into multiple aliquots, and the aliquots are measured separately on the same instrument. In this case, exchangeability would hold as long as no effect due to the sequence of running the samples was present. Since the ''intraclass correlation coefficient'' gives a composite of intra-observer and inter-observer variability, its results are sometimes considered difficult to interpret when the observers are not exchangeable. Alternative measures such as Cohen's

kappa statistic Cohen's kappa coefficient (''κ'', lowercase Greek kappa) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure tha ...

, the

Fleiss kappa Fleiss' kappa (named after Joseph L. Fleiss) is a statistical measure for assessing the inter-rater reliability, reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items ...

, and the

concordance correlation coefficient In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability. Definition The form of the concordance correlation coefficient \rho_c as :\rho_c ...

have been proposed as more suitable measures of agreement among non-exchangeable observers.

Calculation in software packages

Intraclass correlation coefficient graph improved

ICC is supported in the open source software package R (using the function "icc" with the package
''psy''
o

or via the function "ICC" in the package

) Th

package provides methods for the estimation of ICC and repeatabilities for Gaussian, binomial and Poisson distributed data in a mixed-model framework. Notably, the package allows estimation of adjusted ICC (i.e. controlling for other variables) and computes confidence intervals based on parametric bootstrapping and significances based on the permutation of residuals. Commercial software also supports ICC, for instance Stata or

SPSS SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. C ...

The three models are: * One-way random effects: each subject is measured by a different set of k randomly selected raters; * Two-way random: k raters are randomly selected, then, each subject is measured by the same set of k raters; * Two-way mixed: k fixed raters are defined. Each subject is measured by the k raters. Number of measurements: * Single measures: even though more than one measure is taken in the experiment, reliability is applied to a context where a single measure of a single rater will be performed; * Average measures: the reliability is applied to a context where measures of k raters will be averaged for each subject. Consistency or absolute agreement: * Absolute agreement: the agreement between two raters is of interest, including systematic errors of both raters and random residual errors; * Consistency: in the context of repeated measurements by the same rater, systematic errors of the rater are canceled and only the random residual error is kept. The consistency ICC cannot be estimated in the one-way random effects model, as there is no way to separate the inter-rater and residual variances. An overview and re-analysis of the three models for the single measures ICC, with an alternative recipe for their use, has also been presented by Liljequist et al (2019).

Interpretation

Cicchetti (1994) gives the following often quoted guidelines for interpretation for

kappa Kappa (uppercase Κ, lowercase κ or cursive ; el, κάππα, ''káppa'') is the 10th letter of the Greek alphabet, representing the voiceless velar plosive sound in Ancient and Modern Greek. In the system of Greek numerals, has a value o ...

or ICC inter-rater agreement measures: * Less than 0.40—poor. * Between 0.40 and 0.59—fair. * Between 0.60 and 0.74—good. * Between 0.75 and 1.00—excellent. A different guideline is given by Koo and Li (2016): * below 0.50: poor * between 0.50 and 0.75: moderate * between 0.75 and 0.90: good * above 0.90: excellent

References

Others

A comparison of two indices for the intraclass correlation coefficient

External links

AgreeStat 360: cloud-based inter-rater reliability analysis, Cohen's kappa, Gwet's AC1/AC2, Krippendorff's alpha, Brennan-Prediger, Fleiss generalized kappa, intraclass correlation coefficients

A useful online tool that allows calculation of the different types of ICC
{{DEFAULTSORT:Intraclass Correlation Covariance and correlation Inter-rater reliability