Classical test theory (CTT) is a body of related

psychometric Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and rela ...

theory that predicts outcomes of psychological

test Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film) ...

ing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observed or obtained score on a test is the sum of a true score (error-free score) and an error score. Generally speaking, the aim of classical test theory is to understand and improve the

reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * Reliability (computer networking), a category used to des ...

of psychological tests. ''Classical test theory'' may be regarded as roughly synonymous with ''true score theory''. The term "classical" refers not only to the chronology of these models but also contrasts with the more recent psychometric theories, generally referred to collectively as

item response theory In psychometrics, item response theory (IRT, also known as latent trait theory, strong true score theory, or modern mental test theory) is a paradigm for the design, analysis, and scoring of Test (student assessment), tests, questionnaires, and sim ...

, which sometimes bear the appellation "modern" as in "modern latent trait theory". Classical test theory as we know it today was codified by and described in classic texts such as and . The description of classical test theory below follows these seminal publications.

History

Classical test theory was born only after the following three achievements or ideas were conceptualized: # a recognition of the presence of errors in measurements, # a conception of that error as a random variable, # a conception of correlation and how to index it. In 1904,

Charles Spearman Charles Edward Spearman, FRS (10 September 1863 – 17 September 1945) was an English psychologist known for work in statistics, as a pioneer of factor analysis, and for Spearman's rank correlation coefficient. He also did seminal work on mod ...

was responsible for figuring out how to correct a correlation coefficient for attenuation due to measurement error and how to obtain the index of reliability needed in making the correction. Spearman's finding is thought to be the beginning of Classical Test Theory by some . Others who had an influence in the Classical Test Theory's framework include:

George Udny Yule George Udny Yule, CBE, FRS (18 February 1871 – 26 June 1951), usually known as Udny Yule, was a British statistician, particularly known for the Yule distribution and proposing the preferential attachment model for random graphs. Person ...

Truman Lee Kelley Truman Lee Kelley (1884 – 1961) was an American researcher who made seminal contributions to statistics and psychology. Life He was born in Whitehall, Muskegon County, Michigan in 1884. He died in 1961. Career He received his A.M. degree ...

, Fritz Kuder &

Marion Richardson Marion Elaine Richardson (9 October 1892 – 12 November 1946) was a British educator and author of books on penmanship and handwriting. Biography Marion Richardson was born on 9 October 1892 in Ashford, Kent, the second daughter of Walter Marsh ...

involved in making the

Kuder–Richardson Formulas In psychometrics, the Kuder–Richardson formulas, first published in 1937, are a measure of internal consistency reliability for measures with dichotomous choices. They were developed by Kuder and Richardson. Kuder–Richardson Formula 20 (KR-2 ...

Louis Guttman Louis Guttman (; February 10, 1916 – October 25, 1987) was an American sociologist and Professor of Social and Psychological Assessment at the Hebrew University of Jerusalem, known primarily for his work in social statistics. Biography Louis ( ...

, and, most recently, Melvin Novick, not to mention others over the next quarter century after Spearman's initial findings.

Definitions

Classical test theory assumes that each person has a ''true score'',''T'', that would be obtained if there were no errors in measurement. A person's true score is defined as the expected number-correct score over an infinite number of independent administrations of the test. Unfortunately, test users never observe a person's true score, only an ''observed score'', ''X''. It is assumed that ''observed score'' = ''true score'' plus some ''error'': X = T + E observed score true score error Classical test theory is concerned with the relations between the three variables

X

T

, and

E

in the population. These relations are used to say something about the quality of test scores. In this regard, the most important concept is that of ''reliability''. The reliability of the observed test scores

X

, which is denoted as

, is defined as the ratio of true score variance

to the observed score variance

: :

\rho^2_ = \frac

Because the variance of the observed scores can be shown to equal the sum of the variance of true scores and the variance of error scores, this is equivalent to :

\rho^2_ = \frac = \frac

This equation, which formulates a signal-to-noise ratio, has intuitive appeal: The reliability of test scores becomes higher as the proportion of error variance in the test scores becomes lower and vice versa. The reliability is equal to the proportion of the variance in the test scores that we could explain if we knew the true scores. The square root of the reliability is the absolute value of the correlation between true and observed scores.

Evaluating tests and scores: Reliability

Reliability cannot be estimated directly since that would require one to know the true scores, which according to classical test theory is impossible. However, estimates of reliability can be acquired by diverse means. One way of estimating reliability is by constructing a so-called ''

parallel test Parallel may refer to: Mathematics * Parallel (geometry), two lines in the Euclidean plane which never intersect * Parallel (operator), mathematical operation named after the composition of electrical resistance in parallel circuits Science a ...

''. The fundamental property of a parallel test is that it yields the same true score and the same observed score variance as the original test for every individual. If we have parallel tests x and x', then this means that :

\mathbb

'_i The apostrophe (, ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes: * The marking of the omission of one o ...

/math> and :

\sigma^2_ = \sigma^2_

Under these assumptions, it follows that the correlation between parallel test scores is equal to reliability (see , for a proof). :

\rho_= \frac=
\frac= \rho_^2

Using parallel tests to estimate reliability is cumbersome because parallel tests are very hard to come by. In practice the method is rarely used. Instead, researchers use a measure of internal consistency known as Cronbach's

. Consider a test consisting of

k

items

u_

j=1,\ldots,k

. The total test score is defined as the sum of the individual item scores, so that for individual

i

X_i=\sum_^k U_

Then

Cronbach's alpha Cronbach's alpha (Cronbach's \alpha), also known as tau-equivalent reliability (\rho_T) or coefficient alpha (coefficient \alpha), is a reliability coefficient and a measure of the internal consistency of tests and measures. It was named after ...

equals :

\alpha =\frac k \left(1-\frac\right)

Cronbach's

can be shown to provide a lower bound for reliability under rather mild assumptions. Thus, the reliability of test scores in a population is always higher than the value of Cronbach's

in that population. Thus, this method is empirically feasible and, as a result, it is very popular among researchers. Calculation of Cronbach's

is included in many standard statistical packages such as

SPSS SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versi ...

and SAS. As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of reliability. Reliability is supposed to say something about the general quality of the test scores in question. The general idea is that, the higher reliability is, the better. Classical test theory does not say how high reliability is supposed to be. Too high a value for

, say over .9, indicates redundancy of items. Around .8 is recommended for personality research, while .9+ is desirable for individual

high-stakes testing A high-stakes test is a test with important consequences for the test taker. Passing has important benefits, such as a high school diploma, a scholarship, or a license to practice a profession. Failing has important disadvantages, such as being ...

. These 'criteria' are not based on formal arguments, but rather are the result of convention and professional practice. The extent to which they can be mapped to formal principles of statistical inference is unclear.

Evaluating items: P and item-total correlations

Reliability provides a convenient index of test quality in a single number, reliability. However, it does not provide any information for evaluating single items.

Item analysis Within psychometrics, Item analysis refers to statistical methods used for selecting test items for inclusion in a psychological test. The concept goes back at least to . The process of item analysis varies depending on the psychometric model. Fo ...

within the classical approach often relies on two statistics: the P-value (proportion) and the

item-total correlation The item–total correlation is the correlation between a scored item and the total test score. It is an item statistic used in psychometric analysis to diagnose assessment items that fail to indicate the underlying psychological trait so that they ...

(

point-biserial correlation coefficient The point biserial correlation coefficient (''rpb'') is a correlation coefficient used when one variable (e.g. ''Y'') is dichotomous; ''Y'' can either be "naturally" dichotomous, like whether a coin lands heads or tails, or an artificially dichot ...

). The P-value represents the proportion of examinees responding in the keyed direction, and is typically referred to as ''item difficulty''. The item-total correlation provides an index of the discrimination or differentiating power of the item, and is typically referred to as ''item discrimination''. In addition, these statistics are calculated for each response of the oft-used

multiple choice Multiple choice (MC), objective response or MCQ (for multiple choice question) is a form of an objective assessment in which respondents are asked to select only the correct answer from the choices offered as a list. The multiple choice format i ...

item, which are used to evaluate items and diagnose possible issues, such as a confusing distractor. Such valuable analysis is provided by specially designed

psychometric software Psychometric software refers to specialized programs used for the psychometric analysis of data obtained from tests, questionnaires, polls or inventories that measure latent psychoeducational variables. Although some psychometric analyses can be ...

Alternatives

Classical test theory is an influential theory of test scores in the social sciences. In

psychometrics Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and rela ...

, the theory has been superseded by the more sophisticated models in

(IRT) and

generalizability theory Generalizability theory, or G theory, is a statistical framework for conceptualizing, investigating, and designing reliable observations. It is used to determine the reliability (i.e., reproducibility) of measurements under specific conditions. I ...

(G-theory). However, IRT is not included in standard statistical packages like

, but SAS can estimate IRT models via PROC IRT and PROC MCMC and there ar
IRT packages
for the open source statistical programming language R (e.g., CTT). While commercial packages routinely provide estimates of Cronbach's

, specialized

may be preferred for IRT or G-theory. However, general statistical packages often do not provide a complete classical analysis (Cronbach's

is only one of many important statistics), and in many cases, specialized software for classical analysis is also necessary.

Shortcomings

One of the most important or well-known shortcomings of classical test theory is that examinee characteristics and test characteristics cannot be separated: each can only be interpreted in the context of the other. Another shortcoming lies in the definition of reliability that exists in classical test theory, which states that reliability is "the correlation between test scores on parallel forms of a test". The problem with this is that there are differing opinions of what parallel tests are. Various reliability coefficients provide either lower bound estimates of reliability or reliability estimates with unknown biases. A third shortcoming involves the standard error of measurement. The problem here is that, according to classical test theory, the standard error of measurement is assumed to be the same for all examinees. However, as Hambleton explains in his book, scores on any test are unequally precise measures for examinees of different ability, thus making the assumption of equal errors of measurement for all examinees implausible . A fourth, and final shortcoming of the classical test theory is that it is test oriented, rather than item oriented. In other words, classical test theory cannot help us make predictions of how well an individual or even a group of examinees might do on a test item.

Notes

References

* * *

External links

International Test Commission article on Classical Test Theory

TAP: free software for Classical Test Theory

Iteman: software for visual reporting with Classical Test Theory

* ttps://assess.com/citas/ CITAS: Excel-based software for Classical Test Theory
jMetrik: Software for Classical Test Theory
Psychometrics Statistical theory Comparison of assessments Industrial and organizational psychology Statistical reliability