In
psychometrics
Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally refers to specialized fields within psychology and education devoted to testing, measurement, assessment, and ...
, item response theory (IRT) (also known as latent trait theory, strong true score theory, or modern mental test theory) is a paradigm for the design, analysis, and scoring of
tests
Test(s), testing, or TEST may refer to:
* Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities
Arts and entertainment
* ''Test'' (2013 film), an American film
* ''Test'' (2014 film), ...
,
questionnaire
A questionnaire is a research
Research is "creativity, creative and systematic work undertaken to increase the stock of knowledge". It involves the collection, organization and analysis of evidence to increase understanding of a topic, ...
s, and similar instruments
measuring
Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.
In other words, measurement is a process of determining how large or small a physical quantity is as compared t ...
abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance,
Likert scaling, in which ''"''All items are assumed to be replications of each other or in other words items are considered to be parallel instruments".
[A. van Alphen, R. Halfens, A. Hasman and T. Imbos. (1994). Likert or Rasch? Nothing is more applicable than good theory. ''Journal of Advanced Nursing''. 20, 196-201] By contrast, item response theory treats the difficulty of each item (the item characteristic curves, or
ICCs) as information to be incorporated in scaling items.
It is based on the application of related
mathematical model
A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in the natural sciences (such as physics, ...
s to testing
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. Because it is often regarded as superior to
classical test theory
Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observe ...
, it is the preferred method for developing scales in the United States, especially when optimal decisions are demanded, as in so-called
high-stakes tests, e.g., the
Graduate Record Examination
The Graduate Record Examinations (GRE) is a standardized test that is an admissions requirement for many graduate schools in the United States and Canada and a few other countries. The GRE is owned and administered by Educational Testing Servi ...
(GRE) and
Graduate Management Admission Test
The Graduate Management Admission Test (GMAT ( ())) is a computer adaptive test (CAT) intended to assess certain analytical, writing, quantitative, verbal, and reading skills in written English for use in admission to a graduate management ...
(GMAT).
The name ''item response theory'' is due to the focus of the theory on the item, as opposed to the test-level focus of classical test theory. Thus IRT models the response of each examinee of a given ability to each item in the test. The term ''item'' is generic, covering all kinds of informative items. They might be
multiple choice
Multiple choice (MC), objective response or MCQ (for multiple choice question) is a form of an objective assessment in which respondents are asked to select only correct answers from the choices offered as a list. The multiple choice format is m ...
questions that have incorrect and correct responses, but are also commonly statements on questionnaires that allow respondents to indicate level of agreement (a
rating
A rating is an evaluation or assessment of something, in terms of quality, quantity, or some combination of both.
Rating or ratings may also refer to:
Business and economics
* Credit rating, estimating the credit worthiness of an individual, c ...
or
Likert scale
A Likert scale ( , commonly mispronounced as ) is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term (or more fully the ...
), or patient symptoms scored as present/absent, or diagnostic information in complex systems.
IRT is based on the idea that the
probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
of a correct/keyed response to an item is a
mathematical function
In mathematics, a function from a set to a set assigns to each element of exactly one element of .; the words map, mapping, transformation, correspondence, and operator are often used synonymously. The set is called the domain of the functi ...
of person and item
parameters
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
. (The expression “a mathematical function of person and item parameters” is analogous to
Kurt Lewin’s equation ''B = f(P, E)'', which asserts that behavior is a function of the person in their environment.) The person parameter is construed as (usually) a single latent trait or dimension. Examples include general
intelligence
Intelligence has been defined in many ways: the capacity for abstraction, logic, understanding, self-awareness, learning, emotional knowledge, reasoning, planning, creativity, critical thinking, and problem-solving. More generally, it can b ...
or the strength of an attitude. Parameters on which items are characterized include their difficulty (known as "location" for their location on the difficulty range); discrimination (slope or correlation), representing how steeply the rate of success of individuals varies with their ability; and a pseudoguessing parameter, characterising the (lower)
asymptote
In analytic geometry, an asymptote () of a curve is a line such that the distance between the curve and the line approaches zero as one or both of the ''x'' or ''y'' coordinates tends to infinity. In projective geometry and related contexts, ...
at which even the least able persons will score due to guessing (for instance, 25% for a pure chance on a multiple choice item with four possible responses).
In the same manner, IRT can be used to measure human behavior in online social networks. The views expressed by different people can be aggregated to be studied using IRT. Its use in classifying information as misinformation or true information has also been evaluated.
Overview
The concept of the item response function was around before 1950. The pioneering work of IRT as a theory occurred during the 1950s and 1960s. Three of the pioneers were the
Educational Testing Service
Educational Testing Service (ETS), founded in 1947, is the world's largest private nonprofit educational testing and assessment organization. It is headquartered in Lawrence Township, New Jersey, but has a Princeton address.
ETS develops var ...
psychometrician
Frederic M. Lord
Frederic Mather Lord (November 12, 1912 – February 5, 2000) was a psychometrician for Educational Testing Service. The SAT, GRE, GMAT, LSAT and TOEFL are all based on Lord's research.
Early life
Lord was born on November 12, 1912 in Hanover, Ne ...
, the Danish mathematician
Georg Rasch
Georg William Rasch () (21 September 1901 – 19 October 1980) was a Danish mathematician, statistician, and psychometrician, most famous for the development of a class of measurement models known as Rasch models. He studied with R.A. Fisher and a ...
, and Austrian sociologist
Paul Lazarsfeld
Paul Felix Lazarsfeld (February 13, 1901August 30, 1976) was an Austrian-American sociologist. The founder of Columbia University's Bureau of Applied Social Research, he exerted influence over the techniques and the organization of social resea ...
, who pursued parallel research independently. Key figures who furthered the progress of IRT include
Benjamin Drake Wright
Benjamin Drake Wright (March 30, 1926 – October 25, 2015) was an American psychometrician. He is largely responsible for the widespread adoption of Georg Rasch's measurement principles and models.Rasch, G. (1988/1972, Summer). Review of the co ...
and
David Andrich
David Andrich is an Australian academic and assessment specialist. He has made substantial contributions to quantitative social science including seminal work on the Polytomous Rasch model for measurement, which is used in the social sciences, ...
. IRT did not become widely used until the late 1970s and 1980s, when practitioners were told the "usefulness" and "advantages" of IRT on the one hand, and
personal computer
A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tec ...
s gave many researchers access to the computing power necessary for IRT on the other.
Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. The most common application of IRT is in education, where psychometricians use it for developing and designing
exams
An examination (exam or evaluation) or test is an educational assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics (e.g., beliefs). A test may be administered verba ...
, maintaining banks of items for exams, and
equating
Test equating traditionally refers to the statistical process of determining comparable scores on different forms of an exam.Kolen, M.J., & Brennan, R.L. (1995). Test Equating. New York: Spring. It can be accomplished using either classical test ...
the difficulties of items for successive versions of exams (for example, to allow comparisons between results over time).
IRT models are often referred to as ''latent trait models''. The term ''latent'' is used to emphasize that discrete item responses are taken to be ''observable manifestations'' of hypothesized traits, constructs, or attributes, not directly observed, but which must be inferred from the manifest responses. Latent trait models were developed in the field of sociology, but are virtually identical to IRT models.
IRT is generally claimed as an improvement over
classical test theory
Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observe ...
(CTT). For tasks that can be accomplished using CTT, IRT generally brings greater flexibility and provides more sophisticated information. Some applications, such as
computerized adaptive testing
Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the ne ...
, are enabled by IRT and cannot reasonably be performed using only classical test theory. Another advantage of IRT over CTT is that the more sophisticated information IRT provides allows a researcher to improve the
reliability
Reliability, reliable, or unreliable may refer to:
Science, technology, and mathematics Computing
* Data reliability (disambiguation), a property of some disk arrays in computer storage
* High availability
* Reliability (computer networking), a ...
of an
assessment
Assessment may refer to:
Healthcare
*Health assessment, identifies needs of the patient and how those needs will be addressed
*Nursing assessment, gathering information about a patient's physiological, psychological, sociological, and spiritual s ...
.
IRT entails three assumptions:
# A unidimensional trait denoted by
;
#
Local independence of items;
# The response of a person to an item can be modeled by a mathematical ''item response function'' (IRF).
The trait is further assumed to be measurable on a scale (the mere existence of a test assumes this), typically set to a standard scale with a
mean
There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set.
For a data set, the ''arithme ...
of 0.0 and a
standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
of 1.0. Unidimensionality should be interpreted as homogeneity, a quality that should be defined or empirically demonstrated in relation to a given purpose or use, but not a quantity that can be measured. 'Local independence' means (a) that the chance of one item being used is not related to any other item(s) being used and (b) that response to an item is each and every test-taker's independent decision, that is, there is no cheating or pair or group work. The topic of dimensionality is often investigated with
factor analysis
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
, while the IRF is the basic building block of IRT and is the center of much of the research and literature.
The item response function
The IRF gives the probability that a person with a given ability level will answer correctly. Persons with lower ability have less of a chance, while persons with high ability are very likely to answer correctly; for example, students with higher math ability are more likely to get a math item correct. The exact value of the probability depends, in addition to ability, on a set of ''item parameters'' for the IRF.
Three parameter logistic model
For example, in the three parameter logistic model (3PL), the probability of a correct response to a
dichotomous
A dichotomy is a partition of a whole (or a set) into two parts (subsets). In other words, this couple of parts must be
* jointly exhaustive: everything must belong to one part or the other, and
* mutually exclusive: nothing can belong simultan ...
item ''i'', usually a multiple-choice question, is:
:
where
indicates that the person's abilities are modeled as a sample from a normal distribution for the purpose of estimating the item parameters. After the item parameters have been estimated, the abilities of individual people are estimated for reporting purposes.
,
, and
are the item parameters. The item parameters determine the shape of the IRF. Figure 1 depicts an ideal 3PL ICC.
The item parameters can be interpreted as changing the shape of the standard
logistic function
A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation
f(x) = \frac,
where
For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
:
:
In brief, the parameters are interpreted as follows (dropping subscripts for legibility); ''b'' is most basic, hence listed first:
* ''b'' – difficulty, item location:
the half-way point between
(min) and 1 (max), also where the slope is maximized.
* ''a'' – discrimination, scale, slope: the maximum slope
* ''c'' – pseudo-guessing, chance, asymptotic minimum
If
then these simplify to
and
meaning that ''b'' equals the 50% success level (difficulty), and ''a'' (divided by four) is the maximum slope (discrimination), which occurs at the 50% success level. Further, the
logit
In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations.
Mathematically, the logit is the ...
(log
odds
Odds provide a measure of the likelihood of a particular outcome. They are calculated as the ratio of the number of events that produce that outcome to the number that do not. Odds are commonly used in gambling and statistics.
Odds also have ...
) of a correct response is
(assuming
): in particular if ability ''θ'' equals difficulty ''b,'' there are even odds (1:1, so logit 0) of a correct answer, the greater the ability is above (or below) the difficulty the more (or less) likely a correct response, with discrimination ''a'' determining how rapidly the odds increase or decrease with ability.
In other words, the standard logistic function has an asymptotic minimum of 0 (
), is centered around 0 (
,
), and has maximum slope
The
parameter stretches the horizontal scale, the
parameter shifts the horizontal scale, and the
compresses the vertical scale from