Computerized adaptive testing (CAT) is a form of computer-based

test Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film), ...

that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.

How it works

CAT successively selects questions for the purpose of maximizing the precision of the exam based on what is known about the examinee from previous questions. From the examinee's perspective, the difficulty of the exam seems to tailor itself to their level of ability. For example, if an examinee performs well on an item of intermediate difficulty, they will then be presented with a more difficult question. Or, if they performed poorly, they would be presented with a simpler question. Compared to static

multiple choice Multiple choice (MC), objective response or MCQ (for multiple choice question) is a form of an objective assessment in which respondents are asked to select only correct answers from the choices offered as a list. The multiple choice format is mo ...

tests that nearly everyone has experienced, with a fixed set of items administered to all examinees, computer-adaptive tests require fewer test items to arrive at equally accurate scores. (Of course, there is nothing about the CAT methodology that requires the items to be multiple-choice; but just as most exams are multiple-choice, most CAT exams also use this format.) The basic computer-adaptive testing method is an

iterative Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

with the following steps:Thissen, D., & Mislevy, R.J. (2000). Testing Algorithms. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. # The pool of available items is searched for the optimal item, based on the current estimate of the examinee's ability # The chosen item is presented to the examinee, who then answers it correctly or incorrectly # The ability estimate is updated, based on all prior answers # Steps 1–3 are repeated until a termination criterion is met Nothing is known about the examinee prior to the administration of the first item, so the algorithm is generally started by selecting an item of medium, or medium-easy, difficulty as the first item. As a result of adaptive administration, different examinees receive quite different tests.Green, B.F. (2000). System design and operation. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. Although examinees are typically administered different tests, their ability scores are comparable to one another (i.e., as if they had received the same test, as is common in tests designed using classical test theory). The psychometric technology that allows equitable scores to be computed across different sets of items is

item response theory In psychometrics, item response theory (IRT) (also known as latent trait theory, strong true score theory, or modern mental test theory) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring ...

(IRT). IRT is also the preferred methodology for selecting optimal items which are typically selected on the basis of ''information'' rather than difficulty, per se. A related methodology called multistage testing (MST) or

CAST Cast may refer to: Music * Cast (band), an English alternative rock band * Cast (Mexican band), a progressive Mexican rock band * The Cast, a Scottish musical duo: Mairi Campbell and Dave Francis * ''Cast'', a 2012 album by Trespassers William ...

is used in the

Uniform Certified Public Accountant Examination The Uniform Certified Public Accountant Examination (CPA Exam) is the examination administered to people who wish to become U.S. Certified Public Accountants. The CPA Exam is used by the regulatory bodies of all fifty states plus the Distric ...

. MST avoids or reduces some of the disadvantages of CAT as described below. See th
2006 special issue of Applied Measurement in Education
o
Computerized Multistage Testing
for more information on MST.

Examples

CAT has existed since the 1970s, and there are now many assessments that utilize it. *

Graduate Management Admission Test The Graduate Management Admission Test (GMAT ( ())) is a computer adaptive test (CAT) intended to assess certain analytical, writing, quantitative, verbal, and reading skills in written English for use in admission to a graduate management ...

*MAP test from NWEA *The SAT has announced that it will become multistage-adaptive in 2023 *

National Council Licensure Examination The National Council Licensure Examination (NCLEX) is a nationwide examination for the licensing of nurses in the United States, Canada and Australia since 1982, 2015 and 2020 respectively. There are two types, the NCLEX-RN and the NCLEX-PN. Aft ...

Armed Services Vocational Aptitude Battery The Armed Services Vocational Aptitude Battery (ASVAB) is a multiple choice test, administered by the United States Military Entrance Processing Command, used to determine qualification for enlistment in the United States Armed Forces. It is of ...

Additionally, a list of active CAT exams is found a
International Association for Computerized Adaptive Testing
along with a list of current CAT research programs and a near-inclusive bibliography of all published CAT research.

Advantages

Adaptive tests can provide uniformly precise scores for most test-takers. In contrast, standard fixed tests almost always provide the best precision for test-takers of medium ability and increasingly poorer precision for test-takers with more extreme test scores. An adaptive test can typically be shortened by 50% and still maintain a higher level of

precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...

than a fixed version. This translates into time savings for the test-taker. Test-takers do not waste their time attempting items that are too hard or trivially easy. Additionally, the testing organization benefits from the time savings; the cost of examinee seat time is substantially reduced. However, because the development of a CAT involves much more expense than a standard fixed-form test, a large population is necessary for a CAT testing program to be financially fruitful. Large target populations can generally be exhibited in scientific and research-based fields. CAT testing in these aspects may be used to catch early onset of disabilities or diseases. The growth of CAT testing in these fields has increased greatly in the past 10 years. Once not accepted in medical facilities and laboratories, CAT testing is now encouraged in the scope of diagnostics. Like any computer-based test, adaptive tests may show results immediately after testing. Adaptive testing, depending on the item selection

, may reduce exposure of some items because examinees typically receive different sets of items rather than the whole population being administered a single set. However, it may increase the exposure of others (namely the medium or medium/easy items presented to most examinees at the beginning of the test).

Disadvantages

The first issue encountered in CAT is the calibration of the item pool. In order to model the characteristics of the items (e.g., to pick the optimal item), all the items of the test must be pre-administered to a sizable sample and then analyzed. To achieve this, new items must be mixed into the operational items of an exam (the responses are recorded but do not contribute to the test-takers' scores), called "pilot testing", "pre-testing", or "seeding". This presents logistical, ethical, and security issues. For example, it is impossible to field an operational adaptive test with brand-new, unseen items; all items must be pretested with a large enough sample to obtain stable item statistics. This sample may be required to be as large as 1,000 examinees. Each program must decide what percentage of the test can reasonably be composed of unscored pilot test items. Although adaptive tests have ''exposure control'' algorithms to prevent overuse of a few items, the exposure conditioned upon ability is often not controlled and can easily become close to 1. That is, it is common for some items to become very common on tests for people of the same ability. This is a serious security concern because groups sharing items may well have a similar functional ability level. In fact, a completely randomized exam is the most secure (but also least efficient). Review of past items is generally disallowed. Adaptive tests tend to administer easier items after a person answers incorrectly. Supposedly, an astute test-taker could use such clues to detect incorrect answers and correct them. Or, test-takers could be coached to deliberately pick wrong answers, leading to an increasingly easier test. After tricking the adaptive test into building a maximally easy exam, they could then review the items and answer them correctly—possibly achieving a very high score. Test-takers frequently complain about the inability to review. Because of the sophistication, the development of a CAT has a number of prerequisites. The large sample sizes (typically hundreds of examinees) required by IRT calibrations must be present. Items must be scorable in real time if a new item is to be selected instantaneously. Psychometricians experienced with IRT calibrations and CAT simulation research are necessary to provide validity documentation. Finally, a software system capable of true IRT-based CAT must be available. In a CAT with a time limit it is impossible for the examinee to accurately budget the time they can spend on each test item and to determine if they are on pace to complete a timed test section. Test takers may thus be penalized for spending too much time on a difficult question which is presented early in a section and then failing to complete enough questions to accurately gauge their proficiency in areas which are left untested when time expires. While untimed CATs are excellent tools for formative assessments which guide subsequent instruction, timed CATs are unsuitable for high-stakes summative assessments used to measure aptitude for jobs and educational programs.

Components

There are five technical components in building a CAT (the following is adapted from Weiss & Kingsbury, 1984 ). This list does not include practical issues, such as item pretesting or live field release. # Calibrated item pool # Starting point or entry level # Item selection

# Scoring procedure # Termination criterion

Calibrated item pool

A pool of items must be available for the CAT to choose from. Such items can be created in the traditional way (i.e., manually) or through Automatic Item Generation. The pool must be calibrated with a psychometric model, which is used as a basis for the remaining four components. Typically,

is employed as the psychometric model. One reason item response theory is popular is because it places persons and items on the same metric (denoted by the Greek letter theta), which is helpful for issues in item selection (see below).

Starting point

In CAT, items are selected based on the examinee's performance up to a given point in the test. However, the CAT is obviously not able to make any specific estimate of examinee ability when no items have been administered. So some other initial estimate of examinee ability is necessary. If some previous information regarding the examinee is known, it can be used, but often the CAT just assumes that the examinee is of average ability - hence the first item often being of medium difficulty level.

Item selection algorithm

As mentioned previously,

places examinees and items on the same metric. Therefore, if the CAT has an estimate of examinee ability, it is able to select an item that is most appropriate for that estimate. Technically, this is done by selecting the item with the greatest ''information'' at that point.

Information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

is a function of the discrimination parameter of the item, as well as the conditional variance and pseudoguessing parameter (if used).

Scoring procedure

After an item is administered, the CAT updates its estimate of the examinee's ability level. If the examinee answered the item correctly, the CAT will likely estimate their ability to be somewhat higher, and vice versa. This is done by using the item response function from

to obtain a

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

of the examinee's ability. Two methods for this are called ''maximum likelihood estimation'' and ''Bayesian estimation''. The latter assumes an ''a priori'' distribution of examinee ability, and has two commonly used estimators: ''expectation a posteriori'' and ''maximum a posteriori''.

Maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...

is equivalent to a Bayes maximum a posteriori estimate if a uniform (f(x)=1) prior is assumed. Maximum likelihood is asymptotically unbiased, but cannot provide a theta estimate for a nonmixed (all correct or incorrect) response vector, in which case a Bayesian method may have to be used temporarily.

Termination criterion

The CAT

is designed to repeatedly administer items and update the estimate of examinee ability. This will continue until the item pool is exhausted unless a termination criterion is incorporated into the CAT. Often, the test is terminated when the examinee's standard error of measurement falls below a certain user-specified value, hence the statement above that an advantage is that examinee scores will be uniformly precise or "equiprecise." Other termination criteria exist for different purposes of the test, such as if the test is designed only to determine if the examinee should "Pass" or "Fail" the test, rather than obtaining a precise estimate of their ability.Lin, C.-J. & Spray, J.A. (2000). Effects of item-selection criteria on classification testing with the sequential probability ratio test. (Research Report 2000-8). Iowa City, IA: ACT, Inc.

Other issues

Pass-fail

In many situations, the purpose of the test is to classify examinees into two or more

mutually exclusive In logic and probability theory, two events (or propositions) are mutually exclusive or disjoint if they cannot both occur at the same time. A clear example is the set of outcomes of a single coin toss, which can result in either heads or tails ...

and exhaustive categories. This includes the common "mastery test" where the two classifications are "pass" and "fail," but also includes situations where there are three or more classifications, such as "Insufficient," "Basic," and "Advanced" levels of knowledge or competency. The kind of "item-level adaptive" CAT described in this article is most appropriate for tests that are not "pass/fail" or for pass/fail tests where providing good feedback is extremely important. Some modifications are necessary for a pass/fail CAT, also known as a computerized classification test (CCT). For examinees with true scores very close to the passing score, computerized classification tests will result in long tests while those with true scores far above or below the passing score will have shortest exams. For example, a new termination criterion and scoring algorithm must be applied that classifies the examinee into a category rather than providing a point estimate of ability. There are two primary methodologies available for this. The more prominent of the two is the

sequential probability ratio test The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald and later proven to be optimal by Wald and Jacob Wolfowitz. Neyman and Pearson's 1933 result inspired Wald to reformulate it as a seq ...

(SPRT).Wald, A. (1947). Sequential analysis. New York: Wiley.Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press. This formulates the examinee classification problem as a

hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...

that the examinee's ability is equal to either some specified point above the cutscore or another specified point below the cutscore. Note that this is a point hypothesis formulation rather than a composite hypothesis formulation that is more conceptually appropriate. A composite hypothesis formulation would be that the examinee's ability is in the region above the cutscore or the region below the cutscore. A

confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...

approach is also used, where after each item is administered, the algorithm determines the probability that the examinee's true-score is above or below the passing score.Kingsbury, G.G., & Weiss, D.J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press. For example, the algorithm may continue until the 95%

for the true score no longer contains the passing score. At that point, no further items are needed because the pass-fail decision is already 95% accurate, assuming that the psychometric models underlying the adaptive testing fit the examinee and test. This approach was originally called "adaptive mastery testing" but it can be applied to non-adaptive item selection and classification situations of two or more cutscores (the typical mastery test has a single cutscore). As a practical matter, the algorithm is generally programmed to have a minimum and a maximum test length (or a minimum and maximum administration time). Otherwise, it would be possible for an examinee with ability very close to the cutscore to be administered every item in the bank without the algorithm making a decision. The item selection algorithm utilized depends on the termination criterion. Maximizing information at the cutscore is more appropriate for the SPRT because it maximizes the difference in the probabilities used in the

likelihood ratio The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

.Spray, J. A., & Reckase, M. D. (1994). The selection of test items for decision making with a computerized adaptive test. Paper presented at the Annual Meeting of the National Council for Measurement in Education (New Orleans, LA, April 5–7, 1994). Maximizing information at the ability estimate is more appropriate for the confidence interval approach because it minimizes the conditional standard error of measurement, which decreases the width of the confidence interval needed to make a classification.

Practical constraints of adaptivity

ETS ETS or ets may refer to: Climate change, environment and economy * Emissions trading scheme ** European Union Emission Trading Scheme Organisations * European Thermoelectric Society * Evangelical Theological Society Education * École de techn ...

researcher Martha Stocking has quipped that most adaptive tests are actually ''barely adaptive tests'' (BATs) because, in practice, many constraints are imposed upon item choice. For example, CAT exams must usually meet content specifications; a verbal exam may need to be composed of equal numbers of analogies, fill-in-the-blank and synonym item types. CATs typically have some form of item exposure constraints, to prevent the most informative items from being over-exposed. Also, on some tests, an attempt is made to balance surface characteristics of the items such as

gender Gender is the range of characteristics pertaining to femininity and masculinity and differentiating between them. Depending on the context, this may include sex-based social structures (i.e. gender roles) and gender identity. Most culture ...

of the people in the items or the ethnicities implied by their names. Thus CAT exams are frequently constrained in which items it may choose and for some exams the constraints may be substantial and require complex search strategies (e.g.,

linear programming Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is ...

) to find suitable items. A simple method for controlling item exposure is the "randomesque" or strata method. Rather than selecting the most informative item at each point in the test, the algorithm randomly selects the next item from the next five or ten most informative items. This can be used throughout the test, or only at the beginning. Another method is the Sympson-Hetter method,Sympson, B.J., & Hetter, R.D. (1985). Controlling item-exposure rates in computerized adaptive testing. Paper presented at the annual conference of the Military Testing Association, San Diego. in which a random number is drawn from U(0,1), and compared to a ''k_i'' parameter determined for each item by the test user. If the random number is greater than ''k_i'', the next most informative item is considered. Wim van der Linden and colleagues have advanced an alternative approach called ''shadow testing'' which involves creating entire ''shadow tests'' as part of selecting items. Selecting items from shadow tests helps adaptive tests meet selection criteria by focusing on globally optimal choices (as opposed to choices that are optimal ''for a given item'').

Multidimensional

Given a set of items, a multidimensional computer adaptive test (MCAT) selects those items from the bank according to the estimated abilities of the student, resulting in an individualized test. MCATs seek to maximize the test's accuracy, based on multiple simultaneous examination abilities (unlike a computer adaptive test – CAT – which evaluates a single ability) using the sequence of items previously answered (Piton-Gonçalves and Aluisio, 2012).

References

Additional sources

*Drasgow, F., & Olson-Buchanan, J. B. (Eds.). (1999)
Innovations in computerized assessment
Hillsdale, NJ: Erlbaum. * *Piton-Gonçalves, J. & Aluísio, S. M. (2012). An architecture for multidimensional computer adaptive test with educational purposes. ACM, New York, NY, USA, 17-24. *Piton-Gonçalves, J. (2020). Testes adaptativos para o Enade: uma aplicação metodológica. Meta: Avaliação 12(36):665-688 *Van der Linden, W. J., & Glas, C.A.W. (Eds.). (2000)
Computerized adaptive testing: Theory and practice
Boston, MA: Kluwer. * Wainer, H. (Ed.). (2000). Computerized adaptive testing: A Primer (2nd Edition). Mahwah, NJ: ELawrence Erlbaum Associates. *Weiss, D.J. (Ed.). (1983). New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237–254). New York: Academic Press.

External links

International Association for Computerized Adaptive Testing

Concerto: Open-source CAT Platform

FastTest: CAT platform with free version available

CAT Central
by David J. Weiss

Retrieved April 15, 2005.

by Lawrence L. Rudner. November 1998. Retrieved April 15, 2005.

Applied Measurement in Education, 19(3).

- from the

Education Resources Information Center The Education Resources Information Center (ERIC) is an online digital library of education research and information. ERIC is sponsored by the Institute of Education Sciences of the United States Department of Education. Description The missio ...

Clearinghouse on Tests Measurement and Evaluation,

Washington, DC ) , image_skyline = , image_caption = Clockwise from top left: the Washington Monument and Lincoln Memorial on the National Mall, United States Capitol, Logan Circle, Jefferson Memorial, White House, Adams Morg ...

{{DEFAULTSORT:Computerized Adaptive Testing Psychometrics School examinations Computer-based testing