Test validity is the extent to which a test (such as a
chemical
A chemical substance is a form of matter having constant chemical composition and characteristic properties. Some references add that chemical substance cannot be separated into its constituent elements by physical separation methods, i.e., wi ...
,
physical
Physical may refer to:
*Physical examination
In a physical examination, medical examination, or clinical examination, a medical practitioner examines a patient for any possible medical signs or symptoms of a medical condition. It generally co ...
, or
scholastic test)
accurately measures what it is supposed to measure. In the fields of
psychological testing
Psychological testing is the administration of psychological tests. Psychological tests are administered by trained evaluators. A person's responses are evaluated according to carefully prescribed guidelines. Scores are thought to reflect individ ...
and
educational testing
An examination (exam or evaluation) or test is an educational assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics (e.g., beliefs). A test may be administered verb ...
, "validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests".
[American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999) ''Standards for educational and psychological testing''. Washington, DC: American Educational Research Association.] Although classical models divided the concept into various "validities" (such as
content validity
In psychometrics, content validity (also known as logical validity) refers to the extent to which a measure represents all facets of a given construct. For example, a depression scale may lack content validity if it only assesses the affective dim ...
,
criterion validity In psychometrics, criterion validity, or criterion-related validity, is the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretical representation of the construct—the criterion. Criterion valid ...
, and
construct validity Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. ''Construct validation'' is the accumulation of evidence to support the interpretation of what a measure reflects.Polit DF Beck ...
),
[Guion, R. M. (1980). On trinitarian doctrines of validity. ''Professional Psychology, 11'', 385-398.] the currently dominant view is that validity is a single unitary construct.
[Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. ''American Psychologist, 50'', 741-749.]
Validity is generally considered the most important issue in psychological and educational testing
[Popham, W. J. (2008). All About Assessment / A Misunderstood Grail. ''Educational Leadership, 66''(1), 82-83.] because it concerns the meaning placed on test results.
Though many textbooks present validity as a static construct, various models of validity have evolved since the first published recommendations for constructing psychological and education tests.
[American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1954). ''Technical recommendations for psychological tests and diagnostic techniques''. Washington, DC: The Association.] These models can be categorized into two primary groups: classical models, which include several types of validity, and modern models, which present validity as a single construct. The modern models reorganize classical "validities" into either "aspects" of validity
or "types" of validity-supporting evidence
Test validity is often confused with
reliability
Reliability, reliable, or unreliable may refer to:
Science, technology, and mathematics Computing
* Data reliability (disambiguation), a property of some disk arrays in computer storage
* High availability
* Reliability (computer networking), a ...
, which refers to the consistency of a measure. Adequate reliability is a prerequisite of validity, but a high reliability does not in any way guarantee that a measure is valid.
Historical background
Although psychologists and educators were aware of several facets of validity before World War II, their methods for establishing validity were commonly restricted to
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
s of test scores with some known criterion.
[Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. Braun (Eds.), ''Test Validity'' (pp. 19-32). Hillsdale, NJ: Lawrence Erlbaum.] Under the direction of
Lee Cronbach
Lee Joseph Cronbach (April 22, 1916 – October 1, 2001) was an American educational psychologist who made contributions to psychological testing and measurement.
At the University of Illinois, Urbana, Cronbach produced many of his works: the "A ...
, the 1954 ''Technical Recommendations for Psychological Tests and Diagnostic Techniques''
attempted to clarify and broaden the scope of validity by dividing it into four parts: (a)
concurrent validity Concurrent validity is a type of evidence that can be gathered to defend the use of a test for predicting other outcomes. It is a parameter used in sociology, psychology, and other Psychometrics, psychometric or behavioral sciences. Concurrent vali ...
, (b)
predictive validity In psychometrics, predictive validity is the extent to which a score on a scale or test predicts scores on some criterion measure.
For example, the validity of a cognitive test for job performance is the correlation between test scores and, for e ...
, (c)
content validity
In psychometrics, content validity (also known as logical validity) refers to the extent to which a measure represents all facets of a given construct. For example, a depression scale may lack content validity if it only assesses the affective dim ...
, and (d)
construct validity Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. ''Construct validation'' is the accumulation of evidence to support the interpretation of what a measure reflects.Polit DF Beck ...
. Cronbach and Meehl's subsequent publication
[Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. ''Psychological Bulletin, 52'', 281-302.] grouped predictive and concurrent validity into a "criterion-orientation", which eventually became
criterion validity In psychometrics, criterion validity, or criterion-related validity, is the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretical representation of the construct—the criterion. Criterion valid ...
.
Over the next four decades, many theorists, including Cronbach himself, voiced their dissatisfaction with this three-in-one model of validity.
[Guion, R. M. (1977). Content validity–The source of my discontent. ''Applied Psychological Measurement, 1'', 1-10.] Their arguments culminated in
Samuel Messick's 1995 article that described validity as a single construct, composed of six "aspects".
In his view, various inferences made from test scores may require different types of evidence, but not different validities.
The 1999 ''Standards for Educational and Psychological Testing''
largely codified Messick's model. They describe five types of validity-supporting evidence that incorporate each of Messick's aspects, and make no mention of the classical models’ content, criterion, and construct validities.
Validation process
According to the ''1999 Standards'',
validation is the process of gathering evidence to provide “a sound scientific basis” for interpreting the scores as proposed by the test developer and/or the test user. Validation therefore begins with a framework that defines the scope and aspects (in the case of multi-dimensional scales) of the proposed interpretation. The framework also includes a rational justification linking the interpretation to the test in question.
Validity researchers then list a series of propositions that must be met if the interpretation is to be valid. Or, conversely, they may compile a list of issues that may threaten the validity of the interpretations. In either case, the researchers proceed by gathering evidence – be it original empirical research, meta-analysis or review of existing literature, or logical analysis of the issues – to support or to question the interpretation's propositions (or the threats to the interpretation's validity). Emphasis is placed on quality, rather than quantity, of the evidence.
A single interpretation of any test result may require several propositions to be true (or may be questioned by any one of a set of threats to its validity). Strong evidence in support of a single proposition does not lessen the requirement to support the other propositions.
Evidence to support (or question) the validity of an interpretation can be categorized into one of five categories:
# Evidence based on test content
# Evidence based on response processes
# Evidence based on internal structure
# Evidence based on relations to other variables
# Evidence based on consequences of testing
Techniques to gather each type of evidence should only be employed when they yield information that would support or question the propositions required for the interpretation in question.
Each piece of evidence is finally integrated into a validity argument. The argument may call for a revision to the test, its administration protocol, or the theoretical constructs underlying the interpretations. If the test, and/or the interpretations of the test's results are revised in any way, a new validation process must gather evidence to support the new version.
See also
*
Validity scale
A validity scale, in psychological testing, is a scale used in an attempt to measure reliability of responses, for example with the goal of detecting defensiveness, malingering, or careless or random responding.
For example, the Minnesota Multiph ...
References
{{education
*