Test validity

Last updated January 28, 2025

Test validity is the extent to which a test (such as a chemical, physical, or scholastic test) accurately measures what it is supposed to measure. In the fields of psychological testing and educational testing, "validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests".^[1] Although classical models divided the concept into various "validities" (such as content validity, criterion validity, and construct validity),^[2] the currently dominant view is that validity is a single unitary construct.^[3]

Validity is generally considered the most important issue in psychological and educational testing^[4] because it concerns the meaning placed on test results.^[3] Though many textbooks present validity as a static construct,^[5] various models of validity have evolved since the first published recommendations for constructing psychological and education tests.^[6] These models can be categorized into two primary groups: classical models, which include several types of validity, and modern models, which present validity as a single construct. The modern models reorganize classical "validities" into either "aspects" of validity^[3] or "types" of validity-supporting evidence^[1]

Test validity is often confused with reliability, which refers to the consistency of a measure. Adequate reliability is a prerequisite of validity, but a high reliability does not in any way guarantee that a measure is valid.

Historical background

Although psychologists and educators were aware of several facets of validity before World War II, their methods for establishing validity were commonly restricted to correlations of test scores with some known criterion.^[7] Under the direction of Lee Cronbach, the 1954 Technical Recommendations for Psychological Tests and Diagnostic Techniques^[6] attempted to clarify and broaden the scope of validity by dividing it into four parts: (a) concurrent validity, (b) predictive validity, (c) content validity, and (d) construct validity. Cronbach and Meehl's subsequent publication^[8] grouped predictive and concurrent validity into a "criterion-orientation", which eventually became criterion validity.

Over the next four decades, many theorists, including Cronbach himself,^[9] voiced their dissatisfaction with this three-in-one model of validity.^[10]^[11]^[12] Their arguments culminated in Samuel Messick's 1995 article that described validity as a single construct, composed of six "aspects".^[3] In his view, various inferences made from test scores may require different types of evidence, but not different validities.

The 1999 Standards for Educational and Psychological Testing^[1] largely codified Messick's model. They describe five types of validity-supporting evidence that incorporate each of Messick's aspects, and make no mention of the classical models’ content, criterion, and construct validities.

Validation process

According to the 1999 Standards,^[1] validation is the process of gathering evidence to provide "a sound scientific basis" for interpreting the scores as proposed by the test developer and/or the test user. Validation therefore begins with a framework that defines the scope and aspects (in the case of multi-dimensional scales) of the proposed interpretation. The framework also includes a rational justification linking the interpretation to the test in question.

Validity researchers then list a series of propositions that must be met if the interpretation is to be valid. Or, conversely, they may compile a list of issues that may threaten the validity of the interpretations. In either case, the researchers proceed by gathering evidence – be it original empirical research, meta-analysis or review of existing literature, or logical analysis of the issues – to support or to question the interpretation's propositions (or the threats to the interpretation's validity). Emphasis is placed on quality, rather than quantity, of the evidence.

A single interpretation of any test result may require several propositions to be true (or may be questioned by any one of a set of threats to its validity). Strong evidence in support of a single proposition does not lessen the requirement to support the other propositions.

Evidence to support (or question) the validity of an interpretation can be categorized into one of five categories:

Evidence based on test content
Evidence based on response processes
Evidence based on internal structure
Evidence based on relations to other variables
Evidence based on consequences of testing

Techniques to gather each type of evidence should only be employed when they yield information that would support or question the propositions required for the interpretation in question.

Each piece of evidence is finally integrated into a validity argument. The argument may call for a revision to the test, its administration protocol, or the theoretical constructs underlying the interpretations. If the test, and/or the interpretations of the test's results are revised in any way, a new validation process must gather evidence to support the new version.

Related Research Articles

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

Psychological testing refers to the administration of psychological tests. Psychological tests are administered or scored by trained evaluators. A person's responses are evaluated according to carefully prescribed guidelines. Scores are thought to reflect individual or group differences in the construct the test purports to measure. The science behind psychological testing is psychometrics.

Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing research. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

Cronbach's alpha, also known as tau-equivalent reliability or coefficient alpha, is a reliability coefficient and a measure of the internal consistency of tests and measures. It was named after the American psychologist Lee Cronbach.

Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. Construct validation is the accumulation of evidence to support the interpretation of what a measure reflects. Modern validity theory defines construct validity as the overarching concern of validity research, subsuming all other types of validity evidence such as content validity and criterion validity.

In psychometrics, predictive validity is the extent to which a score on a scale or test predicts scores on some criterion measure.

In psychometrics, criterion validity, or criterion-related validity, is the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretically related behaviour or outcome — the criterion. Criterion validity is often divided into concurrent and predictive validity based on the timing of measurement for the "predictor" and outcome. Concurrent validity refers to a comparison between the measure in question and an outcome assessed at the same time. Standards for Educational & Psychological Tests states, "concurrent validity reflects only the status quo at a particular time." Predictive validity, on the other hand, compares the measure in question with an outcome assessed at a later time. Although concurrent and predictive validity are similar, it is cautioned to keep the terms and findings separated. "Concurrent validity should not be used as a substitute for predictive validity without an appropriate supporting rationale." Criterion validity is typically assessed by comparison with a gold standard test.

A nomological network is a representation of the concepts (constructs) of interest in a study, their observable manifestations, and the interrelationships between these. The term "nomological" derives from the Greek, meaning "lawful", or in philosophy of science terms, "law-like". It was Cronbach and Meehl's view of construct validity that in order to provide evidence that a measure has construct validity, a nomological network must be developed for its measure.

Concurrent validity is a type of evidence that can be gathered to defend the use of a test for predicting other outcomes. It is a parameter used in sociology, psychology, and other psychometric or behavioral sciences. Concurrent validity is demonstrated when a test correlates well with a measure that has previously been validated. The two measures may be for the same construct, but more often used for different, but presumably related, constructs.

<span class="mw-page-title-main">Paul E. Meehl</span> American psychologist (1920–2003)

Paul Everett Meehl was an American clinical psychologist. He was the Hathaway and Regents' Professor of Psychology at the University of Minnesota, and past president of the American Psychological Association. A Review of General Psychology survey, published in 2002, ranked Meehl as the 74th most cited psychologist of the 20th century, in a tie with Eleanor J. Gibson. Throughout his nearly 60-year career, Meehl made seminal contributions to psychology, including empirical studies and theoretical accounts of construct validity, schizophrenia etiology, psychological assessment, behavioral prediction, metascience, and philosophy of science.

Lee Joseph Cronbach was an American educational psychologist who made contributions to psychological testing and measurement.

A criterion-referenced test is a style of test that uses test scores to generate a statement about the behavior that can be expected of a person with that score. Most tests and quizzes that are written by school teachers can be considered criterion-referenced tests. In this case, the objective is simply to see whether the student has learned the material. Criterion-referenced assessment can be contrasted with norm-referenced assessment and ipsative assessment.

Cognitive slippage is considered a milder and sub-clinical presentation of formal thought disorder observed via unusual use of language. It is often identified when a person attempts to make tangential connections between concepts that are not immediately understandable to listeners. When observed repeatedly, this is taken as evidence for unusual, maladaptive or illogical thinking patterns.

Anthony F. Gregorc is an American who has taught educational administration. He is best known for his disputed theory of a Mind Styles Model and its associated Style Delineator. The model tries to match education to particular learning styles, as identified by Gregorc.

In philosophy, a construct is an object which is ideal, that is, an object of the mind or of thought, meaning that its existence may be said to depend upon a subject's mind. This contrasts with any possibly mind-independent objects, the existence of which purportedly does not depend on the existence of a conscious observing subject. Thus, the distinction between these two terms may be compared to that between phenomenon and noumenon in other philosophical contexts and to many of the typical definitions of the terms realism and idealism also. In the correspondence theory of truth, ideas, such as constructs, are to be judged and checked according to how well they correspond with their referents, often conceived as part of a mind-independent reality.

<span class="mw-page-title-main">Anne Anastasi</span> American psychologist

Anne Anastasi was an American psychologist best known for her pioneering development of psychometrics. Her generative work, Psychological Testing, remains a classic text in which she drew attention to the individual being tested and therefore to the responsibilities of the testers. She called for them to go beyond test scores, to search the assessed individual's history to help them to better understand their own results and themselves.

In statistical models applied to psychometrics, congeneric reliability $a single-administration test score reliability coefficient, commonly referred to as composite reliability, construct reliability, and coefficient omega . is a structural equation model (SEM)-based reliability coefficients and is obtained from on a unidimensional model. is the second most commonly used reliability factor after tau-equivalent reliability(; also known as Cronbach's alpha), and is often recommended as its alternative.$

The Dark Triad Dirty Dozen (DTDD) is a brief 12-question personality inventory test to assess the possible presence of the three subclinical dark triad traits: Machiavellianism, narcissism, and psychopathy. The DTDD was developed to identify the dark triad traits among subclinical adult populations. It is a screening test.

References

1 2 3 4 American Educational Research Association; American Psychological Association; National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. ISBN 978-0-935302-25-7. Archived from the original on 15 January 2025.
↑ Guion, R. M. (1980). "On trinitarian doctrines of validity". Professional Psychology. 11: 385–398. doi:10.1037/0735-7028.11.3.385.
1 2 3 4 Messick, S. (1995). "Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning" (PDF). American Psychologist. 50: 741–749. doi:10.1037/0003-066X.50.9.741. Archived (PDF) from the original on 11 December 2024.
↑ Popham, W. J. (2008). "All About Assessment / A Misunderstood Grail". Educational Leadership. 66 (1): 82–83. Archived from the original on 27 January 2025.
↑ Nitko, J.J.; Brookhart, S. M. (2004). Educational assessment of students. Upper Saddle River, NJ: Merrill-Prentice Hall.
1 2 American Psychological Association; American Educational Research Association; National Council on Measurement in Education (1954). Technical recommendations for psychological tests and diagnostic techniques. Washington, DC: The Association. doi:10.1037/h0053479.
↑ Angoff, W. H. (1988). "Validity: An evolving concept". In Wainer, H.; Braun, H. (eds.). Test Validity. Hillsdale, NJ: Lawrence Erlbaum. pp. 19–32. doi:10.4324/9780203056905.
↑ Cronbach, L. J.; Meehl, P. E. (1955). "Construct validity in psychological tests". Psychological Bulletin. 52: 281–302. doi:10.1037/h0040957. Archived from the original on 10 September 2024.
↑ Cronbach, L. J. (1969). Validation of educational measures. Proceedings of the 1969 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service. pp. 35–52.
↑ Loevinger, J. (1957). "Objective tests as instruments of psychological theory" (PDF). Psychological Reports. 3: 634–694. doi:10.2466/pr0.1957.3.3.635. Archived (PDF) from the original on 7 July 2024.
↑ Tenopyr, M. L. (1977). "Content-construct confusion". Personnel Psychology. 30: 47–54. doi:10.1111/j.1744-6570.1977.tb02320.x.
↑ Guion, R. M. (1977). "Content validity–The source of my discontent". Applied Psychological Measurement. 1: 1–10. doi:10.1177/014662167700100103. Archived from the original on 27 January 2025.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1999standards-1] 1 2 3 4 American Educational Research Association; American Psychological Association; National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. ISBN 978-0-935302-25-7. Archived from the original on 15 January 2025.

[guion1980-2] Guion, R. M. (1980). "On trinitarian doctrines of validity". Professional Psychology. 11: 385–398. doi:10.1037/0735-7028.11.3.385.

[messick1995-3] 1 2 3 4 Messick, S. (1995). "Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning" (PDF). American Psychologist. 50: 741–749. doi:10.1037/0003-066X.50.9.741. Archived (PDF) from the original on 11 December 2024.

[popham2008-4] Popham, W. J. (2008). "All About Assessment / A Misunderstood Grail". Educational Leadership. 66 (1): 82–83. Archived from the original on 27 January 2025.

[5] Nitko, J.J.; Brookhart, S. M. (2004). Educational assessment of students. Upper Saddle River, NJ: Merrill-Prentice Hall.

[1954recommendations-6] 1 2 American Psychological Association; American Educational Research Association; National Council on Measurement in Education (1954). Technical recommendations for psychological tests and diagnostic techniques. Washington, DC: The Association. doi:10.1037/h0053479.

[angoff1988-7] Angoff, W. H. (1988). "Validity: An evolving concept". In Wainer, H.; Braun, H. (eds.). Test Validity. Hillsdale, NJ: Lawrence Erlbaum. pp. 19–32. doi:10.4324/9780203056905.

[cronbachmeehl1955-8] Cronbach, L. J.; Meehl, P. E. (1955). "Construct validity in psychological tests". Psychological Bulletin. 52: 281–302. doi:10.1037/h0040957. Archived from the original on 10 September 2024.

[9] Cronbach, L. J. (1969). Validation of educational measures. Proceedings of the 1969 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service. pp. 35–52.

[10] Loevinger, J. (1957). "Objective tests as instruments of psychological theory" (PDF). Psychological Reports. 3: 634–694. doi:10.2466/pr0.1957.3.3.635. Archived (PDF) from the original on 7 July 2024.

[11] Tenopyr, M. L. (1977). "Content-construct confusion". Personnel Psychology. 30: 47–54. doi:10.1111/j.1744-6570.1977.tb02320.x.

[12] Guion, R. M. (1977). "Content validity–The source of my discontent". Applied Psychological Measurement. 1: 1–10. doi:10.1177/014662167700100103. Archived from the original on 27 January 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

v t e Psychology
History Philosophy Portal Psychologist
Basic psychology	Abnormal Affective neuroscience Affective science Behavioral genetics Behavioral neuroscience Behaviorism Cognitive/Cognitivism Cognitive neuroscience Social Comparative Cross-cultural Cultural Developmental Differential Ecological Evolutionary Experimental Gestalt Intelligence Mathematical Moral Neuropsychology Perception Personality Psycholinguistics Psychophysiology Quantitative Social Theoretical
Applied psychology	Anomalistic Applied behavior analysis Assessment Clinical Coaching Community Consumer Counseling Critical Educational Ergonomics Feminist Forensic Health Humanistic Industrial and organizational Legal Media Medical Military Music Occupational health Pastoral Political Positive Psychometrics Psychotherapy Religion School Sport and exercise Suicidology Systems Traffic
Methodologies	Animal testing Archival research Behavior epigenetics Case study Content analysis Experiments Human subject research Interviews Neuroimaging Observation Psychophysics Qualitative research Quantitative research Self-report inventory Statistical surveys
Concepts	Behavior Behavioral engineering Behavioral genetics Behavioral neuroscience Cognition Competence Consciousness Consumer behavior Emotions Feelings Human factors and ergonomics Intelligence Mind Psychology of religion Psychometrics Terror management theory
Psychologists	Wilhelm Wundt William James Ivan Pavlov Sigmund Freud Edward Thorndike Carl Jung John B. Watson Clark L. Hull Kurt Lewin Jean Piaget Gordon Allport J. P. Guilford Carl Rogers Erik Erikson B. F. Skinner Donald O. Hebb Ernest Hilgard Harry Harlow Raymond Cattell Abraham Maslow Neal E. Miller Jerome Bruner Donald T. Campbell Hans Eysenck Herbert A. Simon David McClelland Leon Festinger George A. Miller Richard Lazarus Stanley Schachter Robert Zajonc Albert Bandura Roger Brown Endel Tulving Lawrence Kohlberg Noam Chomsky Ulric Neisser Jerome Kagan Walter Mischel Elliot Aronson Daniel Kahneman Paul Ekman Michael Posner Amos Tversky Bruce McEwen Larry Squire Richard E. Nisbett Martin Seligman Ed Diener Shelley E. Taylor John Anderson Ronald C. Kessler Joseph E. LeDoux Richard Davidson Susan Fiske Roy Baumeister
Lists	Counseling topics Disciplines Organizations Outline Psychologists Psychotherapies Research methods Schools of thought Timeline Topics
Wiktionary definition Wiktionary category Wikisource Wikimedia Commons Wikiquote Wikinews Wikibooks

Test validity

Contents

Historical background

Validation process

See also

Related Research Articles

References