Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. [1] [2] [3] Construct validation is the accumulation of evidence to support the interpretation of what a measure reflects. [1] [4] [5] [6] Modern validity theory defines construct validity as the overarching concern of validity research, subsuming all other types of validity evidence [7] [8] such as content validity and criterion validity. [9] [10]
Construct validity is the appropriateness of inferences made on the basis of observations or measurements (often test scores), specifically whether a test can reasonably be considered to reflect the intended construct. Constructs are abstractions that are deliberately created by researchers in order to conceptualize the latent variable, which is correlated with scores on a given measure (although it is not directly observable). Construct validity examines the question: Does the measure behave like the theory says a measure of that construct should behave?
Construct validity is essential to the perceived overall validity of the test. Construct validity is particularly important in the social sciences, psychology, psychometrics and language studies.
Psychologists such as Samuel Messick (1998) have pushed for a unified view of construct validity "...as an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores..." [11] While Messick's views are popularized in educational measurement and originated in a career around explaining validity in the context of the testing industry, a definition more in line with foundational psychological research, supported by data-driven empirical studies that emphasize statistical and causal reasoning was given by (Borsboom et al., 2004). [12]
Key to construct validity are the theoretical ideas behind the trait under consideration, i.e. the concepts that organize how aspects of personality, intelligence, etc. are viewed. [13] Paul Meehl states that, "The best construct is the one around which we can build the greatest number of inferences, in the most direct fashion." [1]
Scale purification, i.e. "the process of eliminating items from multi-item scales" (Wieland et al., 2017) can influence construct validity. A framework presented by Wieland et al. (2017) highlights that both statistical and judgmental criteria need to be taken under consideration when making scale purification decisions. [14]
Throughout the 1940s scientists had been trying to come up with ways to validate experiments prior to publishing them. The result of this was a plethora of different validities (intrinsic validity, face validity, logical validity, empirical validity, etc.). This made it difficult to tell which ones were actually the same and which ones were not useful at all. Until the middle of the 1950s, there were very few universally accepted methods to validate psychological experiments. The main reason for this was because no one had figured out exactly which qualities of the experiments should be looked at before publishing. Between 1950 and 1954 the APA Committee on Psychological Tests met and discussed the issues surrounding the validation of psychological experiments. [1]
Around this time the term construct validity was first coined by Paul Meehl and Lee Cronbach in their seminal article "Construct Validity In Psychological Tests". They noted the idea that construct validity was not new at that point; rather, it was a combination of many different types of validity dealing with theoretical concepts. They proposed the following three steps to evaluate construct validity:
Many psychologists noted that an important role of construct validation in psychometrics was that it placed more emphasis on theory as opposed to validation. This emphasis was designed to address a core requirement that validation include some demonstration that the test measures the theoretical construct it purported to measure. Construct validity has three aspects or components: the substantive component, structural component, and external component. [15] They are closely related to three stages in the test construction process: constitution of the pool of items, analysis and selection of the internal structure of the pool of items, and correlation of test scores with criteria and other variables.
In the 1970s there was growing debate between theorists who began to see construct validity as the dominant model pushing towards a more unified theory of validity, and those who continued to work from multiple validity frameworks. [16] Many psychologists and education researchers saw "predictive, concurrent, and content validities as essentially ad hoc, construct validity was the whole of validity from a scientific point of view" [15] In the 1974 version of The Standards for Educational and Psychological Testing the inter-relatedness of the three different aspects of validity was recognized: "These aspects of validity can be discussed independently, but only for convenience. They are interrelated operationally and logically; only rarely is one of them alone important in a particular situation".
In 1989 Messick presented a new conceptualization of construct validity as a unified and multi-faceted concept. [17] Under this framework, all forms of validity are connected to and are dependent on the quality of the construct. He noted that a unified theory was not his own idea, but rather the culmination of debate and discussion within the scientific community over the preceding decades. There are six aspects of construct validity in Messick's unified theory of construct validity: [18]
How construct validity should properly be viewed is still a subject of debate for validity theorists. The core of the difference lies in an epistemological difference between positivist and postpositivist theorists.
Evaluation of construct validity requires that the correlations of the measure be examined in regard to variables that are known to be related to the construct (purportedly measured by the instrument being evaluated or for which there are theoretical grounds for expecting it to be related). This is consistent with the multitrait-multimethod matrix (MTMM) of examining construct validity described in Campbell and Fiske's landmark paper (1959). [19] There are other methods to evaluate construct validity besides MTMM. It can be evaluated through different forms of factor analysis, structural equation modeling (SEM), and other statistical evaluations. [20] [21] It is important to note that a single study does not prove construct validity. Rather it is a continuous process of evaluation, reevaluation, refinement, and development. Correlations that fit the expected pattern contribute evidence of construct validity. Construct validity is a judgment based on the accumulation of correlations from numerous studies using the instrument being evaluated. [22]
Most researchers attempt to test the construct validity before the main research. To do this pilot studies may be utilized. Pilot studies are small scale preliminary studies aimed at testing the feasibility of a full-scale test. These pilot studies establish the strength of their research and allow them to make any necessary adjustments. Another method is the known-groups technique, which involves administering the measurement instrument to groups expected to differ due to known characteristics. Hypothesized relationship testing involves logical analysis based on theory or prior research. [6] Intervention studies are yet another method of evaluating construct validity. Intervention studies where a group with low scores in the construct is tested, taught the construct, and then re-measured can demonstrate a test's construct validity. If there is a significant difference pre-test and post-test, which are analyzed by statistical tests, then this may demonstrate good construct validity. [23]
Convergent and discriminant validity are the two subtypes of validity that make up construct validity. Convergent validity refers to the degree to which two measures of constructs that theoretically should be related, are in fact related. In contrast, discriminant validity tests whether concepts or measurements that are supposed to be unrelated are, in fact, unrelated. [19] Take, for example, a construct of general happiness. If a measure of general happiness had convergent validity, then constructs similar to happiness (satisfaction, contentment, cheerfulness, etc.) should relate positively to the measure of general happiness. If this measure has discriminant validity, then constructs that are not supposed to be related positively to general happiness (sadness, depression, despair, etc.) should not relate to the measure of general happiness. Measures can have one of the subtypes of construct validity and not the other. Using the example of general happiness, a researcher could create an inventory where there is a very high positive correlation between general happiness and contentment, but if there is also a significant positive correlation between happiness and depression, then the measure's construct validity is called into question. The test has convergent validity but not discriminant validity.
Lee Cronbach and Paul Meehl (1955) [1] proposed that the development of a nomological net was essential to the measurement of a test's construct validity. A nomological network defines a construct by illustrating its relation to other constructs and behaviors. It is a representation of the concepts (constructs) of interest in a study, their observable manifestations, and the interrelationship among them. It examines whether the relationships between similar construct are considered with relationships between the observed measures of the constructs. A thorough observation of constructs relationships to each other it can generate new constructs. For example, intelligence and working memory are considered highly related constructs. Through the observation of their underlying components psychologists developed new theoretical constructs such as: controlled attention [24] and short term loading. [25] Creating a nomological net can also make the observation and measurement of existing constructs more efficient by pinpointing errors. [1] Researchers have found that studying the bumps on the human skull (phrenology) are not indicators of intelligence, but volume of the brain is. Removing the theory of phrenology from the nomological net of intelligence and adding the theory of brain mass evolution, constructs of intelligence are made more efficient and more powerful. The weaving of all of these interrelated concepts and their observable traits creates a "net" that supports their theoretical concept. For example, in the nomological network for academic achievement, we would expect observable traits of academic achievement (i.e. GPA, SAT, and ACT scores) to relate to the observable traits for studiousness (hours spent studying, attentiveness in class, detail of notes). If they do not then there is a problem with measurement (of academic achievement or studiousness), or with the purported theory of achievement. If they are indicators of one another then the nomological network, and therefore the constructed theory, of academic achievement is strengthened. Although the nomological network proposed a theory of how to strengthen constructs, it doesn't tell us how we can assess the construct validity in a study.
The multitrait-multimethod matrix (MTMM) is an approach to examining construct validity developed by Campbell and Fiske (1959). [19] This model examines convergence (evidence that different measurement methods of a construct give similar results) and discriminability (ability to differentiate the construct from other related constructs). It measures six traits: the evaluation of convergent validity, the evaluation of discriminant (divergent) validity, trait-method units, multitrait-multimethods, truly different methodologies, and trait characteristics. This design allows investigators to test for: "convergence across different measures...of the same 'thing'...and for divergence between measures...of related but conceptually distinct 'things'. [2] [26]
Apparent construct validity can be misleading due to a range of problems in hypothesis formulation and experimental design.
An in-depth exploration of the threats to construct validity is presented in Trochim. [31]
Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.
Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.
Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.
In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.
In psychometrics, predictive validity is the extent to which a score on a scale or test predicts scores on some criterion measure.
Personality Assessment Inventory (PAI), developed by Leslie Morey, is a self-report 344-item personality test that assesses a respondent's personality and psychopathology. Each item is a statement about the respondent that the respondent rates with a 4-point scale. It is used in various contexts, including psychotherapy, crisis/evaluation, forensic, personnel selection, pain/medical, and child custody assessment. The test construction strategy for the PAI was primarily deductive and rational. It shows good convergent validity with other personality tests, such as the Minnesota Multiphasic Personality Inventory and the Revised NEO Personality Inventory.
In psychometrics, criterion validity, or criterion-related validity, is the extent to which an operationalization of a construct, such as a test, relates to, or predicts, a theoretically related behaviour or outcome — the criterion. Criterion validity is often divided into concurrent and predictive validity based on the timing of measurement for the "predictor" and outcome. Concurrent validity refers to a comparison between the measure in question and an outcome assessed at the same time. Standards for Educational & Psychological Tests states, "concurrent validity reflects only the status quo at a particular time." Predictive validity, on the other hand, compares the measure in question with an outcome assessed at a later time. Although concurrent and predictive validity are similar, it is cautioned to keep the terms and findings separated. "Concurrent validity should not be used as a substitute for predictive validity without an appropriate supporting rationale." Criterion validity is typically assessed by comparison with a gold standard test.
A nomological network is a representation of the concepts (constructs) of interest in a study, their observable manifestations, and the interrelationships between these. The term "nomological" derives from the Greek, meaning "lawful", or in philosophy of science terms, "law-like". It was Cronbach and Meehl's view of construct validity that in order to provide evidence that a measure has construct validity, a nomological network must be developed for its measure.
Donald Thomas Campbell was an American social scientist. He is noted for his work in methodology. He coined the term evolutionary epistemology and developed a selectionist theory of human creativity. A Review of General Psychology survey, published in 2002, ranked Campbell as the 33rd most cited psychologist of the 20th century.
Paul Everett Meehl was an American clinical psychologist. He was the Hathaway and Regents' Professor of Psychology at the University of Minnesota, and past president of the American Psychological Association. A Review of General Psychology survey, published in 2002, ranked Meehl as the 74th most cited psychologist of the 20th century, in a tie with Eleanor J. Gibson. Throughout his nearly 60-year career, Meehl made seminal contributions to psychology, including empirical studies and theoretical accounts of construct validity, schizophrenia etiology, psychological assessment, behavioral prediction, metascience, and philosophy of science.
Lee Joseph Cronbach was an American educational psychologist who made contributions to psychological testing and measurement.
Convergent validity in the behavioral sciences refers to the degree to which two measures that theoretically should be related, are in fact related. Convergent validity, along with discriminant validity, is a subtype of construct validity. Convergent validity can be established if two similar constructs correspond with one another, while discriminant validity applies to two dissimilar constructs that are easily differentiated.
In psychology, discriminant validity tests whether concepts or measurements that are not supposed to be related are actually unrelated.
In philosophy, a construct is an object which is ideal, that is, an object of the mind or of thought, meaning that its existence may be said to depend upon a subject's mind. This contrasts with any possibly mind-independent objects, the existence of which purportedly does not depend on the existence of a conscious observing subject. Thus, the distinction between these two terms may be compared to that between phenomenon and noumenon in other philosophical contexts and to many of the typical definitions of the terms realism and idealism also. In the correspondence theory of truth, ideas, such as constructs, are to be judged and checked according to how well they correspond with their referents, often conceived as part of a mind-independent reality.
The multitrait-multimethod (MTMM) matrix is an approach to examining construct validity developed by Campbell and Fiske (1959). It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures. The conceptual approach has influenced experimental design and measurement theory in psychology, including applications in structural equation models.
Anne Anastasi was an American psychologist best known for her pioneering development of psychometrics. Her generative work, Psychological Testing, remains a classic text in which she drew attention to the individual being tested and therefore to the responsibilities of the testers. She called for them to go beyond test scores, to search the assessed individual's history to help them to better understand their own results and themselves.
Test validity is the extent to which a test accurately measures what it is supposed to measure. In the fields of psychological testing and educational testing, "validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests". Although classical models divided the concept into various "validities", the currently dominant view is that validity is a single unitary construct.
Affect measures are used in the study of human affect, and refer to measures obtained from self-report studies asking participants to quantify their current feelings or average feelings over a longer period of time. Even though some affect measures contain variations that allow assessment of basic predispositions to experience a certain emotion, tests for such stable traits are usually considered to be personality tests.
The person–situation debate in personality psychology refers to the controversy concerning whether the person or the situation is more influential in determining a person's behavior. Personality trait psychologists believe that a person's personality is relatively consistent across situations. Situationists, opponents of the trait approach, argue that people are not consistent enough from situation to situation to be characterized by broad personality traits. The debate is also an important discussion when studying social psychology, as both topics address the various ways a person could react to a given situation.
Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups. For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different genders or cultural backgrounds. Violations of measurement invariance may preclude meaningful interpretation of measurement data. Tests of measurement invariance are increasingly used in fields such as psychology to supplement evaluation of measurement quality rooted in classical test theory.