Consensus-based assessment

Last updated

Consensus-based assessment expands on the common practice of consensus decision-making and the theoretical observation that expertise can be closely approximated by large numbers of novices or journeymen. It creates a method for determining measurement standards for very ambiguous domains of knowledge, such as emotional intelligence, politics, religion, values and culture in general. From this perspective, the shared knowledge that forms cultural consensus can be assessed in much the same way as expertise or general intelligence.

Contents

Measurement standards for general intelligence

Consensus-based assessment is based on a simple finding: that samples of individuals with differing competence (e.g., experts and apprentices) rate relevant scenarios, using Likert scales, with similar mean ratings. Thus, from the perspective of a CBA framework, cultural standards for scoring keys can be derived from the population that is being assessed. Peter Legree and Joseph Psotka, working together over the past decades, proposed that psychometric g could be measured unobtrusively through survey-like scales requiring judgments. This could either use the deviation score for each person from the group or expert mean; or a Pearson correlation between their judgments and the group mean. The two techniques are perfectly correlated. Legree and Psotka subsequently created scales that requested individuals to estimate word frequency; judge binary probabilities of good continuation; identify knowledge implications; and approximate employment distributions. The items were carefully identified to avoid objective referents, and therefore the scales required respondents to provide judgments that were scored against broadly developed, consensual standards. Performance on this judgment battery correlated approximately 0.80 with conventional measures of psychometric g. The response keys were consensually derived. Unlike mathematics or physics questions, the selection of items, scenarios, and options to assess psychometric g were guided roughly by a theory that emphasized complex judgment, but the explicit keys were unknown until the assessments had been made: they were determined by the average of everyone's responses, using deviation scores, correlations, or factor scores.

Measurement standards for cultural knowledge

One way to understand the connection between expertise and consensus is to consider that for many performance domains, expertise largely reflects knowledge derived from experience. Since novices tend to have fewer experiences, their opinions err in various inconsistent directions. However, as experience is acquired, the opinions of journeymen through to experts become more consistent. According to this view, errors are random. Ratings data collected from large samples of respondents of varying expertise can thus be used to approximate the average ratings a substantial number of experts would provide were many experts available. Because the standard deviation of a mean will approach zero as the number of observations becomes very large, estimates based on groups of varying competence will provide converging estimates of the best performance standards. The means of these groups’ responses can be used to create effective scoring rubrics, or measurement standards to evaluate performance. This approach is particularly relevant to scoring subjective areas of knowledge that are scaled using Likert response scales, and the approach has been applied to develop scoring standards for several domains where experts are scarce.

Experimental results

In practice, analyses have demonstrated high levels of convergence between expert and CBA standards with values quantifying those standards highly correlated (Pearson Rs ranging from .72 to .95), and with scores based on those standards also highly correlated (Rs ranging from .88 to .99) provided the sample size of both groups is large (Legree, Psotka, Tremble & Bourne, 2005). This convergence between CBA and expert referenced scores and the associated validity data indicate that CBA and expert based scoring can be used interchangeably, provided that the ratings data are collected using large samples of experts and novices or journeymen.

Factor analysis

CBA is often computed by using the Pearson R correlation of each person's Likert scale judgments across a set of items against the mean of all people's judgments on those same items. The correlation is then a measure of that person's proximity to the consensus. It is also sometimes computed as a standardized deviation score from the consensus means of the groups. These two procedures are mathematically isomorphic. If culture is considered to be shared knowledge; and the mean of the group's ratings on a focused domain of knowledge is considered a measure of the cultural consensus in that domain; then both procedures assess CBA as a measure of an individual person's cultural understanding.

However, it may be that the consensus is not evenly distributed over all subordinate items about a topic. Perhaps the knowledge content of the items is distributed over domains with differing consensus. For instance, conservatives who are libertarians may feel differently about invasion of privacy than conservatives who feel strongly about law and order. In fact, standard factor analysis brings this issue to the fore.

In either centroid or principal components analysis (PCA) the first factor scores are created by multiplying each rating by the correlation of the factor (usually the mean of all standardized ratings for each person) against each item's ratings. This multiplication weights each item by the correlation of the pattern of individual differences on each item (the component scores). If consensus is unevenly distributed over these items, some items may be more focused on the overall issues of the common factor. If an item correlates highly with the pattern of overall individual differences, then it is weighted more strongly in the overall factor scores. This weighting implicitly also weights the CBA score, since it is those items that share a common CBA pattern of consensus that are weighted more in factor analysis.

The transposed or Q methodology factor analysis, created by William Stephenson (psychologist) brings this relationship out explicitly. CBA scores are statistically isomorphic to the component scores in PCA for a Q factor analysis. They are the loading of each person's responses on the mean of all people's responses. So, Q factor analysis may provide a superior CBA measure, if it can be used first to select the people who represent the dominant dimension, over items that best represent a subordinate attribute dimension of a domain (such as liberalism in a political domain). Factor analysis can then provide the CBA of individuals along that particular axis of the domain.

In practice, when items are not easily created and arrayed to provide a highly reliable scale, the Q factor analysis is not necessary, since the original factor analysis should also select those items that have a common consensus. So, for instance, in a scale of items for political attitudes, the items may ask about attitudes toward big government; law and order; economic issues; labor issues; or libertarian issues. Which of these items most strongly bear on the political attitudes of the groups polled may be difficult to determine a priori. However, since factor analysis is a symmetric computation on the matrix of items and people, the original factor analysis of items, (when these are Likert scales) selects not just those items that are in a similar domain, but more generally, those items that have a similar consensus. The added advantage of this factor analytic technique is that items are automatically arranged along a factor so that the highest Likert ratings are also the highest CBA standard scores. Once selected, that factor determines the CBA (component) scores.

Critiques

The most common critique of CBA standards is to question how an average could possibly be a maximal standard. This critique argues that CBA is unsuitable for maximum-performance tests of psychological attributes, especially intelligence. Even so, CBA techniques are routinely employed in various measures of non-traditional intelligences (e.g., practical, emotional, social, etc.). Detailed critiques are presented in Gottfredson (2003) and MacCann, Roberts, Matthews, & Zeidner (2004) as well as elsewhere in the scientific literature.

See also

Related Research Articles

<span class="mw-page-title-main">Intelligence quotient</span> Score from a test designed to assess intelligence

An intelligence quotient (IQ) is a total score derived from a set of standardised tests or subtests designed to assess human intelligence. The abbreviation "IQ" was coined by the psychologist William Stern for the German term Intelligenzquotient, his term for a scoring method for intelligence tests at University of Breslau he advocated in a 1912 book.

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally refers to specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

Emotional intelligence (EI) is most often defined as the ability to perceive, use, understand, manage, and handle emotions. People with high emotional intelligence can recognize their own emotions and those of others, use emotional information to guide thinking and behavior, discern between different feelings and label them appropriately, and adjust emotions to adapt to environments.

Psychological testing refers to the administration of psychological tests. Psychological tests are administered or scored by trained evaluators. A person's responses are evaluated according to carefully prescribed guidelines. Scores are thought to reflect individual or group differences in the construct the test purports to measure. The science behind psychological testing is psychometrics.

The Stanford–Binet Intelligence Scales is an individually administered intelligence test that was revised from the original Binet–Simon Scale by Alfred Binet and Théodore Simon. It is in its fifth edition (SB5), which was released in 2003.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

The g factor is a construct developed in psychometric investigations of cognitive abilities and human intelligence. It is a variable that summarizes positive correlations among different cognitive tasks, reflecting the fact that an individual's performance on one type of cognitive task tends to be comparable to that person's performance on other kinds of cognitive tasks. The g factor typically accounts for 40 to 50 percent of the between-individual performance differences on a given cognitive test, and composite scores based on many tests are frequently regarded as estimates of individuals' standing on the g factor. The terms IQ, general intelligence, general cognitive ability, general mental ability, and simply intelligence are often used interchangeably to refer to this common core shared by cognitive tests. However, the g factor itself is a mathematical construct indicating the level of observed correlation between cognitive tasks. The measured value of this construct depends on the cognitive tasks that are used, and little is known about the underlying causes of the observed correlations.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

Louis Leon Thurstone was an American pioneer in the fields of psychometrics and psychophysics. He conceived the approach to measurement known as the law of comparative judgment, and is well known for his contributions to factor analysis. A Review of General Psychology survey, published in 2002, ranked Thurstone as the 88th most cited psychologist of the 20th century, tied with John Garcia, James J. Gibson, David Rumelhart, Margaret Floy Washburn, and Robert S. Woodworth.

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

In systems engineering, the system usability scale (SUS) is a simple, ten-item attitude Likert scale giving a global view of subjective assessments of usability. It was developed by John Brooke at Digital Equipment Corporation in the UK in 1986 as a tool to be used in usability engineering of electronic office systems.

Openness to experience is one of the domains which are used to describe human personality in the Five Factor Model. Openness involves six facets, or dimensions: active imagination (fantasy), aesthetic sensitivity, attentiveness to inner feelings, preference for variety (adventurousness), intellectual curiosity, and challenging authority. A great deal of psychometric research has demonstrated that these facets or qualities are significantly correlated. Thus, openness can be viewed as a global personality trait consisting of a set of specific traits, habits, and tendencies that cluster together.

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

DASS, the Depression Anxiety Stress Scales, is made up of 42 self-report items to be completed over five to ten minutes, each reflecting a negative emotional symptom. Each of these is rated on a four-point Likert scale of frequency or severity of the participants' experiences over the last week to emphasize states over traits. These scores ranged from 0, meaning that the client believed the item "did not apply to them at all", to 3, meaning that the client considered the item to "apply to them very much or most of the time". It is also stressed in the instructions that there are no right or wrong answers.

Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.

The VIA Inventory of Strengths (VIA-IS), formerly known as the "Values in Action Inventory," is a proprietary psychological assessment measure designed to identify an individual's profile of "character strengths".

The Psychopathic Personality Inventory (PPI-Revised) is a personality test for traits associated with psychopathy in adults. The PPI was developed by Scott Lilienfeld and Brian Andrews to assess these traits in non-criminal populations, though it is still used in clinical populations as well. In contrast to other psychopathy measures, such as the Hare Psychopathy Checklist (PCL), the PPI is a self-report scale, rather than interview-based, assessment. It is intended to comprehensively index psychopathic personality traits without assuming particular links to anti-social or criminal behaviors. It also includes measures to detect impression management or careless responding.

The Honesty-Humility factor of the HEXACO model of personality is one of the six basic personality traits. Honesty-Humility is a basic personality trait representing the tendency to be fair and genuine when dealing with others, in the sense of cooperating with others, even when someone might utilize them without suffering retaliation. People with very high levels of the Honesty-Humility avoid manipulating for personal gain, feel little desire to break rules, are uninterested in wealth and luxuries, and feel no special right to elevated social status. Conversely, persons with very low levels on this scale will compliment others to get whatever they want, are inclined to break the rules for personal gains, are motivated by material gain, and feel a strong sense of self-importance.

References