Standard-setting study

Last updated January 16, 2022

Standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the US, in particular for high-stakes assessments, and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined; it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Such studies require quite an amount of resources, involving a number of professionals, in particular with psychometric background. Standard-setting studies are for that reason impractical for regular class room situations, yet in every layer of education, standard setting is performed and multiple methods exist.

Standard-setting studies are typically performed using focus groups of 5-15 subject matter experts that represent key stakeholders for the test. For example, in setting cut scores for educational testing, experts might be instructors familiar with the capabilities of the student population for the test.

Types of standard-setting studies

Standard-setting studies fall into two categories, item-centered and person-centered. Examples of item-centered methods include the Angoff, Ebel, Nedelsky,^[1] Bookmark, and ID Matching methods, while examples of person-centered methods include the Borderline Survey and Contrasting Groups approaches. These are so categorized by the focus of the analysis; in item-centered studies, the organization evaluates items with respect to a given population of persons, and vice versa for person-centered studies.

Item-centered studies are related to criterion-referenced tests and to norm-referenced tests.

Item-centered studies

Angoff Method^[2] (item-centered): This method requires the assembly of a group of subject matter experts (SMEs), who are asked to evaluate each item and estimate the proportion of minimally competent examinees that would correctly answer the item. The ratings are averaged across raters for each item and then summed to obtain a panel-recommended raw cutscore. This cutscore then represents the score which the panel estimates a minimally competent candidate would get. This is of course subject to decision biases such as the overconfidence bias. Calibration with other, more objective, sources of data is preferable. Several variants of the method exist.
Modified Angoff Method (item-centered): Subject matter experts are generally briefed on the Angoff method and allowed to take the test with the performance levels in mind. SMEs are then asked to provide estimates for each question of the proportion of borderline or "minimally acceptable" participants that they would expect to get the question correct. The estimates are generally in p-value type form (e.g., 0.6 for item 1: 60% of borderline passing participants would get this question correct). Several rounds are generally conducted with SMEs allowed to modify their estimates given different types of information (e.g., actual participant performance information on each question, other SME estimates, etc.). The final determination of the cut score is then made (e.g., by averaging estimates or taking the median), which is often documented in a report along with secondary results such as the inter-rater reliability or the Beuk compromise. Software programs are typically used to calculate these.^[3] This method is generally used with multiple-choice questions.
Dichotomous Modified Angoff Method (item-centered): In the dichotomous modified Angoff approach, instead of using difficulty level type statistics (typically p-values), SMEs are asked to simply provide a 0/1 for each question ("0" if a borderline acceptable participant would get the question wrong and "1" if a borderline acceptable participant would get the item right)
Nedelsky Method (item-centered): SMEs make decisions on a question-by-question basis regarding which of the question distracters they feel borderline participants would be able to eliminate as incorrect. This method is generally used with multiple-choice questions only.
Bookmark Method (item-centered): Items in a test (or a representative subset of items) are ordered by difficulty (e.g., IRT response probability value) from easiest to hardest. SMEs place a "bookmark" in the "ordered item booklet" such that a student at the threshold of a performance level would be expected to respond successfully to the items prior to the bookmark with a likelihood equal to or greater than the specified response probability value (and with a likelihood less than that value for items after the bookmark). For example, for a response probability of .67 (RP67) SMEs would place a bookmark such that an examinee at the threshold of the performance level would have at least a 2/3 likelihood of success on items prior to the bookmark and less than a 2/3 likelihood of success on the items after the bookmark" This method is considered efficient with respect to setting multiple cut scores on a single test and can be used with tests composed of multiple item types (e.g., multiple-choice, construct response, etc.).^[4]^[5]^[6]
Item-Descriptor (ID) Matching^[7] (item-centered): ID Matching (a) combines the advantages of the Bookmark method; that is, the ordered item book and the information about empirical item difficulty conveyed in that ordering; and (b) hypothesized lower cognitive complexity and cognitive load of other methods; that is no error-prone probability judgments are required;^[8] matching the features of items to features of achievement level descriptions, which is well suited to people in general,^[9] and particularly to the knowledge and expertise of educators; and no need to hold a borderline examinee in mind while making the cut score judgment.

Person-centered studies

Rather than the items that distinguish competent candidates, person-centered studies evaluate the examinees themselves. While this might seem more appropriate, it is often more difficult because examinees are not a captive population, as is a list of items. For example, if a new test comes out regarding new content (as often happens in information technology tests), the test could be given to an initial sample called a beta sample, along with a survey of professional characteristics. The testing organization could then analyze and evaluate the relationship between the test scores and important statistics, such as skills, education, and experience. The cutscore could be set as the score that best differentiates between those examinees characterized as "passing" and those as "failing."

Borderline groups method (person-centered): A description is prepared for each performance category. SMEs are asked to submit a list of participants whose performance on the test should be close to the performance standard (borderline). The test is administered to these borderline groups and the median test score is used as the cut score. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
Contrasting groups method (person-centered): SMEs are asked to categorize the participants in their classes according to the performance category descriptions. The test is administered to all of the categorized participants and the test score distributions for each of the categorized groups are compared. Where the distributions of the contrasting groups intersect is where the cut score would be located. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).

Related Research Articles

Psychometrics theory and technique of psychological measurement

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally refers to specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting measurements that are expected to have some degree of error. The aim then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived. Meta-analytic results are considered the most trustworthy source of evidence by the evidence-based medicine literature.

Validity is the main extent to which a concept, conclusion or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

Survey methodology is "the study of survey methods". As a field of applied statistics concentrating on human-research surveys, survey methodology studies the sampling of individual units from a population and associated techniques of survey data collection, such as questionnaire construction and methods for improving the number and accuracy of responses to surveys. Survey methodology targets instruments or procedures that ask one or more questions that may or may not be answered.

A think-aloudprotocol is a method used to gather data in usability testing in product design and development, in psychology and a range of social sciences.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments" (p. 197). By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

Psychophysics Branch of knowledge relating physical stimuli and psychological perception

Psychophysics quantitatively investigates the relationship between physical stimuli and the sensations and perceptions they produce. Psychophysics has been described as "the scientific study of the relation between stimulus and sensation" or, more completely, as "the analysis of perceptual processes by studying the effect on a subject's experience or behaviour of systematically varying the properties of a stimulus along one or more physical dimensions".

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between (a) the respondent's abilities, attitudes, or personality traits and (b) the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research because of their general applicability.

In psychometrics, content validity refers to the extent to which a measure represents all facets of a given construct. For example, a depression scale may lack content validity if it only assesses the affective dimension of depression but fails to take into account the behavioral dimension. An element of subjectivity exists in relation to determining content validity, which requires a degree of agreement about what a particular personality trait such as extraversion represents. A disagreement about a personality trait will prevent the gain of a high content validity.

A computerized classification test (CCT) refers to, as its name would suggest, a test that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.

Differential item functioning (DIF) is a statistical characteristic of an item that shows the extent to which the item might be measuring different abilities for members of separate subgroups. Average item scores for subgroups having the same overall score on the test are compared to determine whether the item is measuring in essentially the same way for all subgroups. The presence of DIF requires review and judgment, and it does not necessarily indicate the presence of bias. DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups with the same underlying true ability have a different probability of giving a certain response. Common procedures for assessing DIF are Mantel-Haenszel, item response theory (IRT) based methods, and logistic regression.

A situational judgement test (SJT), or situational stress test (SStT) or inventory (SSI) is a type of psychological test which presents the test-taker with realistic, hypothetical scenarios and ask them to identify the most appropriate response or to rank the responses in the order they feel is most effective. SJTs can be presented to test-takers through a variety of modalities, such as booklets, films, or audio recordings. SJTs represent a distinct psychometric approach from the common knowledge-based multiple choice item. They are often used in industrial-organizational psychology applications such as personnel selection. Situational judgement tests tend to determine behavioral tendencies, assessing how an individual will behave in a certain situation, and knowledge instruction, which evaluates the effectiveness of possible responses. Situational judgement tests could also reinforce the status quo with an organization.

In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. The effect of a moderating variable is characterized statistically as an interaction; that is, a categorical or quantitative variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables, or the value of the slope of the dependent variable on the independent variable. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation.

Test validity is the extent to which a test accurately measures what it is supposed to measure. In the fields of psychological testing and educational testing, "validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests". Although classical models divided the concept into various "validities", the currently dominant view is that validity is a single unitary construct.

The Child Behavior Checklist (CBCL) is a widely used caregiver report form identifying problem behavior in children. It is widely used in both research and clinical practice with youths. It has been translated into more than 90 languages, and normative data are available integrating information from multiple societies. Because a core set of the items have been included in every version of the CBCL since the 1980s, it provides a meter stick for measuring whether amounts of behavior problems have changed over time or across societies. This is a helpful complement to other approaches for looking at rates of mental-health issues, as the definitions of disorders have changed repeatedly over the same time frame.

Test (assessment) Procedure for measuring a subjects knowledge, skill, aptitude, physical fitness, or other characteristics

A test or examination is an educational assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics. A test may be administered verbally, on paper, on a computer, or in a predetermined area that requires a test taker to demonstrate or perform a set of skills.

Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification.

Automatic Item Generation (AIG), or Automated Item Generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models.

References

↑ Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3–19.
↑ Zieky, M.J. (2001). So much has changed: how the setting of cutscores has evolved since the 1980s. In Cizek, G.J. (Ed.), Setting Performance Standards, p. 19-52. Mahwah, NJ: Lawrence Erlbaum Associates.
↑ Assessment Systems Corporation: Angoff Analysis Tool (free software). https://assess.com/angoff-analysis-tool/
↑ Lewis, D. M., Mitzel, H. C., Green, D. R. (June, 1996). Standard Setting: A Bookmark Approach. In D. R. Green (Chair), IRT-Based Standard-Setting Procedures Utilizing Behavioral Anchoring. Paper presented at the 1996 Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ.
↑ Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2000). The Bookmark Procedure: Cognitive Perspectives on Standard Setting. Chapter in Setting Performance Standards: Concepts, Methods, and Perspectives (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
↑ Lewis, D. M., Mitzel, H. C., Mercado, R. L., & Schulz, E. M. (2012). The Bookmark Standard Setting Procedure. Chapter in Setting Performance Standards: Foundations, Methods, and Innovations Second Edition (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
↑ Ferrara, S., & Lewis, D. (2012). The Item-Descriptor (ID) Matching method. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 255-282).
↑ Nickerson, R. S. (2005). Cognition and chance: The psychology of probabilistic reasoning. Mahwah, NJ: Lawrence Erlbaum Associates.
↑ Murphy, G. L. (2002). The big book of concepts. Cambridge, MA: The MIT Press

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3–19.

[2] Zieky, M.J. (2001). So much has changed: how the setting of cutscores has evolved since the 1980s. In Cizek, G.J. (Ed.), Setting Performance Standards, p. 19-52. Mahwah, NJ: Lawrence Erlbaum Associates.

[3] Assessment Systems Corporation: Angoff Analysis Tool (free software). https://assess.com/angoff-analysis-tool/

[4] Lewis, D. M., Mitzel, H. C., Green, D. R. (June, 1996). Standard Setting: A Bookmark Approach. In D. R. Green (Chair), IRT-Based Standard-Setting Procedures Utilizing Behavioral Anchoring. Paper presented at the 1996 Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ.

[5] Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2000). The Bookmark Procedure: Cognitive Perspectives on Standard Setting. Chapter in Setting Performance Standards: Concepts, Methods, and Perspectives (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

[6] Lewis, D. M., Mitzel, H. C., Mercado, R. L., & Schulz, E. M. (2012). The Bookmark Standard Setting Procedure. Chapter in Setting Performance Standards: Foundations, Methods, and Innovations Second Edition (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

[7] Ferrara, S., & Lewis, D. (2012). The Item-Descriptor (ID) Matching method. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 255-282).

[8] Nickerson, R. S. (2005). Cognition and chance: The psychology of probabilistic reasoning. Mahwah, NJ: Lawrence Erlbaum Associates.

[9] Murphy, G. L. (2002). The big book of concepts. Cambridge, MA: The MIT Press

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]