Multistage testing

Last updated January 15, 2025 • 1 min readFrom Wikipedia, The Free Encyclopedia

Multistage testing is an algorithm-based approach to administering tests. It is very similar to computer-adaptive testing in that items are interactively selected for each examinee by the algorithm, but rather than selecting individual items, groups of items are selected, building the test in stages. These groups are called testlets or panels.^[1]

While multistage tests could theoretically be administered by a human, the extensive computations required (often using item response theory) mean that multistage tests are administered by computer.

The number of stages or testlets can vary. If the testlets are relatively small, such as five items, ten or more could easily be used in a test. Some multistage tests are designed with the minimum of two stages (one stage would be a conventional fixed-form test).^[2]

In response to the increasing use of multistage testing, the scholarly journal Applied Measurement in Education published a special edition on the topic in 2006.^[3]

Related Research Articles

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

The Graduate Record Examinations (GRE) is a standardized test that is part of the admissions process for many graduate schools in the United States and Canada and a few other countries. The GRE is owned and administered by Educational Testing Service (ETS). The test was established in 1936 by the Carnegie Foundation for the Advancement of Teaching.

Educational Testing Service (ETS), founded in 1947, is the world's largest private educational testing and assessment organization. It is headquartered in Lawrence Township, New Jersey, but has a Princeton address.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.

Estimation of a Rasch model is used to estimate the parameters of the Rasch model. Various techniques are employed to estimate the parameters from matrices of response data. The most common approaches are types of maximum likelihood estimation, such as joint and conditional maximum likelihood estimation. Joint maximum likelihood (JML) equations are efficient, but inconsistent for a finite number of items, whereas conditional maximum likelihood (CML) equations give consistent and unbiased item estimates. Person estimates are generally thought to have bias associated with them, although weighted likelihood estimation methods for the estimation of person parameters reduce the bias.

A computerized classification test (CCT) refers to, as its name would suggest, a Performance Appraisal System that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.

Computer-adaptive sequential testing (CAST) is another term for multistage testing. A CAST test is a type of computer-adaptive test or computerized classification test that uses pre-defined groups of items called testlets rather than operating at the level of individual items. CAST is a term introduced by psychometricians working for the National Board of Medical Examiners. In CAST, the testlets are referred to as panels.

Psychometric software refers to specialized programs used for the psychometric analysis of data obtained from tests, questionnaires, polls or inventories that measure latent psychoeducational variables. Although some psychometric analyses can be performed using general statistical software such as SPSS, most require specialized tools designed specifically for psychometric purposes.

In statistics and combinatorial mathematics, group testing is any procedure that breaks up the task of identifying certain objects into tests on groups of items, rather than on individual ones. First studied by Robert Dorfman in 1943, group testing is a relatively new field of applied mathematics that can be applied to a wide range of practical applications and is an active area of research today.

<span class="mw-page-title-main">Howard Wainer</span> American statistician

Howard Charles Wainer is an American statistician, past principal research scientist at the Educational Testing Service, adjunct professor of statistics at the Wharton School of the University of Pennsylvania, and author, known for his contributions in the fields of statistics, psychometrics, and statistical graphics.

Klaus D. Kubinger, is a psychologist as well as a statistician and has been until retirement professor for psychological assessment at the University of Vienna, Faculty of Psychology. His main research work focuses on fundamental research of assessment processes and on application and advancement of Item response theory models. He is also known as a textbook author of psychological assessment on the one hand and on statistics on the other hand.

The NIH Toolbox, for the assessment of neurological and behavioral function, is a multidimensional set of brief royalty-free measures that researchers and clinicians can use to assess cognitive, sensory, motor and emotional function in people ages 3–85. This suite of measures can be administered to study participants in two hours or less, in a variety of settings, with a particular emphasis on measuring outcomes in longitudinal epidemiologic studies and prevention or intervention trials. The battery has been normed and validated across the lifespan in subjects age 3-85 and its use ensures that assessment methods and results can be used for comparisons across existing and future studies. The NIH Toolbox is capable of monitoring neurological and behavioral function over time, and measuring key constructs across developmental stages.

Computational psychometrics is an interdisciplinary field fusing theory-based psychometrics, learning and cognitive sciences, and data-driven AI-based computational models as applied to large-scale/high-dimensional learning, assessment, biometric, or psychological data. Computational psychometrics is frequently concerned with providing actionable and meaningful feedback to individuals based on measurement and analysis of individual differences as they pertain to specific areas of enquiry.

Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models. More recently, neural networks, including Large Language Models, such as the GPT family, have been used successfully for generating items automatically.

Alina Anca von Davier is a psychometrician and researcher in computational psychometrics, machine learning, and education. Von Davier is a researcher, innovator, and an executive leader with over 20 years of experience in EdTech and in the assessment industry. She is the Chief of Assessment at Duolingo, where she leads the Duolingo English Test research and development area. She is also the Founder and CEO of EdAstra Tech, a service-oriented EdTech company. In 2022, she joined the University of Oxford as an Honorary Research Fellow, and Carnegie Mellon University as a Senior Research Fellow.

Randy Elliot Bennett is an American educational researcher who specializes in educational assessment. He is currently the Norman O. Frederiksen Chair in Assessment Innovation at Educational Testing Service in Princeton, NJ. His research and writing focus on bringing together advances in cognitive science, technology, and measurement to improve teaching and learning. He received the ETS Senior Scientist Award in 1996, the ETS Career Achievement Award in 2005, the Teachers College, Columbia University Distinguished Alumni Award in 2016, Fellow status in the American Educational Research Association (AERA) in 2017, the National Council on Measurement in Education's (NCME) Bradley Hanson Award for Contributions to Educational Measurement in 2019, the E. F. Lindquist Award from AERA and ACT in 2020, elected membership in the National Academy of Education in 2022, and the AERA Cognition and Assessment Special Interest Group Outstanding Contribution to Research in Cognition and Assessment Award in 2024. Randy Bennett was elected President of both the International Association for Educational Assessment (IAEA), a worldwide organization primarily constituted of governmental and NGO measurement organizations, and the National Council on Measurement in Education (NCME), whose members are employed in universities, testing organizations, state and federal education departments, and school districts.

Mark Daniel Reckase is an educational psychologist and expert on quantitative methods and measurement who is known for his work on computerized adaptive testing, multidimensional item response theory, and standard setting in educational and psychological tests. Reckase is University Distinguished Professor Emeritus in the College of Education at Michigan State University.

Matthias von Davier is a psychometrician, academic, inventor, and author. He is the executive director of the TIMSS & PIRLS International Study Center at the Lynch School of Education and Human Development and the J. Donald Monan, S.J., University Professor in Education at Boston College.

Fumiko Samejima (1930–c2021) was a prominent Japanese-born psychometrician best known for her development of the Graded Response Model (GRM), a fundamental approach in Item Response Theory (IRT). Her innovative methods became influential in psychological and educational measurement, particularly in improving the accuracy of tests involving Likert-scale questions and other graded responses. She published her seminal paper “Estimation of Latent Ability Using a Response Pattern of Graded Scores” in 1969. This publication became a foundational reference in psychometric literature, significantly advancing the analysis of ordered categorical data.

References

↑ Luecht, R. M.; Nungester, R. J. (1998). "Some practical examples of computer-adaptive sequential testing". Journal of Educational Measurement (35): 229–249. doi:10.1111/j.1745-3984.1998.tb00537.x.
↑ Castle, R.A. (1997). The Relative Efficiency of Two-Stage Testing Versus Traditional Multiple Choice Testing Using Item Response Theory in Licensure (Unpublished doctoral dissertation thesis). Archived from the original on 3 August 2001.
↑ "Edition on multistage testing". Applied Measurement in Education. 19 (3). 2006.

Yan, D.; von Davier, A.; Lewis, C (2014). Computerized Multistage Testing: Theory and Applications. Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences. ISBN 9781032477381.
Magis, D.; Yan, D.; von Davier, A. (2017). Computerized Adaptive and Multistage Testing with R. Springer.

Multistage testing

Related Research Articles

References

Further reading