Multistage testing

Last updated

Multistage testing is an algorithm-based approach to administering tests. It is very similar to computer-adaptive testing in that items are interactively selected for each examinee by the algorithm, but rather than selecting individual items, groups of items are selected, building the test in stages. These groups are called testlets or panels. [1]

While multistage tests could theoretically be administered by a human, the extensive computations required (often using item response theory) mean that multistage tests are administered by computer.

The number of stages or testlets can vary. If the testlets are relatively small, such as five items, ten or more could easily be used in a test. Some multistage tests are designed with the minimum of two stages (one stage would be a conventional fixed-form test). [2]

In response to the increasing use of multistage testing, the scholarly journal Applied Measurement in Education published a special edition on the topic in 2006. [3]

Related Research Articles

Psychometrics is a field of study concerned with the theory and technique of psychological measurement. As defined by the US National Council on Measurement in Education (NCME), psychometrics refers to psychological measurement. Generally, it refers to the field in psychology and education that is devoted to testing, measurement, assessment, and related activities.

Reliability in statistics and psychometrics is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions. "It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores." For example, measurements of people's height and weight are often extremely reliable.

The Graduate Record Examinations (GRE) is a standardized test that is an admissions requirement for many graduate schools in the United States and Canada. The GRE is owned and administered by Educational Testing Service (ETS). The test was established in 1936 by the Carnegie Foundation for the Advancement of Teaching.

Educational Testing Service Educational testing and assessment organization

Educational Testing Service (ETS), founded in 1947, is the world's largest private nonprofit educational testing and assessment organization. It is headquartered in Lawrence Township, New Jersey, but has a Princeton address.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments" (p. 197). By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

Questionnaire Research instrument consisting of a series of questions and other prompts for the purpose of gathering information from respondents

A questionnaire is a research instrument consisting of a series of questions for the purpose of gathering information from respondents. The questionnaire was invented by the Statistical Society of London in 1838.

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between (a) the respondent's abilities, attitudes, or personality traits and (b) the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession and market research because of their general applicability.

Uniform Certified Public Accountant Examination Exam

The Uniform Certified Public Accountant Examination is the examination administered to people who wish to become U.S. Certified Public Accountants. The CPA Exam is used by the regulatory bodies of all fifty states plus the District of Columbia, Guam, Puerto Rico, the U.S. Virgin Islands and the Northern Mariana Islands.

A computerized classification test (CCT) refers to, as its name would suggest, a test that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.

Computer-adaptive sequential testing (CAST) is another term for multistage testing. A CAST test is a type of computer-adaptive test or computerized classification test that uses pre-defined groups of items called testlets rather than operating at the level of individual items. CAST is a term introduced by psychometricians working for the National Board of Medical Examiners. In CAST, the testlets are referred to as panels.

Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.

Group testing A procedure that breaks up the task of identifying certain objects into tests on groups of items.

In statistics and combinatorial mathematics, group testing is any procedure that breaks up the task of identifying certain objects into tests on groups of items, rather than on individual ones. First studied by Robert Dorfman in 1943, group testing is a relatively new field of applied mathematics that can be applied to a wide range of practical applications and is an active area of research today.

Howard Wainer American statistician

Howard Wainer is an American statistician, past principal research scientist at the Educational Testing Service, adjunct professor of statistics at the Wharton School of the University of Pennsylvania, and author, known for his contributions in the fields of statistics, psychometrics, and statistical graphics.

Klaus Kubinger Austrian psychologist

Klaus D. Kubinger, is a psychologist as well as a statistician, professor for psychological assessment at the University of Vienna, Faculty of Psychology. His main research work focuses on fundamental research of assessment processes and on application and advancement of Item response theory models. He is also known as a textbook author of psychological assessment on the one hand and on statistics on the other hand.

The NIH Toolbox® for the Assessment of Neurological and Behavioral Function® is a multidimensional set of brief royalty-free measures that researchers and clinicians can use to assess cognitive, sensory, motor and emotional function in people ages 3–85. This suite of measures can be administered to study participants in two hours or less, in a variety of settings, with a particular emphasis on measuring outcomes in longitudinal epidemiologic studies and prevention or intervention trials. The battery has been normed and validated across the lifespan in subjects age 3-85 and its use ensures that assessment methods and results can be used for comparisons across existing and future studies. The NIH Toolbox is capable of monitoring neurological and behavioral function over time, and measuring key constructs across developmental stages.

Bruno D. Zumbo is an applied mathematician working primarily in the psychological, social and health sciences. He is currently Professor and Distinguished University Scholar, and Paragon UBC Professor of Psychometrics & Measurement at University of British Columbia. He is known for his contributions in the fields of statistics, psychometrics, validity theory, and studies of the mathematical basis of classical test theory and measurement error models.

Automatic Item Generation (AIG), or Automated Item Generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models.

Alina von Davier Romanian-American psychometrician

Alina Anca von Davier is a psychometrician and researcher in computational psychometrics, machine learning, and education. Von Davier serves on the technical advisory board member for Duolingo and on the board of directors of Smart Sparrow. She is an adjunct professor at Fordham University.

References

  1. Luecht, R. M. & Nungester, R. J. (1998). "Some practical examples of computer-adaptive sequential testing." Journal of Educational Measurement, 35, 229-249.
  2. Castle, R.A. (1997). "The Relative Efficiency of Two-Stage Testing Versus Traditional Multiple Choice Testing Using Item Response Theory in Licensure." Unpublished doctoral dissertation.
  3. Applied Measurement in Education edition on multistage testing [ permanent dead link ]

Further reading