Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered. [1]
CAT successively selects questions for the purpose of maximizing the precision of the exam based on what is known about the examinee from previous questions. [2] From the examinee's perspective, the difficulty of the exam seems to tailor itself to their level of ability. For example, if an examinee performs well on an item of intermediate difficulty, they will then be presented with a more difficult question. Or, if they performed poorly, they would be presented with a simpler question. Compared to static tests that nearly everyone has experienced, with a fixed set of items administered to all examinees, computer-adaptive tests require fewer test items to arrive at equally accurate scores. [2]
The basic computer-adaptive testing method is an iterative algorithm with the following steps: [3]
Nothing is known about the examinee prior to the administration of the first item, so the algorithm is generally started by selecting an item of medium, or medium-easy, difficulty as the first item.[ citation needed ]
As a result of adaptive administration, different examinees receive quite different tests. [4] Although examinees are typically administered different tests, their ability scores are comparable to one another (i.e., as if they had received the same test, as is common in tests designed using classical test theory). The psychometric technology that allows equitable scores to be computed across different sets of items is item response theory (IRT). IRT is also the preferred methodology for selecting optimal items which are typically selected on the basis of information rather than difficulty, per se. [3]
A related methodology called multistage testing (MST) or CAST is used in the Uniform Certified Public Accountant Examination. MST avoids or reduces some of the disadvantages of CAT as described below. [5]
CAT has existed since the 1970s, and there are now many assessments that utilize it.
Additionally, a list of active CAT exams is found at International Association for Computerized Adaptive Testing, [7] along with a list of current CAT research programs and a near-inclusive bibliography of all published CAT research.
Adaptive tests can provide uniformly precise scores for most test-takers. [3] In contrast, standard fixed tests almost always provide the best precision for test-takers of medium ability and increasingly poorer precision for test-takers with more extreme test scores.[ citation needed ]
An adaptive test can typically be shortened by 50% and still maintain a higher level of precision than a fixed version. [2] This translates into time savings for the test-taker. Test-takers do not waste their time attempting items that are too hard or trivially easy. Additionally, the testing organization benefits from the time savings; the cost of examinee seat time is substantially reduced. However, because the development of a CAT involves much more expense than a standard fixed-form test, a large population is necessary for a CAT testing program to be financially fruitful.[ citation needed ]
Large target populations can generally be exhibited in scientific and research-based fields. CAT testing in these aspects may be used to catch early onset of disabilities or diseases. The growth of CAT testing in these fields has increased greatly in the past 10 years. Once not accepted in medical facilities and laboratories, CAT testing is now encouraged in the scope of diagnostics.[ citation needed ]
Like any computer-based test, adaptive tests may show results immediately after testing.[ citation needed ]
Adaptive testing, depending on the item selection algorithm, may reduce exposure of some items because examinees typically receive different sets of items rather than the whole population being administered a single set. However, it may increase the exposure of others (namely the medium or medium/easy items presented to most examinees at the beginning of the test). [3]
The first issue encountered in CAT is the calibration of the item pool. In order to model the characteristics of the items (e.g., to pick the optimal item), all the items of the test must be pre-administered to a sizable sample and then analyzed. To achieve this, new items must be mixed into the operational items of an exam (the responses are recorded but do not contribute to the test-takers' scores), called "pilot testing", "pre-testing", or "seeding". [3] This presents logistical, ethical, and security issues. For example, it is impossible to field an operational adaptive test with brand-new, unseen items; [8] all items must be pretested with a large enough sample to obtain stable item statistics. This sample may be required to be as large as 1,000 examinees. [8] Each program must decide what percentage of the test can reasonably be composed of unscored pilot test items.[ citation needed ]
Although adaptive tests have exposure control algorithms to prevent overuse of a few items, [3] the exposure conditioned upon ability is often not controlled and can easily become close to 1. That is, it is common for some items to become very common on tests for people of the same ability. This is a serious security concern because groups sharing items may well have a similar functional ability level. In fact, a completely randomized exam is the most secure (but also least efficient).[ citation needed ]
Review of past items is generally disallowed. Adaptive tests tend to administer easier items after a person answers incorrectly. Supposedly, an astute test-taker could use such clues to detect incorrect answers and correct them. Or, test-takers could be coached to deliberately pick wrong answers, leading to an increasingly easier test. After tricking the adaptive test into building a maximally easy exam, they could then review the items and answer them correctly—possibly achieving a very high score. Test-takers frequently complain about the inability to review. [9]
Because of the sophistication, the development of a CAT has a number of prerequisites. [10] The large sample sizes (typically hundreds of examinees) required by IRT calibrations must be present. Items must be scorable in real time if a new item is to be selected instantaneously. Psychometricians experienced with IRT calibrations and CAT simulation research are necessary to provide validity documentation. Finally, a software system capable of true IRT-based CAT must be available.[ citation needed ]
In a CAT with a time limit it is impossible for the examinee to accurately budget the time they can spend on each test item and to determine if they are on pace to complete a timed test section. Test takers may thus be penalized for spending too much time on a difficult question which is presented early in a section and then failing to complete enough questions to accurately gauge their proficiency in areas which are left untested when time expires. [11] While untimed CATs are excellent tools for formative assessments which guide subsequent instruction, timed CATs are unsuitable for high-stakes summative assessments used to measure aptitude for jobs and educational programs.[ citation needed ]
There are five technical components in building a CAT (the following is adapted from Weiss & Kingsbury, 1984 [2] ). This list does not include practical issues, such as item pretesting or live field release.
A pool of items must be available for the CAT to choose from. [2] Such items can be created in the traditional way (i.e., manually) or through automatic item generation. The pool must be calibrated with a psychometric model, which is used as a basis for the remaining four components. Typically, item response theory is employed as the psychometric model. [2] One reason item response theory is popular is because it places persons and items on the same metric (denoted by the Greek letter theta), which is helpful for issues in item selection (see below).[ citation needed ]
In CAT, items are selected based on the examinee's performance up to a given point in the test. However, the CAT is obviously not able to make any specific estimate of examinee ability when no items have been administered. So some other initial estimate of examinee ability is necessary. If some previous information regarding the examinee is known, it can be used, [2] but often the CAT just assumes that the examinee is of average ability – hence the first item often being of medium difficulty level.[ citation needed ]
As mentioned previously, item response theory places examinees and items on the same metric. Therefore, if the CAT has an estimate of examinee ability, it is able to select an item that is most appropriate for that estimate. [8] Technically, this is done by selecting the item with the greatest information at that point. [2] Information is a function of the discrimination parameter of the item, as well as the conditional variance and pseudo-guessing parameter (if used).[ citation needed ]
After an item is administered, the CAT updates its estimate of the examinee's ability level. If the examinee answered the item correctly, the CAT will likely estimate their ability to be somewhat higher, and vice versa. This is done by using the item response function from item response theory to obtain a likelihood function of the examinee's ability. Two methods for this are called maximum likelihood estimation and Bayesian estimation . The latter assumes an a priori distribution of examinee ability, and has two commonly used estimators: expectation a posteriori and maximum a posteriori. Maximum likelihood is equivalent to a Bayes maximum a posteriori estimate if a uniform (f(x)=1) prior is assumed. [8] Maximum likelihood is asymptotically unbiased, but cannot provide a theta estimate for an unmixed (all correct or incorrect) response vector, in which case a Bayesian method may have to be used temporarily. [2]
The CAT algorithm is designed to repeatedly administer items and update the estimate of examinee ability. This will continue until the item pool is exhausted unless a termination criterion is incorporated into the CAT. Often, the test is terminated when the examinee's standard error of measurement falls below a certain user-specified value, hence the statement above that an advantage is that examinee scores will be uniformly precise or "equiprecise." [2] Other termination criteria exist for different purposes of the test, such as if the test is designed only to determine if the examinee should "Pass" or "Fail" the test, rather than obtaining a precise estimate of their ability. [2] [12]
In many situations, the purpose of the test is to classify examinees into two or more mutually exclusive and exhaustive categories. This includes the common "mastery test" where the two classifications are "pass" and "fail", but also includes situations where there are three or more classifications, such as "Insufficient", "Basic", and "Advanced" levels of knowledge or competency. The kind of "item-level adaptive" CAT described in this article is most appropriate for tests that are not "pass/fail" or for pass/fail tests where providing good feedback is extremely important. Some modifications are necessary for a pass/fail CAT, also known as a computerized classification test (CCT). [12] For examinees with true scores very close to the passing score, computerized classification tests will result in long tests while those with true scores far above or below the passing score will have shortest exams.[ citation needed ]
For example, a new termination criterion and scoring algorithm must be applied that classifies the examinee into a category rather than providing a point estimate of ability. There are two primary methodologies available for this. The more prominent of the two is the sequential probability ratio test (SPRT). [13] [14] This formulates the examinee classification problem as a hypothesis test that the examinee's ability is equal to either some specified point above the cutscore or another specified point below the cutscore. Note that this is a point hypothesis formulation rather than a composite hypothesis formulation [15] that is more conceptually appropriate. A composite hypothesis formulation would be that the examinee's ability is in the region above the cutscore or the region below the cutscore.[ citation needed ]
A confidence interval approach is also used, where after each item is administered, the algorithm determines the probability that the examinee's true-score is above or below the passing score. [16] [17] For example, the algorithm may continue until the 95% confidence interval for the true score no longer contains the passing score. At that point, no further items are needed because the pass-fail decision is already 95% accurate, assuming that the psychometric models underlying the adaptive testing fit the examinee and test. This approach was originally called "adaptive mastery testing" [16] but it can be applied to non-adaptive item selection and classification situations of two or more cutscores (the typical mastery test has a single cutscore). [17]
As a practical matter, the algorithm is generally programmed to have a minimum and a maximum test length (or a minimum and maximum administration time). Otherwise, it would be possible for an examinee with ability very close to the cutscore to be administered every item in the bank without the algorithm making a decision.[ citation needed ]
The item selection algorithm utilized depends on the termination criterion. Maximizing information at the cutscore is more appropriate for the SPRT because it maximizes the difference in the probabilities used in the likelihood ratio. [18] Maximizing information at the ability estimate is more appropriate for the confidence interval approach because it minimizes the conditional standard error of measurement, which decreases the width of the confidence interval needed to make a classification. [17]
ETS researcher Martha Stocking has quipped that most adaptive tests are actually barely adaptive tests (BATs) because, in practice, many constraints are imposed upon item choice. For example, CAT exams must usually meet content specifications; [3] a verbal exam may need to be composed of equal numbers of analogies, fill-in-the-blank and synonym item types. CATs typically have some form of item exposure constraints, [3] to prevent the most informative items from being over-exposed. Also, on some tests, an attempt is made to balance surface characteristics of the items such as gender of the people in the items or the ethnicities implied by their names. Thus CAT exams are frequently constrained in which items it may choose and for some exams the constraints may be substantial and require complex search strategies (e.g., linear programming) to find suitable items.[ citation needed ]
A simple method for controlling item exposure is the "randomesque" or strata method. Rather than selecting the most informative item at each point in the test, the algorithm randomly selects the next item from the next five or ten most informative items. This can be used throughout the test, or only at the beginning. [3] Another method is the Sympson-Hetter method, [19] in which a random number is drawn from U(0,1), and compared to a ki parameter determined for each item by the test user. If the random number is greater than ki, the next most informative item is considered. [3]
Wim van der Linden and colleagues [20] have advanced an alternative approach called shadow testing which involves creating entire shadow tests as part of selecting items. Selecting items from shadow tests helps adaptive tests meet selection criteria by focusing on globally optimal choices (as opposed to choices that are optimal for a given item).[ citation needed ]
Given a set of items, a multidimensional computer adaptive test (MCAT) selects those items from the bank according to the estimated abilities of the student, resulting in an individualized test. MCATs seek to maximize the test's accuracy, based on multiple simultaneous examination abilities (unlike a computer adaptive test – CAT – which evaluates a single ability) using the sequence of items previously answered (Piton-Gonçalves & Aluisio, 2012).[ citation needed ]
In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:
"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."
The Graduate Management Admission Test is a computer adaptive test (CAT) intended to assess certain analytical, writing, quantitative, verbal, and reading skills in written English for use in admission to a graduate management program, such as a Master of Business Administration (MBA) program. Answering the test questions requires knowledge of English grammatical rules, reading comprehension, and mathematical skills such as arithmetic, algebra, and geometry. The Graduate Management Admission Council (GMAC) owns and operates the test, and states that the GMAT assesses analytical writing and problem-solving abilities while also addressing data sufficiency, logic, and critical reasoning skills that it believes to be vital to real-world business and management success. It can be taken up to five times a year but no more than eight times total. Attempts must be at least 16 days apart.
The Graduate Record Examinations (GRE) is a standardized test that is part of the admissions process for many graduate schools in the United States and Canada and a few other countries. The GRE is owned and administered by Educational Testing Service (ETS). The test was established in 1936 by the Carnegie Foundation for the Advancement of Teaching.
Educational Testing Service (ETS), founded in 1947, is the world's largest private educational testing and assessment organization. It is headquartered in Lawrence Township, New Jersey, but has a Princeton address.
Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observed or obtained score on a test is the sum of a true score (error-free score) and an error score. Generally speaking, the aim of classical test theory is to understand and improve the reliability of psychological tests.
In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.
The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research.
A computerized classification test (CCT) refers to, as its name would suggest, a Performance Appraisal System that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.
The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald and later proven to be optimal by Wald and Jacob Wolfowitz. Neyman and Pearson's 1933 result inspired Wald to reformulate it as a sequential analysis problem. The Neyman-Pearson lemma, by contrast, offers a rule of thumb for when all the data is collected.
Test equating traditionally refers to the statistical process of determining comparable scores on different forms of an exam. It can be accomplished using either classical test theory or item response theory.
Differential item functioning (DIF) is a statistical property of a test item that indicates how likely it is for individuals from distinct groups, possessing similar abilities, to respond differently to the item. It manifests when individuals from different groups, with comparable skill levels, do not have an equal likelihood of answering a question correctly. There are two primary types of DIF: uniform DIF, where one group consistently has an advantage over the other, and nonuniform DIF, where the advantage varies based on the individual's ability level. The presence of DIF requires review and judgment, but it doesn't always signify bias. DIF analysis provides an indication of unexpected behavior of items on a test. DIF characteristic of an item isn't solely determined by varying probabilities of selecting a specific response among individuals from different groups. Rather, DIF becomes pronounced when individuals from different groups, who possess the same underlying true ability, exhibit differing probabilities of giving a certain response. Even when uniform bias is present, test developers sometimes resort to assumptions such as DIF biases may offset each other due to the extensive work required to address it, compromising test ethics and perpetuating systemic biases. Common procedures for assessing DIF are Mantel-Haenszel procedure, logistic regression, item response theory (IRT) based methods, and confirmatory factor analysis (CFA) based methods.
The Principles and Practice of Engineering exam is the examination required for one to become a Professional Engineer (PE) in the United States. It is the second exam required, coming after the Fundamentals of Engineering exam.
Linear-on-the-fly testing, often referred to as LOFT, is a method of delivering educational or professional examinations. Competing methods include traditional linear fixed-form delivery and computerized adaptive testing. LOFT is a compromise between the two, in an effort to maintain the equivalence of the set of items that each examinee sees, which is found in fixed-form delivery, while attempting to reduce item exposure and enhance test security.
Multistage testing is an algorithm-based approach to administering tests. It is very similar to computer-adaptive testing in that items are interactively selected for each examinee by the algorithm, but rather than selecting individual items, groups of items are selected, building the test in stages. These groups are called testlets or panels.
The attribute hierarchy method (AHM), is a cognitively based psychometric procedure developed by Jacqueline Leighton, Mark Gierl, and Steve Hunka at the Centre for Research in Applied Measurement and Evaluation (CRAME) at the University of Alberta. The AHM is one form of cognitive diagnostic assessment that aims to integrate cognitive psychology with educational measurement for the purposes of enhancing instruction and student learning. A cognitive diagnostic assessment (CDA), is designed to measure specific knowledge states and cognitive processing skills in a given domain. The results of a CDA yield a profile of scores with detailed information about a student’s cognitive strengths and weaknesses. This cognitive diagnostic feedback has the potential to guide instructors, parents and students in their teaching and learning processes.
Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.
Howard Charles Wainer is an American statistician, past principal research scientist at the Educational Testing Service, adjunct professor of statistics at the Wharton School of the University of Pennsylvania, and author, known for his contributions in the fields of statistics, psychometrics, and statistical graphics.
An examination or test is an educational assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics. A test may be administered verbally, on paper, on a computer, or in a predetermined area that requires a test taker to demonstrate or perform a set of skills.
Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models.
Mark Daniel Reckase is an educational psychologist and expert on quantitative methods and measurement who is known for his work on computerized adaptive testing, multidimensional item response theory, and standard setting in educational and psychological tests. Reckase is University Distinguished Professor Emeritus in the College of Education at Michigan State University.