Equating

Last updated January 29, 2025

Test equating traditionally refers to the statistical process of determining comparable scores on different forms of an exam.^[1] It can be accomplished using either classical test theory or item response theory.

In item response theory, equating^[2] is the process of placing scores from two or more parallel test forms onto a common score scale. The result is that scores from two different test forms can be compared directly, or treated as though they came from the same test form. When the tests are not parallel, the general process is called linking. It is the process of equating the units and origins of two scales on which the abilities of students have been estimated from results on different tests. The process is analogous to equating degrees Fahrenheit with degrees Celsius by converting measurements from one scale to the other. The determination of comparable scores is a by-product of equating that results from equating the scales obtained from test results.

Purpose

Suppose that Dick and Jane both take a test to become licensed in a certain profession. Because the high stakes (you get to practice the profession if you pass the test) may create a temptation to cheat, the organization that oversees the test creates two forms. If we know that Dick scored 60% on form A and Jane scored 70% on form B, do we know for sure which one has a better grasp of the material? What if form A is composed of very difficult items, while form B is relatively easy? Equating analyses are performed to address this very issue, so that scores are as fair as possible.

Equating in item response theory

In item response theory, person "locations" (measures of some quality being assessed by a test) are estimated on an interval scale; i.e., locations are estimated in relation to a unit and origin. It is common in educational assessment to employ tests in order to assess different groups of students with the intention of establishing a common scale by equating the origins, and when appropriate also the units, of the scales obtained from response data from the different tests. The process is referred to as equating or test equating.

In item response theory, two different kinds of equating are horizontal and vertical equating.^[3] Vertical equating refers to the process of equating tests administered to groups of students with different abilities, such as students in different grades (years of schooling).^[4] Horizontal equating refers the equating of tests administered to groups with similar abilities; for example, two tests administered to students in the same grade in two consecutive calendar years. Different tests are used to avoid practice effects.

In terms of item response theory, equating is just a special case of the more general process of scaling, applicable when more than one test is used. In practice, though, scaling is often implemented separately for different tests and then the scales subsequently equated.

A distinction is often made between two methods of equating; common person and common item equating. Common person equating involves the administration of two tests to a common group of persons. The mean and standard deviation of the scale locations of the group on the two tests are equated using a linear transformation. Common item equating involves the use of a set of common items referred to as the anchor test embedded in two different tests. The mean item location of the common items is equated.

Classical approaches to equating

In classical test theory, mean equating simply adjusts the distribution of scores so that the mean of one form is comparable to the mean of the other form. While mean equating is attractive because of its simplicity, it lacks flexibility, namely accounting for the possibility that the standard deviations of the forms differ.^[1]

Linear equating adjusts so that the two forms have a comparable mean and standard deviation. There are several types of linear equating that differ in the assumptions and mathematics used to estimate parameters. The Tucker and Levine Observed Score methods estimate the relationship between observed scores on the two forms, while the Levine True Score method estimates the relationship between true scores on the two forms.^[1]

Equipercentile equating determines the equating relationship as one where a score could have an equivalent percentile on either form. This relationship can be nonlinear.

Unlike with item response theory, equating based on classical test theory is somewhat distinct from scaling. Equating is a raw-to-raw transformation in that it estimates a raw score on Form B that is equivalent to each raw score on the base Form A. Any scaling transformation used is then applied on top of, or with, the equating.

Related Research Articles

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

In statistics, the standard score is the number of standard deviations by which the value of a raw score is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores.

The Stanford–Binet Intelligence Scales is an individually administered intelligence test that was revised from the original Binet–Simon Scale by Alfred Binet and Théodore Simon. It is in its fifth edition (SB5), which was released in 2003.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

In psychometrics, item response theory is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of one parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes are a complement tool for statistical hypothesis testing, and play an important role in power analyses to assess the sample size required for new experiments. Effect size are fundamental in meta-analyses which aim to provide the combined effect size based on data from multiple studies. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and has since had a complex history, being adopted and extended in some disciplines and by some scholars, and criticized or rejected by others. Other classifications include those by Mosteller and Tukey, and by Chrisman.

In probability theory and statistics, the coefficient of variation (CV), also known as normalized root-mean-square deviation (NRMSD), percent RMS, and relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation $to the mean, and often expressed as a percentage ("%RSD"). The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R, by economists and investors in economic models, and in psychology/neuroscience.$

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research.

The Wechsler Preschool and Primary Scale of Intelligence (WPPSI) is an intelligence test designed for children ages 2 years 6 months to 7 years 7 months developed by David Wechsler in 1967. It is a descendant of the earlier Wechsler Adult Intelligence Scale and the Wechsler Intelligence Scale for Children tests. Since its original publication the WPPSI has been revised three times in 1989, 2002, and 2012. The latest version, WPPSI–IV, published by Pearson Education, is a revision of the WPPSI-R and the WPPSI-III. It provides subtest and composite scores that represent intellectual functioning in verbal and performance cognitive domains, as well as providing a composite score that represents a child's general intellectual ability.

Consensus-based assessment expands on the common practice of consensus decision-making and the theoretical observation that expertise can be closely approximated by large numbers of novices or journeymen. It creates a method for determining measurement standards for very ambiguous domains of knowledge, such as emotional intelligence, politics, religion, values and culture in general. From this perspective, the shared knowledge that forms cultural consensus can be assessed in much the same way as expertise or general intelligence.

The Millon Clinical Multiaxial Inventory – Fourth Edition (MCMI-IV) is the most recent edition of the Millon Clinical Multiaxial Inventory. The MCMI is a psychological assessment tool intended to provide information on personality traits and psychopathology, including specific mental disorders outlined in the DSM-5. It is intended for adults with at least a 5th grade reading level who are currently seeking mental health services. The MCMI was developed and standardized specifically on clinical populations, and the authors are very specific that it should not be used with the general population or adolescents. However, there is evidence base that shows that it may still retain validity on non-clinical populations, and so psychologists will sometimes administer the test to members of the general population, with caution. The concepts involved in the questions and their presentation make it unsuitable for those with below average intelligence or reading ability.

In mathematics and statistics, deviation serves as a measure to quantify the disparity between an observed value of a variable and another designated value, frequently the mean of that variable. Deviations with respect to the sample mean and the population mean are called errors and residuals, respectively. The sign of the deviation reports the direction of that difference: the deviation is positive when the observed value exceeds the reference value. The absolute value of the deviation indicates the size or magnitude of the difference. In a given sample, there are as many deviations as sample points. Summary statistics can be derived from a set of deviations, such as the standard deviation and the mean absolute deviation, measures of dispersion, and the mean signed deviation, a measure of bias.

A test score is a piece of information, usually a number, that conveys the performance of an examinee on a test. One formal definition is that it is "a summary of the evidence contained in an examinee's responses to the items of a test that are related to the construct or constructs being measured."

Psychometric software refers to specialized programs used for the psychometric analysis of data obtained from tests, questionnaires, polls or inventories that measure latent psychoeducational variables. Although some psychometric analyses can be performed using general statistical software such as SPSS, most require specialized tools designed specifically for psychometric purposes.

In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range. For instance, when the variance of data in a set is large, the data is widely scattered. On the other hand, when the variance is small, the data in the set is clustered.

Educational measurement refers to the use of educational assessments and the analysis of data such as scores obtained from educational assessments to infer the abilities and proficiencies of students. The approaches overlap with those in psychometrics. Educational measurement is the assigning of numerals to traits such as achievement, interest, attitudes, aptitudes, intelligence and performance.

References

1 2 3 Kolen, M.J.; Brennan, R.L. (1995). Test Equating. New York: Spring. doi:10.1007/978-1-4757-2412-7. ISBN 978-0-387-94486-9.
↑ "National Council on Measurement in Education". Archived from the original on 2017-07-22.
↑ Baker, F. (1983). "Comparison of ability metrics obtained under two latent trait theory procedures". Applied Psychological Measurement. 7: 97–110. Archived from the original on 28 January 2025.
↑ Baker, F. (1984). "Ability metric transformations involved in vertical equating under item response theory". Applied Psychological Measurement. 8 (3): 261–271.

External links

Equating and the SAT
Equating and AP Tests
"IRTEQ:Windows Application that Implements IRT Scaling and Equating". Archived from the original on 4 July 2017.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[KolenBrennan-1] 1 2 3 Kolen, M.J.; Brennan, R.L. (1995). Test Equating. New York: Spring. doi:10.1007/978-1-4757-2412-7. ISBN 978-0-387-94486-9.

[2] "National Council on Measurement in Education". Archived from the original on 2017-07-22.

[3] Baker, F. (1983). "Comparison of ability metrics obtained under two latent trait theory procedures". Applied Psychological Measurement. 7: 97–110. Archived from the original on 28 January 2025.

[4] Baker, F. (1984). "Ability metric transformations involved in vertical equating under item response theory". Applied Psychological Measurement. 8 (3): 261–271.

[1]

[2]

[3]

[4]