Criterion-referenced test

Last updated November 05, 2023

A criterion-referenced test is a style of test that uses test scores to generate a statement about the behavior that can be expected of a person with that score. Most tests and quizzes that are written by school teachers can be considered criterion-referenced tests. In this case, the objective is simply to see whether the student has learned the material. Criterion-referenced assessment can be contrasted with norm-referenced assessment and ipsative assessment.

Definition of criterion

A common misunderstanding regarding the term is the meaning of criterion. Many, if not most, criterion-referenced tests involve a cutscore, where the examinee passes if their score exceeds the cutscore and fails if it does not (often called a mastery test). The criterion is not the cutscore; the criterion is the domain of subject matter that the test is designed to assess. For example, the criterion may be "Students should be able to correctly add two single-digit numbers," and the cutscore may be that students should correctly answer a minimum of 80% of the questions to pass.

The criterion-referenced interpretation of a test score identifies the relationship to the subject matter. In the case of a mastery test, this does mean identifying whether the examinee has "mastered" a specified level of the subject matter by comparing their score to the cutscore. However, not all criterion-referenced tests have a cutscore, and the score can simply refer to a person's standing on the subject domain.^[2] The ACT is an example of this; there is no cutscore, it simply is an assessment of the student's knowledge of high-school level subject matter.

Because of this common misunderstanding, criterion-referenced tests have also been called standards-based assessments by some education agencies,^[3] as students are assessed with regard to standards that define what they "should" know, as defined by the state.^[4]

Comparison of criterion-referenced, domain-referenced and norm-referenced tests

Sample scoring for the history question: What caused World War II?
Student answers	Criterion-referenced assessment	Norm-referenced assessment
Student #1: World War II was caused by Hitler and Germany invading Poland.	This answer is correct.	This answer is worse than Student #2's answer, but better than Student #3's answer.
Student #2: World War II was caused by multiple factors, including the Great Depression and the general economic situation, the rise of nationalism, fascism, and imperialist expansionism, and unresolved resentments related to World War I. The war in Europe began with the German invasion of Poland.	This answer is correct.	This answer is better than Student #1's and Student #3's answers.
Student #3: World War II was caused by the assassination of Archduke Ferdinand.	This answer is wrong.	This answer is worse than Student #1's and Student #2's answers.

Both terms criterion-referenced and norm-referenced were originally coined by Robert Glaser.^[5] Unlike a criterion-reference test, a norm-referenced test indicates whether the test-taker did better or worse than other people who took the test. For example, if the criterion is "Students should be able to correctly add two single-digit numbers," then reasonable test questions might look like " $2+3=?$ " or " $9+5=?$ " A criterion-referenced test would report the student's performance strictly according to whether the individual student correctly answered these questions. A norm-referenced test would report primarily whether this student correctly answered more questions compared to other students in the group. Even when testing similar topics, a test which is designed to accurately assess mastery may use different questions than one which is intended to show relative ranking. This is because some questions are better at reflecting actual achievement of students, and some test questions are better at differentiating between the best students and the worst students. (Many questions will do both.) A criterion-referenced test will use questions which were correctly answered by students who know the specific material. A norm-referenced test will use questions which were correctly answered by the "best" students and not correctly answered by the "worst" students (e.g. Cambridge University's pre-entry 'S' paper). Some tests can provide useful information about both actual achievement and relative ranking. The ACT provides both a ranking, and indication of what level is considered necessary to likely success in college.^[6] Some argue that the term "criterion-referenced test" is a misnomer, since it can refer to the interpretation of the score as well as the test itself.^[7] In the previous example, the same score on the ACT can be interpreted in a norm-referenced or criterion-referenced manner.

Domain-referenced test is similar to criterion-referenced test, it is an assessment that covers a specific area of study such that a score will reveal how much of this area has been mastered. Thus, if an individual got 90% of the items correct in a domain-referenced or criterion-referenced test, this would be a high score indicative of his or her deep knowledge and understanding of the content covered in the test. These kinds of tests are contrasted with norm-referenced tests, in which scores indicate how well a test taker performed on the items relative to others who took the test.^[8]^[9]

Relationship to high-stakes testing

Many high-profile criterion-referenced tests are also high-stakes tests, where the results of the test have important implications for the individual examinee. Examples of this include high school graduation examinations and licensure testing where the test must be passed to work in a profession, such as to become a physician or attorney. However, being a high-stakes test is not specifically a feature of a criterion-referenced test. It is instead a feature of how an educational or government agency chooses to use the results of the test. It is moreover an individual type of test.

Examples

Driving tests are criterion-referenced tests, because their goal is to see whether the test taker is skilled enough to be granted a driver's license, not to see whether one test taker is more skilled than another test taker.
Citizenship tests are usually criterion-referenced tests, because their goal is to see whether the test taker is sufficiently familiar with the new country's history and government, not to see whether one test taker is more knowledgeable than another test taker.

Related Research Articles

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally refers to specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

Psychological testing refers to the administration of psychological tests. Psychological tests are administered or scored by trained evaluators. A person's responses are evaluated according to carefully prescribed guidelines. Scores are thought to reflect individual or group differences in the construct the test purports to measure. The science behind psychological testing is psychometrics.

A standardized test is a test that is administered and scored in a consistent, or "standard", manner. Standardized tests are designed in such a way that the questions and interpretations are consistent and are administered and scored in a predetermined, standard manner.

Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

Educational assessment or educational evaluation is the systematic process of documenting and using empirical data on the knowledge, skill, attitudes, aptitude and beliefs to refine programs and improve student learning. Assessment data can be obtained from directly examining student work to assess the achievement of learning outcomes or can be based on data from which one can make inferences about learning. Assessment is often used interchangeably with test, but not limited to tests. Assessment can focus on the individual learner, the learning community, a course, an academic program, the institution, or the educational system as a whole. The word 'assessment' came into use in an educational context after the Second World War.

<span class="mw-page-title-main">Personality test</span> Method of assessing human personality constructs

A personality test is a method of assessing human personality constructs. Most personality assessment instruments are in fact introspective self-report questionnaire measures or reports from life records (L-data) such as rating scales. Attempts to construct actual performance tests of personality have been very limited even though Raymond Cattell with his colleague Frank Warburton compiled a list of over 2000 separate objective tests that could be used in constructing objective personality tests. One exception however, was the Objective-Analytic Test Battery, a performance test designed to quantitatively measure 10 factor-analytically discerned personality trait dimensions. A major problem with both L-data and Q-data methods is that because of item transparency, rating scales and self-report questionnaires are highly susceptible to motivational and response distortion ranging all the way from lack of adequate self-insight to downright dissimulation depending on the reason/motivation for the assessment being undertaken.

Multiple choice (MC), objective response or MCQ is a form of an objective assessment in which respondents are asked to select only correct answers from the choices offered as a list. The multiple choice format is most frequently used in educational testing, in market research, and in elections, when a person chooses between multiple candidates, parties, or policies.

A concept inventory is a criterion-referenced test designed to help determine whether a student has an accurate working knowledge of a specific set of concepts. Historically, concept inventories have been in the form of multiple-choice tests in order to aid interpretability and facilitate administration in large classes. Unlike a typical, teacher-authored multiple-choice test, questions and response choices on concept inventories are the subject of extensive research. The aims of the research include ascertaining (a) the range of what individuals think a particular question is asking and (b) the most common responses to the questions. Concept inventories are evaluated to ensure test reliability and validity. In its final form, each question includes one correct answer and several distractors.

In psychology, ipsative questionnaires are those where the sum of scale scores from each respondent adds to a constant value. Sometimes called a forced-choice scale, this measure contrasts Likert-type scales in which respondents score—often from 1 to 5—how much they agree with a given statement.

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.

A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields an estimate of the position of the tested individual in a predefined population, with respect to the trait being measured. Assigning scores on such tests may be described as relative grading, marking on a curve (BrE) or grading on a curve. It is a method of assigning grades to the students in a class in such a way as to obtain or approach a pre-specified distribution of these grades having a specific mean and derivation properties, such as a normal distribution. The term "curve" refers to the bell curve, the graphical representation of the probability density of the normal distribution, but this method can be used to achieve any desired distribution of the grades – for example, a uniform distribution. The estimate is derived from the analysis of test scores and possibly other relevant data from a sample drawn from the population. That is, this type of test identifies whether the test taker performed better or worse than other test takers, not whether the test taker knows either more or less material than is necessary for a given purpose. The term normative assessment is used when the reference population are the peers of the test taker.

In an educational setting, standards-based assessment is assessment that relies on the evaluation of student understanding with respect to agreed-upon standards, also known as "outcomes". The standards set the criteria for the successful demonstration of the understanding of a concept or skill.

A high-stakes test is a test with important consequences for the test taker. Passing has important benefits, such as a high school diploma, a scholarship, or a license to practice a profession. Failing has important disadvantages, such as being forced to take remedial classes until the test can be passed, not being allowed to drive a car, or difficulty finding employment.

The Psychometric Entrance Test (PET) – commonly known in Hebrew as "ha-Psikhometri" – is a standardized test that serves as an entrance exam for institutions of higher education in Israel. The PET covers three areas: quantitative reasoning, verbal reasoning and English language. It is administered by the National Institute for Testing and Evaluation (NITE) and plays a considerable role in the admissions process. A score combining students' performance on the PET with the average score of their high school matriculation tests has been found to be a highly predictive indicator of students' academic performance in their first year of higher education.

Standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the US, in particular for high-stakes assessments, and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined; it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Such studies require quite an amount of resources, involving a number of professionals, in particular with psychometric background. Standard-setting studies are for that reason impractical for regular class room situations, yet in every layer of education, standard setting is performed and multiple methods exist.

A test score is a piece of information, usually a number, that conveys the performance of an examinee on a test. One formal definition is that it is "a summary of the evidence contained in an examinee's responses to the items of a test that are related to the construct or constructs being measured."

The attribute hierarchy method (AHM), is a cognitively based psychometric procedure developed by Jacqueline Leighton, Mark Gierl, and Steve Hunka at the Centre for Research in Applied Measurement and Evaluation (CRAME) at the University of Alberta. The AHM is one form of cognitive diagnostic assessment that aims to integrate cognitive psychology with educational measurement for the purposes of enhancing instruction and student learning. A cognitive diagnostic assessment (CDA), is designed to measure specific knowledge states and cognitive processing skills in a given domain. The results of a CDA yield a profile of scores with detailed information about a student’s cognitive strengths and weaknesses. This cognitive diagnostic feedback has the potential to guide instructors, parents and students in their teaching and learning processes.

Anne Anastasi was an American psychologist best known for her pioneering development of psychometrics. Her generative work, Psychological Testing, remains a classic text in which she drew attention to the individual being tested and therefore to the responsibilities of the testers. She called for them to go beyond test scores, to search the assessed individual's history to help them to better understand their own results and themselves.

<span class="mw-page-title-main">Exam</span> Educational assessment

An examination or test is an educational assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics. A test may be administered verbally, on paper, on a computer, or in a predetermined area that requires a test taker to demonstrate or perform a set of skills.

General Tests of English Language Proficiency (G-TELP) are English language tests, developed by the International Testing Services Center (ITSC) in 1985. They comprehensively evaluate the practical English use ability of test takers who do not speak English as their native language.

References

↑ Weiss, D.J.; Davison, M.L. (1981). "Test Theory and Methods". Annual Review of Psychology . 32: 1. doi:10.1146/annurev.ps.32.020181.003213.
↑ Archived 2008-10-08 at the Wayback Machine QuestionMark Glossary
↑ Assessing the Assessment of Outcomes Based Education Archived 2006-08-29 at the Wayback Machine by Dr Malcolm Venter. Cape Town, South Africa. "OBE advocates a criterion-based system, which means getting rid of the bell curve, phasing out grade point averages and comparative grading".
↑ Homeschool World Archived 2006-09-06 at the Wayback Machine : "The Education Standards Movement Spells Trouble for Private and Home Schools"
↑ Glaser, R. (1963). "Instructional technology and the measurement of learning outcomes". American Psychologist. 18 (8): 519–522. doi:10.1037/h0049294.
↑ Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.
↑ Haertel, E. (1985). "Construct validity and criterion-referenced testing". Review of Educational Research. 55 (1): 23–46. doi:10.3102/00346543055001023. S2CID 145124784.
↑ "Domain-referenced test". APA Dictionary of Psychology. Washington, DC: American Psychological Association. n.d. Retrieved 2021-02-19.
↑ Denham, Carolyn H. (1975). "Criterion-Referenced, Domain-Referenced and Norm-Referenced Measurement: A Parallax View". Educational Technology. 15 (12): 9–13. ISSN 0013-1962. JSTOR 44418878.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Weiss, D.J.; Davison, M.L. (1981). "Test Theory and Methods". Annual Review of Psychology . 32: 1. doi:10.1146/annurev.ps.32.020181.003213.

[2] Archived 2008-10-08 at the Wayback Machine QuestionMark Glossary

[3] Assessing the Assessment of Outcomes Based Education Archived 2006-08-29 at the Wayback Machine by Dr Malcolm Venter. Cape Town, South Africa. "OBE advocates a criterion-based system, which means getting rid of the bell curve, phasing out grade point averages and comparative grading".

[4] Homeschool World Archived 2006-09-06 at the Wayback Machine : "The Education Standards Movement Spells Trouble for Private and Home Schools"

[5] Glaser, R. (1963). "Instructional technology and the measurement of learning outcomes". American Psychologist. 18 (8): 519–522. doi:10.1037/h0049294.

[6] Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.

[7] Haertel, E. (1985). "Construct validity and criterion-referenced testing". Review of Educational Research. 55 (1): 23–46. doi:10.3102/00346543055001023. S2CID 145124784.

[8] "Domain-referenced test". APA Dictionary of Psychology. Washington, DC: American Psychological Association. n.d. Retrieved 2021-02-19.

[9] Denham, Carolyn H. (1975). "Criterion-Referenced, Domain-Referenced and Norm-Referenced Measurement: A Parallax View". Educational Technology. 15 (12): 9–13. ISSN 0013-1962. JSTOR 44418878.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]