Norm-referenced test

Last updated November 13, 2025

A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields an estimate of the position of the tested individual in a predefined population, with respect to the trait being measured. Assigning scores on such tests may be described as relative grading, marking on a curve (BrE) or grading on a curve (AmE, CanE) (also referred to as curved grading, bell curving, or using grading curves). It is a method of assigning grades to the students in a class in such a way as to obtain or approach a pre-specified distribution of these grades having a specific mean and derivation properties, such as a normal distribution (also called Gaussian distribution).^[1] The term "curve" refers to the bell curve, the graphical representation of the probability density of the normal distribution, but this method can be used to achieve any desired distribution of the grades – for example, a uniform distribution. The estimate is derived from the analysis of test scores and possibly other relevant data from a sample drawn from the population. That is, this type of test identifies whether the test taker performed better or worse than other test takers, not whether the test taker knows either more or less material than is necessary for a given purpose. The term normative assessment is used when the reference population are the peers of the test taker.

Norm-referenced assessment can be contrasted with criterion-referenced assessment and ipsative assessment. In a criterion-referenced assessment, the score shows whether or not test takers performed well or poorly on a given task, not how that compares to other test takers; in an ipsative system, test takers are compared to previous performance. Each method can be used to grade the same test paper.^[2]

Robert Glaser originally coined the terms norm-referenced test and criterion-referenced test.^[3]

Common uses

Many college entrance exams and nationally used school tests use norm-referenced tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale for Children (WISC) compare individual student performance to the performance of a normative sample. Test takers cannot "fail" a norm-referenced test, as each test taker receives a score that compares the individual to others that have taken the test, usually given by a percentile. This is useful when there is a wide range of acceptable scores, and the goal is to find out who performs better.

IQ tests are norm-referenced tests, because their goal is to rank test takers' intelligence. The median IQ is set to 100, and all test takers are ranked up or down in comparison to that level.

Other types

As alternatives to normative testing, tests can be ipsative assessments or criterion-referenced assessments.

Ipsative

In an ipsative assessment, the individuals' performance is compared only to their previous performances.^[4]^[5] For example, a person on a weight-loss diet is judged by how his current weight compares to his own previous weight, rather than how his weight compares to an ideal or how it compares to another person.

Criterion-referenced

A test is criterion-referenced when the performance is judged according to the expected or desired behavior. Tests that judge the test taker based on a set standard (e.g., everyone should be able to run one kilometre in less than five minutes) are criterion-referenced tests. The goal of a criterion-referenced test is to find out whether the individual can run as fast as the test giver wants, not to find out whether the individual is faster or slower than the other runners. Standards-based education reform focuses on criterion-referenced testing.^[6]^[7] Most everyday tests and quizzes taken in school, as well as most state achievement tests and high school graduation examinations, are criterion-referenced. In this model, it is possible for all test takers to pass or for all test takers to fail.

Comparison of criterion-referenced, domain-referenced and norm-referenced tests

Sample scoring for the history question: What caused World War II?
Student answers	Criterion-referenced assessment	Norm-referenced assessment
Student #1: World War II was caused by Hitler and Germany invading Poland.	This answer is correct.	This answer is worse than Student #2's answer, but better than Student #3's answer.
Student #2: World War II was caused by multiple factors, including the Great Depression and the general economic situation, the rise of nationalism, fascism, and imperialist expansionism, and unresolved resentments related to World War I. The war in Europe began with the German invasion of Poland.	This answer is correct.	This answer is better than Student #1's and Student #3's answers.
Student #3: World War II was caused by the assassination of Archduke Ferdinand.	This answer is wrong.	This answer is worse than Student #1's and Student #2's answers.

Both terms criterion-referenced and norm-referenced were originally coined by Robert Glaser.^[8] Unlike a criterion-reference test, a norm-referenced test indicates whether the test-taker did better or worse than other people who took the test. For example, if the criterion is "Students should be able to correctly add two single-digit numbers," then reasonable test questions might look like " $2+3=?$ " or " $9+5=?$ " A criterion-referenced test would report the student's performance strictly according to whether the individual student correctly answered these questions. A norm-referenced test would report primarily whether this student correctly answered more questions compared to other students in the group. Even when testing similar topics, a test which is designed to accurately assess mastery may use different questions than one which is intended to show relative ranking. This is because some questions are better at reflecting actual achievement of students, and some test questions are better at differentiating between the best students and the worst students. (Many questions will do both.) A criterion-referenced test will use questions which were correctly answered by students who know the specific material. A norm-referenced test will use questions which were correctly answered by the "best" students and not correctly answered by the "worst" students (e.g. Cambridge University's pre-entry 'S' paper). Some tests can provide useful information about both actual achievement and relative ranking. The ACT provides both a ranking, and indication of what level is considered necessary to likely success in college.^[9] Some argue that the term "criterion-referenced test" is a misnomer, since it can refer to the interpretation of the score as well as the test itself.^[10] In the previous example, the same score on the ACT can be interpreted in a norm-referenced or criterion-referenced manner.

Domain-referenced test is similar to criterion-referenced test, it is an assessment that covers a specific area of study such that a score will reveal how much of this area has been mastered. Thus, if an individual got 90% of the items correct in a domain-referenced or criterion-referenced test, this would be a high score indicative of his or her deep knowledge and understanding of the content covered in the test. These kinds of tests are contrasted with norm-referenced tests, in which scores indicate how well a test taker performed on the items relative to others who took the test.^[11]^[12]

Methods

One method of grading on a curve uses three steps:

Numeric scores (or possibly scores on a sufficiently fine-grained ordinal scale) are assigned to the students. The absolute values are less relevant, provided that the order of the scores corresponds to the relative performance of each student within the course.
These scores are converted to percentiles (or some other system of quantiles).
The percentile values are transformed to grades according to a division of the percentile scale into intervals, where the interval width of each grade indicates the desired relative frequency for that grade.

For example, if there are five grades in a particular university course, A, B, C, D, and F, where A is reserved for the top 20 % of students, B for the next 30 %, C for the next 30–40 %, and D or F for the remaining 10–20 %, then scores in the percentile interval from 0 % to 10–20 % will receive a grade of D or F, scores from 11–21 % to 50 % will receive a grade of C, scores from 51 % to 80 % receive a grade of B, and scores from 81 % to 100 % will achieve a grade of A.

Consistent with the example illustrated above, a grading curve allows academic institutions to ensure the distribution of students across certain grade point average (GPA) thresholds. As many professors establish the curve to target a course average of a C,^{[ clarification needed ]} the corresponding grade point average equivalent would be a 2.0 on a standard 4.0 scale employed at most North American universities.^[1] Similarly, a grade point average of 3.0 on a 4.0 scale would indicate that the student is within the top 20 % of the class. Grading curves serve to attach additional significance to these figures, and the specific distribution employed may vary between academic institutions.^[13]

Advantages and limitations

The primary advantage of norm-reference tests is that they can provide information on how an individual's performance on the test compares to others in the reference group.

A serious limitation of norm-reference tests is that the reference group may not represent the current population of interest. As noted by the Oregon Research Institute's International Personality Item Pool website, "One should be very wary of using canned 'norms' because it isn't obvious that one could ever find a population of which one's present sample is a representative subset. Most 'norms' are misleading, and therefore they should not be used. Far more defensible are local norms, which one develops oneself. For example, if one wants to give feedback to members of a class of students, one should relate the score of each individual to the means and standard deviations derived from the class itself. To maximize informativeness, one can provide the students with the frequency distribution for each scale, based on these local norms, and the individuals can then find (and circle) their own scores on these relevant distributions."^[14]

Norm-referencing does not ensure that a test is valid (i.e. that it measures the construct it is intended to measure).

Another disadvantage of norm-referenced tests is that they cannot measure progress of the population as a whole, only where individuals fall within the whole. Rather, one must measure against a fixed goal, for instance, to measure the success of an educational reform program that seeks to raise the achievement of all students.

With a norm-referenced test, grade level was traditionally set at the level set by the middle 50 percent of scores.^[15] By contrast, the National Children's Reading Foundation believes that it is essential to assure that virtually all children read at or above grade level by third grade, a goal which cannot be achieved with a norm-referenced definition of grade level.^[16]

Norms do not automatically imply a standard. A norm-referenced test does not seek to enforce any expectation of what test takers should know or be able to do. It measures the test takers' current level by comparing the test takers to their peers. A rank-based system produces only data that tell which students perform at an average level, which students do better, and which students do worse. It does not identify which test takers are able to correctly perform the tasks at a level that would be acceptable for employment or further education.

The ultimate objective of grading curves is to minimize or eliminate the influence of variation between different instructors of the same course, ensuring that the students in any given class are assessed relative to their peers. This also circumvents problems associated with utilizing multiple versions of a particular examination, a method often employed where test administration dates vary between class sections. Regardless of any difference in the level of difficulty, real or perceived, the grading curve ensures a balanced distribution of academic results.

However, curved grading can increase competitiveness between students and affect their sense of faculty fairness in a class. Students are generally most upset in the case that the curve lowered their grade compared to what they would have received if a curve was not used. To ensure that this does not happen, teachers usually put forth effort to ensure that the test itself is hard enough when they intend to use a grading curve, such that they would expect the average student to get a lower raw score than the score intended to be used at the average in the curve, thus ensuring that all students benefit from the curve. Thus, curved grades cannot be blindly used and must be carefully considered and pondered compared to alternatives such as criterion-referenced grading. Furthermore, constant misuse of curved grading can adjust grades on poorly designed tests, whereas assessments should be designed to accurately reflect the learning objectives set by the instructor.^[17]

References

1 2 Roell, Kelly. "What is Grading on a Curve?". About.com. Retrieved November 13, 2013.
↑ Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.
↑ Glaser, R. (1963). "Instructional technology and the measurement of learning outcomes". American Psychologist. 18: 510–522. doi:10.1037/h0049294.
↑ Assessment
↑ "PDF presentation" (PDF). Archived from the original (PDF) on 2015-09-24. Retrieved 2006-07-21.
↑ stories 5-01.html ^{[ permanent dead link ]} Fairtest.org: Times on Testing "criterion referenced" tests measure students against a fixed yardstick, not against each other.
↑ "Illinois State Board of Education - Illinois Learning Standards". Archived from the original on 2010-04-14. Retrieved 2010-04-14. Illinois Learning Standards
↑ Glaser, R. (1963). "Instructional technology and the measurement of learning outcomes". American Psychologist. 18 (8): 519–522. doi:10.1037/h0049294.
↑ Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.
↑ Haertel, E. (1985). "Construct validity and criterion-referenced testing". Review of Educational Research. 55 (1): 23–46. doi:10.3102/00346543055001023. S2CID 145124784.
↑ "Domain-referenced test". APA Dictionary of Psychology. Washington, DC: American Psychological Association. n.d. Retrieved 2021-02-19.
↑ Denham, Carolyn H. (1975). "Criterion-Referenced, Domain-Referenced and Norm-Referenced Measurement: A Parallax View". Educational Technology. 15 (12): 9–13. ISSN 0013-1962. JSTOR 44418878.
↑ Volokh, Eugene (February 9, 2015). "In praise of grading on a curve". Washington Post. Retrieved 18 May 2017. Like democracy, grading on a curve may be the worst possible system — except for all the alternatives.
↑ Oregon Research Institute, IPIP website, http://ipip.ori.org/newNorms.htm
↑ NCTM: News & Media: Assessment Issues (Newsbulletin April 2004) "by definition, half of the nation's students are below grade level at any particular moment"
↑ Archived 2007-03-11 at the Wayback Machine National Children's Reading Foundation website
↑ Reese, Michael (May 13, 2013). "To Curve or Not to Curve". The Innovative Instructor Blog. Johns Hopkins University. Retrieved May 13, 2013.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[roell-1] 1 2 Roell, Kelly. "What is Grading on a Curve?". About.com. Retrieved November 13, 2013.

[Cronbach-2] Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.

[Glaser-3] Glaser, R. (1963). "Instructional technology and the measurement of learning outcomes". American Psychologist. 18: 510–522. doi:10.1037/h0049294.

[teach-4] Assessment

[role-5] "PDF presentation" (PDF). Archived from the original (PDF) on 2015-09-24. Retrieved 2006-07-21.

[6] stories 5-01.html ^{[ permanent dead link ]} Fairtest.org: Times on Testing "criterion referenced" tests measure students against a fixed yardstick, not against each other.

[7] "Illinois State Board of Education - Illinois Learning Standards". Archived from the original on 2010-04-14. Retrieved 2010-04-14. Illinois Learning Standards

[8] Glaser, R. (1963). "Instructional technology and the measurement of learning outcomes". American Psychologist. 18 (8): 519–522. doi:10.1037/h0049294.

[9] Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.

[10] Haertel, E. (1985). "Construct validity and criterion-referenced testing". Review of Educational Research. 55 (1): 23–46. doi:10.3102/00346543055001023. S2CID 145124784.

[11] "Domain-referenced test". APA Dictionary of Psychology. Washington, DC: American Psychological Association. n.d. Retrieved 2021-02-19.

[12] Denham, Carolyn H. (1975). "Criterion-Referenced, Domain-Referenced and Norm-Referenced Measurement: A Parallax View". Educational Technology. 15 (12): 9–13. ISSN 0013-1962. JSTOR 44418878.

[13] Volokh, Eugene (February 9, 2015). "In praise of grading on a curve". Washington Post. Retrieved 18 May 2017. Like democracy, grading on a curve may be the worst possible system — except for all the alternatives.

[14] Oregon Research Institute, IPIP website, http://ipip.ori.org/newNorms.htm

[15] NCTM: News & Media: Assessment Issues (Newsbulletin April 2004) "by definition, half of the nation's students are below grade level at any particular moment"

[16] Archived 2007-03-11 at the Wayback Machine National Children's Reading Foundation website

[17] Reese, Michael (May 13, 2013). "To Curve or Not to Curve". The Innovative Instructor Blog. Johns Hopkins University. Retrieved May 13, 2013.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]