Norm-referenced test

Last updated

A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields an estimate of the position of the tested individual in a predefined population, with respect to the trait being measured. Assigning scores on such tests may be described as relative grading, marking on a curve (BrE) or grading on a curve (AmE, CanE) (also referred to as curved grading, bell curving, or using grading curves). It is a method of assigning grades to the students in a class in such a way as to obtain or approach a pre-specified distribution of these grades having a specific mean and derivation properties, such as a normal distribution (also called Gaussian distribution). [1] The term "curve" refers to the bell curve, the graphical representation of the probability density of the normal distribution, but this method can be used to achieve any desired distribution of the grades – for example, a uniform distribution. The estimate is derived from the analysis of test scores and possibly other relevant data from a sample drawn from the population. That is, this type of test identifies whether the test taker performed better or worse than other test takers, not whether the test taker knows either more or less material than is necessary for a given purpose. The term normative assessment is used when the reference population are the peers of the test taker.

Contents

Norm-referenced assessment can be contrasted with criterion-referenced assessment and ipsative assessment. In a criterion-referenced assessment, the score shows whether or not test takers performed well or poorly on a given task, not how that compares to other test takers; in an ipsative system, test takers are compared to previous performance. Each method can be used to grade the same test paper. [2]

Robert Glaser originally coined the terms norm-referenced test and criterion-referenced test. [3]

Common uses

Many college entrance exams and nationally used school tests use norm-referenced tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale for Children (WISC) compare individual student performance to the performance of a normative sample. Test takers cannot "fail" a norm-referenced test, as each test taker receives a score that compares the individual to others that have taken the test, usually given by a percentile. This is useful when there is a wide range of acceptable scores, and the goal is to find out who performs better.

IQ tests are norm-referenced tests, because their goal is to rank test takers' intelligence. The median IQ is set to 100, and all test takers are ranked up or down in comparison to that level.

Other types

As alternatives to normative testing, tests can be ipsative assessments or criterion-referenced assessments.

Ipsative

In an ipsative assessment, the individuals' performance is compared only to their previous performances. [4] [5] For example, a person on a weight-loss diet is judged by how his current weight compares to his own previous weight, rather than how his weight compares to an ideal or how it compares to another person.

Criterion-referenced

A test is criterion-referenced when the performance is judged according to the expected or desired behavior. Tests that judge the test taker based on a set standard (e.g., everyone should be able to run one kilometre in less than five minutes) are criterion-referenced tests. The goal of a criterion-referenced test is to find out whether the individual can run as fast as the test giver wants, not to find out whether the individual is faster or slower than the other runners. Standards-based education reform focuses on criterion-referenced testing. [6] [7] Most everyday tests and quizzes taken in school, as well as most state achievement tests and high school graduation examinations, are criterion-referenced. In this model, it is possible for all test takers to pass or for all test takers to fail.

Methods

One method of grading on a curve uses three steps:

  1. Numeric scores (or possibly scores on a sufficiently fine-grained ordinal scale) are assigned to the students. The absolute values are less relevant, provided that the order of the scores corresponds to the relative performance of each student within the course.
  2. These scores are converted to percentiles (or some other system of quantiles).
  3. The percentile values are transformed to grades according to a division of the percentile scale into intervals, where the interval width of each grade indicates the desired relative frequency for that grade.

For example, if there are five grades in a particular university course, A, B, C, D, and F, where A is reserved for the top 20 % of students, B for the next 30 %, C for the next 30–40 %, and D or F for the remaining 10–20 %, then scores in the percentile interval from 0 % to 10–20 % will receive a grade of D or F, scores from 11–21 % to 50 % will receive a grade of C, scores from 51 % to 80 % receive a grade of B, and scores from 81 % to 100 % will achieve a grade of A.

Consistent with the example illustrated above, a grading curve allows academic institutions to ensure the distribution of students across certain grade point average (GPA) thresholds. As many professors establish the curve to target a course average of a C,[ clarification needed ] the corresponding grade point average equivalent would be a 2.0 on a standard 4.0 scale employed at most North American universities. [1] Similarly, a grade point average of 3.0 on a 4.0 scale would indicate that the student is within the top 20 % of the class. Grading curves serve to attach additional significance to these figures, and the specific distribution employed may vary between academic institutions. [8]

Advantages and limitations

The primary advantage of norm-reference tests is that they can provide information on how an individual's performance on the test compares to others in the reference group.

A serious limitation of norm-reference tests is that the reference group may not represent the current population of interest. As noted by the Oregon Research Institute's International Personality Item Pool website, "One should be very wary of using canned 'norms' because it isn't obvious that one could ever find a population of which one's present sample is a representative subset. Most 'norms' are misleading, and therefore they should not be used. Far more defensible are local norms, which one develops oneself. For example, if one wants to give feedback to members of a class of students, one should relate the score of each individual to the means and standard deviations derived from the class itself. To maximize informativeness, one can provide the students with the frequency distribution for each scale, based on these local norms, and the individuals can then find (and circle) their own scores on these relevant distributions." [9]

Norm-referencing does not ensure that a test is valid (i.e. that it measures the construct it is intended to measure).

Another disadvantage of norm-referenced tests is that they cannot measure progress of the population as a whole, only where individuals fall within the whole. Rather, one must measure against a fixed goal, for instance, to measure the success of an educational reform program that seeks to raise the achievement of all students.

With a norm-referenced test, grade level was traditionally set at the level set by the middle 50 percent of scores. [10] By contrast, the National Children's Reading Foundation believes that it is essential to assure that virtually all children read at or above grade level by third grade, a goal which cannot be achieved with a norm-referenced definition of grade level. [11]

Norms do not automatically imply a standard. A norm-referenced test does not seek to enforce any expectation of what test takers should know or be able to do. It measures the test takers' current level by comparing the test takers to their peers. A rank-based system produces only data that tell which students perform at an average level, which students do better, and which students do worse. It does not identify which test takers are able to correctly perform the tasks at a level that would be acceptable for employment or further education.

The ultimate objective of grading curves is to minimize or eliminate the influence of variation between different instructors of the same course, ensuring that the students in any given class are assessed relative to their peers. This also circumvents problems associated with utilizing multiple versions of a particular examination, a method often employed where test administration dates vary between class sections. Regardless of any difference in the level of difficulty, real or perceived, the grading curve ensures a balanced distribution of academic results.

However, curved grading can increase competitiveness between students and affect their sense of faculty fairness in a class. Students are generally most upset in the case that the curve lowered their grade compared to what they would have received if a curve was not used. To ensure that this does not happen, teachers usually put forth effort to ensure that the test itself is hard enough when they intend to use a grading curve, such that they would expect the average student to get a lower raw score than the score intended to be used at the average in the curve, thus ensuring that all students benefit from the curve. Thus, curved grades cannot be blindly used and must be carefully considered and pondered compared to alternatives such as criterion-referenced grading. Furthermore, constant misuse of curved grading can adjust grades on poorly designed tests, whereas assessments should be designed to accurately reflect the learning objectives set by the instructor. [12]

See also

Related Research Articles

The ACT is a standardized test used for college admissions in the United States. It is currently administered by ACT, a nonprofit organization of the same name. The ACT test covers four academic skill areas: English, mathematics, reading, and scientific reasoning. It also offers an optional direct writing test. It is accepted by all four-year colleges and universities in the United States as well as more than 225 universities outside of the U.S.

SAT Standardized test widely used for college admissions in the United States

The SAT is a standardized test widely used for college admissions in the United States. Since its debut in 1926, its name and scoring have changed several times; originally called the Scholastic Aptitude Test, it was later called the Scholastic Assessment Test, then the SAT I: Reasoning Test, then the SAT Reasoning Test, then simply the SAT.

Psychological testing Administration of psychological tests

Psychological testing is the administration of psychological tests. Psychological tests are administered by trained evaluators. A person's responses are evaluated according to carefully prescribed guidelines. Scores are thought to reflect individual or group differences in the construct the test purports to measure. The science behind psychological testing is psychometrics.

Standardized test Test administered and scored in a predetermined, standard manner

A standardized test is a test that is administered and scored in a consistent, or "standard", manner. Standardized tests are designed in such a way that the questions and interpretations are consistent and are administered and scored in a predetermined, standard manner.

Grading in education is the process of applying standardized measurements for varying levels of achievements in a course. Grades can be assigned as letters, as a range, as a percentage, or as a number out of a possible total.

Educational assessment or educational evaluation is the systematic process of documenting and using empirical data on the knowledge, skill, attitudes, and beliefs to refine programs and improve student learning. Assessment data can be obtained from directly examining student work to assess the achievement of learning outcomes or can be based on data from which one can make inferences about learning. Assessment is often used interchangeably with test, but not limited to tests. Assessment can focus on the individual learner, the learning community, a course, an academic program, the institution, or the educational system as a whole. The word 'assessment' came into use in an educational context after the Second World War.

In psychology, ipsative measures are those where respondents compare two or more desirable options and pick the one they prefer most. Sometimes called a forced-choice scale, this measure contrasts Likert-type scales in which respondents score—often from 1 to 5—how much they agree with a given statement.

The Overall Position (OP) was a tertiary entrance rank used in Queensland, Australia to guide selection into universities. Like similar systems used throughout the rest of Australia, the OP shows how well a student has performed in their senior secondary studies compared with all other OP-eligible students in Queensland. The system was introduced in 1992 and ended with the 2019 cohort.

A criterion-referenced test is a style of test which uses test scores to generate a statement about the behavior that can be expected of a person with that score. Most tests and quizzes that are written by school teachers can be considered criterion-referenced tests. In this case, the objective is simply to see whether the student has learned the material. Criterion-referenced assessment can be contrasted with norm-referenced assessment and ipsative assessment.

STAR (software)

STAR Reading, STAR Early Literacy and STAR Math are standardized, computer-adaptive assessments created by Renaissance Learning, Inc., for use in K-12 education. Each is a "Tier 2" assessment of a skill that can be used any number of times due to item-bank technology. These assessments fall somewhere between progress monitoring tools and high-stakes tests.

In educational statistics, a normal curve equivalent (NCE), developed for the United States Department of Education by the RMC Research Corporation, is a way of standardizing scores received on a test into a 0-100 scale similar to a percentile-rank, but preserving the valuable equal-interval properties of a z-score. It is defined as:

In an educational setting, standards-based assessment is assessment that relies on the evaluation of student understanding with respect to agreed-upon standards, also known as "outcomes". The standards set the criteria for the successful demonstration of the understanding of a concept or skill.

Germany uses a 5- or 6-point grading scale (GPA) to evaluate academic performance for the youngest to the oldest students. Grades vary from 1 to 5. In the final classes of German Gymnasium schools that prepare for university studies, a point system is used with 15 points being the best grade and 0 points the worst. The percentage causing the grade can vary from teacher to teacher.

The ECTS grading scale is a grading system defined in the European Credit Transfer and Accumulation System (ECTS) framework by the European Commission. Since many grading systems co-exist in Europe and, considering that interpretation of grades varies considerably from one country to another, if not from one institution to another, the ECTS grading scale has been developed to provide a common measure and facilitate the transfer of students and their grades between European higher education institutions, by allowing national and local grading systems to be interchangeable. Grades are reported on a carefully calibrated and uniform A-to-F scale combined with keywords and short qualitative definitions. Each institution makes its own decision on how to apply the ECTS grading scale to its system.

The Wide Range Achievement Test, currently in its fifth edition (WRAT5), is an achievement test which measures an individual's ability to read words, comprehend sentences, spell, and compute solutions to math problems.

The Bracken School Readiness Assessment ("BSRA") is an individual concept knowledge test designed for children, pre-K through second grade.

A test score is a piece of information, usually a number, that conveys the performance of an examinee on a test. One formal definition is that it is "a summary of the evidence contained in an examinee's responses to the items of a test that are related to the construct or constructs being measured."

The Criterion-Referenced Competency Tests (CRCT) were a set of tests administered at public schools in the state of Georgia that are designed to test the knowledge of first through eighth graders in reading, English/language arts (ELA), and mathematics, and third through eighth graders additionally in science and social studies.

Academic grading in the United States commonly takes on the form of five, six or seven letter grades. Traditionally, the grades are A+, A, A−, B+, B, B−, C+, C, C−, D+, D, D− and F, with A+ being the highest and F being lowest. In some cases, grades can also be numerical. Numeric-to-letter-grade conversions generally vary from system to system and between disciplines and status.

Test (assessment) Procedure for measuring a subjects knowledge, skill, aptitude, physical fitness, or other characteristics

A test or examination is an educational assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics. A test may be administered verbally, on paper, on a computer, or in a predetermined area that requires a test taker to demonstrate or perform a set of skills.

References

  1. 1 2 Roell, Kelly. "What is Grading on a Curve?". About.com. Retrieved November 13, 2013.
  2. Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York: Harper & Row.
  3. Glaser, R. (1963). "Instructional technology and the measurement of learning outcomes". American Psychologist. 18: 510–522. doi:10.1037/h0049294.
  4. Assessment
  5. "PDF presentation" (PDF). Archived from the original (PDF) on 2015-09-24. Retrieved 2006-07-21.
  6. stories 5-01.html [ permanent dead link ] Fairtest.org: Times on Testing "criterion referenced" tests measure students against a fixed yardstick, not against each other.
  7. "Archived copy". Archived from the original on 2010-04-14. Retrieved 2010-04-14.{{cite web}}: CS1 maint: archived copy as title (link) Illinois Learning Standards
  8. Volokh, Eugene (February 9, 2015). "In praise of grading on a curve". Washington Post. Retrieved 18 May 2017. Like democracy, grading on a curve may be the worst possible system — except for all the alternatives.
  9. Oregon Research Institute, IPIP website, http://ipip.ori.org/newNorms.htm
  10. NCTM: News & Media: Assessment Issues (Newsbulletin April 2004) "by definition, half of the nation's students are below grade level at any particular moment"
  11. Archived 2007-03-11 at the Wayback Machine National Children's Reading Foundation website
  12. Reese, Michael (May 13, 2013). "To Curve or Not to Curve". The Innovative Instructor Blog. Johns Hopkins University. Retrieved May 13, 2013.