Test score

Last updated

A test score is a piece of information, usually a number, that conveys the performance of an examinee on a test. One formal definition is that it is "a summary of the evidence contained in an examinee's responses to the items of a test that are related to the construct or constructs being measured." [1]

Contents

Test scores are interpreted with a norm-referenced or criterion-referenced interpretation, or occasionally both. A norm-referenced interpretation means that the score conveys meaning about the examinee with regards to their standing among other examinees. A criterion-referenced interpretation means that the score conveys information about the examinee with regard to a specific subject matter, regardless of other examinees' scores. [2]

Types

There are two types of test scores: raw scores and scaled scores. A raw score is a score without any sort of adjustment or transformation, such as the simple number of questions answered correctly. A scaled score is the result of some transformation(s) applied to the raw score, such as in relative grading.

The purpose of scaled scores is to report scores for all examinees on a consistent scale. Suppose that a test has two forms, and one is more difficult than the other. It has been determined by equating that a score of 65% on form 1 is equivalent to a score of 68% on form 2. Scores on both forms can be converted to a scale so that these two equivalent scores have the same reported scores. For example, they could both be a score of 350 on a scale of 100 to 500.

Two well-known tests in the United States that have scaled scores are the ACT and the SAT. The ACT's scale ranges from 0 to 36 and the SAT's from 200 to 800 (per section). Ostensibly, these two scales were selected to represent a mean and standard deviation of 18 and 6 (ACT), and 500 and 100. The upper and lower bounds were selected because an interval of plus or minus three standard deviations contains more than 99% of a population. Scores outside that range are difficult to measure, and return little practical value.

Note that scaling does not affect the psychometric properties of a test; it is something that occurs after the assessment process (and equating, if present) is completed. Therefore, it is not an issue of psychometrics, per se, but an issue of interpretability.

Scoring information loss

A test question might require a student to calculate the area of a triangle. Compare the information provided in these two answers.
Simple triangle with height marked.svg
Area = 7.5 cm2
Simple triangle with height marked.svg
Base = 5 cm; Height = 3 cm
Area = 1/2(Base × Height)
= 1/2(5 cm × 3 cm)
= 7.5 cm2
The first shows scoring information loss. The teacher knows whether the student got the right answer, but does not know how the student arrived at the answer. If the answer is wrong, the teacher does not know whether the student was guessing, made a simple error, or fundamentally misunderstands the subject.

When tests are scored right-wrong, an important assumption has been made about learning. The number of right answers or the sum of item scores (where partial credit is given) is assumed to be the appropriate and sufficient measure of current performance status. In addition, a secondary assumption is made that there is no meaningful information in the wrong answers.

In the first place, a correct answer can be achieved using memorization without any profound understanding of the underlying content or conceptual structure of the problem posed. Second, when more than one step for solution is required, there are often a variety of approaches to answering that will lead to a correct result. The fact that the answer is correct does not indicate which of the several possible procedures were used. When the student supplies the answer (or shows the work) this information is readily available from the original documents.

Second, if the wrong answers were blind guesses, there would be no information to be found among these answers. On the other hand, if wrong answers reflect interpretation departures from the expected one, these answers should show an ordered relationship to whatever the overall test is measuring. This departure should be dependent upon the level of psycholinguistic maturity of the student choosing or giving the answer in the vernacular in which the test is written.

In this second case it should be possible to extract this order from the responses to the test items. [3] Such extraction processes, the Rasch model for instance, are standard practice for item development among professionals. However, because the wrong answers are discarded during the scoring process, analysis of these answers for the information they might contain is seldom undertaken.

Third, although topic-based subtest scores are sometimes provided, the more common practice is to report the total score or a rescaled version of it. This rescaling is intended to compare these scores to a standard of some sort. This further collapse of the test results systematically removes all the information about which particular items were missed.

Thus, scoring a test right–wrong loses 1) how students achieved their correct answers, 2) what led them astray towards unacceptable answers and 3) where within the body of the test this departure from expectation occurred.

This commentary suggests that the current scoring procedure conceals the dynamics of the test-taking process and obscures the capabilities of the students being assessed. Current scoring practice oversimplifies these data in the initial scoring step. The result of this procedural error is to obscure diagnostic information that could help teachers serve their students better. It further prevents those who are diligently preparing these tests from being able to observe the information that would otherwise have alerted them to the presence of this error.

A solution to this problem, known as Response Spectrum Evaluation (RSE), [4] is currently being developed that appears to be capable of recovering all three of these forms of information loss, while still providing a numerical scale to establish current performance status and to track performance change.

This RSE approach provides an interpretation of every answer, whether right or wrong, that indicates the likely thought processes used by the test taker. [5] Among other findings, this chapter reports that the recoverable information explains between two and three times more of the test variability than considering only the right answers. This massive loss of information can be explained by the fact that the "wrong" answers are removed from the information being collected during the scoring process and are no longer available to reveal the procedural error inherent in right-wrong scoring. The procedure bypasses the limitations produced by the linear dependencies inherent in test data.

See also

Related Research Articles

<span class="mw-page-title-main">Psychometrics</span> Theory and technique of psychological measurement

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally refers to specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

The Japanese-Language Proficiency Test, or JLPT, is a standardized criterion-referenced test to evaluate and certify Japanese language proficiency for non-native speakers, covering language knowledge, reading ability, and listening ability. The test is held twice a year in Japan and selected countries, and once a year in other regions. The JLPT is conducted by the Japan Foundation for tests overseas, and Japan Educational Exchanges and Services for tests in Japan.

<span class="mw-page-title-main">Graduate Management Admission Test</span> Computer adaptive test (CAT)

The Graduate Management Admission Test is a computer adaptive test (CAT) intended to assess certain analytical, writing, quantitative, verbal, and reading skills in written English for use in admission to a graduate management program, such as a Master of Business Administration (MBA) program. Answering the test questions requires knowledge of English grammatical rules, reading comprehension, and mathematical skills such as arithmetic, algebra, and geometry. The Graduate Management Admission Council (GMAC) owns and operates the test, and states that the GMAT assesses analytical writing and problem-solving abilities while also addressing data sufficiency, logic, and critical reasoning skills that it believes to be vital to real-world business and management success. It can be taken up to five times a year but no more than eight times total. Attempts must be at least 16 days apart.

Cronbach's alpha, also known as rho-equivalent reliability or coefficient alpha, is a reliability coefficient that provides a method of measuring internal consistency of tests and measures. Numerous studies warn against using it unconditionally, and note that reliability coefficients based on structural equation modeling (SEM) or generalizability theory are in many cases a suitable alternative in certain situations.

<span class="mw-page-title-main">Law School Admission Test</span> Admission test primarily for US and Canadian law schools

The Law School Admission Test is a standardized test administered by the Law School Admission Council (LSAC) for prospective law school candidates. It is designed to assess reading comprehension as well as logical and verbal reasoning proficiency. The test is an integral part of the law school admission process in the United States, Canada, the University of Melbourne, Australia, and a growing number of other countries.

Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observed or obtained score on a test is the sum of a true score (error-free score) and an error score. Generally speaking, the aim of classical test theory is to understand and improve the reliability of psychological tests.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

The Minnesota Multiphasic Personality Inventory (MMPI) is a standardized psychometric test of adult personality and psychopathology. Psychologists and other mental health professionals use various versions of the MMPI to help develop treatment plans, assist with differential diagnosis, help answer legal questions, screen job candidates during the personnel selection process, or as part of a therapeutic assessment procedure.

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

Thematic apperception test (TAT) is a projective psychological test developed during the 1930s by Henry A. Murray and Christiana D. Morgan at Harvard University. Proponents of the technique assert that subjects' responses, in the narratives they make up about ambiguous pictures of people, reveal their underlying motives, concerns, and the way they see the social world. Historically, the test has been among the most widely researched, taught, and used of such techniques.

<span class="mw-page-title-main">Multiple choice</span> Assessment that are responded by choosing correct answers from a list of choices

Multiple choice (MC), objective response or MCQ is a form of an objective assessment in which respondents are asked to select only correct answers from the choices offered as a list. The multiple choice format is most frequently used in educational testing, in market research, and in elections, when a person chooses between multiple candidates, parties, or policies.

In psychology, a projective test is a personality test designed to let a person respond to ambiguous stimuli, presumably revealing hidden emotions and internal conflicts projected by the person into the test. This is sometimes contrasted with a so-called "objective test" / "self-report test", which adopt a "structured" approach as responses are analyzed according to a presumed universal standard, and are limited to the content of the test. The responses to projective tests are content analyzed for meaning rather than being based on presuppositions about meaning, as is the case with objective tests. Projective tests have their origins in psychoanalysis, which argues that humans have conscious and unconscious attitudes and motivations that are beyond or hidden from conscious awareness.

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research

A criterion-referenced test is a style of test which uses test scores to generate a statement about the behavior that can be expected of a person with that score. Most tests and quizzes that are written by school teachers can be considered criterion-referenced tests. In this case, the objective is simply to see whether the student has learned the material. Criterion-referenced assessment can be contrasted with norm-referenced assessment and ipsative assessment.

A computerized classification test (CCT) refers to, as its name would suggest, a test that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.

The Millon Clinical Multiaxial Inventory – Fourth Edition (MCMI-IV) is the most recent edition of the Millon Clinical Multiaxial Inventory. The MCMI is a psychological assessment tool intended to provide information on personality traits and psychopathology, including specific mental disorders outlined in the DSM-5. It is intended for adults with at least a 5th grade reading level who are currently seeking mental health services. The MCMI was developed and standardized specifically on clinical populations, and the authors are very specific that it should not be used with the general population or adolescents. However, there is evidence base that shows that it may still retain validity on non-clinical populations, and so psychologists will sometimes administer the test to members of the general population, with caution. The concepts involved in the questions and their presentation make it unsuitable for those with below average intelligence or reading ability.

Standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the US, in particular for high-stakes assessments, and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined; it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Such studies require quite an amount of resources, involving a number of professionals, in particular with psychometric background. Standard-setting studies are for that reason impractical for regular class room situations, yet in every layer of education, standard setting is performed and multiple methods exist.

The attribute hierarchy method (AHM), is a cognitively based psychometric procedure developed by Jacqueline Leighton, Mark Gierl, and Steve Hunka at the Centre for Research in Applied Measurement and Evaluation (CRAME) at the University of Alberta. The AHM is one form of cognitive diagnostic assessment that aims to integrate cognitive psychology with educational measurement for the purposes of enhancing instruction and student learning. A cognitive diagnostic assessment (CDA), is designed to measure specific knowledge states and cognitive processing skills in a given domain. The results of a CDA yield a profile of scores with detailed information about a student’s cognitive strengths and weaknesses. This cognitive diagnostic feedback has the potential to guide instructors, parents and students in their teaching and learning processes.

Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.

References

  1. Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum. Page 1, sentence 1.
  2. Iowa Testing Programs guide for interpreting test scores Archived 2008-02-12 at the Wayback Machine
  3. Powell, J. C. and Shklov, N. (1992) The Journal of Educational and Psychological Measurement, 52, 847865
  4. "Welcome to the Frontpage". Archived from the original on 30 April 2015. Retrieved 2 May 2015.
  5. Powell, Jay C. (2010) Testing as Feedback to Inform Teaching. Chapter 3 in; Learning and Instruction in the Digital Age, Part 1. Cognitive Approaches to Learning and Instruction. (J. Michael Spector, Dirk Ifenthaler, Pedro Isaias, Kinshuk and Demetrios Sampson, Eds.), New York: Springer. ISBN   978-1-4419-1551-1, doi : 10.1007/978-1-4419-1551-1