Educational measurement

Last updated

Educational measurement refers to the use of educational assessments and the analysis of data such as scores obtained from educational assessments to infer the abilities and proficiencies of students. The approaches overlap with those in psychometrics. Educational measurement is the assigning of numerals to traits such as achievement, interest, attitudes, aptitudes, intelligence and performance.

Overview

The aim of theory and practice in educational measurement is typically to measure abilities and levels of attainment by students in areas such as reading, writing, mathematics, science and so forth. Traditionally, attention focuses on whether assessments are reliable and valid. In practice, educational measurement is largely concerned with the analysis of data from educational assessments or tests. Typically, this means using total scores on assessments, whether they are multiple choice or open-ended and marked using marking rubrics or guides.

In technical terms, the pattern of scores by individual students to individual items is used to infer so-called scale locations of students, the "measurements". This process is one form of scaling. Essentially, higher total scores give higher scale locations, consistent with the traditional and everyday use of total scores. [1] If certain theory is used, though, there is not a strict correspondence between the ordering of total scores and the ordering of scale locations. The Rasch model provides a strict correspondence provided all students attempt the same test items, or their performances are marked using the same marking rubrics.

In terms of the broad body of purely mathematical theory drawn on, there is substantial overlap between educational measurement and psychometrics. However, certain approaches considered to be a part of psychometrics, including Classical test theory, Item Response Theory and the Rasch model, were originally developed more specifically for the analysis of data from educational assessments. [2] [3]

One of the aims of applying theory and techniques in educational measurement is to try to place the results of different tests administered to different groups of students on a single or common scale through processes known as test equating. The rationale is that because different assessments usually have different difficulties, the total scores cannot be directly compared. The aim of trying to place results on a common scale is to allow comparison of the scale locations inferred from the totals via scaling processes.[ citation needed ]

Related Research Articles

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

<span class="mw-page-title-main">Educational Testing Service</span> Educational testing and assessment organization

Educational Testing Service (ETS), founded in 1947, is the world's largest private educational testing and assessment organization. It is headquartered in Lawrence Township, New Jersey, but has a Princeton address.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

Louis Leon Thurstone was an American pioneer in the fields of psychometrics and psychophysics. He conceived the approach to measurement known as the law of comparative judgment, and is well known for his contributions to factor analysis. A Review of General Psychology survey, published in 2002, ranked Thurstone as the 88th most cited psychologist of the 20th century, tied with John Garcia, James J. Gibson, David Rumelhart, Margaret Floy Washburn, and Robert S. Woodworth.

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

Educational assessment or educational evaluation is the systematic process of documenting and using empirical data on the knowledge, skill, attitudes, aptitude and beliefs to refine programs and improve student learning. Assessment data can be obtained by examining student work directly to assess the achievement of learning outcomes or it is based on data from which one can make inferences about learning. Assessment is often used interchangeably with test but is not limited to tests. Assessment can focus on the individual learner, the learning community, a course, an academic program, the institution, or the educational system as a whole. The word "assessment" came into use in an educational context after the Second World War.

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research.

The polytomous Rasch model is generalization of the dichotomous Rasch model. It is a measurement model that has potential application in any context in which the objective is to measure a trait or ability through a process in which responses to items are scored with successive integers. For example, the model is applicable to the use of Likert scales, rating scales, and to educational assessment items for which successively higher integer scores are intended to indicate increasing levels of competence or attainment.

The law of comparative judgment was conceived by L. L. Thurstone. In modern-day terminology, it is more aptly described as a model that is used to obtain measurements from any process of pairwise comparison. Examples of such processes are the comparisons of perceived intensity of physical stimuli, such as the weights of objects, and comparisons of the extremity of an attitude expressed within statements, such as statements about capital punishment. The measurements represent how we perceive entities, rather than measurements of actual physical properties. This kind of measurement is the focus of psychometrics and psychophysics.

Quantitative psychology is a field of scientific study that focuses on the mathematical modeling, research design and methodology, and statistical analysis of psychological processes. It includes tests and other devices for measuring cognitive abilities. Quantitative psychologists develop and analyze a wide variety of research methods, including those of psychometrics, a field concerned with the theory and technique of psychological measurement.

Georg William Rasch was a Danish mathematician, statistician, and psychometrician, most famous for the development of a class of measurement models known as Rasch models. He studied with R.A. Fisher and also briefly with Ragnar Frisch, and was elected a member of the International Statistical Institute in 1948.

Test equating traditionally refers to the statistical process of determining comparable scores on different forms of an exam. It can be accomplished using either classical test theory or item response theory.

Psychometric software refers to specialized programs used for the psychometric analysis of data obtained from tests, questionnaires, polls or inventories that measure latent psychoeducational variables. Although some psychometric analyses can be performed using general statistical software such as SPSS, most require specialized tools designed specifically for psychometric purposes.

<span class="mw-page-title-main">Benjamin Drake Wright</span> American psychometrician (1926–2015)

Benjamin Drake Wright was an American psychometrician. He is largely responsible for the widespread adoption of Georg Rasch's measurement principles and models. In the wake of what Rasch referred to as Wright's “almost unbelievable activity in this field” in the period from 1960 to 1972, Rasch's ideas entered the mainstream in high-stakes testing, professional certification and licensure examinations, and in research employing tests, and surveys and assessments across a range of fields. Wright's seminal contributions to measurement continued until 2001, and included articulation of philosophical principles, production of practical results and applications, software development, development of estimation methods and model fit statistics, vigorous support for students and colleagues, and the founding of professional societies and new publications.

<span class="mw-page-title-main">Klaus Kubinger</span>

Klaus D. Kubinger, is a psychologist as well as a statistician, professor for psychological assessment at the University of Vienna, Faculty of Psychology. His main research work focuses on fundamental research of assessment processes and on application and advancement of Item response theory models. He is also known as a textbook author of psychological assessment on the one hand and on statistics on the other hand.

The Mokken scale is a psychometric method of data reduction. A Mokken scale is a unidimensional scale that consists of hierarchically-ordered items that measure the same underlying, latent concept. This method is named after the political scientist Rob Mokken who suggested it in 1971.

Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models. More recently, neural networks, including Large Language Models, such as the GPT family, have been used successfully for generating items automatically.

Matthias von Davier is a psychometrician, academic, inventor, and author. He is the executive director of the TIMSS & PIRLS International Study Center in Lynch School of Education and Human Development and the J. Donald Monan, S.J., University Professor in Education at Boston College.

References

  1. Baker, F (2001). The Basics of Item Response Theory. University of Maryland, College Park, MD: ERIC Clearinghouse on Assessment and Evaluation.
  2. Rasch, Georg (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
  3. Lord, Fred (1980). Applications of item response theory to practical testing problems. CMahwah, NJ: Lawrence Erlbaum Associates, Inc.