Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification.
Several factors have contributed to a growing interest in AES. Among them are cost, accountability, standards, and technology. Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards. The advance of information technology promises to measure educational achievement at reduced cost.
The use of AES for high-stakes testing in education has generated significant backlash, with opponents pointing to research that computers cannot yet grade writing accurately and arguing that their use for such purposes promotes teaching writing in reductive ways (i.e. teaching to the test).
Most historical summaries of AES trace the origins of the field to the work of Ellis Batten Page. [1] In 1966, he argued [2] for the possibility of scoring essays by computer, and in 1968 he published [3] his successful work with a program called Project Essay Grade (PEG). Using the technology of that time, computerized essay scoring would not have been cost-effective, [4] so Page abated his efforts for about two decades. Eventually, Page sold PEG to Measurement Incorporated
By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility. As early as 1982, a UNIX program called Writer's Workbench was able to offer punctuation, spelling and grammar advice. [5] In collaboration with several companies (notably Educational Testing Service), Page updated PEG and ran some successful trials in the early 1990s. [6]
Peter Foltz and Thomas Landauer developed a system using a scoring engine called the Intelligent Essay Assessor (IEA). IEA was first used to score essays in 1997 for their undergraduate courses. [7] It is now a product from Pearson Educational Technologies and used for scoring within a number of commercial products and state and national exams.
IntelliMetric is Vantage Learning's AES engine. Its development began in 1996. [8] It was first used commercially to score essays in 1998. [9]
Educational Testing Service offers "e-rater", an automated essay scoring program. It was first used commercially in February 1999. [10] Jill Burstein was the team leader in its development. ETS's Criterion Online Writing Evaluation Service uses the e-rater engine to provide both scores and targeted feedback.
Lawrence Rudner has done some work with Bayesian scoring, and developed a system called BETSY (Bayesian Essay Test Scoring sYstem). [11] Some of his results have been published in print or online, but no commercial system incorporates BETSY as yet.
Under the leadership of Howard Mitzel and Sue Lottridge, Pacific Metrics developed a constructed response automated scoring engine, CRASE. Currently utilized by several state departments of education and in a U.S. Department of Education-funded Enhanced Assessment Grant, Pacific Metrics’ technology has been used in large-scale formative and summative assessment environments since 2007.
Measurement Inc. acquired the rights to PEG in 2002 and has continued to develop it. [12]
In 2012, the Hewlett Foundation sponsored a competition on Kaggle called the Automated Student Assessment Prize (ASAP). [13] 201 challenge participants attempted to predict, using AES, the scores that human raters would give to thousands of essays written to eight different prompts. The intent was to demonstrate that AES can be as reliable as human raters, or more so. The competition also hosted a separate demonstration among nine AES vendors on a subset of the ASAP data. Although the investigators reported that the automated essay scoring was as reliable as human scoring, [14] this claim was not substantiated by any statistical tests because some of the vendors required that no such tests be performed as a precondition for their participation. [15] Moreover, the claim that the Hewlett Study demonstrated that AES can be as reliable as human raters has since been strongly contested, [16] [17] including by Randy E. Bennett, the Norman O. Frederiksen Chair in Assessment Innovation at the Educational Testing Service. [18] Some of the major criticisms of the study have been that five of the eight datasets consisted of paragraphs rather than essays, four of the eight data sets were graded by human readers for content only rather than for writing ability, and that rather than measuring human readers and the AES machines against the "true score", the average of the two readers' scores, the study employed an artificial construct, the "resolved score", which in four datasets consisted of the higher of the two human scores if there was a disagreement. This last practice, in particular, gave the machines an unfair advantage by allowing them to round up for these datasets. [16]
In 1966, Page hypothesized that, in the future, the computer-based judge will be better correlated with each human judge than the other human judges are. [2] Despite criticizing the applicability of this approach to essay marking in general, this hypothesis was supported for marking free text answers to short questions, such as those typical of the British GCSE system. [19] Results of supervised learning demonstrate that the automatic systems perform well when marking by different human teachers is in good agreement. Unsupervised clustering of answers showed that excellent papers and weak papers formed well-defined clusters, and the automated marking rule for these clusters worked well, whereas marks given by human teachers for the third cluster ('mixed') can be controversial, and the reliability of any assessment of works from the 'mixed' cluster can often be questioned (both human and computer-based). [19]
According to a recent survey, [20] modern AES systems try to score different dimensions of an essay's quality in order to provide feedback to users. These dimensions include the following items:
From the beginning, the basic procedure for AES has been to start with a training set of essays that have been carefully hand-scored. [21] The program evaluates surface features of the text of each essay, such as the total number of words, the number of subordinate clauses, or the ratio of uppercase to lowercase letters—quantities that can be measured without any human insight. It then constructs a mathematical model that relates these quantities to the scores that the essays received. The same model is then applied to calculate scores of new essays.
Recently, one such mathematical model was created by Isaac Persing and Vincent Ng. [22] which not only evaluates essays on the above features, but also on their argument strength. It evaluates various features of the essay, such as the agreement level of the author and reasons for the same, adherence to the prompt's topic, locations of argument components (major claim, claim, premise), errors in the arguments, cohesion in the arguments among various other features. In contrast to the other models mentioned above, this model is closer in duplicating human insight while grading essays. Due to the growing popularity of deep neural networks, deep learning approaches have been adopted for automated essay scoring, generally obtaining superior results, often surpassing inter-human agreement levels. [23]
The various AES programs differ in what specific surface features they measure, how many essays are required in the training set, and most significantly in the mathematical modeling technique. Early attempts used linear regression. Modern systems may use linear regression or other machine learning techniques often in combination with other statistical techniques such as latent semantic analysis [24] and Bayesian inference. [11]
The automated essay scoring task has also been studied in the cross-domain setting using machine learning models, where the models are trained on essays written for one prompt (topic) and tested on essays written for another prompt. Successful approaches in the cross-domain scenario are based on deep neural networks [25] or models that combine deep and shallow features. [26]
Any method of assessment must be judged on validity, fairness, and reliability. [27] An instrument is valid if it actually measures the trait that it purports to measure. It is fair if it does not, in effect, penalize or privilege any one class of people. It is reliable if its outcome is repeatable, even when irrelevant external factors are altered.
Before computers entered the picture, high-stakes essays were typically given scores by two trained human raters. If the scores differed by more than one point, a more experienced third rater would settle the disagreement. In this system, there is an easy way to measure reliability: by inter-rater agreement. If raters do not consistently agree within one point, their training may be at fault. If a rater consistently disagrees with how other raters look at the same essays, that rater probably needs extra training.
Various statistics have been proposed to measure inter-rater agreement. Among them are percent agreement, Scott's π, Cohen's κ, Krippendorf's α, Pearson's correlation coefficient r, Spearman's rank correlation coefficient ρ, and Lin's concordance correlation coefficient.
Percent agreement is a simple statistic applicable to grading scales with scores from 1 to n, where usually 4 ≤ n ≤ 6. It is reported as three figures, each a percent of the total number of essays scored: exact agreement (the two raters gave the essay the same score), adjacent agreement (the raters differed by at most one point; this includes exact agreement), and extreme disagreement (the raters differed by more than two points). Expert human graders were found to achieve exact agreement on 53% to 81% of all essays, and adjacent agreement on 97% to 100%. [28]
Inter-rater agreement can now be applied to measuring the computer's performance. A set of essays is given to two human raters and an AES program. If the computer-assigned scores agree with one of the human raters as well as the raters agree with each other, the AES program is considered reliable. Alternatively, each essay is given a "true score" by taking the average of the two human raters' scores, and the two humans and the computer are compared on the basis of their agreement with the true score.
Some researchers have reported that their AES systems can, in fact, do better than a human. Page made this claim for PEG in 1994. [6] Scott Elliot said in 2003 that IntelliMetric typically outperformed human scorers. [8] AES machines, however, appear to be less reliable than human readers for any kind of complex writing test. [29]
In current practice, high-stakes assessments such as the GMAT are always scored by at least one human. AES is used in place of a second rater. A human rater resolves any disagreements of more than one point. [30]
AES has been criticized on various grounds. Yang et al. mention "the over-reliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies." [30] Several critics are concerned that students' motivation will be diminished if they know that no human will read their writing. [31] Among the most telling critiques are reports of intentionally gibberish essays being given high scores. [32]
On 12 March 2013, HumanReaders.Org launched an online petition, "Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment". Within weeks, the petition gained thousands of signatures, including Noam Chomsky, [33] and was cited in a number of newspapers, including The New York Times , [34] and on a number of education and technology blogs. [35]
The petition describes the use of AES for high-stakes testing as "trivial", "reductive", "inaccurate", "undiagnostic", "unfair" and "secretive". [36]
In a detailed summary of research on AES, the petition site notes, "RESEARCH FINDINGS SHOW THAT no one—students, parents, teachers, employers, administrators, legislators—can rely on machine scoring of essays ... AND THAT machine scoring does not measure, and therefore does not promote, authentic acts of writing." [37]
The petition specifically addresses the use of AES for high-stakes testing and says nothing about other possible uses.
Most resources for automated essay scoring are proprietary.
A standardized test is a test that is administered and scored in a consistent, or "standard", manner. Standardized tests are designed in such a way that the questions and interpretations are consistent and are administered and scored in a predetermined, standard manner.
Educational assessment or educational evaluation is the systematic process of documenting and using empirical data on the knowledge, skill, attitudes, aptitude and beliefs to refine programs and improve student learning. Assessment data can be obtained by examining student work directly to assess the achievement of learning outcomes or it is based on data from which one can make inferences about learning. Assessment is often used interchangeably with test but is not limited to tests. Assessment can focus on the individual learner, the learning community, a course, an academic program, the institution, or the educational system as a whole. The word "assessment" came into use in an educational context after the Second World War.
Electronic assessment, also known as digital assessment, e-assessment, online assessment or computer-based assessment, is the use of information technology in assessment such as educational assessment, health assessment, psychiatric assessment, and psychological assessment. This covers a wide range of activities ranging from the use of a word processor for assignments to on-screen testing. Specific types of e-assessment include multiple choice, online/electronic submission, computerized adaptive testing such as the Frankfurt Adaptive Concentration Test, and computerized classification testing.
The Washington Assessment of Student Learning (WASL) was a standardized educational assessment system given as the primary assessment in the state of Washington from spring 1997 to summer 2009. The WASL was also used as a high school graduation examination beginning in the spring of 2006 and ending in 2009. It has been replaced by the High School Proficiency Exam (HSPE), the Measurements of Students Progress (MSP) for grades 3–8, and later the Smarter Balanced Assessment (SBAC). The WASL assessment consisted of examinations over four subjects with four different types of questions. It was given to students from third through eighth grades and tenth grade. Third and sixth graders were tested in reading and math; fourth and seventh graders in math, reading and writing. Fifth and eighth graders were tested in reading, math and science. The high school assessment, given during a student's tenth grade year, contained all four subjects.
Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.
STAR Reading, STAR Early Literacy and STAR Math are standardized, computer-adaptive assessments created by Renaissance Learning, Inc., for use in K–12 education. Each is a "Tier 2" assessment of a skill (reading practice, math practice, and early literacy, respectively that can be used any number of times due to item-bank technology. These assessments fall somewhere between progress monitoring tools and high-stakes tests.
In an educational setting, standards-based assessment is assessment that relies on the evaluation of student understanding with respect to agreed-upon standards, also known as "outcomes". The standards set the criteria for the successful demonstration of the understanding of a concept or skill.
Holistic grading or holistic scoring, in standards-based education, is an approach to scoring essays using a simple grading structure that bases a grade on a paper's overall quality. This type of grading, which is also described as nonreductionist grading, contrasts with analytic grading, which takes more factors into account when assigning a grade. Holistic grading can also be used to assess classroom-based work. Rather than counting errors, a paper is judged holistically and often compared to an anchor paper to evaluate if it meets a writing standard. It differs from other methods of scoring written discourse in two basic ways. It treats the composition as a whole, not assigning separate values to different parts of the writing. And it uses two or more raters, with the final score derived from their independent scores. Holistic scoring has gone by other names: "non-analytic," "overall quality," "general merit," "general impression," "rapid impression." Although the value and validation of the system are a matter of debate, holistic scoring of writing is still in wide application.
Summative assessment, summative evaluation, or assessment of learning is the assessment of participants in an educational program. Summative assessments are designed both to assess the effectiveness of the program and the learning of the participants. This contrasts with formative assessment which summarizes the participants' development at a particular time to inform instructors of student learning progress.
Marlene Scardamalia is an education researcher, professor at the Ontario Institute for Studies in Education, University of Toronto.
Electracy is a theory by Gregory Ulmer that describes the skills necessary to exploit the full communicative potential of new electronic media such as multimedia, hypermedia, social software, and virtual worlds. According to Ulmer, electracy "is to digital media what literacy is to print". It encompasses the broader cultural, institutional, pedagogical, and ideological implications inherent in the major societal transition from print to electronic media. Electracy is a portmanteau of "electricity" and Jacques Derrida's term "trace".
Standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the US, in particular for high-stakes assessments, and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined; it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Such studies require quite an amount of resources, involving a number of professionals, in particular with psychometric background. Standard-setting studies are for that reason impractical for regular class room situations, yet in every layer of education, standard setting is performed and multiple methods exist.
Adaptive learning, also known as adaptive teaching, is an educational method which uses computer algorithms as well as artificial intelligence to orchestrate the interaction with the learner and deliver customized resources and learning activities to address the unique needs of each learner. In professional learning contexts, individuals may "test out" of some training to ensure they engage with novel instruction. Computers adapt the presentation of educational material according to students' learning needs, as indicated by their responses to questions, tasks and experiences. The technology encompasses aspects derived from various fields of study including computer science, AI, psychometrics, education, psychology, and brain science.
An examination or test is an educational assessment intended to measure a test-taker's knowledge, skill, aptitude, physical fitness, or classification in many other topics. A test may be administered verbally, on paper, on a computer, or in a predetermined area that requires a test taker to demonstrate or perform a set of skills.
Writing assessment refers to an area of study that contains theories and practices that guide the evaluation of a writer's performance or potential through a writing task. Writing assessment can be considered a combination of scholarship from composition studies and measurement theory within educational assessment. Writing assessment can also refer to the technologies and practices used to evaluate student writing and learning. An important consequence of writing assessment is that the type and manner of assessment may impact writing instruction, with consequences for the character and quality of that instruction.
The Smarter Balanced Assessment Consortium (SBAC) is a standardized test consortium. It creates Common Core State Standards-aligned tests to be used in several states. It uses automated essay scoring. Its counterpart in the effort to become a leading multi-state test provider is the Partnership for the Assessment of Readiness for College and Careers (PARCC).
Leslie Cooper Perelman is an American scholar and authority on writing assessment. He is a critic of automated essay scoring (AES), and influenced the College Board's decision to terminate the Writing Section of the SAT.
Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models. More recently, neural networks, including Large Language Models, such as the GPT family, have been used successfully for generating items automatically.
Randy Elliot Bennett is an American educational researcher who specializes in educational assessment. He is currently the Norman O. Frederiksen Chair in Assessment Innovation at Educational Testing Service in Princeton, NJ. His research and writing focus on bringing together advances in cognitive science, technology, and measurement to improve teaching and learning. He received the ETS Senior Scientist Award in 1996, the ETS Career Achievement Award in 2005, the Teachers College, Columbia University Distinguished Alumni Award in 2016, Fellow status in the American Educational Research Association (AERA) in 2017, the National Council on Measurement in Education's (NCME) Bradley Hanson Award for Contributions to Educational Measurement in 2019, the E. F. Lindquist Award from AERA and ACT in 2020, elected membership in the National Academy of Education in 2022, and the AERA Cognition and Assessment Special Interest Group Outstanding Contribution to Research in Cognition and Assessment Award in 2024. Randy Bennett was elected President of both the International Association for Educational Assessment (IAEA), a worldwide organization primarily constituted of governmental and NGO measurement organizations, and the National Council on Measurement in Education (NCME), whose members are employed in universities, testing organizations, state and federal education departments, and school districts.
Lawrence M. Rudner is a research statistician and consultant whose work spans domains, including, statistical analysis, computer programming, web development, and oyster farming. He is the owner and president of Oyster Girl Oysters, and is an instructor at the Chesapeake Forum and the Chesapeake Bay Maritime Museum. He is the founder and former editor of the Practical Assessment, Research, and Evaluation journal.