Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth [1] in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. [2] So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models. [3] [4] [5] More recently, neural networks, including Large Language Models, such as the GPT family, have been used successfully for generating items automatically. [6] [7]
In psychological testing, the responses of the test taker to test items provide objective measurement data for a variety of human characteristics. [8] Some characteristics measured by psychological and educational tests include academic abilities, school performance, intelligence, motivation, etc. and these tests are frequently used to make decisions that have significant consequences on individuals or groups of individuals. Achieving measurement quality standards, such as test validity, is one of the most important objectives for psychologists and educators. [9] AIG is an approach to test development which can be used to maintain and improve test quality economically in the contemporary environment where computerized testing has increased the need for large numbers of test items. [5]
AIG reduces the cost of producing standardized tests, [10] as algorithms can generate many more items in a given amount of time than a human test specialist. It can quickly and easily create parallel test forms, which allow for different test takers to be exposed to different groups of test items with the same level of complexity or difficulty, thus enhancing test security. [3] When combined with computerized adaptive testing, AIG can generate new items or select which already-generated items should be administered next based on the test taker's ability during the administration of the test. AIG is also expected to produce items with a wide range of difficulty, fewer errors in construction, and is expected to permit higher comparability of items due to a more systematic definition of the prototypical item model. [3] [11] [12]
Test development (including AIG) can be enriched if it is based on any cognitive theory. Cognitive processes taken from a given theory are often matched with item features during their construction. The purpose of this is to predetermine a given psychometric parameter, such as item difficulty (from now on: β). Let radicals [11] be those structural elements that significantly affect item parameters and provide the item with certain cognitive requirements. One or more radicals of the item model can be manipulated in order to produce parent item models with different parameters (e.g., β) levels. Each parent can then grow its own family by manipulating other elements that Irvine [11] called incidentals. Incidentals are surface features that suffer random variations from item to item within the same family. Items that have the same structure of radicals and only differ in incidentals are usually labeled as isomorphs [13] or clones. [14] [15] There can be two kinds of item cloning: On the one hand, the item model may consist of an item with one or more open places, and cloning is done by filling each place with an element selected from a list of possibilities. On the other hand, the item model could be an intact item which is cloned by introducing transformations, for example changing the angle of an object of spatial ability tests. [16] The variation of these items' surface characteristics should not significantly influence the testee's responses. This is the reason why it is believed that incidentals produce only slight differences among the item parameters of the isomorphs. [3]
A number of item generators have been subjected to objective validation testing.
MathGen is a program that generates items to test mathematical achievement. In a 2018 article for the Journal of Educational Measurement , authors Embretson and Kingston conducted an extensive qualitative review and empirical try-outs to evaluate the qualitative and psychometric properties of generated items, concluding that the items were successful and that items generated from the same item structure had predictable psychometric properties. [17] [18]
A test of melodic discrimination developed with the aid of the computational model Rachman-Jun 2015 [19] was administered to participants in a 2017 trial. According to the data collected by P.M. Harrison et al., results demonstrate strong validity and reliability. [20]
Ferreyra and Backhoff-Escudero [21] generated two parallel versions of the Basic Competences Exam (Excoba), a general test of educational skills, using a program they developed called GenerEx. They then studied the internal structure as well as the psychometric equivalence of the created tests. Empirical results of psychometric quality are favorable overall, and the tests and items are consistent as measured by multiple psychometric indices.
Gierl and his colleagues [22] [23] [24] [25] used an AIG program called the Item Generator (IGOR [26] ) to create multiple-choice items that test medical knowledge. IGOR-generated items, even when compared to manually-designed items, showed good psychometric properties.
Arendasy, Sommer, and Mayr [27] used AIG to create verbal items to test verbal fluency in German and English, administering them to German- and English-speaking participants respectively. The computer-generated items showed acceptable psychometric properties. The sets of items administered to these two groups were based on a common set of interlanguage anchor items, which facilitated cross-lingual comparisons of performance.
Holling, Bertling, and Zeuch [28] used probability theory to automatically generate mathematical word problems with expected difficulties. They achieved a Rasch [29] model fit and item difficulties could be explained by the linear logistic test model (LLTM [30] ), as well as by the Random-Effects LLTM. Holling, Blank, Kuchenbäcker, and Kuhn [31] made a similar study with statistical word problems but without using AIG. Arendasy and his colleagues [32] [33] presented studies on automatically generated algebra word problems and examined how a quality control framework of AIG can affect the measurement quality of items.
The Item Maker (IMak) is a program written in the R programming language for plotting figural analogy items. The psychometric properties of 23 IMak-generated items were found to be satisfactory, and item difficulty based on rule generation could be predicted by means of the linear logistic test model (LLTM). [3]
MazeGen is another program coded with R that generates mazes automatically. The psychometric properties of 18 such mazes were found to be optimal, including Rasch model fit and the LLTM prediction of maze difficulty. [34]
GeomGen is a program that generates figural matrices. [35] A study which identified sources of measurement bias related to response elimination strategies for figural matrix items concluded that distractor salience favors the pursuit of response elimination strategies and that this knowledge could be incorporated into AIG to improve the construct validity of such items. [36] The same group used AIG to study differential item functioning (DIF) and gender differences associated with mental rotation. They manipulated item design features that have exhibited gender DIF in previous studies, and they showed that the estimates of the effect size of gender differences were compromised by the presence of different kinds of gender DIF that could be related to specific item design features. [37] [38]
Arendasy also studied possible violations of the psychometric quality identified using item response theory (IRT) of automatically generated visuospatial reasoning items. For this purpose, he presented two programs, namely: the already-mentioned GeomGen [35] and the Endless Loop Generator (EsGen). He concluded that GeomGen was more suitable for AIG because IRT principles can be incorporated during item generation. [39] In a parallel research project using GeomGen, Arendasy and Sommer [40] found that variation of the perceptual organization of items could influence the performance of respondents depending on their ability levels and that it had an effect on several psychometric quality indices. With these results, they questioned the unidimensionality assumption of figural matrix items in general.
MatrixDeveloper [41] was used to generate twenty-five 4x4 square matrix items automatically. These items were administered to 169 individuals. According to research results, the items show a good Rasch model fit, and rule-based generation can explain the item difficulty. [42]
The first known item matrix generator was designed by Embretson, [43] [14] and her automatically generated items demonstrated good psychometric properties, as it is shown by Embretson and Reise. [44] She also proposed a model for adequate online item generation.
Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.
Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.
In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.
A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.
In psychology and sociology, the Thurstone scale was the first formal technique to measure an attitude. It was developed by Louis Leon Thurstone in 1928, originally as a means of measuring attitudes towards religion. Today it is used to measure attitudes towards a wide variety of issues. The technique uses a number of statements about a particular issue, and each statement is given a numerical value indicating how favorable or unfavorable it is judged to be. These numerical values are prepared ahead of time by the researcher and not shown to the test subjects. The subjects then check each of the statements with which they agree, and a mean score of those statements' values is computed, indicating their attitude.
Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. Construct validation is the accumulation of evidence to support the interpretation of what a measure reflects. Modern validity theory defines construct validity as the overarching concern of validity research, subsuming all other types of validity evidence such as content validity and criterion validity.
Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.
The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research.
Quantitative psychology is a field of scientific study that focuses on the mathematical modeling, research design and methodology, and statistical analysis of psychological processes. It includes tests and other devices for measuring cognitive abilities. Quantitative psychologists develop and analyze a wide variety of research methods, including those of psychometrics, a field concerned with the theory and technique of psychological measurement.
Georg William Rasch was a Danish mathematician, statistician, and psychometrician, most famous for the development of a class of measurement models known as Rasch models. He studied with R.A. Fisher and also briefly with Ragnar Frisch, and was elected a member of the International Statistical Institute in 1948.
A computerized classification test (CCT) refers to, as its name would suggest, a Performance Appraisal System that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.
The theory of conjoint measurement is a general, formal theory of continuous quantity. It was independently discovered by the French economist Gérard Debreu (1960) and by the American mathematical psychologist R. Duncan Luce and statistician John Tukey.
The attribute hierarchy method (AHM), is a cognitively based psychometric procedure developed by Jacqueline Leighton, Mark Gierl, and Steve Hunka at the Centre for Research in Applied Measurement and Evaluation (CRAME) at the University of Alberta. The AHM is one form of cognitive diagnostic assessment that aims to integrate cognitive psychology with educational measurement for the purposes of enhancing instruction and student learning. A cognitive diagnostic assessment (CDA), is designed to measure specific knowledge states and cognitive processing skills in a given domain. The results of a CDA yield a profile of scores with detailed information about a student’s cognitive strengths and weaknesses. This cognitive diagnostic feedback has the potential to guide instructors, parents and students in their teaching and learning processes.
Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.
Educational measurement refers to the use of educational assessments and the analysis of data such as scores obtained from educational assessments to infer the abilities and proficiencies of students. The approaches overlap with those in psychometrics. Educational measurement is the assigning of numerals to traits such as achievement, interest, attitudes, aptitudes, intelligence and performance.
Klaus D. Kubinger, is a psychologist as well as a statistician, professor for psychological assessment at the University of Vienna, Faculty of Psychology. His main research work focuses on fundamental research of assessment processes and on application and advancement of Item response theory models. He is also known as a textbook author of psychological assessment on the one hand and on statistics on the other hand.
The Mokken scale is a psychometric method of data reduction. A Mokken scale is a unidimensional scale that consists of hierarchically-ordered items that measure the same underlying, latent concept. This method is named after the political scientist Rob Mokken who suggested it in 1971.
Computational psychometrics is an interdisciplinary field fusing theory-based psychometrics, learning and cognitive sciences, and data-driven AI-based computational models as applied to large-scale/high-dimensional learning, assessment, biometric, or psychological data. Computational psychometrics is frequently concerned with providing actionable and meaningful feedback to individuals based on measurement and analysis of individual differences as they pertain to specific areas of enquiry.
Mark Daniel Reckase is an educational psychologist and expert on quantitative methods and measurement who is known for his work on computerized adaptive testing, multidimensional item response theory, and standard setting in educational and psychological tests. Reckase is University Distinguished Professor Emeritus in the College of Education at Michigan State University.
Jacqueline P. Leighton is a Canadian-Chilean educational psychologist, academic and author. She is a full professor in the Faculty of Education as well as vice-dean of Faculty Development and Faculty Affairs at the University of Alberta.