Automatic item generation

Last updated

Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth [1] in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items. [2] So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models. [3] [4] [5] More recently, neural networks, including Large Language Models, such as the GPT family, have been used successfully for generating items automatically. [6] [7]

Contents

Context

In psychological testing, the responses of the test taker to test items provide objective measurement data for a variety of human characteristics. [8] Some characteristics measured by psychological and educational tests include academic abilities, school performance, intelligence, motivation, etc. and these tests are frequently used to make decisions that have significant consequences on individuals or groups of individuals. Achieving measurement quality standards, such as test validity, is one of the most important objectives for psychologists and educators. [9] AIG is an approach to test development which can be used to maintain and improve test quality economically in the contemporary environment where computerized testing has increased the need for large numbers of test items. [5]

Benefits

AIG reduces the cost of producing standardized tests, [10] as algorithms can generate many more items in a given amount of time than a human test specialist. It can quickly and easily create parallel test forms, which allow for different test takers to be exposed to different groups of test items with the same level of complexity or difficulty, thus enhancing test security. [3] When combined with computerized adaptive testing, AIG can generate new items or select which already-generated items should be administered next based on the test taker's ability during the administration of the test. AIG is also expected to produce items with a wide range of difficulty, fewer errors in construction, and is expected to permit higher comparability of items due to a more systematic definition of the prototypical item model. [3] [11] [12]

Radicals, incidentals and isomorphs

Test development (including AIG) can be enriched if it is based on any cognitive theory. Cognitive processes taken from a given theory are often matched with item features during their construction. The purpose of this is to predetermine a given psychometric parameter, such as item difficulty (from now on: β). Let radicals [11] be those structural elements that significantly affect item parameters and provide the item with certain cognitive requirements. One or more radicals of the item model can be manipulated in order to produce parent item models with different parameters (e.g., β) levels. Each parent can then grow its own family by manipulating other elements that Irvine [11] called incidentals. Incidentals are surface features that suffer random variations from item to item within the same family. Items that have the same structure of radicals and only differ in incidentals are usually labeled as isomorphs [13] or clones. [14] [15] There can be two kinds of item cloning: On the one hand, the item model may consist of an item with one or more open places, and cloning is done by filling each place with an element selected from a list of possibilities. On the other hand, the item model could be an intact item which is cloned by introducing transformations, for example changing the angle of an object of spatial ability tests. [16] The variation of these items' surface characteristics should not significantly influence the testee's responses. This is the reason why it is believed that incidentals produce only slight differences among the item parameters of the isomorphs. [3]

Current developments

A number of item generators have been subjected to objective validation testing.

MathGen is a program that generates items to test mathematical achievement. In a 2018 article for the Journal of Educational Measurement , authors Embretson and Kingston conducted an extensive qualitative review and empirical try-outs to evaluate the qualitative and psychometric properties of generated items, concluding that the items were successful and that items generated from the same item structure had predictable psychometric properties. [17] [18]

A test of melodic discrimination developed with the aid of the computational model Rachman-Jun 2015 [19] was administered to participants in a 2017 trial. According to the data collected by P.M. Harrison et al., results demonstrate strong validity and reliability. [20]

Ferreyra and Backhoff-Escudero [21] generated two parallel versions of the Basic Competences Exam (Excoba), a general test of educational skills, using a program they developed called GenerEx. They then studied the internal structure as well as the psychometric equivalence of the created tests. Empirical results of psychometric quality are favorable overall, and the tests and items are consistent as measured by multiple psychometric indices.

Gierl and his colleagues [22] [23] [24] [25] used an AIG program called the Item Generator (IGOR [26] ) to create multiple-choice items that test medical knowledge. IGOR-generated items, even when compared to manually-designed items, showed good psychometric properties.

Arendasy, Sommer, and Mayr [27] used AIG to create verbal items to test verbal fluency in German and English, administering them to German- and English-speaking participants respectively. The computer-generated items showed acceptable psychometric properties. The sets of items administered to these two groups were based on a common set of interlanguage anchor items, which facilitated cross-lingual comparisons of performance.

Holling, Bertling, and Zeuch [28] used probability theory to automatically generate mathematical word problems with expected difficulties. They achieved a Rasch [29] model fit and item difficulties could be explained by the linear logistic test model (LLTM [30] ), as well as by the Random-Effects LLTM. Holling, Blank, Kuchenbäcker, and Kuhn [31] made a similar study with statistical word problems but without using AIG. Arendasy and his colleagues [32] [33] presented studies on automatically generated algebra word problems and examined how a quality control framework of AIG can affect the measurement quality of items.

Automatic generation of figural items

Four-rule-based figural analogy stem automatically generated with the IMak package (for more information, see Blum & Holling, 2018). IMak stem example.png
Four-rule-based figural analogy stem automatically generated with the IMak package (for more information, see Blum & Holling, 2018).

The Item Maker (IMak) is a program written in the R programming language for plotting figural analogy items. The psychometric properties of 23 IMak-generated items were found to be satisfactory, and item difficulty based on rule generation could be predicted by means of the linear logistic test model (LLTM). [3]

MazeGen is another program coded with R that generates mazes automatically. The psychometric properties of 18 such mazes were found to be optimal, including Rasch model fit and the LLTM prediction of maze difficulty. [34]

GeomGen is a program that generates figural matrices. [35] A study which identified sources of measurement bias related to response elimination strategies for figural matrix items concluded that distractor salience favors the pursuit of response elimination strategies and that this knowledge could be incorporated into AIG to improve the construct validity of such items. [36] The same group used AIG to study differential item functioning (DIF) and gender differences associated with mental rotation. They manipulated item design features that have exhibited gender DIF in previous studies, and they showed that the estimates of the effect size of gender differences were compromised by the presence of different kinds of gender DIF that could be related to specific item design features. [37] [38]

Arendasy also studied possible violations of the psychometric quality identified using item response theory (IRT) of automatically generated visuospatial reasoning items. For this purpose, he presented two programs, namely: the already-mentioned GeomGen [35] and the Endless Loop Generator (EsGen). He concluded that GeomGen was more suitable for AIG because IRT principles can be incorporated during item generation. [39] In a parallel research project using GeomGen, Arendasy and Sommer [40] found that variation of the perceptual organization of items could influence the performance of respondents depending on their ability levels and that it had an effect on several psychometric quality indices. With these results, they questioned the unidimensionality assumption of figural matrix items in general.

MatrixDeveloper [41] was used to generate twenty-five 4x4 square matrix items automatically. These items were administered to 169 individuals. According to research results, the items show a good Rasch model fit, and rule-based generation can explain the item difficulty. [42]

The first known item matrix generator was designed by Embretson, [43] [14] and her automatically generated items demonstrated good psychometric properties, as it is shown by Embretson and Reise. [44] She also proposed a model for adequate online item generation.

Related Research Articles

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

In psychology and sociology, the Thurstone scale was the first formal technique to measure an attitude. It was developed by Louis Leon Thurstone in 1928, originally as a means of measuring attitudes towards religion. Today it is used to measure attitudes towards a wide variety of issues. The technique uses a number of statements about a particular issue, and each statement is given a numerical value indicating how favorable or unfavorable it is judged to be. These numerical values are prepared ahead of time by the researcher and not shown to the test subjects. The subjects then check each of the statements with which they agree, and a mean score of those statements' values is computed, indicating their attitude.

Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. Construct validation is the accumulation of evidence to support the interpretation of what a measure reflects. Modern validity theory defines construct validity as the overarching concern of validity research, subsuming all other types of validity evidence such as content validity and criterion validity.

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research.

Quantitative psychology is a field of scientific study that focuses on the mathematical modeling, research design and methodology, and statistical analysis of psychological processes. It includes tests and other devices for measuring cognitive abilities. Quantitative psychologists develop and analyze a wide variety of research methods, including those of psychometrics, a field concerned with the theory and technique of psychological measurement.

Georg William Rasch was a Danish mathematician, statistician, and psychometrician, most famous for the development of a class of measurement models known as Rasch models. He studied with R.A. Fisher and also briefly with Ragnar Frisch, and was elected a member of the International Statistical Institute in 1948.

A computerized classification test (CCT) refers to, as its name would suggest, a Performance Appraisal System that is administered by computer for the purpose of classifying examinees. The most common CCT is a mastery test where the test classifies examinees as "Pass" or "Fail," but the term also includes tests that classify examinees into more than two categories. While the term may generally be considered to refer to all computer-administered tests for classification, it is usually used to refer to tests that are interactively administered or of variable-length, similar to computerized adaptive testing (CAT). Like CAT, variable-length CCTs can accomplish the goal of the test with a fraction of the number of items used in a conventional fixed-form test.

The theory of conjoint measurement is a general, formal theory of continuous quantity. It was independently discovered by the French economist Gérard Debreu (1960) and by the American mathematical psychologist R. Duncan Luce and statistician John Tukey.

The attribute hierarchy method (AHM), is a cognitively based psychometric procedure developed by Jacqueline Leighton, Mark Gierl, and Steve Hunka at the Centre for Research in Applied Measurement and Evaluation (CRAME) at the University of Alberta. The AHM is one form of cognitive diagnostic assessment that aims to integrate cognitive psychology with educational measurement for the purposes of enhancing instruction and student learning. A cognitive diagnostic assessment (CDA), is designed to measure specific knowledge states and cognitive processing skills in a given domain. The results of a CDA yield a profile of scores with detailed information about a student’s cognitive strengths and weaknesses. This cognitive diagnostic feedback has the potential to guide instructors, parents and students in their teaching and learning processes.

Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.

Educational measurement refers to the use of educational assessments and the analysis of data such as scores obtained from educational assessments to infer the abilities and proficiencies of students. The approaches overlap with those in psychometrics. Educational measurement is the assigning of numerals to traits such as achievement, interest, attitudes, aptitudes, intelligence and performance.

<span class="mw-page-title-main">Klaus Kubinger</span>

Klaus D. Kubinger, is a psychologist as well as a statistician, professor for psychological assessment at the University of Vienna, Faculty of Psychology. His main research work focuses on fundamental research of assessment processes and on application and advancement of Item response theory models. He is also known as a textbook author of psychological assessment on the one hand and on statistics on the other hand.

The Mokken scale is a psychometric method of data reduction. A Mokken scale is a unidimensional scale that consists of hierarchically-ordered items that measure the same underlying, latent concept. This method is named after the political scientist Rob Mokken who suggested it in 1971.

Computational psychometrics is an interdisciplinary field fusing theory-based psychometrics, learning and cognitive sciences, and data-driven AI-based computational models as applied to large-scale/high-dimensional learning, assessment, biometric, or psychological data. Computational psychometrics is frequently concerned with providing actionable and meaningful feedback to individuals based on measurement and analysis of individual differences as they pertain to specific areas of enquiry.

Mark Daniel Reckase is an educational psychologist and expert on quantitative methods and measurement who is known for his work on computerized adaptive testing, multidimensional item response theory, and standard setting in educational and psychological tests. Reckase is University Distinguished Professor Emeritus in the College of Education at Michigan State University.

Jacqueline P. Leighton is a Canadian-Chilean educational psychologist, academic and author. She is a full professor in the Faculty of Education as well as vice-dean of Faculty Development and Faculty Affairs at the University of Alberta.

References

  1. Bormuth, J. (1969). On a theory of achievement test items. Chicago, IL: University of Chicago Press.
  2. Gierl, M.J., & Haladyna, T.M. (2012). Automatic item generation, theory and practice. New York, NY: Routledge Chapman & Hall.
  3. 1 2 3 4 5 Blum, Diego; Holling, Heinz (6 August 2018). "Automatic Generation of Figural Analogies With the IMak Package". Frontiers in Psychology. 9: 1286. doi: 10.3389/fpsyg.2018.01286 . PMC   6087760 . PMID   30127757. CC-BY icon.svg The material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.
  4. Glas, C.A.W., van der Linden, W.J., & Geerlings, H. (2010). Estimation of the parameters in an item-cloning model for adaptive testing. In W.J. van der Linden, & C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 289–314). DOI: 10.1007/978-0-387-85461-8_15.
  5. 1 2 Gierl, M.J., & Lai, H. (2012). The role of item models in automatic item generation. International Journal of testing, 12(3), 273–298. DOI: 10.1080/15305058.2011.635830.
  6. von Davier, M. Automated Item Generation with Recurrent Neural Networks. Psychometrika 83, 847–857 (2018). https://doi.org/10.1007/s11336-018-9608-y
  7. Yaneva, V., & von Davier, M. (Eds.). (2023). Advancing Natural Language Processing in Educational Assessment (1st ed.). Routledge. https://doi.org/10.4324/9781003278658
  8. Van der Linden, W.J., & Hambleton, R.K. (1997). Item Response Theory: a brief history, common models, and extensions. In R.K. Hambleton, & W.J. van der Linden (Eds.). Handbook of modern Item Response Theory (pp. 1–31). New York: Springer.
  9. Embretson, S.E. (1999). Issues in the measurement of cognitive abilities. In S.E. Embretson, & S.L. Hershberger (Eds.). The new rules of measurement (pp. 1–15). Mahwah: Lawrence Erlbaum Associates.
  10. Rudner, L. (2010). Implementing the graduate management admission test computerized adaptive test. In W.J. van der Linden, and C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 151–165). DOI: 10.1007/978-0-387-85461-8_15.
  11. 1 2 3 Irvine, S. (2002). The foundations of item generation for mass testing. In S.H. Irvine, & P.C. Kyllonen (Eds.). Item generation for test development (pp. 3–34). Mahwah: Lawrence Erlbaum Associates.
  12. Lai, H., Alves, C., & Gierl, M.J. (2009). Using automatic item generation to address item demands for CAT. In D.J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Web: www.psych.umn.edu/psylabs/CATCentral.
  13. Bejar, I. I. (2002). Generative testing: from conception to implementation in Item Generation for Test Development, eds. S. H. Irvine and P. C. Kyllonen (Mahwah, NJ: Lawrence Erlbaum Associates), 199–217.
  14. 1 2 Embretson, S.E. (1999). Generating items during testing: psychometric issues and models. Psychometrika, 64(4), 407–433.
  15. Arendasy, M. E., and Sommer, M. (2012). Using automatic item generation to meet the increasing item demands of the high-stakes educational and occupational assessment. Learning and individual differences, 22, 112–117. doi: 10.1016/j.lindif.2011.11.005.
  16. Glas, C. A. W., and van der Linden, W. J. (2003). Computerized adaptive testing with item cloning. Applied psychological measurement, 27, 247–261. doi: 10.1177/0146621603027004001.
  17. Embretson, S.E., & Kingston, N.M. (2018). Automatic item generation: a more efficient process for developing mathematics achievement items? Journal of educational measurement, 55(1), 112–131. DOI: 10.1111/jedm.12166
  18. Willson, J., Morrison, K., & Embretson, S.E. (2014). Automatic item generator for mathematical achievement items: MathGen3.0. Technical report IES1005A-2014 for the Institute of Educational Sciences Grant R305A100234. Atlanta, GA: Cognitive Measurement Laboratory, Georgia, Institute of Technology.
  19. Collins, T., Laney, R., Willis, A., & Garthwaite, P.H. (2016). Developing and evaluating computational models of music style. Artificial intelligence for engineering design, analysis, and manufacturing, 30, 16–43. DOI: 10.1017/S0890060414000687.
  20. Harrison, P.M., Collins, T., & Müllensiefen, D. (2017). Applying modern psychometric techniques to melodic discrimination testing: item response theory, computerized adaptive testing, and automatic item generation. Scientific reports, 7(3618), 1–18.
  21. Ferreyra, M.F., & Backhoff-Escudero, E. (2016). Validez del Generador Automático de Ítems del Examen de Competencias Básicas (Excoba). Relieve, 22(1), art. 2, 1–16. DOI: 10.7203/relieve.22.1.8048.
  22. Gierl, M.J., Lai, H., Pugh, D., Touchie, C., Boulais, A.P., & De Champlain, A. (2016). Evaluating the psychometric characteristics of generated multiple-choice test items. Applied measurement in education, 29(3), 196–210. DOI: 10.1080/08957347.2016.1171768.
  23. Lai, H., Gierl, M.J., Byrne, B.E., Spielman, A.I., & Waldschmidt, D.M. (2016). Three modeling applications to promote automatic item generation for examinations in dentistry. Journal of dental education, 80(3), 339–347.
  24. Gierl, M.J., & Lai, H. (2013). Evaluating the quality of medical multiple-choice items created with automated processes. Medical education, 47, 726–733. DOI: 10.1111/medu.12202.
  25. Gierl, M.J., Lai, H., & Turner, S.R. (2012). Using automatic item generation to create multiple-choice test items. Medical education, 46(8), 757–765. DOI: 10.1111/j.1365-2923.2012.04289.x.
  26. Gierl, M.J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item mode types to promote assessment engineering. J technol learn assess, 7(2), 1–51.
  27. Arendasy, M.E., Sommer, M., & Mayr, F. (2011). Using automatic item generation to simultaneously construct German and English versions of a Word Fluency Test. Journal of cross-cultural psychology, 43(3), 464–479. DOI: 10.1177/0022022110397360.
  28. Holling, H., Bertling, J.P., & Zeuch, N. (2009). Automatic item generation of probability word problems. Studies in educational evaluation, 35(2–3), 71–76.
  29. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.
  30. Fischer, G.H. (1973). The linear logistic test model as an instrument of educational research. Acta Psychological, 37, 359–374. DOI: 10.1016/0001-6918(73)90003-6.
  31. Holling, H., Blank, H., Kuchenbäcker, K., & Kuhn, J.T. (2008). Rule-based item design of statistical word problems: a review and first implementation. Psychology science quarterly, 50(3), 363–378.
  32. Arendasy, M.E., Sommer, M., Gittler, G., & Hergovich, A. (2006). Automatic generation of quantitative reasoning items. A pilot study. Journal of individual differences, 27(1), 2–14. DOI: 10.1027/1614-0001.27.1.2.
  33. Arendasy, M.E., & Sommer, M. (2007). Using psychometric technology in educational assessment: the case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and individual differences, 17(4), 366–383. DOI: 10.1016/j.lindif.2007.03.005.
  34. Loe, B.S., & Rust, J. (2017). The perceptual maze test revisited: evaluating the difficulty of automatically generated mazes. Assessment, 1–16. DOI: 10.1177/1073191117746501.
  35. 1 2 Arendasy, M. (2002). Geom-Gen-Ein Itemgenerator für Matrizentestaufgaben. Viena: Eigenverlag.
  36. Arendasy, M.E., & Sommer, M. (2013). Reducing response elimination strategies enhances the construct validity of figural matrices. Intelligence, 41, 234–243. DOI: 10.1016/j.intell.2013.03.006.
  37. Arendasy, M.E., & Sommer, M. (2010). Evaluating the contribution of different item features to the effect size of the gender difference in three-dimensional mental rotation using automatic item generation. Intelligence, 38(6), 574–581. DOI:10.1016/j.intell.2010.06.004.
  38. Arendasy, M.E., Sommer, M., & Gittler, G. (2010). Combining automatic item generation and experimental designs to investigate the contribution of cognitive components to the gender difference in mental rotation. Intelligence, 38(5), 506–512. DOI:10.1016/j.intell.2010.06.006.
  39. Arendasy, M. (2005). Automatic generation of Rasch-calibrated items: figural matrices test GEOM and Endless-Loops Test EC. International Journal of testing, 5(3), 197–224.
  40. Arendasy, M.E., & Sommer, M. (2005). The effect of different types of perceptual manipulations on the dimensionality of automatic generated figural matrices. Intelligence, 33(3), 307–324. DOI: 10.1016/j.intell.2005.02.002.
  41. Hofer, S. (2004). MatrixDeveloper. Münster, Germany: Psychological Institute IV. Westfälische Wilhelms-Universität.
  42. Freund, P.A., Hofer, S., & Holling, H. (2008). Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied psychological measurement, 32(3), 195–210. DOI: 10.1177/0146621607306972.
  43. Embretson, S.E. (1998). A cognitive design system approach to generating valid tests: application to abstract reasoning. Psychological methods, 3(3), 380–396.
  44. Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for psychologists. Mahwah: Lawrence Erlbaum Associates.