Pronunciation assessment does not determine unknown speech (as in dictation or automatic transcription) but instead, knowing the expected word(s) in advance or from prior transcription, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners,[5][6] sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and syllable and word stress.[7] Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams[8] and from Amira Learning.[9] Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.[10]
In 2022, researchers found that some newer speech-to-text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores closely correlated with genuine listener intelligibility.[26] In 2023, others were able to assess intelligibility using dynamic time warping based distance from Wav2Vec2 representation of good speech.[27] Further work through 2025 has focused specifically on measuring intelligibility.[28][29]
Evaluation
Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality.[30][31][32][33] Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions.[6] As of mid-2025, state of the art approaches for automatically transcribing phonemes typically achieve an error rate of about 10% from known good speech.[34][35][36][37]
Ethical issues in pronunciation assessment are present in both human and automatic methods. Authentic validity, fairness, and mitigating bias in evaluation are all crucial. Diverse speech data should be included in automatic pronunciation assessment models. Combining human judgment with automated feedback can improve accuracy and fairness.[38]
Second language learners benefit substantially from their use of common speech regognition systems for dictation, virtual assistants, and AI chatbots.[39] In such systems, users naturally try to correct their own errors evident in speech recognition results that they notice in the results. Such use improves their grammar and vocabulary development along with their pronunciation skills. The extent to which explicit pronunciation assessment and remediation approaches improve on such self-directed interactions remains an open question.[39]
Recent developments
During 2021-22, a smartphone-based CAPT system was used to sense articulation through both audible and inaudible signals, providing feedback at the phoneme level.[40][41]
In 2024, audio multimodal large language models were first described as assessing pronunciation.[48] That work has been carried forward by other researchers who report positive results.[49][50]
In 2025, the Duolingo English Test authors published a description of their pronunciation assessment method, purportedly built to measure intelligibility rather than accent imitation.[51] While achieving a correlation of 0.82 with expert human ratings, very close to inter-rater agreement and outperforming alternative methods, the method is nonetheless based on experts' scores along the six-point CEFR common reference levels scale, instead of actual blinded listener transcriptions.[51]
Further promising work in 2025 includes assessment feedback aligning learner speech to synthetic utterances using interpretable features, identifying continuous spans of words for remediation feedback;[52] synthesizing corrected speech matching learners' self-perceived voices, which they prefer and imitate more accurately as corrections;[53] and streaming such interactions.[54]
↑ El Kheir, Yassine; etal. (October 2023), Automatic Pronunciation Assessment — A Review, Conference on Empirical Methods in Natural Language Processing, arXiv:2310.13974, S2CID264426545
1 2 O’Brien, Mary Grantham; etal. (December 2018). "Directions for the future of technology in pronunciation research and teaching". Journal of Second Language Pronunciation. 4 (2): 182–207. doi:10.1075/jslp.17001.obr. hdl:2066/199273. ISSN2215-1931. S2CID86440885. pronunciation researchers are primarily interested in improving L2 learners' intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners' intelligibility.
↑ Bernstein, Jared; etal. (November 1990), "Automatic Evaluation and Training in English Pronunciation"(PDF), First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan: International Speech Communication Association, pp.1185–1188, retrieved 11 February 2023, listeners differ considerably in their ability to predict unintelligible words.... Thus, it seems the quality rating is a more desirable... automatic-grading score. (Section 2.2.2.)
↑ Bonk, Bill (August 2020). "New innovations in assessment: Versant's Intelligibility Index score". Resources for English Language Learners and Teachers. Pearson English. Archived from the original on 2023-01-27. Retrieved 11 February 2023. you don't need a perfect accent, grammar, or vocabulary to be understandable. In reality, you just need to be understandable with little effort by listeners.
↑ Gao, Yuan; etal. (May 2018). "Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art". 2nd IEEE Advanced Information Management, Communication, Electronic and Automation Control Conference (IMCEC 2018). pp.924–927. arXiv:1709.01713. doi:10.1109/IMCEC.2018.8469649. ISBN978-1-5386-1803-5. S2CID31125681.
↑ Alnafisah, Mutleb (September 2022), "Technology Review: Speechace", Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference (Virtual PSLLT), no. 40, vol.12, St. Catharines, Ontario, ISSN2380-9566, retrieved 14 February 2023{{citation}}: CS1 maint: location missing publisher (link)
↑ E.g., CMUDICT, "The CMU Pronouncing Dictionary". www.speech.cs.cmu.edu. Retrieved 15 February 2023. Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat." This mistake is due to the "horse–hoarse merger," often called the "north–force merger."
↑ Menzel, Wolfgang; etal. (May 2000). "The ISLE Corpus of Non-Native Spoken English". Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). Athens, Greece: European Language Resources Association (ELRA). Retrieved 13 August 2025.
↑ Yeo, Eunjung (October 2022). "wav2vec2-large-english-TIMIT-phoneme_v3". huggingface.co. Seoul National University Spoken Language Processing Lab. Retrieved 19 August 2025.
↑ Mallela, Jhansi; Aluru, Sai Harshitha; Yarra, Chiranjeevi (February 2024). Exploring the Use of Self-Supervised Representations for Automatic Syllable Stress Detection. National Conference on Communications. Chennai, India. pp.1–6. doi:10.1109/NCC60321.2024.10486028.
↑ Fu, Kaiqi; etal. (July 2024). "Pronunciation Assessment with Multi-modal Large Language Models". arXiv:2407.09209 [cs.CL]. Note that Speak.com produced an earlier commercial system that they had not described in technical detail.
↑ Ma, Rao; etal. (May 2025). "Assessment of L2 Oral Proficiency using Speech Large Language Models". arXiv:2505.21148 [cs.CL].
This page is based on this Wikipedia article Text is available under the CC BY-SA 4.0 license; additional terms may apply. Images, videos and audio are available under their respective licenses.