Pronunciation assessment

Last updated

Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech, [1] as distinguished from manual assessment by an instructor or proctor. [2] Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction. Pronunciation assessment does not determine unknown speech (as in dictation or automatic transcription) but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners, [3] [4] sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and stress. [5] Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams [6] and from Amira Learning. [7] Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia. [8]

Contents

The earliest work on pronunciation assessment avoided measuring genuine listener intelligibility, [9] a shortcoming corrected in 2011 at the Toyohashi University of Technology, [10] and included in the Versant high-stakes English fluency assessment from Pearson [11] and mobile apps from 17zuoye Education & Technology, [12] but still missing in 2023 products from Google Search, [13] Microsoft, [14] Educational Testing Service, [15] Speechace, [16] and ELSA. [17] Assessing authentic listener intelligibility is essential for avoiding inaccuracies from accent bias, especially in high-stakes assessments; [18] [19] [20] from words with multiple correct pronunciations; [21] and from phoneme coding errors in machine-readable pronunciation dictionaries. [22] In 2022, researchers found that some newer speech to text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores very closely correlated with genuine listener intelligibility. [23] In the Common European Framework of Reference for Languages (CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels. [24]

Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality. [25] [26] Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions. [4] Some promising areas for improvement being developed in 2023 include articulatory feature extraction [27] and transfer learning to suppress unnecessary corrections. [28] Other interesting advances under development include "augmented reality" interfaces for mobile devices using optical character recognition to provide pronunciation training on text found in user environments. [29] [30]

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

In sociolinguistics, an accent is a way of pronouncing a language that is distinctive to a country, area, social class, or individual. An accent may be identified with the locality in which its speakers reside, the socioeconomic status of its speakers, their ethnicity, their caste or social class, or influence from their first language.

Dysarthria is a speech sound disorder resulting from neurological injury of the motor component of the motor–speech system and is characterized by poor articulation of phonemes. In other words, it is a condition in which problems effectively occur with the muscles that help produce speech, often making it very difficult to pronounce words. It is unrelated to problems with understanding language, although a person can have both. Any of the speech subsystems can be affected, leading to impairments in intelligibility, audibility, naturalness, and efficiency of vocal communication. Dysarthria that has progressed to a total loss of speech is referred to as anarthria. The term dysarthria is from Neo-Latin, dys- "dysfunctional, impaired" and arthr- "joint, vocal articulation".

<span class="mw-page-title-main">Augmentative and alternative communication</span> Techniques used for those with communication impairments

Augmentative and alternative communication (AAC) encompasses the communication methods used to supplement or replace speech or writing for those with impairments in the production or comprehension of spoken or written language. AAC is used by those with a wide range of speech and language impairments, including congenital impairments such as cerebral palsy, intellectual impairment and autism, and acquired conditions such as amyotrophic lateral sclerosis and Parkinson's disease. AAC can be a permanent addition to a person's communication or a temporary aid. Stephen Hawking, probably the best-known user of AAC, had amyotrophic lateral sclerosis, and communicated through a speech-generating device.

In speech communication, intelligibility is a measure of how comprehensible speech is in given conditions. Intelligibility is affected by the level and quality of the speech signal, the type and level of background noise, reverberation, and, for speech over communication devices, the properties of the communication system. A common standard measurement for the quality of the intelligibility of speech is the Speech Transmission Index (STI). The concept of speech intelligibility is relevant to several fields, including phonetics, human factors, acoustical engineering, and audiometry.

The Versant suite of tests are computerized tests of spoken language available from Pearson PLC. Versant tests were the first fully automated tests of spoken language to use advanced speech processing technology to assess the spoken language skills of non-native speakers. The Versant language suite includes tests of English, Spanish, Dutch, French, and Arabic. Versant technology has also been applied to the assessment of Aviation English, children's oral reading assessment, and adult literacy assessment.

TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time.

A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.

Fluency refers to continuity, smoothness, rate, and effort in speech production. It is also used to characterize language production, language ability or language proficiency.

RWTH ASR is a proprietary speech recognition toolkit.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.

<span class="mw-page-title-main">Peter Roach (phonetician)</span> British retired phonetician (born 1943)

Peter John Roach is a British retired phonetician. He taught at the Universities of Leeds and Reading, and is best known for his work on the pronunciation of British English.

Beijing Unisound Information Technology Co., Ltd., often shortened to Unisound, is a Chinese technology company based in Beijing. It is a unicorn startup specialising in speech recognition and artificial intelligence services applicable to a variety of industries.

An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

References

  1. Ehsani, Farzad; Knodt, Eva (July 1998). "Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm". Language Learning & Technology. University of Hawaii National Foreign Language Resource Center; Michigan State University Center for Language Education and Research. 2 (1): 54–73. Retrieved 11 February 2023.
  2. Isaacs, Talia; Harding, Luke (July 2017). "Pronunciation assessment". Language Teaching. 50 (3): 347–366. doi: 10.1017/S0261444817000118 . ISSN   0261-4448. S2CID   209353525.
  3. Loukina, Anastassia; et al. (September 6, 2015), "Pronunciation accuracy and intelligibility of non-native speech" (PDF), INTERSPEECH 2015, Dresden, Germany: International Speech Communication Association, pp. 1917–1921, only 16% of the variability in word-level intelligibility can be explained by the presence of obvious mispronunciations.
  4. 1 2 O’Brien, Mary Grantham; et al. (31 December 2018). "Directions for the future of technology in pronunciation research and teaching". Journal of Second Language Pronunciation. 4 (2): 182–207. doi: 10.1075/jslp.17001.obr . hdl: 2066/199273 . ISSN   2215-1931. S2CID   86440885. pronunciation researchers are primarily interested in improving L2 learners' intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not. These data are essential to train ASR algorithms to assess L2 learners' intelligibility.
  5. Eskenazi, Maxine (January 1999). "Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype". Language Learning & Technology. 2 (2): 62–76. Retrieved 11 February 2023.
  6. Tholfsen, Mike (9 February 2023). "Reading Coach in Immersive Reader plus new features coming to Reading Progress in Microsoft Teams". Techcommunity Education Blog. Microsoft. Retrieved 12 February 2023.
  7. Banerji, Olina (7 March 2023). "Schools Are Using Voice Technology to Teach Reading. Is It Helping?". EdSurge News. Retrieved 7 March 2023.
  8. Hair, Adam; et al. (19 June 2018). "Apraxia world: A speech therapy game for children with speech sound disorders". Proceedings of the 17th ACM Conference on Interaction Design and Children (PDF). pp. 119–131. doi:10.1145/3202185.3202733. ISBN   9781450351522. S2CID   13790002.
  9. Bernstein, Jared; et al. (November 18, 1990), "Automatic Evaluation and Training in English Pronunciation" (PDF), First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan: International Speech Communication Association, pp. 1185–1188, retrieved 11 February 2023, listeners differ considerably in their ability to predict unintelligible words.... Thus, it seems the quality rating is a more desirable... automatic-grading score. (Section 2.2.2.)
  10. Hiroshi, Kibishi; Nakagawa, Seiichi (August 28, 2011), "New feature parameters for pronunciation evaluation in English presentations at international conferences" (PDF), INTERSPEECH 2011, Florence, Italy: International Speech Communication Association, pp. 1149–1152, retrieved 11 February 2023, we investigated the relationship between pronunciation score / intelligibility and various acoustic measures, and then combined these measures.... As far as we know, the automatic estimation of intelligibility has not yet been studied.
  11. Bonk, Bill (25 August 2020). "New innovations in assessment: Versant's Intelligibility Index score". Resources for English Language Learners and Teachers. Pearson English. Archived from the original on 2023-01-27. Retrieved 11 February 2023. you don't need a perfect accent, grammar, or vocabulary to be understandable. In reality, you just need to be understandable with little effort by listeners.
  12. Gao, Yuan; et al. (May 25, 2018), "Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art", 2nd IEEE Advanced Information Management, Communication, Electronic and Automation Control Conference (IMCEC 2018), arXiv: 1709.01713 , doi:10.1109/IMCEC.2018.8469649, S2CID   31125681
  13. Snir, Tal (14 November 2019). "How do you pronounce quokka? Practice with Search". The Keyword. Google. Retrieved 11 February 2023.
  14. "Pronunciation assessment tool". Azure Cognitive Services Speech Studio. Microsoft. Retrieved 11 February 2023.
  15. Chen, Lei; et al. (December 2018). Automated Scoring of Nonnative Speech: Using the SpeechRater v. 5.0 Engine. ETS Research Report Series. Vol. 2018. Princeton, NJ: Educational Testing Service. pp. 1–31. doi:10.1002/ets2.12198. ISSN   2330-8516. S2CID   69925114 . Retrieved 11 February 2023.
  16. Alnafisah, Mutleb (September 2022), "Technology Review: Speechace", Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference (Virtual PSLLT), no. 40, vol. 12, St. Catharines, Ontario, ISSN   2380-9566 , retrieved 14 February 2023{{citation}}: CS1 maint: location missing publisher (link)
  17. Gorham, Jon; et al. (March 10, 2022). Speech Recognition for English Language Learning (video). Technology in Language Teaching and Learning. Education Solutions. Retrieved 2023-02-14.
  18. "Computer says no: Irish vet fails oral English test needed to stay in Australia". The Guardian. Australian Associated Press. 8 August 2017. Retrieved 12 February 2023.
  19. Ferrier, Tracey (9 August 2017). "Australian ex-news reader with English degree fails robot's English test". The Sydney Morning Herald. Retrieved 12 February 2023.
  20. Main, Ed; Watson, Richard (9 February 2022). "The English test that ruined thousands of lives". BBC News. Retrieved 12 February 2023.
  21. Joyce, Katy Spratte (January 24, 2023). "13 Words That Can Be Pronounced Two Ways". Reader's Digest. Retrieved 23 February 2023.
  22. E.g., CMUDICT, "The CMU Pronouncing Dictionary". www.speech.cs.cmu.edu. Retrieved 15 February 2023. Compare "four" given as "F AO R" with the vowel AO as in "caught," to "row" given as "R OW" with the vowel OW as in "oat."
  23. Tu, Zehai; Ma, Ning; Barker, Jon (2022). "Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction" (PDF). Proc. Interspeech 2022. INTERSPEECH 2022. ISCA. pp. 3493–3497. doi:10.21437/Interspeech.2022-10408 . Retrieved 17 December 2023.
  24. Common European framework of reference for languages learning, teaching, assessment: Companion volume with new descriptors. Language Policy Programme, Education Policy Division, Education Department, Council of Europe. February 2018. p. 136. OCLC   1090351600.
  25. Vidal, Jazmín; et al. (15 September 2019), "EpaDB: A Database for Development of Pronunciation Assessment Systems" (PDF), Interspeech 2019, pp. 589–593, doi:10.21437/Interspeech.2019-1839, hdl: 11336/161618 , S2CID   202742421 , retrieved 19 February 2023; database .zip file.
  26. Zhang, Junbo; et al. (30 August 2021), "speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment" (PDF), Interspeech 2021, pp. 3710–3714, arXiv: 2104.01378 , doi:10.21437/Interspeech.2021-1259, S2CID   233025050 , retrieved 19 February 2023; GitHub corpus repository.
  27. Wu, Peter; et al. (14 February 2023), "Speaker-Independent Acoustic-to-Articulatory Speech Inversion", arXiv: 2302.06774 [eess.AS]
  28. Sancinetti, Marcelo; et al. (23 May 2022). "A Transfer Learning Approach for Pronunciation Scoring". ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6812–6816. arXiv: 2111.00976 . doi:10.1109/ICASSP43922.2022.9747727. ISBN   978-1-6654-0540-9. S2CID   249437375.
  29. Che Dalim, Che Samihah; et al. (February 2020). "Using augmented reality with speech input for non-native children's language learning" (PDF). International Journal of Human-Computer Studies. 134: 44–64. doi:10.1016/j.ijhcs.2019.10.002. S2CID   208098513 . Retrieved 28 February 2023.
  30. Tolba, Rahma M.; et al. (2023). "Mobile Augmented Reality for Learning Phonetics: A Review (2012–2022)". Extended Reality and Metaverse. Springer Proceedings in Business and Economics. Springer International Publishing: 87–98. doi:10.1007/978-3-031-25390-4_7. ISBN   978-3-031-25389-8 . Retrieved 28 February 2023.
  31. Mathad, Vikram C.; et al. (2021). "The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation" (PDF). 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021). International Speech Communication Association. pp. 176–180. doi:10.21437/interspeech.2021-1403. ISBN   9781713836902. S2CID   239694157 . Retrieved 10 March 2023.