Non-native speech database

Last updated

A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems. [1]

Contents

List

Table 1: Abbreviations for languages used in Table 2
ArabicAJapaneseJ
ChineseCKoreanK
CzechCzeMalaysianM
DanishDNorwegianN
DutchDutPortugueseP
EnglishERussianR
FrenchFSpanishS
GermanGSwedishSwe
GreekGreThaiT
IndonesianIndVietnameseV
ItalianI  


The actual table with information about the different databases is shown in Table 2.

Table 2: Overview of non-native Databases
CorpusAuthorAvailable atLanguages#SpeakersNative Language#Utt.DurationDateRemarks
AMI [2] EUEDut and other100hmeeting recordings
ATR-Gruhn [3] GruhnATRE96C G F J Ind15000 2004proficiency rating
BAS Strange Corpus 1+10 [4]  ELRAG13950 countries7500 1998 
Berkeley Restaurant [5] ICSIE55G I H C F S J25001994 
Broadcast News [6]  LDCE    1997 
Cambridge-Witt [7] WittU. CambridgeE10J I K S1200 1999 
Cambridge-Ye [8] YeU. CambridgeE20C1600 2005 
Children News [9] TomokiyoCMUE62J C7500 2000partly spontaneous
CLIPS-IMAG [10] TanCLIPS-IMAGF15C V 6h2006 
CLSU [11]  LDCE 22 countries5000 2007telephone, spontaneous
CMU [12]  CMUE64G4520.9h not available
Cross Towns [13] SchadenU. BochumE F G I Cze Dut161E F G I S72000133h2006city names
Duke-Arslan [14] ArslanDuke UniversityE9315 countries2200 1995partly telephone speech
ERJ [15] MinematsuU. TokyoE200J68000 2002proficiency rating
Fischer [16] LDCEmany200htelephone speech
Fitt [17] FittU. EdinburghF I N Gre10E700 1995city names
Fraenki [18]  U. ErlangenE19G2148   
Hispanic [19] Byrne E22S 20h1998partly spontaneous
HLTC [20]  HKUSTE44C 3h2010available on request
IBM-Fischer [21]  IBME40S F G I2000 2002digits
iCALL [22] [23] ChenI2R, A*STARC30524 countries90841142h2015phonetic and tonal transcriptions (in Pinyin), proficiency ratings
ISLE [24] AtwellEU/ELDAE46G I400018h2000 
Jupiter [25] ZueMITEunknownunknown5146 1999telephone speech
K-SEC [26] RheeSiTECEunknownK  2004
LDC WSJ1 [27]  LDC 10 8001h1994 
LeaP [28] GutUniversity of MünsterE G12741 different ones73.941 words12h2003 
MIST [29]  ELRAE F G75Dut2200 1996 
NATO HIWIRE [30]  NATOE81F Gre I S8100 2007clean speech
NATO M-ATC [31] PigeonNATOE622F G I S983317h2007heavy background noise
NATO N4 [32]  NATOE115unknown 7.5h2006heavy background noise
Onomastica [33]   D Dut E F G Gre I N P S Swe (121000) 1995only lexicon
PF-STAR [34]  U. ErlangenE57G46273.4h2005children speech
Sunstar [35]  EUE100G S I P D40000 1992parliament speech
TC-STAR [36] HeuvelELDAE SunknownEU countries 13h2006multiple data sets
TED [37] LamelELDAE40(188)many 10h(47h)1994eurospeech 93
TLTS [38]  DARPAA E 1h2004 
Tokyo-Kikuko [39]  U. TokyoJ14010 countries35000 2004proficiency rating
Verbmobil [40]  U. MunichE44G 1.5h1994very spontaneous
VODIS [41]  EUF G178F G2500 1998about car navigation
WP Arabic [42] RoccaLDCA35E8001h2002 
WP Russian [43] RoccaLDCR26E25002h2003 
WP Spanish [44] MorganLDCS E  2006 
WSJ Spoke [45]   E10unknown800 1993 


Legend

In the table of non-native databases some abbreviations for language names are used. They are listed in Table 1. Table 2 gives the following information about each corpus: The name of the corpus, the institution where the corpus can be obtained, or at least further information should be available, the language which was actually spoken by the speakers, the number of speakers, the native language of the speakers, the total amount of non-native utterances the corpus contains, the duration in hours of the non-native part, the date of the first public reference to this corpus, some free text highlighting special aspects of this database and a reference to another publication. The reference in the last field is in most cases to the paper which is especially devoted to describe this corpus by the original collectors. In some cases it was not possible to identify such a paper. In these cases a paper is referenced which is using this corpus is.

Some entries are left blank and others are marked with unknown. The difference here is that blank entries refer to attributes where the value is just not known. Unknown entries, however, indicate that no information about this attribute is available in the database itself. As an example, in the Jupiter weather database [46] no information about the origin of the speakers is given. Therefore this data would be less useful for verifying accent detection or similar issues.

Where possible, the name is a standard name of the corpus, for some of the smaller corpora, however, there was no established name and hence an identifier had to be created. In such cases, a combination of the institution and the collector of the database is used.

In the case where the databases contain native and non-native speech, only attributes of the non-native part of the corpus are listed. Most of the corpora are collections of read speech. If the corpus instead consists either partly or completely of spontaneous utterances, this is mentioned in the Specials column.

Related Research Articles

Received Pronunciation (RP) is the accent traditionally regarded as standard for British English. For over a century there has been argument over such questions as the definition of RP, whether it is geographically neutral, how many speakers there are, whether sub-varieties exist, how appropriate a choice it is as a standard and how the accent has changed over time. The name itself is controversial. RP is an accent, so the study of RP is concerned only with matters of pronunciation; other areas relevant to the study of language standards such as vocabulary, grammar and style are not considered.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Spoken English shows great variation across regions where it is the predominant language. For example, the United Kingdom has the largest variation of accents of any country in the world, and therefore no single "British accent" exists. This article provides an overview of the numerous identifiable variations in pronunciation; such distinctions usually derive from the phonetic inventory of local dialects, as well as from broader differences in the Standard English of different primary-speaking populations.

Jamaican English, including Jamaican Standard English, is a variety of English native to Jamaica and is the official language of the country. A distinction exists between Jamaican English and Jamaican Patois, though not entirely a sharp distinction so much as a gradual continuum between two extremes. Jamaican English tends to follow British English spelling conventions.

Non-native pronunciations of English result from the common linguistic phenomenon in which non-native users of any language tend to carry the intonation, phonological processes and pronunciation rules from their first language or first languages into their English speech. They may also create innovative pronunciations for English sounds not found in the speaker's first language.

In sociolinguistics, an accent is a manner of pronunciation peculiar to a particular individual, location, or nation. An accent may be identified with the locality in which its speakers reside, the socioeconomic status of its speakers, their ethnicity, their caste or social class, or influence from their first language.

Scottish English is the set of varieties of the English language spoken in Scotland. The transregional, standardised variety is called Scottish Standard English or Standard Scottish English (SSE). Scottish Standard English may be defined as "the characteristic speech of the professional class [in Scotland] and the accepted norm in schools". IETF language tag for "Scottish Standard English" is en-scotland.

English language in Northern England Collection of accents and dialects

The English language in Northern England has been shaped by the region's history of settlement and migration, and today encompasses a group of related dialects known as Northern England English. Historically, the strongest influence on the varieties of the English language spoken in Northern England was the Northumbrian dialect of Old English, but contact with Old Norse during the Viking Age and with Irish English following the Great Famine have produced new and distinctive styles of speech. Some "Northern" traits can be found further south than others: only conservative Northumbrian dialects retain the pre-Great Vowel Shift pronunciation of words such as town, but all northern accents lack the FOOTSTRUT split, and this trait extends a significant distance into the Midlands.

Australian English (AuE) is a non-rhotic variety of English spoken by most native-born Australians. Phonologically, it is one of the most regionally homogeneous language varieties in the world. As with most dialects of English, it is distinguished primarily by its vowel phonology.

TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time.

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

The Buckeye Corpus of conversational speech is a speech corpus created by a team of linguists and psychologists at Ohio State University led by Prof. Mark Pitt. It contains high-quality recordings from 40 speakers in Columbus, Ohio conversing freely with an interviewer. The interviewer's voice is heard only faintly in the background of these recordings. The sessions were conducted as Sociolinguistics interviews, and are essentially monologues. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software. Software for searching the transcription files is also available at the project web site. The corpus is available to researchers in academics and industry.

Speaker adaptation is an important technology to fine-tune either features or speech models for mis-match due to inter-speaker variation. In the last decade, eigenvoice (EV) speaker adaptation has been developed. It makes use of the prior knowledge of training speakers to provide a fast adaptation algorithm. Inspired by the kernel eigenface idea in face recognition, kernel eigenvoice (KEV) is proposed. KEV is a non-linear generalization to EV. This incorporates Kernel principal component analysis, a non-linear version of Principal Component Analysis, to capture higher order correlations in order to further explore the speaker space and enhance recognition performance.

Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

Rhoticity in English is the pronunciation of the historical rhotic consonant by English speakers. The presence or absence of rhoticity is one of the most prominent distinctions by which varieties of English can be classified. In rhotic varieties, the historical English sound is preserved in all pronunciation contexts. In non-rhotic varieties, speakers no longer pronounce in postvocalic environments—that is, when it is immediately after a vowel and not followed by another vowel. For example, in isolation, a rhotic English speaker pronounces the words hard and butter as /ˈhɑːrd/ and /ˈbʌtər/, whereas a non-rhotic speaker "drops" or "deletes" the sound, pronouncing them as /ˈhɑːd/ and /ˈbʌtə/. When an r is at the end of a word but the next word begins with a vowel, as in the phrase "better apples", most non-rhotic speakers will pronounce the in that position, since it is followed by a vowel in this case.

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.

The Persian Speech Corpus is a Modern Persian speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of about 2.5 hours of Persian speech aligned with recorded speech on the phoneme level, including annotations of word boundaries. Previous spoken corpora of Persian include FARSDAT, which consists of read aloud speech from newspaper texts from 100 Persian speakers and the Telephone FARsi Spoken language DATabase (TFARSDAT) which comprises seven hours of read and spontaneous speech produced by 60 native speakers of Persian from ten regions of Iran.

Speechmatics

Speechmatics is a technology company based in Cambridge, England, which develops automatic speech recognition software (ASR) based on recurrent neural networks and statistical language modelling. Speechmatics was originally named Cantab Research Ltd when founded in 2006 by speech recognition specialist Dr. Tony Robinson.

References

  1. M. Raab, R. Gruhn and E. Noeth, Non-Native speech databases, in Proc. ASRU, Kyoto, Japan, 2007.
  2. AMI Project, "AMI Meeting Corpus" .
  3. R. Gruhn, T. Cincarek, and S. Nakamura, "A multi-accent non-native English database", in ASJ, 2004.
  4. University Munich, "Bavarian archive for speech signals strange corpus", .
  5. Jurafsky et al., "The Berkeley Restaurant Project", Proc. ICSLP 1994.
  6. L. Tomokiyo, Recognizing Non-native Speech: Characterizing and Adapting to Non-native Usage in Speech Recognition, Ph.D. thesis, Carnegie Mellon University, Pennsylvania, 2001.
  7. S. Witt, Use of Speech Recognition in Computer-Assisted Language Learning, Ph.D. thesis, Cambridge University Engineering Department, UK, 1999.
  8. H. Ye and S. Young, Improving the speech recognition performance of beginners in spoken conversational interaction for language learning, in Proc. Interspeech, Lisbon, Portugal, 2005.
  9. L. Tomokiyo, Recognizing Non-native Speech: Characterizing and Adapting to Non-native Usage in Speech Recognition, Ph.D. thesis, Carnegie Mellon University, Pennsylvania, 2001.
  10. T. P. Tan and L. Besacier, A French non-native corpus for automatic speech recognition, in LREC, Genoa, Italy, 2006.
  11. T. Lander, CSLU: Foreign accented English release 1.2, Tech. Rep., LDC, Philadelphia, Pennsylvania, 2007.
  12. Z. Wang, T. Schultz, and A. Waibel, Comparison of acoustic model adaptation techniques on non-native speech, in Proc. ICASSP, 2003.
  13. S. Schaden, Regelbasierte Modellierung fremdsprachlich akzentbehafteter Aussprachevarianten, Ph.D. thesis, University Duisburg-Essen, 2006.
  14. L. M. Arslan and J. H. Hansen, Frequency characteristics of foreign accented speech, in Proc. of ICASSP, Munich, Germany, 1997, pp. 1123-1126.
  15. N. Minematsu et al., Development of English speech database read by Japanese to support CALL research, in ICA, Kyoto, Japan, 2004, pp. 577-560.
  16. Christopher Cieri, David Miller, Kevin Walker, The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text, Proc. LREC 2004
  17. S. Fitt, The pronunciation of unfamiliar native and non-native town names, in Proc. of Eurospeech, 1995, pp. 2227-2230.
  18. G. Stemmer, E. Noeth, and H. Niemann, Acoustic modeling of foreign words in a German speech recognition system, in Proc. Eurospeech, P. Dalsgaard, B. Lindberg, and H. Benner, Eds., 2001, vol. 4, pp. 2745-2748.
  19. W. Byrne, E. Knodt, S. Khudanpur, and J. Bernstein, Is automatic speech recognition ready for non-native speech? A data-collection effort and initial experiments in modeling conversational Hispanic English, in STiLL, Marholmen, Sweden, 1998, pp. 37-40.
  20. Y. Li, P. Fung, P. Xu, and Y. Liu, Asymmetric acoustic modeling for mixed language speech recognition, in ICASSP, Prague, Czech, 2011, pp. 37-40.
  21. V. Fischer, E. Janke, and S. Kunzmann, Recent progress in the decoding of non-native speech with multilingual acoustic models, in Proc. of Eurospeech, 2003, pp. 3105-3108.
  22. Nancy F. Chen, Rong Tong, Darren Wee, Peixuan Lee, Bin Ma, Haizhou Li, iCALL Corpus: Mandarin Chinese Spoken by Non-Native Speakers of European Descent, in Proc. of Interspeech, 2015.
  23. Nancy F. Chen, Vivaek Shivakumar, Mahesh Harikumar, Bin Ma, Haizhou Li. Large-Scale Characterization of Mandarin Pronunciation Errors Made by native Speakers of European Languages, in Proc. of Interspeech, 2013.
  24. W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, and C. Souter, The ISLE corpus of non-native spoken English, in LREC, Athens, Greece, 2000, pp. 957-963.
  25. K. Livescu, Analysis and modeling of non-native speech for automatic speech recognition, M.S. thesis, Massachusetts Institute of Technology, Cambridge, MA, 1999.
  26. S-C. Rhee and S-H. Lee and S-K. Kang and Y-J. Lee, Design and Construction of Korean-Spoken English Corpus (K-SEC), Proc. ICSLP 2004
  27. L. Tomokiyo, Recognizing Non-native Speech: Characterizing and Adapting to Non-native Usage in Speech Recognition, Ph.D. thesis, Carnegie Mellon University, Pennsylvania, 2001.
  28. Gut, U., Non-native Speech. A Corpus-based Analysis of Phonological and Phonetic Properties of L2 English and German, Frankfurt am Main: Peter Lang, 2009.
  29. TNO Human Factors Research Institute, Mist multi-lingual interoperability in speech technology database, Tech. Rep., ELRA, Paris, France, 2007, ELRA Catalog Reference S0238.
  30. J.C. Segura et al., The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication, 2007, .
  31. S. Pigeon, W. Shen, and D. van Leeuwen, Design and characterization of the non-native military air traffic communications database, in ICSLP, Antwerp, Belgium, 2007.
  32. L. Benarousse et al., The NATO native and non-native (n4) speech corpus, in Proc. of the MIST workshop (ESCA-NATO), Leusden, Sep 1999.
  33. Onomastica Consortium, The ONOMASTICA interlanguage pronunciation lexicon, in Proc. Eurospeech, Madrid, Spain, 1995, pp. 829-832.
  34. C. Hacker, T. Cincarek, A. Maier, A. Hessler, and E. Noeth, Boosting of prosodic and pronunciation features to detect mispronunciations of non-native children, in Proc. of ICASSP, Honolulu, Hawai, 2007, pp. 197-200.
  35. C. Teixeira, I. Trancoso, and A. Serralheiro, Recognition of non-native accents, in Proc. Eurospeech, Rhodes, Greece, 1997, pp. 2375-2378.
  36. H. Heuvel, K. Choukri, C. Gollan, A. Moreno, and D. Mostefa, TC-STAR: New language resources for ASR and SLT purposes, in LREC, Genoa, 2006, pp. 2570-2573.
  37. L.F. Lamel, F. Schiel, A. Fourcin, J. Mariani, and H. Tillmann, The translanguage English database TED, in ICSLP, Yokohama, Japan, Sep 1994.
  38. N. Mote, L. Johnson, A. Sethy, J. Silva, and S. Narayanan, Tactical language detection and modeling of learner speech errors: The case of Arabic tactical language training for American English speakers, in Proc. of InSTIL, June 2004.
  39. K. Nishina, Development of Japanese speech database read by non-native speakers for constructing CALL system, in ICA, Kyoto, Japan, 2004, pp. 561-564.
  40. University Munich, The Verbmobil project, .
  41. I. Trancoso, C. Viana, I. Mascarenhas, and C. Teixeira, On deriving rules for nativised pronunciation in navigation queries, in Proc. Eurospeech, 1999.
  42. A. LaRocca and R. Chouairi, West point Arabic speech corpus, Tech. Rep., LDC, Philadelphia, Pennsylvania, 2002.
  43. A. LaRocca and C. Tomei, West point Russian speech corpus, Tech. Rep., LDC, Philadelphia, Pennsylvania, 2003.
  44. J. Morgan, West point heroico Spanish speech, Tech. Rep., LDC, Philadelphia, Pennsylvania, 2006.
  45. I. Amdal, F. Korkmazskiy, and A. C. Surendran, Joint pronunciation modelling of non-native speakers using data-driven methods, in ICSLP, Beijing, China, 2000, pp. 622-625.
  46. K. Livescu, Analysis and modeling of non-native speech for automatic speech recognition, M.S. thesis, Massachusetts Institute of Technology, Cambridge, MA, 1999.