List of children's speech corpora

Last updated

A child speech corpus is a speech corpus documenting first-language language acquisition. Such databases are used in the development of computer-assisted language learning systems and the characterization of children's speech at difference ages. [1] Children's speech varies not only by language, but also by region within a language. It can also be different for specific groups like autistic children, especially when emotion is considered. Thus different databases are needed for different populations. Corpora are available for American and British English as well as for many other European languages. [1] [2] [3]

Contents

Overview of Children's Speech Corpora

In the table below, the age range may be described in terms of school grades. "K" denotes "kindergarten" while "G" denotes "grade". For example, an age range of "K - G10" refers to speakers ranging from kindergarten age to grade 10.

This table is based on a paper from the Interspeech conference, 2016. [4] This online article is intended to provide an interactive table for readers and a place where information about children speech corpora that can be updated continuously by the speech research community.

CorpusAuthorLanguages# Speakers# Utt.DurationAge RangeDateRemarks
Boulder Learning—MyST Corpus (v0.4.0) [5] Cole et al. [6] English1371228,874~393hG3 - G52019dialog interaction between a student and a virtual tutor on science topics; typically 20-40 minute (wall clock) duration of a session; roughly 49% of the utterances have been transcribed, and more being transcribed. volunteers encouraged. available free for research; flat $10K for commercial use.
CMU Kids Corpus [7] EskenaziEnglish24M, 52F51806 - 111997
CSLU Kids' Speech Corpus [8] ShobakiEnglish11001017K - G102007
PF-STAR Children's Speech Corpus [9] [10] RussellEnglish,158~14.5h4 - 142006word-level transcriptions
CALL-SLT [11] RaynerGerman50002014
TBALL [12] KazemgadehEnglish256500040hK - G42005partially non-native speech
CASS_CHILD [13] GaoMandarin231 - 42012phonetic transcriptions
CU Children's Read and Prompted Speech Corpus [14] HagenEnglish663~100K - G52001consists of isolated words, sentences and short spontaneous story telling; word-level transcriptions
CU Story Corpus [14] HagenEnglish106500040hG3 - G52003consists of story prompts and spontaneous spoken summary of the material; word-level transcriptions
Providence Corpus [15] DemuthEnglish6363h1 - 32006mother-child spontaneous speech interactions; broad phonetic transcription
Lyon Corpus [16] DemuthFrench4185h1 - 32007mother-child spontaneous speech interactions; broad phonetic transcription
Demuth Sesotho Corpus [17] DemuthSesotho4~1325098h2 - 41992family/peer spontaneous speech interactions; morphologically tagged
CHIEDE [18] GarroteSpanish5915444~8h2008spontaneous conversation, personal interviews, adult-child interaction; orthographic transcriptions; automatic phonological transcription
TIDIGITS [19] LeonardEnglish326 (101 children)6 - 151993mix of adult and child speakers
FAU Aibo Emotion CorpusSteidlGerman519h10 - 13human-annotated with 11 emotion categories
Swedish NICE Corpus [20] Bell55808 - 152005consists of child-machine and adult-child interactions; orthographic transcriptions
SingaKids-Mandarin [4] ChenMandarin25579,843125h7 - 122016word and phone-level transcriptions; human-annotated proficiency ratings
CFSC [21] PascualFilipino57~8h6-112012consists of children's read speech; contains both good pronunciations and reading miscues; partially transcribed to word- and phoneme-levels

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech, as distinguished from manual assessment by an instructor or proctor. Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction. Pronunciation assessment does not determine unknown speech but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners, sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and stress. Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams and from Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.

TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time.

Brian James MacWhinney is a Professor of Psychology and Modern Languages at Carnegie Mellon University. He specializes in first and second language acquisition, psycholinguistics, and the neurological bases of language, and he has written and edited several books and over 100 peer-reviewed articles and book chapters on these subjects. MacWhinney is best known for his competition model of language acquisition and for creating the CHILDES and TalkBank corpora. He has also helped to develop a stream of pioneering software programs for creating and running psychological experiments, including PsyScope, an experimental control system for the Macintosh; E-Prime, an experimental control system for the Microsoft Windows platform; and System for Teaching Experimental Psychology (STEP), a database of scripts for facilitating and improving psychological and linguistic research.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.

The following outline is provided as an overview of and topical guide to natural-language processing:

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

<span class="mw-page-title-main">LIVAC Synchronous Corpus</span>

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment. By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations.

The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.

The International Computer Archive of Modern and Medieval English (ICAME) is an international group of linguists and data scientists working in corpus linguistics to digitise English texts. The organisation was founded in Oslo, Norway in 1977 as the International Computer Archive of Modern English, before being renamed to its current title.

Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.

ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media. It is applied in humanities and social sciences research for the purpose of documentation and of qualitative and quantitative analysis. It is distributed as free and open source software under the GNU General Public License, version 3.

References

  1. 1 2 Habernal, Ivan; Vaclav, Matousek (2013). Text, Speech, and Dialogue: 16th International Conference, TSD 2013, Pilsen, Czech Republic, September 1-5, 2013, Proceedings. Springer. p. 545. ISBN   9783642405853 . Retrieved 11 December 2015.
  2. Neustein, Amy (2014). Speech and Automata in Health Care. Walter de Gruyter. pp. 225–226. ISBN   9781614515159 . Retrieved 11 December 2015.
  3. Ronzhin, Andrey; Potapova, Rodmonga; Fakotakis, Nikos (2015). Speech and Computer: 17th International Conference, SPECOM 2015, Athens, Greece, September 20-24, 2015, Proceedings. Springer. pp. 144–145. ISBN   9783319231327 . Retrieved 11 December 2015.
  4. 1 2 Nancy F. Chen, Rong Tong, Darren Wee, Peixuan Lee, Bin Ma and Haizhou Li. SingaKids-Mandarin: Speech Corpus of Singaporean Children Speaking Mandarin Chinese, in Proc. of Interspeech, 2016.
  5. "MyST Corpus | Boulder Learning inc" . Retrieved 2019-07-17.
  6. "My Science Tutor and the MyST Corpus". ResearchGate. Retrieved 2019-07-17.
  7. Maxine Eskenazi, Jack Mostow, and David Graff. The CMU Kids Corpus LDC97S63. Web Download. Philadelphia: Linguistic Data Consortium, 1997.
  8. Khaldoun Shobaki, John-Paul Hosom, and Ronald Cole. CSLU: Kids' Speech Version 1.1 LDC2007S18. Web Download. Philadelphia: Linguistic Data Consortium, 2007.
  9. Martin Russell. The PF-STAR British English Children's Speech Corpus. The Speech Ark Limited. 2006.
  10. Anton Batliner, Mats Blomberg, Shona D'Arcy, Daniel Elenius, Diego Giuliani, Matteo Gerosa, Christian Hacker, Martin Russell, Stefan Steidl, Michael Wong. The PF STAR Children’s Speech Corpus. In Proc. of Interspeech, 2005.
  11. Manny Rayner, Nikos Tsourakis, Claudia Baur, Pierrette Bouillon, Johanna Gerlach. CALL-SLT: A Spoken CALL System based on grammar and speech recognition. In Linguistic Issues in Language Technology, vol. 10, issue 2. 2014.
  12. Abe Kazemzadeh, Hong You, Markus Iseli, Barbara Jones, Xiaodong Cui, Margaret Heritage, Patti Price, Elaine Anderson, Shrikanth Narayanan and Abeer Alwan. TBALL Data Collection: The Making of a Young Children's Speech Corpus, in Proc. of Interspeech, 2005.
  13. Jun Gao, Aijun Li and Ziyu Xiong. Mandarin Multimedia Child Speech Corpus: CASS_CHILD in International Conference on Speech Database and Assessments (Oriental COCOSDA), 2012.
  14. 1 2 Andreas Hagen, Bryan Pellom and Ronald Cole. Children's Speech Recognition with Application to Interactive Books and Tutors in IEEE Workshop on Automatic Speech Recognition and Understanding, 2003.
  15. Demuth, K., Culbertson, J. & Alter, J. 2006. Word-minimality, epenthesis, and coda licensing in the acquisition of English. Language & Speech, 49, 137-174.
  16. Demuth, K. & A. Tremblay. 2007. Prosodically-conditioned variability in children's production of French determiners. Journal of Child Language, 34, 1-29.
  17. Demuth, K. 1992. Acquisition of Sesotho. In D. Slobin (ed.), The Cross-Linguistic Study of Language Acquisition, vol 3, 557-638. Hillsdale, N.J.: Lawrence Erlbaum Associates.
  18. Marta Garrote. CHIEDE: A Spontaneous Child Language Corpus of Spanish. Ph.D. thesis, Universidad Autónoma de Madrid, Spain. 2008.
  19. R. Gary Leonard, and George Doddington. TIDIGITS LDC93S10. Web Download. Philadelphia: Linguistic Data Consortium, 1993.
  20. Linda Bell, Johan Boyce, Joakim Gustafson, Mattias Heldner, Anders Lindström and Mats Wirén. The Swedish NICE Corpus - Spoken Dialogues between Children and Embodied Characters in a Computer Game Scenario, in Proc. of Eurospeech, 2005.
  21. Pascual, R. M.; Guevara, R. C. L. (November 2012). "Developing a children's Filipino speech corpus for application in automatic detection of reading miscues and disfluencies". TENCON 2012 IEEE Region 10 Conference. pp. 1–6. doi:10.1109/TENCON.2012.6412235. ISBN   978-1-4673-4824-9. S2CID   8795591.