A child speech corpus is a speech corpus documenting first-language language acquisition. Such databases are used in the development of computer-assisted language learning systems and the characterization of children's speech at difference ages. [1] Children's speech varies not only by language, but also by region within a language. It can also be different for specific groups like autistic children, especially when emotion is considered. Thus different databases are needed for different populations. Corpora are available for American and British English as well as for many other European languages. [1] [2] [3]
In the table below, the age range may be described in terms of school grades. "K" denotes "kindergarten" while "G" denotes "grade". For example, an age range of "K - G10" refers to speakers ranging from kindergarten age to grade 10.
This table is based on a paper from the Interspeech conference, 2016. [4] This online article is intended to provide an interactive table for readers and a place where information about children speech corpora that can be updated continuously by the speech research community.
Corpus | Author | Languages | # Speakers | # Utt. | Duration | Age Range | Date | Remarks |
---|---|---|---|---|---|---|---|---|
Boulder Learning—MyST Corpus (v0.4.0) [5] | Cole et al. [6] | English | 1371 | 228,874 | ~393h | G3 - G5 | 2019 | dialog interaction between a student and a virtual tutor on science topics; typically 20-40 minute (wall clock) duration of a session; roughly 49% of the utterances have been transcribed, and more being transcribed. volunteers encouraged. available free for research; flat $10K for commercial use. |
CMU Kids Corpus [7] | Eskenazi | English | 24M, 52F | 5180 | 6 - 11 | 1997 | ||
CSLU Kids' Speech Corpus [8] | Shobaki | English | 1100 | 1017 | K - G10 | 2007 | ||
PF-STAR Children's Speech Corpus [9] [10] | Russell | English, | 158 | ~14.5h | 4 - 14 | 2006 | word-level transcriptions | |
CALL-SLT [11] | Rayner | German | 5000 | 2014 | ||||
TBALL [12] | Kazemgadeh | English | 256 | 5000 | 40h | K - G4 | 2005 | partially non-native speech |
CASS_CHILD [13] | Gao | Mandarin | 23 | 1 - 4 | 2012 | phonetic transcriptions | ||
CU Children's Read and Prompted Speech Corpus [14] | Hagen | English | 663 | ~100 | K - G5 | 2001 | consists of isolated words, sentences and short spontaneous story telling; word-level transcriptions | |
CU Story Corpus [14] | Hagen | English | 106 | 5000 | 40h | G3 - G5 | 2003 | consists of story prompts and spontaneous spoken summary of the material; word-level transcriptions |
Providence Corpus [15] | Demuth | English | 6 | 363h | 1 - 3 | 2006 | mother-child spontaneous speech interactions; broad phonetic transcription | |
Lyon Corpus [16] | Demuth | French | 4 | 185h | 1 - 3 | 2007 | mother-child spontaneous speech interactions; broad phonetic transcription | |
Demuth Sesotho Corpus [17] | Demuth | Sesotho | 4 | ~13250 | 98h | 2 - 4 | 1992 | family/peer spontaneous speech interactions; morphologically tagged |
CHIEDE [18] | Garrote | Spanish | 59 | 15444 | ~8h | 2008 | spontaneous conversation, personal interviews, adult-child interaction; orthographic transcriptions; automatic phonological transcription | |
TIDIGITS [19] | Leonard | English | 326 (101 children) | 6 - 15 | 1993 | mix of adult and child speakers | ||
FAU Aibo Emotion Corpus | Steidl | German | 51 | 9h | 10 - 13 | human-annotated with 11 emotion categories | ||
Swedish NICE Corpus [20] | Bell | 5580 | 8 - 15 | 2005 | consists of child-machine and adult-child interactions; orthographic transcriptions | |||
SingaKids-Mandarin [4] | Chen | Mandarin | 255 | 79,843 | 125h | 7 - 12 | 2016 | word and phone-level transcriptions; human-annotated proficiency ratings |
CFSC [21] | Pascual | Filipino | 57 | ~8h | 6-11 | 2012 | consists of children's read speech; contains both good pronunciations and reading miscues; partially transcribed to word- and phoneme-levels |
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.
Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech, as distinguished from manual assessment by an instructor or proctor. Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction. Pronunciation assessment does not determine unknown speech but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners, sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and syllable and word stress. Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams and from Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.
TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time.
Brian James MacWhinney is a Professor of Psychology and Modern Languages at Carnegie Mellon University. He specializes in first and second language acquisition, psycholinguistics, and the neurological bases of language, and he has written and edited several books and over 100 peer-reviewed articles and book chapters on these subjects. MacWhinney is best known for his competition model of language acquisition and for creating the CHILDES and TalkBank corpora. He has also helped to develop a stream of pioneering software programs for creating and running psychological experiments, including PsyScope, an experimental control system for the Macintosh; E-Prime, an experimental control system for the Microsoft Windows platform; and System for Teaching Experimental Psychology (STEP), a database of scripts for facilitating and improving psychological and linguistic research.
Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
Linguistic categories include
A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.
A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.
Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.
LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment. By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations.
The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.
The International Computer Archive of Modern and Medieval English (ICAME) is an international group of linguists and data scientists working in corpus linguistics to digitise English texts. The organisation was founded in Oslo, Norway in 1977 as the International Computer Archive of Modern English, before being renamed to its current title.
Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.
ELAN is computer software, a professional tool to manually and semi-automatically annotate and transcribe audio or video recordings. It has a tier-based data model that supports multi-level, multi-participant annotation of time-based media. It is applied in humanities and social sciences research for the purpose of documentation and of qualitative and quantitative analysis. It is distributed as free and open source software under the GNU General Public License, version 3.