BABEL Speech Corpus

Last updated

The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.

Contents

Development of the BABEL Project

Following the creation of a speech corpus of European Union languages by the SAM project, funding was granted by the European Union for the creation along similar lines of a speech corpus of languages of Central and Eastern Europe, with the name of BABEL.

The initial impetus came from the SAM (Speech Assessment Methods) project funded by the European Union as ESPRIT Project #1541 in 1987–89. [1] This project was conducted by an international group of phoneticians, and was applied in the first instance to the European Communities languages Danish, Dutch, English, French, German, and Italian (by 1989). SAM produced many speech research tools (including the SAMPA computer-based phonetic transcription which was also used for the BABEL project) and a corpus of recorded speech material distributed on CD-ROM. [2] A proposal was made to the European Union under the Copernicus initiative in 1994, with the objective of creating a corpus of spoken Bulgarian, Estonian, Hungarian, Polish and Romanian, and Grant #1304 was awarded for this. A pilot project to create a small corpus of spoken Bulgarian was carried out jointly by the Universities of Sofia (Bulgaria) and Reading (U.K.). [3] The initial meeting of the whole project team took place at the University of Reading in 1995.

Recorded material

Since the objective was to produce material suitable for use in speech technology applications, the digital recordings were made in strictly controlled conditions in recording studios. For each language the material had the following composition:

Membership of the BABEL Project

Project Director: Peter Roach (University of Reading)

Project leaders in Central and Eastern Europe

Bulgaria: initially, A. Misheva until her death in 1995, then S. Dimitrova (University of Sofia).
Estonia: E. Meister (University of Tallinn)
Hungary: K. Vicsi (Technical University of Budapest)
Poland: R. Gubrynowicz (Polish Academy of Sciences) and W. Gonet (University of Lublin)
Romania: M. Boldea (University of Timișoara)

Project members in Western Europe

France: L. Lamel (LIMSI, Paris); A. Marchal (CNRS)
Germany : W. Barry (Saarland University) ; K. Marasek (University of Stuttgart)
United Kingdom: J. Wells (University College London); P. Roach (University of Reading)

Project outcomes

An intermediate project assessment meeting was held in Lublin, Poland, in 1996. Work then continued until a final assessment and presentation of outcomes in Granada, Spain, at the First International Conference on Language Resources and Evaluation, in 1998. [4] The project was completed in December 1998. The resulting set of corpora was then supplied to the European Language Resources Association. ELRA is exclusively responsible for distributing the material to users via their website. [5]

At the time of its completion, BABEL was the largest high-quality speech database available for research purposes in languages such as Hungarian [6] and Estonian. [7] It has been used for research into topics such as pronunciation modeling [6] and automatic speech recognition. [8] The project was also part of what has been called the most significant recent development in corpus linguistics – the increasing range of languages covered by corpus data, which promises to bring to a wider range of languages the benefits that corpus linguistics has brought to the study of Western European languages. [9]

Related Research Articles

The Speech Assessment Methods Phonetic Alphabet (SAMPA) is a computer-readable phonetic script using 7-bit printable ASCII characters, based on the International Phonetic Alphabet (IPA). It was originally developed in the late 1980s for six European languages by the EEC ESPRIT information technology research and development program. As many symbols as possible have been taken over from the IPA; where this is not possible, other signs that are available are used, e.g. [@] for schwa, [2] for the vowel sound found in French deux, and [9] for the vowel sound found in French neuf.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Isochrony is the postulated rhythmic division of time into equal portions by a language. Rhythm is an aspect of prosody, others being intonation, stress, and tempo of speech.

<span class="mw-page-title-main">Kató Lomb</span> Hungarian interpreter and translator

Kató Lomb was a Hungarian interpreter, translator and one of the first simultaneous interpreters in the world. Originally educated in physics and chemistry, her interest soon led her to languages. Native in Hungarian, she could interpret fluently in nine or ten languages, translated technical literature, and read belles-lettres in six languages. She was able to understand journalism in a further 11 languages. She stated that she worked professionally with 16 languages, which she learned mostly by self-study due to her interest in them.

Frederick Jelinek was a Czech-American researcher in information theory, automatic speech recognition, and natural language processing. He is well known for his oft-quoted statement, "Every time I fire a linguist, the performance of the speech recognizer goes up".

TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time.

Linguistic categories include

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.

The following outline is provided as an overview of and topical guide to natural-language processing:

The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found on ICAME.

The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.

<span class="mw-page-title-main">Peter Roach (phonetician)</span> British retired phonetician (born 1943)

Peter John Roach is a British retired phonetician. He taught at the Universities of Leeds and Reading, and is best known for his work on the pronunciation of British English.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

Lori Faith Lamel is a speech processing researcher known for her work with the TIMIT corpus of American English speech and for her work on voice activity detection, speaker recognition, and other non-linguistic inferences from speech signals. She works for the French National Centre for Scientific Research (CNRS) as a senior research scientist in the Spoken Language Processing Group of the Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur.

References

  1. D. Chan, A. Fourcin, D. Gibbon, B. Granstrom, M. Huckvale, G. Kokkinakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno, J. Mouropoulos, F. Senia, I. Trancoso, C. Veld & J. Zeiliger, "EUROM – A Spoken Language Resource for the EU", in Eurospeech'95, Proceedings of the 4th European Conference on Speech Communication and Speech Technology. Madrid, Spain, 18–21 September 1995. Vol 1, pp. 867-870
  2. "EUROM1 – Multilingual Speech Corpus". University College London. Retrieved 2015-01-19.
  3. Misheva, A., Dimitrova, S., Filipov, V., Grigorova, E., Nikov, M., Roach, P. and Arnfield, S. ‘Bulgarian Speech Database: a pilot study’, Proceedings of Eurospeech ‘95, Madrid, vol. 1, pp.859-862 (1995)
  4. Roach, P., S.Arnfield, W.Barry, S.Dimitrova, M.Boldea, A.Fourcin, W.Gonet, R.Gubrynowicz, E.Hallum, L.Lamel, K.Marasek, A.Marchal, E.Meister, K.Vicsi (1998). ‘BABEL: A Database Of Central And Eastern European Languages’, Proceedings of the First International Conference on Language Resources and Evaluation, eds. A. Rubio et al, Granada, Vol. 1, pp. 371-4.
  5. "Search results for: babel". European Language Resources Association. Retrieved 2015-01-18.
  6. 1 2 Fegyó, Tibor; Péter Mihajlik; Péter Tatai; Géza Gordos (2001). "Pronunciation modeling in Hungarian number recognition." In INTERSPEECH, pp. 1465-1468.
  7. Alumae, Tanel (2004). Large vocabulary continuous speech recognition for Estonian using morpheme classes. INTERSPEECH, Jeju, Korea. pp. 389–392.
  8. Mihajlik, Péter; Révész, Tibor; Tatai, Péter (2002-11-01). "Phonetic transcription in automatic speech recognition" (PDF). Acta Linguistica Hungarica. 49 (3): 407–425. doi:10.1556/ALing.49.2002.3-4.9.
  9. McEnery, Tony (2001). Corpus Linguistics: An Introduction . Oxford University Press. p.  188. ISBN   9780748611652.