TIMIT

Last updated

TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time.

Contents

TIMIT was designed to further acoustic-phonetic knowledge and automatic speech recognition systems. It was commissioned by DARPA and corpus design was a joint effort between the Massachusetts Institute of Technology, SRI International, and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and verified and prepared for publishing by the National Institute of Standards and Technology (NIST). [1] There is also a telephone bandwidth version called NTIMIT (Network TIMIT).

TIMIT and NTIMIT are not freely available — either membership of the Linguistic Data Consortium, or a monetary payment, is required for access to the dataset.

History

The TIMIT telephone corpus was an early attempt to create a database with speech samples. [2] It was published in the year 1988 on CD-ROM and consists of only 10 sentences per speaker. Two 'dialect' sentences were read by each speaker, as well as another 8 sentences selected from a larger set [3] Each sentence averages 3 seconds long and is spoken by 630 different speakers. [4] It was the first notable attempt in creating and distributing a speech corpus and the overall project has produced costs of 1.5 million US$. [5]

The full name of the project is DARPA-TIMIT Acoustic-Phonetic Continuous Speech Corpus [6] and the acronym TIMIT stands for Texas Instruments/Massachusetts Institute of Technology. The main reason why a corpus of telephone speech was created was to train speech recognition software. In the Blizzard challenge, different software has the obligation to convert audio recordings into textual data and the TIMIT corpus was used as a standardized baseline. [7]

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Glottal stop</span> Sound made by stopping airflow in the glottis

The glottal stop or glottal plosive is a type of consonantal sound used in many spoken languages, produced by obstructing airflow in the vocal tract or, more precisely, the glottis. The symbol in the International Phonetic Alphabet that represents this sound is ⟨ʔ⟩.

Phonetic transcription is the visual representation of speech sounds by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet.

Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to speaker recognition or speech recognition. Speaker verification contrasts with identification, and speaker recognition differs from speaker diarisation.

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.

The Buckeye Corpus of conversational speech is a speech corpus created by a team of linguists and psychologists at Ohio State University led by Prof. Mark Pitt. It contains high-quality recordings from 40 speakers in Columbus, Ohio conversing freely with an interviewer. The interviewer's voice is heard only faintly in the background of these recordings. The sessions were conducted as Sociolinguistics interviews, and are essentially monologues. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software. Software for searching the transcription files is also available at the project web site. The corpus is available to researchers in academics and industry.

Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

Lessac Technologies, Inc. (LTI) is an American firm which develops voice synthesis software, licenses technology and sells synthesized novels as MP3 files. The firm currently has seven patents granted and three more pending for its automated methods of converting digital text into human-sounding speech, more accurately recognizing human speech and outputting the text representing the words and phrases of said speech, along with recognizing the speaker's emotional state.

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).

The BABEL speech corpus is a corpus of recorded speech materials from five Central and Eastern European languages. Intended for use in speech technology applications, it was funded by a grant from the European Union and completed in 1998. It is distributed by the European Language Resources Association.

Victor Waito Zue is a Chinese American computer scientist and professor at Massachusetts Institute of Technology.

<span class="mw-page-title-main">Larry Heck</span>

Larry Paul Heck is the Rhesa Screven Farmer, Jr., Advanced Computing Concepts Chair, Georgia Research Alliance Eminent Scholar, Chief Scientist of the AI Hub, Executive Director of the Machine Learning Center, and Professor at the Georgia Institute of Technology. His career spans many of the sub-disciplines of artificial intelligence, including conversational AI, speech recognition and speaker recognition, natural language processing, web search, online advertising and acoustics. He is best known for his role as a co-founder of the Microsoft] Cortana] Personal Assistant and his early work in deep learning] for speech processing.

The Persian Speech Corpus is a Modern Persian speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of about 2.5 hours of Persian speech aligned with recorded speech on the phoneme level, including annotations of word boundaries. Previous spoken corpora of Persian include FARSDAT, which consists of read aloud speech from newspaper texts from 100 Persian speakers and the Telephone FARsi Spoken language DATabase (TFARSDAT) which comprises seven hours of read and spontaneous speech produced by 60 native speakers of Persian from ten regions of Iran.

openSMILE is source-available software for automatic extraction of features from audio signals and for classification of speech and music signals. "SMILE" stands for "Speech & Music Interpretation by Large-space Extraction". The software is mainly applied in the area of automatic emotion recognition and is widely used in the affective computing research community. The openSMILE project exists since 2008 and is maintained by the German company audEERING GmbH since 2013. openSMILE is provided free of charge for research purposes and personal use under a source-available license. For commercial use of the tool, the company audEERING offers custom license options.

An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

Lori Faith Lamel is a speech processing researcher known for her work with the TIMIT corpus of American English speech and for her work on voice activity detection, speaker recognition, and other non-linguistic inferences from speech signals. She works for the French National Centre for Scientific Research (CNRS) as a senior research scientist in the Spoken Language Processing Group of the Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur.

The Switchboard Telephone Speech Corpus is a corpus of spoken English language consisted of almost 260 hours of speech. It was created in 1990 by Texas Instruments via a DARPA grant, and released in 1992 by NIST. The corpus contains 2,400 telephone conversations among 543 US speakers. Participants did not know each other, and conversations were held on topics from a predetermined list.

References

  1. Fisher, William M.; Doddington, George R.; Goudie-Marshall, Kathleen M. (1986). "The DARPA Speech Recognition Research Database: Specifications and Status". Proceedings of DARPA Workshop on Speech Recognition. pp. 93–99.
  2. Morales, Nicolas and Tejedor, Javier and Garrido, Javier and Colas, Jose and Toledano, Doroteo T (2008). "STC-TIMIT Generation of a single-channel telephone corpus". Proceedings of the Sixth International Language Resources and Evaluation (LREC'08): 391–395.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. Lori F Lamel and Robert H. Kassel and Stephanie Seneff (1986). Speech Database Development: Design and Analysis of the Acoustic-Phonetic Corpus (Technical report). DARPA (SAIC-86/1546).
  4. John S Garofolo and Lori F Lamel and William M Fisher and Jonathan G Fiscus and David S Pallett and Nancy L Dahlgren (1993). DARPA TIMIT (Technical report). National Institute of Standards and Technology. doi:10.6028/nist.ir.4930.
  5. Nattanun Chanchaochai and Christopher Cieri and Japhet Debrah and Hongwei Ding and Yue Jiang and Sishi Liao and Mark Liberman and Jonathan Wright and Jiahong Yuan and Juhong Zhan and Yuqing Zhan (2018). GlobalTIMIT: Acoustic-Phonetic Datasets for the World's Languages. Interspeech 2018. ISCA. doi:10.21437/interspeech.2018-1185.
  6. Bauer, Patrick and Scheler, David and Fingscheidt, Tim (2010). WTIMIT: The TIMIT Speech Corpus Transmitted Over The 3G AMR Wideband Mobile Network. LREC.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  7. Sawada, Kei and Asai, Chiaki and Hashimoto, Kei and Oura, Keiichiro and Tokuda, Keiichi (2016). The NITech text-to-speech system for the Blizzard Challenge 2016. Blizzard Challenge 2016 Workshop.{{cite conference}}: CS1 maint: multiple names: authors list (link)