CMU Pronouncing Dictionary

CMU Pronouncing Dictionary
Developer(s)	Carnegie Mellon University
Stable release	0.7b / November 19, 2014;9 years ago
Available in	English
License	BSD
Website	www.speech.cs.cmu.edu/cgi-bin/cmudict

Last updated December 23, 2023

The CMU Pronouncing Dictionary (also known as CMUdict) is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.

CMUdict provides a mapping orthographic/phonetic for English words in their North American pronunciations. It is commonly used to generate representations for speech recognition (ASR), e.g. the CMU Sphinx system, and speech synthesis (TTS), e.g. the Festival system. CMUdict can be used as a training corpus for building statistical grapheme-to-phoneme (g2p) models^[1] that will generate pronunciations for words not yet included in the dictionary.

The most recent release is 0.7b; it contains over 134,000 entries. An interactive lookup version is available.^[2]

Database format

The database is distributed as a plain text file with one entry to a line in the format "WORD <pronunciation>" with a two-space separator between the parts. If multiple pronunciations are available for a word, variants are identified using numbered versions (e.g. WORD(1)). The pronunciation is encoded using a modified form of the ARPABET system, with the addition of stress marks on vowels of levels 0, 1, and 2. A line-initial ;;; token indicates a comment. A derived format, directly suitable for speech recognition engines is also available as part of the distribution; this format collapses stress distinctions (typically not used in ASR).

The following is a table of phonemes used by CMU Pronouncing Dictionary.^[2]

Vowels
ARPABET	Rspl.	IPA	Example
`AA`	ah	ɑ	odd
`AE`	a	æ	at
`AH0`	ə	ə	about
`AH`	uh	ʌ	hut
`AO`	aw	ɔ	ought, story
`AW`	ow	aʊ	cow
`AY`	eye	aɪ	hide
`EH`	eh	ɛ	Ed

Vowels
ARPABET	Rspl.	IPA	Example
`ER`	ur, ər	ɝ , ɚ	hurt
`EY`	ay	eɪ	ate
`IH`	i, ih	ɪ	it
`IY`	ee	i	eat
`OW`	oh	oʊ	oat
`OY`	oy	ɔɪ	toy
`UH`	uu	ʊ	hood
`UW`	oo	u	two

Stress
AB	Description
0	No stress
1	Primary stress
2	Secondary stress

Consonants
ARPABET	Rspl.	IPA	Example
`B`	b	b	be
`CH`	ch, tch	tʃ	cheese
`D`	d	d	dee
`DH`	dh	ð	thee
`F`	f	f	fee
`G`	g	ɡ	green
`HH`	h	h	he
`JH`	j	dʒ	gee

Consonants
ARPABET	Rspl.	IPA	Example
`K`	k	k	key
`L`	l	l	lee
`M`	m	m	me
`N`	n	n	knee
`NG`	ng	ŋ	ping
`P`	p	p	pee
`R`	r	r	read
`S`	s, ss	s	sea

Consonants
ARPABET	Rspl.	IPA	Example
`SH`	sh	ʃ	she
`T`	t	t	tea
`TH`	th	θ	theta
`V`	v	v	vee
`W`	w, wh	w	we
`Y`	y	j	yield
`Z`	z	z	zee
`ZH`	zh	ʒ	seizure

History

Version	Release date^[3]	License
0.1	16 September 1993	Public Domain
0.2	10 March 1994	Public Domain
0.3	28 September 1994	Public Domain
0.4	8 November 1995	Public Domain
0.5	No public release	Public Domain
0.6	11 August 1998	Public Domain
0.7	No public release	Public Domain
0.7a	18 February 2008	2-clause BSD
0.7b	19 November 2014^[4]	2-clause BSD
GitHub (unversioned)	26 May 2021	2-clause BSD

Applications

The Unifon converter is based on the CMU Pronouncing Dictionary.
The Natural Language Toolkit contains an interface to the CMU Pronouncing Dictionary.
The Carnegie Mellon Logios ^[5] tool incorporates the CMU Pronouncing Dictionary.
PronunDict, a pronunciation dictionary of American English, uses the CMU Pronouncing Dictionary as its data source. Pronunciation is transcribed in IPA symbols. This dictionary also supports searching by pronunciation.
Some singing voice synthesizer software like CeVIO Creative Studio and Synthesizer V uses modified version of CMU Pronouncing Dictionary for synthesizing English singing voices.
Transcriber, a tool for the full text phonetic transcription, uses the CMU Pronouncing Dictionary
15.ai, a real-time text-to-speech tool using artificial intelligence, uses the CMU Pronouncing Dictionary

Related Research Articles

The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standardized representation of speech sounds in written form. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators, and translators.

In phonology and linguistics, a phoneme is a set of phones that can distinguish one word from another in a particular language.

Received Pronunciation (RP) is the accent traditionally regarded as the standard and most prestigious form of spoken British English. For over a century, there has been argument over such questions as the definition of RP, whether it is geographically neutral, how many speakers there are, whether sub-varieties exist, how appropriate a choice it is as a standard, and how the accent has changed over time. The name itself is controversial. RP is an accent, so the study of RP is concerned only with matters of pronunciation, while other areas relevant to the study of language standards, such as vocabulary, grammar, and style, are not considered.

The Speech Assessment Methods Phonetic Alphabet (SAMPA) is a computer-readable phonetic script using 7-bit printable ASCII characters, based on the International Phonetic Alphabet (IPA). It was originally developed in the late 1980s for six European languages by the EEC ESPRIT information technology research and development program. As many symbols as possible have been taken over from the IPA; where this is not possible, other signs that are available are used, e.g. [@] for schwa, [2] for the vowel sound found in French deux, and [9] for the vowel sound found in French neuf.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

Phonetic transcription is the visual representation of speech sounds by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet.

A phonemic orthography is an orthography in which the graphemes correspond to the language's phonemes. Natural languages rarely have perfectly phonemic orthographies; a high degree of grapheme–phoneme correspondence can be expected in orthographies based on alphabetic writing systems, but they differ in how complete this correspondence is. English orthography, for example, is alphabetic but highly nonphonemic; it was once mostly phonemic during the Middle English stage, when the modern spellings originated, but spoken English changed rapidly while the orthography was much more stable, resulting in the modern nonphonemic situation. On the contrary the Albanian, Serbian/Croatian/Bosnian/Montenegrin, Romanian, Italian, Turkish, Spanish, Finnish, Czech, Latvian, Esperanto, Korean and Swahili orthographic systems come much closer to being consistent phonemic representations.

The Moby Project is a collection of public-domain lexical resources created by Grady Ward. The resources were dedicated to the public domain, and are now mirrored at Project Gutenberg. As of 2007, it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations.

A.tong is one of the Garo dialect Sino-Tibetan language which is also related to Koch, Rabha, Bodo other than Garo language. It is spoken in the South Garo Hills and West Khasi Hills districts of Meghalaya state in Northeast India, southern Kamrup district in Assam, and adjacent areas in Bangladesh. The spelling "A.tong" is based on the way the speakers themselves pronounce the name of their language. There is no glottal stop in the name and it is not a tonal language.

The Hebrew language uses the Hebrew alphabet with optional vowel diacritics. The romanization of Hebrew is the use of the Latin alphabet to transliterate Hebrew words.

Th is a digraph in the Latin script. It was originally introduced into Latin to transliterate Greek loan words. In modern languages that use the Latin alphabet, it represents a number of different sounds. It is the most common digraph in order of frequency in the English language.

A pronunciation respelling for English is a notation used to convey the pronunciation of words in the English language, which do not have a phonemic orthography.

Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech, as distinguished from manual assessment by an instructor or proctor. Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction. Pronunciation assessment does not determine unknown speech but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners, sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and stress. Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams and from Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.

CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers and an acoustic model trainer (SphinxTrain).

ARPABET is a set of phonetic transcription codes developed by Advanced Research Projects Agency (ARPA) as a part of their Speech Understanding Research project in the 1970s. It represents phonemes and allophones of General American English with distinct sequences of ASCII characters. Two systems, one representing each segment with one character and the other with one or two (case-insensitive), were devised, the latter being far more widely adopted.

The Pronunciation Lexicon Specification (PLS) is a W3C Recommendation, which is designed to enable interoperable specification of pronunciation information for both speech recognition and speech synthesis engines within voice browsing applications. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use.

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

The orthographic depth of an alphabetic orthography indicates the degree to which a written language deviates from simple one-to-one letter–phoneme correspondence. It depends on how easy it is to predict the pronunciation of a word based on its spelling: shallow orthographies are easy to pronounce based on the written word, and deep orthographies are difficult to pronounce based on how they are written.

The English Pronouncing Dictionary (EPD) was created by the British phonetician Daniel Jones and was first published in 1917. It originally comprised over 50,000 headwords listed in their spelling form, each of which was given one or more pronunciations transcribed using a set of phonemic symbols based on a standard accent. The dictionary is now in its 18th edition. John C. Wells has written of it "EPD has set the standard against which other dictionaries must inevitably be judged".

References

↑ "Sequitur G2P - A trainable Grapheme-to-Phoneme converter".
1 2 "The CMU Pronouncing Dictionary". CMU Pronouncing Dictionary. 2015-07-16. Archived from the original on 2022-06-03. Retrieved 2022-06-04.
↑ ftp://ftp.cs.cmu.edu/project/speech/dict/%5B%5D
↑ http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/00README_FIRST.txt ^{[ bare URL plain text file ]}
↑ "Cmusphinx - Revision 10973: /Trunk/Logios". Archived from the original on 2011-05-20. Retrieved 2009-12-19.

External links

The current version of the dictionary is at SourceForge, although there is also a version maintained on GitHub.
Homepage – includes database search
RDF converted to Resource Description Framework by the open source Texai project.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Sequitur G2P - A trainable Grapheme-to-Phoneme converter".

[cmudict-2] 1 2 "The CMU Pronouncing Dictionary". CMU Pronouncing Dictionary. 2015-07-16. Archived from the original on 2022-06-03. Retrieved 2022-06-04.

[3] tp://ftp.cs.cmu.edu/project/speech/dict/%5B%5D

[4] ttp://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/00README_FIRST.txt ^{[ bare URL plain text file ]}

[5] "Cmusphinx - Revision 10973: /Trunk/Logios". Archived from the original on 2011-05-20. Retrieved 2009-12-19.

[1]

[2]

[3]

[4]

[5]