Developer(s) | Carnegie Mellon University |
---|---|
Stable release | 0.7b / November 19, 2014 |
Available in | English |
License | BSD |
Website | www |
The CMU Pronouncing Dictionary (also known as CMUdict) is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.
CMUdict provides a mapping orthographic/phonetic for English words in their North American pronunciations. It is commonly used to generate representations for speech recognition (ASR), e.g. the CMU Sphinx system, and speech synthesis (TTS), e.g. the Festival system. CMUdict can be used as a training corpus for building statistical grapheme-to-phoneme (g2p) models [1] that will generate pronunciations for words not yet included in the dictionary.
The most recent release is 0.7b; it contains over 134,000 entries. An interactive lookup version is available. [2]
The database is distributed as a plain text file with one entry to a line in the format "WORD <pronunciation>
" with a two-space separator between the parts. If multiple pronunciations are available for a word, variants are identified using numbered versions (e.g. WORD(1)
). The pronunciation is encoded using a modified form of the ARPABET system, with the addition of stress marks on vowels of levels 0, 1, and 2. A line-initial ;;;
token indicates a comment. A derived format, directly suitable for speech recognition engines is also available as part of the distribution; this format collapses stress distinctions (typically not used in ASR).
The following is a table of phonemes used by CMU Pronouncing Dictionary. [2]
ARPABET | Rspl. | IPA | Example |
---|---|---|---|
AA | ah | ɑ | odd |
AE | a | æ | at |
AH0 | ə | ə | about |
AH | uh | ʌ | hut |
AO | aw | ɔ | ought, story |
AW | ow | aʊ | cow |
AY | eye | aɪ | hide |
EH | eh | ɛ | Ed |
ARPABET | Rspl. | IPA | Example |
---|---|---|---|
ER | ur, ər | ɝ , ɚ | hurt |
EY | ay | eɪ | ate |
IH | i, ih | ɪ | it |
IY | ee | i | eat |
OW | oh | oʊ | oat |
OY | oy | ɔɪ | toy |
UH | uu | ʊ | hood |
UW | oo | u | two |
AB | Description |
---|---|
0 | No stress |
1 | Primary stress |
2 | Secondary stress |
ARPABET | Rspl. | IPA | Example |
---|---|---|---|
B | b | b | be |
CH | ch, tch | tʃ | cheese |
D | d | d | dee |
DH | dh | ð | thee |
F | f | f | fee |
G | g | ɡ | green |
HH | h | h | he |
JH | j | dʒ | gee |
ARPABET | Rspl. | IPA | Example |
---|---|---|---|
K | k | k | key |
L | l | l | lee |
M | m | m | me |
N | n | n | knee |
NG | ng | ŋ | ping |
P | p | p | pee |
R | r | r | read |
S | s, ss | s | sea |
ARPABET | Rspl. | IPA | Example |
---|---|---|---|
SH | sh | ʃ | she |
T | t | t | tea |
TH | th | θ | theta |
V | v | v | vee |
W | w, wh | w | we |
Y | y | j | yield |
Z | z | z | zee |
ZH | zh | ʒ | seizure |
Version | Release date [3] | License |
---|---|---|
0.1 | 16 September 1993 | Public Domain |
0.2 | 10 March 1994 | Public Domain |
0.3 | 28 September 1994 | Public Domain |
0.4 | 8 November 1995 | Public Domain |
0.5 | No public release | Public Domain |
0.6 | 11 August 1998 | Public Domain |
0.7 | No public release | Public Domain |
0.7a | 18 February 2008 | 2-clause BSD |
0.7b | 19 November 2014 [4] | 2-clause BSD |
GitHub (unversioned) | 26 May 2021 | 2-clause BSD |
The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standard written representation for the sounds of speech. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators, and translators.
An orthography is a set of conventions for writing a language, including norms of spelling, punctuation, word boundaries, capitalization, hyphenation, and emphasis.
A phoneme is any set of similar speech sounds that is perceptually regarded by the speakers of a language as a single basic sound—a smallest possible phonetic unit—that helps distinguish one word from another. All languages contains phonemes, and all spoken languages include both consonant and vowel phonemes. Phonemes are primarily studied under the branch of linguistics known as phonology.
Received Pronunciation (RP) is the accent regarded as the standard and most prestigious form of spoken British English, since as late as the early 20th century. Language scholars have long disagreed on questions such as: the exact definition of RP, how geographically neutral it is, how many speakers there are, the nature and classification of its sub-varieties, how appropriate a choice it is as a standard, how the accent has changed over time, and even its name. RP is an accent, so the study of RP is concerned only with matters of pronunciation, while other features of Standard British English, such as vocabulary, grammar, and style, are not considered. The accent has changed, or its traditional users have changed their accents, to such a degree over the last century that many of its early 20th-century traditions of transcription and analysis have become outdated and are therefore no longer considered evidence-based by linguists. Still, in language education these traditions continue to be commonly taught and used, and the use of RP as a convenient umbrella term remains popular.
The Speech Assessment Methods Phonetic Alphabet (SAMPA) is a computer-readable phonetic script using 7-bit printable ASCII characters, based on the International Phonetic Alphabet (IPA). It was originally developed in the late 1980s for six European languages by the EEC ESPRIT information technology research and development program. As many symbols as possible have been taken over from the IPA; where this is not possible, other signs that are available are used, e.g. [@
] for schwa, [2
] for the vowel sound found in French deux 'two', and [9
] for the vowel sound found in French neuf 'nine'.
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.
Pronunciation is the way in which a word or a language is spoken. This may refer to generally agreed-upon sequences of sounds used in speaking a given word or language in a specific dialect or simply the way a particular individual speaks a word or language.
Phonetic transcription is the visual representation of speech sounds by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet.
A phonemic orthography is an orthography in which the graphemes correspond consistently to the language's phonemes, or more generally to the language's diaphonemes. Natural languages rarely have perfectly phonemic orthographies; a high degree of grapheme–phoneme correspondence can be expected in orthographies based on alphabetic writing systems, but they differ in how complete this correspondence is. English orthography, for example, is alphabetic but highly nonphonemic.
The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black, Paul Taylor and Richard Caley at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites. It is distributed under a free software license similar to the BSD License.
The Andrew Project was a distributed computing environment developed at Carnegie Mellon University beginning in 1982. It was an ambitious project for its time and resulted in an unprecedentedly vast and accessible university computing infrastructure. The project was named after Andrew Carnegie and Andrew Mellon, the founders of the institutions that eventually became Carnegie Mellon University.
A pronunciation respelling for English is a notation used to convey the pronunciation of words in the English language, which do not have a phonemic orthography.
Automatic pronunciation assessment is the use of speech recognition to verify the correctness of pronounced speech, as distinguished from manual assessment by an instructor or proctor. Also called speech verification, pronunciation evaluation, and pronunciation scoring, the main application of this technology is computer-aided pronunciation teaching (CAPT) when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction. Pronunciation assessment does not determine unknown speech but instead, knowing the expected word(s) in advance, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners, sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and syllable and word stress. Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams and from Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.
CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers and an acoustic model trainer (SphinxTrain).
ARPABET is a set of phonetic transcription codes developed by Advanced Research Projects Agency (ARPA) as a part of their Speech Understanding Research project in the 1970s. It represents phonemes and allophones of General American English with distinct sequences of ASCII characters. Two systems, one representing each segment with one character and the other with one or two (case-insensitive), were devised, the latter being far more widely adopted.
The Pronunciation Lexicon Specification (PLS) is a W3C Recommendation, which is designed to enable interoperable specification of pronunciation information for both speech recognition and speech synthesis engines within voice browsing applications. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use.
Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.
The following outline is provided as an overview of and topical guide to natural-language processing:
The English Pronouncing Dictionary (EPD) was created by the British phonetician Daniel Jones and was first published in 1917. It originally comprised over 50,000 headwords listed in their spelling form, each of which was given one or more pronunciations transcribed using a set of phonemic symbols based on a standard accent. The dictionary is now in its 18th edition. John C. Wells has written of it "EPD has set the standard against which other dictionaries must inevitably be judged".