Original author(s) | Jonathan Duddington |
---|---|
Developer(s) | Alexander Epaneshnikov et al. |
Initial release | February 2006 |
Stable release | |
Repository | github |
Written in | C |
Operating system | Linux Windows macOS FreeBSD |
Type | Speech synthesizer |
License | GPLv3 |
Website | github |
eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG (Next Generation) is a continuation of the original developer's project with more feedback from native speakers.
Because of its small size and many languages, eSpeakNG is included in NVDA [2] open source screen reader for Windows, as well as Android, [3] Ubuntu [4] and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016 [5] and was used by Google Translate for 27 languages in 2010; [6] 17 of these were subsequently replaced by proprietary voices. [7]
The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia. [8] Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.
In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English. [9] On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007. [10] Development on Speak continued until version 1.14, when it was renamed to eSpeak.
Development of eSpeak continued from 1.16 (there was not a 1.15 release) [10] with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion, [11] with separate source and binary downloads made available on SourceForge. [10] From eSpeak 1.27, eSpeak was updated to use the GPLv3 license. [11] The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS. [12] The last development release of eSpeak was 1.48.15 on 16 April 2015. [13]
eSpeak uses the Usenet scheme to represent phonemes with ASCII characters. [14]
On 25 June 2010, [15] Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.
On 4 October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak. [16] [17]
On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence. [18] [19] The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.
On 11 December 2015, the espeak-ng fork was started. [20] The first release of espeak-ng was 1.49.0 on 10 September 2016, [21] containing significant code cleanup, bug fixes, and language updates.
eSpeakNG can be used as a command-line program, or as a shared library.
It supports Speech Synthesis Markup Language (SSML).
Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.
eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.
Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello [[w3:ld]]" will say in English.
eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.
There are many languages (notably English) which do not have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.
To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z'i@r0ks which provides more natural speech:
For comparison two samples with and without prosody data:
If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.
The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer: [22]
For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.
eSpeakNG performs text-to-speech synthesis for the following languages: [24]
The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standard written representation for the sounds of speech. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators, and translators.
In linguistics, a liquid consonant or simply liquid is any of a class of consonants that consists of rhotics and voiced lateral approximants, which are also sometimes described as "R-like sounds" and "L-like sounds". The word liquid seems to be a calque of the Ancient Greek word ὑγρός, initially used by grammarian Dionysius Thrax to describe Greek sonorants.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.
Linguistics is the scientific study of human language. Someone who engages in this study is called a linguist. See also the Outline of linguistics, the List of phonetics topics, the List of linguists, and the List of cognitive science topics. Articles related to linguistics include:
Non-native pronunciations of English result from the common linguistic phenomenon in which non-native speakers of any language tend to transfer the intonation, phonological processes and pronunciation rules of their first language into their English speech. They may also create innovative pronunciations not found in the speaker's native language.
PlainTalk is the collective name for several speech synthesis (MacinTalk) and speech recognition technologies developed by Apple Inc. In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many researchers in the field. The result was "PlainTalk", released with the AV models in the Macintosh Quadra series from 1993. It was made a standard system component in System 7.1.2, and has since been shipped on all PowerPC and some 68k Macintoshes.
Carl Gunnar Michael Fant was a leading researcher in speech science in general and speech synthesis in particular who spent most of his career as a professor at the Swedish Royal Institute of Technology (KTH) in Stockholm. He was a first cousin of the actors and directors George Fant and Kenne Fant.
In linguistics, prosody is the study of elements of speech that are not individual phonetic segments but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.
The phonology of Vietnamese features 19 consonant phonemes, with 5 additional consonant phonemes used in Vietnamese's Southern dialect, and 4 exclusive to the Northern dialect. Vietnamese also has 14 vowel nuclei, and 6 tones that are integral to the interpretation of the language. Older interpretations of Vietnamese tones differentiated between "sharp" and "heavy" entering and departing tones. This article is a technical description of the sound system of the Vietnamese language, including phonetics and phonology. Two main varieties of Vietnamese, Hanoi and Saigon, which are slightly different to each other, are described below.
Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for creating audio books. For desktop applications, other markup languages are popular, including Apple's embedded speech commands, and Microsoft's SAPI Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa.
Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.
DECtalk was a speech synthesizer and text-to-speech technology developed by Digital Equipment Corporation in 1983, based largely on the work of Dennis Klatt at MIT, whose source-filter algorithm was variously known as KlattTalk or MITalk.
Chinese speech synthesis is the application of speech synthesis to the Chinese language. It poses additional difficulties due to Chinese characters frequently having different pronunciations in different contexts and the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.
MBROLA is speech synthesis software as a worldwide collaborative project. The MBROLA project web page provides diphone databases for many spoken languages.
The Java Speech API (JSAPI) is an application programming interface for cross-platform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only, there are several implementations created by third parties, for example FreeTTS.
The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.
UTAU is a Japanese singing synthesizer application created by Ameya/Ayame (飴屋/菖蒲). This program is similar to the VOCALOID software, with the difference being it is shareware instead of under a third party licensing.
Cantor was a vocal singing synthesizer software released four months after the original release of Vocaloid by the company VirSyn, and was based on the same idea of synthesizing the human voice. VirSyn released English and German versions of this software. Cantor 2 boasted a variety of voices from near-realistic sounding ones to highly expressive vocals and robotic voices.
Speech Recognition & Synthesis, formerly known as Speech Services, is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.
Dennis H. Klatt was an American researcher in speech and hearing science. Klatt was the pioneer of computerized speech synthesis and created an interface which allowed for speech for non-expert users for the first time. Prior to his work, non-verbal individuals would need specialist support to be able to speak at all. Stephen Hawking used a version of this speech synthesizer, based on Klatt's own voice, and which Hawking chose to keep even after others became available.