Chinese speech synthesis

Last updated

Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Chinese). It poses additional difficulties due to Chinese characters frequently having different pronunciations in different contexts and the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.

Contents

Concatenation (Ekho and KeyTip)

Recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis) and this can severely affect prosody; these synthesizers are also inflexible in terms of speed and expression. However, because these synthesizers do not rely on a corpus, there is no noticeable degradation in performance when they are given more unusual or awkward phrases.

Ekho is an open source TTS which simply concatenates sampled syllables. [1] It currently supports Cantonese, Mandarin, and experimentally Korean. Some of the Mandarin syllables have been pitched-normalised in Praat. A modified version of these is used in Gradint's "synthesis from partials". [2]

cjkware.com used to ship a product called KeyTip Putonghua Reader which worked similarly; it contained 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprising 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase).

Lightweight synthesizers (eSpeak and Yuet)

The lightweight open-source speech project eSpeak, which has its own approach to synthesis, has experimented with Mandarin and Cantonese. eSpeak was used by Google Translate from May 2010 [3] until December 2010. [4]

The commercial product "Yuet" is also lightweight (it is intended to be suitable for resource-constrained environments like embedded systems); it was written from scratch in ANSI C starting from 2013. Yuet claims a built-in NLP model that does not require a separate dictionary; the speech synthesised by the engine claims clear word boundaries and emphasis on appropriate words. Communication with its author is required to obtain a copy. [5]

Both eSpeak and Yuet can synthesis speech for Cantonese and Mandarin from the same input text, and can output the corresponding romanisation (for Cantonese, Yuet uses Yale and eSpeak uses Jyutping; both use Pinyin for Mandarin). eSpeak does not concern itself with word boundaries when these don't change the question of which syllable should be spoken.

Corpus-based

A "corpus-based" approach can sound very natural in most cases but can err in dealing with unusual phrases if they can't be matched with the corpus. [6] The synthesiser engine is typically very large (hundreds or even thousands of megabytes) due to the size of the corpus.

iFlyTek

Anhui USTC iFlyTek Co., Ltd (iFlyTek) published a W3C paper in which they adapted Speech Synthesis Markup Language to produce a mark-up language called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information. [7] The amount of data involved is not disclosed by iFlyTek but can be seen from the commercial products that iFlyTek have licensed their technology to; for example, Bider's SpeechPlus is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly compressed data for a single Chinese voice. iFlyTek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".

The iFlyTek corpus appears to be heavily dependent on Chinese characters, and it is not possible to synthesize from pinyin alone. It is sometimes possible by means of CSSML to add pinyin to the characters to disambiguate between multiple possible pronunciations, but this does not always work.

NeoSpeech

There is an online interactive demonstration for NeoSpeech speech synthesis, [8] which accepts Chinese characters and also pinyin if it's enclosed in their proprietary "VTML" markup. [9]

Mac OS

Mac OS had Chinese speech synthesizers available up to version 9. This was removed in 10.0 and reinstated in 10.7 (Lion). [10]

Historical corpus-based synthesizers (no longer available)

A corpus-based approach was taken by Tsinghua University in SinoSonic, with the Harbin dialect voice data taking 800 Megabytes. This was planned to be offered as a download but the link was never activated. Nowadays, only references to it can be found on Internet Archive. [11]

Bell Labs' approach, which was demonstrated online in 1997 but subsequently removed, was described in a monograph "Multilingual Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31, 1997, ISBN   978-0-7923-8027-6), and the former employee who was responsible for the project, Chilin Shih (who subsequently worked at the University of Illinois) put some notes about her methods on her website. [12]

Related Research Articles

<span class="mw-page-title-main">Chinese language</span> National language of China

Chinese is a group of languages spoken natively by the ethnic Han Chinese majority and many minority ethnic groups in China. Approximately 1.35 billion people, or 17% of the global population, speak a variety of Chinese as their first language.

<span class="mw-page-title-main">Standard Chinese</span> Standard form of Chinese and official language of China

Standard Chinese is a modern standard form of Mandarin Chinese that was first codified during the republican era (1912–1949). It is designated as the official language of mainland China and a major language in the United Nations, Singapore, and Taiwan. It is largely based on the Beijing dialect. Standard Chinese is a pluricentric language with local standards in mainland China, Taiwan and Singapore that mainly differ in their lexicon. Hong Kong written Chinese, used for formal written communication in Hong Kong and Macau, is a form of Standard Chinese that is read aloud with the Cantonese reading of characters.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Hakka Chinese</span> Sinitic language originating in southern China

Hakka forms a language group of varieties of Chinese, spoken natively by the Hakka people in parts of Southern China, Taiwan, some diaspora areas of Southeast Asia and in overseas Chinese communities around the world.

Java Speech API Markup Language (JSML) is an XML-based markup language for annotating text input to speech synthesizers. JSML is used within the Java Speech API. JSML is an XML application and conforms to the requirements of well-formed XML documents. Java Speech API Markup Language is referred to as JSpeech Markup Language when describing the W3C documentation of the standard. Java Speech API Markup Language and JSpeech Markup Language identical apart from the change in name, which is made to protect Sun trademarks.

<span class="mw-page-title-main">Lion-Eating Poet in the Stone Den</span> Chinese one-syllable poem

"Lion-Eating Poet in the Stone Den" is a short narrative poem written in Literary Chinese, composed of around 92 to 94 characters in which every word is pronounced shi when read in modern Standard Chinese, with only the tones differing.

Cantonese is the traditional prestige variety of Yue Chinese, a Sinitic language belonging to the Sino-Tibetan language family. It originated in the city of Guangzhou and its surrounding Pearl River Delta.

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for creating audio books. For desktop applications, other markup languages are popular, including Apple's embedded speech commands, and Microsoft's SAPI Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa.

General Chinese is a diaphonemic orthography invented by Yuen Ren Chao to represent the pronunciations of all major varieties of Chinese simultaneously. It is "the most complete genuine Chinese diasystem yet published". It can also be used for the Korean, Japanese, and Vietnamese pronunciations of Chinese characters, and challenges the claim that Chinese characters are required for interdialectal communication in written Chinese.

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

Hong Kong Cantonese is a dialect of the Cantonese language (廣東話,粵語), which is in the Sino-Tibetan language family. Cantonese is lingua franca of populations living in the Guangdong Province of mainland China, in the special administrative regions of Hong Kong and Macau, as well as in many overseas Chinese communities. Hong Kong Cantonese shares a recent and direct lineage with the Guangzhou (Canton) dialect of Cantonese (廣州話); decades of separation have led to some deviations between Hong Kong Cantonese and Guangzhou Cantonese in terms of vocabulary and other noticeable speech habits, although Hong Kong officially maintains the Guangzhou dialect and pronunciations as the official language standard. Hongkongers refer to the language as "Cantonese" (廣東話).

The Java Speech API (JSAPI) is an application programming interface for cross-platform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only, there are several implementations created by third parties, for example FreeTTS.

The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.

eSpeak Compact, open-source, software speech synthesizer

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG is a continuation of the original developer's project with more feedback from native speakers.

The phonology of Standard Chinese has historically derived from the Beijing dialect of Mandarin. However, pronunciation varies widely among speakers, who may introduce elements of their local varieties. Television and radio announcers are chosen for their ability to affect a standard accent. Elements of the sound system include not only the segments—e.g. vowels and consonants—of the language, but also the tones applied to each syllable. In addition to its four main tones, Standard Chinese has a neutral tone that appears on weak syllables.

Standard Cantonese pronunciation originates from Guangzhou, also known as Canton, the capital of Guangdong Province. Hong Kong Cantonese is closely related to the Guangzhou dialect, with only minor differences. Yue dialects spoken in other parts of Guangdong and Guangxi provinces, such as Taishanese, exhibit more significant divergences.

The Fuqing dialect, or Hokchia, is an Eastern Min dialect. It is spoken in the county-level city of Fuqing, China, situated within the prefecture-level city of Fuzhou. It is not completely mutually intelligible with the Fuzhou dialect, although the level of understanding is high enough to be considered so.

NeoSpeech Inc. was an American company that specializes in text-to-speech (TTS) software for embedded devices, mobile, desktop, and network/server applications. NeoSpeech was founded by two speech engineers, Lin Chase and Yoon Kim, in Fremont, California, US, in 2002. NeoSpeech is privately held, headquartered in Santa Clara, California. NeoSpeech voices are now available from ReadSpeaker, www.readspeaker.com

The Yale romanization of Mandarin is a system for transcribing the sounds of Standard Chinese, based on the Beijing dialect of Mandarin. It was devised in 1943 by the Yale sinologist George Kennedy for a course teaching Chinese to American soldiers, and was popularized by continued development of that course at Yale. The system approximated Chinese sounds using English spelling conventions, in order to accelerate acquisition of correct pronunciation by English speakers.

<span class="mw-page-title-main">Speech Recognition & Synthesis</span> Screen reader application by Google

Speech Recognition & Synthesis, formerly known as Speech Services, is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

References

  1. Ekho
  2. Gradint
  3. "Giving a voice to more languages on Google Translate".
  4. "Listen to us now!".
  5. "Yuet, the tiny Cantonese TTS engine, Cantonese speech synthesis engine for offline embedded system".
  6. "Chinese mistakes in commercial speech synthesizers".
  7. http://www.w3.org/2005/08/SSML/Papers/iFLYTech.pdf [ bare URL PDF ]
  8. "Home". neospeech.com.
  9. for example <vtml_phoneme alphabet="x-pinyin" ph="ni3hao3"></vtml_phoneme>; see pages 7 and 25-27 of https://ondemand.neospeech.com/vt_eng-Engine-VTML-v3.9.0-3.pdf
  10. Voice packs are automatically downloaded as needed when selected in System Preferences, Speech Settings, Text to Speech, System Voice, Customize. Three Chinese female voices are available in the system. One each for Mainland China, Hong Kong and Taiwan.
  11. "Research Group of Human Computer Speech Interaction". hcsi.cs.tsinghua.edu.cn. Archived from the original on 13 August 2004. Retrieved 12 January 2022.
  12. Home Page: Chilin Shih (Internet Archive link)