Chinese speech synthesis

Last updated

Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Chinese). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.

Contents

Concatenation (Ekho and KeyTip)

Recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis) and this can severely affect prosody; these synthesizers are also inflexible in terms of speed and expression. However, because these synthesizers do not rely on a corpus, there is no noticeable degradation in performance when they are given more unusual or awkward phrases.

Ekho is an open source TTS which simply concatenates sampled syllables. [1] It currently supports Cantonese, Mandarin, and experimentally Korean. Some of the Mandarin syllables have been pitched-normalised in Praat. A modified version of these is used in Gradint's "synthesis from partials". [2]

cjkware.com used to ship a product called KeyTip Putonghua Reader which worked similarly; it contained 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprising 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase).


Lightweight synthesizers (eSpeak and Yuet)

The lightweight open-source speech project eSpeak, which has its own approach to synthesis, has experimented with Mandarin and Cantonese. eSpeak was used by Google Translate from May 2010 [3] until December 2010. [4]

The commercial product "Yuet" is also lightweight (it is intended to be suitable for resource-constrained environments like embedded systems); it was written from scratch in ANSI C starting from 2013. Yuet claims a built-in NLP model that does not require a separate dictionary; the speech synthesised by the engine claims clear word boundaries and emphasis on appropriate words. Communication with its author is required to obtain a copy. [5]

Both eSpeak and Yuet can synthesis speech for Cantonese and Mandarin from the same input text, and can output the corresponding romanisation (for Cantonese, Yuet uses Yale and eSpeak uses Jyutping; both use Pinyin for Mandarin). eSpeak does not concern itself with word boundaries when these don't change the question of which syllable should be spoken.

Corpus-based

A "corpus-based" approach can sound very natural in most cases but can err in dealing with unusual phrases if they can't be matched with the corpus. [6] The synthesiser engine is typically very large (hundreds or even thousands of megabytes) due to the size of the corpus.

iFlyTek

Anhui USTC iFlyTek Co., Ltd (iFlyTek) published a W3C paper in which they adapted Speech Synthesis Markup Language to produce a mark-up language called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information. [7] The amount of data involved is not disclosed by iFlyTek but can be seen from the commercial products that iFlyTek have licensed their technology to; for example, Bider's SpeechPlus is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly compressed data for a single Chinese voice. iFlyTek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".

The iFlyTek corpus appears to be heavily dependent on Chinese characters, and it is not possible to synthesize from pinyin alone. It is sometimes possible by means of CSSML to add pinyin to the characters to disambiguate between multiple possible pronunciations, but this does not always work.

NeoSpeech

There is an online interactive demonstration for NeoSpeech speech synthesis, [8] which accepts Chinese characters and also pinyin if it's enclosed in their proprietary "VTML" markup. [9]

Mac OS

Mac OS had Chinese speech synthesizers available up to version 9. This was removed in 10.0 and reinstated in 10.7 (Lion). [10]

Historical corpus-based synthesizers (no longer available)

A corpus-based approach was taken by Tsinghua University in SinoSonic, with the Harbin dialect voice data taking 800 Megabytes. This was planned to be offered as a download but the link was never activated. Nowadays, only references to it can be found on Internet Archive. [11]

Bell Labs' approach, which was demonstrated online in 1997 but subsequently removed, was described in a monograph "Multilingual Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31, 1997, ISBN   978-0-7923-8027-6), and the former employee who was responsible for the project, Chilin Shih (who subsequently worked at the University of Illinois) put some notes about her methods on her website. [12]

Related Research Articles

<span class="mw-page-title-main">Chinese language</span> National language of China

Chinese (simplified Chinese: 汉语; traditional Chinese: 漢語; pinyin: Hànyǔ; lit. 'Han language' or

<span class="mw-page-title-main">Mandarin Chinese</span> Major branch of Chinese languages

Mandarin is a group of Sinitic dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language of China. Because Mandarin originated in North China and most Mandarin dialects are found in the north, the group is sometimes referred to as Northern Chinese. Many varieties of Mandarin, such as those of the Southwest and the Lower Yangtze, are not mutually intelligible with the standard language. Nevertheless, Mandarin as a group is often placed first in lists of languages by number of native speakers.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Hakka Chinese</span> Primary branch of Chinese originating in Southern China

Hakka forms a language group of varieties of Chinese, spoken natively by the Hakka people throughout Southern China and Taiwan and throughout the diaspora areas of East Asia, Southeast Asia and in overseas Chinese communities around the world.

Isochrony is the postulated rhythmic division of time into equal portions by a language. Rhythm is an aspect of prosody, others being intonation, stress, and tempo of speech.

<span class="mw-page-title-main">Lion-Eating Poet in the Stone Den</span> Chinese one-syllable poem

"Lion-Eating Poet in the Stone Den" is a short narrative poem written in Classical Chinese that is composed of about 94 characters in which every word is pronounced shi when read in present-day Standard Mandarin, with only the tones differing.

<span class="mw-page-title-main">Cantonese</span> Variety of Yue Chinese

Cantonese is a language within the Chinese (Sinitic) branch of the Sino-Tibetan languages originating from the city of Guangzhou and its surrounding Pearl River Delta. It is the traditional prestige variety of the Yue Chinese group, which has over 82.4 million native speakers. While the term Cantonese specifically refers to the prestige variety, it is often used to refer to the entire Yue subgroup of Chinese, including related but partially mutually intelligible varieties like Taishanese.

In linguistics and especially phonology, functional load, or phonemic load, is the collection of words that contain a certain pronunciation feature that makes distinctions between other words. Phonemes with a high functional load distinguish a large number of words from other words, and phonemes with a low functional load distinguish relatively fewer words from other words. The omission or mishearing of features with a high functional load thus leads to more confusion than features with a low functional load.

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's Voice Browser Working Group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for creating audio books. For desktop applications, other markup languages are popular, including Apple's embedded speech commands, and Microsoft's SAPI Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa.

In linguistics, intonation is the variation in pitch used to indicate the speaker's attitudes and emotions, to highlight or focus an expression, to signal the illocutionary act performed by a sentence, or to regulate the flow of discourse. For example, the English question "Does Maria speak Spanish or French?" is interpreted as a yes-or-no question when it is uttered with a single rising intonation contour, but is interpreted as an alternative question when uttered with a rising contour on "Spanish" and a falling contour on "French". Although intonation is primarily a matter of pitch variation, its effects almost always work hand-in-hand with other prosodic features. Intonation is distinct from tone, the phenomenon where pitch is used to distinguish words or to mark grammatical features.

Malaysian Mandarin (simplified Chinese: 马来西亚华语; traditional Chinese: 馬來西亞華語; pinyin: Mǎláixīyà Huáyǔ; Wade–Giles: Ma3-lai2-hsi1-ya4 Hua2-yü3) is a variety of Mandarin Chinese spoken in Malaysia by ethnic Chinese in Malaysia. Today, Malaysian Mandarin is the lingua franca of the Malaysian Chinese community.

Hong Kong Cantonese is a dialect of the Cantonese language of the Sino-Tibetan family.

MBROLA is speech synthesis software as a worldwide collaborative project. The MBROLA project web page provides diphone databases for many spoken languages.

The Java Speech API (JSAPI) is an application programming interface for cross-platform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only, there are several implementations created by third parties, for example FreeTTS.

The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.

eSpeak Compact, open-source, software speech synthesizer

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG is a continuation of the original developer's project with more feedback from native speakers.

This article summarizes the phonology of Standard Chinese.

Standard Cantonese pronunciation is that of Guangzhou, also known as Canton, capital of Guangdong Province. Hong Kong Cantonese is related to Guangzhou dialect, and they diverge only slightly. Yue dialects in other parts of Guangdong and Guangxi provinces like Taishanese, may be considered divergent to a greater degree.

The Fuqing dialect, or Hokchia, is an Eastern Min dialect. It is spoken in the county-level city of Fuqing, China, situated within the prefecture-level city of Fuzhou. It is not completely mutually intelligible with the Fuzhou dialect.

<span class="mw-page-title-main">Speech Recognition & Synthesis</span> Screen reader application by Google

Speech Recognition & Synthesis, formerly known as Speech Services, is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, by Google Translate for reading aloud translations providing useful insight to the pronunciation of words, by Google TalkBack and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

References

  1. Ekho
  2. Gradint
  3. "Giving a voice to more languages on Google Translate".
  4. "Listen to us now!".
  5. "Yuet, the tiny Cantonese TTS engine, Cantonese speech synthesis engine for offline embedded system".
  6. "Chinese mistakes in commercial speech synthesizers".
  7. http://www.w3.org/2005/08/SSML/Papers/iFLYTech.pdf [ bare URL PDF ]
  8. "Home". neospeech.com.
  9. for example <vtml_phoneme alphabet="x-pinyin" ph="ni3hao3"></vtml_phoneme>; see pages 7 and 25-27 of https://ondemand.neospeech.com/vt_eng-Engine-VTML-v3.9.0-3.pdf
  10. Voice packs are automatically downloaded as needed when selected in System Preferences, Speech Settings, Text to Speech, System Voice, Customize. Three Chinese female voices are available in the system. One each for Mainland China, Hong Kong and Taiwan.
  11. "Research Group of Human Computer Speech Interaction". hcsi.cs.tsinghua.edu.cn. Archived from the original on 13 August 2004. Retrieved 12 January 2022.
  12. Home Page: Chilin Shih (Internet Archive link)