Gnuspeech

Last updated
Gnuspeech
Developer(s) Trillium Sound Research
Initial release2002;20 years ago (2002)
Stable release
0.9 [1]   OOjs UI icon edit-ltr-progressive.svg / 14 October 2015;6 years ago (14 October 2015)
Repository
Platform Cross-platform
Type Text-to-speech
License GNU General Public License
Website www.gnu.org/software/gnuspeech/   OOjs UI icon edit-ltr-progressive.svg

Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory speech synthesizer; uses these to drive an articulatory model of the human vocal tract producing an output suitable for the normal sound output devices used by various computer operating systems; and does this at the same or faster rate than the speech is spoken for adult speech.

Contents

Design

The synthesizer is a tube resonance, or waveguide, model that models the behavior of the real vocal tract directly, and reasonably accurately, unlike formant synthesizers that indirectly model the speech spectrum. [2] The control problem is solved by using René Carré's Distinctive Region Model [3] which relates changes in the radii of eight longitudinal divisions of the vocal tract to corresponding changes in the three frequency formants in the speech spectrum that convey much of the information of speech. The regions are, in turn, based on work by the Stockholm Speech Technology Laboratory [4] of the Royal Institute of Technology (KTH) on "formant sensitivity analysis" - that is, how formant frequencies are affected by small changes in the radius of the vocal tract at various places along its length. [5]

History

Gnuspeech was originally commercial software produced by the now-defunct Trillium Sound Research for the NeXT computer as various grades of "TextToSpeech" kit. Trillium Sound Research was a technology transfer spin-off company formed at the University of Calgary, Alberta, Canada, based on long-standing research in the computer science department on computer-human interaction using speech, where papers and manuals relevant to the system are maintained. [6] The initial version in 1992 used a formant-based speech synthesiser. When NeXT ceased manufacturing hardware, the synthesizer software was completely re-written [7] and also ported to NSFIP (NextStep For Intel Processors) using the waveguide approach to acoustic tube modeling based on the research at the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford University, especially the Music Kit. The synthesis approach is explained in more detail in a paper presented to the American Voice I/O Society in 1995. [8] The system used the onboard 56001 Digital Signal Processor (DSP) on the NeXT computer and a Turtle Beach add-on board with the same DSP on the NSFIP version to run the waveguide (also known as the tube model). Speed limitations meant that the shortest vocal tract length that could be used for speech in real time (that is, generated at the same or faster rate than it was "spoken") was around 15 centimeters, because the sample rate for the waveguide computations increases with decreasing vocal tract length. Faster processor speeds are progressively removing this restriction, an important advance for producing children's speech in real time.

Since NeXTSTEP is discontinued and NeXT computers are rare, one option for executing the original code is the use of virtual machines. The Previous emulator, for example, can emulate the DSP in NeXT computers, which can be used by the Trillium software.

MONET (Gnuspeech) in NeXTSTEP 3.3 running inside Previous. Monet (Gnuspeech) in Nextstep 3.3 running inside Previous.png
MONET (Gnuspeech) in NeXTSTEP 3.3 running inside Previous.

Trillium ceased trading in the late 1990s and the Gnuspeech project was first entered into the GNU Savannah repository under the terms of the GNU General Public License in 2002, as an official GNU software.

Due to its free and open source license, which allows customization of the code, Gnuspeech has been utilized in academic research. [9] [10]

Related Research Articles

Additive synthesis is a sound synthesis technique that creates timbre by adding sine waves together.

Formant Spectrum of phonetic resonance in speech production, or its peak

In speech science and phonetics, a formant is the broad spectral maximum that results from an acoustic resonance of the human vocal tract. In acoustics, a formant is usually defined as a broad peak, or local maximum, in the spectrum. For harmonic sounds, with this definition, the formant frequency is sometimes taken as that of the harmonic that is most augmented by a resonance. The difference between these two definitions resides in whether "formants" characterise the production mechanisms of a sound or the produced sound itself. In practice, the frequency of a spectral peak differs slightly from the associated resonance frequency, except when, by luck, harmonics are aligned with the resonance frequency.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

Physical modelling synthesis refers to sound synthesis methods in which the waveform of the sound to be generated is computed using a mathematical model, a set of equations and algorithms to simulate a physical source of sound, usually a musical instrument.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

Music technology (electronic and digital)

Digital music technology encompasses digital instruments, computers, electronic effects units, software, or digital audio equipment by a performer, composer, sound engineer, DJ, or record producer to produce, perform or record music. The term refers to electronic devices, instruments, computer hardware, and software used in performance, playback, recording, composition, mixing, analysis, and editing of music.

A software synthesizer or softsynth is a computer program that generates digital audio, usually for music. Computer software that can create sounds or music is not new, but advances in processing speed now allow softsynths to accomplish the same tasks that previously required the dedicated hardware of a conventional synthesizer. Softsynths may be readily interfaced with other music software such as music sequencers typically in the context of a digital audio workstation. Softsynths are usually less expensive and can be more portable than dedicated hardware.

Wavetable synthesis is a sound synthesis technique used to create periodic waveforms. Often used in the production of musical tones or notes, it was first written about by Hal Chamberlin in Byte's September 1977 issue. Wolfgang Palm of Palm Products GmbH (PPG) developed it in the late 1970s and published in 1979. The technique has since been used as the primary synthesis method in synthesizers built by PPG and Waldorf Music and as an auxiliary synthesis method by Ensoniq and Access. It is currently used in hardware synthesizers from Waldorf Music and in software-based synthesizers for PCs and tablets, including apps offered by PPG and Waldorf, among others.

Digital waveguide synthesis is the synthesis of audio using a digital waveguide. Digital waveguides are efficient computational models for physical media through which acoustic waves propagate. For this reason, digital waveguides constitute a major part of most modern physical modeling synthesizers.

ISPW

The IRCAM Signal Processing Workstation (ISPW) was a hardware DSP platform developed by IRCAM and the Ariel Corporation in the late 1980s. In French, the ISPW is referred to as the SIM. Eric Lindemann was the principal designer of the ISPW hardware as well as manager of the overall hardware/software effort.

Gunnar Fant

Carl Gunnar Michael Fant was a leading researcher in speech science in general and speech synthesis in particular who spent most of his career as a professor at the Swedish Royal Institute of Technology (KTH) in Stockholm. He was a first cousin of George Fant, the actor and director.

Signal processing is an electrical engineering subfield that focuses on analysing, modifying, and synthesizing signals such as sound, images, and scientific measurements. For example, with a filter g, an inverse filterh is one such that the sequence of applying g then h to a signal results in the original signal. Software or electronic inverse filters are often used to compensate for the effect of unwanted environmental filtering of signals.

Kenneth N. Stevens

Kenneth Noble Stevens was the Clarence J. LeBel Professor of Electrical Engineering and Computer Science, and Professor of Health Sciences and Technology at the Research Laboratory of Electronics at MIT. Stevens was head of the Speech Communication Group in MIT's Research Laboratory of Electronics (RLE), and was one of the world's leading scientists in acoustic phonetics.

Haskins Laboratories

Haskins Laboratories, Inc. is an independent 501(c) non-profit corporation, founded in 1935 and located in New Haven, Connecticut, since 1970. Upon moving to New Haven, Haskins entered in to formal affiliation agreements with both Yale University and the University of Connecticut; it remains fully independent, administratively and financially, of both Yale and UConn. Haskins is a multidisciplinary and international community of researchers which conducts basic research on spoken and written language. A guiding perspective of their research is to view speech and language as emerging from biological processes, including those of adaptation, response to stimuli, and conspecific interaction. The Laboratories has a long history of technological and theoretical innovation, from creating systems of rules for speech synthesis and development of an early working prototype of a reading machine for the blind to developing the landmark concept of phonemic awareness as the critical preparation for learning to read an alphabetic writing system.

Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which usually involves modifying the position of the speech articulators, such as the tongue, jaw, and lips. Speech is created by digitally simulating the flow of air through the representation of the vocal tract.

Louis M. Goldstein is an American linguist and cognitive scientist. He was previously a professor and chair of the Department of Linguistics and a professor of psychology at Yale University and is now a professor in the Department of Linguistics at the University of Southern California. He is a senior scientist at Haskins Laboratories in New Haven, Connecticut, and a founding member of the Association for Laboratory Phonology.

The Yamaha FS1R is a sound synthesizer module, manufactured by the Yamaha Corporation from 1998 to 2000. Based on Formant synthesis, it also has FM synthesis capabilities similar to the DX range. Its editing involves 2,000+ parameters in any one 'performance', prompting the creation of a number of third party freeware programming applications. These applications provide the tools needed to program the synth which were missing when it was in production by Yamaha. The synth was discontinued after two years, probably in part due to its complexity, poor front-panel controls, brief manual and limited polyphony.

Cantor (music software)

Cantor was a vocal singing synthesizer software released four months after the original release of Vocaloid by the company VirSyn, and was based on the same idea of synthesizing the human voice. VirSyn released English and German versions of this software. Cantor 2 boasted a variety of voices from near-realistic sounding ones to highly expressive vocals and robotic voices.

The Music Kit was a software package for the NeXT Computer system. First developed by David A. Jaffe and Julius O. Smith, it supported the Motorola 56001 DSP that was included on the NeXT Computer's motherboard. It was also the first architecture to unify the Music-N and MIDI paradigms,. Thus it combined the generality of the former with the interactivity and performance capabilities of the latter. The Music Kit was integrated with the Sound Kit.

Raimo Olavi Toivonen Finnish researcher

Raimo Olavi Toivonen is a Finnish developer of speech analysis, speech synthesis, speech technology, psychoacoustics and digital signal processing.

References

  1. https://directory.fsf.org/wiki/gnuspeech.
  2. COOK, P.R. (1989) Synthesis of the singing voice using a physically parameterized model of the human vocal tract. International Computer Music Conference, Columbus Ohio
  3. CARRE, R. (1992) Distinctive regions in acoustic tubes. Speech production modelling. Journal d'Acoustique, 5 141 to 159
  4. Now Department for Speech, Music and Hearing
  5. FANT, G. & PAULI, S. (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, KTH, Stockholm, Sweden
  6. Relevant U of Calgary website
  7. The Tube Resonance Model Speech Synthesizer
  8. HILL, D.R., MANZARA, L. & TAUBE-SCHOCK, C-R. (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44
  9. D'Este, F. - Articulatory Speech Synthesis with Parallel Multi-Objective Genetic Algorithm. Master's Thesis, Leiden Institute of Advanced Computer Science, 2010.
  10. Xiong, F.; Barker, J. - Deep Learning of Articulatory-Based Representations and Applications for Improving Dysarthric Speech Recognition. ITG Conference on Speech Communication, Germany, 2018.