Articulatory synthesis

Last updated
3D vocal tract model for Articulatory synthesis Based on Consonant-Vowel Coarticulation modeling, German sentence "Lea und Doreen mögen Bananen." was reproduced from a naturally spoken sentence in terms of the fundamental frequency and the phone durations. [1]

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which usually involves modifying the position of the speech articulators, such as the tongue, jaw, and lips. Speech is created by digitally simulating the flow of air through the representation of the vocal tract.

Contents

Mechanical talking heads

There is a long history of attempts to build mechanical "talking heads". [2] Gerbert (d. 1003), Albertus Magnus (1198–1280) and Roger Bacon (1214–1294) are all said to have built speaking heads (Wheatstone 1837). However, historically confirmed speech synthesis begins with Wolfgang von Kempelen (1734–1804), who published an account of his research in 1791 (see also Dudley & Tarnoczy 1950).

Electrical vocal tract analogs

The first electrical vocal tract analogs were static, like those of Dunn (1950), Ken Stevens and colleagues (1953), Gunnar Fant (1960). Rosen (1958) built a dynamic vocal tract (DAVO), which Dennis (1963) later attempted to control by computer. Dennis et al. (1964), Hiki et al. (1968) and Baxter and Strong (1969) have also described hardware vocal-tract analogs. Kelly and Lochbaum (1962) made the first computer simulation; later digital computer simulations have been made, e.g. by Nakata and Mitsuoka (1965), Matsui (1968) and Paul Mermelstein (1971). Honda et al. (1968) have made an analog computer simulation.

Haskins and Maeda models

The first software articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, [3] was a computational model of speech production based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues. Another popular model that has been frequently used is that of Shinji Maeda, which uses a factor-based approach to control tongue shape.

Modern models

Recent progress in speech production imaging, articulatory control modeling, and tongue biomechanics modeling has led to changes in the way articulatory synthesis is performed [ permanent dead link ]. Examples include the Haskins CASY model (Configurable Articulatory Synthesis), [4] designed by Philip Rubin, Mark Tiede Archived 2006-09-01 at the Wayback Machine , and Louis Goldstein , which matches midsagittal vocal tracts to actual magnetic resonance imaging (MRI) data, and uses MRI data to construct a 3D model of the vocal tract. A full 3D articulatory synthesis model has been described by Olov Engwall. A geometrically based 3D articulatory speech synthesizer has been developed by Peter Birkholz (VocalTractLab [5] ). The Directions Into Velocities of Articulators (DIVA) model, a feedforward control approach which takes the neural computations underlying speech production into consideration, was developed by Frank H. Guenther at Boston University. The ArtiSynth project, [6] headed by Sidney Fels at the University of British Columbia, is a 3D biomechanical modeling toolkit for the human vocal tract and upper airway. Biomechanical modeling of articulators such as the tongue has been pioneered by a number of scientists, including Reiner Wilhelms-Tricarico , Yohan Payan and Jean-Michel Gerard , Jianwu Dang and Kiyoshi Honda .

Commercial models

One of the few commercial articulatory speech synthesis systems is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under a GNU General Public Licence, with work continuing as gnuspeech. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Rene Carré's "distinctive region model". [7]

See also

Footnotes

  1. Birkholz, Peter (2013). "Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis". PLOS ONE. 8 (4): e60603. Bibcode:2013PLoSO...860603B. doi: 10.1371/journal.pone.0060603 . PMC   3628899 . PMID   23613734.
  2. "Talking Heads". Archived from the original on 2006-12-07. Retrieved 2006-12-06.
  3. ASY
  4. "CASY". Archived from the original on 2006-08-28. Retrieved 2006-12-06.
  5. VocalTractLab
  6. Artisynth
  7. Real-time articulatory speech-synthesis-by-rules

Bibliography

Related Research Articles

<span class="mw-page-title-main">Phonetics</span> Branch of linguistics that comprises the study of the sounds of human language

Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. The field of phonetics is traditionally divided into three sub-disciplines based on the research questions involved such as how humans plan and execute movements to produce speech, how various movements affect the properties of the resulting sound, or how humans convert sound waves to linguistic information. Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language which differs from the phonological unit of phoneme; the phoneme is an abstract categorization of phones, and it is also defined as the smallest unit that discerns meaning between sounds in any given language.

<span class="mw-page-title-main">Human voice</span> Sound made by a human being using the vocal tract

The human voice consists of sound made by a human being using the vocal tract, including talking, singing, laughing, crying, screaming, shouting, humming or yelling. The human voice frequency is specifically a part of human sound production in which the vocal folds are the primary sound source.

Physical modelling synthesis refers to sound synthesis methods in which the waveform of the sound to be generated is computed using a mathematical model, a set of equations and algorithms to simulate a physical source of sound, usually a musical instrument.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Music technology (electronic and digital)</span>

Digital music technology encompasses digital instruments, computers, electronic effects units, software, or digital audio equipment by a performer, composer, sound engineer, DJ, or record producer to produce, perform or record music. The term refers to electronic devices, instruments, computer hardware, and software used in performance, playback, recording, composition, mixing, analysis, and editing of music.

<span class="mw-page-title-main">Kenneth N. Stevens</span> American computer scientist

Kenneth Noble Stevens was the Clarence J. LeBel Professor of Electrical Engineering and Computer Science, and professor of health sciences and technology at the research laboratory of electronics at MIT. Stevens was head of the speech communication group in MIT's research laboratory of electronics (RLE), and was one of the world's leading scientists in acoustic phonetics.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

<span class="mw-page-title-main">Haskins Laboratories</span>

Haskins Laboratories, Inc. is an independent 501(c) non-profit corporation, founded in 1935 and located in New Haven, Connecticut, since 1970. Haskins has formal affiliation agreements with both Yale University and the University of Connecticut; it remains fully independent, administratively and financially, of both Yale and UConn. Haskins is a multidisciplinary and international community of researchers that conducts basic research on spoken and written language. A guiding perspective of their research is to view speech and language as emerging from biological processes, including those of adaptation, response to stimuli, and conspecific interaction. Haskins Laboratories has a long history of technological and theoretical innovation, from creating systems of rules for speech synthesis and development of an early working prototype of a reading machine for the blind to developing the landmark concept of phonemic awareness as the critical preparation for learning to read an alphabetic writing system.

<span class="mw-page-title-main">Philip Rubin</span> American linguist

Philip E. Rubin is an American cognitive scientist, technologist, and science administrator known for raising the visibility of behavioral and cognitive science, neuroscience, and ethical issues related to science, technology, and medicine, at a national level. His research career is noted for his theoretical contributions and pioneering technological developments, starting in the 1970s, related to speech synthesis and speech production, including articulatory synthesis and sinewave synthesis, and their use in studying complex temporal events, particularly understanding the biological bases of speech and language.

Articulatory phonology is a linguistic theory originally proposed in 1986 by Catherine Browman of Haskins Laboratories and Louis Goldstein of University of Southern California and Haskins. The theory identifies theoretical discrepancies between phonetics and phonology and aims to unify the two by treating them as low- and high-dimensional descriptions of a single system.

Katherine Safford Harris is a noted psychologist and speech scientist. She is Distinguished Professor Emerita in Speech and Hearing at the CUNY Graduate Center and a member of the Board of Directors Archived 2006-03-03 at the Wayback Machine of Haskins Laboratories. She is also the former President of the Acoustical Society of America and Vice President of Haskins Laboratories.

Louis M. Goldstein is an American linguist and cognitive scientist. He was previously a professor and chair of the Department of Linguistics and a professor of psychology at Yale University and is now a professor in the Department of Linguistics at the University of Southern California. He is a senior scientist at Haskins Laboratories in New Haven, Connecticut, and a founding member of the Association for Laboratory Phonology. Notable students of Goldstein include Douglas Whalen and Elizabeth Zsiga.

Catherine Phebe Browman was an American linguist and speech scientist. She received her Ph.D. in linguistics from the University of California, Los Angeles (UCLA) in 1978. Browman was a research scientist at Bell Laboratories in New Jersey (1967–1972). While at Bell Laboratories, she was known for her work on speech synthesis using demisyllables. She later worked as researcher at Haskins Laboratories in New Haven, Connecticut (1982–1998). She was best known for developing, with Louis Goldstein, of the theory of articulatory phonology, a gesture-based approach to phonological and phonetic structure. The theoretical approach is incorporated in a computational model that generates speech from a gesturally-specified lexicon. Browman was made an honorary member of the Association for Laboratory Phonology.

Elliot SaltzmanArchived 2006-09-06 at the Wayback Machine is an American psychologist and speech scientist. He is a professor in the Department of Physical Therapy at Boston University and a Senior Scientist at Haskins Laboratories in New Haven, Connecticut. He is best known for his development, with J. A. Scott Kelso of "task dynamics ." He is also known for his contributions to the development of a gestural-computational model Archived 2006-12-08 at the Wayback Machine at Haskins Laboratories that combines task dynamics with articulatory phonology and articulatory synthesis. His research interests include application of theories and methods of nonlinear dynamics and complexity theory to understanding the dynamical and biological bases of sensorimotor coordination and control. He is the co-founder, with Philip Rubin, of the IS group.

<span class="mw-page-title-main">Wolfgang von Kempelen's speaking machine</span>

Wolfgang von Kempelen's speaking machine is a manually operated speech synthesizer that began development in 1769, by Austro-Hungarian author and inventor Wolfgang von Kempelen. It was in this same year that he completed his far more infamous contribution to history: The Turk, a chess-playing automaton, later revealed to be a very far-reaching and elaborate hoax due to the chess-playing human-being occupying its innards. But while the Turk's construction was completed in six months, Kempelen's speaking machine occupied the next twenty years of his life. After two conceptual "dead ends" over the first five years of research, Kempelen's third direction ultimately led him to the design he felt comfortable deeming "final": a functional representational model of the human vocal tract.

Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory speech synthesizer; uses these to drive an articulatory model of the human vocal tract producing an output suitable for the normal sound output devices used by various computer operating systems; and does this at the same or faster rate than the speech is spoken for adult speech.

Neurocomputational speech processing is computer-simulation of speech production and speech perception by referring to the natural neuronal processes of speech production and speech perception, as they occur in the human nervous system. This topic is based on neuroscience and computational neuroscience.

Bernd J. Kröger is a German phonetician and professor at RWTH Aachen University. He is known for his contributions in the field of neurocomputational speech processing, in particular the ACT model.

The Articulatory approach to teaching pronunciation considers learning how to pronounce a second language to be a motor skill which most students are not in a position to develop based on self-evaluation of their production. The role of the teacher is therefore to provide feedback on students' performance as part of coaching them in the movements of the vocal tract articulators which create speech sounds.

Laura L. Koenig is an American linguist and speech scientist.