Signal processing is an electrical engineering subfield that focuses on analysing, modifying, and synthesizing signals such as sound, images, and scientific measurements. [1] For example, with a filter g, an inverse filterh is one such that the sequence of applying g then h to a signal results in the original signal. Software or electronic inverse filters are often used to compensate for the effect of unwanted environmental filtering of signals.
In all proposed models for the production of human speech, an important variable is the waveform of the airflow, or volume velocity, at the glottis. The glottal volume velocity waveform provides the link between movements of the vocal folds and the acoustical results of such movements, in that the glottis acts approximately as a source of volume velocity. That is, the impedance of the glottis is usually much higher than that of the vocal tract, and so glottal airflow is controlled mostly (but not entirely) by glottal area and subglottal pressure, and not by vocal-tract acoustics. This view of voiced speech production is often referred to as the source-filter model.
A technique for obtaining an estimate of the glottal volume velocity waveform during voiced speech is the “inverse-filtering” of either the radiated acoustic waveform, as measured by a microphone having a good low frequency response, or the volume velocity at the mouth, as measured by a pneumotachograph at the mouth having a linear response, little speech distortion, and a response time of under approximately 1/2 ms. A pneumotachograph having these properties was first described by Rothenberg [2] and termed by him a circumferentially vented mask or CV mask.
As practiced, inverse-filtering is usually limited to non-nasalized or slightly nasalized vowels, and the recorded waveform is passed through an “inverse-filter” having a transfer characteristic that is the inverse of the transfer characteristic of the supraglottal vocal tract configuration at that moment. The transfer characteristic of the supraglottal vocal tract is defined with the input to the vocal tract considered to be the volume velocity at the glottis. For non-nasalized vowels, assuming a high-impedance volume velocity source at the glottis, the transfer function of the vocal tract below about 3000 Hz contains a number of pairs of complex-conjugate poles, more commonly referred to as resonances or formants. Thus, an inverse-filter would have a pair of complex-conjugate zeroes, more commonly referred to as an anti-resonance, for every vocal tract formant in the frequency range of interest.
If the input is from a microphone, and not a CV mask or its equivalent, the inverse filter also must have a pole at zero frequency (an integration operation) to account for the radiation characteristic that connects volume velocity with acoustic pressure. Inverse filtering the output of a CV mask retains the level of zero flow, [2] while inverse filtering a microphone signal does not.
Inverse filtering depends on the source-filter model and a vocal tract filter that is linear system, however, the source and filter need not be independent.
In speech science and phonetics, a formant is the broad spectral maximum that results from an acoustic resonance of the human vocal tract. In acoustics, a formant is usually defined as a broad peak, or local maximum, in the spectrum. For harmonic sounds, with this definition, the formant frequency is sometimes taken as that of the harmonic partial that is most augmented by a resonance. The difference between these two definitions resides in whether "formants" characterise the production mechanisms of a sound or the produced sound itself. In practice, the frequency of a spectral peak differs from the associated resonance frequency, except when, by luck, harmonics are aligned with the resonance frequency.
In articulatory phonetics, the manner of articulation is the configuration and interaction of the articulators when making a speech sound. One parameter of manner is stricture, that is, how closely the speech organs approach one another. Others include those involved in the r-like sounds, and the sibilancy of fricatives.
Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Phoneticians—linguists who specialize in phonetics—study the physical properties of speech. The field of phonetics is traditionally divided into three sub-disciplines based on the research questions involved such as how humans plan and execute movements to produce speech, how different movements affect the properties of the resulting sound, or how humans convert sound waves to linguistic information. Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language—which differs from the phonological unit of phoneme; the phoneme is an abstract categorization of phones.
The term phonation has slightly different meanings depending on the subfield of phonetics. Among some phoneticians, phonation is the process by which the vocal folds produce certain sounds through quasi-periodic vibration. This is the definition used among those who study laryngeal anatomy and physiology and speech production in general. Phoneticians in other subfields, such as linguistic phonetics, call this process voicing, and use the term phonation to refer to any oscillatory state of any part of the larynx that modifies the airstream, of which voicing is just one example. Voiceless and supra-glottal phonations are included under this definition.
In phonetics, a plosive, also known as an occlusive or simply a stop, is a pulmonic consonant in which the vocal tract is blocked so that all airflow ceases.
A vocoder is a category of voice codec that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.
Voice analysis is the study of speech sounds for purposes other than linguistic content, such as in speech recognition. Such studies include mostly medical analysis of the voice (phoniatrics), but also speaker identification. More controversially, some believe that the truthfulness or emotional state of speakers can be determined using voice stress analysis or layered voice analysis.
Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.
The human voice consists of sound made by a human being using the vocal tract, including talking, singing, laughing, crying, screaming, shouting, or yelling. The human voice frequency is specifically a part of human sound production in which the vocal folds are the primary sound source.
Distortion is the alteration of the original shape of something. In communications and electronics it means the alteration of the waveform of an information-bearing signal, such as an audio signal representing sound or a video signal representing images, in an electronic device or communication channel.
A microphone, colloquially called a mic or mike, is a device – a transducer – that converts sound into an electrical signal. Microphones are used in many applications such as telephones, hearing aids, public address systems for concert halls and public events, motion picture production, live and recorded audio engineering, sound recording, two-way radios, megaphones, radio and television broadcasting. They are also used in computers for recording voice, speech recognition, VoIP, and for non-acoustic purposes such as ultrasonic sensors or knock sensors.
The field of articulatory phonetics is a subfield of phonetics that studies articulation and ways that humans produce speech. Articulatory phoneticians explain how humans produce speech sounds via the interaction of different physiological structures. Generally, articulatory phonetics is concerned with the transformation of aerodynamic energy into acoustic energy. Aerodynamic energy refers to the airflow through the vocal tract. Its potential form is air pressure; its kinetic form is the actual dynamic airflow. Acoustic energy is variation in the air pressure that can be represented as sound waves, which are then perceived by the human auditory system as sound.
The glottal plosive or stop is a type of consonantal sound used in many spoken languages, produced by obstructing airflow in the vocal tract or, more precisely, the glottis. The symbol in the International Phonetic Alphabet that represents this sound is ⟨ʔ⟩.
In electronics, impedance matching is the practice of designing the input impedance of an electrical load or the output impedance of its corresponding signal source to maximize the power transfer or minimize signal reflection from the load. A source of electric power such as a generator, amplifier or radio transmitter has a source impedance equivalent to an electrical resistance in series with a frequency-dependent reactance. Likewise, an electrical load such as a light bulb, transmission line or antenna has an impedance equivalent to a resistance in series with a reactance.
In phonetics, the airstream mechanism is the method by which airflow is created in the vocal tract. Along with phonation and articulation, it is one of three main components of speech production. The airstream mechanism is mandatory for sound production and constitutes the first part of this process, which is called initiation.
Acoustic phonetics is a subfield of phonetics, which deals with acoustic aspects of speech sounds. Acoustic phonetics investigates time domain features such as the mean squared amplitude of a waveform, its duration, its fundamental frequency, or frequency domain features such as the frequency spectrum, or even combined spectrotemporal features and the relationship of these properties to other branches of phonetics, and to abstract linguistic concepts such as phonemes, phrases, or utterances.
The source–filter model represents speech as a combination of a sound source, such as the vocal cords, and a linear acoustic filter, the vocal tract. While only an approximation, the model is widely used in a number of applications such as speech synthesis and speech analysis because of its relative simplicity. It is also related to linear prediction. The development of the model is due, in large part, to the early work of Gunnar Fant, although others, notably Ken Stevens, have also contributed substantially to the models underlying acoustic analysis of speech and speech synthesis. Fant built off the work of Tsutomu Chiba and Masato Kajiyama, who first showed the relationship between a vowel's acoustic properties and the shape of the vocal tract.
Kenneth Noble Stevens was the Clarence J. LeBel Professor of Electrical Engineering and Computer Science, and Professor of Health Sciences and Technology at the Research Laboratory of Electronics at MIT. Stevens was head of the Speech Communication Group in MIT's Research Laboratory of Electronics (RLE), and was one of the world's leading scientists in acoustic phonetics.
Vocal resonance may be defined as "the process by which the basic product of phonation is enhanced in timbre and/or intensity by the air-filled cavities through which it passes on its way to the outside air." Throughout the vocal literature, various terms related to resonation are used, including: amplification, filtering, enrichment, enlargement, improvement, intensification, and prolongation. Acoustic authorities would question many of these terms from a strictly scientific perspective. However, the main point to be drawn from these terms by a singer or speaker is that the result of resonation is to make a "better" sound, or at least suitable to a certain esthetical and practical domain.
Wolfgang von Kempelen's speaking machine is a manually operated speech synthesizer that began development in 1769, by Austro-Hungarian author and inventor Wolfgang von Kempelen. It was in this same year that he completed his far more infamous contribution to history: The Turk, a chess-playing automaton, later revealed to be a very far-reaching and elaborate hoax due to the chess-playing human-being occupying its innards.[4] But while the Turk's construction was completed in six months, Kempelen's speaking machine occupied the next twenty years of his life.[2] After two conceptual "dead ends" over the first five years of research, Kempelen's third direction ultimately led him to the design he felt comfortable deeming "final": a functional representational model of the human vocal tract.[3]