Formant

Last updated
Spectrogram of American English vowels [i, u, a]
showing the formants F1 and F2 Spectrogram -iua-.png
Spectrogram of American English vowels [i,u,ɑ] showing the formants F1 and F2

In speech science and phonetics, a formant is the broad spectral maximum that results from an acoustic resonance of the human vocal tract. [1] [2] In acoustics, a formant is usually defined as a broad peak, or local maximum, in the spectrum. [3] [4] For harmonic sounds, with this definition, the formant frequency is sometimes taken as that of the harmonic that is most augmented by a resonance. The difference between these two definitions resides in whether "formants" characterise the production mechanisms of a sound or the produced sound itself. In practice, the frequency of a spectral peak differs slightly from the associated resonance frequency, except when, by luck, harmonics are aligned with the resonance frequency, or when the sound source is mostly non-harmonic, as in whispering and vocal fry.

Contents

A room can be said to have formants characteristic of that particular room, due to its resonances, i.e., to the way sound reflects from its walls and objects. Room formants of this nature reinforce themselves by emphasizing specific frequencies and absorbing others, as exploited, for example, by Alvin Lucier in his piece I Am Sitting in a Room . In acoustic digital signal processing, the way a collection of formants (such as a room) affects a signal can be represented by an impulse response.

In both speech and rooms, formants are characteristic features of the resonances of the space. They are said to be excited by acoustic sources such as the voice, and they shape (filter) the sources' sounds, but they are not sources themselves.

History

From an acoustic point of view, phonetics had a serious problem with the idea that the effective length of vocal tract changed vowels. [5] Indeed, when the length of the vocal tract changes, all the acoustic resonators formed by mouth cavities are scaled, and so are their resonance frequencies. Therefore, it was unclear how vowels could depend on frequencies when talkers with different vocal tract lengths, for instance bass and soprano singers, can produce sounds that are perceived as belonging to the same phonetic category. There had to be some way to normalize the spectral information underpinning the vowel identity. Hermann suggested a solution to this problem in 1894, coining the term “formant”. A vowel, according to him, is a special acoustic phenomenon, depending on the intermittent production of a special partial, or “formant”, or “characteristique” feature. The frequency of the “formant” may vary a little without altering the character of the vowel. For “long e” (ee or iy) for example, the lowest-frequency “formant” may vary from 350 to 440 Hz even in the same person. [6]

Phonetics

Average vowel formants for a male voice (in Hz) [7]
Vowel
(IPA)
F1F2F2F1
i24024002160
y23521001865
e39023001910
ø37019001530
ɛ61019001290
æ58517101125
a8501610760
ɶ8201530710
ɑ750940190
ɒ70076060
ʌ6001170570
ɔ500700200
ɤ4601310850
o360640280
ɯ30013901090
u250595345

Formants are distinctive frequency components of the acoustic signal produced by speech, musical instruments [8] or singing. The information that humans require to distinguish between speech sounds can be represented purely quantitatively by specifying peaks in the frequency spectrum. Most of these formants are produced by tube and chamber resonance, but a few whistle tones derive from periodic collapse of Venturi effect low-pressure zones. [9]

The formant with the lowest frequency is called F1, the second F2, the third F3, and so forth. The fundamental frequency or pitch of the voice is sometimes referred to as F0, but it is not a formant. Most often the two first formants, F1 and F2, are sufficient to identify the vowel. The relationship between the perceived vowel quality and the first two formant frequencies can be appreciated by listening to "artificial vowels" that are generated by passing a click train (to simulate the glottal pulse train) through a pair of bandpass filters (to simulate vocal tract resonances). Front vowels have higher F2, while low vowels have higher F1. Lip rounding tends to lower F1 and F2 in back vowels and F2 and F3 in front vowels. [10]

Nasal consonants usually have an additional formant around 2500 Hz. The liquid [l] usually has an extra formant at 1500 Hz, whereas the English "r" sound ([ɹ]) is distinguished by a very low third formant (well below 2000 Hz).

Plosives (and, to some degree, fricatives) modify the placement of formants in the surrounding vowels. Bilabial sounds (such as /b/ and /p/ in "ball" or "sap") cause a lowering of the formants; on spectrograms, velar sounds (/k/ and /ɡ/ in English) almost always show F2 and F3 coming together in a 'velar pinch' before the velar and separating from the same 'pinch' as the velar is released; alveolar sounds (English /t/ and /d/) cause fewer systematic changes in neighbouring vowel formants, depending partially on exactly which vowel is present. The time course of these changes in vowel formant frequencies are referred to as 'formant transitions'.

In normal voiced speech, the underlying vibration produced by the vocal folds resembles a sawtooth wave, rich in harmonic overtones. If the fundamental frequency or (more often) one of the overtones is higher than a resonance frequency of the system, then the resonance will be only weakly excited and the formant usually imparted by that resonance will be mostly lost. This is most apparent in the case of soprano opera singers, who sing at pitches high enough that their vowels become very hard to distinguish.

Control of resonances is an essential component of the vocal technique known as overtone singing, in which the performer sings a low fundamental tone, and creates sharp resonances to select upper harmonics, giving the impression of several tones being sung at once.

Spectrograms may be used to visualise formants. In spectrograms, it can be hard to distinguish formants from naturally occurring harmonics when one sings. However, one can hear the natural formants in a vowel shape through atonal techniques such as vocal fry.

Formant estimation

Formants, whether they are seen as acoustic resonances of the vocal tract, or as local maxima in the speech spectrum, like band-pass filters, are defined by their frequency and by their spectral width (bandwidth).

Different methods exist to obtain this information. Formant frequencies, in their acoustic definition, can be estimated from the frequency spectrum of the sound, using a spectrogram (in the figure) or a spectrum analyzer. However, to estimate the acoustic resonances of the vocal tract (i.e. the speech definition of formants) from a speech recording, one can use linear predictive coding . An intermediate approach consists in extracting the spectral envelope by neutralizing the fundamental frequency, [11] and only then looking for local maxima in the spectral envelope.

Formant plots

A plot of the average formants listed in the above chart Catford formant plot.png
A plot of the average formants listed in the above chart

The first two formants are important in determining the quality of vowels, and are frequently said to correspond to the open/close (or low/high) and front/back dimensions (which have traditionally been associated with the shape and position of the tongue). Thus the first formant F1 has a higher frequency for an open or low vowel such as [a] and a lower frequency for a closed or high vowel such as [i] or [u]; and the second formant F2 has a higher frequency for a front vowel such as [i] and a lower frequency for a back vowel such as [u]. [12] [13]

Vowels will almost always have four or more distinguishable formants, and sometimes more than six. However, the first two formants are the most important in determining vowel quality and are often plotted against each other in vowel diagrams, [14] though this simplification fails to capture some aspects of vowel quality such as rounding. [15]

Many writers have addressed the problem of finding an optimal alignment of the positions of vowels on formant plots with those on the conventional vowel quadrilateral. The pioneering work of Ladefoged [16] used the Mel scale because this scale was claimed to correspond more closely to the auditory scale of pitch than to the acoustic measure of fundamental frequency expressed in Hertz. Two alternatives to the Mel scale are the Bark scale and the ERB-rate scale. [17] Another widely adopted strategy is plotting the difference between F1 and F2 rather than F2 on the horizontal axis.[ citation needed ]

Singer's formant

Studies of the frequency spectrum of trained speakers and classical singers, especially male singers, indicate a clear formant around 3000 Hz (between 2800 and 3400 Hz) that is absent in speech or in the spectra of untrained speakers or singers. It is thought to be associated with one or more of the higher resonances of the vocal tract. [18] [19] It is this increase in energy at 3000 Hz which allows singers to be heard and understood over an orchestra. This formant is actively developed through vocal training, for instance through so-called voce di strega or "witch's voice" [20] exercises and is caused by a part of the vocal tract acting as a resonator. [21] In classical music and vocal pedagogy, this phenomenon is also known as squillo .

See also

Related Research Articles

Approximants are speech sounds that involve the articulators approaching each other but not narrowly enough nor with enough articulatory precision to create turbulent airflow. Therefore, approximants fall between fricatives, which do produce a turbulent airstream, and vowels, which produce no turbulence. This class is composed of sounds like and semivowels like and, as well as lateral approximants like.

Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. The field of phonetics is traditionally divided into three sub-disciplines on questions involved such as how humans plan and execute movements to produce speech, how various movements affect the properties of the resulting sound or how humans convert sound waves to linguistic information. Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language which differs from the phonological unit of phoneme; the phoneme is an abstract categorization of phones and it is also defined as the smallest unit that discerns meaning between sounds in any given language.

<span class="mw-page-title-main">Place of articulation</span> Place in the mouth consonants are articulated

In articulatory phonetics, the place of articulation of a consonant is an approximate location along the vocal tract where its production occurs. It is a point where a constriction is made between an active and a passive articulator. Active articulators are organs capable of voluntary movement which create the constriction, while passive articulators are so called because they are normally fixed and are the parts with which an active articulator makes contact. Along with the manner of articulation and phonation, the place of articulation gives the consonant its distinctive sound.

The term phonation has slightly different meanings depending on the subfield of phonetics. Among some phoneticians, phonation is the process by which the vocal folds produce certain sounds through quasi-periodic vibration. This is the definition used among those who study laryngeal anatomy and physiology and speech production in general. Phoneticians in other subfields, such as linguistic phonetics, call this process voicing, and use the term phonation to refer to any oscillatory state of any part of the larynx that modifies the airstream, of which voicing is just one example. Voiceless and supra-glottal phonations are included under this definition.

In phonetics, a plosive, also known as an occlusive or simply a stop, is a pulmonic consonant in which the vocal tract is blocked so that all airflow ceases.

A vowel is a syllabic speech sound pronounced without any stricture in the vocal tract. Vowels are one of the two principal classes of speech sounds, the other being the consonant. Vowels vary in quality, in loudness and also in quantity (length). They are usually voiced and are closely involved in prosodic variation such as tone, intonation and stress.

<span class="mw-page-title-main">Human voice</span> Sound made by a human being using the vocal tract

The human voice consists of sound made by a human being using the vocal tract, including talking, singing, laughing, crying, screaming, shouting, humming or yelling. The human voice frequency is specifically a part of human sound production in which the vocal folds are the primary sound source.

In phonetics and phonology, a semivowel, glide or semiconsonant is a sound that is phonetically similar to a vowel sound but functions as the syllable boundary, rather than as the nucleus of a syllable. Examples of semivowels in English are y and w in yes and west, respectively. Written in IPA, y and w are near to the vowels ee and oo in seen and moon, written in IPA. The term glide may alternatively refer to any type of transitional sound, not necessarily a semivowel.

The field of articulatory phonetics is a subfield of phonetics that studies articulation and ways that humans produce speech. Articulatory phoneticians explain how humans produce speech sounds via the interaction of different physiological structures. Generally, articulatory phonetics is concerned with the transformation of aerodynamic energy into acoustic energy. Aerodynamic energy refers to the airflow through the vocal tract. Its potential form is air pressure; its kinetic form is the actual dynamic airflow. Acoustic energy is variation in the air pressure that can be represented as sound waves, which are then perceived by the human auditory system as sound.

<span class="mw-page-title-main">Silbo Gomero</span> Whistled language

Silbo Gomero, also known as el silbo, is a whistled register of Spanish used by inhabitants of La Gomera in the Canary Islands, historically used to communicate across the deep ravines and narrow valleys that radiate through the island. It enabled messages to be exchanged over a distance of up to five kilometres. Due to its loudness, Silbo Gomero is generally used for public communication. Messages conveyed range from event invitations to public information advisories. A speaker of Silbo Gomero is sometimes called a silbador ("whistler").

<span class="mw-page-title-main">Voiceless glottal fricative</span> Consonantal sound represented by ⟨h⟩ in IPA

The voiceless glottal fricative, sometimes called voiceless glottal transition or the aspirate, is a type of sound used in some spoken languages that patterns like a fricative or approximant consonant phonologically, but often lacks the usual phonetic characteristics of a consonant. The symbol in the International Phonetic Alphabet that represents this sound is ⟨h⟩. However, has been described as a voiceless phonation because in many languages, it lacks the place and manner of articulation of a prototypical consonant, as well as the height and backness of a prototypical vowel:

[h and ɦ] have been described as voiceless or breathy voiced counterparts of the vowels that follow them [but] the shape of the vocal tract [...] is often simply that of the surrounding sounds. [...] Accordingly, in such cases it is more appropriate to regard h and ɦ as segments that have only a laryngeal specification, and are unmarked for all other features. There are other languages [such as Hebrew and Arabic] which show a more definite displacement of the formant frequencies for h, suggesting it has a [glottal] constriction associated with its production.

In phonetics, the airstream mechanism is the method by which airflow is created in the vocal tract. Along with phonation and articulation, it is one of three main components of speech production. The airstream mechanism is mandatory for most sound production and constitutes the first part of this process, which is called initiation.

In phonetics, a trill is a consonantal sound produced by vibrations between the active articulator and passive articulator. Standard Spanish ⟨rr⟩ as in perro, for example, is an alveolar trill.

Acoustic phonetics is a subfield of phonetics, which deals with acoustic aspects of speech sounds. Acoustic phonetics investigates time domain features such as the mean squared amplitude of a waveform, its duration, its fundamental frequency, or frequency domain features such as the frequency spectrum, or even combined spectrotemporal features and the relationship of these properties to other branches of phonetics, and to abstract linguistic concepts such as phonemes, phrases, or utterances.

<span class="mw-page-title-main">Belting (music)</span> Singing technique

Belting is a specific technique of singing by which a singer carries their chest voice above their break or passaggio with a proportion of head voice. Belting is sometimes described as "high chest voice" or "mixed voice", although if this is done incorrectly, it can potentially be damaging for the voice. It is often described as a vocal register, although this is also technically incorrect; it is rather a descriptive term for the use of a register.

In speech communication, intelligibility is a measure of how comprehensible speech is in given conditions. Intelligibility is affected by the level and quality of the speech signal, the type and level of background noise, reverberation, and, for speech over communication devices, the properties of the communication system. A common standard measurement for the quality of the intelligibility of speech is the Speech Transmission Index (STI). The concept of speech intelligibility is relevant to several fields, including phonetics, human factors, acoustical engineering, and audiometry.

The source–filter model represents speech as a combination of a sound source, such as the vocal cords, and a linear acoustic filter, the vocal tract. While only an approximation, the model is widely used in a number of applications such as speech synthesis and speech analysis because of its relative simplicity. It is also related to linear prediction. The development of the model is due, in large part, to the early work of Gunnar Fant, although others, notably Ken Stevens, have also contributed substantially to the models underlying acoustic analysis of speech and speech synthesis. Fant built off the work of Tsutomu Chiba and Masato Kajiyama, who first showed the relationship between a vowel's acoustic properties and the shape of the vocal tract.

<span class="mw-page-title-main">Kenneth N. Stevens</span> American computer scientist (1924–2013)

Kenneth Noble Stevens was the Clarence J. LeBel Professor of Electrical Engineering and Computer Science, and professor of health sciences and technology at the research laboratory of electronics at MIT. Stevens was head of the speech communication group in MIT's research laboratory of electronics (RLE), and was one of the world's leading scientists in acoustic phonetics.

<span class="mw-page-title-main">Vowel diagram</span> Schematic arrangement of vowels

A vowel diagram or vowel chart is a schematic arrangement of the vowels. Depending on the particular language being discussed, it can take the form of a triangle or a quadrilateral. Vertical position on the diagram denotes the vowel closeness, with close vowels at the top of the diagram, and horizontal position denotes the vowel backness, with front vowels at the left of the diagram. Vowels are unique in that their main features do not contain differences in voicing, manner, or place (articulators). Vowels differ only in the position of the tongue when voiced. The tongue moves vertically and horizontally within the oral cavity. Vowels are produced with at least a part of their vocal tract obstructed.

References

  1. Titze, I.R. (1994). Principles of Voice Production, Prentice Hall, ISBN   978-0-13-717893-3.
  2. Titze, I.R., Baken, R.J. Bozeman, K.W., Granqvist, S. Henrich, N., Herbst, C.T., Howard, D.M., Hunter, E.J., Kaelin, D., Kent, R.D., Löfqvist, A., McCoy, S., Miller, D.G., Noé, H., Scherer, R.C., Smith, J.R., Story, B.H., Švec, J.G., Ternström, S. and Wolfe, J. (2015) "Toward a consensus on symbolic notation of harmonics, resonances, and formants in vocalization." J. Acoust. Soc. America. 137, 3005–3007.
  3. Jeans, J.H. (1938) Science & Music, reprinted by Dover, 1968.
  4. Standards Secretariat, Acoustical Society of America, (1994). ANSI S1.1-1994 (R2004) American National Standard Acoustical Terminology, (12.41) Acoustical Society of America, Melville, NY.
  5. Hermann, Ludimar (1894). Phonophotographische Untersuchungen[Phonophotographical Studies] (in German) (5th ed.).
  6. McKendrick, J. G. (1903). Experimental phonetics. In Annual report of the board of regents of the Smithsonian institution for the year ending June 30, 1902 (pp. 241–259). Smithsonian Institution.
  7. Catford, J. C. (1988). A practical introduction to phonetics. Oxford: Clarendon. p. 161. ISBN   978-0-19-824217-8.
  8. Reuter, Christoph (2009): The role of formant positions and micro-modulations in blending and partial masking of musical instruments. In: Journal of the Acoustical Society of America (JASA), Vol. 126,4, p. 2237
  9. Flanagan, James L. (1972). Speech Analysis Synthesis and Perception. doi:10.1007/978-3-662-01562-9. ISBN   978-3-662-01564-3.
  10. Thomas, Erik R. (2011). Sociophonetics: An Introduction. Palgrave Macmillan. p. 145. ISBN   978-0-230-22455-1.
  11. Kawahara, Hideki; Masuda-Katsuse, Ikuyo; de Cheveigné, Alain (April 1999). "Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds". Speech Communication. 27 (3–4): 187–207. doi:10.1016/S0167-6393(98)00085-5.
  12. Ladefoged, Peter (2006) A Course in Phonetics (Fifth Edition), Boston, MA: Thomson Wadsworth, p. 188. ISBN   1-4130-2079-8
  13. Ladefoged, Peter (2001) Vowels and Consonants: An Introduction to the Sounds of Language, Maldern, MA: Blackwell, p. 40. ISBN   0-631-21412-7
  14. Deterding, David (1997) 'The Formants of Monophthong Vowels in Standard Southern British English Pronunciation', Journal of the International Phonetic Association, 27, pp. 47–55.
  15. Hayward, Katrina (2000) Experimental Phonetics, Harlow, UK: Pearson, p. 149. ISBN   0-582-29137-2
  16. Ladefoged, P. (1967). Three Areas of Experimental Phonetics. Oxford. p. 87.
  17. Hayward, K. (2000). Experimental Phonetics. Longman. ISBN   0-582-29137-2.
  18. Sundberg, J. (1974). "Articulatory interpretation of the 'singing formant'", Journal of the Acoustical Society of America, 55, 838–844.
  19. Bele, Irene Velsvik (December 2006). "The Speaker's Formant". J. Voice. 20 (4): 555–578. doi:10.1016/j.jvoice.2005.07.001. PMID   16325374.
  20. Frisell, Anthony (2007). Baritone Voice. Boston: Branden Books. p. 84. ISBN   978-0-8283-2181-5.
  21. Sundberg, Johan (1987). The science of the singing voice. DeKalb, Ill: Northern Illinois University Press. ISBN   0-87580-542-6.