Auditory scene analysis

Last updated May 28, 2023

In perception and psychophysics, auditory scene analysis (ASA) is a proposed model for the basis of auditory perception. This is understood as the process by which the human auditory system organizes sound into perceptually meaningful elements. The term was coined by psychologist Albert Bregman.^[1] The related concept in machine perception is computational auditory scene analysis (CASA), which is closely related to source separation and blind signal separation.

Background

Sound reaches the ear and the eardrum vibrates as a whole. This signal has to be analyzed (in some way). Bregman's ASA model proposes that sounds will either be heard as "integrated" (heard as a whole – much like harmony in music), or "segregated" into individual components (which leads to counterpoint). For example, a bell can be heard as a 'single' sound (integrated), or some people are able to hear the individual components – they are able to segregate the sound. This can be done with chords where it can be heard as a 'color', or as the individual notes. Natural sounds, such as the human voice, musical instruments, or cars passing in the street, are made up of many frequencies, which contribute to the perceived quality (like timbre) of the sounds. When two or more natural sounds occur at once, all the components of the simultaneously active sounds are received at the same time, or overlapped in time, by the ears of listeners. This presents their auditory systems with a problem: which parts of the sound should be grouped together and treated as parts of the same source or object? Grouping them incorrectly can cause the listener to hear non-existent sounds built from the wrong combinations of the original components.

In many circumstances the segregated elements can be linked together in time, producing an auditory stream. This ability of auditory streaming can be demonstrated by the so-called cocktail party effect. Up to a point, with a number of voices speaking at the same time or with background sounds, one is able to follow a particular voice even though other voices and background sounds are present.^[2] In this example, the ear is segregating this voice from other sounds (which are integrated), and the mind "streams" these segregated sounds into an auditory stream. This is a skill which is highly developed by musicians, notably conductors who are able to listen to one, two, three or more instruments at the same time (segregating them), and following each as an independent line through auditory streaming^{[ citation needed ]}.

Grouping and streams

A number of grouping principles appear to underlie ASA, many of which are related to principles of perceptual organization discovered by the school of Gestalt psychology. These can be broadly categorized into sequential grouping mechanisms (those that operate across time) and simultaneous grouping mechanisms (those that operate across frequency):

Errors in simultaneous grouping can lead to the blending of sounds that should be heard as separate, the blended sounds having different perceived qualities (such as pitch or timbre) to any of the sounds actually received. For instance two vowels presented simultaneously may not be identifiable if they are segregated.^[3]
Errors in sequential grouping can lead, for example, to hearing a word created out of syllables originating from two different voices.^[4]^[5]

Segregation can be based primarily on perceptual cues or rely on the recognition of learned patterns ("schema-based").

The job of ASA is to group incoming sensory information to form an accurate mental representation of the individual sounds. When sounds are grouped by the auditory system into a perceived sequence, distinct from other co-occurring sequences, each of these perceived sequences is called an "auditory stream". In the real world, if the ASA is successful, a stream corresponds to a distinct environmental sound source producing a pattern that persists over time, such as a person talking, a piano playing, or a dog barking. However, in the lab, by manipulating the acoustic parameters of the sounds, it is possible to induce the perception of one or more auditory streams.

One example of this is the phenomenon of streaming, also called "stream segregation."^[6] If two sounds, A and B, are rapidly alternated in time, after a few seconds the perception may seem to "split" so that the listener hears two rather than one stream of sound, each stream corresponding to the repetitions of one of the two sounds, for example, A-A-A-A-, etc. accompanied by B-B-B-B-, etc. The tendency towards segregation into separate streams is favored by differences in the acoustical properties of sounds A and B. Among the differences classically shown to promote segregation are those of frequency (for pure tones), fundamental frequency (for complex tones), frequency composition, source location. But it has been suggested that about any systematic perceptual difference between two sequences can elicit streaming,^[7] provided the speed of the sequence is sufficient.

An interactive web page illustrating this streaming and the importance of frequency separation and speed can be found here.

Andranik Tangian argues that the grouping phenomenon is observed not only in dynamics but in statics as well. For instance, the sensation of a chord is the effect of acoustical data representation rather than physical causality (indeed, a single physical body, like a loudspeaker membrane, can produce an effect of several tones, and several physical bodies, like organ pipes tuned as a chord, can produce an effect of a single tone). From the viewpoint of musical acoustics, a chord is a special kind of sound whose spectrum — the set of partial tones (sinusoidal oscillations) — can be regarded as generated by displacements of a single tone spectrum along the frequency axis. In other words, the chord’s interval structure is an acoustical contour drawn by a tone (in dynamics, polyphonic voices are trajectories of tone spectra). This is justified by the information theory. If the generative tone is harmonic (= has a pitch salience) then such a representation is proved to be unique and requires the least amount of memory, i.e. is the least complex in the sense of Kolmogorov. Since it is simpler all other representations, including the one where the chord is regarded as a single complex sound, the chord is perceived as a compound. If the generative tone is inharmonic, like a bell-like sound, the interval structure is still recognizable as displacements of a tone spectrum, whose pitch can be even undetectable. This optimal representation-based definition of a chord explains, among other things, the predominance of interval hearing over the absolute pitch hearing.^[8]^[9]

Experimental basis

Many experiments have studied the segregation of more complex patterns of sound, such as a sequence of high notes of different pitches, interleaved with low ones. In such sequences, the segregation of co-occurring sounds into distinct streams has a profound effect on the way they are heard. Perception of a melody is formed more easily if all its notes fall in the same auditory stream. We tend to hear the rhythms among notes that are in the same stream, excluding those that are in other streams. Judgments of timing are more precise between notes in the same stream than between notes in separate streams. Even perceived spatial location and perceived loudness can be affected by sequential grouping. While the initial research on this topic was done on human adults, recent studies have shown that some ASA capabilities are present in newborn infants, showing that they are built-in, rather than learned through experience. Other research has shown that non-human animals also display ASA. Currently, scientists are studying the activity of neurons in the auditory regions of the cerebral cortex to discover the mechanisms underlying ASA.

Related Research Articles

In music, harmony is the process by which individual sounds are joined or composed into whole units or compositions. Often, the term harmony refers to simultaneously occurring frequencies, pitches, or chords. However, harmony is generally understood to involve both vertical harmony (chords) and horizontal harmony (melody).

In music, timbre, also known as tone color or tone quality, is the perceived sound quality of a musical note, sound or tone. Timbre distinguishes different types of sound production, such as choir voices and musical instruments. It also enables listeners to distinguish different instruments in the same category.

Pitch is a perceptual property of sounds that allows their ordering on a frequency-related scale, or more commonly, pitch is the quality that makes it possible to judge sounds as "higher" and "lower" in the sense associated with musical melodies. Pitch is a major auditory attribute of musical tones, along with duration, loudness, and timbre.

Auditory illusions are false perceptions of a real sound or outside stimulus. These false perceptions are the equivalent of an optical illusion: the listener hears either sounds which are not present in the stimulus, or sounds that should not be possible given the circumstance on how they were created.

A harmonic sound is said to have a missing fundamental, suppressed fundamental, or phantom fundamental when its overtones suggest a fundamental frequency but the sound lacks a component at the fundamental frequency itself. The brain perceives the pitch of a tone not only by its fundamental frequency, but also by the periodicity implied by the relationship between the higher harmonics; we may perceive the same pitch even if the fundamental frequency is missing from a tone.

The octave illusion is an auditory illusion discovered by Diana Deutsch in 1973. It is produced when two tones that are an octave apart are repeatedly played in alternation ("high-low-high-low") through stereo headphones. The same sequence is played to both ears simultaneously; however when the right ear receives the high tone, the left ear receives the low tone, and conversely. Instead of hearing two alternating pitches, most subjects instead hear a single tone that alternates between ears while at the same time its pitch alternates between high and low.

<span class="mw-page-title-main">Illusory continuity of tones</span> Auditory illusion

The illusory continuity of tones is the auditory illusion caused when a tone is interrupted for a short time, during which a narrow band of noise is played. The noise has to be of a sufficiently high level to effectively mask the gap, unless it is a gap transfer illusion. Whether the tone is of constant, rising or decreasing pitch, the ear perceives the tone as continuous if the discontinuity is masked by noise. Because the human ear is very sensitive to sudden changes, however, it is necessary for the success of the illusion that the amplitude of the tone in the region of the discontinuity not decrease or increase too abruptly. While the inner mechanisms of this illusion is not well understood, there is evidence that supports activation of primarily the auditory cortex is present.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

Computational auditory scene analysis (CASA) is the study of auditory scene analysis by computational means. In essence, CASA systems are "machine listening" systems that aim to separate mixtures of sound sources in the same way that human listeners do. CASA differs from the field of blind signal separation in that it is based on the mechanisms of the human auditory system, and thus uses no more than two microphone recordings of an acoustic environment. It is related to the cocktail party problem.

Computer audition (CA) or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems — "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."

In audio signal processing, auditory masking occurs when the perception of one sound is affected by the presence of another sound.

<span class="mw-page-title-main">Sound</span> Vibration that travels via pressure waves in matter

In physics, sound is a vibration that propagates as an acoustic wave, through a transmission medium such as a gas, liquid or solid. In human physiology and psychology, sound is the reception of such waves and their perception by the brain. Only acoustic waves that have frequencies lying between about 20 Hz and 20 kHz, the audio frequency range, elicit an auditory percept in humans. In air at atmospheric pressure, these represent sound waves with wavelengths of 17 meters (56 ft) to 1.7 centimeters (0.67 in). Sound waves above 20 kHz are known as ultrasound and are not audible to humans. Sound waves below 20 Hz are known as infrasound. Different animal species have varying hearing ranges.

Albert Stanley Bregman was a Canadian academic and researcher in experimental psychology, cognitive science, and Gestalt psychology, primarily in the perceptual organization of sound.

Cognitive musicology is a branch of cognitive science concerned with computationally modeling musical knowledge with the goal of understanding both music and cognition.

Psychoacoustics is the branch of psychophysics involving the scientific study of sound perception and audiology—how human auditory system perceives various sounds. More specifically, it is the branch of science studying the psychological responses associated with sound. Psychoacoustics is an interdisciplinary field of many areas, including psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science.

In music cognition, melodic fission, is a phenomenon in which one line of pitches is heard as two or more separate melodic lines. This occurs when a phrase contains groups of pitches at two or more distinct registers or with two or more distinct timbres.

Multistable auditory perception is a cognitive phenomenon in which certain auditory stimuli can be perceived in multiple ways. While multistable perception has been most commonly studied in the visual domain, it also has been observed in the auditory and olfactory modalities. In the olfactory domain, different scents are piped to the two nostrils, while in the auditory domain, researchers often examine the effects of binaural sequences of pure tones. Generally speaking, multistable perception has three main characteristics: exclusivity, implying that the multiple perceptions cannot simultaneously occur; randomness, indicating that the duration of perceptual phases follows a random law, and inevitability, meaning that subjects are unable to completely block out one percept indefinitely.

Ernst Terhardt is a German engineer and psychoacoustician who made significant contributions in diverse areas of audio communication including pitch perception, music cognition, and Fourier transformation. He was professor in the area of acoustic communication at the Institute of Electroacoustics, Technical University of Munich, Germany.

Temporal envelope (ENV) and temporal fine structure (TFS) are changes in the amplitude and frequency of sound perceived by humans over time. These temporal changes are responsible for several aspects of auditory perception, including loudness, pitch and timbre perception and spatial hearing.

The speech-to-song illusion is an auditory illusion discovered by Diana Deutsch in 1995. A spoken phrase is repeated several times, without altering it in any way, and without providing any context. This repetition causes the phrase to transform perceptually from speech into song.

References

↑ Bregman, A. S. (1990). Auditory scene analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press. ISBN 9780262022972.
↑ Miller, G. A. (1947). "The masking of speech". Psychological Bulletin. 44 (2): 105–129. doi:10.1037/h0055960. PMID 20288932.
↑ Assmann, P. F.; Summerfield, Q. (August 1990). "Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies". The Journal of the Acoustical Society of America. 88 (2): 680–697. Bibcode:1990ASAJ...88..680A. doi:10.1121/1.399772. PMID 2212292.
↑ Gaudrain, E.; Grimault, N.; Healy, E. W.; Béra, J.-C. (2007). "Effect of spectral smearing on the perceptual segregation of vowel sequences". Hearing Research. 231 (1–2): 32–41. doi:10.1016/j.heares.2007.05.001. PMC 2128787 . PMID 17597319.
↑ Billig, A. J.; Davis, M. H.; Deeks, J. M.; Monstrey, J.; Carlyon, R. P. (2013). "Lexical Influences on Auditory Streaming". Current Biology. 23 (16): 1585–1589. doi:10.1016/j.cub.2013.06.042. PMC 3748342 . PMID 23891107.
↑ van Noorden, L. P. A. S. (1975). Temporal coherence in the perception of tones sequences (PDF) (PhD). The Netherlands: Eindhoven University of Technology. Retrieved 10 March 2018.
↑ Moore, B. C. J.; Gockel, H. E. (2012). "Properties of auditory stream formation". Philosophical Transactions of the Royal Society B: Biological Sciences. 367 (1591): 919–931. doi:10.1098/rstb.2011.0355. PMC 3282308 . PMID 22371614.
↑ Tanguiane (Tangian), Andranick (1993). Artificial Perception and Music Recognition. Lecture Notes in Artificial Intelligence. Vol. 746. Berlin-Heidelberg: Springer. ISBN 978-3-540-57394-4.
↑ Tanguiane (Tanguiane), Andranick (1994). "A principle of correlativity of perception and its application to music recognition". Music Perception. 11 (4): 465–502. doi:10.2307/40285634.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[bregman90-1] Bregman, A. S. (1990). Auditory scene analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press. ISBN 9780262022972.

[2] Miller, G. A. (1947). "The masking of speech". Psychological Bulletin. 44 (2): 105–129. doi:10.1037/h0055960. PMID 20288932.

[3] Assmann, P. F.; Summerfield, Q. (August 1990). "Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies". The Journal of the Acoustical Society of America. 88 (2): 680–697. Bibcode:1990ASAJ...88..680A. doi:10.1121/1.399772. PMID 2212292.

[4] Gaudrain, E.; Grimault, N.; Healy, E. W.; Béra, J.-C. (2007). "Effect of spectral smearing on the perceptual segregation of vowel sequences". Hearing Research. 231 (1–2): 32–41. doi:10.1016/j.heares.2007.05.001. PMC 2128787 . PMID 17597319.

[5] Billig, A. J.; Davis, M. H.; Deeks, J. M.; Monstrey, J.; Carlyon, R. P. (2013). "Lexical Influences on Auditory Streaming". Current Biology. 23 (16): 1585–1589. doi:10.1016/j.cub.2013.06.042. PMC 3748342 . PMID 23891107.

[6] van Noorden, L. P. A. S. (1975). Temporal coherence in the perception of tones sequences (PDF) (PhD). The Netherlands: Eindhoven University of Technology. Retrieved 10 March 2018.

[7] Moore, B. C. J.; Gockel, H. E. (2012). "Properties of auditory stream formation". Philosophical Transactions of the Royal Society B: Biological Sciences. 367 (1591): 919–931. doi:10.1098/rstb.2011.0355. PMC 3282308 . PMID 22371614.

[Tanguiane1993-8] Tanguiane (Tangian), Andranick (1993). Artificial Perception and Music Recognition. Lecture Notes in Artificial Intelligence. Vol. 746. Berlin-Heidelberg: Springer. ISBN 978-3-540-57394-4.

[Tangian1994-9] Tanguiane (Tanguiane), Andranick (1994). "A principle of correlativity of perception and its application to music recognition". Music Perception. 11 (4): 465–502. doi:10.2307/40285634.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]