Computational auditory scene analysis

Last updated September 30, 2023

Computational auditory scene analysis (CASA) is the study of auditory scene analysis by computational means.^[1] In essence, CASA systems are "machine listening" systems that aim to separate mixtures of sound sources in the same way that human listeners do. CASA differs from the field of blind signal separation in that it is (at least to some extent) based on the mechanisms of the human auditory system, and thus uses no more than two microphone recordings of an acoustic environment. It is related to the cocktail party problem.

Principles

Since CASA serves to model functionality parts of the auditory system, it is necessary to view parts of the biological auditory system in terms of known physical models. Consisting of three areas, the outer, middle and inner ear, the auditory periphery acts as a complex transducer that converts sound vibrations into action potentials in the auditory nerve. The outer ear consists of the external ear, ear canal and the ear drum. The outer ear, like an acoustic funnel, helps locating the sound source.^[2] The ear canal acts as a resonant tube (like an organ pipe) to amplify frequencies between 2–5.5 kHz with a maximum amplification of about 11 dB occurring around 4 kHz.^[3] As the organ of hearing, the cochlea consists of two membranes, Reissner’s and the basilar membrane. The basilar membrane moves to audio stimuli through the specific stimulus frequency matches the resonant frequency of a particular region of the basilar membrane. The movement the basilar membrane displaces the inner hair cells in one direction, which encodes a half-wave rectified signal of action potentials in the spiral ganglion cells. The axons of these cells make up the auditory nerve, encoding the rectified stimulus. The auditory nerve responses select certain frequencies, similar to the basilar membrane. For lower frequencies, the fibers exhibit "phase locking". Neurons in higher auditory pathway centers are tuned to specific stimuli features, such as periodicity, sound intensity, amplitude and frequency modulation.^[1] There are also neuroanatomical associations of ASA through the posterior cortical areas, including the posterior superior temporal lobes and the posterior cingulate. Studies have found that impairments in ASA and segregation and grouping operations are affected in patients with Alzheimer's disease.^[4]

System Architecture

Cochleagram

As the first stage of CASA processing, the cochleagram creates a time-frequency representation of the input signal. By mimicking the components of the outer and middle ear, the signal is broken up into different frequencies that are naturally selected by the cochlea and hair cells. Because of the frequency selectivity of the basilar membrane, a filter bank is used to model the membrane, with each filter associated with a specific point on the basilar membrane.^[1]

Since the hair cells produce spike patterns, each filter of the model should also produce a similar spike in the impulse response. The use of a gammatone filter provides an impulse response as the product of a gamma function and a tone. The output of the gammatone filter can be regarded as a measurement of the basilar membrane displacement. Most CASA systems represent the firing rate in the auditory nerve rather than a spike-based. To obtain this, the filter bank outputs are half-wave rectified followed by a square root. (Other models, such as automatic gain controllers have been implemented). The half-rectified wave is similar to the displacement model of the hair cells. Additional models of the hair cells include the Meddis hair cell model which pairs with the gammatone filter bank, by modeling the hair cell transduction.^[5] Based on the assumption that there are three reservoirs of transmitter substance within each hair cell, and the transmitters are released in proportion to the degree of displacement to the basilar membrane, the release is equated with the probability of a spike generated in the nerve fiber. This model replicates many of the nerve responses in the CASA systems such as rectification, compression, spontaneous firing, and adaptation.^[1]

Correlogram

Important model of pitch perception by unifying 2 schools of pitch theory:^[1]

Place theories (emphasizing the role of resolved harmonics)
Temporal theories (emphasizing the role of unresolved harmonics)

The correlogram is generally computed in the time domain by autocorrelating the simulated auditory nerve firing activity to the output of each filter channel.^[1] By pooling the autocorrelation across frequency, the position of peaks in the summary correlogram corresponds to the perceived pitch.^[1]

Cross-Correlogram

Because the ears receive audio signals at different times, the sound source can be determined by using the delays retrieved from the two ears.^[6] By cross-correlating the delays from the left and right channels (of the model), the coincided peaks can be categorized as the same localized sound, despite their temporal location in the input signal.^[1] The use of interaural cross-correlation mechanism has been supported through physiological studies, paralleling the arrangement of neurons in the auditory midbrain.^[7]

Time-Frequency Masks

To segregate the sound source, CASA systems mask the cochleagram. This mask, sometimes a Wiener filter, weighs the target source regions and suppresses the rest.^[1] The physiological motivation behind the mask results from the auditory perception where sound is rendered inaudible by a louder sound.^[8]

Resynthesis

A resynthesis pathway reconstructs an audio signal from a group of segments. Achieved by inverting the cochleagram, high quality resynthesized speech signals can be obtained.^[1]

Applications

Monaural CASA

Monaural sound separation first began with separating voices based on frequency. There were many early developments based on segmenting different speech signals through frequency.^[1] Other models followed on this process, by the addition of adaption through state space models, batch processing, and prediction-driven architecture.^[9] The use of CASA has improved the robustness of ASR and speech separation systems.^[10]

Binaural CASA

Since CASA is modeling human auditory pathways, binaural CASA systems better the human model by providing sound localization, auditory grouping and robustness to reverberation by including 2 spatially separated microphones. With methods similar to cross-correlation, systems are able to extract the target signal from both input microphones.^[11]^[12]

Neural CASA Models

Since the biological auditory system is deeply connected with the actions of neurons, CASA systems also incorporated neural models within the design. Two different models provide the basis for this area. Malsburg and Schneider proposed a neural network model with oscillators to represent features of different streams (synchronized and desynchronized).^[13] Wang also presented a model using a network of excitatory units with a global inhibitor with delay lines to represent the auditory scene within the time-frequency.^[14]^[15]

Analysis of Musical Audio Signals

Typical approaches in CASA systems starts with segmenting sound-sources into individual constituents, in its attempts to mimic the physical auditory system. However, there is evidence that the brain does not necessarily process audio input separately, but rather as a mixture.^[16] Instead of breaking the audio signal down to individual constituents, the input is broken down of by higher level descriptors, such as chords, bass and melody, beat structure, and chorus and phrase repetitions. These descriptors run into difficulties in real-world scenarios, with monaural and binaural signals.^[1] Also, the estimation of these descriptors is highly dependent on the cultural influence of the musical input. For example, within Western music, the melody and bass influences the identity of the piece, with the core formed by the melody. By distinguishing the frequency responses of melody and bass, a fundamental frequency can be estimated and filtered for distinction.^[17] Chord detection can be implemented through pattern recognition, by extracting low-level features describing harmonic content.^[18] The techniques utilized in music scene analysis can also be applied to speech recognition, and other environmental sounds.^[19] Future bodies of work include a top-down integration of audio signal processing, such as a real-time beat-tracking system and expanding out of the signal processing realm with the incorporation of auditory psychology and physiology.^[20]

Neural Perceptual Modeling

While many models consider the audio signal as a complex combination of different frequencies, modeling the auditory system can also require consideration for the neural components. By taking a holistic process, where a stream (of feature-based sounds) correspond to neuronal activity distributed in many brain areas, the perception of the sound could be mapped and modeled. Two different solutions have been proposed to the binding of the audio perception and the area in the brain. Hierarchical coding models many cells to encode all possible combinations of features and objects in the auditory scene.^[21]^[22] Temporal or oscillatory correlation addressing the binding problem by focusing on the synchrony and desynchrony between neural oscillations to encode the state of binding among the auditory features.^[1] These two solutions are very similar to the debacle between place coding and temporal coding. While drawing from modeling neural components, another phenomenon of ASA comes into play with CASA systems: the extent of modeling neural mechanisms. The studies of CASA systems have involved modeling some known mechanisms, such as the bandpass nature of cochlear filtering and random auditory nerve firing patterns, however, these models may not lead to finding new mechanisms, but rather give an understanding of purpose to the known mechanisms.^[23]

Related Research Articles

The inner ear is the innermost part of the vertebrate ear. In vertebrates, the inner ear is mainly responsible for sound detection and balance. In mammals, it consists of the bony labyrinth, a hollow cavity in the temporal bone of the skull with a system of passages comprising two main functional parts:

<span class="mw-page-title-main">Cochlea</span> Snail-shaped part of inner ear involved in hearing

The cochlea is the part of the inner ear involved in hearing. It is a spiral-shaped cavity in the bony labyrinth, in humans making 2.75 turns around its axis, the modiolus. A core component of the cochlea is the Organ of Corti, the sensory organ of hearing, which is distributed along the partition separating the fluid chambers in the coiled tapered tube of the cochlea.

The basilar membrane is a stiff structural element within the cochlea of the inner ear which separates two liquid-filled tubes that run along the coil of the cochlea, the scala media and the scala tympani. The basilar membrane moves up and down in response to incoming sound waves, which are converted to traveling waves on the basilar membrane.

Stimulus modality, also called sensory modality, is one aspect of a stimulus or what is perceived after a stimulus. For example, the temperature modality is registered after heat or cold stimulate a receptor. Some sensory modalities include: light, sound, temperature, taste, pressure, and smell. The type and location of the sensory receptor activated by the stimulus plays the primary role in coding the sensation. All sensory modalities work together to heighten stimuli sensation when necessary.

The auditory system is the sensory system for the sense of hearing. It includes both the sensory organs and the auditory parts of the sensory system.

<span class="mw-page-title-main">Sensorineural hearing loss</span> Hearing loss caused by an inner ear or vestibulocochlear nerve defect

Sensorineural hearing loss (SNHL) is a type of hearing loss in which the root cause lies in the inner ear or sensory organ or the vestibulocochlear nerve. SNHL accounts for about 90% of reported hearing loss. SNHL is usually permanent and can be mild, moderate, severe, profound, or total. Various other descriptors can be used depending on the shape of the audiogram, such as high frequency, low frequency, U-shaped, notched, peaked, or flat.

In audiology and psychoacoustics the concept of critical bands, introduced by Harvey Fletcher in 1933 and refined in 1940, describes the frequency bandwidth of the "auditory filter" created by the cochlea, the sense organ of hearing within the inner ear. Roughly, the critical band is the band of audio frequencies within which a second tone will interfere with the perception of the first tone by auditory masking.

Volley theory states that groups of neurons of the auditory system respond to a sound by firing action potentials slightly out of phase with one another so that when combined, a greater frequency of sound can be encoded and sent to the brain to be analyzed. The theory was proposed by Ernest Wever and Charles Bray in 1930 as a supplement to the frequency theory of hearing. It was later discovered that this only occurs in response to sounds that are about 500 Hz to 5000 Hz.

The temporal theory of hearing, also called frequency theory or timing theory, states that human perception of sound depends on temporal patterns with which neurons respond to sound in the cochlea. Therefore, in this theory, the pitch of a pure tone is determined by the period of neuron firing patterns—either of single neurons, or groups as described by the volley theory. Temporal theory competes with the place theory of hearing, which instead states that pitch is signaled according to the locations of vibrations along the basilar membrane.

In neurobiology, lateral inhibition is the capacity of an excited neuron to reduce the activity of its neighbors. Lateral inhibition disables the spreading of action potentials from excited neurons to neighboring neurons in the lateral direction. This creates a contrast in stimulation that allows increased sensory perception. It is also referred to as lateral antagonism and occurs primarily in visual processes, but also in tactile, auditory, and even olfactory processing. Cells that utilize lateral inhibition appear primarily in the cerebral cortex and thalamus and make up lateral inhibitory networks (LINs). Artificial lateral inhibition has been incorporated into artificial sensory systems, such as vision chips, hearing systems, and optical mice. An often under-appreciated point is that although lateral inhibition is visualised in a spatial sense, it is also thought to exist in what is known as "lateral inhibition across abstract dimensions." This refers to lateral inhibition between neurons that are not adjacent in a spatial sense, but in terms of modality of stimulus. This phenomenon is thought to aid in colour discrimination.

Binaural fusion or binaural integration is a cognitive process that involves the combination of different auditory information presented binaurally, or to each ear. In humans, this process is essential in understanding speech as one ear may pick up more information about the speech stimuli than the other.

An analog ear or analog cochlea is a model of the ear or of the cochlea based on an electrical, electronic or mechanical analog. An analog ear is commonly described as an interconnection of electrical elements such as resistors, capacitors, and inductors; sometimes transformers and active amplifiers are included.

Computer audition (CA) or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems — "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."

In audio signal processing, auditory masking occurs when the perception of one sound is affected by the presence of another sound.

Hearing, or auditory perception, is the ability to perceive sounds through an organ, such as an ear, by detecting vibrations as periodic changes in the pressure of a surrounding medium. The academic field concerned with hearing is auditory science.

The neural encoding of sound is the representation of auditory sensation and perception in the nervous system. The complexities of contemporary neuroscience are continually redefined. Thus what is known of the auditory system has been continually changing. The encoding of sounds includes the transduction of sound waves into electrical impulses along auditory nerve fibers, and further processing in the brain.

Psychoacoustics is the branch of psychophysics involving the scientific study of sound perception and audiology—how human auditory system perceives various sounds. More specifically, it is the branch of science studying the psychological responses associated with sound. Psychoacoustics is an interdisciplinary field of many areas, including psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science.

Electrocochleography is a technique of recording electrical potentials generated in the inner ear and auditory nerve in response to sound stimulation, using an electrode placed in the ear canal or tympanic membrane. The test is performed by an otologist or audiologist with specialized training, and is used for detection of elevated inner ear pressure or for the testing and monitoring of inner ear and auditory nerve function during surgery.

Perceptual-based 3D sound localization is the application of knowledge of the human auditory system to develop 3D sound localization technology.

Temporal envelope (ENV) and temporal fine structure (TFS) are changes in the amplitude and frequency of sound perceived by humans over time. These temporal changes are responsible for several aspects of auditory perception, including loudness, pitch and timbre perception and spatial hearing.

References

1 2 3 4 5 6 7 8 9 10 11 12 13 Wang, D. L. and Brown, G. J. (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms and applications. IEEE Press/Wiley-Interscience
↑ Warren, R.(1999). Auditory Perception: A New Analysis and Synthesis. New York: Cambridge University Press.
↑ Wiener, F.(1947), "On the diffraction of a progressive wave by the human head". Journal of the Acoustical Society of America, 19, 143–146.
↑ Goll, J., Kim, L. (2012), "Impairments of auditory scene analysis in Alzheimer's disease", Brain135 (1), 190–200.
↑ Meddis, R., Hewitt, M., Shackleton, T. (1990). "Implementation details of a computational model of the inner hair-cell/auditory nerve synapse". Journal of the Acoustical Society of America87(4) 1813–1816.
↑ Jeffress, L.A. (1948). "A place theory of sound localization". Journal of Comparative and Physiological Psychology, 41 35–39.
↑ Yin, T., Chan, J. (1990). "Interaural time sensitivity in medial superior olive of cat" Journal Neurophysiology, 64(2) 465–488.
↑ Moore, B. (2003). An Introduction to the Psychology of Hearing (5th ed.). Academic Press, London.
↑ Ellis, D (1996). "Predication-Driven Computational Auditory Scene Analysis". PhD thesis, MIT Department of Electrical Engineering and Computer Science.
↑ Li, P., Guan, Y. (2010). "Monaural speech separation based on MASVQ and CASA for robust speech recognition" Computer Speech and Language, 24, 30–44.
↑ Bodden, M. (1993). "Modeling human sound-source locations and cocktail party effect" Acta Acustica1 43–55.
↑ Lyon, R.(1983). "A computational model of binaural locations and separation". Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1148–1151.
↑ Von der Malsburg, C., Schneider, W. (1986). "A neural cocktail-party processor". Biological Cybernetics54 29–40.
↑ Wang, D.(1994). "Auditory stream segregation based on oscillatory correlation". Proceedings of the IEEE International Workshop on Neural Networks for Signal Processings, 624–632.
↑ Wang, D.(1996), "Primitive auditory segregation based on oscillatory correlation". Cognitive Science20, 409–456.
↑ Bregman, A (1995). "Constraints on computational models of auditory scene analysis as derived from human perception". The Journal of the Acoustical Society of Japan (E), 16(3), 133–136.
↑ Goto, M.(2004). "A real-time music-scene-description system: predominate-F0 estimation for detecting melody and bass lines in real-world audio signals". Speech Communication, 43, 311–329.
↑ Zbigniew, R., Wieczorkowska, A.(2010). "Advances in Music Information Retrieval". Studies in Computational Intelligence, 274 119–142.
↑ Masuda-Katsuse, I (2001). "A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise". Proceedings Eurospeech, 1119–1122.
↑ Goto, M (2001). "An Audio-based real-time beat tracking system for music with or without drum sounds". Journal of New Music Research, 30(2): 159–171.
↑ deCharms, R., Merzenich, M, (1996). "Primary cortical representation of sounds by the coordination of action-potential timing". Nature, 381, 610–613.
↑ Wang, D.(2005). "The time dimension of scene analysis". IEEE Transactions on Neural Networks, 16(6), 1401–1426.
↑ Bregman, A.(1990). Auditory Scene Analysis. Cambridge: MIT Press.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[wangbrown06-1] 1 2 3 4 5 6 7 8 9 10 11 12 13 Wang, D. L. and Brown, G. J. (Eds.) (2006). Computational auditory scene analysis: Principles, algorithms and applications. IEEE Press/Wiley-Interscience

[warren-2] Warren, R.(1999). Auditory Perception: A New Analysis and Synthesis. New York: Cambridge University Press.

[wiener-3] Wiener, F.(1947), "On the diffraction of a progressive wave by the human head". Journal of the Acoustical Society of America, 19, 143–146.

[goll-4] Goll, J., Kim, L. (2012), "Impairments of auditory scene analysis in Alzheimer's disease", Brain135 (1), 190–200.

[meddis-5] Meddis, R., Hewitt, M., Shackleton, T. (1990). "Implementation details of a computational model of the inner hair-cell/auditory nerve synapse". Journal of the Acoustical Society of America87(4) 1813–1816.

[jeffress-6] Jeffress, L.A. (1948). "A place theory of sound localization". Journal of Comparative and Physiological Psychology, 41 35–39.

[yin-7] Yin, T., Chan, J. (1990). "Interaural time sensitivity in medial superior olive of cat" Journal Neurophysiology, 64(2) 465–488.

[moore-8] Moore, B. (2003). An Introduction to the Psychology of Hearing (5th ed.). Academic Press, London.

[Ellis-9] Ellis, D (1996). "Predication-Driven Computational Auditory Scene Analysis". PhD thesis, MIT Department of Electrical Engineering and Computer Science.

[li-10] Li, P., Guan, Y. (2010). "Monaural speech separation based on MASVQ and CASA for robust speech recognition" Computer Speech and Language, 24, 30–44.

[bodden-11] Bodden, M. (1993). "Modeling human sound-source locations and cocktail party effect" Acta Acustica1 43–55.

[lyon-12] Lyon, R.(1983). "A computational model of binaural locations and separation". Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1148–1151.

[vdm-13] Von der Malsburg, C., Schneider, W. (1986). "A neural cocktail-party processor". Biological Cybernetics54 29–40.

[wangseg-14] Wang, D.(1994). "Auditory stream segregation based on oscillatory correlation". Proceedings of the IEEE International Workshop on Neural Networks for Signal Processings, 624–632.

[wangprim-15] Wang, D.(1996), "Primitive auditory segregation based on oscillatory correlation". Cognitive Science20, 409–456.

[bregman2-16] Bregman, A (1995). "Constraints on computational models of auditory scene analysis as derived from human perception". The Journal of the Acoustical Society of Japan (E), 16(3), 133–136.

[Goto-17] Goto, M.(2004). "A real-time music-scene-description system: predominate-F0 estimation for detecting melody and bass lines in real-world audio signals". Speech Communication, 43, 311–329.

[zb-18] Zbigniew, R., Wieczorkowska, A.(2010). "Advances in Music Information Retrieval". Studies in Computational Intelligence, 274 119–142.

[masudak-19] Masuda-Katsuse, I (2001). "A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise". Proceedings Eurospeech, 1119–1122.

[gotodrum-20] Goto, M (2001). "An Audio-based real-time beat tracking system for music with or without drum sounds". Journal of New Music Research, 30(2): 159–171.

[decharm-21] Charms, R., Merzenich, M, (1996). "Primary cortical representation of sounds by the coordination of action-potential timing". Nature, 381, 610–613.

[wangtime-22] Wang, D.(2005). "The time dimension of scene analysis". IEEE Transactions on Neural Networks, 16(6), 1401–1426.

[bregman-23] Bregman, A.(1990). Auditory Scene Analysis. Cambridge: MIT Press.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]