Temporal envelope (ENV) and temporal fine structure (TFS) are changes in the amplitude and frequency of sound perceived by humans over time. These temporal changes are responsible for several aspects of auditory perception, including loudness, pitch and timbre perception and spatial hearing.
Complex sounds such as speech or music are decomposed by the peripheral auditory system of humans into narrow frequency bands. The resulting narrow-band signals convey information at different time scales ranging from less than one millisecond to hundreds of milliseconds. A dichotomy between slow "temporal envelope" cues and faster "temporal fine structure" cues has been proposed to study several aspects of auditory perception (e.g., loudness, pitch and timbre perception, auditory scene analysis, sound localization) at two distinct time scales in each frequency band. [1] [2] [3] [4] [5] [6] [7] Over the last decades, a wealth of psychophysical, electrophysiological and computational studies based on this envelope/fine-structure dichotomy have examined the role of these temporal cues in sound identification and communication, how these temporal cues are processed by the peripheral and central auditory system, and the effects of aging and cochlear damage on temporal auditory processing. Although the envelope/fine-structure dichotomy has been debated and questions remain as to how temporal fine structure cues are actually encoded in the auditory system, these studies have led to a range of applications in various fields including speech and audio processing, clinical audiology and rehabilitation of sensorineural hearing loss via hearing aids or cochlear implants.
Notions of temporal envelope and temporal fine structure may have different meanings in many studies. An important distinction to make is between the physical (i.e., acoustical) and the biological (or perceptual) description of these ENV and TFS cues.
Any sound whose frequency components cover a narrow range (called a narrowband signal) can be considered as an envelope (ENVp, where p denotes the physical signal) superimposed on a more rapidly oscillating carrier, the temporal fine structure (TFSp). [8]
Many sounds in everyday life, including speech and music, are broadband; the frequency components spread over a wide range and there is no well-defined way to represent the signal in terms of ENVp and TFSp. However, in a normally functioning cochlea, complex broadband signals are decomposed by the filtering on the basilar membrane (BM) within the cochlea into a series of narrowband signals. [9] Therefore, the waveform at each place on the BM can be considered as an envelope (ENVBM) superimposed on a more rapidly oscillating carrier, the temporal fine structure (TFSBM). [10] The ENVBM and TFSBM depend on the place along the BM. At the apical end, which is tuned to low (audio) frequencies, ENVBM and TFSBM vary relatively slowly with time, while at the basal end, which is tuned to high frequencies, both ENVBM and TFSBM vary more rapidly with time. [10]
Both ENVBM and TFSBM are represented in the time patterns of action potentials in the auditory nerve [11] these are denoted ENVn and TFSn. TFSn is represented most prominently in neurons tuned to low frequencies, while ENVn is represented most prominently in neurons tuned to high (audio) frequencies. [11] [12] For a broadband signal, it is not possible to manipulate TFSp without affecting ENVBM and ENVn, and it is not possible to manipulate ENVp without affecting TFSBM and TFSn. [13] [14]
The neural representation of stimulus envelope, ENVn, has typically been studied using well-controlled ENVp modulations, that is sinusoidally amplitude-modulated (AM) sounds. Cochlear filtering limits the range of AM rates encoded in individual auditory-nerve fibers. In the auditory nerve, the strength of the neural representation of AM decreases with increasing modulation rate. At the level of the cochlear nucleus, several cell types show an enhancement of ENVn information. Multipolar cells can show band-pass tuning to AM tones with AM rates between 50 and 1000 Hz. [15] [16] Some of these cells show an excellent response to the ENVn and provide inhibitory sideband inputs to other cells in the cochlear nucleus giving a physiological correlate of comodulation masking release, a phenomenon whereby the detection of a signal in a masker is improved when the masker has correlated envelope fluctuations across frequency (see section below). [17] [18]
Responses to the temporal-envelope cues of speech or other complex sounds persist up the auditory pathway, eventually to the various fields of the auditory cortex in many animals. In the Primary Auditory Cortex, responses can encode AM rates by phase-locking up to about 20–30 Hz, [19] [20] [21] [22] while faster rates induce sustained and often tuned responses. [23] [24] A topographical representation of AM rate has been demonstrated in the primary auditory cortex of awake macaques. [25] This representation is approximately perpendicular to the axis of the tonotopic gradient, consistent with an orthogonal organization of spectral and temporal features in the auditory cortex. Combining these temporal responses with the spectral selectivity of A1 neurons gives rise to the spectro-temporal receptive fields that often capture well cortical responses to complex modulated sounds. [26] [27] In secondary auditory cortical fields, responses become temporally more sluggish and spectrally broader, but are still able to phase-lock to the salient features of speech and musical sounds. [28] [29] [30] [31] Tuning to AM rates below about 64 Hz is also found in the human auditory cortex [32] [33] [34] [35] as revealed by brain-imaging techniques (fMRI) and cortical recordings in epileptic patients (electrocorticography). This is consistent with neuropsychological studies of brain-damaged patients [36] and with the notion that the central auditory system performs some form of spectral decomposition of the ENVp of incoming sounds. The ranges over which cortical responses encode well the temporal-envelope cues of speech have been shown to be predictive of the human ability to understand speech. In the human superior temporal gyrus (STG), an anterior-posterior spatial organization of spectro-temporal modulation tuning has been found in response to speech sounds, the posterior STG being tuned for temporally fast varying speech sounds with low spectral modulations and the anterior STG being tuned for temporally slow varying speech sounds with high spectral modulations. [37]
One unexpected aspect of phase locking in the auditory cortex has been observed in the responses elicited by complex acoustic stimuli with spectrograms that exhibit relatively slow envelopes (< 20 Hz), but that are carried by fast modulations that are as high as hundreds of Hertz. Speech and music, as well as various modulated noise stimuli have such temporal structure. [38] For these stimuli, cortical responses phase-lock to both the envelope and fine-structure induced by interactions between unresolved harmonics of the sound, thus reflecting the pitch of the sound, and exceeding the typical lower limits of cortical phase-locking to the envelopes of a few 10's of Hertz. This paradoxical relation [38] [39] between the slow and fast cortical phase-locking to the carrier “fine structure” has been demonstrated both in the auditory [38] and visual [40] cortices. It has also been shown to be amply manifested in measurements of the spectro-temporal receptive fields of the primary auditory cortex giving them unexpectedly fine temporal accuracy and selectivity bordering on a 5-10 ms resolution. [38] [40] The underlying causes of this phenomenon have been attributed to several possible origins, including nonlinear synaptic depression and facilitation, and/or a cortical network of thalamic excitation and cortical inhibition. [38] [41] [42] [43] There are many functionally significant and perceptually relevant reasons for the coexistence of these two complementary dynamic response modes. They include the ability to accurately encode onsets and other rapid ‘events’ in the ENVp of complex acoustic and other sensory signals, features that are critical for the perception of consonants (speech) and percussive sounds (music), as well as the texture of complex sounds. [38] [44]
The perception of ENVp depends on which AM rates are contained in the signal. Low rates of AM, in the 1–8 Hz range, are perceived as changes in perceived intensity, that is loudness fluctuations (a percept that can also be evoked by frequency modulation, FM); at higher rates, AM is perceived as roughness, with the greatest roughness sensation occurring at around 70 Hz; [45] at even higher rates, AM can evoke a weak pitch percept corresponding to the modulation rate. [46] Rainstorms, crackling fire, chirping crickets or galloping horses produce "sound textures" - the collective result of many similar acoustic events - which perception is mediated by ENVn statistics. [47] [48]
The auditory detection threshold for AM as a function of AM rate, referred to as the temporal modulation transfer function (TMTF), [49] is best for AM rates in the range from 4 – 150 Hz and worsens outside that range [49] [50] [51] The cutoff frequency of the TMTF gives an estimate of temporal acuity (temporal resolution) for the auditory system. This cutoff frequency corresponds to a time constant of about 1 - 3 ms for the auditory system of normal-hearing humans.
Correlated envelope fluctuations across frequency in a masker can aid detection of a pure tone signal, an effect known as comodulation masking release. [18]
AM applied to a given carrier can perceptually interfere with the detection of a target AM imposed on the same carrier, an effect termed modulation masking. [52] [53] Modulation-masking patterns are tuned (greater masking occurs for masking and target AMs close in modulation rate), suggesting that the human auditory system is equipped with frequency-selective channels for AM. Moreover, AM applied to spectrally remote carriers can perceptually interfere with the detection of AM on a target sound, an effect termed modulation detection interference. [54] The notion of modulation channels is also supported by the demonstration of selective adaptation effects in the modulation domain. [55] [56] [57] These studies show that AM detection thresholds are selectively elevated above pre-exposure thresholds when the carrier frequency and the AM rate of the adaptor are similar to those of the test tone.
Human listeners are sensitive to relatively slow "second-order" AMs cues correspond to fluctuations in the strength of AM. These cues arise from the interaction of different modulation rates, previously described as "beating" in the envelope-frequency domain. Perception of second-order AM has been interpreted as resulting from nonlinear mechanisms in the auditory pathway that produce an audible distortion component at the envelope beat frequency in the internal modulation spectrum of the sounds. [58] [59] [60]
Interaural time differences in the envelope provide binaural cues even at high frequencies where TFSn cannot be used. [61]
The most basic computer model of ENV processing is the leaky integrator model. [62] [49] This model extracts the temporal envelope of the sound (ENVp) via bandpass filtering, half-wave rectification (which may be followed by fast-acting amplitude compression), and lowpass filtering with a cutoff frequency between about 60 and 150 Hz. The leaky integrator is often used with a decision statistic based on either the resulting envelope power, the max/min ratio, or the crest factor. This model accounts for the loss of auditory sensitivity for AM rates higher than about 60–150 Hz for broadband noise carriers. [49] Based on the concept of frequency selectivity for AM, [53] the perception model of Torsten Dau [63] incorporates broadly tuned bandpass modulation filters (with a Q value around 1) to account for data from a broad variety of psychoacoustic tasks and particularly AM detection for noise carriers with different bandwidths, taking into account their intrinsic envelope fluctuations. This model of has been extended to account for comodulation masking release (see sections above). [64] The shapes of the modulation filters have been estimated [65] and an “envelope power spectrum model” (EPSM) based on these filters can account for AM masking patterns and AM depth discrimination. [66] The EPSM has been extended to the prediction of speech intelligibility [67] and to account for data from a broad variety of psychoacoustic tasks. [68] A physiologically based processing model simulating brainstem responses has also been developed to account for AM detection and AM masking patterns. [69]
The neural representation of temporal fine structure, TFSn, has been studied using stimuli with well-controlled TFSp: pure tones, harmonic complex tones, and frequency-modulated (FM) tones.
Auditory-nerve fibres are able to represent low-frequency sounds via their phase-locked discharges (i.e., TFSn information). The upper frequency limit for phase locking is species dependent. It is about 5 kHz in the cat, 9 kHz in the barn owl and just 4 kHz in the guinea pig. We do not know the upper limit of phase locking in humans but current, indirect, estimates suggest it is about 4–5 kHz. [70] Phase locking is a direct consequence of the transduction process with an increase in probability of transduction channel opening occurring with a stretching of the stereocilia and decrease in channel opening occurring when pushed in the opposite direction. This has led some to suggest that phase locking is an epiphenomenon. The upper limit appears to be determined by a cascade of low pass filters at the level of the inner hair cell and auditory-nerve synapse. [71] [72]
TFSn information in the auditory nerve may be used to encode the (audio) frequency of low-frequency sounds, including single tones and more complex stimuli such as frequency-modulated tones or steady-state vowels (see role and applications to speech and music).
The auditory system goes to some length to preserve this TFSn information with the presence of giant synapses (End bulbs of Held) in the ventral cochlear nucleus. These synapses contact bushy cells (Spherical and globular) and faithfully transmit (or enhance) the temporal information present in the auditory nerve fibers to higher structures in the brainstem. [73] The bushy cells project to the medial superior olive and the globular cells project to the medial nucleus of the trapezoid body (MNTB). The MNTB is also characterized by giant synapses (calyces of Held) and provides precisely timed inhibition to the lateral superior olive. The medial and lateral superior olive and MNTB are involved in the encoding of interaural time and intensity differences. There is general acceptance that the temporal information is crucial in sound localization but it is still contentious as to whether the same temporal information is used to encode the frequency of complex sounds.
Several problems remain with the idea that the TFSn is important in the representation of the frequency components of complex sounds. The first problem is that the temporal information deteriorates as it passes through successive stages of the auditory pathway (presumably due to the low pass dendritic filtering). Therefore, the second problem is that the temporal information must be extracted at an early stage of the auditory pathway. No such stage has currently been identified although there are theories about how temporal information can be converted into rate information (see section Models of normal processing: Limitations).
It is often assumed that many perceptual capacities rely on the ability of the monaural and binaural auditory system to encode and use TFSn cues evoked by components in sounds with frequencies below about 1–4 kHz. These capacities include discrimination of frequency, [74] [4] [75] [76] discrimination of the fundamental frequency of harmonic sounds, [75] [4] [76] detection of FM at rates below 5 Hz, [77] melody recognition for sequences of pure tones and complex tones, [74] [4] lateralization and localization of pure tones and complex tones, [78] and segregation of concurrent harmonic sounds (such as speech sounds). [79] It appears that TFSn cues require correct tonotopic (place) representation to be processed optimally by the auditory system. [80] Moreover, musical pitch perception has been demonstrated for complex tones with all harmonics above 6 kHz, demonstrating that it is not entirely dependent on neural phase locking to TFSBM (i.e., TFSn) cues. [81]
As for FM detection, the current view assumes that in the normal auditory system, FM is encoded via TFSn cues when the FM rate is low (<5 Hz) and when the carrier frequency is below about 4 kHz, [77] [82] [83] [84] and via ENVn cues when the FM is fast or when the carrier frequency is higher than 4 kHz. [77] [85] [86] [87] [84] This is supported by single-unit recordings in the low brainstem. [73] According to this view, TFSn cues are not used to detect FM with rates above about 10 Hz because the mechanism decoding the TFSn information is “sluggish” and cannot track rapid changes in frequency. [77] Several studies have shown that auditory sensitivity to slow FM at low carrier frequency is associated with speech identification for both normal-hearing and hearing-impaired individuals when speech reception is limited by acoustic degradations (e.g., filtering) or concurrent speech sounds. [88] [89] [90] [91] [92] This suggests that robust speech intelligibility is determined by accurate processing of TFSn cues.
The separation of a sound into ENVp and TFSp appears inspired partly by how sounds are synthesized and by the availability of a convenient way to separate an existing sound into ENV and TFS, namely the Hilbert transform. There is a risk that this view of auditory processing [93] is dominated by these physical/technical concepts, similarly to how cochlear frequency-to-place mapping was for a long time conceptualized in terms of the Fourier transform. Physiologically, there is no indication of a separation of ENV and TFS in the auditory system for stages up to the cochlear nucleus. Only at that stage does it appear that parallel pathways, potentially enhancing ENVn or TFSn information (or something akin to it), may be implemented through the temporal response characteristics of different cochlear nucleus cell types. [73] It may therefore be useful to better simulate cochlear nucleus cell types to understand the true concepts for parallel processing created at the level of the cochlear nucleus. These concepts may be related to separating ENV and TFS but are unlikely realized like the Hilbert transform.
A computational model of the peripheral auditory system [94] [95] may be used to simulate auditory-nerve fiber responses to complex sounds such as speech, and quantify the transmission (i.e., internal representation) of ENVn and TFSn cues. In two simulation studies, [96] [97] the mean-rate and spike-timing information was quantified at the output of such a model to characterize, respectively, the short-term rate of neural firing (ENVn) and the level of synchronization due to phase locking (TFSn) in response to speech sounds degraded by vocoders. [98] [99] The best model predictions of vocoded-speech intelligibility were found when both ENVn and TFSn cues were included, providing evidence that TFSn cues are important for intelligibility when the speech ENVp cues are degraded.
At a more fundamental level, similar computational modeling was used to demonstrate that the functional dependence of human just-noticeable-frequency-differences on pure-tone frequency were not accounted for unless temporal information was included (notably most so for mid-high frequencies, even above the nominal cutoff in physiological phase locking). [100] [101] However, a caveat of most TFS models is that optimal model performance with temporal information typically over-estimates human performance.
An alternative view is to assume that TFSn information at the level of the auditory nerve is converted into rate-place (ENVn) information at a later stage of the auditory system (e.g., the low brainstem). Several modelling studies proposed that the neural mechanisms for decoding TFSn are based on correlation of the outputs of adjacent places. [102] [103] [104] [105] [106]
The ENVp plays a critical role in many aspects of auditory perception, including in the perception of speech and music. [2] [7] [108] [109] Speech recognition is possible using cues related to the ENVp, even in situations where the original spectral information and TFSp are highly degraded. [110] Indeed, when the spectrally local TFSp from one sentence is combined with the ENVp from a second sentence, only the words of the second sentence are heard. [111] The ENVp rates most important for speech are those below about 16 Hz, corresponding to fluctuations at the rate of syllables. [112] [107] [113] On the other hand, the fundamental frequency (“pitch”) contour of speech sounds is primarily conveyed via TFSp cues, [107] although some information on the contour can be perceived via rapid envelope fluctuations corresponding to the fundamental frequency. [2] For music, slow ENVp rates convey rhythm and tempo information, whereas more rapid rates convey the onset and offset properties of sound (attack and decay, respectively) that are important for timbre perception. [114]
The ability to accurately process TFSp information is thought to play a role in our perception of pitch (i.e., the perceived height of sounds), an important sensation for music perception, as well as our ability to understand speech, especially in the presence of background noise. [4]
Although pitch retrieval mechanisms in the auditory system are still a matter of debate, [76] [115] TFSn information may be used to retrieve the pitch of low-frequency pure tones [75] and estimate the individual frequencies of the low-numbered (ca. 1st-8th) harmonics of a complex sound, [116] frequencies from which the fundamental frequency of the sound can be retrieved according to, e.g., pattern-matching models of pitch perception. [117] A role of TFSn information in pitch perception of complex sounds containing intermediate harmonics (ca. 7th-16th) has also been suggested [118] and may be accounted for by temporal or spectrotemporal [119] models of pitch perception. The degraded TFSn cues conveyed by cochlear implant devices may also be partly responsible for impaired music perception of cochlear implant recipients. [120]
TFSp cues are thought to be important for the identification of speakers and for tone identification in tonal languages. [121] In addition, several vocoder studies have suggested that TFSp cues contribute to the intelligibility of speech in quiet and noise. [98] Although it is difficult to isolate TFSp from ENVp cues, [109] [122] there is evidence from studies in hearing-impaired listeners that speech perception in the presence of background noise can be partly accounted for by the ability to accurately process TFSp, [92] [99] although the ability to “listen in the dips” of fluctuating maskers does not seem to depend on periodic TFSp cues. [123]
Environmental sounds can be broadly defined as nonspeech and nonmusical sounds in the listener's environment that can convey meaningful information about surrounding objects and events. [124] Environmental sounds are highly heterogeneous in terms of their acoustic characteristics and source types, and may include human and animal vocalizations, water and weather related events, mechanical and electronic signaling sounds. Given a great variety in sound sources that give rise to environmental sounds both ENVp and TFSp play an important role in their perception. However, the relative contributions of ENVp and TFSp can differ considerably for specific environmental sounds. This is reflected in the variety of acoustic measures that correlate with different perceptual characteristics of objects and events. [125] [126] [127]
Early studies highlighted the importance of envelope-based temporal patterning in perception of environmental events. For instance, Warren & Verbrugge, demonstrated that constructed sounds of a glass bottle dropped on the floor were perceived as bouncing when high-energy regions in four different frequency bands were temporally aligned, producing amplitude peaks in the envelope. [128] In contrast, when the same spectral energy was distributed randomly across bands the sounds were heard as breaking. More recent studies using vocoder simulations of cochlear implant processing demonstrated that many temporally-patterned sounds can be perceived with little original spectral information, based primarily on temporal cues. [126] [127] Such sounds as footsteps, horse galloping, helicopter flying, ping-pong playing, clapping, typing were identified with a high accuracy of 70% or more with a single channel of envelope-modulated broadband noise or with only two frequency channels. In these studies, envelope-based acoustic measures such as number of bursts and peaks in the envelope were predictive of listeners’ abilities to identify sounds based primarily on ENVp cues. On the other hand, identification of brief environmental sounds without strong temporal patterning in ENVp may require a much larger number of frequency channels to perceive. Sounds such as a car horn or a train whistle were poorly identified even with as many as 32 frequency channels. [126] Listeners with cochlear implants, which transmit envelope information for specific frequency bands, but do not transmit TFSp, have considerably reduced abilities in identification of common environmental sounds. [129] [130] [131]
In addition, individual environmental sounds are typically heard within the context of larger auditory scenes where sounds from multiple sources may overlap in time and frequency. When heard within an auditory scene, accurate identification of individual environmental sounds is contingent on the ability to segregate them from other sound sources or auditory streams in the auditory scene, which involves further reliance on ENVp and TFSp cues (see Role in auditory scene analysis).
Auditory scene analysis refers to the ability to perceive separately sounds coming from different sources. Any acoustical difference can potentially lead to auditory segregation, [132] and so any cues based either on ENVp or TFSp are likely to assist in segregating competing sound sources. [133] Such cues involve percepts such as pitch. [134] [135] [136] [137] Binaural TFSp cues producing interaural time differences have not always resulted in clear source segregation, particularly with simultaneously presented sources, although successful segregation of sequential sounds, such as noise or speech, have been reported. [138]
In infancy, behavioral AM detection thresholds [139] and forward or backward masking thresholds [139] [140] [141] observed in 3-month olds are similar to those observed in adults. Electrophysiological studies conducted in 1-month-old infants using 2000 Hz AM pure tones indicate some immaturity in envelope following response (EFR). Although sleeping infants and sedated adults show the same effect of modulation rate on EFR, infants’ estimates were generally poorer than adults’. [142] [143] This is consistent with behavioral studies conducted with school-age children showing differences in AM detection thresholds compared to adults. Children systematically show worse AM detection thresholds than adults until 10–11 years. However, the shape of the TMTF (the cutoff) is similar to adults’ for younger children of 5 years. [144] [145] Sensory versus non-sensory factors for this long maturation are still debated, [146] but the results generally appear to be more dependent on the task or on sound complexity for infants and children than for adults. [147] Regarding the development of speech ENVp processing, vocoder studies suggest that infants as young as 3 months are able to discriminate a change in consonants when the faster ENVp information of the syllables is preserved (< 256 Hz) but less so when only the slowest ENVp is available (< 8 Hz). [148] Older children of 5 years show similar abilities than adults to discriminate consonant changes based on ENVp cues (< 64 Hz). [149]
The effects of hearing loss and age on neural coding are generally believed to be smaller for slowly varying envelope responses (i.e., ENVn) than for rapidly varying temporal fine structure (i.e., TFSn). [150] [151] Enhanced ENVn coding following noise-induced hearing loss has been observed in peripheral auditory responses from single neurons [152] and in central evoked responses from the auditory midbrain. [153] The enhancement in ENVn coding of narrowband sounds occurs across the full range of modulation frequencies encoded by single neurons. [154] For broadband sounds, the range of modulation frequencies encoded in impaired responses is broader than normal (extending to higher frequencies), as expected from reduced frequency selectivity associated with outer-hair-cell dysfunction. [155] The enhancement observed in neural envelope responses is consistent with enhanced auditory perception of modulations following cochlear damage, which is commonly believed to result from loss of cochlear compression that occurs with outer-hair-cell dysfunction due to age or noise overexposure. [156] However, the influence of inner-hair-cell dysfunction (e.g., shallower response growth for mild-moderate damage and steeper growth for severe damage) can confound the effects of outer-hair-cell dysfunction on overall response growth and thus ENVn coding. [152] [157] Thus, not surprisingly the relative effects of outer-hair-cell and inner-hair-cell dysfunction have been predicted with modeling to create individual differences in speech intelligibility based on the strength of envelope coding of speech relative to noise.
For sinusoidal carriers, which have no intrinsic envelope (ENVp) fluctuations, the TMTF is roughly flat for AM rates from 10 to 120 Hz, but increases (i.e. threshold worsens) for higher AM rates, [51] [158] provided that spectral sidebands are not audible. The shape of the TMTF for sinusoidal carriers is similar for young and older people with normal audiometric thresholds, but older people tend to have higher detection thresholds overall, suggesting poorer “detection efficiency” for ENVn cues in older people. [159] [160] Provided that the carrier is fully audible, the ability to detect AM is usually not adversely affected by cochlear hearing loss and may sometimes be better than normal, for both noise carriers [161] [162] and sinusoidal carriers, [158] [163] perhaps because loudness recruitment (an abnormally rapid growth of loudness with increasing sound level) “magnifies” the perceived amount of AM (i.e., ENVn cues). Consistent with this, when the AM is clearly audible, a sound with a fixed AM depth appears to fluctuate more for an impaired ear than for a normal ear. However, the ability to detect changes in AM depth can be impaired by cochlear hearing loss. [163] Speech that is processed with noise vocoder such that mainly envelope information is delivered in multiple spectral channels was also used in investigating envelope processing in hearing impairment. Here, hearing-impaired individuals could not make use of such envelope information as well as normal-hearing individuals, even after audibility factors were taken into account. [164] Additional experiments suggest that age negatively affects the binaural processing of ENVp at least at low audio-frequencies. [165]
The perception model of ENV processing [63] that incorporates selective (bandpass) AM filters accounts for many perceptual consequences of cochlear dysfunction including enhanced sensitivity to AM for sinusoidal and noise carriers, [166] [167] abnormal forward masking (the rate of recovery from forward masking being generally slower than normal for impaired listeners), [168] stronger interference effects between AM and FM [82] and enhanced temporal integration of AM. [167] The model of Torsten Dau [63] has been extended to account for the discrimination of complex AM patterns by hearing-impaired individuals and the effects of noise-reduction systems. [169] The performance of the hearing-impaired individuals was best captured when the model combined the loss of peripheral amplitude compression resulting from the loss of the active mechanism in the cochlea [166] [167] [168] with an increase in internal noise in the ENVn domain. [166] [167] [82] Phenomenological models simulating the response of the peripheral auditory system showed that impaired AM sensitivity in individuals experiencing chronic tinnitus with clinically normal audiograms could be predicted by substantial loss of auditory-nerve fibers with low spontaneous rates and some loss of auditory-nerve fibers with high-spontaneous rates. [170]
Very few studies have systematically assessed TFS processing in infants and children. Frequency-following response (FFR), thought to reflect phase-locked neural activity, appears to be adult-like in 1-month-old infants when using a pure tone (centered at 500, 1000 or 2000 Hz) modulated at 80 Hz with a 100% of modulation depth. [142]
As for behavioral data, six-month-old infants require larger frequency transitions to detect an FM change in a 1-kHz tone compared to adults. [171] However, 4-month-old infants are able to discriminate two different FM sweeps, [172] and they are more sensitive to FM cues swept from 150 Hz to 550 Hz than at lower frequencies. [173] In school-age children, performance in detecting FM change improves between 6 and 10 years and sensitivity to low modulation rate (2 Hz) is poor until 9 years. [174]
For speech sounds, only one vocoder study has explored the ability of school age children to rely on TFSp cues to detect consonant changes, showing the same abilities for 5-years-olds than adults. [149]
Psychophysical studies have suggested that degraded TFS processing due to age and hearing loss may underlie some suprathreshold deficits, such as speech perception; [10] however, debate remains about the underlying neural correlates. [150] [151] The strength of phase locking to the temporal fine structure of signals (TFSn) in quiet listening conditions remains normal in peripheral single-neuron responses following cochlear hearing loss. [152] Although these data suggest that the fundamental ability of auditory-nerve fibers to follow the rapid fluctuations of sound remains intact following cochlear hearing loss, deficits in phase locking strength do emerge in background noise. [175] This finding, which is consistent with the common observation that listeners with cochlear hearing loss have more difficulty in noisy conditions, results from reduced cochlear frequency selectivity associated with outer-hair-cell dysfunction. [156] Although only limited effects of age and hearing loss have been observed in terms of TFSn coding strength of narrowband sounds, more dramatic deficits have been observed in TFSn coding quality in response to broadband sounds, which are more relevant for everyday listening. A dramatic loss of tonotopicity can occur following noise induced hearing loss, where auditory-nerve fibers that should be responding to mid frequencies (e.g., 2–4 kHz) have dominant TFS responses to lower frequencies (e.g., 700 Hz). [176] Notably, the loss of tonotopicity generally occurs only for TFSn coding but not for ENVn coding, which is consistent with greater perceptual deficits in TFS processing. [10] This tonotopic degradation is likely to have important implications for speech perception, and can account for degraded coding of vowels following noise-induced hearing loss in which most of the cochlea responds to only the first formant, eliminating the normal tonotopic representation of the second and third formants.
Several psychophysical studies have shown that older people with normal hearing and people with sensorineural hearing loss often show impaired performance for auditory tasks that are assumed to rely on the ability of the monaural and binaural auditory system to encode and use TFSn cues, such as: discrimination of sound frequency, [76] [177] [178] discrimination of the fundamental frequency of harmonic sounds, [76] [177] [178] [179] detection of FM at rates below 5 Hz, [180] [181] [91] melody recognition for sequences of pure tones and complex sounds, [182] lateralization and localization of pure tones and complex tones, [78] [183] [165] and segregation of concurrent harmonic sounds (such as speech sounds). [79] However, it remains unclear to which extent deficits associated with hearing loss reflect poorer TFSn processing or reduced cochlear frequency selectivity. [182]
The quality of the representation of a sound in the auditory nerve is limited by refractoriness, adaptation, saturation, and reduced synchronization (phase locking) at high frequencies, as well as by the stochastic nature of actions potentials. [184] However, the auditory nerve contains thousands of fibers. Hence, despite these limiting factors, the properties of sounds are reasonably well represented in the population nerve response over a wide range of levels [185] and audio frequencies (see Volley Theory).
The coding of temporal information in the auditory nerve can be disrupted by two main mechanisms: reduced synchrony and loss of synapses and/or auditory nerve fibers. [186] The impact of disrupted temporal coding on human auditory perception has been explored using physiologically inspired signal-processing tools. The reduction in neural synchrony has been simulated by jittering the phases of the multiple frequency components in speech, [187] although this has undesired effects in the spectral domain. The loss of auditory nerve fibers or synapses has been simulated by assuming (i) that each afferent fiber operates as a stochastic sampler of the sound waveform, with greater probability of firing for higher-intensity and sustained sound features than for lower-intensity or transient features, and (ii) that deafferentation can be modeled by reducing the number of samplers. [184] However, this also has undesired effects in the spectral domain. Both jittering and stochastic undersampling degrade the representation of the TFSn more than the representation of the ENVn. Both jittering and stochastic undersampling impair the recognition of speech in noisy backgrounds without degrading recognition in silence, support the argument that TFSn is important for recognizing speech in noise. [3] Both jittering and stochastic undersampling mimic the effects of aging on speech perception. [188]
Individuals with cochlear hearing loss usually have a smaller than normal dynamic range between the level of the weakest detectable sound and the level at which sounds become uncomfortably loud. [189] [190] To compress the large range of sound levels encountered in everyday life into the small dynamic range of the hearing-impaired person, hearing aids apply amplitude compression, which is also called automatic gain control (AGC). The basic principle of such compression is that the amount of amplification applied to the incoming sound progressively decreases as the input level increases. Usually, the sound is split into several frequency “channels”, and AGC is applied independently in each channel. As a result of compressing the level, AGC reduces the amount of envelope fluctuation in the input signal (ENVp) by an amount that depends on the rate of fluctuation and the speed with which the amplification changes in response to changes in input sound level. [191] [192] AGC can also change the shape of the envelope of the signal. [193] Cochlear implants are devices that electrically stimulate the auditory nerve, thereby creating the sensation of sound in a person who would otherwise be profoundly or totally deaf. The electrical dynamic range is very small, [194] so cochlear implants usually incorporate AGC prior to the signal being filtered into multiple frequency channels. [195] The channel signals are then subjected to instantaneous compression to map them into the limited dynamic range for each channel. [196]
Cochlear implants differ than hearing aids in that the entire acoustic hearing is replaced with direct electric stimulation of the auditory nerve, achieved via an electrode array placed inside the cochlea. Hence, here, other factors than device signal processing also strongly contribute to overall hearing, such as etiology, nerve health, electrode configuration and proximity to the nerve, and overall adaptation process to an entirely new mode of hearing. [197] [198] [199] [200] Almost all information in cochlear implants is conveyed by the envelope fluctuations in the different channels. This is sufficient to give reasonable perception of speech in quiet, but not in noisy or reverberant conditions. [201] [202] [203] [204] [121] [110] [205] [206] [207] [208] The processing in cochlear implants is such that the TFSp is discarded in favor of fixed-rate pulse trains amplitude-modulated by the ENVp within each frequency band. Implant users are sensitive to these ENVp modulations, but performance varies across stimulation site, stimulation level, and across individuals. [209] [210] The TMTF shows a low-pass filter shape similar to that observed in normal-hearing listeners. [210] [211] [212] Voice pitch or musical pitch information, conveyed primarily via weak periodicity cues in the ENVp, results in a pitch sensation that is not salient enough to support music perception, [213] [214] talker sex identification, [215] [216] lexical tones, [217] [218] or prosodic cues. [219] [220] [221] Listeners with cochlear implants are susceptible to interference in the modulation domain [222] [223] which likely contributes to difficulties listening in noise.
Hearing aids usually process sounds by filtering them into multiple frequency channels and applying AGC in each channel. Other signal processing in hearing aids, such as noise reduction, also involves filtering the input into multiple channels. [224] The filtering into channels can affect the TFSp of sounds depending on characteristics such as the phase response and group delay of the filters. However, such effects are usually small. Cochlear implants also filter the input signal into frequency channels. Usually, the ENVp of the signal in each channel is transmitted to the implanted electrodes in the form an electrical pulses of fixed rate that are modulated in amplitude or duration. Information about TFSp is discarded. This is justified by the observation that people with cochlear implants have a very limited ability to process TFSp information, even if it is transmitted to the electrodes, [225] perhaps because of a mismatch between the temporal information and the place in the cochlea to which it is delivered [76] Reducing this mismatch may improve the ability to use TFSp information and hence lead to better pitch perception. [226] Some cochlear implant systems transmit information about TFSp in the channels of the cochlear implants that are tuned to low audio frequencies, and this may improve the pitch perception of low-frequency sounds. [227]
Perceptual learning resulting from training has been reported for various auditory AM detection or discrimination tasks, [228] [229] [230] suggesting that the responses of central auditory neurons to ENVp cues are plastic and that practice may modify the circuitry of ENVn processing. [230] [231]
The plasticity of ENVn processing has been demonstrated in several ways. For instance, the ability of auditory-cortex neurons to discriminate voice-onset time cues for phonemes is degraded following moderate hearing loss (20-40 dB HL) induced by acoustic trauma. [232] Interestingly, developmental hearing loss reduces cortical responses to slow, but not fast (100 Hz) AM stimuli, in parallel with behavioral performance. [233] As a matter of fact, a transient hearing loss (15 days) occurring during the "critical period" is sufficient to elevate AM thresholds in adult gerbils. [234] Even non-traumatic noise exposure reduces the phase-locking ability of cortical neurons as well as the animals' behavioral capacity to discriminate between different AM sounds. [235] Behavioral training or pairing protocols involving neuromodulators also alter the ability of cortical neurons to phase lock to AM sounds. [236] [237] In humans, hearing loss may result in an unbalanced representation of speech cues: ENVn cues are enhanced at the cost of TFSn cues (see: Effects of age and hearing loss on temporal envelope processing). Auditory training may reduce the representation of speech ENVn cues for elderly listeners with hearing loss, who may then reach levels comparable to those observed for normal-hearing elderly listeners. [238] Last, intensive musical training induces both behavioral effects such as higher sensitivity to pitch variations (for Mandarin linguistic pitch) and a better synchronization of brainstem responses to the f0-contour of lexical tones for musicians compared with non-musicians. [239]
Fast and easy to administer psychophysical tests have been developed to assist clinicians in the screening of TFS-processing abilities and diagnosis of suprathreshold temporal auditory processing deficits associated with cochlear damage and ageing. These tests may also be useful for audiologists and hearing-aid manufacturers to explain and/or predict the outcome of hearing-aid fitting in terms of perceived quality, speech intelligibility or spatial hearing. [240] [241] These tests may eventually be used to recommend the most appropriate compression speed in hearing aids [242] or the use of directional microphones. The need for such tests is corroborated by strong correlations between slow-FM or spectro-temporal modulation detection thresholds and aided speech intelligibility in competing backgrounds for hearing-impaired persons. [90] [243] Clinical tests can be divided into two groups: those assessing monaural TFS processing capacities (TFS1 test) and those assessing binaural capacities (binaural pitch, TFS-LF, TFS-AF).
TFS1: this test assesses the ability to discriminate between a harmonic complex tone and its frequency-transposed (and thus, inharmonic) version. [244] [245] [246] [159] Binaural pitch: these tests evaluate the ability to detect and discriminate binaural pitch, and melody recognition using different types of binaural pitch. [182] [247] TFS-LF: this test assesses the ability to discriminate low-frequency pure tones that are identical at the two ears from the same tones differing in interaural phase. [248] [249] TFS AF: this test assesses the highest audio frequency of a pure tone up to which a change in interaural phase can be discriminated. [250]
Signal distortion, additive noise, reverberation, and audio processing strategies such as noise suppression and dynamic-range compression can all impact speech intelligibility and speech and music quality. [251] [252] [253] [254] [255] These changes in the perception of the signal can often be predicted by measuring the associated changes in the signal envelope and/or temporal fine structure (TFS). Objective measures of the signal changes, when combined with procedures that associate the signal changes with differences in auditory perception, give rise to auditory performance metrics for predicting speech intelligibility and speech quality.
Changes in the TFS can be estimated by passing the signals through a filterbank and computing the coherence [256] between the system input and output in each band. Intelligibility predicted from the coherence is accurate for some forms of additive noise and nonlinear distortion, [251] [255] but works poorly for ideal binary mask (IBM) noise suppression. [253] Speech and music quality for signals subjected to noise and clipping distortion have also been modeled using the coherence [257] or using the coherence averaged across short signal segments. [258]
Changes in the signal envelope can be measured using several different procedures. The presence of noise or reverberation will reduce the modulation depth of a signal, and multiband measurement of the envelope modulation depth of the system output is used in the speech transmission index (STI) to estimate intelligibility. [259] While accurate for noise and reverberation applications, the STI works poorly for nonlinear processing such as dynamic-range compression. [260] An extension to the STI estimates the change in modulation by cross-correlating the envelopes of the speech input and output signals. [261] [262] A related procedure, also using envelope cross-correlations, is the short-time objective intelligibility (STOI) measure, [253] which works well for its intended application in evaluating noise suppression, but which is less accurate for nonlinear distortion. [263] Envelope-based intelligibility metrics have also been derived using modulation filterbanks [67] and using envelope time-frequency modulation patterns. [264] Envelope cross-correlation is also used for estimating speech and music quality. [265] [266]
Envelope and TFS measurements can also be combined to form intelligibility and quality metrics. A family of metrics for speech intelligibility, [263] speech quality, [267] [268] and music quality [269] has been derived using a shared model of the auditory periphery [270] that can represent hearing loss. Using a model of the impaired periphery leads to more accurate predictions for hearing-impaired listeners than using a normal-hearing model, and the combined envelope/TFS metric is generally more accurate than a metric that uses envelope modulation alone. [263] [267]
Lip reading, also known as speechreading, is a technique of understanding speech by visually interpreting the movements of the lips, face and tongue when normal sound is not available. It relies also on information provided by the context, knowledge of the language, and any residual hearing. Although lip reading is used most extensively by deaf and hard-of-hearing people, most people with normal hearing process some speech information from sight of the moving mouth.
The auditory system is the sensory system for the sense of hearing. It includes both the sensory organs and the auditory parts of the sensory system.
Volley theory states that groups of neurons of the auditory system respond to a sound by firing action potentials slightly out of phase with one another so that when combined, a greater frequency of sound can be encoded and sent to the brain to be analyzed. The theory was proposed by Ernest Wever and Charles Bray in 1930 as a supplement to the frequency theory of hearing. It was later discovered that this only occurs in response to sounds that are about 500 Hz to 5000 Hz.
The Greenwood function correlates the position of the hair cells in the inner ear to the frequencies that stimulate their corresponding auditory neurons. Empirically derived in 1961 by Donald D. Greenwood, the relationship has shown to be constant throughout mammalian species when scaled to the appropriate cochlear spiral lengths and audible frequency ranges. Moreover, the Greenwood function provides the mathematical basis for cochlear implant surgical electrode array placement within the cochlea.
Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.
Binaural fusion or binaural integration is a cognitive process that involves the combination of different auditory information presented binaurally, or to each ear. In humans, this process is essential in understanding speech as one ear may pick up more information about the speech stimuli than the other.
A gammatone filter is a linear filter described by an impulse response that is the product of a gamma distribution and sinusoidal tone. It is a widely used model of auditory filters in the auditory system.
The ASA Silver Medal is an award presented by the Acoustical Society of America to individuals, without age limitation, for contributions to the advancement of science, engineering, or human welfare through the application of acoustic principles or through research accomplishments in acoustics. The medal is awarded in a number of categories depending on the technical committee responsible for making the nomination.
Diplacusis, also known as diplacusis binauralis, binauralis disharmonica or interaural pitch difference (IPD), is a hearing disorder whereby a single auditory stimulus is perceived as different pitches between ears. It is typically experienced as a secondary symptom of sensorineural hearing loss, although not all patients with sensorineural hearing loss experience diplacusis or tinnitus. The onset is usually spontaneous and can occur following an acoustic trauma, for example an explosive noise, or in the presence of an ear infection. Sufferers may experience the effect permanently, or it may resolve on its own. Diplacusis can be particularly disruptive to individuals working within fields requiring acute audition, such as musicians, sound engineers or performing artists.
Auditory fatigue is defined as a temporary loss of hearing after exposure to sound. This results in a temporary shift of the auditory threshold known as a temporary threshold shift (TTS). The damage can become permanent if sufficient recovery time is not allowed before continued sound exposure. When the hearing loss is rooted from a traumatic occurrence, it may be classified as noise-induced hearing loss, or NIHL.
Auditory feedback (AF) is an aid used by humans to control speech production and singing by helping the individual verify whether the current production of speech or singing is in accordance with his acoustic-auditory intention. This process is possible through what is known as the auditory feedback loop, a three-part cycle that allows individuals to first speak, then listen to what they have said, and lastly, correct it when necessary. From the viewpoint of movement sciences and neurosciences, the acoustic-auditory speech signal can be interpreted as the result of movements of speech articulators. Auditory feedback can hence be inferred as a feedback mechanism controlling skilled actions in the same way that visual feedback controls limb movements.
Phonemic restoration effect is a perceptual phenomenon where under certain conditions, sounds actually missing from a speech signal can be restored by the brain and may appear to be heard. The effect occurs when missing phonemes in an auditory signal are replaced with a noise that would have the physical properties to mask those phonemes, creating an ambiguity. In such ambiguity, the brain tends towards filling in absent phonemes. The effect can be so strong that some listeners may not even notice that there are phonemes missing. This effect is commonly observed in a conversation with heavy background noise, making it difficult to properly hear every phoneme being spoken. Different factors can change the strength of the effect, including how rich the context or linguistic cues are in speech, as well as the listener's state, such as their hearing status or age.
Monita Chatterjee is an auditory scientist and the Director of the Auditory Prostheses & Perception Laboratory at Boys Town National Research Hospital. She investigates the basic mechanisms underlying auditory processing by cochlear implant listeners.
Brian C.J. Moore FMedSci, FRS is an Emeritus Professor of Auditory Perception in the University of Cambridge and an Emeritus Fellow of Wolfson College, Cambridge. His research focuses on psychoacoustics, audiology, and the development and assessment of hearing aids.
Auditory science or hearing science is a field of research and education concerning the perception of sounds by humans, animals, or machines. It is a heavily interdisciplinary field at the crossroad between acoustics, neuroscience, and psychology. It is often related to one or many of these other fields: psychophysics, psychoacoustics, audiology, physiology, otorhinolaryngology, speech science, automatic speech recognition, music psychology, linguistics, and psycholinguistics.
Christian Lorenzi is Professor of Experimental Psychology at École Normale Supérieure in Paris, France, where he has been Director of the Department of Cognitive Studies and Director of Scientific Studies until. Lorenzi works on auditory perception.
Deniz Başkent is a Turkish-born Dutch auditory scientist who works on auditory perception. As of 2018, she is Professor of Audiology at the University Medical Center Groningen, Netherlands.
Robert V. Shannon is Research Professor of Otolaryngology-Head & Neck Surgery and Affiliated Research Professor of Biomedical Engineering at University of Southern California, CA, USA. Shannon investigates the basic mechanisms underlying auditory neural processing by users of cochlear implants, auditory brainstem implants, and midbrain implants.
Binaural unmasking is phenomenon of auditory perception discovered by Ira Hirsh. In binaural unmasking, the brain combines information from the two ears in order to improve signal detection and identification in noise. The phenomenon is most commonly observed when there is a difference between the interaural phase of the signal and the interaural phase of the noise. When such a difference is present there is an improvement in masking threshold compared to a reference situation in which the interaural phases are the same, or when the stimulus has been presented monaurally. Those two cases usually give very similar thresholds. The size of the improvement is known as the "binaural masking level difference" (BMLD), or simply as the "masking level difference".
Quentin Summerfield is a British psychologist, specialising in hearing. He joined the Medical Research Council Institute of Hearing Research in 1977 and served as its deputy director from 1993 to 2004, before moving on to a chair in psychology at The University of York. He served as head of the Psychology department from 2011 to 2017 and retired in 2018, becoming an emeritus professor. From 2013 to 2018, he was a member of the University of York's Finance & Policy Committee. From 2015 to 2018, he was a member of York University's governing body, the Council.