Auditory spatial attention is a specific form of attention, involving the focusing of auditory perception to a location in space.
Although the properties of visuospatial attention have been the subject of detailed study, relatively less work has been done to elucidate the mechanisms of audiospatial attention. Spence and Driver [1] note that while early researchers investigating auditory spatial attention failed to find the types of effects seen in other modalities such as vision, these null effects may be due to the adaptation of visual paradigms to the auditory domain, which has decreased spatial acuity.
Recent neuroimaging research has provided insight into the processes behind audiospatial attention, suggesting functional overlap with portions of the brain previously shown to be responsible for visual attention. [2] [3]
Several studies have explored the properties of visuospatial attention using the behavioral tools of cognitive science, either in isolation or as part of a larger neuroimaging study.
Rhodes [4] sought to identify whether audiospatial attention was represented analogically, that is, if the mental representation of auditory space was arranged in the same fashion as physical space. If this is the case, then the time to move the focus of auditory attention should be related to the distance to be moved in physical space. Rhodes notes that previous work by Posner, [5] among others, had not found behavioral differences in an auditory attention task that merely requires stimulus detection, possibly due to low-level auditory receptors being mapped tonotopically rather than spatially, as in vision. For this reason, Rhodes utilized an auditory localization task, finding that the time to shift attention increases with greater angular separation between attention and target, although this effect reached asymptote at locations more than of 90° from the forward direction.
Spence and Driver, [1] noting that previous findings of audiospatial attentional effects including the aforementioned study by Rhodes could be confounded with response-priming, instead utilized several cuing paradigms, both exogenous and endogenous, over the course of 8 experiments. Both endogenous (informative) and exogenous (un-informative) cues increased performance in an auditory spatial localization task, consistent with the results previously found by Rhodes. However, only endogenous spatial cues improved performance on an auditory pitch discrimination task; exogenous spatial cues had no effect on the performance of this non-spatial pitch judgement. In light of these findings, Spence and Driver suggest that exogenous and endogenous audiospatial orientating may involve different mechanisms, with the colliculus possibly playing a role in both auditory and visual exogenous orienting, and the frontal and parietal cortex playing a similar part for endogenous orienting. It is noted that the lack of orientation effects to pitch stimuli for exogenous spatial cuing may be due to the connectivity of these structures, Spence and Driver note that while frontal and parietal cortical areas have inputs from cells coding both pitch and sound location, colliculus is only thought to be sensitive to pitches above 10 kHz, well above the ~350 Hz tones used in their study.
Diaconescu et al. [6] found participants of their cross-modal cuing experiment to respond faster to the spatial (location of visual or auditory stimulus) rather than non-spatial (shape / pitch) properties of target stimuli. While this occurred for both visual and auditory targets, the effect was greater for targets in the visual domain, which the researchers suggest may reflect a subordination of the audiospatial to visuospatial attentional systems.
Neuroimaging tools of modern cognitive neuroscience such as functional magnetic resonance imaging (fMRI) and event-related potential (ERP) techniques have provided further insight beyond behavioral research into the functional form of audiospatial attention. Current research suggests that auditory spatial attention overlaps functionally with many areas previously shown to be associated with visual attention.
Although there exists substantial neuroimaging research on attention in the visual domain, comparatively fewer studies have investigated attentional processes in the auditory domain. For audition research utilizing fMRI, extra steps must be taken to reduce and/or avoid scanner noise impinging on auditory stimuli. [7] Often, a sparse temporal sampling scanning pattern is used to reduce the impact of scanner noise, taking advantage of the haemodynamic delay and scanning only after stimuli have been presented. [8]
Analogous to the 'what' (ventral) and 'where' (dorsal) streams of visual processing (see the Two Streams hypothesis,) there is evidence to suggest that audition is also split into identification and localization pathways.
Alain et al. [9] utilized a delayed match to sample task in which participants held an initial tone in memory, comparing it to a second tone presented 500 ms later. Although the set of stimuli tones remained the same throughout the experiment, task blocks alternated between pitch and spatial comparisons. For example, during pitch comparison blocks, participants were instructed to report whether the second stimulus was higher, lower, or equal in pitch relative to the first pitch, regardless of the two tones spatial locations. Conversely, during spatial comparison blocks, participants were instructed to report whether the second tone was leftward, rightward, or equal in space relative to the first tone, regardless of tone pitch. This task was used in two experiments, one utilizing fMRI and one ERP, to gauge the spatial and temporal properties, respectively, of 'what' and 'where' auditory processing. Comparing the pitch and spatial judgements revealed increased activation in primary auditory cortices and right inferior frontal gyrus during the pitch task, and increased activation in bilateral posterior temporal areas, and inferior and superior parietal cortices during the spatial task. The ERP results revealed divergence between the pitch and spatial tasks at 300-500 ms following the onset of the first stimulus, in the form of increased positivity in inferior frontotemporal regions with the pitch task, and increased positivity over centroparietal regions during the spatial task. This suggested that, similar to what is thought to occur in vision, elements of an auditory scene are split into separate 'what' (ventral) and 'where' (dorsal) pathways, however it was unclear if this similarity is the result of a supramodal division of feature and spatial processes.
Further evidence as to the modality specificity of the 'what' and 'where' pathways has been provided in a recent study by Diaconescu et al., [6] who suggest that while 'what' processes have discrete pathways for vision and audition, the 'where' pathway may be supra-modal, shared by both modalities. Participants were asked in randomly alternating trials to respond to either the feature or spatial elements of stimuli, which varied between the auditory and visual domain in set blocks. Between two experiments, the modality of the cue was also varied; the first experiment contained auditory cues as to which element (feature or spatial) of the stimuli to respond to, while the second experiment utilized visual cues. During the period between cue and target, when participants were presumably attending to the cued feature to be presented, both auditory and vision spatial attention conditions elicited greater positivity in source space from a centro-medial location at 600-1200 ms following cue onset, which the authors of the study propose may be the result of a supra-modal pathway for spatial information. Conversely, source space activity for feature attention were not consistent between modalities, with auditory feature attention associated with greater positivity at the right auditory radial dipole around 300-600 ms, and spatial feature attention associated with greater negativity at the left-visual central-inferior dipole at 700-1050ms, suggested as evidence for separate feature or 'what' pathways for vision and audition.
Several studies investigating the functional structures of audiospatial attention have revealed functional areas which overlap with visuospatial attention, suggesting the existence of a supra-modal spatial attentional network.
Smith et al. [2] contrasted the cortical activation during audiospatial attention with both visuospatial attention and auditory feature attention in two separate experiments.
The first experiment used an endogenous or top down orthogonal cuing paradigm to investigate the cortical regions involved in audiospatial attention vs. visuospatial attention. The orthogonal cuing paradigm refers to the information provided by the cue stimuli; participants were asked to make a spatial up/down elevation judgement to stimuli that can appear either centrally, or laterally to the left / right side. While cues provided information to the lateralization of the target to be presented, they contained no information as to the correct elevation judgement. Such a procedure was used to dissociate the functional effects of spatial attention from those of motor-response priming. The same task was used for visual and auditory targets, in alternating blocks. Crucially, the primary focus of analysis was on “catch trials,” in which cued targets are not presented. This allowed for investigation of functional activation related to attending to a specific location, free of contamination from target-stimulus related activity. In the auditory domain, comparing activation following peripheral right and left cues to central cues revealed significant activation in the posterior parietal cortex (PPC,) frontal eye fields (FEF), and supplementary motor area (SMA.) These areas overlap those that were significantly active during the visuospatial attention condition; a comparison of the activation during the auditory and visual spatial attention conditions found no significant difference between the two.
During the second experiment participants were presented with a pair of distinguishable auditory stimuli. Although the pair of stimuli were identical throughout the experiment, different blocks of the task required participants to respond to either the temporal order (which sound came first) or spatial location (which sound was farther from midline) of the stimuli. Participants were instructed which feature to attend to at the onset of each block, allowing for comparisons of activation due to auditory spatial attention and auditory non-spatial attention to the same set of stimuli. The comparison of the spatial location task to the temporal order task showed greater activation in areas previously found to be associated with attention in the visual domain, including the bilateral temporal parietal junction, bilateral superior frontal areas near FEF, bilateral intraparietal sulcus, and bilateral occipital temporal junction, suggesting an attentional network that operates supra-modally across vision and audition.
The anatomical locus of the executive control of endogenous audiospatial attention was investigated using fMRI by Wu et al.. [3] Participants received auditory cues to attend to either their left or right, in anticipation of an auditory stimulus. A third cue, instructing participants to attend to neither left nor right, served as a control, non-spatial condition. Comparing activation in the spatial vs. non-spatial attentional conditions showed increased activation in several areas implicated in the executive control of visual attention, including the prefrontal cortex, FEF, anterior cingulate cortex (ACC), and superior parietal lobe, again supporting the notion of these structures as supra-modal attentional areas. The spatial attention vs. control comparison further revealed increased activity in auditory cortex, increases which were contralateral to the side of audiospatial attention, which may reflect top-down biasing of early sensory areas as has been seen with visual attention.
Wu et al. additionally observed that audiospatial attention was associated with increased activation in areas thought to process visual information, namely the cuneus and lingual gyrus, despite participants having completed the task with eyes closed. As this activity was not contralateral to the locus of attention the authors contend that the effect is likely not spatially specific, suggesting that it may instead reflect a general spread of attentional activity, possibly playing a role in multimodal sensory integration.
Although comparatively less research exists on the functional underpinnings of audiospatial compared to visuospatial attention, it is currently suggested that many of the anatomical structures implicated in visiospatial attention function supramodally, and are involved with audiospatial attention as well. The cognitive consequences of this connection, which may relate to multimodal processing, have yet to be fully explored.
Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Attention is the taking possession by the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought. Focalization, concentration, of consciousness are of its essence." Attention has also been described as the allocation of limited cognitive processing resources. Attention is manifested by an attentional bottleneck, in terms of the amount of data the brain can process each second; for example, in human vision, only less than 1% of the visual input data can enter the bottleneck, leading to inattentional blindness.
Multisensory integration, also known as multimodal integration, is the study of how information from the different sensory modalities may be integrated by the nervous system. A coherent representation of objects combining modalities enables animals to have meaningful perceptual experiences. Indeed, multisensory integration is central to adaptive behavior because it allows animals to perceive a world of coherent perceptual entities. Multisensory integration also deals with how different sensory modalities interact with one another and alter each other's processing.
Inhibition of return (IOR) refers to an orientation mechanism that briefly enhances the speed and accuracy with which an object is detected after the object is attended, but then impairs detection speed and accuracy. IOR is usually measured with a cue-response paradigm, in which a person presses a button when they detect a target stimulus following the presentation of a cue that indicates the location in which the target will appear. The cue can be exogenous, or endogenous. Inhibition of return results from oculomotor activation, regardless of whether it was produced by exogenous signals or endogenously. Although IOR occurs for both visual and auditory stimuli, IOR is greater for visual stimuli, and is studied more often than auditory stimuli.
The two-streams hypothesis is a model of the neural processing of vision as well as hearing. The hypothesis, given its initial characterisation in a paper by David Milner and Melvyn A. Goodale in 1992, argues that humans possess two distinct visual systems. Recently there seems to be evidence of two distinct auditory systems as well. As visual information exits the occipital lobe, and as sound leaves the phonological network, it follows two main pathways, or "streams". The ventral stream leads to the temporal lobe, which is involved with object and visual identification and recognition. The dorsal stream leads to the parietal lobe, which is involved with processing the object's spatial location relative to the viewer and with speech repetition.
Visual search is a type of perceptual task requiring attention that typically involves an active scan of the visual environment for a particular object or feature among other objects or features. Visual search can take place with or without eye movements. The ability to consciously locate an object or target amongst a complex array of stimuli has been extensively studied over the past 40 years. Practical examples of using visual search can be seen in everyday life, such as when one is picking out a product on a supermarket shelf, when animals are searching for food among piles of leaves, when trying to find a friend in a large crowd of people, or simply when playing visual search games such as Where's Wally?
Attentional shift occurs when directing attention to a point increases the efficiency of processing of that point and includes inhibition to decrease attentional resources to unwanted or irrelevant inputs. Shifting of attention is needed to allocate attentional resources to more efficiently process information from a stimulus. Research has shown that when an object or area is attended, processing operates more efficiently. Task switching costs occur when performance on a task suffers due to the increased effort added in shifting attention. There are competing theories that attempt to explain why and how attention is shifted as well as how attention is moved through space.
Negative priming is an implicit memory effect in which prior exposure to a stimulus unfavorably influences the response to the same stimulus. It falls under the category of priming, which refers to the change in the response towards a stimulus due to a subconscious memory effect. Negative priming describes the slow and error-prone reaction to a stimulus that is previously ignored. For example, a subject may be imagined trying to pick a red pen from a pen holder. The red pen becomes the target of attention, so the subject responds by moving their hand towards it. At this time, they mentally block out all other pens as distractors to aid in closing in on just the red pen. After repeatedly picking the red pen over the others, switching to the blue pen results in a momentary delay picking the pen out. The slow reaction due to the change of the distractor stimulus to target stimulus is called the negative priming effect.
Echoic memory is the sensory memory that registers specific to auditory information (sounds). Once an auditory stimulus is heard, it is stored in memory so that it can be processed and understood. Unlike most visual memory, where a person can choose how long to view the stimulus and can reassess it repeatedly, auditory stimuli are usually transient and cannot be reassessed. Since echoic memories are heard once, they are stored for slightly longer periods of time than iconic memories. Auditory stimuli are received by the ear one at a time before they can be processed and understood.
Extinction is a neurological disorder that impairs the ability to perceive multiple stimuli of the same type simultaneously. Extinction is usually caused by damage resulting in lesions on one side of the brain. Those who are affected by extinction have a lack of awareness in the contralesional side of space and a loss of exploratory search and other actions normally directed toward that side.
Cross modal plasticity is the adaptive reorganization of neurons to integrate the function of two or more sensory systems. Cross modal plasticity is a type of neuroplasticity and often occurs after sensory deprivation due to disease or brain damage. The reorganization of the neural network is greatest following long-term sensory deprivation, such as congenital blindness or pre-lingual deafness. In these instances, cross modal plasticity can strengthen other sensory systems to compensate for the lack of vision or hearing. This strengthening is due to new connections that are formed to brain cortices that no longer receive sensory input.
The P3a, or novelty P3, is a component of time-locked (EEG) signals known as event-related potentials (ERP). The P3a is a positive-going scalp-recorded brain potential that has a maximum amplitude over frontal/central electrode sites with a peak latency falling in the range of 250–280 ms. The P3a has been associated with brain activity related to the engagement of attention and the processing of novelty.
In neuroscience, the visual P200 or P2 is a waveform component or feature of the event-related potential (ERP) measured at the human scalp. Like other potential changes measurable from the scalp, this effect is believed to reflect the post-synaptic activity of a specific neural process. The P2 component, also known as the P200, is so named because it is a positive going electrical potential that peaks at about 200 milliseconds after the onset of some external stimulus. This component is often distributed around the centro-frontal and the parieto-occipital areas of the scalp. It is generally found to be maximal around the vertex of the scalp, however there have been some topographical differences noted in ERP studies of the P2 in different experimental conditions.
The visual N1 is a visual evoked potential, a type of event-related electrical potential (ERP), that is produced in the brain and recorded on the scalp. The N1 is so named to reflect the polarity and typical timing of the component. The "N" indicates that the polarity of the component is negative with respect to an average mastoid reference. The "1" originally indicated that it was the first negative-going component, but it now better indexes the typical peak of this component, which is around 150 to 200 milliseconds post-stimulus. The N1 deflection may be detected at most recording sites, including the occipital, parietal, central, and frontal electrode sites. Although, the visual N1 is widely distributed over the entire scalp, it peaks earlier over frontal than posterior regions of the scalp, suggestive of distinct neural and/or cognitive correlates. The N1 is elicited by visual stimuli, and is part of the visual evoked potential – a series of voltage deflections observed in response to visual onsets, offsets, and changes. Both the right and left hemispheres generate an N1, but the laterality of the N1 depends on whether a stimulus is presented centrally, laterally, or bilaterally. When a stimulus is presented centrally, the N1 is bilateral. When presented laterally, the N1 is larger, earlier, and contralateral to the visual field of the stimulus. When two visual stimuli are presented, one in each visual field, the N1 is bilateral. In the latter case, the N1's asymmetrical skewedness is modulated by attention. Additionally, its amplitude is influenced by selective attention, and thus it has been used to study a variety of attentional processes.
The C1 and P1 are two human scalp-recorded event-related brain potential components, collected by means of a technique called electroencephalography (EEG). The C1 is named so because it was the first component in a series of components found to respond to visual stimuli when it was first discovered. It can be a negative-going component or a positive going component with its peak normally observed in the 65–90 ms range post-stimulus onset. The P1 is called the P1 because it is the first positive-going component and its peak is normally observed in around 100 ms. Both components are related to processing of visual stimuli and are under the category of potentials called visually evoked potentials (VEPs). Both components are theorized to be evoked within the visual cortices of the brain with C1 being linked to the primary visual cortex of the human brain and the P1 being linked to other visual areas. One of the primary distinctions between these two components is that, whereas the P1 can be modulated by attention, the C1 has been typically found to be invariable to different levels of attention.
The P3b is a subcomponent of the P300, an event-related potential (ERP) component that can be observed in human scalp recordings of brain electrical activity. The P3b is a positive-going amplitude peaking at around 300 ms, though the peak will vary in latency from 250 to 500 ms or later depending upon the task and on the individual subject response. Amplitudes are typically highest on the scalp over parietal brain areas.
Object-based attention refers to the relationship between an ‘object’ representation and a person’s visually stimulated, selective attention, as opposed to a relationship involving either a spatial or a feature representation; although these types of selective attention are not necessarily mutually exclusive. Research into object-based attention suggests that attention improves the quality of the sensory representation of a selected object, and results in the enhanced processing of that object’s features.
The Colavita visual dominance effect refers to the phenomenon in which study participants respond more often to the visual component of an audiovisual stimulus, when presented with bimodal stimuli.
Crossmodal attention refers to the distribution of attention to different senses. Attention is the cognitive process of selectively emphasizing and ignoring sensory stimuli. According to the crossmodal attention perspective, attention often occurs simultaneously through multiple sensory modalities. These modalities process information from the different sensory fields, such as: visual, auditory, spatial, and tactile. While each of these is designed to process a specific type of sensory information, there is considerable overlap between them which has led researchers to question whether attention is modality-specific or the result of shared "cross-modal" resources. Cross-modal attention is considered to be the overlap between modalities that can both enhance and limit attentional processing. The most common example given of crossmodal attention is the Cocktail Party Effect, which is when a person is able to focus and attend to one important stimulus instead of other less important stimuli. This phenomenon allows deeper levels of processing to occur for one stimulus while others are then ignored.
The Posner cueing task, also known as the Posner paradigm, is a neuropsychological test often used to assess attention. Formulated by Michael Posner, it assesses a person's ability to perform an attentional shift. It has been used and modified to assess disorders, focal brain injury, and the effects of both on spatial attention.
Visual spatial attention is a form of visual attention that involves directing attention to a location in space. Similar to its temporal counterpart visual temporal attention, these attention modules have been widely implemented in video analytics in computer vision to provide enhanced performance and human interpretable explanation of deep learning models.