Natural scene perception

Last updated

Natural scene perception refers to the process by which an agent (such as a human being) visually takes in and interprets scenes that it typically encounters in natural modes of operation (e.g. busy streets, meadows, living rooms). [1] This process has been modeled in several different ways that are guided by different concepts.

In the field of perception, a scene is information that can flow from a physical environment into a perceptual system via sensory transduction.

Contents

Debate over role of attention

One major dividing line between theories that explain natural scene perception is the role of attention. Some theories maintain the need for focused attention, while others claim that focused attention is not involved.

Attention Behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information

Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information. It is a state of arousal. It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought. Focalization, the concentration of consciousness, is of its essence. Attention has also been described as the allocation of limited cognitive processing resources.

Focused attention played a partial role in early models of natural scene perception. Such models involved two stages of visual processing. [2] According to these models, the first stage is attention free and registers low level features such as brightness gradients, motion and orientation in a parallel manner. Meanwhile, the second stage requires focused attention. It registers high-level object descriptions, has limited capacity and operates serially. These models have been empirically informed by studies demonstrating change blindness, inattentional blindness and attentional blink. Such studies show that when one's visual focused attention is engaged by a task, significant changes in one's environment that are not directly pertinent to the task can escape awareness. It was generally thought that natural scene perception was similarly susceptible to change blindness, inattentional blindness and attentional blink, and that these psychological phenomena occurred because engaging in a task diverts attentional resources that would otherwise be used for natural scene perception.

Brightness perception of light level

Brightness is an attribute of visual perception in which a source appears to be radiating or reflecting light. In other words, brightness is the perception elicited by the luminance of a visual target. It is not necessarily proportional to luminance. This is a subjective attribute/property of an object being observed and one of the color appearance parameters of color appearance models. Brightness refers to an absolute term and should not be confused with Lightness.

Change blindness

Change blindness is a perceptual phenomenon that occurs when a change in a visual stimulus is introduced and the observer does not notice it. For example, observers often fail to notice major differences introduced into an image while it flickers off and on again. People's poor ability to detect changes has been argued to reflect fundamental limitations of human attention. Change blindness has become a highly researched topic and some have argued that it may have important practical implications in areas such as eyewitness testimony and distractions while driving.

Evidence against the need for focused attention

The attention-free hypothesis soon emerged to challenge early models. The initial basis for the attention-free hypothesis was the finding that in visual search, basic visual features of objects immediately and automatically pop out to the person doing the visual search. [3] Further experiments seemed to support this: Potter (as cited by Evans & Treisman, 2005) showed that high-order representations can be accessed rapidly from natural scenes presented at rates of up to 10 per second. Additionally, Thorpe, Fize & Marlot (as cited by Evans & Treisman) discovered that humans and primates can categorize natural images (i.e. of animals in everyday indoor and outdoor scenes) rapidly and accurately even after brief exposures. [3] The basic idea in these studies is that exposure to each individual scene is too brief for attentional processes to occur, yet human beings are able to interpret and categorize these scenes.

Visual search is a type of perceptual task requiring attention that typically involves an active scan of the visual environment for a particular object or feature among other objects or features. Visual search can take place with or without eye movements. The ability to consciously locate an object or target amongst a complex array of stimuli has been extensively studied over the past 40 years. Practical examples of using visual search can be seen in everyday life, such as when one is picking out a product on a supermarket shelf, when animals are searching for food amongst piles of leaves, when trying to find your friend in a large crowd of people, or simply when playing visual search games such as Where's Wally? Much previous literature on visual search used reaction time in order to measure the time it takes to detect the target amongst its distractors. An example of this could be a green square amongst a set of red circles. However, reaction time measurements do not always distinguish between the role of attention and other factors: a long reaction time might be the result of difficulty directing attention to the target, or slowed decision-making processes or slowed motor responses after attention is already directed to the target and the target has already been detected. Many visual search paradigms have therefore used eye movement as a means to measure the degree of attention given to stimuli. However, vast research to date suggests that eye movements can move independently of attention, and therefore eye movement measures do not completely capture the role of attention.

Weaker versions of the attention-free hypothesis have also been targeted at specific components of the natural scene perception process instead of the process as a whole. Kihara & Takeda (2012) limit their claim to saying that it is the integration of spatial frequency-based information in natural scenes (a sub-process of natural scene perception) that is attention free. [4] This claim is based on a study of theirs which used attention-demanding tasks to examine participants' abilities to accurately categorize images that were filtered to have a wide range of spatial frequencies. The logic behind this experiment was that if integration of visual information across spatial frequencies (measured by the categorization task) is preattentive, then attention-demanding tasks should not affect performance in the categorization task. This was indeed found to be the case.

Spatial frequency characteristic of any structure that is periodic across position in space; measure of how often sinusoidal components (as determined by the Fourier transform) of the structure repeat per unit of distance

In mathematics, physics, and engineering, spatial frequency is a characteristic of any structure that is periodic across position in space. The spatial frequency is a measure of how often sinusoidal components of the structure repeat per unit of distance. The SI unit of spatial frequency is cycles per meter. In image-processing applications, spatial frequency is often expressed in units of cycles per millimeter or equivalently line pairs per millimeter.

More recent evidence reasserting the need for focused attention

A recent study by Cohen, Alvarez & Nakayama (2011) calls into question the validity of evidence supporting the attention-free hypothesis. They found that participants did display inattentional blindness while doing certain kinds of multiple-object tracking (MOT) and rapid serial visual presentation (RSVP) tasks. [5] Furthermore, Cohen et al. found that participants' natural scene perception was impaired under dual-task conditions, but that this dual-task impairment happened only when participants' primary task was sufficiently demanding. The authors concluded that previous studies showing the absence of a need for focused attention did not use tasks that were demanding enough to fully engage attention.

Rapid serial visual presentation is an Paradigm used by psychologists to study an attentional phenomena named "the attentional blink". The RSVP tasks asks a participant to observe a continuous presentation of multiple separate visual images or objects that appear in rapid succession (very short duration of exposure time. Two targets are shown during each presentation. In the classic setup, the first target is shown, and then a second one comes after it in either two spots or 8 spots later. The other visual stimuli shown between target 1 and target 2 are known as distractions. An example of a distractor is a color change in the entire display, a post-mask or it can be letters that are among the numbers.

A dual-task paradigm is a procedure in experimental (neuro)psychology that requires an individual to perform two tasks simultaneously, in order to compare performance with single-task conditions. When performance scores on one and/or both tasks are lower when they are done simultaneously compared to separately, these two tasks interfere with each other, and it is assumed that both tasks compete for the same class of information processing resources in the brain.

In the Cohen et al. study, the MOT task involved viewing eight black moving discs presented against a changing background that consisted of randomly colored checkerboard masks. Four of these discs were picked out and participants were instructed to track these four discs. The RSVP task involved viewing a stream of letters and digits presented against a series of changing checkerboards, and counting the number of times a digit was presented. In both experiments, the critical trial involved a natural scene suddenly replacing the second last checkerboard, and participants were immediately afterwards asked whether they had noticed anything different, as well as presented with six questions to determine whether they had categorized the scene. The dual-task condition simply involved participants performing the MOT task mentioned above and a scene-classification task simultaneously. The authors varied the difficulty of the task (i.e. how demanding the task was) by increasing or decreasing the speed of the moving discs.

Models

These are some of the models that have been proposed for the purpose of explaining natural scene perception.

Evans' & Treisman's hypothesis

Evans & Treisman (2005) proposed a hypothesis that humans rapidly detect disjunctive sets of unbound features of target categories in a parallel manner, and then use these features to discriminate between scenes that do or do not contain the target without necessarily fully identifying it. [3] An example of such a feature would be outstretched wings that can be used to tell whether or not a bird is in a picture, even before the system has identified an object as a bird. Evans & Treisman propose that natural scene perception involves a first pass through the visual processing hierarchy up to the nodes in a visual identification network, and then optional revisiting of earlier levels for more detailed analysis. During the 'first pass' stage, the system forms a global representation of the natural scene that includes the layout of global boundaries and potential objects. During the 'revisiting' stage, focused attention is employed to select local objects of interest in a serial manner, and then bind their features to their representations.

This hypothesis is consistent with the results of their study in which participants were instructed to detect animal targets in RSVP sequences, and then report their identities and locations. While participants were able to detect the targets in most trials, they were often subsequently unable to identify or localize them. Furthermore, when two targets were presented in quick succession, participants displayed a significant attentional blink when required to identify the targets, but the attentional blink was mostly eliminated among participants only required to only detect them. [3] Evans & Treisman explain these results by with the hypothesis that the attentional blink occurs because the identification stage requires attentional resources, while the detection stage does not.

Ultra-rapid visual categorization

Ultra-rapid visual categorization is a model proposing an automatic feedforward mechanism that forms high-level object representations in parallel without focused attention. In this model, the mechanism cannot be sped up by training. Evidence for a feedforward mechanism can be found in studies that have shown that many neurons are already highly selective at the beginning of a visual response, thus suggesting that feedback mechanisms are not required for response selectivity to increase. [6] Furthermore, recent fMRI and ERP studies have shown that masked visual stimuli that participants do not consciously perceive can significantly modulate activity in the motor system, thus suggesting somewhat sophisticated visual processing. [7] VanRullen (2006) ran simulations showing that the feedforward propagation of one wave of spikes through high-level neurons, generated in response to a stimulus, could be enough for crude recognition and categorization that occurs in 150 ms or less. [8]

Neural-object file theory

Xu & Chun (2009) propose the neural-object file theory, which posits that the human visual system initially selects a fixed number of roughly four objects from a crowded scene based on their spatial information (object individuation) before encoding their details (object identification). [9] Under this framework, object individuation is generally controlled by the inferior intra-parietal sulcus (IPS), while object identification involves the superior IPS and higher-level visual areas. At the object individuation stage, object representations are coarse and contain minimal feature information. However, once these object representations (or object-files, to use the theory's language) have been 'set up' during the object individuation stage they can be elaborated on over time during the object identification stage, during which additional featural and identity information is received.

The neural-object file theory deals with the issue of attention by proposing two different processing systems. One of them tracks the overall hierarchical structure of the visual display and is attention-free, while the other processes current objects of attentional selection. The current hypothesis is that the parahippocampal place area (PPA) plays a role in shifting visual attention to different parts of a scene and incorporating information from multiple frames in order to form an integrated representation of the scene.

The separation between object individuation and identification in the neural object-file theory is supported by evidence such as that from Xu's & Chun's fMRI study (as cited in Xu & Chun, 2009). In this study, they examined posterior brain mechanisms that supported visual short-term memory (VSTM). The fMRI showed that representations in the inferior IPS were fixed to roughly four objects regardless of object complexity, but representations in the superior IPS and lateral occipital complex (LOC) varied according to complexity. [10]

Natural scene statistics

Related Research Articles

Wishful thinking

Wishful thinking describes decision-making and the formation of beliefs based on what might be pleasing to imagine, rather than on evidence, rationality, or reality. It is a product of resolving conflicts between belief and desire.

Simultanagnosia is a rare neurological disorder characterized by the inability of an individual to perceive more than a single object at a time. This type of visual attention problem is one of three major components of Bálint's syndrome, an uncommon and incompletely understood variety of severe neuropsychological impairments involving space representation. The term "simultanagnosia" was first coined in 1924 by Wolpert to describe a condition where the affected individual could see individual details of a complex scene but failed to grasp the overall meaning of the image.

Anne Treisman English psychologist

Anne Marie Treisman was an English psychologist who specialised in cognitive psychology. She researched visual attention, object perception, and memory. One of her most influential ideas is the feature integration theory of attention, first published with G. Gelade in 1980. Treisman taught at the University of Oxford, University of British Columbia, University of California, Berkeley and Princeton University. Notable postdoctoral fellows she supervised included Nancy Kanwisher and Nilli Lavie. In 2013, Treisman received the National Medal of Science from President Barack Obama for her pioneering work in the study of attention. During her long career, Treisman experimentally and theoretically defined the issue of how information is selected and integrated to form meaningful objects that guide human thought and action.

Also known as perceptual blindness, inattentive blindness results from a lack of attention that is not associated with vision defects or deficits, as an individual fails to perceive an unexpected stimulus in plain sight. When it becomes impossible to attend to all the stimuli in a given situation, a temporary “blindness” effect can occur, as individuals fail to see unexpected but often salient objects or stimuli. The term was coined by Arien Mack and Irvin Rock in 1992 and was used as the title of their book of the same name, published by MIT press in 1998, in which they describe the discovery of the phenomenon and include a collection of procedures used in describing it. A famous study that demonstrated inattentional blindness asked participants whether or not they noticed a gorilla walking through the scene of a visual task they had been given.

Feature integration theory is a theory of attention developed in 1980 by Anne Treisman and Garry Gelade that suggests that when perceiving a stimulus, features are "registered early, automatically, and in parallel, while objects are identified separately" and at a later stage in processing. The theory has been one of the most influential psychological models of human visual attention.

Attentional blink (AB) is a phenomenon that reflects the temporal costs in allocating selective attention. The AB is typically measured by using rapid serial visual presentation (RSVP) tasks, where participants often fail to detect a second salient target occurring in succession if it is presented between 180-450 ms after the first one. Also, the AB has been observed using two backward-masked targets and auditory stimuli. The term attentional blink was first used in 1992, although the phenomenon was probably known before.

Dr. Barbara Landau is a professor in the Department of Cognitive Science at Johns Hopkins University and also chairs the department. Landau specializes in language learning, spatial representation, and the relationships between these foundational systems of human knowledge. She examines questions about how the two systems work together to enhance human cognition and whether one is actually foundational to the other. She is known for her research of unusual cases of development and is a leading authority on language and spatial information in people with Williams syndrome.

Object recognition is the ability to perceive an object's physical properties and apply semantic attributes to it. This process includes the understanding of its use, previous experience with the object, and how it relates to others. Regardless of an object's position or illumination, humans possess the ability to effectively identify and label an object. Humans are one of the few species that possess the ability of invariant visual object recognition. Both "front end" and "back end" processing are required for a species to be able to recognize objects at varying distances, angles, lighting, etc....

Visual N1

The visual N1 is a visual evoked potential, a type of event-related electrical potential (ERP), that is produced in the brain and recorded on the scalp. The N1 is so named to reflect the polarity and typical timing of the component. The "N" indicates that the polarity of the component is negative with respect to an average mastoid reference. The "1" originally indicated that it was the first negative-going component, but it now better indexes the typical peak of this component, which is around 150 to 200 milliseconds post-stimulus. The N1 deflection may be detected at most recording sites, including the occipital, parietal, central, and frontal electrode sites. Although, the visual N1 is widely distributed over the entire scalp, it peaks earlier over frontal than posterior regions of the scalp, suggestive of distinct neural and/or cognitive correlates. The N1 is elicited by visual stimuli, and is part of the visual evoked potential – a series of voltage deflections observed in response to visual onsets, offsets, and changes. Both the right and left hemispheres generate an N1, but the laterality of the N1 depends on whether a stimulus is presented centrally, laterally, or bilaterally. When a stimulus is presented centrally, the N1 is bilateral. When presented laterally, the N1 is larger, earlier, and contralateral to the visual field of the stimulus. When two visual stimuli are presented, one in each visual field, the N1 is bilateral. In the latter case, the N1's asymmetrical skewedness is modulated by attention. Additionally, its amplitude is influenced by selective attention, and thus it has been used to study a variety of attentional processes.

N2pc refers to an ERP component linked to selective attention. The N2pc appears over visual cortex contralateral to the location in space to which subjects are attending; if subjects pay attention to the left side of the visual field, the N2pc appears in the right hemisphere of the brain, and vice versa. This characteristic makes it a useful tool for directly measuring the general direction of a person's attention with fine-grained temporal resolution.

In the psychology of perception and motor control, the term response priming denotes a special form of priming. Generally, priming effects take place whenever a response to a target stimulus is influenced by a prime stimulus presented at an earlier time. The distinctive feature of response priming is that prime and target are presented in quick succession and are coupled to identical or alternative motor responses. When a speeded motor response is performed to classify the target stimulus, a prime immediately preceding the target can thus induce response conflicts when assigned to a different response as the target. These response conflicts have observable effects on motor behavior, leading to priming effects, e.g., in response times and error rates. A special property of response priming is its independence from visual awareness of the prime.

Transsaccadic memory is the neural process that allows humans to perceive their surroundings as a seamless, unified image despite rapid changes in fixation points. The human eyes move rapidly and repeatedly, focusing on a single point for only a short period of time before moving to the next point. These rapid eye movements are called saccades. If a video camera were to perform such high speed changes in focal points, the image on screen would be a blurry, nauseating mess. Despite this rapidly changing input to the visual system, the normal experience is of a stable visual world; an example of perceptual constancy. Transsaccadic memory is a system that contributes to this stability.

Object-based attention refers to the relationship between an ‘object’ representation and a person’s visually stimulated, selective attention, as opposed to a relationship involving either a spatial or a feature representation; although these types of selective attention are not necessarily mutually exclusive. Research into object-based attention suggests that attention improves the quality of the sensory representation of a selected object, and results in the enhanced processing of that object’s features.

The sensory enhancement theory assumes that attentional resources will spread until they reach the boundaries of a cued object, including regions that may be obstructed or are overlapping other objects. It has been suggested that sensory enhancement is an essential mechanism that underlies object-based attention. The sensory enhancement theory of object-based attention proposes that when attention is directed to a cued object, the quality of the object’s physical representations improve because the spread of attention facilitates the efficiency of processing the features of the object as a whole. The qualities of the cued object, such as spatial resolution and contrast sensitivity, are therefore more strongly represented in one's memory than the qualities of other objects or locations that received little or no attentional resource. Information processing of these objects also tends to be significantly faster and more accurate as the representations have become more salient.

In cognitive psychology, intertrial priming is an accumulation of the priming effect over multiple trials, where "priming" is the effect of the exposure to one stimulus on subsequently presented stimuli. Intertrial priming occurs when a target feature is repeated from one trial to the next, and typically results in speeded response times to the target. A target is the stimulus participants are required to search for. For example, intertrial priming occurs when the task is to respond to either a red or a green target, and the response time to a red target is faster if the preceding trial also has a red target.

Spatial ability

Spatial ability or visuo-spatial ability is the capacity to understand, reason and remember the spatial relations among objects or space.

Visual Indexing Theory is an account of early visual perception developed by Zenon Pylyshyn in the 1980s. It proposes a pre-attentive mechanism whose function is to individuate salient elements of a visual scene, and track their locations across space and time. Developed in response to what Pylyshyn viewed as limitations of prominent theories of visual perception at the time, visual indexing theory is supported by several lines of empirical evidence.

References

  1. Geisler, W.S., Perry, J.S. and Ing, A.D. (2008) Natural systems analysis. In: B. Rogowitz and T. Pappas (Eds.), Human Vision and Electronic Imaging. Proceedings SPIE, Vol 6806, 68060M
  2. Evans, K. & Treisman, A. (2005). Perception of Objects in Natural Scenes: Is it really attention free? Journal of Experimental Psychology: Human Perception and Performance, 31(6), 1476-1492.
  3. 1 2 3 4 See 2.
  4. Kihara, K. & Takeda, Y. (2012). Attention-free integration of spatial frequency-based information in natural scenes. Vision Research, 65, 38-44.
  5. Cohen, M.A., Alvarez, G.A., & Nakayama, K. (2011). Natural-scene perception requires attention. Psychological Science, 22(9), 1165-1172.
  6. Fabre-Thorpe, M., Delorme, A., Marlot, C., & Thorpe, S. (2001). A limit to the speed of processing in ultra-rapid visual categorization of novel natural scenes. Journal of Cognitive Neuroscience, 13(2), pp. 171-180.
  7. See 9.
  8. VanRullen, R. (2007). The power of the feed-forward sweep. Advances in Cognitive Psychology, 3(1), 167-176.
  9. Xu, Y. & Chun, M.M. (2009). Selecting and perceiving multiple visual objects. Trends in Cognitive Sciences, 13(4), 167-173.
  10. See 12.