This article needs additional citations for verification .(June 2018) |
The recognition-by-components theory, or RBC theory, [1] is a process proposed by Irving Biederman in 1987 to explain object recognition. According to RBC theory, we are able to recognize objects by separating them into geons (the object's main component parts). Biederman suggested that geons are based on basic 3-dimensional shapes (cylinders, cones, etc.) that can be assembled in various arrangements to form a virtually unlimited number of objects. [2]
The recognition-by-components theory suggests that there are fewer than 36 geons which are combined to create the objects we see in day-to-day life. [3] For example, when looking at a mug we break it down into two components – "cylinder" and "handle". This also works for more complex objects, which in turn are made up of a larger number of geons. Perceived geons are then compared with objects in our stored memory to identify what it is we are looking at. The theory proposes that when we view objects we look for two important components.
In his proposal of RBC, Biederman makes an analogy to the composition of speech and objects that helps support his theory. The idea is that about 44 individual phonemes or "units of sound" are needed to make up every word in the English language, and only about 55 are needed to make up every word in all languages. Though small differences may exist between these phonemes, there is still a discrete number that make up all languages.
A similar system may be used to describe how objects are perceived. Biederman suggests that in the same way speech is made up by phonemes, objects are made up by geons, and as there are a great variance of phonemes, there is also a great variance of geons. It is more easily understood how 36 geons can compose the sum of all objects, when the sum of all language and human speech is made up of only 55 phonemes.
One of the most defining factors of the recognition-by-components theory is that it enables us to recognize objects regardless of viewing angle; this is known as viewpoint invariance. It is proposed that the reason for this effect is the invariant edge properties of geons. [4]
The invariant edge properties are as follows:
Our knowledge of these properties means that when viewing an object or geon, we can perceive it from almost any angle. For example, when viewing a brick we will be able to see horizontal sets of parallel lines and vertical ones, and when considering where these points meet (co-termination) we are able to perceive the object.
Using geons as structural primitives results in two key advantages. Because geons are based on object properties that are stable across viewpoint ("viewpoint invariant"), and all geons are discriminable from one another, a single geon description is sufficient to describe an object from all possible viewpoints. The second advantage is that considerable economy of representation is achieved: a relatively small set of geons form a simple "alphabet" that can combine to form complex objects. For example, with only 24 geons, there are 306 billion possible combinations of 3 geons, allowing for all possible objects to be recognized.
In addition, some research suggests that the ability to recognize geons and compound structures of geons may develop in the brain as early as four months old, making it one of the fundamental skills that infants use to perceive the world. [5]
RBC theory is not in itself capable of starting with a photograph of a real object and producing a geons-and-relations description of the object; the theory does not attempt to provide a mechanism to reduce the complexities of real scenes to simple geon shapes. RBC theory is also incomplete in that geons and the relations between them will fail to distinguish many real objects. For example, a pear and an apple are easily distinguished by humans, but lack the corners and edges needed for RBC theory to recognize they are different. However, Irving Biederman has argued that RBC theory is the "preferred" mode of human object recognition, with a secondary process handling objects that are not distinguishable by their geons. He further states that this distinction explains research suggesting that objects may or may not be recognized equally well with changes in viewpoint.
Perception is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. All perception involves signals that go through the nervous system, which in turn result from physical or chemical stimulation of the sensory system. Vision involves light striking the retina of the eye; smell is mediated by odor molecules; and hearing involves pressure waves.
Gestalt psychology, gestaltism, or configurationism is a school of psychology that emerged in the early twentieth century in Austria and Germany as a theory of perception that was a rejection of basic principles of Wilhelm Wundt's and Edward Titchener's elementalist and structuralist psychology.
James Jerome Gibson was an American psychologist and is considered to be one of the most important contributors to the field of visual perception. Gibson challenged the idea that the nervous system actively constructs conscious visual perception, and instead promoted ecological psychology, in which the mind directly perceives environmental stimuli without additional cognitive construction or processing. A Review of General Psychology survey, published in 2002, ranked him as the 88th most cited psychologist of the 20th century, tied with John Garcia, David Rumelhart, Louis Leon Thurstone, Margaret Floy Washburn, and Robert S. Woodworth.
The memory-prediction framework is a theory of brain function created by Jeff Hawkins and described in his 2004 book On Intelligence. This theory concerns the role of the mammalian neocortex and its associations with the hippocampi and the thalamus in matching sensory inputs to stored memory patterns and how this process leads to predictions of what will happen in the future.
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.
Ambiguous images or reversible figures are visual forms that create ambiguity by exploiting graphical similarities and other properties of visual system interpretation between two or more distinct image forms. These are famous for inducing the phenomenon of multistable perception. Multistable perception is the occurrence of an image being able to provide multiple, although stable, perceptions.
Geons are the simple 2D or 3D forms such as cylinders, bricks, wedges, cones, circles and rectangles corresponding to the simple parts of an object in Biederman's recognition-by-components theory. The theory proposes that the visual input is matched against structural representations of objects in the brain. These structural representations consist of geons and their relations. Only a modest number of geons are assumed. When combined in different relations to each other and coarse metric variation such as aspect ratio and 2D orientation, billions of possible 2- and 3-geon objects can be generated. Two classes of shape-based visual identification that are not done through geon representations, are those involved in: a) distinguishing between similar faces, and b) classifications that don’t have definite boundaries, such as that of bushes or a crumpled garment. Typically, such identifications are not viewpoint-invariant.
Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.
Irving Biederman was an American vision scientist specializing in the study of brain processes underlying humans' ability to quickly recognize and interpret what they see. While best known for his Recognition by Components Theory that focuses on volumetric object recognition, his later work tended to examine the recognition of human faces. Biederman argued that face recognition is separate and distinct from the recognition of objects.
Concept learning, also known as category learning, concept attainment, and concept formation, is defined by Bruner, Goodnow, & Austin (1967) as "the search for and listing of attributes that can be used to distinguish exemplars from non exemplars of various categories". More simply put, concepts are the mental categories that help us classify objects, events, or ideas, building on the understanding that each object, event, or idea has a set of common relevant features. Thus, concept learning is a strategy which requires a learner to compare and contrast groups or categories that contain concept-relevant features with groups or categories that do not contain concept-relevant features.
In psychology and cognitive neuroscience, pattern recognition describes a cognitive process that matches information from a stimulus with information retrieved from memory.
The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.
Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.
In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example of the object to be recognized is presented to a vision system in a controlled environment, and then for an arbitrary input such as a video stream, the system locates the previously presented object. This can be done either off-line, or in real-time. The algorithms for solving this problem are specialized for locating a single pre-identified object, and can be contrasted with algorithms which operate on general classes of objects, such as face recognition systems or 3D generic object recognition. Due to the low cost and ease of acquiring photographs, a significant amount of research has been devoted to 3D object recognition in photographs.
Visual object recognition refers to the ability to identify the objects in view based on visual input. One important signature of visual object recognition is "object invariance", or the ability to identify objects across changes in the detailed context in which objects are viewed, including changes in illumination, object pose, and background context.
The neuronal recycling hypothesis was proposed by Stanislas Dehaene in the field of cognitive neuroscience in an attempt to explain the underlying neural processes which allow humans to acquire recently invented cognitive capacities. This hypothesis was formulated in response to the 'reading paradox', which states that these cognitive processes are cultural inventions too modern to be the products of evolution. The paradox lies within the fact that cross-cultural evidence suggests specific brain areas are associated with these functions. The concept of neuronal recycling resolves this paradox by suggesting that novel functions actually utilize and 'recycle' existing brain circuitry. Once these cognitive functions find a cortical area devoted to a similar purpose, they can invade the existing circuit. Through plasticity, the cortex can adapt in order to accommodate for these novel functions.
In machine learning and computer vision, M-theory is a learning framework inspired by feed-forward processing in the ventral stream of visual cortex and originally developed for recognition and classification of objects in visual scenes. M-theory was later applied to other areas, such as speech recognition. On certain image recognition tasks, algorithms based on a specific instantiation of M-theory, HMAX, achieved human-level performance.
In psychology, the face superiority effect refers to the phenomena of how all individuals perceive and encode other human faces in memory. Rather than perceiving and encoding single features of a face, we perceive and encode a human face as one holistic unified element. This phenomenon aids our visual system in the recognition of thousands of faces, a task that would be difficult if it were necessary to recognize sets of individual features and characteristics. However, this effect is limited to perceiving upright faces and does not occur when a face is at an unusual angle, such as when faces are upside-down or contorted in phenomena like the Thatcher effect and Pareidolia.
Ensemble coding, also known as ensemble perception or summary representation, is a theory in cognitive neuroscience about the internal representation of groups of objects in the human mind. Ensemble coding proposes that such information is recorded via summary statistics, particularly the average or variance. Experimental evidence tends to support the theory for low-level visual information, such as shapes and sizes, as well as some high-level features such as face gender. Nonetheless, it remains unclear the extent to which ensemble coding applies to high-level or non-visual stimuli, and the theory remains the subject of active research.
An accidental viewpoint is a singular position from which an image can be perceived, creating either an ambiguous image or an illusion. The image perceived at this angle is viewpoint-specific, meaning it cannot be perceived at any other position, known as generic or non-accidental viewpoints. These view-specific angles are involved in object recognition. In its uses in art and other visual illusions, the accidental viewpoint creates the perception of depth often on a two-dimensional surface with the assistance of monocular cues.