Part-based models

Last updated

Part-based models refers to a broad class of detection algorithms used on images, in which various parts of the image are used separately in order to determine if and where an object of interest exists. Amongst these methods a very popular one is the constellation model which refers to those schemes which seek to detect a small number of features and their relative positions to then determine whether or not the object of interest is present.

Contents

These models build on the original idea of Fischler and Elschlager [1] of using the relative position of a few template matches and evolve in complexity in the work of Perona and others. [2] These models will be covered in the constellation models section. To get a better idea of what is meant by constellation model an example may be more illustrative. Say we are trying to detect faces. A constellation model would use smaller part detectors, for instance mouth, nose and eye detectors and make a judgment about whether an image has a face based on the relative positions in which the components fire.

Non-constellation models

Many overlapping ideas are included under the title part-based models even after having excluded those models of the constellation variety. The uniting thread is the use of small parts to build up to an algorithm that can detect/recognize an item (face, car, etc.) Early efforts, such as those by Yuille, Hallinan and Cohen [3] sought to detect facial features and fit deformable templates to them. These templates were mathematically defined outlines which sought to capture the position and shape of the feature. Yuille, Hallinan and Cohen's algorithm does have trouble finding the global minimum fit for a given model and so templates did occasionally become mismatched.

Later efforts such as those by Poggio and Brunelli [4] focus on building specific detectors for each feature. They use successive detectors to estimate scale, position, etc. and narrow the search field to be used by the next detector. As such it is a part-based model, however, they seek more to recognize specific faces rather than to detect the presence of a face. They do so by using each detector to build a 35 element vector of characteristics of a given face. These characteristic can then be compared to recognize specific faces, however cut-offs can also be used to detect whether a face is present at all. [5]

Cootes, Lanitis and Taylor [6] build on this work in constructing a 100 element representation of the primary features of a face. The model is more detailed and robust however, given the additional complexity (100 elements compared to 35) this might be expected. The model essentially computes deviations from a mean face in terms of shape, orientation and gray level. The model is matched by the minimization of an error function. These three classes of algorithms naturally fall within the scope of template matching [7]

Of the non-constellation perhaps the most successful is that of Leibe and Schiele. [8] [9] Their algorithm finds templates associated with positive examples and records both the template (an average of the feature in all positive examples where it is present) and the position of the center of the item (a face for instance) relative to the template. The algorithm then takes a test image and runs an interest point locater (hopefully one of the scale invariant variety). These interest points are then compared to each template and the probability of a match is computed. All templates then cast votes for the center of the detected object proportional to the probability of the match, and the probability the template predicts the center. These votes are all summed and if there are enough of them, well enough clustered, the presence of the object in question (i.e. a face or car) is predicted.

The algorithm is effective because it imposes much less constellational rigidity the way the constellation model does. Admittedly the constellation model can be modified to allow for occlusions and other large abnormalities but this model is naturally suited to it. Also it must be said that sometimes the more rigid structure of the constellation is desired.

See also

Related Research Articles

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Template matching is a technique in digital image processing for finding small parts of an image which match a template image. It can be used for quality control in manufacturing, navigation of mobile robots, or edge detection in images.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

<span class="mw-page-title-main">Gesture recognition</span> Topic in computer science and language technology

Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or state, but commonly originate from the face or hand. Focuses in the field include emotion recognition from face and hand gesture recognition since they are all expressions. Users can make simple gestures to control or interact with devices without physically touching them. Many approaches have been made using cameras and computer vision algorithms to interpret sign language, however, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques. Gesture recognition can be seen as a way for computers to begin to understand human body language, thus building a better bridge between machines and humans than older text user interfaces or even GUIs, which still limit the majority of input to keyboard and mouse and interact naturally without any mechanical devices.

The condensation algorithm is a computer vision algorithm. The principal application is to detect and track the contour of objects moving in a cluttered environment. Object tracking is one of the more basic and difficult aspects of computer vision and is generally a prerequisite to object recognition. Being able to identify which pixels in an image make up the contour of an object is a non-trivial problem. Condensation is a probabilistic algorithm that attempts to solve this problem.

In computer vision and image processing, a feature is a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may be specific structures in the image such as points, edges or objects. Features may also be the result of a general neighborhood operation or feature detection applied to the image. Other examples of features are related to motion in image sequences, or to shapes defined in terms of curves or boundaries between different image regions.

An active appearance model (AAM) is a computer vision algorithm for matching a statistical model of object shape and appearance to a new image. They are built during a training phase. A set of images, together with coordinates of landmarks that appear in all of the images, is provided to the training supervisor.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate Computer Vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

<span class="mw-page-title-main">Object detection</span>

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

<span class="mw-page-title-main">Pedestrian detection</span>

Pedestrian detection is an essential and significant task in any intelligent video surveillance system, as it provides the fundamental information for semantic understanding of the video footages. It has an obvious extension to automotive applications due to the potential for improving safety systems. Many car manufacturers offer this as an ADAS option in 2017.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

<span class="mw-page-title-main">Alan Yuille</span> English academic

Alan Yuille is a Bloomberg Distinguished Professor of Computational Cognitive Science with appointments in the departments of Cognitive Science and Computer Science at Johns Hopkins University. Yuille develops models of vision and cognition for computers, intended for creating artificial vision systems. He studied under Stephen Hawking at Cambridge University on a PhD in theoretical physics, which he completed in 1981.

Bernt Schiele is a German computer scientist. He is Max Planck Director at the Max Planck Institute for Informatics and professor at Saarland University. He is known for his work in the field of computer vision and perceptual computing.

References

  1. Fischler, M.A.; Elschlager, R.A. (1973). "The Representation and Matching of Pictorial Structures". IEEE Transactions on Computers. C-22: 67–92. doi:10.1109/T-C.1973.223602.
  2. Fergus, R.; Perona, P.; Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Vol. 2. pp. II–264. doi:10.1109/CVPR.2003.1211479. ISBN   0-7695-1900-8.
  3. Yuille, Alan L.; Hallinan, Peter W.; Cohen, David S. (1992). "Feature extraction from faces using deformable templates". International Journal of Computer Vision. 8 (2): 99. doi:10.1007/BF00127169.
  4. Brunelli, R.; Poggio, T. (1993). "Face recognition: Features versus templates". IEEE Transactions on Pattern Analysis and Machine Intelligence. 15 (10): 1042. doi:10.1109/34.254061.
  5. Simonite, Tom. "Photo Algorithms ID White Men Fine—Black Women, Not So Much". Wired. ISSN   1059-1028 . Retrieved 2023-04-17.
  6. Lanitis, A.; Taylor, C.J.; Cootes, T.F. (1995). A unified approach to coding and interpreting face images. IEEE International Conference on Computer Vision. p. 368. doi:10.1109/ICCV.1995.466919. ISBN   0-8186-7042-8.
  7. Brunelli, R. (2009). Template Matching Techniques in Computer Vision: Theory and Practice. Wiley. ISBN   978-0-470-51706-2.
  8. Leibe, Bastian; Leonardis, Aleš; Schiele, Bernt (2007). "Robust Object Detection with Interleaved Categorization and Segmentation". International Journal of Computer Vision. 77 (1–3): 259–289. CiteSeerX   10.1.1.111.464 . doi:10.1007/s11263-007-0095-3.
  9. Leibe, Bastian; Leonardis, Ales; Schiele, Bernt (2006). "An Implicit Shape Model for Combined Object Categorization and Segmentation". Toward Category-Level Object Recognition. Lecture Notes in Computer Science. Vol. 4170. p. 508. CiteSeerX   10.1.1.5.6272 . doi:10.1007/11957959_26. ISBN   978-3-540-68794-8.