3D object recognition

Last updated March 31, 2020

In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example of the object to be recognized is presented to a vision system in a controlled environment, and then for an arbitrary input such as a video stream, the system locates the previously presented object. This can be done either off-line, or in real-time. The algorithms for solving this problem are specialized for locating a single pre-identified object, and can be contrasted with algorithms which operate on general classes of objects, such as face recognition systems or 3D generic object recognition. Due to the low cost and ease of acquiring photographs, a significant amount of research has been devoted to 3D object recognition in photographs.

3D single-object recognition in photographs

The method of recognizing a 3D object depends on the properties of an object. For simplicity, many existing algorithms have focused on recognizing rigid objects consisting of a single part, that is, objects whose spatial transformation is a Euclidean motion. Two general approaches have been taken to the problem: pattern recognition approaches use low-level image appearance information to locate an object, while feature-based geometric approaches construct a model for the object to be recognized, and match the model against the photograph.

Pattern recognition approaches

These methods use appearance information gathered from pre-captured or pre-computed projections of an object to match the object in the potentially cluttered scene. However, they do not take the 3D geometric constraints of the object into consideration during matching, and typically also do not handle occlusion as well as feature-based approaches. See [Murase and Nayar 1995] and [Selinger and Nelson 1999].

Feature-based geometric approaches

An example of a detected feature in an image. Blue indicates the center of the feature, the red ellipse indicates the characteristic scale identified by the feature detector, and the green parallelogram is constructed from the coordinates of the ellipse as per [Lowe 2004]. Feature example.png — An example of a detected feature in an image. Blue indicates the center of the feature, the red ellipse indicates the characteristic scale identified by the feature detector, and the green parallelogram is constructed from the coordinates of the ellipse as per [Lowe 2004].

Feature-based approaches work well for objects which have distinctive features. Thus far, objects which have good edge features or blob features have been successfully recognized; for example detection algorithms, see Harris affine region detector and SIFT, respectively. Due to lack of the appropriate feature detectors, objects without textured, smooth surfaces cannot currently be handled by this approach.

Feature-based object recognizers generally work by pre-capturing a number of fixed views of the object to be recognized, extracting features from these views, and then in the recognition process, matching these features to the scene and enforcing geometric constraints.

As an example of a prototypical system taking this approach, we will present an outline of the method used by [Rothganger et al. 2004], with some detail elided. The method starts by assuming that objects undergo globally rigid transformations. Because smooth surfaces are locally planar, affine invariant features are appropriate for matching: the paper detects ellipse-shaped regions of interest using both edge-like and blob-like features, and as per [Lowe 2004], finds the dominant gradient direction of the ellipse, converts the ellipse into a parallelogram, and takes a SIFT descriptor on the resulting parallelogram. Color information is used also to improve discrimination over SIFT features alone.

Next, given a number of camera views of the object (24 in the paper), the method constructs a 3D model for the object, containing the 3D spatial position and orientation of each feature. Because the number of views of the object is large, typically each feature is present in several adjacent views. The center points of such matching features correspond, and detected features are aligned along the dominant gradient direction, so the points at (1, 0) in the local coordinate system of the feature parallelogram also correspond, as do the points (0, 1) in the parallelogram's local coordinates. Thus for every pair of matching features in nearby views, three point pair correspondences are known. Given at least two matching features, a multi-view affine structure from motion algorithm (see [Tomasi and Kanade 1992]) can be used to construct an estimate of points positions (up to an arbitrary affine transformation). The paper of Rothganger et al. therefore selects two adjacent views, uses a RANSAC-like method to select two corresponding pairs of features, and adds new features to the partial model built by RANSAC so long as they are under an error term. Thus for any given pair of adjacent views, the algorithm creates a partial model of all features visible in both views.

Final merged model of features for the teddy bear, after Euclidean upgrade. For recognition, this model is matched against a photograph of the scene using RANSAC. Taken from [Rothganger et al. 2004]. Features full 3d.png — Final merged model of features for the teddy bear, after Euclidean upgrade. For recognition, this model is matched against a photograph of the scene using RANSAC. Taken from [Rothganger et al. 2004].

To produce a unified model, the paper takes the largest partial model, and incrementally aligns all smaller partial models to it. Global minimization is used to reduce the error, then a Euclidean upgrade is used to change the model's feature positions from 3D coordinates unique up to affine transformation to 3D coordinates that are unique up to Euclidean motion. At the end of this step, one has a model of the target object, consisting of features projected into a common 3D space.

To recognize an object in an arbitrary input image, the paper detects features, and then uses RANSAC to find the affine projection matrix which best fits the unified object model to the 2D scene. If this RANSAC approach has sufficiently low error, then on success, the algorithm both recognizes the object and gives the object's pose in terms of an affine projection. Under the assumed conditions, the method typically achieves recognition rates of around 95%.

Related Research Articles

Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human visual system can do.

In computer science, geometric hashing is originally a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation, though extensions exist to some other object representations and transformations. In an off-line step, the objects are encoded by treating each pair of points as a geometric basis. The remaining points can be represented in an invariant fashion with respect to this basis using two parameters. For each point, its quantized transformed coordinates are stored in the hash table as a key, and indices of the basis points as a value. Then a new pair of basis points is selected, and the process is repeated. In the on-line (recognition) step, randomly selected pairs of data points are considered as candidate bases. For each candidate basis, the remaining data points are encoded according to the basis and possible correspondences from the object are found in the previously constructed table. The candidate basis is accepted if a sufficiently large number of the data points index a consistent object basis.

The scale-invariant feature transform (SIFT) is a feature detection algorithm in computer vision to detect and describe local features in images. It was patented in Canada by the University of British Columbia and published by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

In computer vision and image processing feature detection includes methods for computing abstractions of image information and making local decisions at every image point whether there is an image feature of a given type at that point or not. The resulting features will be subsets of the image domain, often in the form of isolated points, continuous curves or connected regions.

Motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.

In computer vision and image processing, a feature is a piece of information which is relevant for solving the computational task related to a certain application. This is the same sense as feature in machine learning and pattern recognition generally, though image processing has a very sophisticated collection of features. Features may be specific structures in the image such as points, edges or objects. Features may also be the result of a general neighborhood operation or feature detection applied to the image.

The following outline is provided as an overview of and topical guide to computer vision:

Structure from Motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences that may be coupled with local motion signals. It is studied in the fields of computer vision and visual perception. In biological vision, SfM refers to the phenomenon by which humans can recover 3D structure from the projected 2D (retinal) motion field of a moving object or scene.

In computer vision, blob detection methods are aimed at detecting regions in a digital image that differ in properties, such as brightness or color, compared to surrounding regions. Informally, a blob is a region of an image in which some properties are constant or approximately constant; all the points in a blob can be considered in some sense to be similar to each other. The most common method for blob detection is convolution.

Interest point detection is a recent terminology in computer vision that refers to the detection of interest points for subsequent processing. An interest point is a point in the image which in general can be characterized as follows:

In computer vision, speeded up robust features (SURF) is a patented local feature detector and descriptor. It can be used for tasks such as object recognition, image registration, classification, or 3D reconstruction. It is partly inspired by the scale-invariant feature transform (SIFT) descriptor. The standard version of SURF is several times faster than SIFT and claimed by its authors to be more robust against different image transformations than SIFT.

The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

In computer vision, the bag-of-words model can be applied to image classification, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

The following outline is provided as an overview of and topical guide to object recognition:

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

In computer vision, maximally stable extremal regions (MSER) are used as a method of blob detection in images. This technique was proposed by Matas et al. to find correspondences between image elements from two images with different viewpoints. This method of extracting a comprehensive number of corresponding image elements contributes to the wide-baseline matching, and it has led to better stereo matching and object recognition algorithms.

The principal curvature-based region detector, also called PCBR is a feature detector used in the fields of computer vision and image analysis. Specifically the PCBR detector is designed for object recognition applications.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

References

Murase, H. and S. K. Nayar: 1995, Visual Learning and Recognition of 3-D Objects from Appearance. International Journal of Computer Vision 14, 5–24.
Selinger, A. and R. Nelson: 1999, A Perceptual Grouping Hierarchy for Appearance-Based 3D Object Recognition. Computer Vision and Image Understanding 76(1), 83–92.
Rothganger, F; S. Lazebnik, C. Schmid, and J. Ponce: 2004. 3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial Constraints, ICCV.
Lowe, D.: 2004, Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. In press.
Tomasi, C. and T. Kanade: 1992, Shape and Motion from Image Streams: a Factorization Method. International Journal of Computer Vision 9(2), 137–154.