Shape context

Last updated

Shape context is a feature descriptor used in object recognition. Serge Belongie and Jitendra Malik proposed the term in their paper "Matching with Shape Contexts" in 2000. [1]

Contents

Theory

The shape context is intended to be a way of describing shapes that allows for measuring shape similarity and the recovering of point correspondences. [1] The basic idea is to pick n points on the contours of a shape. For each point pi on the shape, consider the n  1 vectors obtained by connecting pi to all other points. The set of all these vectors is a rich description of the shape localized at that point but is far too detailed. The key idea is that the distribution over relative positions is a robust, compact, and highly discriminative descriptor. So, for the point pi, the coarse histogram of the relative coordinates of the remaining n  1 points,

is defined to be the shape context of . The bins are normally taken to be uniform in log-polar space. The fact that the shape context is a rich and discriminative descriptor can be seen in the figure below, in which the shape contexts of two different versions of the letter "A" are shown.

Shapecontext.jpg

(a) and (b) are the sampled edge points of the two shapes. (c) is the diagram of the log-polar bins used to compute the shape context. (d) is the shape context for the point marked with a circle in (a), (e) is that for the point marked as a diamond in (b), and (f) is that for the triangle. As can be seen, since (d) and (e) are the shape contexts for two closely related points, they are quite similar, while the shape context in (f) is very different.

For a feature descriptor to be useful, it needs to have certain invariances. In particular it needs to be invariant to translation, scaling, small perturbations, and, depending on the application, rotation. Translational invariance comes naturally to shape context. Scale invariance is obtained by normalizing all radial distances by the mean distance between all the point pairs in the shape [2] [3] although the median distance can also be used. [1] [4] Shape contexts are empirically demonstrated to be robust to deformations, noise, and outliers [4] using synthetic point set matching experiments. [5]

One can provide complete rotational invariance in shape contexts. One way is to measure angles at each point relative to the direction of the tangent at that point (since the points are chosen on edges). This results in a completely rotationally invariant descriptor. But of course this is not always desired since some local features lose their discriminative power if not measured relative to the same frame. Many applications in fact forbid rotational invariance e.g. distinguishing a "6" from a "9".

Use in shape matching

A complete system that uses shape contexts for shape matching consists of the following steps (which will be covered in more detail in the Details of Implementation section):

  1. Randomly select a set of points that lie on the edges of a known shape and another set of points on an unknown shape.
  2. Compute the shape context of each point found in step 1.
  3. Match each point from the known shape to a point on an unknown shape. To minimize the cost of matching, first choose a transformation (e.g. affine, thin plate spline, etc.) that warps the edges of the known shape to the unknown (essentially aligning the two shapes). Then select the point on the unknown shape that most closely corresponds to each warped point on the known shape.
  4. Calculate the "shape distance" between each pair of points on the two shapes. Use a weighted sum of the shape context distance, the image appearance distance, and the bending energy (a measure of how much transformation is required to bring the two shapes into alignment).
  5. To identify the unknown shape, use a nearest-neighbor classifier to compare its shape distance to shape distances of known objects.

Details of implementation

Step 1: Finding a list of points on shape edges

The approach assumes that the shape of an object is essentially captured by a finite subset of the points on the internal or external contours on the object. These can be simply obtained using the Canny edge detector and picking a random set of points from the edges. Note that these points need not and in general do not correspond to key-points such as maxima of curvature or inflection points. It is preferable to sample the shape with roughly uniform spacing, though it is not critical. [2]

Step 2: Computing the shape context

This step is described in detail in the Theory section.

Step 3: Computing the cost matrix

Consider two points p and q that have normalized K-bin histograms (i.e. shape contexts) g(k) and h(k). As shape contexts are distributions represented as histograms, it is natural to use the χ2 test statistic as the "shape context cost" of matching the two points:

The values of this range from 0 to 1. [1] In addition to the shape context cost, an extra cost based on the appearance can be added. For instance, it could be a measure of tangent angle dissimilarity (particularly useful in digit recognition):

This is half the length of the chord in unit circle between the unit vectors with angles and . Its values also range from 0 to 1. Now the total cost of matching the two points could be a weighted-sum of the two costs:

Now for each point pi on the first shape and a point qj on the second shape, calculate the cost as described and call it Ci,j. This is the cost matrix.

Step 4: Finding the matching that minimizes total cost

Results of matching Matching example using the Shape Context descriptor.jpg
Results of matching

Now, a one-to-one matching pi that matches each point pi on shape 1 and qj on shape 2 that minimizes the total cost of matching,

is needed. This can be done in time using the Hungarian method, although there are more efficient algorithms. [6] To have robust handling of outliers, one can add "dummy" nodes that have a constant but reasonably large cost of matching to the cost matrix. This would cause the matching algorithm to match outliers to a "dummy" if there is no real match.

Step 5: Modeling transformation

Given the set of correspondences between a finite set of points on the two shapes, a transformation can be estimated to map any point from one shape to the other. There are several choices for this transformation, described below.

Affine

The affine model is a standard choice: . The least squares solution for the matrix and the translational offset vector o is obtained by:

Where with a similar expression for . is the pseudoinverse of .

Thin plate spline

The thin plate spline (TPS) model is the most widely used model for transformations when working with shape contexts. A 2D transformation can be separated into two TPS function to model a coordinate transform:

where each of the ƒx and ƒy have the form:

and the kernel function is defined by . The exact details of how to solve for the parameters can be found elsewhere [7] [8] but it essentially involves solving a linear system of equations. The bending energy (a measure of how much transformation is needed to align the points) will also be easily obtained.

Regularized TPS

The TPS formulation above has exact matching requirement for the pairs of points on the two shapes. For noisy data, it is best to relax this exact requirement. If we let denote the target function values at corresponding locations (Note that for , would the x-coordinate of the point corresponding to and for it would be the y-coordinate, ), relaxing the requirement amounts to minimizing

where is the bending energy and is called the regularization parameter. This ƒ that minimizes H[ƒ] can be found in a fairly straightforward way. [9] If one uses normalize coordinates for , then scale invariance is kept. However, if one uses the original non-normalized coordinates, then the regularization parameter needs to be normalized.

Note that in many cases, regardless of the transformation used, the initial estimate of the correspondences contains some errors which could reduce the quality of the transformation. If we iterate the steps of finding correspondences and estimating transformations (i.e. repeating steps 25 with the newly transformed shape) we can overcome this problem. Typically, three iterations are all that is needed to obtain reasonable results.

Step 6: Computing the shape distance

Now, a shape distance between two shapes and . This distance is going to be a weighted sum of three potential terms:

Shape context distance: this is the symmetric sum of shape context matching costs over best matching points:

where T(·) is the estimated TPS transform that maps the points in Q to those in P.

Appearance cost: After establishing image correspondences and properly warping one image to match the other, one can define an appearance cost as the sum of squared brightness differences in Gaussian windows around corresponding image points:

where and are the gray-level images ( is the image after warping) and is a Gaussian windowing function.

Transformation cost: The final cost measures how much transformation is necessary to bring the two images into alignment. In the case of TPS, it is assigned to be the bending energy.

Now that we have a way of calculating the distance between two shapes, we can use a nearest neighbor classifier (k-NN) with distance defined as the shape distance calculated here. The results of applying this to different situations is given in the following section.

Results

Digit recognition

The authors Serge Belongie and Jitendra Malik tested their approach on the MNIST database. Currently, more than 50 algorithms have been tested on the database. The database has a training set of 60,000 examples, and a test set of 10,000 examples. The error rate for this approach was 0.63% using 20,000 training examples and 3-NN. At the time of publication, this error rate was the lowest. Currently, the lowest error rate is 0.18%. [10]

Silhouette similarity-based retrieval

The authors experimented with the MPEG-7 shape silhouette database, performing Core Experiment CE-Shape-1 part B, which measures performance of similarity-based retrieval. [11] The database has 70 shape categories and 20 images per shape category. Performance of a retrieval scheme is tested by using each image as a query and counting the number of correct images in the top 40 matches. For this experiment, the authors increased the number of points sampled from each shape. Also, since the shapes in the database sometimes were rotated or flipped, the authors took defined the distance between a reference shape and query shape to be minimum shape distance between the query shape and either the unchanged reference, the vertically flipped, or the reference horizontally flipped. [1] [2] [3] [4] With these changes, they obtained a retrieval rate of 76.45%, which in 2002 was the best.

3D object recognition

The next experiment performed on shape contexts involved the 20 common household objects in the Columbia Object Image Library (COIL-20). Each object has 72 views in the database. In the experiment, the method was trained on a number of equally spaced views for each object and the remaining views were used for testing. A 1-NN classifier was used. The authors also developed an editing algorithm based on shape context similarity and k-medoid clustering that improved on their performance. [4]

Trademark retrieval

Shape contexts were used to retrieve the closest matching trademarks from a database to a query trademark (useful in detecting trademark infringement). No visually similar trademark was missed by the algorithm (verified manually by the authors). [2]

Related Research Articles

Ellipse Plane curve: conic section

In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant. As such, it generalizes a circle, which is the special type of ellipse in which the two focal points are the same. The elongation of an ellipse is measured by its eccentricity , a number ranging from to .

Euclidean space Fundamental space of geometry

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension, including the three-dimensional space and the Euclidean plane. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.

Affine transformation Geometric transformation that preserves lines but not angles nor the origin

In Euclidean geometry, an affine transformation, or an affinity, is a geometric transformation that preserves lines and parallelism.

Affine space Geometric structure that generalizes the Euclidean space

In mathematics, an affine space is a geometric structure that generalizes some of the properties of Euclidean spaces in such a way that these are independent of the concepts of distance and measure of angles, keeping only the properties related to parallelism and ratio of lengths for parallel line segments.

A conformal field theory (CFT) is a quantum field theory that is invariant under conformal transformations. In two dimensions, there is an infinite-dimensional algebra of local conformal transformations, and conformal field theories can sometimes be exactly solved or classified.

Cross-ratio An invariant under projective transformations

In geometry, the cross-ratio, also called the double ratio and anharmonic ratio, is a number associated with a list of four collinear points, particularly points on a projective line. Given four points A, B, C and D on a line, their cross ratio is defined as

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Geometry processing

Geometry processing, or mesh processing, is an area of research that uses concepts from applied mathematics, computer science and engineering to design efficient algorithms for the acquisition, reconstruction, analysis, manipulation, simulation and transmission of complex 3D models. As the name implies, many of the concepts, data structures, and algorithms are directly analogous to signal processing and image processing. For example, where image smoothing might convolve an intensity signal with a blur kernel formed using the Laplace operator, geometric smoothing might be achieved by convolving a surface geometry with a blur kernel formed using the Laplace-Beltrami operator.

In mathematics, the Wasserstein distance or Kantorovich–Rubinstein metric is a distance function defined between probability distributions on a given metric space . It is named after Leonid Vaseršteĭn.

Affine shape adaptation is a methodology for iteratively adapting the shape of the smoothing kernels in an affine group of smoothing kernels to the local image structure in neighbourhood region of a specific image point. Equivalently, affine shape adaptation can be accomplished by iteratively warping a local image patch with affine transformations while applying a rotationally symmetric filter to the warped image patches. Provided that this iterative process converges, the resulting fixed point will be affine invariant. In the area of computer vision, this idea has been used for defining affine invariant interest point operators as well as affine invariant texture analysis methods.

The generalized Hough transform (GHT), introduced by Dana H. Ballard in 1981, is the modification of the Hough transform using the principle of template matching. The Hough transform was initially developed to detect analytically defined shapes. In these cases, we have knowledge of the shape and aim to find out its location and orientation in the image. This modification enables the Hough transform to be used to detect an arbitrary object described with its model.

In statistics, the earth mover's distance (EMD) is a measure of the distance between two probability distributions over a region D. In mathematics, this is known as the Wasserstein metric. Informally, if the distributions are interpreted as two different ways of piling up a certain amount of earth (dirt) over the region D, the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be the amount of dirt moved times the distance by which it is moved.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

In computer vision, maximally stable extremal regions (MSER) are used as a method of blob detection in images. This technique was proposed by Matas et al. to find correspondences between image elements from two images with different viewpoints. This method of extracting a comprehensive number of corresponding image elements contributes to the wide-baseline matching, and it has led to better stereo matching and object recognition algorithms.

Size functions are shape descriptors, in a geometrical/topological sense. They are functions from the half-plane to the natural numbers, counting certain connected components of a topological space. They are used in pattern recognition and topology.

Computer stereo vision is the extraction of 3D information from digital images, such as those obtained by a CCD camera. By comparing information about a scene from two vantage points, 3D information can be extracted by examining the relative positions of objects in the two panels. This is similar to the biological process of stereopsis.

Histogram matching

In image processing, histogram matching or histogram specification is the transformation of an image so that its histogram matches a specified histogram. The well-known histogram equalization method is a special case in which the specified histogram is uniformly distributed.

In digital image and video processing, a color layout descriptor (CLD) is designed to capture the spatial distribution of color in an image. The feature extraction process consists of two parts: grid based representative color selection and discrete cosine transform with quantization.

A heat kernel signature (HKS) is a feature descriptor for use in deformable shape analysis and belongs to the group of spectral shape analysis methods. For each point in the shape, HKS defines its feature vector representing the point's local and global geometric properties. Applications include segmentation, classification, structure discovery, shape matching and shape retrieval.

Point-set registration

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

References

  1. 1 2 3 4 5 S. Belongie & J. Malik (2000). "Matching with Shape Contexts". IEEE Workshop on Contentbased Access of Image and Video Libraries (CBAIVL-2000). doi:10.1109/IVL.2000.853834.
  2. 1 2 3 4 S. Belongie; J. Malik & J. Puzicha (April 2002). "Shape Matching and Object Recognition Using Shape Contexts" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 24 (4): 509–521. doi:10.1109/34.993558. S2CID   129468.
  3. 1 2 S. Belongie; J. Malik & J. Puzicha (July 2001). "Matching Shapes" (PDF). Eighth IEEE International Conference on Computer Vision (July 2001).
  4. 1 2 3 4 S. Belongie; J. Malik & J. Puzicha (2000). "Shape Context: A new descriptor for shape matching and object recognition" (PDF). NIPS 2000.
  5. H. Chui & A. Rangarajan (June 2000). "A new algorithm for non-rigid point matching". CVPR. Vol. 2. pp. 44–51. doi:10.1109/CVPR.2000.854733.
  6. R. Jonker & A. Volgenant (1987). "A Shortest Augmenting Path Algorithm for Dense and Sparse Linear Assignment Problems". Computing. 38 (4): 325–340. doi:10.1007/BF02278710. S2CID   7806079.
  7. M.J.D. Powell (1995). "A Thin Plate Spline Method for Mapping Curves into Curves in Two Dimensions". Computational Techniques and Applications (CTAC '95). doi:10.1142/9789814530651.
  8. J. Duchon (1977). "Splines Minimizing Rotation-Invariant Semi-Norms in Sobolev Spaces". Constructive Theory of Functions of Several Variables. Lecture Notes in Mathematics. 571: 85–100. doi:10.1007/BFb0086566. ISBN   978-3-540-08069-5.
  9. G. Wahba (1990). Spline Models for Observational Data . Soc. Industrial and Applied Math. ISBN   9780898712445.
  10. Kowsari, Kamran; Heidarysafa, Mojtaba; Brown, Donald E.; Meimandi, Kiana Jafari; Barnes, Laura E. (2018-05-03). "RMDL: Random Multimodel Deep Learning for Classification". Proceedings of the 2018 International Conference on Information System and Data Mining. arXiv: 1805.01890 . Bibcode:2018arXiv180501890K. doi:10.1145/3206098.3206111. S2CID   19208611.
  11. S. Jeannin & M. Bober (March 1999). "Description of core experiments for MPEG-7 motion/shape. Technical Report ISO/IEC JTC 1/SC 29/WG 11 MPEG99/N2690, MPEG-7, Seoul".{{cite journal}}: Cite journal requires |journal= (help)