Constellation model

Last updated

The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.

Contents

The problem of defining a generative model for object recognition is difficult. The task becomes significantly complicated by factors such as background clutter, occlusion, and variations in viewpoint, illumination, and scale. Ideally, we would like the particular representation we choose to be robust to as many of these factors as possible.

In category-level recognition, the problem is even more challenging because of the fundamental problem of intra-class variation. Even if two objects belong to the same visual category, their appearances may be significantly different. However, for structured objects such as cars, bicycles, and people, separate instances of objects from the same category are subject to similar geometric constraints. For this reason, particular parts of an object such as the headlights or tires of a car still have consistent appearances and relative positions. The Constellation Model takes advantage of this fact by explicitly modeling the relative location, relative scale, and appearance of these parts for a particular object category. Model parameters are estimated using an unsupervised learning algorithm, meaning that the visual concept of an object class can be extracted from an unlabeled set of training images, even if that set contains "junk" images or instances of objects from multiple categories. It can also account for the absence of model parts due to appearance variability, occlusion, clutter, or detector error.

History

The idea for a "parts and structure" model was originally introduced by Fischler and Elschlager in 1973. [1] This model has since been built upon and extended in many directions. The Constellation Model, as introduced by Dr. Perona and his colleagues, was a probabilistic adaptation of this approach.

In the late '90s, Burl et al. [2] [3] [4] [5] revisited the Fischler and Elschlager model for the purpose of face recognition. In their work, Burl et al. used manual selection of constellation parts in training images to construct a statistical model for a set of detectors and the relative locations at which they should be applied. In 2000, Weber et al. [6] [7] [8] [9] made the significant step of training the model using a more unsupervised learning process, which precluded the necessity for tedious hand-labeling of parts. Their algorithm was particularly remarkable because it performed well even on cluttered and occluded image data. Fergus et al. [10] [11] then improved upon this model by making the learning step fully unsupervised, having both shape and appearance learned simultaneously, and accounting explicitly for the relative scale of parts.

The method of Weber and Welling et al.

In the first step, a standard interest point detection method, such as Harris corner detection, is used to generate interest points. Image features generated from the vicinity of these points are then clustered using k-means or another appropriate algorithm. In this process of vector quantization, one can think of the centroids of these clusters as being representative of the appearance of distinctive object parts. Appropriate feature detectors are then trained using these clusters, which can be used to obtain a set of candidate parts from images. [9]


As a result of this process, each image can now be represented as a set of parts. Each part has a type, corresponding to one of the aforementioned appearance clusters, as well as a location in the image space.

Basic generative model

Weber & Welling here introduce the concept of foreground and background. Foreground parts correspond to an instance of a target object class, whereas background parts correspond to background clutter or false detections.

Let T be the number of different types of parts. The positions of all parts extracted from an image can then be represented in the following "matrix,"

where represents the number of parts of type observed in the image. The superscript o indicates that these positions are observable, as opposed to missing. The positions of unobserved object parts can be represented by the vector . Suppose that the object will be composed of distinct foreground parts. For notational simplicity, we assume here that , though the model can be generalized to . A hypothesis is then defined as a set of indices, with , indicating that point is a foreground point in . The generative probabilistic model is defined through the joint probability density .

Model details

The rest of this section summarizes the details of Weber & Welling's model for a single component model. The formulas for multiple component models [8] are extensions of those described here.

To parametrize the joint probability density, Weber & Welling introduce the auxiliary variables and , where is a binary vector encoding the presence/absence of parts in detection ( if , otherwise ), and is a vector where denotes the number of background candidates included in the row of . Since and are completely determined by and the size of , we have . By decomposition,

The probability density over the number of background detections can be modeled by a Poisson distribution,

where is the average number of background detections of type per image.

Depending on the number of parts , the probability can be modeled either as an explicit table of length , or, if is large, as independent probabilities, each governing the presence of an individual part.

The density is modeled by

where denotes the set of all hypotheses consistent with and , and denotes the total number of detections of parts of type . This expresses the fact that all consistent hypotheses, of which there are , are equally likely in the absence of information on part locations.

And finally,

where are the coordinates of all foreground detections, observed and missing, and represents the coordinates of the background detections. Note that foreground detections are assumed to be independent of the background. is modeled as a joint Gaussian with mean and covariance .

Classification

The ultimate objective of this model is to classify images into classes "object present" (class ) and "object absent" (class ) given the observation . To accomplish this, Weber & Welling run part detectors from the learning step exhaustively over the image, examining different combinations of detections. If occlusion is considered, then combinations with missing detections are also permitted. The goal is then to select the class with maximum a posteriori probability, by considering the ratio

where denotes the null hypothesis, which explains all parts as background noise. In the numerator, the sum includes all hypotheses, including the null hypothesis, whereas in the denominator, the only hypothesis consistent with the absence of an object is the null hypothesis. In practice, some threshold can be defined such that, if the ratio exceeds that threshold, we then consider an instance of an object to be detected.

Model learning

After the preliminary step of interest point detection, feature generation and clustering, we have a large set of candidate parts over the training images. To learn the model, Weber & Welling first perform a greedy search over possible model configurations, or equivalently, over potential subsets of the candidate parts. This is done in an iterative fashion, starting with random selection. At subsequent iterations, parts in the model are randomly substituted, the model parameters are estimated, and the performance is assessed. The process is complete when further model performance improvements are no longer possible.

At each iteration, the model parameters

are estimated using expectation maximization. and , we recall, are the mean and covariance of the joint Gaussian , is the probability distribution governing the binary presence/absence of parts, and is the mean number of background detections over part types.

M-step

EM proceeds by maximizing the likelihood of the observed data,

with respect to the model parameters . Since this is difficult to achieve analytically, EM iteratively maximizes a sequence of cost functions,

Taking the derivative of this with respect to the parameters and equating to zero produces the update rules:

E-step

The update rules in the M-step are expressed in terms of sufficient statistics, , , and , which are calculated in the E-step by considering the posterior density:

The method of Fergus et al.

In Weber et al., shape and appearance models are constructed separately. Once the set of candidate parts had been selected, shape is learned independently of appearance. The innovation of Fergus et al. is to learn not only two, but three model parameters simultaneously: shape, appearance, and relative scale. Each of these parameters are represented by Gaussian densities. [10]

Feature representation

Whereas the preliminary step in the Weber et al. method is to search for the locations of interest points, Fergus et al. use the detector of Kadir and Brady [12] to find salient regions in the image over both location (center) and scale (radius). Thus, in addition to location information this method also extracts associated scale information . Fergus et al. then normalize the squares bounding these circular regions to 11 x 11 pixel patches, or equivalently, 121-dimensional vectors in the appearance space. These are then reduced to 10-15 dimensions by principal component analysis, giving the appearance information .

Model structure

Given a particular object class model with parameters , we must decide whether or not a new image contains an instance of that class. This is accomplished by making a Bayesian decision,

where is the background model. This ratio is compared to a threshold to determine object presence/absence.

The likelihoods are factored as follows:

Appearance

Each part has an appearance modeled by a Gaussian density in the appearance space, with mean and covariance parameters , independent of other parts' densities. The background model has parameters . Fergus et al. assume that, given detected features, the position and appearance of those features are independent. Thus, . The ratio of the appearance terms reduces to

Recall from Weber et al. that is the hypothesis for the indices of foreground parts, and is the binary vector giving the occlusion state of each part in the hypothesis.

Shape

Shape is represented by a joint Gaussian density of part locations within a particular hypothesis, after those parts have been transformed into a scale-invariant space. This transformation precludes the need to perform an exhaustive search over scale. The Gaussian density has parameters . The background model is assumed to be a uniform distribution over the image, which has area . Letting be the number of foreground parts,

Relative scale

The scale of each part relative to a reference frame is modeled by a Gaussian density with parameters . Each part is assumed to be independent of other parts. The background model assumes a uniform distribution over scale, within a range .

Occlusion and statistics of feature detection

The first factor models the number of features detected using a Poisson distribution, which has mean M. The second factor serves as a "book-keeping" factor for the hypothesis variable. The last factor is a probability table for all possible occlusion patterns.

Learning

The task of learning the model parameters is accomplished by expectation maximization. This is carried out in a spirit similar to that of Weber et al. Details and formulas for the E-step and M-step can be seen in the literature. [11]

Performance

The Constellation Model as conceived by Fergus et al. achieves successful categorization rates consistently above 90% on large datasets of motorbikes, faces, airplanes, and spotted cats. [13] For each of these datasets, the Constellation Model is able to capture the "essence" of the object class in terms of appearance and/or shape. For example, face and motorbike datasets generate very tight shape models because objects in those categories have very well-defined structure, whereas spotted cats vary significantly in pose, but have a very distinctive spotted appearance. Thus, the model succeeds in both cases. It is important to note that the Constellation Model does not generally account for significant changes in orientation. Thus, if the model is trained on images of horizontal airplanes, it will not perform well on, for instance, images of vertically oriented planes unless the model is extended to account for this sort of rotation explicitly.

In terms of computational complexity, the Constellation Model is very expensive. If is the number of feature detections in the image, and the number of parts in the object model, then the hypothesis space is . Because the computation of sufficient statistics in the E-step of expectation maximization necessitates evaluating the likelihood for every hypothesis, learning becomes a major bottleneck operation. For this reason, only values of have been used in practical applications, and the number of feature detections is usually kept within the range of about 20-30 per image.

Variations

One variation that attempts to reduce complexity is the star model proposed by Fergus et al. [14] The reduced dependencies of this model allows for learning in time instead of . This allows for a greater number of model parts and image features to be used in training. Because the star model has fewer parameters, it is also better at avoiding the problem of over-fitting when trained on fewer images.

Related Research Articles

The likelihood function is the joint probability of the observed data viewed as a function of the parameters of a statistical model.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter and a scale parameter .
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. The terms "distribution" and "family" are often used loosely: specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter; however, a parametric family of distributions is often referred to as "a distribution", and the set of all exponential families is sometimes loosely referred to as "the" exponential family. They are distinct because they possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

<span class="mw-page-title-main">Rice distribution</span> Probability distribution

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

Noncentral <i>t</i>-distribution Probability distribution

The noncentral t-distribution generalizes Student's t-distribution using a noncentrality parameter. Whereas the central probability distribution describes how a test statistic t is distributed when the difference tested is null, the noncentral distribution describes how t is distributed when the null is false. This leads to its use in statistics, especially calculating statistical power. The noncentral t-distribution is also known as the singly noncentral t-distribution, and in addition to its primary use in statistical inference, is also used in robust modeling for data.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

Shape context is a feature descriptor used in object recognition. Serge Belongie and Jitendra Malik proposed the term in their paper "Matching with Shape Contexts" in 2000.

<span class="mw-page-title-main">Beta rectangular distribution</span>

In probability theory and statistics, the beta rectangular distribution is a probability distribution that is a finite mixture distribution of the beta distribution and the continuous uniform distribution. The support is of the distribution is indicated by the parameters a and b, which are the minimum and maximum values respectively. The distribution provides an alternative to the beta distribution such that it allows more density to be placed at the extremes of the bounded interval of support. Thus it is a bounded distribution that allows for outliers to have a greater chance of occurring than does the beta distribution.

In probability and statistics, the generalized beta distribution is a continuous probability distribution with four shape parameters, including more than thirty named distributions as limiting or special cases. It has been used in the modeling of income distribution, stock returns, as well as in regression analysis. The exponential generalized beta (EGB) distribution follows directly from the GB and generalizes other common distributions.

In image analysis, the generalized structure tensor (GST) is an extension of the Cartesian structure tensor to curvilinear coordinates. It is mainly used to detect and to represent the "direction" parameters of curves, just as the Cartesian structure tensor detects and represents the direction in Cartesian coordinates. Curve families generated by pairs of locally orthogonal functions have been the best studied.

<span class="mw-page-title-main">Hyperbolastic functions</span> Mathematical functions

The hyperbolastic functions, also known as hyperbolastic growth models, are mathematical functions that are used in medical statistical modeling. These models were originally developed to capture the growth dynamics of multicellular tumor spheres, and were introduced in 2005 by Mohammad Tabatabai, David Williams, and Zoran Bursac. The precision of hyperbolastic functions in modeling real world problems is somewhat due to their flexibility in their point of inflection. These functions can be used in a wide variety of modeling problems such as tumor growth, stem cell proliferation, pharma kinetics, cancer growth, sigmoid activation function in neural networks, and epidemiological disease progression or regression.

References

  1. Fischler, M.A.; Elschlager, R.A. (1973). "The Representation and Matching of Pictorial Structures". IEEE Transactions on Computers (1): 67–92. doi:10.1109/T-C.1973.223602. S2CID   14554383.
  2. M. Burl, T. Leung, and P. Perona. Face Localization via Shape Statistics. (1995) [ permanent dead link ]
  3. T. Leung, M. Burl, and P. Perona. Finding Faces in Cluttered Scenes Using Random Labeled Graph Matching. (1995) [ permanent dead link ]
  4. M. Burl and P. Perona. Recognition of Planar Object Classes (1996) [ permanent dead link ]
  5. M. Burl, M. Weber, and P. Perona. A Probabilistic Approach to Object Recognition Using Local Photometry and Global Geometry (1998)
  6. M. Weber. Unsupervised Learning of Models for Object Recognition. PhD Thesis. (2000)
  7. M. Weber, W. Einhaeuser, M. Welling and P. Perona. Viewpoint-Invariant Learning and Detection of Human Heads. (2000) [ permanent dead link ]
  8. 1 2 M. Weber, M. Welling, and P. Perona. Towards Automatic Discovery of Object Categories. (2000) [ permanent dead link ]
  9. 1 2 M. Weber, M. Welling and P. Perona. Unsupervised Learning of Models for Recognition. (2000) [ permanent dead link ]
  10. 1 2 R. Fergus, P. Perona, and A. Zisserman. Object Class Recognition by Unsupervised Scale-Invariant Learning. (2003) [ permanent dead link ]
  11. 1 2 R. Fergus. Visual Object Category Recognition. PhD Thesis. (2005)
  12. Kadir, Timor; Brady, Michael (2001). "Saliency, Scale and Image Description". International Journal of Computer Vision. 45 (2): 83–105. doi:10.1023/A:1012460413855. S2CID   825395.
  13. R. Fergus and P. Perona. Caltech Object Category datasets. http://www.vision.caltech.edu/html-files/archive.html (2003)
  14. R. Fergus, P. Perona, and A. Zisserman. A Sparse Object Category Model for Efficient Learning and Exhaustive Recognition. (2005)

See also