One-shot learning (computer vision)

Last updated

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

Contents

Motivation

The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans. [1] [2] It is estimated that a child learns almost all of the 10 ~ 30 thousand object categories in the world by age six. [3] This is due not only to the human mind's computational power, but also to its ability to synthesize and learn new object categories from existing information about different, previously learned categories. Given two examples from two object categories: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of previously learned categories when learning new ones. The key motivation for solving one-shot learning is that systems, like humans, can use knowledge about object categories to classify new objects. [4] [5]

Background

As with most classification schemes, one-shot learning involves three main challenges:

One-shot learning differs from single object recognition and standard category recognition algorithms in its emphasis on knowledge transfer, which makes use of previously learned categories.

Theory

The Bayesian one-shot learning algorithm represents the foreground and background of images as parametrized by a mixture of constellation models. [12] During the learning phase, the parameters of these models are learned using a conjugate density parameter posterior and Variational Bayesian Expectation-Maximization (VBEM). [13] In this stage the previously learned object categories inform the choice of model parameters via transfer by contextual information. For object recognition on new images, the posterior obtained during the learning phase is used in a Bayesian decision framework to estimate the ratio of p(object | test, train) to p(background clutter | test, train) where p is the probability of the outcome. [14]

Bayesian framework

Given the task of finding a particular object in a query image, the overall objective of the Bayesian one-shot learning algorithm is to compare the probability that object is present vs the probability that only background clutter is present. If the former probability is higher, the algorithm reports the object's presence, otherwise the algorithm reports its absence. To compute these probabilities, the object class must be modeled from a set of (1 ~ 5) training images containing examples.

To formalize these ideas, let be the query image, which contains either an example of the foreground category or only background clutter of a generic background category . Also let be the set of training images used as the foreground category. The decision of whether contains an object from the foreground category, or only clutter from the background category is:

where the class posteriors and have been expanded by Bayes' Theorem, yielding a ratio of likelihoods and a ratio of object category priors. We decide that the image contains an object from the foreground class if exceeds a certain threshold . We next introduce parametric models for the foreground and background categories with parameters and respectively. This foreground parametric model is learned during the learning stage from , as well as prior information of learned categories. The background model we assume to be uniform across images. Omitting the constant ratio of category priors, , and parametrizing over and yields

, having simplified and to and

The posterior distribution of model parameters given the training images, is estimated in the learning phase. In this estimation, one-shot learning differs sharply from more traditional Bayesian estimation models that approximate the integral as . Instead, it uses a variational approach using prior information from previously learned categories. However, the traditional maximum likelihood estimation of the model parameters is used for the background model and the categories learned in advance through training. [15]

Object category model

For each query image and training images , a constellation model is used for representation. [12] [16] [17] To obtain this model for a given image , first a set of N interesting regions is detected in the image using the Kadir brady saliency detector. [18] Each region selected is represented by a location in the image, and a description of its appearance, . Letting and and the analogous representations for training images, the expression for R becomes:

The likelihoods and are represented as mixtures of constellation models. A typical constellation model has P(3 ~ 7) parts, with N(~100) interest regions. Thus a P-dimensional vector h assigns one region of interest (out of N regions) to each model part (for P parts). Thus h denotes a hypothesis (an assignment of interest regions to model parts) for the model and a full constellation model is represented by summing over all possible hypotheses h in the hypothesis space . Finally the likelihood is written

The different 's represent different configurations of parts, whereas the different hypotheses h represent different assignations of regions to parts, given a part model . The assumption that the shape of the model (as represented by , the collection of part locations) and appearance are independent allows one to consider the likelihood expression as two separate likelihoods of appearance and shape. [19]

Appearance

The appearance of each feature is represented by a point in appearance space (discussed below in implementation). "Each part in the constellation model has a Gaussian density within this space with mean and precision parameters ." From these the appearance likelihood described above is computed as a product of Gaussians over the model parts for a give hypothesis h and mixture component . [20]

Shape

The shape of the model for a given mixture component and hypothesis h is represented as a joint Gaussian density of the locations of features. These features are transformed into a scale and translation-invariant space before modelling the relative location of the parts by a 2(P - 1)-dimensional Gaussian. From this, we obtain the shape likelihood, completing our representation of . In order to reduce the number of hypotheses in the hypothesis space , only those hypotheses that satisfy the ordering constraint that the x-coordinate of each part is monotonically increasing are considered. This eliminates hypotheses from . [20]

Conjugate densities

In order to compute , the integral must be evaluated, but is analytically intractable. The object category model above gives information about , so what remains is to examine , the posterior of , and find a sufficient approximation to render the integral tractable. Previous work approximates the posterior by a function centered at , collapsing the integral in question into . This is normally estimated using a Maximum Likelihood () or Maximum A Posteriori () procedure. However, because in one-shot learning, few training examples are used, the distribution will not be well-peaked, as is assumed in a function approximation. Thus instead of this traditional approximation, the Bayesian one-shot learning algorithm seeks to "find a parametric form of such that the learning of is feasible". The algorithm employs a Normal-Wishart distribution as the conjugate prior of , and in the learning phase, variational Bayesian methods with the same computational complexity as maximum likelihood methods are used to learn the hyperparameters of the distribution. Then, since is a product of Gaussians, as chosen in the object category model, the integral reduces to a multivariate Student's T distribution, which can be evaluated. [21]

Implementation

Feature detection and representation

To detect features in an image so that it can be represented by a constellation model, the Kadir Brady feature detector is used on grey-scale images, finding salient regions of the image. These regions are then clustered, yielding a number of features (the clusters) and the shape parameter , composed of the cluster centers. The Kadir Brady detector was chosen because it produces fewer, more salient regions, as opposed to feature detectors like multiscale Harris, which produces numerous, less significant regions.

The regions are then taken from the image and rescaled to a small patch of 11 x 11 pixels, allowing each patch to be represented in 121-dimensional space. This dimensionality is reduced using principal component analysis, and , the appearance parameter, is then formed from the first 10 principal components of each patch. [22]

Learning

To obtain shape and appearance priors, three categories (spotted cats, faces, and airplanes) are learned using maximum likelihood estimation. These object category model parameters are then used to estimate the hyper-parameters of the desired priors.

Given a set of training examples, the algorithm runs the feature detector on these images, and determines model parameters from the salient regions. The hypothesis index h assigning features to parts prevents a closed-form solution of the linear model, so the posterior is estimated by variational Bayesian expectation-maximization, which is run until parameter convergence after ~ 100 iterations. Learning a category in this fashion takes under a minute on a 2.8 GHz machine with a 4-part model and < 10 training images. [23]

Experimental results

Motorbike example

To learn the motorbike category:

Shared densities on transforms

Another algorithm uses knowledge transfer by model parameters to learn a new object category that is similar in appearance to previously learned categories. An image is represented as either a texture and shape, or as a latent image that has been transformed, denoted by .

A Siamese neural network works in tandem on two different input vectors to compute comparable output vectors. [24]

Congealing

In this context, congealing is "the simultaneous vectorization of each of a set of images to each other". For a set of training images of a certain category, congealing iteratively transforms each image to minimize the images' joint pixelwise entropies E, where

"where is the binary random variable defined by the values of a particular pixel p across all of the images, is the discrete entropy function of that variable, and is the set of pixel indices for the image."

The congealing algorithm begins with a set of images and a corresponding transform matrix , which at the end of the algorithm will represent the transformation of into its latent . These latents minimize the joint pixel-wise entropies. Thus the task of the congealing algorithm is to estimate the transformations .

Sketch of algorithm:

At the end of the algorithm, , and transforms the latent image back into the originally observed image. [25]

Classification

To use this model for classification, it must be estimated with the maximum posterior probability given an observed image . Applying Bayes' rule to and parametrization by the transformation gives a difficult integral that must be approximated, and then the best transform (that which maps the test image to its latent image) must be found. Once this transformation is found, the test image can be transformed into its latent, and a nearest neighbor classifier based on Hausdorff distance between images can classify the latent (and thus the test image) as belonging to a particular class .

To find , the test image I is inserted into the training ensemble for the congealing process. Since the test image is drawn from one of the categories , congealing provides a corresponding that maps I to its latent. The latent can then be classified. [26]

Single-example classification

Given a set of transformations obtained from congealing many images of a certain category, the classifier can be extended to the case where only one training example of a new category is allowed. Applying all the transformations sequentially to creates an artificial training set for . This artificial data set can be made larger by borrowing transformations from many already known categories. Once this data set is obtained, , a test instance of , can be classified as in the normal classification procedure. The key assumption is that categories are similar enough that the transforms from one can be applied to another. [27]

See also

Citations

  1. Li, Fergus & Perona 2002.
  2. Thorpe, Fize & Marlot 1996.
  3. Biederman 1987.
  4. Li, Fergus & Perona 2006, Section 1.
  5. Li 2006, Section 1.
  6. Li, Fergus & Perona 2006, Section 2.
  7. Fink 2004.
  8. Bart & Ullman 2005.
  9. Murphy & et al 2004.
  10. Hoiem, Efros & Herbert 2005.
  11. Li 2006, Section 2.
  12. 1 2 Burl & et al 1996.
  13. Attias 1999.
  14. Li & et al 2006.
  15. Li, Fergus & Perona 2006, Section 3.1.
  16. Weber, Welling & Perona 2000.
  17. Fergus, Perona & Zisserman 2003.
  18. Kadir & Brady 2001.
  19. Li, Fergus & Perona 2006, Section 3.2.
  20. 1 2 Li, Fergus & Perona 2006, Section 3.2.1.
  21. Li, Fergus & Perona 2006, Section 3.4.3.
  22. Li, Fergus & Perona 2006, Section 5.1.
  23. Li, Fergus & Perona 2006, Sections 4, 5.2.
  24. Few-Shot Learning (2/3): Siamese Networks. YouTube . Archived from the original on 2021-12-10.
  25. Miller et al.
  26. Miller, Matsakis & Viola 2000, Section 4.
  27. Miller, Matsakis & Viola 2000, Section 7.

Related Research Articles

A Costas loop is a phase-locked loop (PLL) based circuit which is used for carrier frequency recovery from suppressed-carrier modulation signals and phase modulation signals. It was invented by John P. Costas at General Electric in the 1950s. Its invention was described as having had "a profound effect on modern digital communications". The primary application of Costas loops is in wireless receivers. Its advantage over other PLL-based detectors is that at small deviations the Costas loop error voltage is as compared to . This translates to double the sensitivity and also makes the Costas loop uniquely suited for tracking Doppler-shifted carriers, especially in OFDM and GPS receivers.

Kinematics is a subfield of physics, developed in classical mechanics, that describes the motion of points, bodies (objects), and systems of bodies without considering the forces that cause them to move. Kinematics, as a field of study, is often referred to as the "geometry of motion" and is occasionally seen as a branch of mathematics. A kinematics problem begins by describing the geometry of the system and declaring the initial conditions of any known values of position, velocity and/or acceleration of points within the system. Then, using arguments from geometry, the position, velocity and acceleration of any unknown parts of the system can be determined. The study of how forces act on bodies falls within kinetics, not kinematics. For further details, see analytical dynamics.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the capacity of a set of functions that can be learned by a statistical binary classification algorithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.

In statistics, the Lehmann–Scheffé theorem is a prominent statement, tying together the ideas of completeness, sufficiency, uniqueness, and best unbiased estimation. The theorem states that any estimator which is unbiased for a given unknown quantity and that depends on the data only through a complete, sufficient statistic is the unique best unbiased estimator of that quantity. The Lehmann–Scheffé theorem is named after Erich Leo Lehmann and Henry Scheffé, given their two early papers.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

<span class="mw-page-title-main">Tomographic reconstruction</span> Estimate object properties from a finite number of projections

Tomographic reconstruction is a type of multidimensional inverse problem where the challenge is to yield an estimate of a specific system from a finite number of projections. The mathematical basis for tomographic imaging was laid down by Johann Radon. A notable example of applications is the reconstruction of computed tomography (CT) where cross-sectional images of patients are obtained in non-invasive manner. Recent developments have seen the Radon transform and its inverse used for tasks related to realistic object insertion required for testing and evaluating computed tomography use in airport security.

<span class="mw-page-title-main">Bidirectional reflectance distribution function</span> Function of four real variables that defines how light is reflected at an opaque surface

The bidirectional reflectance distribution function is a function of four real variables that defines how light is reflected at an opaque surface. It is employed in the optics of real-world light, in computer graphics algorithms, and in computer vision algorithms. The function takes an incoming light direction, , and outgoing direction, , and returns the ratio of reflected radiance exiting along to the irradiance incident on the surface from direction . Each direction is itself parameterized by azimuth angle and zenith angle , therefore the BRDF as a whole is a function of 4 variables. The BRDF has units sr−1, with steradians (sr) being a unit of solid angle.

In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.

The Kuramoto model, first proposed by Yoshiki Kuramoto, is a mathematical model used in describing synchronization. More specifically, it is a model for the behavior of a large set of coupled oscillators. Its formulation was motivated by the behavior of systems of chemical and biological oscillators, and it has found widespread applications in areas such as neuroscience and oscillating flame dynamics. Kuramoto was quite surprised when the behavior of some physical systems, namely coupled arrays of Josephson junctions, followed his model.

In astronomy, angular diameter distance is a distance defined in terms of an object's physical size, , and its angular size, , as viewed from Earth:

<span class="mw-page-title-main">Arnold tongue</span>

In mathematics, particularly in dynamical systems, Arnold tongues are a pictorial phenomenon that occur when visualizing how the rotation number of a dynamical system, or other related invariant property thereof, changes according to two or more of its parameters. The regions of constant rotation number have been observed, for some dynamical systems, to form geometric shapes that resemble tongues, in which case they are called Arnold tongues.

Uncertainty quantification (UQ) is the science of quantitative characterization and estimation of uncertainties in both computational and real world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known. An example would be to predict the acceleration of a human body in a head-on crash with another car: even if the speed was exactly known, small differences in the manufacturing of individual cars, how tightly every bolt has been tightened, etc., will lead to different results that can only be predicted in a statistical sense.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.

Amplitude amplification is a technique in quantum computing which generalizes the idea behind the Grover's search algorithm, and gives rise to a family of quantum algorithms. It was discovered by Gilles Brassard and Peter Høyer in 1997, and independently rediscovered by Lov Grover in 1998.

Geometric feature learning is a technique combining machine learning and computer vision to solve visual tasks. The main goal of this method is to find a set of representative features of geometric form to represent an object by collecting geometric features from images and learning them using efficient machine learning methods. Humans solve visual tasks and can give fast response to the environment by extracting perceptual information from what they see. Researchers simulate humans' ability of recognizing objects to solve computer vision problems. For example, M. Mata et al.(2002) applied feature learning techniques to the mobile robot navigation tasks in order to avoid obstacles. They used genetic algorithms for learning features and recognizing objects (figures). Geometric feature learning methods can not only solve recognition problems but also predict subsequent actions by analyzing a set of sequential input sensory images, usually some extracting features of images. Through learning, some hypothesis of the next action are given and according to the probability of each hypothesis give a most probable action. This technique is widely used in the area of artificial intelligence.

In mathematics, mirror symmetry is a conjectural relationship between certain Calabi–Yau manifolds and a constructed "mirror manifold". The conjecture allows one to relate the number of rational curves on a Calabi-Yau manifold to integrals from a family of varieties. In short, this means there is a relation between the number of genus algebraic curves of degree on a Calabi-Yau variety and integrals on a dual variety . These relations were original discovered by Candelas, De la Ossa, Green, and Parkes in a paper studying a generic quintic threefold in as the variety and a construction from the quintic Dwork family giving . Shortly after, Sheldon Katz wrote a summary paper outlining part of their construction and conjectures what the rigorous mathematical interpretation could be.

Nonlinear mixed-effects models constitute a class of statistical models generalizing linear mixed-effects models. Like linear mixed-effects models, they are particularly useful in settings where there are multiple measurements within the same statistical units or when there are dependencies between measurements on related statistical units. Nonlinear mixed-effects models are applied in many fields including medicine, public health, pharmacology, and ecology.

References