This article is written like a research paper or scientific journal .(April 2016) |
One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.
The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans. [1] [2] It is estimated that a child learns almost all of the 10 ~ 30 thousand object categories in the world by age six. [3] This is due not only to the human mind's computational power, but also to its ability to synthesize and learn new object categories from existing information about different, previously learned categories. Given two examples from two object categories: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of previously learned categories when learning new ones. The key motivation for solving one-shot learning is that systems, like humans, can use knowledge about object categories to classify new objects. [4] [5]
As with most classification schemes, one-shot learning involves three main challenges:
One-shot learning differs from single object recognition and standard category recognition algorithms in its emphasis on knowledge transfer, which makes use of previously learned categories.
The Bayesian one-shot learning algorithm represents the foreground and background of images as parametrized by a mixture of constellation models. [12] During the learning phase, the parameters of these models are learned using a conjugate density parameter posterior and Variational Bayesian Expectation–Maximization (VBEM). [13] In this stage the previously learned object categories inform the choice of model parameters via transfer by contextual information. For object recognition on new images, the posterior obtained during the learning phase is used in a Bayesian decision framework to estimate the ratio of p(object | test, train) to p(background clutter | test, train) where p is the probability of the outcome. [14]
Given the task of finding a particular object in a query image, the overall objective of the Bayesian one-shot learning algorithm is to compare the probability that object is present vs the probability that only background clutter is present. If the former probability is higher, the algorithm reports the object's presence, otherwise the algorithm reports its absence. To compute these probabilities, the object class must be modeled from a set of (1 ~ 5) training images containing examples.
To formalize these ideas, let be the query image, which contains either an example of the foreground category or only background clutter of a generic background category . Also let be the set of training images used as the foreground category. The decision of whether contains an object from the foreground category, or only clutter from the background category is:
where the class posteriors and have been expanded by Bayes' Theorem, yielding a ratio of likelihoods and a ratio of object category priors. We decide that the image contains an object from the foreground class if exceeds a certain threshold . We next introduce parametric models for the foreground and background categories with parameters and respectively. This foreground parametric model is learned during the learning stage from , as well as prior information of learned categories. The background model we assume to be uniform across images. Omitting the constant ratio of category priors, , and parametrizing over and yields
The posterior distribution of model parameters given the training images, is estimated in the learning phase. In this estimation, one-shot learning differs sharply from more traditional Bayesian estimation models that approximate the integral as . Instead, it uses a variational approach using prior information from previously learned categories. However, the traditional maximum likelihood estimation of the model parameters is used for the background model and the categories learned in advance through training. [15]
For each query image and training images , a constellation model is used for representation. [12] [16] [17] To obtain this model for a given image , first a set of N interesting regions is detected in the image using the Kadir–Brady saliency detector. [18] Each region selected is represented by a location in the image, and a description of its appearance, . Letting and and the analogous representations for training images, the expression for R becomes:
The likelihoods and are represented as mixtures of constellation models. A typical constellation model has P(3 ~ 7) parts, with N(~100) interest regions. Thus a P-dimensional vector h assigns one region of interest (out of N regions) to each model part (for P parts). Thus h denotes a hypothesis (an assignment of interest regions to model parts) for the model and a full constellation model is represented by summing over all possible hypotheses h in the hypothesis space . Finally the likelihood is written
The different 's represent different configurations of parts, whereas the different hypotheses h represent different assignations of regions to parts, given a part model . The assumption that the shape of the model (as represented by , the collection of part locations) and appearance are independent allows one to consider the likelihood expression as two separate likelihoods of appearance and shape. [19]
The appearance of each feature is represented by a point in appearance space (discussed below in implementation). "Each part in the constellation model has a Gaussian density within this space with mean and precision parameters ." From these the appearance likelihood described above is computed as a product of Gaussians over the model parts for a give hypothesis h and mixture component . [20]
The shape of the model for a given mixture component and hypothesis h is represented as a joint Gaussian density of the locations of features. These features are transformed into a scale and translation-invariant space before modelling the relative location of the parts by a 2(P - 1)-dimensional Gaussian. From this, we obtain the shape likelihood, completing our representation of . In order to reduce the number of hypotheses in the hypothesis space , only those hypotheses that satisfy the ordering constraint that the x-coordinate of each part is monotonically increasing are considered. This eliminates hypotheses from . [20]
In order to compute , the integral must be evaluated, but is analytically intractable. The object category model above gives information about , so what remains is to examine , the posterior of , and find a sufficient approximation to render the integral tractable. Previous work approximates the posterior by a function centered at , collapsing the integral in question into . This is normally estimated using a Maximum Likelihood () or Maximum A Posteriori () procedure. However, because in one-shot learning, few training examples are used, the distribution will not be well-peaked, as is assumed in a function approximation. Thus instead of this traditional approximation, the Bayesian one-shot learning algorithm seeks to "find a parametric form of such that the learning of is feasible". The algorithm employs a Normal-Wishart distribution as the conjugate prior of , and in the learning phase, variational Bayesian methods with the same computational complexity as maximum likelihood methods are used to learn the hyperparameters of the distribution. Then, since is a product of Gaussians, as chosen in the object category model, the integral reduces to a multivariate Student's T distribution, which can be evaluated. [21]
To detect features in an image so that it can be represented by a constellation model, the Kadir–Brady saliency detector is used on grey-scale images, finding salient regions of the image. These regions are then clustered, yielding a number of features (the clusters) and the shape parameter , composed of the cluster centers. The Kadir–Brady detector was chosen because it produces fewer, more salient regions, as opposed to feature detectors like multiscale Harris, which produces numerous, less significant regions.
The regions are then taken from the image and rescaled to a small patch of 11 × 11 pixels, allowing each patch to be represented in 121-dimensional space. This dimensionality is reduced using principal component analysis, and , the appearance parameter, is then formed from the first 10 principal components of each patch. [22]
To obtain shape and appearance priors, three categories (spotted cats, faces, and airplanes) are learned using maximum likelihood estimation. These object category model parameters are then used to estimate the hyper-parameters of the desired priors.
Given a set of training examples, the algorithm runs the feature detector on these images, and determines model parameters from the salient regions. The hypothesis index h assigning features to parts prevents a closed-form solution of the linear model, so the posterior is estimated by variational Bayesian expectation–maximization algorithm, which is run until parameter convergence after ~ 100 iterations. Learning a category in this fashion takes under a minute on a 2.8 GHz machine with a 4-part model and < 10 training images. [23]
To learn the motorbike category:
Another algorithm uses knowledge transfer by model parameters to learn a new object category that is similar in appearance to previously learned categories. An image is represented as either a texture and shape, or as a latent image that has been transformed, denoted by .
A Siamese neural network works in tandem on two different input vectors to compute comparable output vectors. [24]
In this context, congealing is "the simultaneous vectorization of each of a set of images to each other". For a set of training images of a certain category, congealing iteratively transforms each image to minimize the images' joint pixelwise entropies E, where
"where is the binary random variable defined by the values of a particular pixel p across all of the images, is the discrete entropy function of that variable, and is the set of pixel indices for the image."
The congealing algorithm begins with a set of images and a corresponding transform matrix , which at the end of the algorithm will represent the transformation of into its latent . These latents minimize the joint pixel-wise entropies. Thus the task of the congealing algorithm is to estimate the transformations .
Sketch of algorithm:
At the end of the algorithm, , and transforms the latent image back into the originally observed image. [25]
To use this model for classification, it must be estimated with the maximum posterior probability given an observed image . Applying Bayes' rule to and parametrization by the transformation gives a difficult integral that must be approximated, and then the best transform (that which maps the test image to its latent image) must be found. Once this transformation is found, the test image can be transformed into its latent, and a nearest neighbor classifier based on Hausdorff distance between images can classify the latent (and thus the test image) as belonging to a particular class .
To find , the test image I is inserted into the training ensemble for the congealing process. Since the test image is drawn from one of the categories , congealing provides a corresponding that maps I to its latent. The latent can then be classified. [26]
Given a set of transformations obtained from congealing many images of a certain category, the classifier can be extended to the case where only one training example of a new category is allowed. Applying all the transformations sequentially to creates an artificial training set for . This artificial data set can be made larger by borrowing transformations from many already known categories. Once this data set is obtained, , a test instance of , can be classified as in the normal classification procedure. The key assumption is that categories are similar enough that the transforms from one can be applied to another. [27]
A Costas loop is a phase-locked loop (PLL) based circuit which is used for carrier frequency recovery from suppressed-carrier modulation signals and phase modulation signals. It was invented by John P. Costas at General Electric in the 1950s. Its invention was described as having had "a profound effect on modern digital communications". The primary application of Costas loops is in wireless receivers. Its advantage over other PLL-based detectors is that at small deviations the Costas loop error voltage is as compared to . This translates to double the sensitivity and also makes the Costas loop uniquely suited for tracking Doppler-shifted carriers, especially in OFDM and GPS receivers.
Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by German mathematicians Paul Bachmann, Edmund Landau, and others, collectively called Bachmann–Landau notation or asymptotic notation. The letter O was chosen by Bachmann to stand for Ordnung, meaning the order of approximation.
Kinematics is a subfield of physics and mathematics, developed in classical mechanics, that describes the motion of points, bodies (objects), and systems of bodies without considering the forces that cause them to move. Kinematics, as a field of study, is often referred to as the "geometry of motion" and is occasionally seen as a branch of both applied and pure mathematics since it can be studied without considering the mass of a body or the forces acting upon it. A kinematics problem begins by describing the geometry of the system and declaring the initial conditions of any known values of position, velocity and/or acceleration of points within the system. Then, using arguments from geometry, the position, velocity and acceleration of any unknown parts of the system can be determined. The study of how forces act on bodies falls within kinetics, not kinematics. For further details, see analytical dynamics.
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.
Tomographic reconstruction is a type of multidimensional inverse problem where the challenge is to yield an estimate of a specific system from a finite number of projections. The mathematical basis for tomographic imaging was laid down by Johann Radon. A notable example of applications is the reconstruction of computed tomography (CT) where cross-sectional images of patients are obtained in non-invasive manner. Recent developments have seen the Radon transform and its inverse used for tasks related to realistic object insertion required for testing and evaluating computed tomography use in airport security.
In theoretical physics, supersymmetric quantum mechanics is an area of research where supersymmetry are applied to the simpler setting of plain quantum mechanics, rather than quantum field theory. Supersymmetric quantum mechanics has found applications outside of high-energy physics, such as providing new methods to solve quantum mechanical problems, providing useful extensions to the WKB approximation, and statistical mechanics.
The Kuramoto model, first proposed by Yoshiki Kuramoto, is a mathematical model used in describing synchronization. More specifically, it is a model for the behavior of a large set of coupled oscillators. Its formulation was motivated by the behavior of systems of chemical and biological oscillators, and it has found widespread applications in areas such as neuroscience and oscillating flame dynamics. Kuramoto was quite surprised when the behavior of some physical systems, namely coupled arrays of Josephson junctions, followed his model.
In astronomy, angular diameter distance is a distance defined in terms of an object's physical size, , and its angular size, , as viewed from Earth:
In mathematics, particularly in dynamical systems, Arnold tongues are a pictorial phenomenon that occur when visualizing how the rotation number of a dynamical system, or other related invariant property thereof, changes according to two or more of its parameters. The regions of constant rotation number have been observed, for some dynamical systems, to form geometric shapes that resemble tongues, in which case they are called Arnold tongues.
A pendulum is a body suspended from a fixed support such that it freely swings back and forth under the influence of gravity. When a pendulum is displaced sideways from its resting, equilibrium position, it is subject to a restoring force due to gravity that will accelerate it back towards the equilibrium position. When released, the restoring force acting on the pendulum's mass causes it to oscillate about the equilibrium position, swinging it back and forth. The mathematics of pendulums are in general quite complicated. Simplifying assumptions can be made, which in the case of a simple pendulum allow the equations of motion to be solved analytically for small-angle oscillations.
The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.
The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.
Amplitude amplification is a technique in quantum computing which generalizes the idea behind Grover's search algorithm, and gives rise to a family of quantum algorithms. It was discovered by Gilles Brassard and Peter Høyer in 1997, and independently rediscovered by Lov Grover in 1998.
Geometric feature learning is a technique combining machine learning and computer vision to solve visual tasks. The main goal of this method is to find a set of representative features of geometric form to represent an object by collecting geometric features from images and learning them using efficient machine learning methods. Humans solve visual tasks and can give fast response to the environment by extracting perceptual information from what they see. Researchers simulate humans' ability of recognizing objects to solve computer vision problems. For example, M. Mata et al.(2002) applied feature learning techniques to the mobile robot navigation tasks in order to avoid obstacles. They used genetic algorithms for learning features and recognizing objects (figures). Geometric feature learning methods can not only solve recognition problems but also predict subsequent actions by analyzing a set of sequential input sensory images, usually some extracting features of images. Through learning, some hypothesis of the next action are given and according to the probability of each hypothesis give a most probable action. This technique is widely used in the area of artificial intelligence.
Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.
Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.
A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.
In mathematics, homotopy theory is a systematic study of situations in which maps can come with homotopies between them. It originated as a topic in algebraic topology, but nowadays is learned as an independent discipline.
In mathematics, mirror symmetry is a conjectural relationship between certain Calabi–Yau manifolds and a constructed "mirror manifold". The conjecture allows one to relate the number of rational curves on a Calabi-Yau manifold to integrals from a family of varieties. In short, this means there is a relation between the number of genus algebraic curves of degree on a Calabi-Yau variety and integrals on a dual variety . These relations were original discovered by Candelas, de la Ossa, Green, and Parkes in a paper studying a generic quintic threefold in as the variety and a construction from the quintic Dwork family giving . Shortly after, Sheldon Katz wrote a summary paper outlining part of their construction and conjectures what the rigorous mathematical interpretation could be.