M-theory (learning framework)

Last updated

In machine learning and computer vision, M-theory is a learning framework inspired by feed-forward processing in the ventral stream of visual cortex and originally developed for recognition and classification of objects in visual scenes. M-theory was later applied to other areas, such as speech recognition. On certain image recognition tasks, algorithms based on a specific instantiation of M-theory, HMAX, achieved human-level performance. [1]

Contents

The core principle of M-theory is extracting representations invariant under various transformations of images (translation, scale, 2D and 3D rotation and others). In contrast with other approaches using invariant representations, in M-theory they are not hardcoded into the algorithms, but learned. M-theory also shares some principles with compressed sensing. The theory proposes multilayered hierarchical learning architecture, similar to that of visual cortex.

Intuition

Invariant representations

A great challenge in visual recognition tasks is that the same object can be seen in a variety of conditions. It can be seen from different distances, different viewpoints, under different lighting, partially occluded, etc. In addition, for particular classes objects, such as faces, highly complex specific transformations may be relevant, such as changing facial expressions. For learning to recognize images, it is greatly beneficial to factor out these variations. It results in much simpler classification problem and, consequently, in great reduction of sample complexity of the model.

A simple computational experiment illustrates this idea. Two instances of a classifier were trained to distinguish images of planes from those of cars. For training and testing of the first instance, images with arbitrary viewpoints were used. Another instance received only images seen from a particular viewpoint, which was equivalent to training and testing the system on invariant representation of the images. One can see that the second classifier performed quite well even after receiving a single example from each category, while performance of the first classifier was close to random guess even after seeing 20 examples.

Invariant representations has been incorporated into several learning architectures, such as neocognitrons. Most of these architectures, however, provided invariance through custom-designed features or properties of architecture itself. While it helps to take into account some sorts of transformations, such as translations, it is very nontrivial to accommodate for other sorts of transformations, such as 3D rotations and changing facial expressions. M-theory provides a framework of how such transformations can be learned. In addition to higher flexibility, this theory also suggests how human brain may have similar capabilities.

Templates

Another core idea of M-theory is close in spirit to ideas from the field of compressed sensing. An implication from Johnson–Lindenstrauss lemma says that a particular number of images can be embedded into a low-dimensional feature space with the same distances between images by using random projections. This result suggests that dot product between the observed image and some other image stored in memory, called template, can be used as a feature helping to distinguish the image from other images. The template need not to be anyhow related to the image, it could be chosen randomly.

Combining templates and invariant representations

The two ideas outlined in previous sections can be brought together to construct a framework for learning invariant representations. The key observation is how dot product between image and a template behaves when image is transformed (by such transformations as translations, rotations, scales, etc.). If transformation is a member of a unitary group of transformations, then the following holds:

In other words, the dot product of transformed image and a template is equal to the dot product of original image and inversely transformed template. For instance, for image rotated by 90 degrees, the inversely transformed template would be rotated by −90 degrees.

Consider the set of dot products of an image to all possible transformations of template: . If one applies a transformation to , the set would become . But because of the property (1), this is equal to . The set is equal to just the set of all elements in . To see this, note that every is in due to the closure property of groups, and for every in G there exist its prototype such as (namely, ). Thus, . One can see that the set of dot products remains the same despite that a transformation was applied to the image! This set by itself may serve as a (very cumbersome) invariant representation of an image. More practical representations can be derived from it.

In the introductory section, it was claimed that M-theory allows to learn invariant representations. This is because templates and their transformed versions can be learned from visual experience – by exposing the system to sequences of transformations of objects. It is plausible that similar visual experiences occur in early period of human life, for instance when infants twiddle toys in their hands. Because templates may be totally unrelated to images that the system later will try to classify, memories of these visual experiences may serve as a basis for recognizing many different kinds of objects in later life. However, as it is shown later, for some kinds of transformations, specific templates are needed.

Theoretical aspects

From orbits to distribution measures

To implement the ideas described in previous sections, one need to know how to derive a computationally efficient invariant representation of an image. Such unique representation for each image can be characterized as it appears by a set of one-dimensional probability distributions (empirical distributions of the dot-products between image and a set of templates stored during unsupervised learning). These probability distributions in their turn can be described by either histograms or a set of statistical moments of it, as it will be shown below.

Orbit is a set of images generated from a single image under the action of the group .

In other words, images of an object and of its transformations correspond to an orbit . If two orbits have a point in common they are identical everywhere, [2] i.e. an orbit is an invariant and unique representation of an image. So, two images are called equivalent when they belong to the same orbit: if such that . Conversely, two orbits are different if none of the images in one orbit coincide with any image in the other. [3]

A natural question arises: how can one compare two orbits? There are several possible approaches. One of them employs the fact that intuitively two empirical orbits are the same irrespective of the ordering of their points. Thus, one can consider a probability distribution induced by the group's action on images ( can be seen as a realization of a random variable).

This probability distribution can be almost uniquely characterized by one-dimensional probability distributions induced by the (one-dimensional) results of projections , where are a set of templates (randomly chosen images) (based on the Cramer–Wold theorem [4] and concentration of measures).

Consider images . Let , where is a universal constant. Then

with probability , for all .

This result (informally) says that an approximately invariant and unique representation of an image can be obtained from the estimates of 1-D probability distributions for . The number of projections needed to discriminate orbits, induced by images, up to precision (and with confidence ) is , where is a universal constant.

To classify an image, the following "recipe" can be used:

  1. Memorize a set of images/objects called templates;
  2. Memorize observed transformations for each template;
  3. Compute dot products of its transformations with image;
  4. Compute histogram of the resulting values, called signature of the image;
  5. Compare the obtained histogram with signatures stored in memory.

Estimates of such one-dimensional probability density functions (PDFs) can be written in terms of histograms as , where is a set of nonlinear functions. These 1-D probability distributions can be characterized with N-bin histograms or set of statistical moments. For example, HMAX represents an architecture in which pooling is done with a max operation.

Non-compact groups of transformations

In the "recipe" for image classification, groups of transformations are approximated with finite number of transformations. Such approximation is possible only when the group is compact.

Such groups as all translations and all scalings of the image are not compact, as they allow arbitrarily big transformations. However, they are locally compact. For locally compact groups, invariance is achievable within certain range of transformations. [2]

Assume that is a subset of transformations from for which the transformed patterns exist in memory. For an image and template , assume that is equal to zero everywhere except some subset of . This subset is called support of and denoted as . It can be proven that if for a transformation , support set will also lie within , then signature of is invariant with respect to . [2] This theorem determines the range of transformations for which invariance is guaranteed to hold.

One can see that the smaller is , the larger is the range of transformations for which invariance is guaranteed to hold. It means that for a group that is only locally compact, not all templates would work equally well anymore. Preferable templates are those with a reasonably small for a generic image. This property is called localization: templates are sensitive only to images within a small range of transformations. Although minimizing is not absolutely necessary for the system to work, it improves approximation of invariance. Requiring localization simultaneously for translation and scale yields a very specific kind of templates: Gabor functions. [2]

The desirability of custom templates for non-compact group is in conflict with the principle of learning invariant representations. However, for certain kinds of regularly encountered image transformations, templates might be the result of evolutionary adaptations. Neurobiological data suggests that there is Gabor-like tuning in the first layer of visual cortex. [5] The optimality of Gabor templates for translations and scales is a possible explanation of this phenomenon.

Non-group transformations

Many interesting transformations of images do not form groups. For instance, transformations of images associated with 3D rotation of corresponding 3D object do not form a group, because it is impossible to define an inverse transformation (two objects may looks the same from one angle but different from another angle). However, approximate invariance is still achievable even for non-group transformations, if localization condition for templates holds and transformation can be locally linearized.

As it was said in the previous section, for specific case of translations and scaling, localization condition can be satisfied by use of generic Gabor templates. However, for general case (non-group) transformation, localization condition can be satisfied only for specific class of objects. [2] More specifically, in order to satisfy the condition, templates must be similar to the objects one would like to recognize. For instance, if one would like to build a system to recognize 3D rotated faces, one need to use other 3D rotated faces as templates. This may explain the existence of such specialized modules in the brain as one responsible for face recognition. [2] Even with custom templates, a noise-like encoding of images and templates is necessary for localization. It can be naturally achieved if the non-group transformation is processed on any layer other than the first in hierarchical recognition architecture.

Hierarchical architectures

The previous section suggests one motivation for hierarchical image recognition architectures. However, they have other benefits as well.

Firstly, hierarchical architectures best accomplish the goal of ‘parsing’ a complex visual scene with many objects consisting of many parts, whose relative position may greatly vary. In this case, different elements of the system must react to different objects and parts. In hierarchical architectures, representations of parts at different levels of embedding hierarchy can be stored at different layers of hierarchy.

Secondly, hierarchical architectures which have invariant representations for parts of objects may facilitate learning of complex compositional concepts. This facilitation may happen through reusing of learned representations of parts that were constructed before in process of learning of other concepts. As a result, sample complexity of learning compositional concepts may be greatly reduced.

Finally, hierarchical architectures have better tolerance to clutter. Clutter problem arises when the target object is in front of a non-uniform background, which functions as a distractor for the visual task. Hierarchical architecture provides signatures for parts of target objects, which do not include parts of background and are not affected by background variations. [6]

In hierarchical architectures, one layer is not necessarily invariant to all transformations that are handled by the hierarchy as a whole. Some transformations may pass through that layer to upper layers, as in the case of non-group transformations described in the previous section. For other transformations, an element of the layer may produce invariant representations only within small range of transformations. For instance, elements of the lower layers in hierarchy have small visual field and thus can handle only a small range of translation. For such transformations, the layer should provide covariant rather than invariant, signatures. The property of covariance can be written as , where is a layer, is the signature of image on that layer, and stands for "distribution of values of the expression for all ".

Relation to biology

M-theory is based on a quantitative theory of the ventral stream of visual cortex. [7] [8] Understanding how visual cortex works in object recognition is still a challenging task for neuroscience. Humans and primates are able to memorize and recognize objects after seeing just couple of examples unlike any state-of-the art machine vision systems that usually require a lot of data in order to recognize objects. Prior to the use of visual neuroscience in computer vision has been limited to early vision for deriving stereo algorithms (e.g., [9] ) and to justify the use of DoG (derivative-of-Gaussian) filters and more recently of Gabor filters. [10] [11] No real attention has been given to biologically plausible features of higher complexity. While mainstream computer vision has always been inspired and challenged by human vision, it seems to have never advanced past the very first stages of processing in the simple cells in V1 and V2. Although some of the systems inspired – to various degrees – by neuroscience, have been tested on at least some natural images, neurobiological models of object recognition in cortex have not yet been extended to deal with real-world image databases. [12]

M-theory learning framework employs a novel hypothesis about the main computational function of the ventral stream: the representation of new objects/images in terms of a signature, which is invariant to transformations learned during visual experience. This allows recognition from very few labeled examples – in the limit, just one.

Neuroscience suggests that natural functionals for a neuron to compute is a high-dimensional dot product between an "image patch" and another image patch (called template) which is stored in terms of synaptic weights (synapses per neuron). The standard computational model of a neuron is based on a dot product and a threshold. Another important feature of the visual cortex is that it consists of simple and complex cells. This idea was originally proposed by Hubel and Wiesel. [9] M-theory employs this idea. Simple cells compute dot products of an image and transformations of templates for ( is a number of simple cells). Complex cells are responsible for pooling and computing empirical histograms or statistical moments of it. The following formula for constructing histogram can be computed by neurons:

where is a smooth version of step function, is the width of a histogram bin, and is the number of the bin.

Applications

Applications to computer vision

In [ clarification needed ] [13] [14] authors applied M-theory to unconstrained face recognition in natural photographs. Unlike the DAR (detection, alignment, and recognition) method, which handles clutter by detecting objects and cropping closely around them so that very little background remains, this approach accomplishes detection and alignment implicitly by storing transformations of training images (templates) rather than explicitly detecting and aligning or cropping faces at test time. This system is built according to the principles of a recent theory of invariance in hierarchical networks and can evade the clutter problem generally problematic for feedforward systems. The resulting end-to-end system achieves a drastic improvement in the state of the art on this end-to-end task, reaching the same level of performance as the best systems operating on aligned, closely cropped images (no outside training data). It also performs well on two newer datasets, similar to LFW, but more difficult: significantly jittered (misaligned) version of LFW and SUFR-W (for example, the model's accuracy in the LFW "unaligned & no outside data used" category is 87.55±1.41% compared to state-of-the-art APEM (adaptive probabilistic elastic matching): 81.70±1.78%).

The theory was also applied to a range of recognition tasks: from invariant single object recognition in clutter to multiclass categorization problems on publicly available data sets (CalTech5, CalTech101, MIT-CBCL) and complex (street) scene understanding tasks that requires the recognition of both shape-based as well as texture-based objects (on StreetScenes data set). [12] The approach performs really well: It has the capability of learning from only a few training examples and was shown to outperform several more complex state-of-the-art systems constellation models, the hierarchical SVM-based face-detection system. A key element in the approach is a new set of scale and position-tolerant feature detectors, which are biologically plausible and agree quantitatively with the tuning properties of cells along the ventral stream of visual cortex. These features are adaptive to the training set, though we also show that a universal feature set, learned from a set of natural images unrelated to any categorization task, likewise achieves good performance.

Applications to speech recognition

This theory can also be extended for the speech recognition domain. As an example, in [15] an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluated its validity for voiced speech sound classification was proposed. Authors empirically demonstrated that a single-layer, phone-level representation, extracted from base speech features, improves segment classification accuracy and decreases the number of training examples in comparison with standard spectral and cepstral features for an acoustic classification task on TIMIT dataset. [16]

Related Research Articles

<span class="mw-page-title-main">Group theory</span> Branch of mathematics that studies the properties of groups

In abstract algebra, group theory studies the algebraic structures known as groups. The concept of a group is central to abstract algebra: other well-known algebraic structures, such as rings, fields, and vector spaces, can all be seen as groups endowed with additional operations and axioms. Groups recur throughout mathematics, and the methods of group theory have influenced many parts of algebra. Linear algebraic groups and Lie groups are two branches of group theory that have experienced advances and have become subject areas in their own right.

A conformal field theory (CFT) is a quantum field theory that is invariant under conformal transformations. In two dimensions, there is an infinite-dimensional algebra of local conformal transformations, and conformal field theories can sometimes be exactly solved or classified.

Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior. The theory is also called Hebb's rule, Hebb's postulate, and cell assembly theory. Hebb states it as follows:

Let us assume that the persistence or repetition of a reverberatory activity tends to induce lasting cellular changes that add to its stability. ... When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.

<span class="mw-page-title-main">Curvature of Riemannian manifolds</span>

In mathematics, specifically differential geometry, the infinitesimal geometry of Riemannian manifolds with dimension greater than 2 is too complicated to be described by a single number at a given point. Riemann introduced an abstract and rigorous way to define curvature for these manifolds, now known as the Riemann curvature tensor. Similar notions have found applications everywhere in differential geometry of surfaces and other objects. The curvature of a pseudo-Riemannian manifold can be expressed in the same way with only slight modifications.

In quantum mechanics, the canonical commutation relation is the fundamental relation between canonical conjugate quantities. For example,

In mathematics, the spectrum of a C*-algebra or dual of a C*-algebraA, denoted Â, is the set of unitary equivalence classes of irreducible *-representations of A. A *-representation π of A on a Hilbert space H is irreducible if, and only if, there is no closed subspace K different from H and {0} which is invariant under all operators π(x) with xA. We implicitly assume that irreducible representation means non-null irreducible representation, thus excluding trivial (i.e. identically 0) representations on one-dimensional spaces. As explained below, the spectrum  is also naturally a topological space; this is similar to the notion of the spectrum of a ring.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks. Early versions of MTL were called "hints".

In the mathematical field of knot theory, the Jones polynomial is a knot polynomial discovered by Vaughan Jones in 1984. Specifically, it is an invariant of an oriented knot or link which assigns to each oriented knot or link a Laurent polynomial in the variable with integer coefficients.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, image processing and signal processing communities with complementary motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. The parameter in this family is referred to as the scale parameter, with the interpretation that image structures of spatial size smaller than about have largely been smoothed away in the scale-space level at scale .

<span class="mw-page-title-main">Representation theory of the Lorentz group</span> Representation of the symmetry group of spacetime in special relativity

The Lorentz group is a Lie group of symmetries of the spacetime of special relativity. This group can be realized as a collection of matrices, linear transformations, or unitary operators on some Hilbert space; it has a variety of representations. This group is significant because special relativity together with quantum mechanics are the two physical theories that are most thoroughly established, and the conjunction of these two theories is the study of the infinite-dimensional unitary representations of the Lorentz group. These have both historical importance in mainstream physics, as well as connections to more speculative present-day theories.

In mathematics, Stinespring's dilation theorem, also called Stinespring's factorization theorem, named after W. Forrest Stinespring, is a result from operator theory that represents any completely positive map on a C*-algebra A as a composition of two completely positive maps each of which has a special form:

  1. A *-representation of A on some auxiliary Hilbert space K followed by
  2. An operator map of the form TV*TV.

In 3-dimensional topology, a part of the mathematical field of geometric topology, the Casson invariant is an integer-valued invariant of oriented integral homology 3-spheres, introduced by Andrew Casson.

In physics, a quantum amplifier is an amplifier that uses quantum mechanical methods to amplify a signal; examples include the active elements of lasers and optical amplifiers.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Visual object recognition refers to the ability to identify the objects in view based on visual input. One important signature of visual object recognition is "object invariance", or the ability to identify objects across changes in the detailed context in which objects are viewed, including changes in illumination, object pose, and background context.

Coherent states have been introduced in a physical context, first as quasi-classical states in quantum mechanics, then as the backbone of quantum optics and they are described in that spirit in the article Coherent states. However, they have generated a huge variety of generalizations, which have led to a tremendous amount of literature in mathematical physics. In this article, we sketch the main directions of research on this line. For further details, we refer to several existing surveys.

The cluster-expansion approach is a technique in quantum mechanics that systematically truncates the BBGKY hierarchy problem that arises when quantum dynamics of interacting systems is solved. This method is well suited for producing a closed set of numerically computable equations that can be applied to analyze a great variety of many-body and/or quantum-optical problems. For example, it is widely applied in semiconductor quantum optics and it can be applied to generalize the semiconductor Bloch equations and semiconductor luminescence equations.

In the mathematical field of quantum topology, the Reshetikhin–Turaev invariants (RT-invariants) are a family of quantum invariants of framed links. Such invariants of framed links also give rise to invariants of 3-manifolds via the Dehn surgery construction. These invariants were discovered by Nicolai Reshetikhin and Vladimir Turaev in 1991, and were meant to be a mathematical realization of Witten's proposed invariants of links and 3-manifolds using quantum field theory.

References

  1. Serre T., Oliva A., Poggio T. (2007) A feedforward architecture accounts for rapid categorization. PNAS, vol. 104, no. 15, pp. 6424–6429
  2. 1 2 3 4 5 6 F Anselmi, JZ Leibo, L Rosasco, J Mutch, A Tacchetti, T Poggio (2014) Unsupervised learning of invariant representations in hierarchical architectures arXiv preprint arXiv:1311.4158
  3. H. Schulz-Mirbach. Constructing invariant features by averaging techniques. In Pattern Recognition, 1994. Vol. 2 – Conference B: Computer Vision amp; Image Processing., Proceedings of the 12th IAPR International. Conference on, volume 2, pages 387 –390 vol.2, 1994.
  4. H. Cramer and H. Wold. Some theorems on distribution functions. J. London Math. Soc., 4:290–294, 1936.
  5. F. Anselmi, J.Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio (2013) Magic Materials: a theory of deep hierarchical architectures for learning sensory representations. CBCL paper, Massachusetts Institute of Technology, Cambridge, MA
  6. Liao Q., Leibo J., Mroueh Y., Poggio T. (2014) Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines? CBMM Memo No. 003, Massachusetts Institute of Technology, Cambridge, MA
  7. M. Riesenhuber and T. Poggio Hierarchical Models of Object Recognition in Cortex (1999) Nature Neuroscience, vol. 2, no. 11, pp. 1019-1025, 1999.
  8. T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, and T. Poggio (2005) A Theory of Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex AI Memo 2005-036/CBCL Memo 259, Massachusetts Inst. of Technology, Cambridge.
  9. 1 2 D.H. Hubel and T.N. Wiesel (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex The Journal of Physiology 160.
  10. D. Gabor (1946) Theory of Communication J. IEE, vol. 93, pp. 429–459.
  11. J.P. Jones and L.A. Palmer (1987) An Evaluation of the Two-Dimensional Gabor Filter Model of Simple Receptive Fields in Cat Striate Cortex J. Neurophysiol., vol. 58, pp. 1233–1258.
  12. 1 2 Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio (2007) Robust Object Recognition with Cortex-Like Mechanisms IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 29, NO. 3
  13. Qianli Liao, Joel Z Leibo, Youssef Mroueh, Tomaso Poggio (2014) Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines? CBMM Memo No. 003
  14. Qianli Liao, Joel Z Leibo, and Tomaso Poggio (2014) Learning invariant representations and applications to face verification NIPS 2014
  15. Georgios Evangelopoulos, Stephen Voinea, Chiyuan Zhang, Lorenzo Rosasco, Tomaso Poggio (2014) Learning An Invariant Speech Representation CBMM Memo No. 022
  16. "TIMIT Acoustic-Phonetic Continuous Speech Corpus - Linguistic Data Consortium".