Capsule neural network

Last updated

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization. [1]

Contents

The idea is to add structures called "capsules" to a convolutional neural network (CNN), and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher capsules. [2] The output is a vector consisting of the probability of an observation, and a pose for that observation. This vector is similar to what is done for example when doing classification with localization in CNNs.

Among other benefits, capsnets address the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level. [3] This can be compared to inverting the rendering of an object of multiple parts. [4]

History

In 2000, Geoffrey Hinton et al. described an imaging system that combined segmentation and recognition into a single inference process using parse trees. So-called credibility networks described the joint distribution over the latent variables and over the possible parse trees. That system proved useful on the MNIST handwritten digit database. [4]

A dynamic routing mechanism for capsule networks was introduced by Hinton and his team in 2017. The approach was claimed to reduce error rates on MNIST and to reduce training set sizes. Results were claimed to be considerably better than a CNN on highly overlapped digits. [1]

In Hinton's original idea one minicolumn would represent and detect one multidimensional entity. [5] [note 1]

Transformations

An invariant is an object property that does not change as a result of some transformation. For example, the area of a circle does not change if the circle is shifted to the left.

Informally, an equivariant is a property that changes predictably under transformation. For example, the center of a circle moves by the same amount as the circle when shifted. [6]

A nonequivariant is a property whose value does not change predictably under a transformation. For example, transforming a circle into an ellipse means that its perimeter can no longer be computed as π times the diameter.

In computer vision, the class of an object is expected to be an invariant over many transformations. I.e., a cat is still a cat if it is shifted, turned upside down or shrunken in size. However, many other properties are instead equivariant. The volume of a cat changes when it is scaled.

Equivariant properties such as a spatial relationship are captured in a pose, data that describes an object's translation, rotation, scale and reflection. Translation is a change in location in one or more dimensions. Rotation is a change in orientation. Scale is a change in size. Reflection is a mirror image. [1]

Unsupervised capsnets learn a global linear manifold between an object and its pose as a matrix of weights. In other words, capsnets can identify an object independent of its pose, rather than having to learn to recognize the object while including its spatial relationships as part of the object. In capsnets, the pose can incorporate properties other than spatial relationships, e.g., color (cats can be of various colors).

Multiplying the object by the manifold poses the object (for an object, in space). [7]

Pooling

Capsnets reject the pooling layer strategy of conventional CNNs that reduces the amount of detail to be processed at the next higher layer. Pooling allows a degree of translational invariance (it can recognize the same object in a somewhat different location) and allows a larger number of feature types to be represented. Capsnet proponents argue that pooling: [1]

Capsules

A capsule is a set of neurons that individually activate for various properties of a type of object, such as position, size and hue. Formally, a capsule is a set of neurons that collectively produce an activity vector with one element for each neuron to hold that neuron's instantiation value (e.g., hue). [1] Graphics programs use instantiation value to draw an object. Capsnets attempt to derive these from their input. The probability of the entity's presence in a specific input is the vector's length, while the vector's orientation quantifies the capsule's properties. [1] [3]

Artificial neurons traditionally output a scalar, real-valued activation that loosely represents the probability of an observation. Capsnets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement. [1]

Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher. A minimal cluster of two capsules considering a six-dimensional entity would agree within 10% by chance only once in a million trials. As the number of dimensions increase, the likelihood of a chance agreement across a larger cluster with higher dimensions decreases exponentially. [1]

Capsules in higher layers take outputs from capsules at lower layers, and accept those whose outputs cluster. A cluster causes the higher capsule to output a high probability of observation that an entity is present and also output a high-dimensional (20-50+) pose. [1]

Higher-level capsules ignore outliers, concentrating on clusters. This is similar to the Hough transform, the RHT and RANSAC from classic digital image processing. [1]

Routing by agreement

The outputs from one capsule (child) are routed to capsules in the next layer (parent) according to the child's ability to predict the parents' outputs. Over the course of a few iterations, each parents' outputs may converge with the predictions of some children and diverge from those of others, meaning that that parent is present or absent from the scene. [1]

For each possible parent, each child computes a prediction vector by multiplying its output by a weight matrix (trained by backpropagation). [3] Next the output of the parent is computed as the scalar product of a prediction with a coefficient representing the probability that this child belongs to that parent. A child whose predictions are relatively close to the resulting output successively increases the coefficient between that parent and child and decreases it for parents that it matches less well. This increases the contribution that that child makes to that parent, thus increasing the scalar product of the capsule's prediction with the parent's output. After a few iterations, the coefficients strongly connect a parent to its most likely children, indicating that the presence of the children imply the presence of the parent in the scene. [1] The more children whose predictions are close to a parent's output, the more quickly the coefficients grow, driving convergence. The pose of the parent (reflected in its output) progressively becomes compatible with that of its children. [3]

The coefficients' initial logits are the log prior probabilities that a child belongs to a parent. The priors can be trained discriminatively along with the weights. The priors depend on the location and type of the child and parent capsules, but not on the current input. At each iteration, the coefficients are adjusted via a "routing" softmax so that they continue to sum to 1 (to express the probability that a given capsule is the parent of a given child.) Softmax amplifies larger values and diminishes smaller values beyond their proportion of the total. Similarly, the probability that a feature is present in the input is exaggerated by a nonlinear "squashing" function that reduces values (smaller ones drastically and larger ones such that they are less than 1). [3]

This dynamic routing mechanism provides the necessary deprecation of alternatives ("explaining away") that is needed for segmenting overlapped objects.

This learned routing of signals has no clear biological equivalent. Some operations can be found in cortical layers, but they do not seem to relate this technique.

Math/code

The pose vector is rotated and translated by a matrix into a vector that predicts the output of the parent capsule.

Capsules in the next higher level are fed the sum of the predictions from all capsules in the lower layer, each with a coupling coefficient

Procedure softmax

The coupling coefficients from a capsule in layer to all capsules in layer sum to one, and are defined by a "routing softmax". The initial logits are prior log probabilities for the routing. That is the prior probability that capsule in layer should connect to capsule in layer . Normalization of the coupling coefficients: [1]

For this procedure to be optimum it would have to memorize several values, and reset those values on each iteration. That is if the vector changes, then the memorized values must be updated. It is not shown how this should be done. Neither memorizing the divisor is shown. [1]

Procedure squash

Because the length of the vectors represents probabilities they should be between zero and one, and to do that a squashing function is applied: [1]

A vector squashed to zero has a vanishing gradient.

Procedure routing

One approach to routing is the following [1]

At line 8, the softmax function can be replaced by any type of winner-take-all network. Biologically this somewhat resembles chandelier cells, but they can also be involved in calculation of coupling coefficients (line 9) or calculation of agreements (line 11).

At line 9, the weight matrix for the coupling coefficients and the hidden prediction matrix are shown. The structure in layer I and II is somewhat similar to the cerebral cortex if stellate cells are assumed to be involved in transposing input vectors. Whether both types of stellate cells have the same function is not clear, as layer I has excitatory spiny cells and layer II has inhibitory aspiny cells. The latter indicates a much different network.

At line 10, the squash function can be replaced by other functions and network topologies that retain the vector direction.

The procedure conducts iterations, usually 4–5, with the index for the source capsule layer or primary layer, where the routing goes from, and the capsule layer the next higher layer.

Training

Learning is supervised. [3] The network is trained by minimizing the euclidean distance between the image and the output of a CNN that reconstructs the input from the output of the terminal capsules. [1]

The network is discriminatively trained, using iterative routing-by-agreement. [1]

The activity vectors of all but the correct parent are masked. [1]

Margin loss

The length of the instantiation vector represents the probability that a capsule's entity is present in the scene. A top-level capsule has a long vector if and only if its associated entity is present. To allow for multiple entities, a separate margin loss is computed for each capsule. Downweighting the loss for absent entities stops the learning from shrinking activity vector lengths for all entities. The total loss is the sum of the losses of all entities. [1] In Hinton's example the loss function is: [1]

This type of loss function is common in ANNs. The parameters and are set so the length does not max out or collapse, and . Down-weighting of initial weights for absent classes are controlled by , with as a reasonable choice. [1]

Reconstruction loss

An additional reconstruction loss encourages entities to encode their inputs' instantiation parameters. The final activity vector is then used to reconstruct the input image via a CNN decoder consisting of 3 fully connected layers. The reconstruction minimizes the sum of squared differences between the outputs of the logistic units and the pixel intensities. This reconstruction loss is scaled down by 0.0005 so that it does not dominate the margin loss during training. [1]

Example configuration

The first convolutional layers perform feature extraction. For the 28x28 pixel MNIST image test an initial 256 9x9 pixel convolutional kernels (using stride 1 and rectified linear unit  (ReLU) activation, defining 20x20 receptive fields) convert the pixel input into 1D feature activations and induce nonlinearity. [1]

The primary (lowest) capsule layer divides the 256 kernels into 32 capsules of 8 9x9 kernels each (using stride 2, defining 6x6 receptive fields). Capsule activations effectively invert the graphics rendering process, going from pixels to features. A single weight matrix is used by each capsule across all receptive fields. Each primary capsule sees all of the lower-layer outputs whose fields overlap with the center of the field in the primary layer. Each primary capsule output (for a particular field) is an 8-dimensional vector. [1] [3]

A second, digit capsule layer has one 16-dimensional capsule for each digit (0-9). Dynamic routing connects (only) primary and digit capsule layers. A [32x6x6] x 10 weight matrix controls the mapping between layers. [1]

Capsnets are hierarchical, in that each lower-level capsule contributes significantly to only one higher-level capsule. [1]

However, replicating learned knowledge remains valuable. To achieve this, a capsnet's lower layers are convolutional, including hidden capsule layers. Higher layers thus cover larger regions, while retaining information about the precise position of each object within the region. For low level capsules, location information is "place-coded" according to which capsule is active. Higher up, more and more of the positional information is rate-coded in the capsule's output vector. This shift from place-coding to rate-coding, combined with the fact that higher-level capsules represent more complex objects with more degrees of freedom, suggests that capsule dimensionality increases with level. [1]

Human vision

Human vision examines a sequence of focal points (directed by saccades), processing only a fraction of the scene at its highest resolution. Capsnets build on inspirations from cortical minicolumns (also called cortical microcolumns) in the cerebral cortex. A minicolumn is a structure containing 80-120 neurons, with a diameter of about 28-40 µm, spanning all layers in the cerebral cortex. All neurons in the larger minicolumns have the same receptive field, and they output their activations as action potentials or spikes. [1] Neurons within the microcolumn receive common inputs, have common outputs, are interconnected and may constitute a fundamental computational unit of the cerebral cortex. [8]

Capsnets explore the intuition that the human visual system creates a tree-like structure for each focal point and coordinates these trees to recognize objects. However, with capsnets each tree is "carved" from a fixed network (by adjusting coefficients) rather than assembled on the fly. [1]

Alternatives

CapsNets are claimed to have four major conceptual advantages over convolutional neural networks (CNN):

Purely convolutional nets cannot generalize to unlearned viewpoints (other than translation). For other affine transformations, either feature detectors must be repeated on a grid that grows exponentially with the number of transformation dimensions, or the size of the labelled training set must (exponentially) expand to encompass those viewpoints. These exponential explosions make them unsuitable for larger problems. [1]

Capsnet's transformation matrices learn the (viewpoint independent) spatial relationship between a part and a whole, allowing the latter to be recognized based on such relationships. However, capsnets assume that each location displays at most one instance of a capsule's object. This assumption allows a capsule to use a distributed representation (its activity vector) of an object to represent that object at that location. [1]

Capsnets use neural activities that vary with viewpoint. They do not have to normalize objects (as in spatial transformer networks) and can even recognize multiply transformed objects. Capsnets can also process segmented objects. [1]

See also

Notes

  1. In Hinton's own words this is "wild speculation".

Related Research Articles

In mathematics, and more specifically in linear algebra, a linear map is a mapping between two vector spaces that preserves the operations of vector addition and scalar multiplication. The same names and the same definition are also used for the more general case of modules over a ring; see Module homomorphism.

In mathematics, particularly in linear algebra, a skew-symmetricmatrix is a square matrix whose transpose equals its negative. That is, it satisfies the condition

In mathematics, an algebra over a field is a vector space equipped with a bilinear product. Thus, an algebra is an algebraic structure consisting of a set together with operations of multiplication and addition and scalar multiplication by elements of a field and satisfying the axioms implied by "vector space" and "bilinear".

<span class="mw-page-title-main">Exterior algebra</span> Algebra of exterior/ wedge products

In mathematics, the exterior algebra, or Grassmann algebra, named after Hermann Grassmann, is an algebra that uses the exterior product or wedge product as its multiplication. In mathematics, the exterior product or wedge product of vectors is an algebraic construction used in geometry to study areas, volumes, and their higher-dimensional analogues. The exterior product of two vectors and , denoted by is called a bivector and lives in a space called the exterior square, a vector space that is distinct from the original space of vectors. The magnitude of can be interpreted as the area of the parallelogram with sides and  which in three dimensions can also be computed using the cross product of the two vectors. More generally, all parallel plane surfaces with the same orientation and area have the same bivector as a measure of their oriented area. Like the cross product, the exterior product is anticommutative, meaning that for all vectors and  but, unlike the cross product, the exterior product is associative.

In linear algebra, a square matrix with complex entries is said to be skew-Hermitian or anti-Hermitian if its conjugate transpose is the negative of the original matrix. That is, the matrix is skew-Hermitian if it satisfies the relation

<span class="mw-page-title-main">Backpropagation</span> Optimization algorithm for artificial neural networks

As a machine-learning algorithm, backpropagation performs a backward pass to adjust a neural network model's parameters, aiming to minimize the mean squared error (MSE). In a multi-layered network, backpropagation uses the following steps:

  1. Propagate training data through the model from input to predicted output by computing the successive hidden layers' outputs and finally the final layer's output.
  2. Adjust the model weights to reduce the error relative to the weights.
    1. The error is typically the squared difference between prediction and target.
    2. For each weight, the slope or derivative of the error is found, and the weight adjusted by a negative multiple of this derivative, so as to go downslope toward the minimum-error configuration.
    3. This derivative is easy to calculate for final layer weights, and possible to calculate for one layer given the next layer's derivatives. Starting at the end, then, the derivatives are calculated layer by layer toward the beginning -- thus "backpropagation".
  3. Repeatedly update the weights until they converge or the model has undergone enough iterations.

In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Laplacian matrix can be viewed as a matrix form of the negative discrete Laplace operator on a graph approximating the negative continuous Laplacian obtained by the finite difference method.

FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration.

<span class="mw-page-title-main">Softmax function</span> Smooth approximation of one-hot arg max

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

Activation function of a node in an artificial neural network is a function that calculates the output of the node. Nontrivial problems can be solved only using a nonlinear activation function. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

Algebraic signal processing (ASP) is an emerging area of theoretical signal processing (SP). In the algebraic theory of signal processing, a set of filters is treated as an (abstract) algebra, a set of signals is treated as a module or vector space, and convolution is treated as an algebra representation. The advantage of algebraic signal processing is its generality and portability.

In the mathematical theory of artificial neural networks, universal approximation theorems are results that put limits on what neural networks can theoretically learn, i.e. that establish the density of an algorithmically generated class of functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two Euclidean spaces, and the approximation is with respect to the compact convergence topology. What must be stressed, is that while some functions can be arbitrarily well approximated in a region, the proofs do not apply outside of the region, i.e. the approximated functions do not extrapolate outside of the region. That applies for all non-periodic activation functions, i.e. what's in practice used and most proofs assume.

<span class="mw-page-title-main">Neighbourhood components analysis</span>

Neighbourhood components analysis is a supervised learning method for classifying multivariate data into distinct classes according to a given distance metric over the data. Functionally, it serves the same purposes as the K-nearest neighbors algorithm and makes direct use of a related concept termed stochastic nearest neighbours.

In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator.

t-distributed stochastic neighbor embedding Technique for dimensionality reduction

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and Geoffrey Hinton, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

<span class="mw-page-title-main">Convolutional neural network</span> Artificial neural network

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Dilution and dropout are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. Dilution refers to thinning weights, while dropout refers to randomly "dropping out", or omitting, units during the training process of a neural network. Both trigger the same type of regularization.

The convolutional sparse coding paradigm is an extension of the global sparse coding model, in which a redundant dictionary is modeled as a concatenation of circulant matrices. While the global sparsity constraint describes signal as a linear combination of a few atoms in the redundant dictionary , usually expressed as for a sparse vector , the alternative dictionary structure adopted by the convolutional sparse coding model allows the sparsity prior to be applied locally instead of globally: independent patches of are generated by "local" dictionaries operating over stripes of .

<span class="mw-page-title-main">Graph neural network</span> Class of artificial neural networks

A graph neural network (GNN) is a class of artificial neural networks for processing data that can be represented as graphs.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Sabour, Sara; Frosst, Nicholas; Hinton, Geoffrey E. (2017-10-26). "Dynamic Routing Between Capsules". arXiv: 1710.09829 [cs.CV].
  2. Hinton, Geoffrey E.; Krizhevsky, Alex; Wang, Sida D. (2011-06-14). "Transforming Auto-Encoders". Artificial Neural Networks and Machine Learning – ICANN 2011. Lecture Notes in Computer Science. Vol. 6791. Springer, Berlin, Heidelberg. pp. 44–51. CiteSeerX   10.1.1.220.5099 . doi:10.1007/978-3-642-21735-7_6. ISBN   9783642217340. S2CID   6138085.
  3. 1 2 3 4 5 6 7 Srihari, Sargur. "Capsule Nets" (PDF). University of Buffalo . Retrieved 2017-12-07.
  4. 1 2 Hinton, Geoffrey E; Ghahramani, Zoubin; Teh, Yee Whye (2000). Solla, S. A.; Leen, T. K.; Müller, K. (eds.). Advances in Neural Information Processing Systems 12 (PDF). MIT Press. pp. 463–469.
  5. Meher Vamsi (2017-11-15), Geoffrey Hinton Capsule theory , retrieved 2017-12-06
  6. "Understanding Matrix capsules with EM Routing (Based on Hinton's Capsule Networks)". jhui.github.io. Retrieved 2017-12-31.
  7. Tan, Kendrick (November 10, 2017). "Capsule Networks Explained". kndrck.co. Retrieved 2017-12-26.
  8. "Microcolumns in the Brain". www.physics.drexel.edu. Archived from the original on 2018-05-27. Retrieved 2017-12-31.