Dilution (neural networks)

Last updated
On the left is a fully connected neural network with two hidden layers. On the right is the same network after applying dropout. Dropout mechanism.png
On the left is a fully connected neural network with two hidden layers. On the right is the same network after applying dropout.

Dilution and dropout (also called DropConnect [1] ) are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. [2] Dilution refers to thinning weights, [3] while dropout refers to randomly "dropping out", or omitting, units (both hidden and visible) during the training process of a neural network. [4] [5] [2] Both trigger the same type of regularization.

Contents

Types and uses

Dilution is usually split in weak dilution and strong dilution. Weak dilution describes the process in which the finite fraction of removed connections is small, and strong dilution refers to when this fraction is large. There is no clear distinction on where the limit between strong and weak dilution is, and often the distinction is dependent on the precedent of a specific use-case and has implications for how to solve for exact solutions.

Sometimes dilution is used for adding damping noise to the inputs. In that case, weak dilution refers to adding a small amount of damping noise, while strong dilution refers to adding a greater amount of damping noise. Both can be rewritten as variants of weight dilution.

These techniques are also sometimes referred to as random pruning of weights, but this is usually a non-recurring one-way operation. The network is pruned, and then kept if it is an improvement over the previous model. Dilution and dropout both refer to an iterative process. The pruning of weights typically does not imply that the network continues learning, while in dilution/dropout, the network continues to learn after the technique is applied.

Generalized linear network

Output from a layer of linear nodes, in an artificial neural net can be described as

  • – output from node
  • – real weight before dilution, also called the Hebb connection strength
  • – input from node

This can be written in vector notation as

  • – output vector
  • – weight matrix
  • – input vector

Equations (1) and (2) are used in the subsequent sections.

Weak dilution

During weak dilution, the finite fraction of removed connections (the weights) is small, giving rise to a tiny uncertainty. This edge-case can be solved exactly with mean field theory. In weak dilution the impact on the weights can be described as

  • – diluted weight
  • – real weight before dilution
  • – the probability of , the probability of keeping a weight

The interpretation of probability can also be changed from keeping a weight into pruning a weight.

In vector notation this can be written as

where the function imposes the previous dilution.

In weak dilution only a small and fixed fraction of the weights are diluted. When the number of terms in the sum goes to infinite (the weights for each node) it is still infinite (the fraction is fixed), thus mean field theory can be applied. In the notation from Hertz et al. [3] this would be written as

  • the mean field temperature
  • – a scaling factor for the temperature from the probability of keeping the weight
  • – real weight before dilution, also called the Hebb connection strength
  • – the mean stable equilibrium states

There are some assumptions for this to hold, which are not listed here. [6] [7]

Strong dilution

When the dilution is strong, the finite fraction of removed connections (the weights) is large, giving rise to a huge uncertainty.

Dropout

Dropout is a special case of the previous weight equation ( 3 ), where the aforementioned equation is adjusted to remove a whole row in the vector matrix, and not only random weights

  • – the probability to keep a row in the weight matrix
  • – real row in the weight matrix before dropout
  • – diluted row in the weight matrix

Because dropout removes a whole row from the vector matrix, the previous (unlisted) assumptions for weak dilution and the use of mean field theory are not applicable.

The process by which the node is driven to zero, whether by setting the weights to zero, by “removing the node”, or by some other means, does not impact the end result and does not create a new and unique case. If the neural net is processed by a high-performance digital array-multiplicator, then it is likely more effective to drive the value to zero late in the process graph. If the net is processed by a constrained processor, perhaps even an analog neuromorph processor, then it is likely a more power-efficient solution is to drive the value to zero early in the process graph.

Google's patent

Although there have been examples of randomly removing connections between neurons in a neural network to improve models, [3] this technique was first introduced with the name dropout by Geoffrey Hinton, et al. in 2012. [2] Google currently holds the patent for the dropout technique. [8] [note 1]

See also

Notes

  1. The patent is most likely not valid due to previous art. “Dropout” has been described as “dilution” in previous publications. It is described by Hertz, Krogh, and Palmer in Introduction to the Theory of Neural Computation (1991) ISBN   0-201-51560-1, pp. 45, Weak Dilution. The text references Sompolinsky The Theory of Neural Networks: The Hebb Rules and Beyond in Heidelberg Colloquium on Glossy Dynamics (1987) and Canning and Gardner Partially Connected Models of Neural Networks in Journal of Physics (1988). It goes on to describe strong dilution. This predates Hinton's paper.

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5–12–23

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Projection of data onto lower-dimensional manifolds

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior. The theory is also called Hebb's rule, Hebb's postulate, and cell assembly theory. Hebb states it as follows:

Let us assume that the persistence or repetition of a reverberatory activity tends to induce lasting cellular changes that add to its stability. ... When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.

In signal processing, independent component analysis (ICA) is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that at most one subcomponent is Gaussian and that the subcomponents are statistically independent from each other. ICA was invented by Jeanny Hérault and Christian Jutten in 1985. ICA is a special case of blind source separation. A common example application of ICA is the "cocktail party problem" of listening in on one person's speech in a noisy room.

<span class="mw-page-title-main">Boltzmann machine</span> Type of stochastic recurrent neural network

A Boltzmann machine, named after Ludwig Boltzmann is a stochastic spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as a Markov random field.

A Hopfield network is a spin glass system used to model neural networks, based on Ernst Ising's work with Wilhelm Lenz on the Ising model of magnetic materials. Hopfield networks were first described with respect to recurrent neural networks independently by Kaoru Nakano in 1971 and Shun'ichi Amari in 1972, and with respect to biological neural networks by William Little in 1974, and were popularised by John Hopfield in 1982. Hopfield networks serve as content-addressable ("associative") memory systems with binary threshold nodes, or with continuous variables. Hopfield networks also provide a model for understanding human memory.

In machine learning, backpropagation is a gradient estimation method commonly used for training neural networks to compute the network parameter updates.

In queueing theory, a discipline within the mathematical theory of probability, a Jackson network is a class of queueing network where the equilibrium distribution is particularly simple to compute as the network has a product-form solution. It was the first significant development in the theory of networks of queues, and generalising and applying the ideas of the theorem to search for similar product-form solutions in other networks has been the subject of much research, including ideas used in the development of the Internet. The networks were first identified by James R. Jackson and his paper was re-printed in the journal Management Science’s ‘Ten Most Influential Titles of Management Sciences First Fifty Years.’

Oja's learning rule, or simply Oja's rule, named after Finnish computer scientist Erkki Oja, is a model of how neurons in the brain or in artificial neural networks change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule that, through multiplicative normalization, solves all stability problems and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons.

In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

The generalized Hebbian algorithm (GHA), also known in the literature as Sanger's rule, is a linear feedforward neural network for unsupervised learning with applications primarily in principal components analysis. First defined in 1989, it is similar to Oja's rule in its formulation and stability, except it can be applied to networks with multiple outputs. The name originates because of the similarity between the algorithm and a hypothesis made by Donald Hebb about the way in which synaptic strengths in the brain are modified in response to experience, i.e., that changes are proportional to the correlation between the firing of pre- and post-synaptic neurons.

In neuroscience and computer science, synaptic weight refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another. The term is typically used in artificial and biological neural network research.

<span class="mw-page-title-main">Restricted Boltzmann machine</span> Class of artificial neural network

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.

Within mathematical analysis, Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other regularization-based machine-learning algorithms. SVM algorithms categorize binary data, with the goal of fitting the training set data in a way that minimizes the average of the hinge-loss function and L2 norm of the learned weights. This strategy avoids overfitting via Tikhonov regularization and in the L2 norm sense and also corresponds to minimizing the bias and variance of our estimator of the weights. Estimators with lower Mean squared error predict better or generalize better when given unseen data.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden nodes need to be tuned. These hidden nodes can be randomly assigned and never updated, or can be inherited from their ancestors without being changed. In most cases, the output weights of hidden nodes are usually learned in a single step, which essentially amounts to learning a linear model.

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

References

  1. Wan, Li; Zeiler, Matthew; Zhang, Sixin; Le Cun, Yann; Fergus, Rob (2013). "Regularization of Neural Networks using DropConnect". Proceedings of the 30th International Conference on Machine Learning, PMLR. 28 (3): 1058–1066 via PMLR.
  2. 1 2 3 Hinton, Geoffrey E.; Srivastava, Nitish; Krizhevsky, Alex; Sutskever, Ilya; Salakhutdinov, Ruslan R. (2012). "Improving neural networks by preventing co-adaptation of feature detectors". arXiv: 1207.0580 [cs.NE].
  3. 1 2 3 Hertz, John; Krogh, Anders; Palmer, Richard (1991). Introduction to the Theory of Neural Computation. Redwood City, California: Addison-Wesley Pub. Co. pp. 45–46. ISBN   0-201-51560-1.
  4. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". Jmlr.org. Retrieved July 26, 2015.
  5. Warde-Farley, David; Goodfellow, Ian J.; Courville, Aaron; Bengio, Yoshua (2013-12-20). "An empirical analysis of dropout in piecewise linear networks". arXiv: 1312.6197 [stat.ML].
  6. Sompolinsky, H. (1987), "The theory of neural networks: The Hebb rule and beyond", Heidelberg Colloquium on Glassy Dynamics, Lecture Notes in Physics, vol. 275, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 485–527, Bibcode:1987LNP...275..485S, doi:10.1007/bfb0057531, ISBN   978-3-540-17777-7
  7. Canning, A; Gardner, E (1988-08-07). "Partially connected models of neural networks". Journal of Physics A: Mathematical and General. 21 (15): 3275–3284. Bibcode:1988JPhA...21.3275C. doi:10.1088/0305-4470/21/15/016. ISSN   0305-4470.
  8. US 9406017B2,Hinton, Geoffrey E.,"System and method for addressing overfitting in a neural network",published 2016-08-02,issued 2016-08-02