Vanishing gradient problem

Last updated

In machine learning, the vanishing gradient problem is encountered when training recurrent neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. [1] The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease (or grow uncontrollably), slowing the training process. [1] In the worst case, this may completely stop the neural network from further training. [1] As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the early layers train very slowly.

Contents

Back-propagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", [2] [3] which not only affects many-layered feedforward networks, [4] but also recurrent networks. [5] The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. (The combination of unfolding and backpropagation is termed backpropagation through time.)

When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

Prototypical models

This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio. [5]

Recurrent network model

A generic recurrent network has hidden states inputs , and outputs . Let it be parametrized by , so that the system evolves as

Often, the output is a function of , as some . The vanishing gradient problem already presents itself clearly when , so we simplify our notation to the special case with:

Now, take its differential:

Training the network requires us to define a loss function to be minimized. Let it be [note 1] , then minimizing it by gradient descent gives

 

 

 

 

(loss differential)

where is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form

Example: recurrent network with sigmoid activation

For a concrete example, consider a typical recurrent network defined by

where is the network parameter, is the sigmoid activation function [note 2] , applied to each vector coordinate separately, and is the bias vector. Then, , and so

Since , the operator norm of the above multiplication is bounded above by . So if the spectral radius of is , then at large , the above multiplication has operator norm bounded above by . This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation ( loss differential ):

The components of are just components of and , so if are bounded, then is also bounded by some , and so the terms in decay as . This means that, effectively, is affected only by the first terms in the sum.

If , the above analysis does not quite work. [note 3] For the prototypical exploding gradient problem, the next model is clearer.

Dynamical systems model

Bifurcation diagram of the one-neuron recurrent network. Horizontal axis is b, and vertical axis is x. The black curve is the set of stable and unstable equilibria. Notice that the system exhibits hysteresis, and can be used as a one-bit memory. One-neuron recurrent network bifurcation diagram.png
Bifurcation diagram of the one-neuron recurrent network. Horizontal axis is b, and vertical axis is x. The black curve is the set of stable and unstable equilibria. Notice that the system exhibits hysteresis, and can be used as a one-bit memory.

Following (Doya, 1993), [6] consider this one-neuron recurrent network with sigmoid activation:

At the small limit, the dynamics of the network becomes

Consider first the autonomous case, with . Set , and vary in . As decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are .

Now consider and , where is large enough that the system has settled into one of the stable points.

If puts the system very close to an unstable point, then a tiny variation in or would make move from one stable point to the other. This makes and both very large, a case of the exploding gradient.

If puts the system far from an unstable point, then a small variation in would have no effect on , making , a case of the vanishing gradient.

Note that in this case, neither decays to zero nor blows up to infinity. Indeed, it's the only well-behaved gradient, which explains why early researches focused on learning or designing recurrent networks systems that could perform long-ranged computations (such as outputting the first input it sees at the very end of an episode) by shaping its stable attractors. [7]

For the general case, the intuition still holds ( [5] Figures 3, 4, and 5).

Geometric model

Continue using the above one-neuron network, fixing , and consider a loss function defined by . This produces a rather pathological loss landscape: as approach from above, the loss approaches zero, but as soon as crosses , the attractor basin changes, and loss jumps to 0.50. [note 4]

Consequently, attempting to train by gradient descent would "hit a wall in the loss landscape", and cause exploding gradient. A slightly more complex situation is plotted in, [5] Figures 6.

Solutions

To overcome this problem, several methods were proposed.

Batch normalization

Batch normalization is a standard method for solving both the exploding and the vanishing gradient problems. [8] [9]

Multi-level hierarchy

One is Jürgen Schmidhuber's multi-level hierarchy of networks (1992) pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation. [10] Here each level learns a compressed representation of the observations that is fed to the next level.

Similar ideas have been used in feed-forward neural networks for unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised backpropagation to classify labeled data. The deep belief network model by Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. [11] Hinton reports that his models are effective feature extractors over high-dimensional, structured data. [12]

Long short-term memory

Another technique particularly used for recurrent neural networks is the long short-term memory (LSTM) network of 1997 by Hochreiter & Schmidhuber. [13] In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three ICDAR 2009 competitions in connected handwriting recognition, without any prior knowledge about the three different languages to be learned. [14] [15]

Faster hardware

Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by GPUs) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way" [16] since the original models tackling the vanishing gradient problem by Hinton and others were trained in a Xeon processor, not GPUs. [11]

Residual networks

One of the newest and most effective ways to resolve the vanishing gradient problem is with residual neural networks, [17] or ResNets (not to be confused with recurrent neural networks). ResNets refer to neural networks where skip connections or residual connections are part of the network architecture. These skip connections allow gradient information to pass through the layers, by creating "highways" of information, where the output of a previous layer/activation is added to the output of a deeper layer. This allows information from the earlier parts of the network to be passed to the deeper parts of the network, helping maintain signal propagation even in deeper networks. Skip connections are a critical component of what allowed successful training of deeper neural networks.

ResNets yielded lower training error (and test error) than their shallower counterparts simply by reintroducing outputs from shallower layers in the network to compensate for the vanishing data. [17] Note that ResNets are an ensemble of relatively shallow nets and do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network – rather, they avoid the problem simply by constructing ensembles of many short networks together. (Ensemble by Construction [18] )

Other activation functions

Rectifiers such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction. [19]

Weight initialization

Weight initialization is another approach that has been proposed to reduce the vanishing gradient problem in deep networks.

Kumar suggested that the distribution of initial weights should vary according to activation function used and proposed to initialize the weights in networks with the logistic activation function using a Gaussian distribution with a zero mean and a standard deviation of 3.6/sqrt(N), where N is the number of neurons in a layer. [20]

Recently, Yilmaz and Poli [21] performed a theoretical analysis on how gradients are affected by the mean of the initial weights in deep neural networks using the logistic activation function and found that gradients do not vanish if the mean of the initial weights is set according to the formula: max(−1,-8/N). This simple strategy allows networks with 10 or 15 hidden layers to be trained very efficiently and effectively using the standard backpropagation.

Other

Behnke relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid [22] to solve problems like image reconstruction and face localization.[ citation needed ]

Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g., random guess or more systematically genetic algorithm. This approach is not based on gradient and avoids the vanishing gradient problem. [23]

See also

Notes

  1. A more general loss function could depend on the entire sequence of outputs, as for which the problem is the same, just with more complex notations.
  2. Any activation function works, as long as it is differentiable with bounded derivative.
  3. Consider and , with and . Then has spectral radius , and , which might go to infinity or zero depending on choice of .
  4. This is because at , the two stable attractors are , and the unstable attractor is .

    Related Research Articles

    <span class="mw-page-title-main">Laplace's equation</span> Second-order partial differential equation

    In mathematics and physics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace, who first studied its properties. This is often written as

    In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols , (where is the nabla operator), or . In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form. Informally, the Laplacian Δf (p) of a function f at a point p measures by how much the average value of f over small spheres or balls centered at p deviates from f (p).

    Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate.

    In machine learning, backpropagation is a gradient estimation method used to train neural network models. The gradient estimate is used by the optimization algorithm to compute the network parameter updates.

    A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

    An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

    Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

    <span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

    Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

    There are many types of artificial neural networks (ANN).

    Natural evolution strategies (NES) are a family of numerical optimization algorithms for black box problems. Similar in spirit to evolution strategies, they iteratively update the (continuous) parameters of a search distribution by following the natural gradient towards higher expected fitness.

    Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

    In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous artificial neural networks. It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by Long Short-Term Memory (LSTM) recurrent neural networks. The advantage of a Highway Network over the common deep neural networks is that it solves or partially prevents the vanishing gradient problem, thus leading to easier to optimize neural networks. The gating mechanisms facilitate information flow across many layers.

    <span class="mw-page-title-main">Residual neural network</span> Deep learning method

    A Residual Neural Network is a deep learning model in which the weight layers learn residual functions with reference to the layer inputs. A Residual Network is a network with skip connections that perform identity mappings, merged with the layer outputs by addition. It behaves like a Highway Network whose gates are opened through strongly positive bias weights. This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper. The identity skip connections, often referred to as "residual connections", are also used in the 1997 LSTM networks, Transformer models, the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.

    Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

    <span class="mw-page-title-main">Stochastic gradient Langevin dynamics</span>

    Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

    An artificial neural network (ANN) combines biological principles with advanced statistics to solve problems in domains such as pattern recognition and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways.

    <span class="mw-page-title-main">Transformer (machine learning model)</span> Machine learning algorithm used for natural-language processing

    A transformer is a deep learning architecture based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". It has no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism was proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.

    In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

    In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

    A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

    References

    1. 1 2 3 Basodi, Sunitha; Ji, Chunyan; Zhang, Haiping; Pan, Yi (September 2020). "Gradient amplification: An efficient way to train deep neural networks". Big Data Mining and Analytics. 3 (3): 198. arXiv: 2006.10560 . doi: 10.26599/BDMA.2020.9020004 . ISSN   2096-0654. S2CID   219792172.
    2. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (Diplom thesis). Institut f. Informatik, Technische Univ. Munich.
    3. Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. (2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kremer, S. C.; Kolen, J. F. (eds.). A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press. ISBN   0-7803-5369-2.
    4. Goh, Garrett B.; Hodas, Nathan O.; Vishnu, Abhinav (15 June 2017). "Deep learning for computational chemistry". Journal of Computational Chemistry. 38 (16): 1291–1307. arXiv: 1701.04503 . Bibcode:2017arXiv170104503G. doi:10.1002/jcc.24764. PMID   28272810. S2CID   6831636.
    5. 1 2 3 4 Pascanu, Razvan; Mikolov, Tomas; Bengio, Yoshua (21 November 2012). "On the difficulty of training Recurrent Neural Networks". arXiv: 1211.5063 [cs.LG].
    6. Doya, K. (1992). "Bifurcations in the learning of recurrent neural networks". [Proceedings] 1992 IEEE International Symposium on Circuits and Systems. Vol. 6. IEEE. pp. 2777–2780. doi:10.1109/iscas.1992.230622. ISBN   0-7803-0593-0. S2CID   15069221.
    7. Bengio, Y.; Simard, P.; Frasconi, P. (March 1994). "Learning long-term dependencies with gradient descent is difficult". IEEE Transactions on Neural Networks. 5 (2): 157–166. doi:10.1109/72.279181. ISSN   1941-0093. PMID   18267787. S2CID   206457500.
    8. Ioffe, Sergey; Szegedy, Christian (1 June 2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". International Conference on Machine Learning. PMLR: 448–456. arXiv: 1502.03167 .
    9. Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018). "How Does Batch Normalization Help Optimization?". Advances in Neural Information Processing Systems. Curran Associates, Inc. 31.
    10. J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," Neural Computation, 4, pp. 234–242, 1992.
    11. 1 2 Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation . 18 (7): 1527–1554. CiteSeerX   10.1.1.76.1541 . doi:10.1162/neco.2006.18.7.1527. PMID   16764513. S2CID   2309950.
    12. Hinton, G. (2009). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi: 10.4249/scholarpedia.5947 .
    13. Hochreiter, Sepp; Schmidhuber, Jürgen (1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID   9377276. S2CID   1915014.
    14. Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552
    15. Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. (2009). "A Novel Connectionist System for Improved Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868. CiteSeerX   10.1.1.139.4502 . doi:10.1109/tpami.2008.137. PMID   19299860. S2CID   14635907.
    16. Schmidhuber, Jürgen (2015). "Deep learning in neural networks: An overview". Neural Networks. 61: 85–117. arXiv: 1404.7828 . doi:10.1016/j.neunet.2014.09.003. PMID   25462637. S2CID   11715509.
    17. 1 2 He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv: 1512.03385 . doi:10.1109/CVPR.2016.90. ISBN   978-1-4673-8851-1.
    18. Veit, Andreas; Wilber, Michael; Belongie, Serge (20 May 2016). "Residual Networks Behave Like Ensembles of Relatively Shallow Networks". arXiv: 1605.06431 [cs.CV].
    19. Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (14 June 2011). "Deep Sparse Rectifier Neural Networks". PMLR: 315–323.
    20. Kumar, Siddharth Krishna. "On weight initialization in deep neural networks." arXiv preprint arXiv:1704.08863 (2017).
    21. Yilmaz, Ahmet; Poli, Riccardo (1 September 2022). "Successfully and efficiently training deep multi-layer perceptrons with logistic activation function simply requires initializing the weights with an appropriate negative mean". Neural Networks. 153: 87–103. doi:10.1016/j.neunet.2022.05.030. ISSN   0893-6080. PMID   35714424. S2CID   249487697.
    22. Sven Behnke (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.
    23. "Sepp Hochreiter's Fundamental Deep Learning Problem (1991)". people.idsia.ch. Retrieved 7 January 2017.