Part of a series on |
Machine learning and data mining |
---|
In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each training iteration, each neural network weight receives an update proportional to the partial derivative of the loss function with respect to the current weight. [1] The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease (or grow uncontrollably), slowing the training process. [1] In the worst case, this may completely stop the neural network from further learning. [1] As one example of this problem, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients using the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the early layers train very slowly.
Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", [2] [3] which not only affects many-layered feedforward networks, [4] but also recurrent networks. [5] [6] The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time).
When activation functions are used whose derivatives can take on larger values, one risk is encountering the related exploding gradient problem.
This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio. [6]
A generic recurrent network has hidden states inputs , and outputs . Let it be parametrized by , so that the system evolves asOften, the output is a function of , as some . The vanishing gradient problem already presents itself clearly when , so we simplify our notation to the special case with: Now, take its differential:Training the network requires us to define a loss function to be minimized. Let it be [note 1] , then minimizing it by gradient descent gives
(loss differential) |
where is the learning rate.
The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form
For a concrete example, consider a typical recurrent network defined by
where is the network parameter, is the sigmoid activation function [note 2] , applied to each vector coordinate separately, and is the bias vector.
Then, , and so Since , the operator norm of the above multiplication is bounded above by . So if the spectral radius of is , then at large , the above multiplication has operator norm bounded above by . This is the prototypical vanishing gradient problem.
The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation ( loss differential ):The components of are just components of and , so if are bounded, then is also bounded by some , and so the terms in decay as . This means that, effectively, is affected only by the first terms in the sum.
If , the above analysis does not quite work. [note 3] For the prototypical exploding gradient problem, the next model is clearer.
Following (Doya, 1993), [7] consider this one-neuron recurrent network with sigmoid activation:At the small limit, the dynamics of the network becomesConsider first the autonomous case, with . Set , and vary in . As decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are .
Now consider and , where is large enough that the system has settled into one of the stable points.
If puts the system very close to an unstable point, then a tiny variation in or would make move from one stable point to the other. This makes and both very large, a case of the exploding gradient.
If puts the system far from an unstable point, then a small variation in would have no effect on , making , a case of the vanishing gradient.
Note that in this case, neither decays to zero nor blows up to infinity. Indeed, it's the only well-behaved gradient, which explains why early researches focused on learning or designing recurrent networks systems that could perform long-ranged computations (such as outputting the first input it sees at the very end of an episode) by shaping its stable attractors. [8]
For the general case, the intuition still holds ( [6] Figures 3, 4, and 5).
Continue using the above one-neuron network, fixing , and consider a loss function defined by . This produces a rather pathological loss landscape: as approach from above, the loss approaches zero, but as soon as crosses , the attractor basin changes, and loss jumps to 0.50. [note 4]
Consequently, attempting to train by gradient descent would "hit a wall in the loss landscape", and cause exploding gradient. A slightly more complex situation is plotted in, [6] Figures 6.
This section has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these messages) |
To overcome this problem, several methods were proposed.
For recurrent neural networks, the long short-term memory (LSTM) network was designed to solve the problem (Hochreiter & Schmidhuber, 1997). [9]
For the exploding gradient problem, (Pascanu et al, 2012) [6] recommended gradient clipping, meaning dividing the gradient vector by if . This restricts the gradient vectors within a ball of radius .
Batch normalization is a standard method for solving both the exploding and the vanishing gradient problems. [10] [11]
In multi-level hierarchy of networks (Schmidhuber, 1992), pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation. [12] Here each level learns a compressed representation of the observations that is fed to the next level.
Similar ideas have been used in feed-forward neural networks for unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised backpropagation to classify labeled data. The deep belief network model by Hinton et al. (2006) involves learning the distribution of a high-level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. [13] Hinton reports that his models are effective feature extractors over high-dimensional, structured data. [14]
Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by GPUs) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way" [15] since the original models tackling the vanishing gradient problem by Hinton and others were trained in a Xeon processor, not GPUs. [13]
Residual connections, or skip connections, refers to the architectural motif of , where is an arbitrary neural network module. This gives the gradient of , where the identity matrix do not suffer from the vanishing or exploding gradient. During backpropagation, part of the gradient flows through the residual connections. [16]
Concretely, let the neural network (without residual connections) be , then with residual connections, the gradient of output with respect to the activations at layer is . The gradient thus does not vanish in arbitrarily deep networks.
Feedforward networks with residual connections can be regarded as an ensemble of relatively shallow nets. In this perspective, they resolve the vanishing gradient problem by being equivalent to ensembles of many shallow networks, for which there is no vanishing gradient problem. [17]
Rectifiers such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction. [18]
Weight initialization is another approach that has been proposed to reduce the vanishing gradient problem in deep networks.
Kumar suggested that the distribution of initial weights should vary according to activation function used and proposed to initialize the weights in networks with the logistic activation function using a Gaussian distribution with a zero mean and a standard deviation of 3.6/sqrt(N)
, where N
is the number of neurons in a layer. [19]
Recently, Yilmaz and Poli [20] performed a theoretical analysis on how gradients are affected by the mean of the initial weights in deep neural networks using the logistic activation function and found that gradients do not vanish if the mean of the initial weights is set according to the formula: max(−1,-8/N)
. This simple strategy allows networks with 10 or 15 hidden layers to be trained very efficiently and effectively using the standard backpropagation.
Behnke relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid [21] to solve problems like image reconstruction and face localization.[ citation needed ]
Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g., random guess or more systematically genetic algorithm. This approach is not based on gradient and avoids the vanishing gradient problem. [22]
In mathematics and physics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace, who first studied its properties. This is often written as or where is the Laplace operator, is the divergence operator, is the gradient operator, and is a twice-differentiable real-valued function. The Laplace operator therefore maps a scalar function to another scalar function.
The Navier–Stokes equations are partial differential equations which describe the motion of viscous fluid substances. They were named after French engineer and physicist Claude-Louis Navier and the Irish physicist and mathematician George Gabriel Stokes. They were developed over several decades of progressively building the theories, from 1822 (Navier) to 1842–1850 (Stokes).
In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols , (where is the nabla operator), or . In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form. Informally, the Laplacian Δf (p) of a function f at a point p measures by how much the average value of f over small spheres or balls centered at p deviates from f (p).
In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.
In mathematics, a Green's function is the impulse response of an inhomogeneous linear differential operator defined on a domain with specified initial conditions or boundary conditions.
In mathematical physics, scalar potential describes the situation where the difference in the potential energies of an object in two different positions depends only on the positions, not upon the path taken by the object in traveling from one position to the other. It is a scalar field in three-space: a directionless value (scalar) that depends only on its location. A familiar example is potential energy due to gravity.
In multivariable calculus, the directional derivative measures the rate at which a function changes in a particular direction at a given point.
In machine learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.
Recurrent neural networks (RNNs) are a class of artificial neural network commonly used for sequential data processing. Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.
Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.
In fluid mechanics and mathematics, a capillary surface is a surface that represents the interface between two different fluids. As a consequence of being a surface, a capillary surface has no thickness in slight contrast with most real fluid interfaces.
Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function. Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.
An artificial neural network (ANN) combines biological principles with advanced statistics to solve problems in domains such as pattern recognition and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways.
In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.
In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.
A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.
A Stein discrepancy is a statistical divergence between two probability measures that is rooted in Stein's method. It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers, but has since been used in diverse settings in statistics, machine learning and computer science.
Deep backward stochastic differential equation method is a numerical method that combines deep learning with Backward stochastic differential equation (BSDE). This method is particularly useful for solving high-dimensional problems in financial derivatives pricing and risk management. By leveraging the powerful function approximation capabilities of deep neural networks, deep BSDE addresses the computational challenges faced by traditional numerical methods in high-dimensional settings.