Weight initialization

Last updated

In deep learning, weight initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training: weight initalization is the pre-training step of assigning initial values to these parameters.

Contents

The choice of weight initialization method affects the speed of convergence, the scale of neural activation within the network, the scale of gradient signals during backpropagation, and the quality of the final model. Proper initialization is necessary for avoiding issues such as vanishing and exploding gradients and activation function saturation.

Note that even though this article is titled "weight initialization", both weights and biases are used in a neural network as trainable parameters, so this article describes how both of these are initialized. Similarly, trainable parameters in convolutional neural networks (CNNs) are called kernels and biases, and this article also describes these.

Constant initialization

We discuss the main methods of initialization in the context of a multilayer perceptron (MLP). Specific strategies for initializing other network architectures are discussed in later sections.

For an MLP, there are only two kinds of trainable parameters, called weights and biases. Each layer contains a weight matrix and a bias vector , where is the number of neurons in that layer. A weight initialization method is an algorithm for setting the initial values for for each layer .

The simplest form is zero initialization:Zero initialization is usually used for initializing biases, but it is not used for initializing weights, as it leads to symmetry in the network, causing all neurons to learn the same features.

In this page, we assume unless otherwise stated.

Recurrent neural networks typically use activation functions with bounded range, such as sigmoid and tanh, since unbounded activation may cause exploding values. (Le, Jaitly, Hinton, 2015) [1] suggested initializing weights in the recurrent parts of the network to identity and zero bias.

In most cases, the biases are initialized to zero, though some situations can use a nonzero initialization. For example, in multiplicative units, such as the forget gate of LSTM, the bias can be initialized to 1 to allow good gradient signal through the gate. [2] For neurons with ReLU activation, one can initialize the bias to a small positive value like 0.1, so that the gradient is likely nonzero at initialization, avoiding the dying ReLU problem. [3] :305 [4]

Random initialization

Random initialization means sampling the weights from a normal distribution or a uniform distribution, usually independently.

LeCun initialization

LeCun initialization, popularized in (LeCun et al, 1998), [5] is designed to preserve the variance of neural activations during the forward pass.

It samples each entry in independently from a distribution with mean 0 and variance . For example, if the distribution is a continuous uniform distribution, then the distribution is .

Glorot initialization

Glorot initialization (or Xavier initialization) was proposed by Xavier Glorot and Yoshua Bengio. [6] It was designed as a compromise between two goals: to preserve activation variance during the forward pass and to preserve gradient variance during the backward pass.

For uniform initialization, it samples each entry in independently and identically from . In the context, is also called the "fan-in", and the "fan-out". When the fan-in and fan-out are equal, then Glorot initialization is the same as LeCun initialization.

He initialization

As Glorot initialization performs poorly for ReLU activation, [7] He initialization (or Kaiming initialization) was proposed by Kaiming He et al. [8] for networks with ReLU activation. It samples each entry in from .

Orthogonal initialization

(Saxe et al 2013) [9] proposed orthogonal initialization: initializing weight matrices as uniformly random (according to the Haar measure) semi-orthogonal matrices, multiplied by a factor that depends on the activation function of the layer. It was designed so that if one initializes a deep linear network this way, then its training time until convergence is independent of depth. [10]

Sampling a uniformly random semi-orthogonal matrix can be done by initializing by IID sampling its entries from a standard normal distribution, then calculate or its transpose, depending on whether is tall or wide. [11]

For CNN kernels with odd widths and heights, orthogonal initialization is done this way: initialize the central point by a semi-orthogonal matrix, and fill the other entries with zero. As an illustration, a kernel of shape is initialized by filling with the entries of a random semi-orthogonal matrix of shape , and the other entries with zero. (Balduzzi et al, 2017) [12] used it with stride 1 and zero-padding. This is sometimes called the Orthogonal Delta initialization. [11] [13]

Related to this approach, unitary initialization proposes to parameterize the weight matrices to be unitary matrices, with the result that at initialization they are random unitary matrices (and throughout training, they remain unitary). This is found to improve long-sequence modelling in LSTM. [14] [15]

Orthogonal initialization has been generalized to layer-sequential unit-variance (LSUV) initialization. It is a data-dependent initialization method, and can be used in convolutional neural networks. It first initializes weights of each convolution or fully connected layer with orthonormal matrices. Then, proceeding from the first to the last layer, it runs a forward pass on a random minibatch, and divides the layer's weights by the standard deviation of its output, so that its output has variance approximately 1. [16] [17]

Normalization-free initialization

In 2015, the introduction of residual connections allowed very deep neural networks to be trained, much deeper than the ~20 layers of the previous state of the art (such as the VGG-19). Residual connections gave rise to their own weight initialization problems and strategies.

Fixup initialization is designed specifically for networks with residual connections and without batch normalization, as follows: [18]

  1. Initialize the classification layer and the last layer of each residual branch to 0.
  2. Initialize every other layer using a standard method (such as He initialization), and scale only the weight layers inside residual branches by .
  3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.

Similarly, T-Fixup initialization is designed for Transformers without layer normalization. [19] :9

Others

Instead of initializing all weights with random values on the order of , sparse initialization initialized only a small subset of the weights with larger random values, and the other weights zero, so that the total variance is still on the order of . [20]

Random walk initialization was designed for MLP so that during backpropagation, the L2 norm of gradient at each layer performs an unbiased random walk as one moves from the last layer to the first. [21]

Looks linear initialization was designed to allow the neural network to behave like a deep linear network at initialization, since . It initializes a matrix of shape by any method, such as orthogonal initialization, then let the weight matrix to be the concatenation of . [22]

Miscellaneous

For hyperbolic tangent activation function, a particular scaling is sometimes used: . This was sometimes called " LeCun's tanh". It was designed so that it maps the interval to itself, thus ensuring that the overall gain is around 1 in "normal operating conditions", and that is at maximum when , which improves convergence at the end of training. [23] [5]

In self-normalizing neural networks, the SELU activation function with parameters makes it such that the mean and variance of the output of each layer has as an attracting fixed-point. This makes initialization less important, though they recommend initializing weights randomly with variance . [24]

History

Random weight initialization was used since Frank Rosenblatt's perceptrons. An early work that described weight initialization specifically was (LeCun et al, 1998). [5]

Before the 2010s era of deep learning, it was common to initialize models by "pre-training" using an unsupervised learning algorithm that is not backpropagation, as it was difficult to directly train deep neural networks by backpropagation. [25] [26] For example, a deep belief network was trained by using contrastive divergence layer by layer, starting from the bottom. [27]

(Martens, 2010) [20] proposed a quasi-Newton method to directly train deep networks. The work generated considerable excitement that initializing networks without pre-training phase was possible. [28] However, a 2013 paper demonstrated that with well-chosen hyperparameters, momentum gradient descent with weight initialization was sufficient for training neural networks, a combination that is still in use as of 2024. [29]

Since then, the impact of initialization on tuning the variance has become less important, with methods developed to automatically tune variance, like batch normalization tuning the variance of the forward pass, [30] and momentum-based optimizers tuning the variance of the backward pass. [31]

There is a tension between using careful weight initialization to decrease the need for normalization, and using normalization to decrease the need for careful weight initialization, with each approach having its tradeoffs. For example, batch normalization causes training examples in the minibatch to become dependent, an undesirable trait, while weight initialization is architecture-dependent. [32]

See also

Related Research Articles

<span class="mw-page-title-main">Boltzmann machine</span> Type of stochastic recurrent neural network

A Boltzmann machine, named after Ludwig Boltzmann is a spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as a Markov random field.

In machine learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.

<span class="mw-page-title-main">Feedforward neural network</span> Type of artificial neural network

A feedforward neural network (FNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. Its flow is uni-directional, meaning that the information in the model flows in only one direction—forward—from the input nodes, through the hidden nodes and to the output nodes, without any cycles or loops. Modern feedforward networks are trained using backpropagation, and are colloquially referred to as "vanilla" neural networks.

In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is not linearly separable.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Rectifier (neural networks)</span> Type of activation function

In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function:

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each training iteration, each neural network weight receives an update proportional to the partial derivative of the loss function with respect to the current weight. The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further learning. As one example of this problem, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients using the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

<span class="mw-page-title-main">AlexNet</span> An influential convolutional neural network published in 2012

AlexNet is a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto. It had 60 million parameters and 650,000 neurons.

<span class="mw-page-title-main">Residual neural network</span> Type of artificial neural network

A residual neural network is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won the ImageNet Large Scale Visual Recognition Challenge of that year.

Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by biological neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling this period an "AI winter".

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely data normalization and activation normalization. Data normalization includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range. This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.

References

  1. Le, Quoc V.; Jaitly, Navdeep; Hinton, Geoffrey E. (2015). "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units". arXiv: 1504.00941 [cs.NE].
  2. Jozefowicz, Rafal; Zaremba, Wojciech; Sutskever, Ilya (2015-06-01). "An Empirical Exploration of Recurrent Network Architectures". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2342–2350.
  3. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep learning. Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press. ISBN   978-0-262-03561-3.
  4. Lu, Lu; Shin, Yeonjong; Su, Yanhui; Karniadakis, George Em (2019). "Dying ReLU and Initialization: Theory and Numerical Examples". Communications in Computational Physics. 28 (5): 1671–1706. arXiv: 1903.06733 . doi:10.4208/cicp.OA-2020-0165.
  5. 1 2 3 LeCun, Yann; Bottou, Leon; Orr, Genevieve B.; Müller, Klaus -Robert (1998), Orr, Genevieve B.; Müller, Klaus-Robert (eds.), "Efficient BackProp", Neural Networks: Tricks of the Trade, Berlin, Heidelberg: Springer, pp. 9–50, doi:10.1007/3-540-49430-8_2, ISBN   978-3-540-49430-0 , retrieved 2024-10-05
  6. Glorot, Xavier; Bengio, Yoshua (2010-03-31). "Understanding the difficulty of training deep feedforward neural networks". Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 249–256.
  7. Kumar, Siddharth Krishna (2017). "On weight initialization in deep neural networks". arXiv: 1704.08863 [cs.LG].
  8. He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv: 1502.01852 [cs.CV].
  9. Saxe, Andrew M.; McClelland, James L.; Ganguli, Surya (2013). "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks". arXiv: 1312.6120 [cs.NE].
  10. Hu, Wei; Xiao, Lechao; Pennington, Jeffrey (2020). "Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks". arXiv: 2001.05992 [cs.LG].
  11. 1 2 Martens, James; Ballard, Andy; Desjardins, Guillaume; Swirszcz, Grzegorz; Dalibard, Valentin; Sohl-Dickstein, Jascha; Schoenholz, Samuel S. (2021). "Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping". arXiv: 2110.01765 [cs.LG].
  12. Balduzzi, David; Frean, Marcus; Leary, Lennox; Lewis, J. P.; Ma, Kurt Wan-Duo; McWilliams, Brian (2017-07-17). "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". Proceedings of the 34th International Conference on Machine Learning. PMLR: 342–350.
  13. Xiao, Lechao; Bahri, Yasaman; Sohl-Dickstein, Jascha; Schoenholz, Samuel; Pennington, Jeffrey (2018-07-03). "Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks". Proceedings of the 35th International Conference on Machine Learning. PMLR: 5393–5402. arXiv: 1806.05393 .
  14. Arjovsky, Martin; Shah, Amar; Bengio, Yoshua (2016-06-11). "Unitary Evolution Recurrent Neural Networks". Proceedings of the 33rd International Conference on Machine Learning. PMLR: 1120–1128. arXiv: 1511.06464 .
  15. Henaff, Mikael; Szlam, Arthur; LeCun, Yann (2017-03-15). "Recurrent Orthogonal Networks and Long-Memory Tasks". arXiv: 1602.06662 [cs.NE].
  16. Mishkin, Dmytro; Matas, Jiri (2016-02-19), All you need is a good init, arXiv: 1511.06422
  17. Xie, Di; Xiong, Jiang; Pu, Shiliang (2017). All You Need Is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks With Orthonormality and Modulation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6176–6185.
  18. Zhang, Hongyi; Dauphin, Yann N.; Ma, Tengyu (2019). "Fixup Initialization: Residual Learning Without Normalization". arXiv: 1901.09321 [cs.LG].
  19. Huang, Xiao Shi; Perez, Felipe; Ba, Jimmy; Volkovs, Maksims (2020-11-21). "Improving Transformer Optimization Through Better Initialization". Proceedings of the 37th International Conference on Machine Learning. PMLR: 4475–4483.
  20. 1 2 Martens, James (2010-06-21). "Deep learning via Hessian-free optimization". Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML'10. Madison, WI, USA: Omnipress: 735–742. ISBN   978-1-60558-907-7.
  21. Sussillo, David; Abbott, L. F. (2014). "Random Walk Initialization for Training Very Deep Feedforward Networks". arXiv: 1412.6558 [cs.NE].
  22. Balduzzi, David; Frean, Marcus; Leary, Lennox; Lewis, JP; Kurt Wan-Duo Ma; McWilliams, Brian (2017). "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". arXiv: 1702.08591 [cs.NE].
  23. Y. LeCun. Generalization and network design strategies . In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Amsterdam, 1989. Elsevier. Proceedings of the International Conference Connectionism in Perspective, University of Zurich, 10. -- 13. October 1988.
  24. Klambauer, Günter; Unterthiner, Thomas; Mayr, Andreas; Hochreiter, Sepp (2017). "Self-Normalizing Neural Networks". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  25. Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2: 1–127. CiteSeerX   10.1.1.701.9550 . doi:10.1561/2200000006.
  26. Erhan, Dumitru; Courville, Aaron; Bengio, Yoshua; Vincent, Pascal (2010-03-31). "Why Does Unsupervised Pre-training Help Deep Learning?". Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 201–208.
  27. Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2006). "Greedy Layer-Wise Training of Deep Networks". Advances in Neural Information Processing Systems. 19. MIT Press.
  28. Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011-06-14). "Deep Sparse Rectifier Neural Networks". Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 315–323.
  29. Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey (2013-05-26). "On the importance of initialization and momentum in deep learning" (PDF). Proceedings of the 30th International Conference on Machine Learning. PMLR: 1139–1147.
  30. Bjorck, Nils; Gomes, Carla P; Selman, Bart; Weinberger, Kilian Q (2018). "Understanding Batch Normalization". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc. arXiv: 1806.02375 .
  31. Balles, Lukas; Hennig, Philipp (2018-07-03). "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients". Proceedings of the 35th International Conference on Machine Learning. PMLR: 404–413. arXiv: 1705.07774 .
  32. Brock, Andrew; De, Soham; Smith, Samuel L.; Simonyan, Karen (2021). "High-Performance Large-Scale Image Recognition Without Normalization". arXiv: 2102.06171 [cs.CV].

Further reading