Large width limits of neural networks

Last updated February 06, 2024

Behavior of a neural network simplifies as it becomes infinitely wide. Left: a Bayesian neural network with two hidden layers, transforming a 3-dimensional input (bottom) into a two-dimensional output

(y_{1},y_{2})

(top). Right: output probability density function

p(y_{1},y_{2})

induced by the random weights of the network. Video: as the width of the network increases, the output distribution simplifies, ultimately converging to a Neural network Gaussian process in the infinite width limit.

Artificial neural networks are a class of models used in machine learning, and inspired by biological neural networks. They are the core component of modern deep learning algorithms. Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. Theoretical analysis of artificial neural networks sometimes considers the limiting case that layer width becomes large or infinite. This limit enables simple analytic statements to be made about neural network predictions, training dynamics, generalization, and loss surfaces. This wide layer limit is also of practical interest, since finite width neural networks often perform strictly better as layer width is increased.^[1]^[2]^[3]^[4]^[5]^[6]

Theoretical approaches based on a large width limit

The Neural Network Gaussian Process (NNGP) corresponds to the infinite width limit of Bayesian neural networks, and to the distribution over functions realized by non-Bayesian neural networks after random initialization.^[7]^[8]^[9]^[10]
The same underlying computations that are used to derive the NNGP kernel are also used in deep information propagation to characterize the propagation of information about gradients and inputs through a deep network.^[11] This characterization is used to predict how model trainability depends on architecture and initializations hyper-parameters.
The Neural Tangent Kernel describes the evolution of neural network predictions during gradient descent training. In the infinite width limit the NTK usually becomes constant, often allowing closed form expressions for the function computed by a wide neural network throughout gradient descent training.^[12] The training dynamics essentially become linearized.^[13]
Mean-field limit analysis, when applied to neural networks with weight scaling of $\sim 1/h$ instead of $\sim 1/{\sqrt {h}}$ and large enough learning rates, predicts qualitatively distinct nonlinear training dynamics compared to the static linear behavior described by the fixed neural tangent kernel, suggesting alternative pathways for understanding infinite-width networks.^[14]^[15]
Catapult dynamics describe neural network training dynamics in the case that logits diverge to infinity as the layer width is taken to infinity, and describe qualitative properties of early training dynamics.^[16]

Related Research Articles

Artificial neural networks are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

<span class="mw-page-title-main">Echo state network</span> Type of reservoir computer

An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behavior is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.

Reservoir computing is a framework for computation derived from recurrent neural network theory that maps input signals into higher dimensional computational spaces through the dynamics of a fixed, non-linear system called a reservoir. After the input signal is fed into the reservoir, which is treated as a "black box," a simple readout mechanism is trained to read the state of the reservoir and map it to the desired output. The first key benefit of this framework is that training is performed only at the readout stage, as the reservoir dynamics are fixed. The second is that the computational power of naturally available systems, both classical and quantum mechanical, can be used to reduce the effective computational cost.

Artificial neural networks are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued vectors to real-valued vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Metadynamics</span> Scientific computer simulation method

Metadynamics is a computer simulation method in computational physics, chemistry and biology. It is used to estimate the free energy and other state functions of a system, where ergodicity is hindered by the form of the system's energy landscape. It was first suggested by Alessandro Laio and Michele Parrinello in 2002 and is usually applied within molecular dynamics simulations. MTD closely resembles a number of newer methods such as adaptively biased molecular dynamics, adaptive reaction coordinate forces and local elevation umbrella sampling. More recently, both the original and well-tempered metadynamics were derived in the context of importance sampling and shown to be a special case of the adaptive biasing potential setting. MTD is related to the Wang–Landau sampling.

In machine learning, a hyperparameter is a parameter, such as the learning rate or choice of optimizer, which specifies details of the learning process, hence the name hyperparameter. This is in contrast to parameters which determine the model itself.

Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

This is a comparison of statistical analysis software that allows doing inference with Gaussian processes often using approximations.

Bayesian quadrature is a method for approximating intractable integration problems. It falls within the class of probabilistic numerical methods. Bayesian quadrature views numerical integration as a Bayesian inference task, where function evaluations are used to estimate the integral of that function. For this reason, it is sometimes also referred to as "Bayesian probabilistic numerical integration" or "Bayesian numerical integration". The name "Bayesian cubature" is also sometimes used when the integrand is multi-dimensional. A potential advantage of this approach is that it provides probabilistic uncertainty quantification for the value of the integral.

References

↑ Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". International Conference on Learning Representations. arXiv: 1802.08760 . Bibcode:2018arXiv180208760N.
↑ Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (2016-11-04). "An Analysis of Deep Neural Network Models for Practical Applications". arXiv: 1605.07678 . Bibcode:2016arXiv160507678C.{{cite journal}}: Cite journal requires |journal= (help)
↑ Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv: 1810.05148 . Bibcode:2018arXiv181005148N.
↑ Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (2019). "Towards understanding the role of over-parametrization in generalization of neural networks". International Conference on Learning Representations. arXiv: 1805.12076 . Bibcode:2018arXiv180512076N.
↑ Lawrence, Steve; Giles, C. Lee; Tsoi, Ah Chung (1996). "What size neural network gives optimal generalization? convergence properties of backpropagation". CiteSeerX 10.1.1.125.6019 .{{cite journal}}: Cite journal requires |journal= (help)
↑ Bartlett, P.L. (1998). "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network". IEEE Transactions on Information Theory. 44 (2): 525–536. doi:10.1109/18.661502. ISSN 1557-9654.
↑ Neal, Radford M. (1996), "Priors for Infinite Networks", Bayesian Learning for Neural Networks, Lecture Notes in Statistics, vol. 118, Springer New York, pp. 29–53, doi:10.1007/978-1-4612-0745-0_2, ISBN 978-0-387-94724-2
↑ Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2017). "Deep Neural Networks as Gaussian Processes". International Conference on Learning Representations. arXiv: 1711.00165 . Bibcode:2017arXiv171100165L.
↑ G. de G. Matthews, Alexander; Rowland, Mark; Hron, Jiri; Turner, Richard E.; Ghahramani, Zoubin (2017). "Gaussian Process Behaviour in Wide Deep Neural Networks". International Conference on Learning Representations. arXiv: 1804.11271 . Bibcode:2018arXiv180411271M.
↑ Hron, Jiri; Bahri, Yasaman; Novak, Roman; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2020). "Exact posterior distributions of wide Bayesian neural networks". ICML 2020 Workshop on Uncertainty & Robustness in Deep Learning. arXiv: 2006.10541 .
↑ Schoenholz, Samuel S.; Gilmer, Justin; Ganguli, Surya; Sohl-Dickstein, Jascha (2016). "Deep information propagation". International Conference on Learning Representations. arXiv: 1611.01232 .
↑ Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018). "Neural tangent kernel: Convergence and generalization in neural networks". Advances in Neural Information Processing Systems. arXiv: 1806.07572 .
↑ Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (2020). "Wide neural networks of any depth evolve as linear models under gradient descent". Journal of Statistical Mechanics: Theory and Experiment. 2020 (12): 124002. arXiv: 1902.06720 . Bibcode:2020JSMTE2020l4002L. doi:10.1088/1742-5468/abc62b. S2CID 62841516.
↑ Mei, Song Montanari, Andrea Nguyen, Phan-Minh (2018-04-18). A Mean Field View of the Landscape of Two-Layers Neural Networks. OCLC 1106295873.{{cite book}}: CS1 maint: multiple names: authors list (link)
↑ Nguyen, Phan-Minh; Pham, Huy Tuan (2020). "A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks". arXiv: 2001.11443 [cs.LG].
↑ Lewkowycz, Aitor; Bahri, Yasaman; Dyer, Ethan; Sohl-Dickstein, Jascha; Gur-Ari, Guy (2020). "The large learning rate phase of deep learning: the catapult mechanism". arXiv: 2003.02218 [stat.ML].

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:7-1] Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". International Conference on Learning Representations. arXiv: 1802.08760 . Bibcode:2018arXiv180208760N.

[:8-2] Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (2016-11-04). "An Analysis of Deep Neural Network Models for Practical Applications". arXiv: 1605.07678 . Bibcode:2016arXiv160507678C.{{cite journal}}: Cite journal requires |journal= (help)

[:1-3] Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv: 1810.05148 . Bibcode:2018arXiv181005148N.

[:6-4] Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (2019). "Towards understanding the role of over-parametrization in generalization of neural networks". International Conference on Learning Representations. arXiv: 1805.12076 . Bibcode:2018arXiv180512076N.

[5] Lawrence, Steve; Giles, C. Lee; Tsoi, Ah Chung (1996). "What size neural network gives optimal generalization? convergence properties of backpropagation". CiteSeerX 10.1.1.125.6019 .{{cite journal}}: Cite journal requires |journal= (help)

[6] Bartlett, P.L. (1998). "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network". IEEE Transactions on Information Theory. 44 (2): 525–536. doi:10.1109/18.661502. ISSN 1557-9654.

[7] Neal, Radford M. (1996), "Priors for Infinite Networks", Bayesian Learning for Neural Networks, Lecture Notes in Statistics, vol. 118, Springer New York, pp. 29–53, doi:10.1007/978-1-4612-0745-0_2, ISBN 978-0-387-94724-2

[8] Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2017). "Deep Neural Networks as Gaussian Processes". International Conference on Learning Representations. arXiv: 1711.00165 . Bibcode:2017arXiv171100165L.

[9] G. de G. Matthews, Alexander; Rowland, Mark; Hron, Jiri; Turner, Richard E.; Ghahramani, Zoubin (2017). "Gaussian Process Behaviour in Wide Deep Neural Networks". International Conference on Learning Representations. arXiv: 1804.11271 . Bibcode:2018arXiv180411271M.

[10] Hron, Jiri; Bahri, Yasaman; Novak, Roman; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2020). "Exact posterior distributions of wide Bayesian neural networks". ICML 2020 Workshop on Uncertainty & Robustness in Deep Learning. arXiv: 2006.10541 .

[:10-11] Schoenholz, Samuel S.; Gilmer, Justin; Ganguli, Surya; Sohl-Dickstein, Jascha (2016). "Deep information propagation". International Conference on Learning Representations. arXiv: 1611.01232 .

[12] Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018). "Neural tangent kernel: Convergence and generalization in neural networks". Advances in Neural Information Processing Systems. arXiv: 1806.07572 .

[Lee-13] Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (2020). "Wide neural networks of any depth evolve as linear models under gradient descent". Journal of Statistical Mechanics: Theory and Experiment. 2020 (12): 124002. arXiv: 1902.06720 . Bibcode:2020JSMTE2020l4002L. doi:10.1088/1742-5468/abc62b. S2CID 62841516.

[14] Mei, Song Montanari, Andrea Nguyen, Phan-Minh (2018-04-18). A Mean Field View of the Landscape of Two-Layers Neural Networks. OCLC 1106295873.{{cite book}}: CS1 maint: multiple names: authors list (link)

[15] Nguyen, Phan-Minh; Pham, Huy Tuan (2020). "A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks". arXiv: 2001.11443 [cs.LG].

[16] Lewkowycz, Aitor; Bahri, Yasaman; Dyer, Ethan; Sohl-Dickstein, Jascha; Gur-Ari, Guy (2020). "The large learning rate phase of deep learning: the catapult mechanism". arXiv: 2003.02218 [stat.ML].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]