A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. [1] [2] [3] [4] [5] [6] [7] [8] The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.
Bayesian networks are a modeling tool for assigning probabilities to events, and thereby characterizing the uncertainty in a model's predictions. Deep learning and artificial neural networks are approaches used in machine learning to build computational models which learn from training examples. Bayesian neural networks merge these fields. They are a type of neural network whose parameters and predictions are both probabilistic. [9] [10] While standard neural networks often assign high confidence even to incorrect predictions, [11] Bayesian neural networks can more accurately evaluate how likely their predictions are to be correct.
Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. When we consider a sequence of Bayesian neural networks with increasingly wide layers (see figure), they converge in distribution to a NNGP. This large width limit is of practical interest, since the networks often improve as layers get wider. [12] [13] [4] [14] And the process may give a closed form way to evaluate networks.
NNGPs also appears in several other contexts: It describes the distribution over predictions made by wide non-Bayesian artificial neural networks after random initialization of their parameters, but before training; it appears as a term in neural tangent kernel prediction equations; it is used in deep information propagation to characterize whether hyperparameters and architectures will be trainable. [15] It is related to other large width limits of neural networks.
The first correspondence result had been established in the 1995 PhD thesis of Radford M. Neal, [16] then supervised by Geoffrey Hinton at University of Toronto. Neal cites David J. C. MacKay as inspiration, who worked in Bayesian learning.
Today the correspondence is proven for: Single hidden layer Bayesian neural networks; [16] deep [2] [3] fully connected networks as the number of units per layer is taken to infinity; convolutional neural networks as the number of channels is taken to infinity; [4] [5] [6] transformer networks as the number of attention heads is taken to infinity; [17] recurrent networks as the number of units is taken to infinity. [8] In fact, this NNGP correspondence holds for almost any architecture: Generally, if an architecture can be expressed solely via matrix multiplication and coordinatewise nonlinearities (i.e., a tensor program), then it has an infinite-width GP. [8] This in particular includes all feedforward or recurrent neural networks composed of multilayer perceptron, recurrent neural networks (e.g., LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization.
Every setting of a neural network's parameters corresponds to a specific function computed by the neural network. A prior distribution over neural network parameters therefore corresponds to a prior distribution over functions computed by the network. As neural networks are made infinitely wide, this distribution over functions converges to a Gaussian process for many architectures.
The notation used in this section is the same as the notation used below to derive the correspondence between NNGPs and fully connected networks, and more details can be found there.
The figure to the right plots the one-dimensional outputs of a neural network for two inputs and against each other. The black dots show the function computed by the neural network on these inputs for random draws of the parameters from . The red lines are iso-probability contours for the joint distribution over network outputs and induced by . This is the distribution in function space corresponding to the distribution in parameter space, and the black dots are samples from this distribution. For infinitely wide neural networks, since the distribution over functions computed by the neural network is a Gaussian process, the joint distribution over network outputs is a multivariate Gaussian for any finite set of network inputs.
This section expands on the correspondence between infinitely wide neural networks and Gaussian processes for the specific case of a fully connected architecture. It provides a proof sketch outlining why the correspondence holds, and introduces the specific functional form of the NNGP for fully connected networks. The proof sketch closely follows the approach by Novak and coauthors. [4]
Consider a fully connected artificial neural network with inputs , parameters consisting of weights and biases for each layer in the network, pre-activations (pre-nonlinearity) , activations (post-nonlinearity) , pointwise nonlinearity , and layer widths . For simplicity, the width of the readout vector is taken to be 1. The parameters of this network have a prior distribution , which consists of an isotropic Gaussian for each weight and bias, with the variance of the weights scaled inversely with layer width. This network is illustrated in the figure to the right, and described by the following set of equations:
We first observe that the pre-activations are described by a Gaussian process conditioned on the preceding activations . This result holds even at finite width. Each pre-activation is a weighted sum of Gaussian random variables, corresponding to the weights and biases , where the coefficients for each of those Gaussian variables are the preceding activations . Because they are a weighted sum of zero-mean Gaussians, the are themselves zero-mean Gaussians (conditioned on the coefficients ). Since the are jointly Gaussian for any set of , they are described by a Gaussian process conditioned on the preceding activations . The covariance or kernel of this Gaussian process depends on the weight and bias variances and , as well as the second moment matrix of the preceding activations ,
The effect of the weight scale is to rescale the contribution to the covariance matrix from , while the bias is shared for all inputs, and so makes the for different datapoints more similar and makes the covariance matrix more like a constant matrix.
The pre-activations only depend on through its second moment matrix . Because of this, we can say that is a Gaussian process conditioned on , rather than conditioned on ,
As previously defined, is the second moment matrix of . Since is the activation vector after applying the nonlinearity , it can be replaced by , resulting in a modified equation expressing for in terms of ,
We have already determined that is a Gaussian process. This means that the sum defining is an average over samples from a Gaussian process which is a function of ,
As the layer width goes to infinity, this average over samples from the Gaussian process can be replaced with an integral over the Gaussian process:
So, in the infinite width limit the second moment matrix for each pair of inputs and can be expressed as an integral over a 2d Gaussian, of the product of and . There are a number of situations where this has been solved analytically, such as when is a ReLU, [18] ELU, GELU, [19] or error function [1] nonlinearity. Even when it can't be solved analytically, since it is a 2d integral it can generally be efficiently computed numerically. [2] This integral is deterministic, so is deterministic.
For shorthand, we define a functional , which corresponds to computing this 2d integral for all pairs of inputs, and which maps into ,
By recursively applying the observation that is deterministic as , can be written as a deterministic function of ,
where indicates applying the functional sequentially times. By combining this expression with the further observations that the input layer second moment matrix is a deterministic function of the input , and that is a Gaussian process, the output of the neural network can be expressed as a Gaussian process in terms of its input,
Neural Tangents is a free and open-source Python library used for computing and doing inference with the NNGP and neural tangent kernel corresponding to various common ANN architectures. [20]
In physics, the cross section is a measure of the probability that a specific process will take place when some kind of radiant excitation intersects a localized phenomenon. For example, the Rutherford cross-section is a measure of probability that an alpha particle will be deflected by a given angle during an interaction with an atomic nucleus. Cross section is typically denoted σ (sigma) and is expressed in units of area, more specifically in barns. In a way, it can be thought of as the size of the object that the excitation must hit in order for the process to occur, but more exactly, it is a parameter of a stochastic process.
In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form
In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.
In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).
In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:
In mathematics, the Ornstein–Uhlenbeck process is a stochastic process with applications in financial mathematics and the physical sciences. Its original application in physics was as a model for the velocity of a massive Brownian particle under the influence of friction. It is named after Leonard Ornstein and George Eugene Uhlenbeck.
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.
In mathematics, a determinantal point process is a stochastic point process, the probability distribution of which is characterized as a determinant of some function. Such processes arise as important tools in random matrix theory, combinatorics, physics, and wireless network modeling.
Bayesian hierarchical modelling is a statistical model written in multiple levels that estimates the parameters of the posterior distribution using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.
In machine learning, the vanishing gradient problem is encountered when training recurrent neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.
Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.
In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.
In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.
Nonlinear mixed-effects models constitute a class of statistical models generalizing linear mixed-effects models. Like linear mixed-effects models, they are particularly useful in settings where there are multiple measurements within the same statistical units or when there are dependencies between measurements on related statistical units. Nonlinear mixed-effects models are applied in many fields including medicine, public health, pharmacology, and ecology.
A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow, which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.
Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information. The approximation is justified by the Bernstein–von Mises theorem, which states that under regularity conditions the posterior converges to a Gaussian in large samples.
In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates the probability distribution of a given dataset. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.
{{cite journal}}
: Cite journal requires |journal=
(help)