Part of a series on |
Machine learning and data mining |
---|
In machine learning, normalization is a statistical technique with various applications. There are two main forms of normalization, namely data normalization and activation normalization. Data normalization (or feature scaling) includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range (typically or ). This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.
Activation normalization, on the other hand, is specific to deep learning, and includes methods that rescale the activation of hidden neurons inside neural networks.
Normalization is often used to:
Normalization techniques are often theoretically justified as reducing covariance shift, smoothing optimization landscapes, and increasing regularization, though they are mainly justified by empirical success. [1]
Batch normalization (BatchNorm) [2] operates on the activations of a layer for each mini-batch.
Consider a simple feedforward network, defined by chaining together modules:
where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. is the input vector, is the output vector from the first module, etc.
BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after , then the network would operate accordingly:
The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time.
Concretely, suppose we have a batch of inputs , fed all at once into the network. We would obtain in the middle of the network some vectors:
The BatchNorm module computes the coordinate-wise mean and variance of these vectors:
where indexes the coordinates of the vectors, and indexes the elements of the batch. In other words, we are considering the -th coordinate of each vector in the batch, and computing the mean and variance of these numbers.
It then normalizes each coordinate to have zero mean and unit variance:
The is a small positive constant such as added to the variance for numerical stability, to avoid division by zero.
Finally, it applies a linear transformation:
Here, and are parameters inside the BatchNorm module. They are learnable parameters, typically trained by gradient descent.
The following is a Python implementation of BatchNorm:
importnumpyasnpdefbatchnorm(x,gamma,beta,epsilon=1e-9):# Mean and variance of each featuremu=np.mean(x,axis=0)# shape (N,)var=np.var(x,axis=0)# shape (N,)# Normalize the activationsx_hat=(x-mu)/np.sqrt(var+epsilon)# shape (B, N)# Apply the linear transformy=gamma*x_hat+beta# shape (B, N)returny
and allow the network to learn to undo the normalization, if this is beneficial. [3] BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top. [4] [3]
It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters [5] [6] and detractors. [7] [8]
The original paper [2] recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is, , not . Also, the bias does not matter, since it would be canceled by the subsequent mean subtraction, so it is of the form . That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to zero. [2]
For convolutional neural networks (CNNs), BatchNorm must preserve the translation-invariance of these models, meaning that it must treat all outputs of the same kernel as if they are different data points within a batch. [2] This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm. [9] [10]
Concretely, suppose we have a 2-dimensional convolutional layer defined by:
where:
In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per kernel (equivalently, once per channel ), not per activation:
where is the batch size, is the height of the feature map, and is the width of the feature map.
That is, even though there are only data points in a batch, all outputs from the kernel in this batch are treated equally. [2]
Subsequently, normalization and the linear transform is also done per kernel:
Similar considerations apply for BatchNorm for n-dimensional convolutions.
The following is a Python implementation of BatchNorm for 2D convolutions:
importnumpyasnpdefbatchnorm_cnn(x,gamma,beta,epsilon=1e-9):# Calculate the mean and variance for each channel.mean=np.mean(x,axis=(0,1,2),keepdims=True)var=np.var(x,axis=(0,1,2),keepdims=True)# Normalize the input tensor.x_hat=(x-mean)/np.sqrt(var+epsilon)# Scale and shift the normalized tensor.y=gamma*x_hat+betareturny
BatchNorm has been very popular and there were many attempted improvements. Some examples include: [11]
A particular problem with BatchNorm is that during training, the mean and variance are calculated on the fly for each batch (usually as an exponential moving average), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference: [11] : Eq. 3
where is a hyperparameter to be optimized on a validation set.
Other works attempt to eliminate BatchNorm, such as the Normalizer-Free ResNet. [12]
Layer normalization (LayerNorm) [13] is a popular alternative to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of transformer models.
For a given data input and layer, LayerNorm computes the mean and variance over all the neurons in the layer. Similar to BatchNorm, learnable parameters (scale) and (shift) are applied. It is defined by:
where:
and the index ranges over the neurons in that layer.
For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have:
Notice that the batch index is removed, while the channel index is added.
In recurrent neural networks [13] and transformers, [14] LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep is , where is the dimension of the hidden vector, then LayerNorm will be applied with:
where:
Root mean square layer normalization (RMSNorm) [15] changes LayerNorm by:
Essentially, it is LayerNorm where we enforce .
Adaptive layer norm (adaLN) computes the in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNNs, [16] and has been used effectively in diffusion transformers (DiTs). [17] For example, in a DiT, the conditioning information (such as a text encoding vector) is processed by a multilayer perceptron into , which is then applied in the LayerNorm module of a transformer.
Weight normalization (WeightNorm) [18] is a technique inspired by BatchNorm that normalizes weight matrices in a neural network, rather than its activations.
One example is spectral normalization, which divides weight matrices by their spectral norm. The spectral normalization is used in generative adversarial networks (GANs) such as the Wasserstein GAN. [19] The spectral radius can be efficiently computed by the following algorithm:
INPUT matrix and initial guess
Iterate to convergence . This is the eigenvector of with eigenvalue .
RETURN
By reassigning after each update of the discriminator, we can upper-bound , and thus upper-bound .
The algorithm can be further accelerated by memoization: at step , store . Then, at step , use as the initial guess for the algorithm. Since is very close to , so is to , thus allowing rapid convergence.
There are some activation normalization techniques that are only used for CNNs.
Local response normalization [20] was used in AlexNet. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by:
where is the activation of the neuron at location and channel . I.e., each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels.
are hyperparameters picked by using a validation set.
It was a variant of the earlier local contrast normalization. [21]
where is the average activation in a small window centered on location and channel . The hyperparameters , and the size of the small window, are picked by using a validation set.
Similar methods were called divisive normalization, as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception. [22]
Both kinds of local normalization were obviated by batch normalization, which is a more global form of normalization. [23]
Response normalization reappeared in ConvNeXT-2 as global response normalization. [24]
Group normalization (GroupNorm) [25] is a technique also solely used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel group.
Suppose at a layer , there are channels , then it is partitioned into groups . Then, LayerNorm is applied to each group.
Instance normalization (InstanceNorm), or contrast normalization, is a technique first developed for neural style transfer, and is also only used for CNNs. [26] It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:
Adaptive instance normalization (AdaIN) is a variant of instance normalization, designed specifically for neural style transfer with CNNs, rather than just CNNs in general. [27]
In the AdaIN method of style transfer, we take a CNN and two input images, one for content and one for style. Each image is processed through the same CNN, and at a certain layer , AdaIn is applied.
Let be the activation in the content image, and be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image , then uses those as the for InstanceNorm on . Note that itself remains unchanged. Explicitly, we have:
Some normalization methods were designed for use in transformers.
The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018, [28] was found to be easier to train, requiring no warm-up, leading to faster convergence. [29]
FixNorm [30] and ScaleNorm [31] both normalize activation vectors in a transformer. The FixNorm method divides the output vectors from a transformer by their L2 norms, then multiplies by a learned parameter . The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter (shared by all ScaleNorm modules of a transformer). Query-Key normalization (QKNorm) [32] normalizes query and key vectors to have unit L2 norm.
In nGPT, many vectors are normalized to have unit L2 norm: [33] hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors.
Gradient normalization (GradNorm) [34] normalizes gradient vectors during backpropagation.
In mathematical physics and mathematics, the Pauli matrices are a set of three 2 × 2 complex matrices that are traceless, Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
In mathematical analysis, Hölder's inequality, named after Otto Hölder, is a fundamental inequality between integrals and an indispensable tool for the study of Lp spaces.
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.
The Ising model, named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic "spins" that can be in one of two states. The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Neighboring spins that agree have a lower energy than those that disagree; the system tends to the lowest energy but heat disturbs this tendency, thus creating the possibility of different structural phases. The model allows the identification of phase transitions as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition.
In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.
In general relativity, the Gibbons–Hawking–York boundary term is a term that needs to be added to the Einstein–Hilbert action when the underlying spacetime manifold has a boundary.
The covariant formulation of classical electromagnetism refers to ways of writing the laws of classical electromagnetism in a form that is manifestly invariant under Lorentz transformations, in the formalism of special relativity using rectilinear inertial coordinate systems. These expressions both make it simple to prove that the laws of classical electromagnetism take the same form in any inertial coordinate system, and also provide a way to translate the fields and forces from one frame to another. However, this is not as general as Maxwell's equations in curved spacetime or non-rectilinear coordinate systems.
The Newman–Penrose (NP) formalism is a set of notation developed by Ezra T. Newman and Roger Penrose for general relativity (GR). Their notation is an effort to treat general relativity in terms of spinor notation, which introduces complex forms of the usual variables used in GR. The NP formalism is itself a special case of the tetrad formalism, where the tensors of the theory are projected onto a complete vector basis at each point in spacetime. Usually this vector basis is chosen to reflect some symmetry of the spacetime, leading to simplified expressions for physical observables. In the case of the NP formalism, the vector basis chosen is a null tetrad: a set of four null vectors—two real, and a complex-conjugate pair. The two real members often asymptotically point radially inward and radially outward, and the formalism is well adapted to treatment of the propagation of radiation in curved spacetime. The Weyl scalars, derived from the Weyl tensor, are often used. In particular, it can be shown that one of these scalars— in the appropriate frame—encodes the outgoing gravitational radiation of an asymptotically flat system.
The folded normal distribution is a probability distribution related to the normal distribution. Given a normally distributed random variable X with mean μ and variance σ2, the random variable Y = |X| has a folded normal distribution. Such a case may be encountered if only the magnitude of some variable is recorded, but not its sign. The distribution is called "folded" because probability mass to the left of x = 0 is folded over by taking the absolute value. In the physics of heat conduction, the folded normal distribution is a fundamental solution of the heat equation on the half space; it corresponds to having a perfect insulator on a hyperplane through the origin.
A ratio distribution is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two random variables X and Y, the distribution of the random variable Z that is formed as the ratio Z = X/Y is a ratio distribution.
In probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909.
In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.
Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.
In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.
The GHK algorithm is an importance sampling method for simulating choice probabilities in the multivariate probit model. These simulated probabilities can be used to recover parameter estimates from the maximized likelihood equation using any one of the usual well known maximization methods. Train has well documented steps for implementing this algorithm for a multinomial probit model. What follows here will apply to the binary multivariate probit model.
Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning.
Batch normalization is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.
In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help)