In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. [1] It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.
Pooling is most commonly used in convolutional neural networks (CNN). Below is a description of pooling in 2-dimensional CNNs. The generalization to n-dimensions is immediate.
As notation, we consider a tensor , where is height, is width, and is the number of channels. A pooling layer outputs a tensor .
We define two variables called "filter size" (aka "kernel size") and "stride". Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions. In such cases, we define 4 variables .
The receptive field of an entry in the output tensor are all the entries in that can affect that entry.
Max Pooling (MaxPool) is commonly used in CNNs to reduce the spatial dimensions of feature maps.
Definewhere means the range . Note that we need to avoid the off-by-one error. The next input isand so on. The receptive field of is , so in general,If the horizontal and vertical filter size and strides differ, then in general,More succinctly, we can write .
If is not expressible as where is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor. In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right.
Global Max Pooling (GMP) is a specific kind of max pooling where the output tensor has shape and the receptive field of is all of . That is, it takes the maximum over each entire channel. It is often used just before the final fully connected layers in a CNN classification head.
Average pooling (AvgPool) is similarly definedGlobal Average Pooling (GAP) is defined similarly to GMP. It was first proposed in Network-in-Network. [2] Similarly to GMP, it is often used just before the final fully connected layers in a CNN classification head.
There are some interpolations of max pooling and average pooling.
Mixed Pooling is a linear sum of maxpooling and average pooling. [3] That is,
where is either a hyperparameter, a learnable parameter, or randomly sampled anew every time.
Lp Pooling is like average pooling, but uses Lp norm average instead of average:where is the size of receptive field, and is a hyperparameter. If all activations are non-negative, then average pooling is the case of , and maxpooling is the case of . Square-root pooling is the case of . [4]
Stochastic pooling samples a random activation from the receptive field with probability . It is the same as average pooling in expectation. [5]
Softmax pooling is like maxpooling, but uses softmax, i.e. where . Average pooling is the case of , and maxpooling is the case of [4]
Local Importance-based Pooling generalizes softmax pooling by where is a learnable function. [6]
Spatial pyramidal pooling applies max pooling (or any other form of pooling) in a pyramid structure. That is, it applies global max pooling, then applies max pooling to the image divided into 4 equal parts, then 16, etc. The results are then concatenated. It is a hierarchical form of global pooling, and similar to global pooling, it is often used just before a classification head. [7]
Region of Interest Pooling (also known as RoI pooling) is a variant of max pooling used in R-CNNs for object detection. [8] It is designed to take an arbitrarily-sized input matrix, and output a fixed-sized output matrix.
Covariance pooling computes the covariance matrix of the vectors which is then flattened to a -dimensional vector . Global covariance pooling is used similarly to global max pooling. As average pooling computes the average, which is a first-degree statistic, and covariance is a second-degree statistic, covariance pooling is also called "second-order pooling". It can be generalized to higher-order poolings. [9] [10]
Blur Pooling means applying a blurring method before downsampling. For example, the Rect-2 blur pooling means taking an average pooling at , then taking every second pixel (identity with ). [11]
In Vision Transformers (ViT), there are the following common kinds of poolings.
BERT-like pooling uses a dummy [CLS]
token ("classification"). For classification, the output at [CLS]
is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution, which is the network's prediction of class probability distribution. This is the one used by the original ViT [12] and Masked Autoencoder. [13]
Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good. [12]
Multihead attention pooling (MAP) applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors , which might be thought of as the output vectors of a layer of a ViT. It then applies a feedforward layer on each vector, resulting in a matrix . This is then sent to a multiheaded attention, resulting in , where is a matrix of trainable parameters. [14] This was first proposed in the Set Transformer architecture. [15]
Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling. [14] [16]
This section may require copy editing for notations need to be explained and unified with the rest of the article.(September 2024) |
In graph neural networks (GNN), there are also two forms of pooling: global and local. Global pooling can be reduced to a local pooling where the receptive field is the entire output.
Local pooling layers coarsen the graph via downsampling. We present here several learnable local pooling strategies that have been proposed. [19] For each cases, the input is the initial graph is represented by a matrix of node features, and the graph adjacency matrix . The output is the new matrix of node features, and the new graph adjacency matrix .
We first set
where is a learnable projection vector. The projection vector computes a scalar projection value for each graph node.
The top-k pooling layer [17] can then be formalised as follows:
where is the subset of nodes with the top-k highest projection scores, denotes element-wise matrix multiplication, and is the sigmoid function. In other words, the nodes with the top-k highest projection scores are retained in the new adjacency matrix . The operation makes the projection vector trainable by backpropagation, which otherwise would produce discrete outputs. [17]
We first set
where is a generic permutation equivariant GNN layer (e.g., GCN, GAT, MPNN).
The Self-attention pooling layer [18] can then be formalised as follows:
where is the subset of nodes with the top-k highest projection scores, denotes element-wise matrix multiplication.
The self-attention pooling layer can be seen as an extension of the top-k pooling layer. Differently from top-k pooling, the self-attention scores computed in self-attention pooling account both for the graph features and the graph topology.
In early 20th century, neuroanatomists noticed a certain motif where multiple neurons synapse to the same neuron. This was given a functional explanation as "local pooling", which makes vision translation-invariant. (Hartline, 1940) [20] gave supporting evidence for the theory by electrophysiological experiments on the receptive fields of retinal ganglion cells. The Hubel and Wiesel experiments showed that the vision system in cats is similar to a convolutional neural network, with some cells summing over inputs from the lower layer. [21] : Fig. 19, 20 See (Westheimer, 1965) [22] for citations to these early literature.
During the 1970s, to explain the effects of depth perception, some such as (Julesz and Chang, 1976) [23] proposed that the vision system implements a disparity-selective mechanism by global pooling, where the outputs from matching pairs of retinal regions in the two eyes are pooled in higher order cells. See [24] for citations to these early literature.
In artificial neural networks, max pooling was used in 1990 for speech processing (1-dimensional convolution). [25]
In mathematics, tropical geometry is the study of polynomials and their geometric properties when addition is replaced with minimization and multiplication is replaced with ordinary addition:
In machine learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.
In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment.
The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.
The generalized Hebbian algorithm, also known in the literature as Sanger's rule, is a linear feedforward neural network for unsupervised learning with applications primarily in principal components analysis. First defined in 1989, it is similar to Oja's rule in its formulation and stability, except it can be applied to networks with multiple outputs. The name originates because of the similarity between the algorithm and a hypothesis made by Donald Hebb about the way in which synaptic strengths in the brain are modified in response to experience, i.e., that changes are proportional to the correlation between the firing of pre- and post-synaptic neurons.
Algebraic signal processing (ASP) is an emerging area of theoretical signal processing (SP). In the algebraic theory of signal processing, a set of filters is treated as an (abstract) algebra, a set of signals is treated as a module or vector space, and convolution is treated as an algebra representation. The advantage of algebraic signal processing is its generality and portability.
In the mathematical theory of artificial neural networks, universal approximation theorems are theorems of the following form: Given a family of neural networks, for each function from a certain function space, there exists a sequence of neural networks from the family, such that according to some criterion. That is, the family of neural networks is dense in the function space.
A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
Dilution and dropout are regularization techniques for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. They are an efficient way of performing model averaging with neural networks. Dilution refers to thinning weights, while dropout refers to randomly "dropping out", or omitting, units during the training process of a neural network. Both trigger the same type of regularization.
A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.
A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.
A transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.
In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.
A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.
Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.
A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.
A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
In machine learning, the term tensor informally refers to two different concepts for organizing and representing data. Data may be organized in a multidimensional array (M-way array), informally referred to as a "data tensor"; however, in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor"), may be analyzed either by artificial neural networks or tensor methods.
In machine learning, a neural differential equation is a differential equation whose right-hand side is parametrized by the weights θ of a neural network. In particular, a neural ordinary differential equation (neural ODE) is an ordinary differential equation of the form
In neural networks, the gating mechanism is an architectural motif for controlling the flow of activation and gradient signals. They are most prominently used in recurrent neural networks (RNNs), but have also found applications in other architectures.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help)