Convolutional layer

Last updated

In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry. [1]

Contents

The convolution operation in a convolutional layer involves sliding a small window (called a kernel or filter) across the input data and computing the dot product between the values in the kernel and the input at each position. This process creates a feature map that represents detected features in the input. [2]

Concepts

Kernel

Kernels, also known as filters, are small matrices of weights that are learned during the training process. Each kernel is responsible for detecting a specific feature in the input data. The size of the kernel is a hyperparameter that affects the network's behavior.

Convolution

For a 2D input and a 2D kernel , the 2D convolution operation can be expressed as:where and are the height and width of the kernel, respectively.

This generalizes immediately to nD convolutions. Commonly used convolutions are 1D (for audio and text), 2D (for images), and 3D (for spatial objects, and videos).

Stride

Stride determines how the kernel moves across the input data. A stride of 1 means the kernel shifts by one pixel at a time, while a larger stride (e.g., 2 or 3) results in less overlap between convolutions and produces smaller output feature maps.

Padding

Padding involves adding extra pixels around the edges of the input data. It serves two main purposes:

Common padding strategies include:

Common padding algorithms include:

The exact numbers used in convolutions is complicated, for which we refer to (Dumoulin and Visin, 2018) [3] for details.

Variants

Standard

The basic form of convolution as described above, where each kernel is applied to the entire input volume.

Depthwise separable

Depthwise separable convolution separates the standard convolution into two steps: depthwise convolution and pointwise convolution. The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution ( convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost. [4]

It was first developed by Laurent Sifre during an internship at Google Brain in 2013 as an architectural variation on AlexNet to improve convergence speed and model size. [4]

Dilated

Dilated convolution, or atrous convolution, introduces gaps between kernel elements, allowing the network to capture a larger receptive field without increasing the kernel size. [5] [6]

Transposed

Transposed convolution, also known as deconvolution or fractionally strided convolution, is a convolution where the output tensor is larger than its input tensor. It's often used in encoder-decoder architectures for upsampling.

History

The concept of convolution in neural networks was inspired by the visual cortex in biological brains. Early work by Hubel and Wiesel in the 1960s on the cat's visual system laid the groundwork for artificial convolution networks. [7]

An early convolution neural network was developed by Kunihiko Fukushima in 1969. It had mostly hand-designed kernels inspired by convolutions in mammalian vision. [8] In 1979 he improved it to the Neocognitron, which learns all convolutional kernels by unsupervised learning (in his terminology, "self-organized by 'learning without a teacher'"). [9] [10]

In 1998, Yann LeCun et al. introduced LeNet-5, an early influential CNN architecture for handwritten digit recognition, trained on the MNIST dataset. [11]

(Olshausen & Field, 1996) [12] discovered that simple cells in the mammalian primary visual cortex implement localized, oriented, bandpass receptive fields, which could be recreated by fitting sparse linear codes for natural scenes. This was later found to also occur in the lowest-level kernels of trained CNNs. [13] :Fig 3

The field saw a resurgence in the 2010s with the development of deeper architectures and the availability of large datasets and powerful GPUs. AlexNet, developed by Alex Krizhevsky et al. in 2012, was a catalytic event in modern deep learning. [13]

See also

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the inspiration for convolutional neural networks.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

In the mathematical theory of artificial neural networks, universal approximation theorems are theorems of the following form: Given a family of neural networks, for each function from a certain function space, there exists a sequence of neural networks from the family, such that according to some criterion. That is, the family of neural networks is dense in the function space.

Kunihiko Fukushima is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a senior research scientist at the Fuzzy Logic Systems Institute in Fukuoka, Japan.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

There are many types of artificial neural networks (ANN).

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">AlexNet</span> An influential convolutional neural network published in 2012

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto. It had 60 million parameters and 650,000 neurons.

<span class="mw-page-title-main">Neural style transfer</span> Type of software algorithm for image manipulation

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

Inception is a family of convolutional neural network (CNN) for computer vision, introduced by researchers at Google in 2014 as GoogLeNet. The series was historically important as an early CNN that separates the stem, body, and head (prediction), an architectural design that persists in all modern CNN.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by biological neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling this period an "AI winter".

<span class="mw-page-title-main">LeNet</span> Convolutional neural network structure

LeNet is a series of convolutional neural network structure proposed by LeCun et al.. The earliest version, LeNet-1, was trained in 1989. In general, when "LeNet" is referred to without a number, it refers to LeNet-5 (1998), the most well-known version.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

In machine learning, the term tensor informally refers to two different concepts for organizing and representing data. Data may be organized in a multidimensional array (M-way array), informally referred to as a "data tensor"; however, in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor"), may be analyzed either by artificial neural networks or tensor methods.

<span class="mw-page-title-main">VGGNet</span> Series of convolutional neural networks for image classification

The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford.

In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

MobileNet is a family of convolutional neural network (CNN) architectures designed for image classification, object detection, and other computer vision tasks. They are designed for small size, low latency, and low power consumption, making them suitable for on-device inference and edge computing on resource-constrained devices like mobile phones and embedded systems. They were originally designed to be run efficiently on mobile devices with TensorFlow Lite.

References

  1. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. Cambridge, MA: MIT Press. pp. 326–366. ISBN   978-0262035613.
  2. Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "7.2. Convolutions for Images". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN   978-1-009-38943-3.
  3. Dumoulin, Vincent; Visin, Francesco (2016). "A guide to convolution arithmetic for deep learning". arXiv preprint arXiv:1603.07285. arXiv: 1603.07285 .
  4. 1 2 Chollet, François (2017). "Xception: Deep Learning with Depthwise Separable Convolutions". 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1800–1807. arXiv: 1610.02357 . doi:10.1109/CVPR.2017.195. ISBN   978-1-5386-0457-1.
  5. Yu, Fisher; Koltun, Vladlen (2016). "Multi-Scale Context Aggregation by Dilated Convolutions". Iclr 2016. arXiv: 1511.07122 .
  6. Chen, Liang-Chieh; Papandreou, George; Kokkinos, Iasonas; Murphy, Kevin; Yuille, Alan L. (2018-04-01). "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 40 (4): 834–848. arXiv: 1606.00915 . doi:10.1109/TPAMI.2017.2699184. ISSN   0162-8828. PMID   28463186.
  7. Hubel, D. H.; Wiesel, T. N. (1968). "Receptive fields and functional architecture of monkey striate cortex". The Journal of Physiology. 195 (1): 215–243. doi:10.1113/jphysiol.1968.sp008455. PMC   1557912 . PMID   4966457.
  8. Fukushima, Kunihiko (1969). "Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. doi:10.1109/TSSC.1969.300225. ISSN   0536-1567.
  9. Fukushima, Kunihiko (October 1979). "位置ずれに影響されないパターン認識機構の神経回路のモデル --- ネオコグニトロン ---" [Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron —]. Trans. IECE (in Japanese). J62-A (10): 658–665.
  10. Fukushima, Kunihiko (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position". Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID   7370364.
  11. LeCun, Yann; Bottou, Léon; Bengio, Yoshua; Haffner, Patrick (1998). "Gradient-based learning applied to document recognition". Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791.
  12. Olshausen, Bruno A.; Field, David J. (June 1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. Bibcode:1996Natur.381..607O. doi:10.1038/381607a0. ISSN   1476-4687. PMID   8637596.
  13. 1 2 Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). "ImageNet Classification with Deep Convolutional Neural Networks". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc.