Convolutional layer

Last updated March 22, 2025 • 4 min readFrom Wikipedia, The Free Encyclopedia

In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry.^[1]

The convolution operation in a convolutional layer involves sliding a small window (called a kernel or filter) across the input data and computing the dot product between the values in the kernel and the input at each position. This process creates a feature map that represents detected features in the input.^[2]

Concepts

Kernel

Kernels, also known as filters, are small matrices of weights that are learned during the training process. Each kernel is responsible for detecting a specific feature in the input data. The size of the kernel is a hyperparameter that affects the network's behavior.

Convolution

For a 2D input $x$ and a 2D kernel $w$ , the 2D convolution operation can be expressed as: $y[i,j]=\sum _{m=0}^{k_{h}-1}\sum _{n=0}^{k_{w}-1}x[i+m,j+n]\cdot w[m,n]$ where $k_{h}$ and $k_{w}$ are the height and width of the kernel, respectively.

This generalizes immediately to nD convolutions. Commonly used convolutions are 1D (for audio and text), 2D (for images), and 3D (for spatial objects, and videos).

Stride

Stride determines how the kernel moves across the input data. A stride of 1 means the kernel shifts by one pixel at a time, while a larger stride (e.g., 2 or 3) results in less overlap between convolutions and produces smaller output feature maps.

Padding

Padding involves adding extra pixels around the edges of the input data. It serves two main purposes:

Preserving spatial dimensions: Without padding, each convolution reduces the size of the feature map.
Handling border pixels: Padding ensures that border pixels are given equal importance in the convolution process.

Common padding strategies include:

No padding/valid padding. This strategy typically causes the output to shrink.
Same padding: Any method that ensures the output size same as input size is a same padding strategy.
Full padding: Any method that ensures each input entry is convolved over for the same number of times is a full padding strategy.

Common padding algorithms include:

Zero padding: Add zero entries to the borders of input.
Mirror/reflect/symmetric padding: Reflect the input array on the border.
Circular padding: Cycle the input array back to the opposite border, like a torus.

The exact numbers used in convolutions is complicated, for which we refer to (Dumoulin and Visin, 2018)^[3] for details.

Variants

Standard

The basic form of convolution as described above, where each kernel is applied to the entire input volume.

Depthwise separable

Depthwise separable convolution separates the standard convolution into two steps: depthwise convolution and pointwise convolution. The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution ( $1\times 1$ convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost.^[4]

It was first developed by Laurent Sifre during an internship at Google Brain in 2013 as an architectural variation on AlexNet to improve convergence speed and model size.^[4]

Dilated

Dilated convolution, or atrous convolution, introduces gaps between kernel elements, allowing the network to capture a larger receptive field without increasing the kernel size.^[5]^[6]

Transposed

Transposed convolution, also known as deconvolution, fractionally strided convolution, and upsampling convolution, is a convolution where the output tensor is larger than its input tensor. It's often used in encoder-decoder architectures for upsampling. It's used in image generation, semantic segmentation, and super-resolution tasks.

History

The concept of convolution in neural networks was inspired by the visual cortex in biological brains. Early work by Hubel and Wiesel in the 1960s on the cat's visual system laid the groundwork for artificial convolution networks.^[7]

An early convolution neural network was developed by Kunihiko Fukushima in 1969. It had mostly hand-designed kernels inspired by convolutions in mammalian vision.^[8] In 1979 he improved it to the Neocognitron, which learns all convolutional kernels by unsupervised learning (in his terminology, "self-organized by 'learning without a teacher'").^[9]^[10]

During the 1988 to 1998 period, a series of CNN were introduced by Yann LeCun et al., ending with LeNet-5 in 1998. It was an early influential CNN architecture for handwritten digit recognition, trained on the MNIST dataset, and was used in ATM.^[11]

(Olshausen & Field, 1996)^[12] discovered that simple cells in the mammalian primary visual cortex implement localized, oriented, bandpass receptive fields, which could be recreated by fitting sparse linear codes for natural scenes. This was later found to also occur in the lowest-level kernels of trained CNNs.^[13]^{: Fig 3}

The field saw a resurgence in the 2010s with the development of deeper architectures and the availability of large datasets and powerful GPUs. AlexNet, developed by Alex Krizhevsky et al. in 2012, was a catalytic event in modern deep learning.^[13]^[14] In that year’s ImageNet competition, the AlexNet model achieved a 16% top-five error rate, significantly outperforming the next best entry, which had a 26% error rate. The network used eight trainable layers, approximately 650,000 neurons, and around 60 million parameters, highlighting the impact of deeper architectures and GPU acceleration on image recognition performance.^[14]

From the 2013 ImageNet competition, most entries adopted deep convolutional neural networks, building on the success of AlexNet. Over the following years, performance steadily improved, with the top-five error rate falling from 16% in 2012 and 12% in 2013 to below 3% by 2017, as networks grew increasingly deep.^[14]

References

↑ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. Cambridge, MA: MIT Press. pp. 326–366. ISBN 978-0262035613.
↑ Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "7.2. Convolutions for Images". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
↑ Dumoulin, Vincent; Visin, Francesco (2016). "A guide to convolution arithmetic for deep learning". arXiv preprint arXiv:1603.07285. arXiv: 1603.07285 .
1 2 Chollet, François (2017). "Xception: Deep Learning with Depthwise Separable Convolutions". 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1800–1807. arXiv: 1610.02357 . doi:10.1109/CVPR.2017.195. ISBN 978-1-5386-0457-1.
↑ Yu, Fisher; Koltun, Vladlen (2016). "Multi-Scale Context Aggregation by Dilated Convolutions". Iclr 2016. arXiv: 1511.07122 .
↑ Chen, Liang-Chieh; Papandreou, George; Kokkinos, Iasonas; Murphy, Kevin; Yuille, Alan L. (2018-04-01). "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 40 (4): 834–848. arXiv: 1606.00915 . doi:10.1109/TPAMI.2017.2699184. ISSN 0162-8828. PMID 28463186.
↑ Hubel, D. H.; Wiesel, T. N. (1968). "Receptive fields and functional architecture of monkey striate cortex". The Journal of Physiology. 195 (1): 215–243. doi:10.1113/jphysiol.1968.sp008455. PMC 1557912 . PMID 4966457.
↑ Fukushima, Kunihiko (1969). "Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. doi:10.1109/TSSC.1969.300225. ISSN 0536-1567.
↑ Fukushima, Kunihiko (October 1979). "位置ずれに影響されないパターン認識機構の神経回路のモデル --- ネオコグニトロン ---" [Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron —]. Trans. IECE (in Japanese). J62-A (10): 658–665.
↑ Fukushima, Kunihiko (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position". Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364.
↑ LeCun, Yann; Bottou, Léon; Bengio, Yoshua; Haffner, Patrick (1998). "Gradient-based learning applied to document recognition". Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791.
↑ Olshausen, Bruno A.; Field, David J. (June 1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. Bibcode:1996Natur.381..607O. doi:10.1038/381607a0. ISSN 1476-4687. PMID 8637596.
1 2 Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). "ImageNet Classification with Deep Convolutional Neural Networks". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc.
1 2 3 "How computers got shockingly good at recognizing images". Ars Technica. 18 December 2018. Retrieved 21 March 2025.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Goodfellow-1] Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. Cambridge, MA: MIT Press. pp. 326–366. ISBN 978-0262035613.

[Zhang-2] Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "7.2. Convolutions for Images". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.

[3] Dumoulin, Vincent; Visin, Francesco (2016). "A guide to convolution arithmetic for deep learning". arXiv preprint arXiv:1603.07285. arXiv: 1603.07285 .

[:1-4] 1 2 Chollet, François (2017). "Xception: Deep Learning with Depthwise Separable Convolutions". 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1800–1807. arXiv: 1610.02357 . doi:10.1109/CVPR.2017.195. ISBN 978-1-5386-0457-1.

[5] Yu, Fisher; Koltun, Vladlen (2016). "Multi-Scale Context Aggregation by Dilated Convolutions". Iclr 2016. arXiv: 1511.07122 .

[6] Chen, Liang-Chieh; Papandreou, George; Kokkinos, Iasonas; Murphy, Kevin; Yuille, Alan L. (2018-04-01). "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 40 (4): 834–848. arXiv: 1606.00915 . doi:10.1109/TPAMI.2017.2699184. ISSN 0162-8828. PMID 28463186.

[7] Hubel, D. H.; Wiesel, T. N. (1968). "Receptive fields and functional architecture of monkey striate cortex". The Journal of Physiology. 195 (1): 215–243. doi:10.1113/jphysiol.1968.sp008455. PMC 1557912 . PMID 4966457.

[8] Fukushima, Kunihiko (1969). "Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. doi:10.1109/TSSC.1969.300225. ISSN 0536-1567.

[9] Fukushima, Kunihiko (October 1979). "位置ずれに影響されないパターン認識機構の神経回路のモデル --- ネオコグニトロン ---" [Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron —]. Trans. IECE (in Japanese). J62-A (10): 658–665.

[10] Fukushima, Kunihiko (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position". Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364.

[11] LeCun, Yann; Bottou, Léon; Bengio, Yoshua; Haffner, Patrick (1998). "Gradient-based learning applied to document recognition". Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791.

[12] Olshausen, Bruno A.; Field, David J. (June 1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature. 381 (6583): 607–609. Bibcode:1996Natur.381..607O. doi:10.1038/381607a0. ISSN 1476-4687. PMID 8637596.

[:0-13] 1 2 Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). "ImageNet Classification with Deep Convolutional Neural Networks". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc.

[:2-14] 1 2 3 "How computers got shockingly good at recognizing images". Ars Technica. 18 December 2018. Retrieved 21 March 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

v t e Differentiable computing
General	Differentiable programming Information geometry Statistical manifold Automatic differentiation Neuromorphic computing Pattern recognition Ricci calculus Computational learning theory Inductive bias
Hardware	IPU TPU VPU Memristor SpiNNaker
Software libraries	TensorFlow PyTorch Keras scikit-learn Theano JAX Flux.jl MindSpore
Portals Computer programming Technology