VGGNet

VGGNet
Developer(s)	Visual Geometry Group
Initial release	September 4, 2014;10 years ago
Written in	Caffe (software)
Type	Convolutional neural network ; Deep neural network ;
License	CC BY 4.0
Website	www.robots.ox.ac.uk/~vgg/research/very_deep/

Last updated October 11, 2024

The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford.

The VGG family includes various configurations with different depths, denoted by the letter "VGG" followed by the number of weight layers. The most common ones are VGG-16 (13 convolutional layers + 3 fully connected layers, 138M parameters) and VGG-19 (16 + 3, 144M parameters).^[1]

The VGG family were widely applied in various computer vision areas.^[2] An ensemble model of VGGNets achieved state-of-the-art results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2014.^[1]^[3] It was used as a baseline comparison in the ResNet paper for image classification,^[4] as the network in the Fast Region-based CNN for object detection, and as a base network in neural style transfer.^[5]

The series was historically important as an early influential model designed by composing generic modules, whereas AlexNet (2012) was designed "from scratch". It was also instrumental in changing the standard convolutional kernels in CNN from large (up to 11-by-11 in AlexNet) to just 3-by-3, a decision that was only revised in ConvNext (2022).^[6]^[7]

VGGNets were mostly obsoleted by Inception, ResNet, and DenseNet. RepVGG (2021) is an updated version of the architecture.^[8]

Architecture

The key architectural principle of VGG models is the consistent use of small $3\times 3$ convolutional filters throughout the network. This contrasts with earlier CNN architectures that employed larger filters, such as $11\times 11$ in AlexNet.^[7]

For example, two ${\textstyle 3\times 3}$ convolutions stacked together has the same receptive field pixels as a single ${\textstyle 5\times 5}$ convolution, but the latter uses ${\textstyle \left(25\cdot c^{2}\right)}$ parameters, while the former uses ${\textstyle \left(18\cdot c^{2}\right)}$ parameters (where $c$ is the number of channels). The original publication showed that deep and narrow CNN significantly outperform their shallow and wide counterparts.^[7]

The VGG series of models are deep neural networks composed of generic modules:

Convolutional modules: $3\times 3$ convolutional layers with stride 1, followed by ReLU activations.
Max-pooling layers: After some convolutional modules, max-pooling layers with a $2\times 2$ filter and a stride of 2 to downsample the feature maps. It halves both width and height, but keeps the number of channels.
Fully connected layers: Three fully connected layers at the end of the network, with sizes 4096-4096-1000. The last one has 1000 channels corresponding to the 1000 classes in ImageNet.
Softmax layer: A softmax layer outputs the probability distribution over the classes.

The VGG family includes various configurations with different depths, denoted by the letter "VGG" followed by the number of weight layers. The most common ones are VGG-16 (13 convolutional layers + 3 fully connected layers) and VGG-19 (16 + 3), denoted as configurations D and E in the original paper.^[10]

As an example, the 16 convolutional layers of VGG-19 are structured as follows: ${\begin{aligned}&3\to 64\to 64&\xrightarrow {\text{downsample}} \\&64\to 128\to 128&\xrightarrow {\text{downsample}} \\&128\to 256\to 256\to 256\to 256&\xrightarrow {\text{downsample}} \\&256\to 512\to 512\to 512\to 512&\xrightarrow {\text{downsample}} \\&512\to 512\to 512\to 512\to 512&\xrightarrow {\text{downsample}} \end{aligned}}$ where the arrow $c_{1}\to c_{2}$ means a 3x3 convolution with $c_{1}$ input channels and $c_{2}$ output channels and stride 1, followed by ReLU activation. The $\xrightarrow {\text{downsample}}$ means a down-sampling layer by 2x2 maxpooling with stride 2.

Table of VGG models
Name	Number of convolutional layers	Number of fully connected layers	Parameter count
VGG-16	13	3	138M
VGG-19	16	3	144M

Related Research Articles

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous artificial neural networks. It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by Long Short-Term Memory (LSTM) recurrent neural networks. The advantage of a Highway Network over the common deep neural networks is that it solves or partially prevents the vanishing gradient problem, thus leading to easier to optimize neural networks. The gating mechanisms facilitate information flow across many layers.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

Inception v3 is a convolutional neural network (CNN) for assisting in image analysis and object detection, and got its start as a module for GoogLeNet. It is the third edition of Google's Inception Convolutional Neural Network, originally introduced during the ImageNet Recognition Challenge. The design of Inceptionv3 was intended to allow deeper networks while also keeping the number of parameters from growing too large: it has "under 25 million parameters", compared against 60 million for AlexNet.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective.. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

The Latent Diffusion Model (LDM) is a diffusion model architecture developed developed by the CompVis group at LMU Munich.

In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

References

1 2 Simonyan, Karen; Zisserman, Andrew (2015-04-10), Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv: 1409.1556
↑ Dhillon, Anamika; Verma, Gyanendra K. (2020-06-01). "Convolutional neural network: a review of models, methodologies and applications to object detection". Progress in Artificial Intelligence. 9 (2): 85–112. doi:10.1007/s13748-019-00203-0. ISSN 2192-6360.
↑ "ILSVRC2014 Results". image-net.org. Retrieved 2024-09-06.
↑ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. arXiv: 1512.03385 . Bibcode:2016cvpr.confE...1H. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
↑ Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (2016). Image Style Transfer Using Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2414–2423.
↑ Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 11976–11986. arXiv: 2201.03545 .
1 2 3 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "8.2. Networks Using Blocks (VGG)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
↑ Ding, Xiaohan; Zhang, Xiangyu; Ma, Ningning; Han, Jungong; Ding, Guiguang; Sun, Jian (2021). "RepVGG: Making VGG-Style ConvNets Great Again". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 13733–13742. arXiv: 2101.03697 .
↑ Lin, Min; Chen, Qiang; Yan, Shuicheng (2013). "Network In Network". arXiv: 1312.4400 [cs.NE].
↑ "Very Deep Convolutional Networks for Large-Scale Visual Recognition". Computer Vision group from the University of Oxford. Retrieved 2024-09-06.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:1-1] 1 2 Simonyan, Karen; Zisserman, Andrew (2015-04-10), Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv: 1409.1556

[2] Dhillon, Anamika; Verma, Gyanendra K. (2020-06-01). "Convolutional neural network: a review of models, methodologies and applications to object detection". Progress in Artificial Intelligence. 9 (2): 85–112. doi:10.1007/s13748-019-00203-0. ISSN 2192-6360.

[3] "ILSVRC2014 Results". image-net.org. Retrieved 2024-09-06.

[4] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. arXiv: 1512.03385 . Bibcode:2016cvpr.confE...1H. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.

[5] Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (2016). Image Style Transfer Using Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2414–2423.

[6] Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 11976–11986. arXiv: 2201.03545 .

[:0-7] 1 2 3 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "8.2. Networks Using Blocks (VGG)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.

[8] Ding, Xiaohan; Zhang, Xiangyu; Ma, Ningning; Han, Jungong; Ding, Guiguang; Sun, Jian (2021). "RepVGG: Making VGG-Style ConvNets Great Again". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 13733–13742. arXiv: 2101.03697 .

[9] Lin, Min; Chen, Qiang; Yan, Shuicheng (2013). "Network In Network". arXiv: 1312.4400 [cs.NE].

[10] "Very Deep Convolutional Networks for Large-Scale Visual Recognition". Computer Vision group from the University of Oxford. Retrieved 2024-09-06.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[10]