Inception (deep learning architecture)

Last updated
Inception
Original author(s) Google AI
Initial release2014
Stable release
v4 / 2017
Repository github.com/tensorflow/models/blob/master/research/slim/README.md
Type
License Apache 2.0

Inception [1] is a family of convolutional neural network (CNN) for computer vision, introduced by researchers at Google in 2014 as GoogLeNet (later renamed Inception v1). The series was historically important as an early CNN that separates the stem (data ingest), body (data processing), and head (prediction), an architectural design that persists in all modern CNN. [2]

Contents

Inception-v3 model. Inception-v3 model.png
Inception-v3 model.

Version history

Inception v1

GoogLeNet architecture. GoogLeNet architecture.svg
GoogLeNet architecture.

In 2014, a team at Google developed the GoogLeNet architecture, an instance of which won the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). [1] [3]

The name came from the LeNet of 1998, since both LeNet and GoogLeNet are CNNs. They also called it "Inception" after a "we need to go deeper" internet meme, a phrase from Inception (2010) the film. [1] Because later, more versions were released, the original Inception architecture was renamed again as "Inception v1".

The models and the code were released under Apache 2.0 license on GitHub. [4]

An individual Inception module. On the left is a standard module, and on the right is a dimension-reduced module. Inception-v3 model module.png
An individual Inception module. On the left is a standard module, and on the right is a dimension-reduced module.
A single Inception dimension-reduced module. Inception dimension-reduced module.svg
A single Inception dimension-reduced module.

The Inception v1 architecture is a deep CNN composed of 22 layers. Most of these layers were "Inception modules". The original paper stated that Inception modules are a "logical culmination" of Network in Network [5] and (Arora et al, 2014). [6]

Since Inception v1 is deep, it suffered from the vanishing gradient problem. The team solved it by using two "auxiliary classifiers", which are linear-softmax classifiers inserted at 1/3-deep and 2/3-deep within the network, and the loss function is a weighted sum of all three:

These were removed after training was complete. This was later solved by the ResNet architecture.

The architecture consists of three parts stacked on top of one another: [2]

This structure is used in most modern CNN architectures.

Inception v2

Inception v2 was released in 2015, in a paper that is more famous for proposing batch normalization. [7] [8] It had 13.6 million parameters.

It improves on Inception v1 by adding batch normalization, and removing dropout and local response normalization which they found became unnecessary when batch normalization is used.

Inception v3

Inception v3 was released in 2016. [7] [9] It improves on Inception v2 by using factorized convolutions.

As an example, a single 5×5 convolution can be factored into 3×3 stacked on top of another 3×3. Both has a receptive field of size 5×5. The 5×5 convolution kernel has 25 parameters, compared to just 18 in the factorized version. Thus, the 5×5 convolution is strictly more powerful than the factorized version. However, this power is not necessarily needed. Empirically, the research team found that factorized convolutions help.

It also uses a form of dimension-reduction by concatenating the output from a convolutional layer and a pooling layer. As an example, a tensor of size can be downscaled by a convolution with stride 2 to , and by maxpooling with pool size to . These are then concatenated to .

Other than this, it also removed the lowest auxiliary classifier during training. They found that the auxiliary head worked as a form of regularization.

They also proposed label-smoothing regularization in classification. For an image with label , instead of making the model to predict the probability distribution , they made the model predict the smoothed distribution where is the total number of classes.

Inception v4

In 2017, the team released Inception v4, Inception ResNet v1, and Inception ResNet v2. [10]

Inception v4 is an incremental update with even more factorized convolutions, and other complications that were empirically found to improve benchmarks.

Inception ResNet v1 and v2 are both modifications of Inception v4, where residual connections are added to each Inception module, inspired by the ResNet architecture. [11]

Xception

Xception ("Extreme Inception") was published in 2017. [12] It is a linear stack of depthwise separable convolution layers with residual connections. The design was proposed on the hypothesis that in a CNN, the cross-channels correlations and spatial correlations in the feature maps can be entirely decoupled.

Related Research Articles

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">DeepDream</span> Software program

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.

<span class="mw-page-title-main">AlexNet</span> An influential convolutional neural network published in 2012

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto. It had 60 million parameters and 650,000 neurons.

In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural networks. It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by long short-term memory (LSTM) recurrent neural networks. The advantage of the Highway Network over other deep learning architectures is its ability to overcome or partially prevent the vanishing gradient problem, thus improving its optimization. Gating mechanisms are used to facilitate information flow across the many layers.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.

In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry.

<span class="mw-page-title-main">Neural style transfer</span> Type of software algorithm for image manipulation

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

<span class="mw-page-title-main">LeNet</span> Convolutional neural network structure

LeNet is a series of convolutional neural network structure proposed by LeCun et al.. The earliest version, LeNet-1, was trained in 1989. In general, when "LeNet" is referred to without a number, it refers to LeNet-5 (1998), the most well-known version.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

<span class="mw-page-title-main">François Chollet</span> Machine learning researcher

François Chollet is a French software engineer and artificial intelligence researcher currently working at Google. Chollet is the creator of the Keras deep-learning library, released in 2015. His research focuses on computer vision, the application of machine learning to formal reasoning, abstraction, and how to achieve greater generality in artificial intelligence.

In machine learning, the term tensor informally refers to two different concepts for organizing and representing data. Data may be organized in a multidimensional array (M-way array), informally referred to as a "data tensor"; however, in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor"), may be analyzed either by artificial neural networks or tensor methods.

In machine learning, normalization is a statistical technique with various applications. There are mainly two forms of normalization, data normalization and activation normalization. Data normalization, or feature scaling, is a general technique in statistics, and it includes methods that rescale input data so that they have well-behaved range, mean, variance, and other statistical properties. Activation normalization is specific to deep learning, and it includes methods that rescale the activation of hidden neurons inside a neural network.

<span class="mw-page-title-main">VGGNet</span> Series of convolutional neural networks for image classification

The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford.

The Latent Diffusion Model (LDM) is a diffusion model architecture developed by the CompVis group at LMU Munich.

In deep learning, weight initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training: weight initalization is the pre-training step of assigning initial values to these parameters.

MobileNet is a family of convolutional neural network (CNN) architectures designed for image classification, object detection, and other computer vision tasks. They are designed for small size, low latency, and low power consumption, making them suitable for on-device inference and edge computing on resource-constrained devices like mobile phones and embedded systems. They were originally designed to be run efficiently on mobile devices with TensorFlow Lite.

References

  1. 1 2 3 Szegedy, Christian; Wei Liu; Yangqing Jia; Sermanet, Pierre; Reed, Scott; Anguelov, Dragomir; Erhan, Dumitru; Vanhoucke, Vincent; Rabinovich, Andrew (June 2015). "Going deeper with convolutions". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 1–9. arXiv: 1409.4842 . doi:10.1109/CVPR.2015.7298594. ISBN   978-1-4673-6964-0.
  2. 1 2 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "8.4. Multi-Branch Networks (GoogLeNet)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN   978-1-009-38943-3.
  3. Official repo of Inception V1 on Kaggle, published by Google.
  4. "google/inception". Google. 2024-08-19. Retrieved 2024-08-19.
  5. Lin, Min; Chen, Qiang; Yan, Shuicheng (2014-03-04). "Network In Network". arXiv: 1312.4400 [cs.NE].
  6. Arora, Sanjeev; Bhaskara, Aditya; Ge, Rong; Ma, Tengyu (2014-01-27). "Provable Bounds for Learning Some Deep Representations". Proceedings of the 31st International Conference on Machine Learning. PMLR: 584–592.
  7. 1 2 Szegedy, Christian; Vanhoucke, Vincent; Ioffe, Sergey; Shlens, Jon; Wojna, Zbigniew (2016). "Rethinking the Inception Architecture for Computer Vision". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2818–2826.
  8. Official repo of Inception V2 on Kaggle, published by Google.
  9. Official repo of Inception V3 on Kaggle, published by Google.
  10. Szegedy, Christian; Ioffe, Sergey; Vanhoucke, Vincent; Alemi, Alexander (2017-02-12). "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". Proceedings of the AAAI Conference on Artificial Intelligence. 31 (1). arXiv: 1602.07261 . doi:10.1609/aaai.v31i1.11231. ISSN   2374-3468.
  11. He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015). Deep Residual Learning for Image Recognition. arXiv: 1512.03385 .
  12. Chollet, Francois (2017). "Xception: Deep Learning With Depthwise Separable Convolutions". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 1251–1258.