AlexNet

Last updated
AlexNet
Developer(s) Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
Initial releaseJun 28, 2011
Repository code.google.com/archive/p/cuda-convnet/
Written in CUDA, C++
Type Convolutional neural network
License New BSD License
AlexNet architecture and a possible modification. On the top is half of the original AlexNet (which is split into two halves, one per GPU). On the bottom is the same architecture but with the last "projection" layer replaced by another one that projects to fewer outputs. If one freezes the rest of the model and only finetune the last layer, one can obtain another vision model at cost much less than training one from scratch. AlexNet architecture.png
AlexNet architecture and a possible modification. On the top is half of the original AlexNet (which is split into two halves, one per GPU). On the bottom is the same architecture but with the last "projection" layer replaced by another one that projects to fewer outputs. If one freezes the rest of the model and only finetune the last layer, one can obtain another vision model at cost much less than training one from scratch.
AlexNet block diagram AlexNet block diagram.svg
AlexNet block diagram

AlexNet is a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto in 2012. It had 60 million parameters and 650,000 neurons. [1]

Contents

The original paper's primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of graphics processing units (GPUs) during training. [1]

The three formed team SuperVision and submitted AlexNet in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012. [2] The network achieved a top-5 error of 15.3%, more than 10.8 percentage points better than that of the runner-up.

The architecture influenced a large number of subsequent work in deep learning, especially in applying neural networks to computer vision.

Architecture

AlexNet contains eight layers: the first five are convolutional layers, some of them followed by max-pooling layers, and the last three are fully connected layers. The network, except the last layer, is split into two copies, each run on one GPU. [1] The entire structure can be written as

(CNN → RN → MP)² → (CNN³ → MP) → (FC → DO)² → Linear → softmax

where

It used the non-saturating ReLU activation function, which trained better than tanh and sigmoid. [1]

Because the network did not fit onto a single Nvidia GTX580 3GB GPU, it was split into two halves, one on each GPU. [1] :Section 3.2

Training

The ImageNet training set had 1.2 million images. It was trained for 90 epochs, which took five to six days on two NVIDIA GTX 580 3GB GPUs, [1] which has a theoretical performance of 1.581 TFLOPS in float32 and release price 500 USD. [3] One forward pass of AlexNet takes about 4 GFLOPs. [4]

It was trained with momentum gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. Learning rate started at and was manually decreased 10-fold whenever validation error appeared to stop decreasing. It was reduced three times during training, ending at .

It used two forms of data augmentation, both computed on the fly on the CPU, thus "computationally free":

It used local response normalization, and dropout regularization with drop probability 0.5.

All weights were initialized as gaussians with 0 mean and 0.01 standard deviation. Biases in convolutional layers 2, 4, 5, and all fully-connected layers, were initialized to constant 1 to avoid the dying ReLU problem.

History

Previous work

Comparison of the LeNet and AlexNet convolution, pooling, and dense layers
(AlexNet image size should be 227x227x3, instead of 224x224x3, so the math will come out right. The original paper said different numbers, but Andrej Karpathy, the former head of computer vision at Tesla, said it should be 227x227x3 (he said Alex didn't describe why he put 224x224x3). The next convolution should be 11x11 with stride 4: 55x55x96 (instead of 54x54x96). It would be calculated, for example, as: [(input width 227 - kernel width 11) / stride 4] + 1 = [(227 - 11) / 4] + 1 = 55. Since the kernel output is the same length as width, its area is 55x55.) Comparison image neural networks.svg
Comparison of the LeNet and AlexNet convolution, pooling, and dense layers
(AlexNet image size should be 227×227×3, instead of 224×224×3, so the math will come out right. The original paper said different numbers, but Andrej Karpathy, the former head of computer vision at Tesla, said it should be 227×227×3 (he said Alex didn't describe why he put 224×224×3). The next convolution should be 11×11 with stride 4: 55×55×96 (instead of 54×54×96). It would be calculated, for example, as: [(input width 227 - kernel width 11) / stride 4] + 1 = [(227 - 11) / 4] + 1 = 55. Since the kernel output is the same length as width, its area is 55×55.)

AlexNet is a convolutional neural network. In 1980, Kunihiko Fukushima proposed an early CNN named neocognitron. [5] [6] It was trained by an unsupervised learning algorithm. The LeNet-5 (Yann LeCun et al., 1989) [7] [8] was trained by supervised learning with backpropagation algorithm, with an architecture that is essentially the same as AlexNet on a small scale. (J. Weng, 1993) added max-pooling. [9] [10]

During the 2000s, as GPU hardware improved, some researchers adapted these for general-purpose computing, including neural network training. (K. Chellapilla et al., 2006) trained a CNN on GPU that was 4 times faster than an equivalent CPU implementation. [11] A deep CNN of (Dan Cireșan et al., 2011) at IDSIA was 60 times faster than an equivalent CPU implementation. [12] Between May 15, 2011, and September 10, 2012, their CNN won four image competitions and achieved SOTA for multiple image databases. [13] [14] [15] According to the AlexNet paper, [1] Cireșan's earlier net is "somewhat similar." Both were written with CUDA to run on GPU.

Computer vision

During the 1990 -- 2010 period, neural networks and were not better than other machine learning methods like kernel regression, support vector machines, AdaBoost, structured estimation, [16] among others. For computer vision in particular, much progress came from manual feature engineering, such as SIFT features, SURF features, HoG features, bags of visual words, etc. It was a minority position in computer vision that features can be learned directly from data, a position which became dominant after AlexNet. [17]

In 2011, Geoffrey Hinton started reaching out to colleagues about "What do I have to do to convince you that neural networks are the future?", and Jitendra Malik, a sceptic of neural networks, recommended the PASCAL Visual Object Classes challenge. Hinton said its dataset was too small, so Malik recommended to him the ImageNet challenge. [18]

While AlexNet and LeNet share essentially the same design and algorithm, AlexNet is much larger than LeNet and was trained on a much larger dataset on much faster hardware. Over the period of 20 years, both data and compute became cheaply available. [17]

Subsequent work

AlexNet is highly influential, resulting in much subsequent work in using CNNs for computer vision and using GPUs to accelerate deep learning. As of mid 2024, the AlexNet paper has been cited over 157,000 times according to Google Scholar. [19]

At the time of publication, there was no framework available for GPU-based neural network training and inference. The codebase for AlexNet was released under a BSD license, and had been commonly used in neural network research for several subsequent years. [20] [17]

In one direction, subsequent works aimed to train increasingly deep CNNs that achieve increasingly higher performance on ImageNet. In this line of research are GoogLeNet (2014), VGGNet (2014), Highway network (2015), and ResNet (2015). Another direction aimed to reproduce the performance of AlexNet at a lower cost. In this line of research are SqueezeNet (2016), MobileNet (2017), EfficientNet (2019).

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Jürgen Schmidhuber</span> German computer scientist

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

<span class="mw-page-title-main">Geoffrey Hinton</span> British computer scientist (born 1947)

Geoffrey Everest Hinton is a British-Canadian computer scientist, cognitive scientist, cognitive psychologist, known for his work on artificial neural networks which earned him the title as the "Godfather of AI".

The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the inspiration for convolutional neural networks.

<span class="mw-page-title-main">Yann LeCun</span> French computer scientist (born 1960)

Yann André LeCun is a French-American computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Professor of the Courant Institute of Mathematical Sciences at New York University and Vice President, Chief AI Scientist at Meta.

Kunihiko Fukushima is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a senior research scientist at the Fuzzy Logic Systems Institute in Fukuoka, Japan.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

Ilya Sutskever is a Canadian-Israeli-Russian computer scientist who specializes in machine learning.

<span class="mw-page-title-main">Residual neural network</span> Type of artificial neural network

A residual neural network is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won the ImageNet Large Scale Visual Recognition Challenge of that year.

In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by biological neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling this period an "AI winter".

<span class="mw-page-title-main">LeNet</span> Convolutional neural network structure

LeNet is a series of convolutional neural network structure proposed by LeCun et al.. The earliest version, LeNet-1, was trained in 1989. In general, when "LeNet" is referred to without a number, it refers to LeNet-5 (1998), the most well-known version.

Alex Krizhevsky is a Ukrainian-born Canadian computer scientist most noted for his work on artificial neural networks and deep learning. In 2012, Krizhevsky, Ilya Sutskever and their PhD advisor Geoffrey Hinton, at the University of Toronto, developed a powerful visual-recognition network AlexNet using only two GeForce NVIDIA GPU cards. This revolutionized research in neural networks. Previously neural networks were trained on CPUs. The transition to GPUs opened the way to the development of advanced AI models. AlexNet won the ImageNet challenge in 2012. Krizhevsky and Sutskever sold their startup, DNN Research Inc., to Google, shortly after winning the contest. Krizhevsky left Google in September 2017 after losing interest in the work, to work at the company Dessa in support of new deep-learning techniques. Many of his numerous papers on machine learning and computer vision are frequently cited by other researchers. He is also the main author of the CIFAR-10 and CIFAR-100 datasets.

In machine learning, the term tensor informally refers to two different concepts for organizing and representing data. Data may be organized in a multidimensional array (M-way array), informally referred to as a "data tensor"; however, in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor"), may be analyzed either by artificial neural networks or tensor methods.

References

  1. 1 2 3 4 5 6 7 Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). "ImageNet classification with deep convolutional neural networks" (PDF). Communications of the ACM. 60 (6): 84–90. doi: 10.1145/3065386 . ISSN   0001-0782. S2CID   195908774.
  2. "ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012)". image-net.org.
  3. "NVIDIA GeForce GTX 580 Specs". TechPowerUp. 2024-11-12. Retrieved 2024-11-12.
  4. pypi.org https://pypi.org/project/calflops/ . Retrieved 2024-12-10.{{cite web}}: Missing or empty |title= (help)
  5. Fukushima, K. (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi: 10.4249/scholarpedia.1717 .
  6. Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID   7370364. S2CID   206775608 . Retrieved 16 November 2013.
  7. LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jackel, L. D. (1989). "Backpropagation Applied to Handwritten Zip Code Recognition" (PDF). Neural Computation. 1 (4). MIT Press - Journals: 541–551. doi:10.1162/neco.1989.1.4.541. ISSN   0899-7667. OCLC   364746139.
  8. LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF). Proceedings of the IEEE. 86 (11): 2278–2324. CiteSeerX   10.1.1.32.9552 . doi:10.1109/5.726791. S2CID   14542261 . Retrieved October 7, 2016.
  9. Weng, J; Ahuja, N; Huang, TS (1993). "Learning recognition and segmentation of 3-D objects from 2-D images". Proc. 4th International Conf. Computer Vision: 121–128.
  10. Schmidhuber, Jürgen (2015). "Deep Learning". Scholarpedia. 10 (11): 1527–54. CiteSeerX   10.1.1.76.1541 . doi:10.1162/neco.2006.18.7.1527. PMID   16764513. S2CID   2309950.
  11. Kumar Chellapilla; Sidd Puri; Patrice Simard (2006). "High Performance Convolutional Neural Networks for Document Processing". In Lorette, Guy (ed.). Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft.
  12. Cireșan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gambardella; Jurgen Schmidhuber (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two. 2: 1237–1242. Retrieved 17 November 2013.
  13. "IJCNN 2011 Competition result table". OFFICIAL IJCNN2011 COMPETITION. 2010. Retrieved 2019-01-14.
  14. Schmidhuber, Jürgen (17 March 2017). "History of computer vision contests won by deep CNNs on GPU" . Retrieved 14 January 2019.
  15. Cireșan, Dan; Meier, Ueli; Schmidhuber, Jürgen (June 2012). "Multi-column deep neural networks for image classification". 2012 IEEE Conference on Computer Vision and Pattern Recognition. New York, NY: Institute of Electrical and Electronics Engineers (IEEE). pp. 3642–3649. arXiv: 1202.2745 . CiteSeerX   10.1.1.300.3283 . doi:10.1109/CVPR.2012.6248110. ISBN   978-1-4673-1226-4. OCLC   812295155. S2CID   2161592.
  16. Taskar, Ben; Guestrin, Carlos; Koller, Daphne (2003). "Max-Margin Markov Networks". Advances in Neural Information Processing Systems. 16. MIT Press.
  17. 1 2 3 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "8.1. Deep Convolutional Neural Networks (AlexNet)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN   978-1-009-38943-3.
  18. Li, Fei Fei (2023). The worlds I see: curiosity, exploration, and discovery at the dawn of AI (First ed.). New York: Moment of Lift Books ; Flatiron Books. ISBN   978-1-250-89793-0.
  19. AlexNet paper on Google Scholar
  20. Krizhevsky, Alex (July 18, 2014). "cuda-convnet: High-performance C++/CUDA implementation of convolutional neural networks". Google Code Archive. Retrieved 2024-10-20.