MobileNet

MobileNet
Developer(s)	Google
Initial release	April 2017
Stable release	v4 / September 2024
Repository	github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
Written in	Python
License	Apache License 2.0

Last updated November 06, 2024

MobileNet is a family of convolutional neural network (CNN) architectures designed for image classification, object detection, and other computer vision tasks. They are designed for small size, low latency, and low power consumption, making them suitable for on-device inference and edge computing on resource-constrained devices like mobile phones and embedded systems. They were originally designed to be run efficiently on mobile devices with TensorFlow Lite.

Features

V1

MobileNetV1 was published in April 2017.^[1]^[2] Its main architectural innovation was incorporation of depthwise separable convolutions . It was first developed by Laurent Sifre during an internship at Google Brain in 2013 as an architectural variation on AlexNet to improve convergence speed and model size.^[3]

The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution ( $1\times 1$ convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost.

The MobileNetV1 has two hyperparameters: a width multiplier $\alpha$ that controls the number of channels in each layer. Smaller values of $\alpha$ lead to smaller and faster models, but at the cost of reduced accuracy, and a resolution multiplier $\rho$ , which controls the input resolution of the images. Lower resolutions result in faster processing but potentially lower accuracy.

V2

MobileNetV2 was published in March 2019.^[4]^[5] It uses inverted residual layers and linear bottlenecks.

Inverted residuals modify the traditional residual block structure. Instead of compressing the input channels before the depthwise convolution, they expand them. This expansion is followed by a $1\times 1$ depthwise convolution and then a $1\times 1$ projection layer that reduces the number of channels back down. This inverted structure helps to maintain representational capacity by allowing the depthwise convolution to operate on a higher-dimensional feature space, thus preserving more information flow during the convolutional process.

Linear bottlenecks removes the typical ReLU activation function in the projection layers. This was rationalized by arguing that that nonlinear activation loses information in lower-dimensional spaces, which is problematic when the number of channels is already small.

V3

MobileNetV3 was published in 2019.^[6]^[7] The publication included MobileNetV3-Small, MobileNetV3-Large, and MobileNetEdgeTPU (optimized for Pixel 4). They were found by a form of neural architecture search (NAS) that takes mobile latency into account, to achieve good trade-off between accuracy and latency.^[8]^[9] It used piecewise-linear approximations of swish and sigmoid activation functions (which they called "h-swish" and "h-sigmoid"), squeeze-and-excitation modules,^[10] and the inverted bottlenecks of MobileNetV2.

V4

MobileNetV4 was published in September 2024.^[11]^[12] The publication included a large number of architectures found by NAS. Compared to the architectural modules used in V3, the V4 series included the "universal inverted bottleneck", which includes both inverted residual and inverted bottleneck as special cases, and attention modules with multi-query attention.^[13]

External links

"models/research/slim/nets/mobilenet at master · tensorflow/models". GitHub. Retrieved 2024-10-18.
"Keras documentation: MobileNet, MobileNetV2, and MobileNetV3". Keras. Retrieved October 18, 2024.

Related Research Articles

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto. It had 60 million parameters and 650,000 neurons.

In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural networks. It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by long short-term memory (LSTM) recurrent neural networks. The advantage of the Highway Network over other deep learning architectures is its ability to overcome or partially prevent the vanishing gradient problem, thus improving its optimization. Gating mechanisms are used to facilitate information flow across the many layers.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par with or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

DeepScale, Inc. was an American technology company headquartered in Mountain View, California, that developed perceptual system technologies for automated vehicles. On October 1, 2019, the company was acquired by Tesla, Inc.

In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry.

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

Inception is a family of convolutional neural network (CNN) for computer vision, introduced by researchers at Google in 2014 as GoogLeNet. The series was historically important as an early CNN that separates the stem, body, and head (prediction), an architectural design that persists in all modern CNN.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford.

<span class="mw-page-title-main">You Only Look Once</span>

You Only Look Once (YOLO) is a series of real-time object detection systems based on convolutional neural networks. First introduced by Joseph Redmon et al. in 2015, YOLO has undergone several iterations and improvements, becoming one of the most popular object detection frameworks.

EfficientNet is a family of convolutional neural networks (CNNs) for computer vision published by researchers at Google AI in 2019. Its key innovation is compound scaling, which uniformly scales all dimensions of depth, width, and resolution using a single parameter.

References

↑ Howard, Andrew G.; Zhu, Menglong; Chen, Bo; Kalenichenko, Dmitry; Wang, Weijun; Weyand, Tobias; Andreetto, Marco; Adam, Hartwig (2017-04-16), MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv: 1704.04861
↑ "MobileNets: Open-Source Models for Efficient On-Device Vision". research.google. June 14, 2017. Retrieved 2024-10-18.
↑ Chollet, François (2017). "Xception: Deep Learning with Depthwise Separable Convolutions". 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1800–1807. arXiv: 1610.02357 . doi:10.1109/CVPR.2017.195. ISBN 978-1-5386-0457-1.
↑ Sandler, Mark; Howard, Andrew; Zhu, Menglong; Zhmoginov, Andrey; Chen, Liang-Chieh (2019-03-21), MobileNetV2: Inverted Residuals and Linear Bottlenecks, arXiv: 1801.04381
↑ "MobileNetV2: The Next Generation of On-Device Computer Vision Networks". research.google. April 3, 2018. Retrieved 2024-10-18.
↑ "Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and Mobi". research.google. November 13, 2019. Retrieved 2024-10-18.
↑ Howard, Andrew; Sandler, Mark; Chu, Grace; Chen, Liang-Chieh; Chen, Bo; Tan, Mingxing; Wang, Weijun; Zhu, Yukun; Pang, Ruoming; Vasudevan, Vijay; Le, Quoc V.; Adam, Hartwig (2019). "Searching for MobileNetV3": 1314–1324. arXiv: 1905.02244 .{{cite journal}}: Cite journal requires |journal= (help)
↑ Tan, Mingxing; Chen, Bo; Pang, Ruoming; Vasudevan, Vijay; Sandler, Mark; Howard, Andrew; Le, Quoc V. (June 2019). "MnasNet: Platform-Aware Neural Architecture Search for Mobile". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 2815–2823. arXiv: 1807.11626 . doi:10.1109/CVPR.2019.00293. ISBN 978-1-7281-3293-8.
↑ Yang, Tien-Ju; Howard, Andrew; Chen, Bo; Zhang, Xiao; Go, Alec; Sandler, Mark; Sze, Vivienne; Adam, Hartwig (2018). "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications": 285–300. arXiv: 1804.03230 .{{cite journal}}: Cite journal requires |journal= (help)
↑ Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks": 7132–7141.{{cite journal}}: Cite journal requires |journal= (help)
↑ Qin, Danfeng; Leichner, Chas; Delakis, Manolis; Fornoni, Marco; Luo, Shixin; Yang, Fan; Wang, Weijun; Banbury, Colby; Ye, Chengxi (2024-09-29), MobileNetV4 -- Universal Models for the Mobile Ecosystem, arXiv: 2404.10518
↑ Wightman, Ross. "MobileNet-V4 (now in timm)". huggingface.co. Retrieved 2024-10-18.
↑ Shazeer, Noam (2019-11-05), Fast Transformer Decoding: One Write-Head is All You Need, arXiv: 1911.02150

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Howard, Andrew G.; Zhu, Menglong; Chen, Bo; Kalenichenko, Dmitry; Wang, Weijun; Weyand, Tobias; Andreetto, Marco; Adam, Hartwig (2017-04-16), MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv: 1704.04861

[2] "MobileNets: Open-Source Models for Efficient On-Device Vision". research.google. June 14, 2017. Retrieved 2024-10-18.

[:1-3] Chollet, François (2017). "Xception: Deep Learning with Depthwise Separable Convolutions". 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1800–1807. arXiv: 1610.02357 . doi:10.1109/CVPR.2017.195. ISBN 978-1-5386-0457-1.

[4] Sandler, Mark; Howard, Andrew; Zhu, Menglong; Zhmoginov, Andrey; Chen, Liang-Chieh (2019-03-21), MobileNetV2: Inverted Residuals and Linear Bottlenecks, arXiv: 1801.04381

[5] "MobileNetV2: The Next Generation of On-Device Computer Vision Networks". research.google. April 3, 2018. Retrieved 2024-10-18.

[6] "Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and Mobi". research.google. November 13, 2019. Retrieved 2024-10-18.

[7] Howard, Andrew; Sandler, Mark; Chu, Grace; Chen, Liang-Chieh; Chen, Bo; Tan, Mingxing; Wang, Weijun; Zhu, Yukun; Pang, Ruoming; Vasudevan, Vijay; Le, Quoc V.; Adam, Hartwig (2019). "Searching for MobileNetV3": 1314–1324. arXiv: 1905.02244 .{{cite journal}}: Cite journal requires |journal= (help)

[8] Tan, Mingxing; Chen, Bo; Pang, Ruoming; Vasudevan, Vijay; Sandler, Mark; Howard, Andrew; Le, Quoc V. (June 2019). "MnasNet: Platform-Aware Neural Architecture Search for Mobile". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 2815–2823. arXiv: 1807.11626 . doi:10.1109/CVPR.2019.00293. ISBN 978-1-7281-3293-8.

[9] Yang, Tien-Ju; Howard, Andrew; Chen, Bo; Zhang, Xiao; Go, Alec; Sandler, Mark; Sze, Vivienne; Adam, Hartwig (2018). "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications": 285–300. arXiv: 1804.03230 .{{cite journal}}: Cite journal requires |journal= (help)

[10] Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks": 7132–7141.{{cite journal}}: Cite journal requires |journal= (help)

[11] Qin, Danfeng; Leichner, Chas; Delakis, Manolis; Fornoni, Marco; Luo, Shixin; Yang, Fan; Wang, Weijun; Banbury, Colby; Ye, Chengxi (2024-09-29), MobileNetV4 -- Universal Models for the Mobile Ecosystem, arXiv: 2404.10518

[12] Wightman, Ross. "MobileNet-V4 (now in timm)". huggingface.co. Retrieved 2024-10-18.

[13] Shazeer, Noam (2019-11-05), Fast Transformer Decoding: One Write-Head is All You Need, arXiv: 1911.02150

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

MobileNet

Contents

Features

V1

V2

V3

V4

See also

External links

Related Research Articles

References