MobileNet is a family of
convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
(CNN) architectures designed for
image classification
Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...
,
object detection
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
, and other computer vision tasks. They are designed for small size, low latency, and low power consumption, making them suitable for on-device inference and
edge computing
Edge computing is a distributed computing model that brings computation and data storage closer to the sources of data. More broadly, it refers to any design that pushes computation physically closer to a user, so as to reduce the Latency (engineer ...
on resource-constrained devices like
mobile phones
A mobile phone or cell phone is a portable telephone that allows users to make and receive calls over a radio frequency link while moving within a designated telephone service area, unlike fixed-location phones ( landline phones). This radio ...
and
embedded systems
An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is em ...
. They were originally designed to be run efficiently on mobile devices with
TensorFlow Lite.
The need for efficient deep learning models on mobile devices led researchers at
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
to develop MobileNet. , the family has four versions, each improving upon the previous one in terms of performance and efficiency.
Features
V1
MobileNetV1 was published in April 2017. Its main architectural innovation was incorporation of
depthwise separable convolutions. It was first developed by Laurent Sifre during an internship at
Google Brain
Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...
in 2013 as an architectural variation on
AlexNet
AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1, ...
to improve convergence speed and model size.
The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution (
convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost.
The MobileNetV1 has two hyperparameters: a width multiplier
that controls the number of channels in each layer. Smaller values of
lead to smaller and faster models, but at the cost of reduced accuracy, and a resolution multiplier
, which controls the input resolution of the images. Lower resolutions result in faster processing but potentially lower accuracy.
V2
MobileNetV2 was published in March 2019. It uses inverted residual layers and linear bottlenecks.
Inverted residuals modify the traditional residual block structure. Instead of compressing the input channels before the depthwise convolution, they ''expand'' them. This expansion is followed by a
depthwise convolution and then a
projection layer that reduces the number of channels back down. This inverted structure helps to maintain representational capacity by allowing the depthwise convolution to operate on a higher-dimensional feature space, thus preserving more information flow during the convolutional process.
Linear bottlenecks removes the typical ReLU activation function in the projection layers. This was rationalized by arguing that that nonlinear activation loses information in lower-dimensional spaces, which is problematic when the number of channels is already small.
V3
MobileNetV3 was published in 2019. The publication included MobileNetV3-Small, MobileNetV3-Large, and MobileNetEdgeTPU (optimized for
Pixel 4). They were found by a form of
neural architecture search
Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par with or outperform hand-desig ...
(NAS) that takes mobile latency into account, to achieve good trade-off between accuracy and latency. It used piecewise-linear approximations of
swish and
sigmoid
Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include:
* Sigmoid function, a mathematical function
* Sigmoid colon, part of the l ...
activation functions (which they called "h-swish" and "h-sigmoid"),
squeeze-and-excitation modules, and the inverted bottlenecks of MobileNetV2.
V4
MobileNetV4 was published in September 2024. The publication included a large number of architectures found by NAS.
Inspired by
Vision Transformers, the V4 series included
multi-query attention.
It also unified both inverted residual and inverted bottleneck from the V3 series with the "universal inverted bottleneck", which includes these two as special cases.
See also
*
Convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
*
Deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
*
TensorFlow Lite
References
External links
*
*
{{Differentiable computing
Computer vision
Machine learning
Google software