An Overview of MobileNet Architectures for Efficient Computer Vision
MobileNet is a family of convolutional neural networks designed for efficient inference on mobile and embedded devices. The core innovation across versions is the use of depthwise separable convolutions to drastically reduce computational cost and model size.
MobileNet V1
The primary contribution of MobileNet V1 is the introduction of the depthwise separable convolution. This operation factorizes a standard convolution into a depthwise convolution (applying a single filter per input channel) followed by a pointwise convolution (a 1x1 convolution to combine channel outputs). This factorization reduces computation and parameters significantly compared to standard convolutions.
Two hyperparameters are introduced for further model tuning:
- Width Multiplier (α): A multiplier applied to the number of channels in each layer, controlling the model's width and directly reducing parameters.
- Resolution Multiplier (ρ): A multiplier applied to the input image dimensions, affecting computational cost without changing the parameter count.
Experimental results demonstrated that models using depthwise separable convolutions achieved accuracy close to full convolutions while reducing computations by approximately 8x and parameters by 6x.
PyTorch Implementation of MobileNet V1
import torch
import torch.nn as nn
class LightweightNetV1(nn.Module):
def __init__(self, num_classes=1000):
super(LightweightNetV1, self).__init__()
def standard_block(in_c, out_c, stride_val):
return nn.Sequential(
nn.Conv2d(in_c, out_c, 3, stride_val, padding=1, bias=False),
nn.BatchNorm2d(out_c),
nn.ReLU(inplace=True)
)
def depthwise_block(in_c, out_c, stride_val):
return nn.Sequential(
# Depthwise convolution
nn.Conv2d(in_c, in_c, 3, stride_val, padding=1, groups=in_c, bias=False),
nn.BatchNorm2d(in_c),
nn.ReLU(inplace=True),
# Pointwise convolution
nn.Conv2d(in_c, out_c, 1, 1, bias=False),
nn.BatchNorm2d(out_c),
nn.ReLU(inplace=True)
)
self.features = nn.Sequential(
standard_block(3, 32, 2),
depthwise_block(32, 64, 1),
depthwise_block(64, 128, 2),
depthwise_block(128, 128, 1),
depthwise_block(128, 256, 2),
depthwise_block(256, 256, 1),
depthwise_block(256, 512, 2),
depthwise_block(512, 512, 1),
depthwise_block(512, 512, 1),
depthwise_block(512, 512, 1),
depthwise_block(512, 512, 1),
depthwise_block(512, 512, 1),
depthwise_block(512, 1024, 2),
depthwise_block(1024, 1024, 1),
nn.AdaptiveAvgPool2d(1)
)
self.classifier = nn.Linear(1024, num_classes)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
MobileNet V2
MobileNet V2 introduces two key architectural improvements: Linear Bottlenecks and Inverted Residual blocks.
Linear Bottlenecks: Analysis shows that non-linear activation functions like ReLU can cause significant information loss when applied to low-dimensional representations. MobileNet V2 replaces the final ReLU in the bottleneck with a linear activation to preserve information in these compressed layers.
Inverted Residuals: Unlike traditional residual blocks that first reduce then expend channels, MobileNet V2 uses an inverted pattern: expand channels with a 1x1 convolution, apply depthwise convolution, then project back to a lower dimension with another 1x1 convolution. This structure ensures that the non-linear transformations happen in a higher-dimensional space where information loss is minimized. A shortcut connection is added only when the stride is 1.
A basic building block can be represented as:
# Conceptual structure of an Inverted Residual block
Input -> 1x1 Conv (Expand) -> ReLU6 -> 3x3 Depthwise Conv -> ReLU6 -> 1x1 Conv (Project) -> Linear -> Output
MobileNet V3
MobileNet V3 builds upon V2 by incorporating neural architecture search (NAS) for layer optimization and adding Squeeze-and-Excitation (SE) attention modules to enhance feature representation. It also uses a new activation function, h-swish, which is a computationally efficient approximation of the swish function.