Section 20.4: Efficient Designs: MobileNet, ShuffleNet & EfficientNet

"Accuracy is easy when someone else pays the electricity bill. Run me on a phone in someone's pocket and suddenly every multiply-add is a moral decision."
A Network Learning to Live Within Its Means

Big Picture

Once networks were accurate, the bottleneck moved from "can it learn?" to "can it run on a phone, a drone, or a doorbell?", and the efficient designs of this section attack the cost of the convolution itself rather than the depth or width around it. The central trick is factorization: a standard convolution mixes space and channels at once, which is wasteful, so depthwise-separable convolution splits it into a per-channel spatial filter and a $1 \times 1$ channel mix, cutting cost by roughly the kernel area. ShuffleNet pushes grouping further and shuffles channels to restore information flow; EfficientNet steps back and asks how to scale depth, width, and resolution together for a fixed budget. This is the section where architecture becomes an explicit cost-accuracy negotiation.

ResNet (Section 20.3) made depth cheap to train but not cheap to run; a ResNet-50 needs about four billion multiply-adds per image, fine on a server, painful on a battery. From roughly 2017 onward the frontier of architecture research split: one branch chased ever-higher accuracy, the other asked how little compute could buy a usable model. This section follows the efficiency branch, and its ideas now sit inside almost every model that ships to a real device, the topic of Chapter 28.

1. Depthwise-Separable Convolution Intermediate

A standard convolution with kernel size $K$, $C_{in}$ input channels, and $C_{out}$ output channels does two jobs in one shot: it filters spatially (the $K \times K$ window) and it combines channels (summing across all $C_{in}$ inputs for each output). Its cost per output pixel is $K^2 \cdot C_{in} \cdot C_{out}$ multiply-adds. MobileNet's insight is that these two jobs can be separated. First a depthwise convolution applies one $K \times K$ filter to each input channel independently, doing only spatial work, at cost $K^2 \cdot C_{in}$. Then a pointwise convolution (a $1 \times 1$ convolution, the channel mixer of Section 20.2) combines channels, at cost $C_{in} \cdot C_{out}$. Figure 20.4.1 contrasts the two, and the illustration below tells the same story as two relaxed specialists splitting one overworked job.

One sweating overworked robot juggling many colored balls at once on the left, and on the right two relaxed specialist robots splitting the work, one stamping a pattern onto each color separately and one blending the colors at a mixing board, picturing how depthwise-separable convolution factors a costly dense convolution into a per-channel spatial filter plus a one-by-one channel mix. — Split the one expensive job into two cheap specialists, a per-channel filter and a channel mixer, and you delete about nine of every ten multiply-adds with barely a dent in accuracy.

Figure 20.4.1: Depthwise-separable convolution. A standard convolution (orange) filters space and mixes channels at once. The separable form does the spatial filtering per channel (green) and then mixes channels with a $1 \times 1$ convolution (blue). For a $3 \times 3$ kernel the cost falls by roughly eight to nine times with little accuracy loss.

The cost ratio of separable to standard is $\frac{K^2 C_{in} + C_{in} C_{out}}{K^2 C_{in} C_{out}} = \frac{1}{C_{out}} + \frac{1}{K^2}$. With a typical $K = 3$ and a few hundred output channels, the second term dominates and the reduction is about $1/9$, an order-of-magnitude saving for a small accuracy cost. In PyTorch, a depthwise convolution is simply a Conv2d with groups equal to the channel count.

# MobileNet's core unit splits a standard convolution into two cheap steps:
# a depthwise 3x3 (one filter per channel, groups=in_ch) does the spatial work,
# then a pointwise 1x1 mixes channels. Together they cost about 1/9 of the dense op.
import torch
import torch.nn as nn

class DepthwiseSeparable(nn.Module):
    """MobileNet's core unit: depthwise 3x3 then pointwise 1x1, each with BN+ReLU."""
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.block = nn.Sequential(
            # depthwise: groups=in_ch means one KxK filter per channel
            nn.Conv2d(in_ch, in_ch, 3, stride=stride, padding=1,
                      groups=in_ch, bias=False),
            nn.BatchNorm2d(in_ch), nn.ReLU(inplace=True),
            # pointwise: 1x1 conv mixes channels and changes their count
            nn.Conv2d(in_ch, out_ch, 1, bias=False),
            nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
        )

    def forward(self, x):
        return self.block(x)

dw = DepthwiseSeparable(64, 128)
std = nn.Conv2d(64, 128, 3, padding=1)
n_dw = sum(p.numel() for p in dw.parameters())
n_std = sum(p.numel() for p in std.parameters())
print(f"separable: {n_dw:,} params, standard: {n_std:,} params")

separable: 9,152 params, standard: 73,856 params

Code Fragment 1: Depthwise-separable convolution via groups=in_ch in the depthwise Conv2d, followed by a $1 \times 1$ pointwise conv. The roughly eightfold drop from 73,856 to 9,152 parameters against a standard $3 \times 3$ convolution on the same 64-to-128 channel change is the entire reason MobileNet runs on phones.

Try This: Sweep the Grouping

The groups argument of Conv2d is a single dial that moves you from a dense convolution all the way to a depthwise one. On a fixed $3 \times 3$ layer with 64 input and 64 output channels, build it with groups set to $1$, then $2$, $4$, $8$, $16$, $32$, and finally $64$, and print sum(p.numel() for p in conv.parameters()) each time. Watch the parameter count fall by exactly the factor you set groups to: $1$ group is the full dense cost, and $64$ groups (one filter per channel) is the depthwise extreme that is $64$ times cheaper. Seeing the weights drop step by step makes "factorize the expensive operation" something you have felt, not just read, and it shows that grouping is a continuous knob, not an all-or-nothing switch.

Common Misconception: "Depthwise-separable is just a standard convolution, factored exactly"

The word "factorize" in Figure 20.4.1 suggests an exact algebraic identity, as if the depthwise step and the pointwise step simply multiply back out into the original dense convolution. They do not. A standard $K \times K$ convolution learns a separate $K \times K$ spatial pattern for every (input channel, output channel) pair; the separable form forces every output channel to reuse the same per-input-channel spatial filter and only re-weights it through the $1 \times 1$ mix. That is a strictly smaller, rank-constrained subset of all the functions a dense convolution can represent, which is precisely why it has roughly nine times fewer parameters. It is a cheaper approximation that happens to lose little accuracy on natural images, not a lossless rewrite. The accuracy gap is small in practice because real spatial filters are highly redundant across output channels, not because nothing was given up; a problem whose channels genuinely needed independent spatial patterns would suffer.

MobileNetV2 added a twist called the inverted residual. A normal ResNet bottleneck squeezes channels, works, then expands. The inverted residual does the opposite: it expands to a wide intermediate representation, applies the cheap depthwise convolution there, then projects back down, and the residual skip connects the two narrow ends. Doing the spatial work in the wide space preserves information, while keeping the skip-connected tensors thin saves memory, a clever rearrangement of the same parts. Figure 20.4.2 places the two side by side so you can read the inversion directly: where the channel width pinches, and which ends the skip connects.

Figure 20.4.2: The bottleneck, inverted. The ResNet bottleneck (left) pinches the channel width in the middle, doing its $3 \times 3$ spatial work on the narrow representation, and the residual skip joins the two wide ends. MobileNetV2's inverted residual (right) flips this: it expands to a wide middle, runs the cheap depthwise convolution where information is richest, then projects back down, and the skip now joins the two narrow ends, so the tensors carried along the skip path stay thin and memory-cheap.

Fun Fact

Depthwise-separable convolution is the rare optimization that looks too good to be true and mostly is not: it deletes about nine out of every ten multiply-adds and barely dents accuracy. The catch is that a GPU does not always run it nine times faster. Each tiny per-channel filter reads and writes memory but does almost no arithmetic, so the chip spends its time shuffling bytes rather than multiplying, the memory-bandwidth wall the research-frontier callout returns to. You saved the FLOPs; the wall clock did not always get the memo.

2. ShuffleNet: Grouping, Then Mixing Advanced

ShuffleNet pushes the cost-cutting further with grouped convolutions, where the input channels are split into $g$ groups and each group is convolved independently, cutting the $1 \times 1$ pointwise cost by a factor of $g$. The danger is that if every layer groups the same way, information never crosses between groups and the network fragments into $g$ independent sub-networks. ShuffleNet's fix is a parameter-free channel shuffle: after a grouped convolution, permute the channels so that the next group draws from all previous groups. Information flows across the whole width at zero extra cost, the same "cheap operation that restores a property an optimization broke" pattern you saw in ResNet's skip.

# Channel shuffle is a parameter-free reshape-transpose-reshape that interleaves
# the channels of g groups. After a grouped convolution it lets the next grouped
# conv draw from every previous group, restoring cross-group information flow.
import torch

def channel_shuffle(x, groups: int):
    """Permute channels so the next grouped conv mixes across all groups."""
    b, c, h, w = x.shape
    x = x.view(b, groups, c // groups, h, w)  # split channels into groups
    x = x.transpose(1, 2).contiguous()        # swap group and per-group axes
    return x.view(b, c, h, w)                 # flatten back, now interleaved

x = torch.arange(8).view(1, 8, 1, 1).float()
print(x.flatten().tolist())
print(channel_shuffle(x, groups=4).flatten().tolist())

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] [0.0, 2.0, 4.0, 6.0, 1.0, 3.0, 5.0, 7.0]

Code Fragment 2: Channel shuffle as a view, transpose, and view back. The output ordering [0, 2, 4, 6, 1, 3, 5, 7] interleaves the four groups, so a following grouped convolution sees a channel from every original group, restoring the cross-group information flow that grouping otherwise breaks.

3. Squeeze-and-Excitation: Cheap Attention on Channels Advanced

A third efficiency idea, used inside MobileNetV3 and EfficientNet, is the squeeze-and-excitation (SE) block. It learns a per-channel importance weight: squeeze each feature map to a single number with global average pooling (averaging over the whole map is what gives each gate global context, so a channel is judged by how strongly it fires across the entire image rather than at one location), pass the resulting channel vector through a tiny two-layer network ending in a sigmoid, which squashes each output into a gate in $[0, 1]$ per channel, and multiply each channel by its gate. For almost no compute, the network learns to amplify informative channels and suppress noisy ones, a lightweight form of the attention you will study fully in Chapter 22. SE is one of the highest accuracy-per-FLOP additions known, which is why it appears in nearly every efficient design after 2018.

Key Insight: A Gate Per Channel Buys More Than a Whole Extra Stage

Squeeze-and-excitation looks too small to matter: it adds only two tiny fully connected layers acting on a length-$C$ vector, well under 1% to a backbone's multiply-adds. Yet bolting SE onto a ResNet was enough to win the final ImageNet challenge in 2017, cutting the previous year's winning top-5 error by roughly a quarter. Compare that to the alternative the previous sections taught: buying a comparable accuracy gain by adding depth or width costs a whole extra stage of $3 \times 3$ convolutions, orders of magnitude more compute. A few hundred parameters that re-weight the channels you already computed outperform a far larger pile of parameters that compute new ones. The lesson is that telling the network which features to trust can be cheaper, and more effective, than giving it more features.

Key Insight: Factorize the Expensive Operation

Every design in this section is a variation on one move: take the costly dense mixing of space and channels and factor it into cheaper pieces that, recombined, do nearly the same job. Depthwise-separable splits space from channels; grouped convolution splits channels into independent groups; the $1 \times 1$ bottleneck reshapes cheaply between expensive steps. When you need to make any network cheaper, ask first "what is the single most expensive operation, and can I factor it?" That question, not a memorized block diagram, is the transferable skill. A four-word handle for the whole section: split, group, gate, scale. Depthwise-separable splits space from channels, ShuffleNet groups channels (then shuffles to reconnect them), squeeze-and-excitation gates each channel by importance, and EfficientNet scales depth, width, and resolution together. Four verbs, four ways to spend a smaller compute budget well.

4. EfficientNet: Scale Everything Together Intermediate

The earlier designs cut the cost of a fixed network. EfficientNet asks the complementary question: given more compute, how should you grow a network? You can make it deeper (more layers), wider (more channels), or feed it higher-resolution images, and the field had historically scaled one of these at a time, by intuition. EfficientNet's compound scaling argues that the three should grow together in a fixed ratio, governed by a single coefficient $\phi$:

\text{depth} = \alpha^\phi, \quad \text{width} = \beta^\phi, \quad \text{resolution} = \gamma^\phi, \quad \text{subject to } \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

Here $\alpha$, $\beta$, and $\gamma$ are constants greater than one (the paper found roughly $\alpha = 1.2$, $\beta = 1.1$, $\gamma = 1.15$ by a small grid search), and $\phi$ is the single knob you turn to grow the network; setting $\phi = 0$ recovers the base network, and each step up multiplies depth, width, and resolution by those fixed per-step factors. The constraint keeps the total compute growing by roughly $2^\phi$ as $\phi$ increases, so each unit of $\phi$ doubles the budget and spreads it across all three dimensions in the empirically best proportion.

The squared exponents on width and resolution are not arbitrary; they fall straight out of how convolution counts cost. Doubling the depth stacks twice as many layers, so cost doubles, a factor of $2$. Doubling the width doubles both the input and the output channel count of every layer, and a convolution's cost scales with their product, so cost quadruples, a factor of $4$. Doubling the input resolution doubles both the height and the width of every feature map, so there are four times as many pixels to convolve, again a factor of $4$. Width and resolution each cost like an area (two dimensions growing at once) while depth costs like a length, which is exactly why they sit under squares in $\alpha \cdot \beta^2 \cdot \gamma^2$ and depth does not. The constraint is just the statement "whatever combination you pick, let the total compute roughly double per step of $\phi$".

With the scaling rule fixed, the remaining question is what to scale from. The base network is EfficientNet-B0, built from inverted-residual SE blocks and found by neural architecture search, an automated procedure that searches over candidate layer configurations to optimize accuracy under a cost budget rather than designing them by hand. Increasing $\phi$ from that base generates the B1 through B7 family, which traced an accuracy-per-FLOP frontier that dominated for years. The lesson reframes the depth-versus-width debate of Section 20.2: the answer is not depth or width but a principled mixture, plus resolution.

Library Shortcut: Efficient Backbones, Pretrained

MobileNet, ShuffleNet, and the EfficientNet family all ship pretrained, so the blocks above become a single call:

# Load the efficient backbones pretrained on ImageNet, picking by their
# accuracy-latency tradeoff. The depthwise grouping, inverted-residual and SE
# blocks, and channel-multiplier settings of this section are all inside timm.
import timm
# Pick a model by its accuracy-latency tradeoff, all pretrained on ImageNet:
mbv3 = timm.create_model("mobilenetv3_large_100", pretrained=True)  # ~5.5M params
effb0 = timm.create_model("efficientnet_b0", pretrained=True)       # ~5.3M params
# timm exposes FLOP and parameter stats for ranking candidates:
print(sum(p.numel() for p in effb0.parameters()))

5288548

The library handles the depthwise grouping, the inverted-residual and SE blocks, the exact channel-multiplier and resolution settings, and the pretrained weights, replacing a few hundred lines of careful block plumbing. timm also publishes a benchmarked results table so you can pick a model by measured accuracy and latency rather than guessing, exactly the selection process of Section 20.6.

Code Fragment 3: MobileNetV3 and EfficientNet-B0 in one timm.create_model call each instead of the hand-built DepthwiseSeparable and shuffle blocks above. The library handles the depthwise grouping, inverted-residual and squeeze-and-excitation blocks, and channel-multiplier settings internally; the 5,288,548-parameter count confirms EfficientNet-B0's compact size, letting you focus on choosing a backbone by its cost-accuracy tradeoff.

Practical Example: The Doorbell That Could Not Afford a ResNet

Who: a consumer hardware team adding person detection to a battery-powered video doorbell, 2025. Situation: their prototype ran a ResNet-50 backbone in the cloud, but round-trip latency and privacy concerns demanded on-device inference, and the doorbell's chip offered a small fraction of a server's compute with a strict power envelope. Problem: the ResNet model drained the battery in hours and missed the latency target by an order of magnitude. Decision: they swapped to a MobileNetV3-Small backbone with squeeze-and-excitation, accepting a two-point accuracy drop, and quantized it to 8-bit integers (the deployment techniques of Chapter 28). Result: inference fit the latency budget with room to spare, battery life met the year-long target, and the small accuracy loss was invisible in the field because the detection threshold dominated real-world performance anyway. Lesson: on a constrained device the right architecture is the one that fits the power and latency budget while clearing the accuracy bar, not the one that tops the ImageNet leaderboard. Efficiency is a first-class design goal, not a consolation prize.

Research Frontier: Latency Is the Real Target

A theme sharpening from 2022 to 2026 is that FLOP count is a poor proxy for actual speed: depthwise convolutions are FLOP-cheap but memory-bandwidth-bound, so a "more efficient" model on paper can run slower on real hardware. Designs like MobileOne (CVPR 2023) and FastViT (ICCV 2023, arXiv:2303.14189) optimize directly for measured on-device latency, using structural reparameterization to train with multi-branch blocks and then fold them into a single fast convolution at inference. The broader 2024 to 2026 trend is hardware-aware neural architecture search, where the cost term in the search is the latency measured on the exact target chip. The factorization ideas of this section remain the building blocks; what has changed is that the objective being minimized is now wall-clock time on silicon, the metric Section 20.6 urges you to measure.

Exercise 20.4.1: Derive the Separable Saving Conceptual

For a convolution with kernel $K = 3$, $C_{in} = 256$, and $C_{out} = 256$ on an $H \times W$ feature map, compute the multiply-add cost of (a) the standard convolution and (b) the depthwise-separable equivalent, both per output pixel. Express the ratio and confirm it matches $\frac{1}{C_{out}} + \frac{1}{K^2}$. Then explain why the saving grows as $C_{out}$ increases but is bounded below by $1/K^2$.

Exercise 20.4.2: Build and Time a Separable Block Coding

Using the DepthwiseSeparable class above, build a small network for CIFAR-10 and a twin that replaces each separable block with a standard Conv2d of the same input and output channels. Train both to convergence and report (a) parameter counts, (b) wall-clock time per training epoch on your hardware, and (c) final test accuracy. Discuss the accuracy-versus-cost tradeoff you observe, and check whether the wall-clock speedup matches the FLOP speedup or falls short (the memory-bandwidth point from the research-frontier callout).

Exercise 20.4.3: Read the EfficientNet Frontier Analysis

From the timm or torchvision documentation, collect parameters, FLOPs, and ImageNet top-1 accuracy for EfficientNet-B0 through B4 and for ResNet-50 and ResNet-101. Plot accuracy against FLOPs (log scale) with both families on one chart. Identify which models lie on the upper-left frontier (more accuracy per FLOP) and which are dominated. Using the compound-scaling formula of subsection 4, explain why the EfficientNet points form a smooth curve while the ResNet points scaled by depth alone do not.