Part III: Deep Learning for Computer Vision
Chapter 20: CNN Architectures: From LeNet to ConvNeXt

VGG & Inception: Depth vs Width

"One of us stacked the same small filter until the lights flickered. The other ran four filter sizes at once and let the gradient pick a favorite. We both won the same year. Architecture is not a religion."

Two Networks That Disagreed Productively in 2014
Big Picture

By 2014 the bottleneck was no longer "can we train a deep CNN at all" but "how should we spend a fixed compute budget: on depth or on width?", and the two winning answers, VGG and Inception, are the cleanest statement of that tradeoff in the whole chapter. VGG bets everything on depth with a single repeated $3 \times 3$ block, proving that two small kernels stacked beat one large kernel at lower cost. Inception bets on width, running several filter sizes in parallel inside one module, and introduces the $1 \times 1$ convolution as a cheap channel bottleneck that keeps the width affordable. Both ideas, the small-kernel stack and the $1 \times 1$ bottleneck, survive into every later architecture, so this section is less about two old networks than about two design primitives you will reuse constantly.

In Section 20.1 we saw AlexNet remove the compute and gradient ceilings, and we noticed that its parameters piled up in the dense layers, not the convolutions. With training now feasible, the 2014 question became architectural: given that you can stack convolutions, how should you arrange them? VGG and Inception gave opposite-flavored answers in the same ImageNet competition, and both finished near the top. Studying them side by side teaches the two most reusable structural ideas in convolutional design.

1. VGG: The Power of Uniform Depth Beginner

VGG, from Oxford's Visual Geometry Group, is almost aggressively simple. Every convolution is $3 \times 3$ with stride $1$ and padding $1$ (so spatial size is preserved), every pooling is $2 \times 2$ max-pool, and the only design knobs are how many convolutions to put between pools and how many channels each stage gets. VGG-16 stacks thirteen such convolutions and three fully connected layers; the depth, not any clever module, is the whole idea. The network is organized into stages, and within a stage the channel count is constant while the spatial size is fixed; pooling between stages halves the spatial size and the channel count doubles, the same spatial-down, channels-up rhythm of LeNet.

The central VGG argument is the substitution of small kernels for large ones. Two stacked $3 \times 3$ convolutions see the same $5 \times 5$ input region as one $5 \times 5$ convolution does, because the receptive field of Chapter 19 grows additively with depth. But the two small layers are cheaper and add an extra nonlinearity. Count the weights for a layer with $C$ input and $C$ output channels (each filter has kernel-area $\times\, C$ weights and there are $C$ of them, so the count is kernel-area $\times\, C^2$): one $5 \times 5$ convolution costs $25 C^2$ parameters, while two $3 \times 3$ convolutions cost $2 \times 9 C^2 = 18 C^2$, about 28% fewer, and three $3 \times 3$ layers cost $27 C^2$ for the same receptive field as one $7 \times 7$ layer at $49 C^2$. Figure 20.2.1 shows this equivalence.

one 5x5 convolution 25 C² params, 1 nonlinearity two stacked 3x3 convolutions 18 C² params, 2 nonlinearities same 5x5 field
Figure 20.2.1: The VGG substitution. A single $5 \times 5$ filter and two stacked $3 \times 3$ filters cover the same $5 \times 5$ receptive field (blue), but the stacked version uses 28% fewer parameters and inserts an extra nonlinearity between the two layers, giving more expressive power for less cost.
Common Misconception: "Two 3x3 convolutions equal one 5x5"

It is tempting to read the substitution as an equality: that a stack of two $3 \times 3$ convolutions is one $5 \times 5$ convolution, just cheaper. It is not. What the two share is only the receptive field, the input footprint each output pixel depends on. As functions they are different. With the ReLU between the two layers the stack is strictly more expressive, which is the whole point. Even with the nonlinearity removed, two $3 \times 3$ layers cannot reproduce an arbitrary $5 \times 5$ filter: composing two $3 \times 3$ kernels yields a constrained (low-rank-like) subset of all possible $5 \times 5$ kernels, not the full $25$-weight space. The takeaway VGG actually defends is "same receptive field, fewer parameters, more nonlinearity", not "the same operation". A quick check on yourself: if the two were truly equivalent, the extra nonlinearity could add nothing, yet that nonlinearity is exactly what makes the deep stack stronger.

Because every kernel is the same size and every stage follows the same rule, the whole VGG trunk can be generated from a short configuration list rather than written out layer by layer. The code below builds the thirteen-convolution trunk by mapping one block-builder over a list of stage shapes.

# Generate the whole VGG-16 trunk from one repeated rule: a stage is
# n_convs of (3x3 conv, ReLU) followed by a 2x2 max-pool. The uniformity
# is the design, so the network is a short loop over a config list.
import torch.nn as nn

def vgg_block(in_ch, out_ch, n_convs):
    """One VGG stage: n_convs of 3x3 conv-ReLU, then a 2x2 max-pool."""
    layers = []
    for i in range(n_convs):
        layers += [nn.Conv2d(in_ch if i == 0 else out_ch, out_ch,
                             kernel_size=3, padding=1),
                   nn.ReLU(inplace=True)]
    layers.append(nn.MaxPool2d(2))           # halve spatial size between stages
    return nn.Sequential(*layers)

# VGG-16 convolutional stages: (in, out, n_convs)
cfg = [(3, 64, 2), (64, 128, 2), (128, 256, 3), (256, 512, 3), (512, 512, 3)]
features = nn.Sequential(*[vgg_block(i, o, n) for i, o, n in cfg])
n_convs = sum(n for _, _, n in cfg)
print(f"{len(cfg)} stages, {n_convs} convolutions in total")
5 stages, 13 convolutions in total
Code Fragment 1: The entire VGG-16 convolutional trunk built by mapping vgg_block over the cfg list of (in, out, n_convs) tuples. The uniformity that makes the network easy to describe (every kernel is $3 \times 3$, every pool $2 \times 2$) also makes the thirteen-convolution trunk easy to generate programmatically, the deeper point of the VGG design.
Key Insight: Stack Small, Not Big

Replacing one large kernel with a stack of $3 \times 3$ kernels is a free lunch that the field never gave back: equal receptive field, fewer parameters, more nonlinearity. This is why nearly every architecture after VGG, including ResNet and the efficient designs of Section 20.4, is built almost entirely from $3 \times 3$ (and $1 \times 1$) convolutions. When you see a network full of small kernels, you are looking at VGG's lesson, absorbed.

VGG's weakness is the one AlexNet already exposed: its three fully connected layers hold roughly 120 million of its 138 million parameters, almost all of them in the first dense layer. The convolutional trunk is elegant and cheap; the classifier head is a parameter sink. Inception attacks exactly this.

2. Inception: Several Scales at Once Intermediate

The Inception (GoogLeNet) team made the opposite bet. Rather than commit to one kernel size and go deep, they asked: why choose a kernel size at all? An object can occupy a small or a large fraction of the image, so let one module compute $1 \times 1$, $3 \times 3$, and $5 \times 5$ convolutions in parallel, plus a pooling branch, and concatenate the results along the channel axis. The next layer then has access to features at several scales simultaneously and learns which to weight. This is the multi-scale idea of the image pyramids in Chapter 4, now computed inside a single learnable block. The workshop illustration below makes the parallel-branch idea concrete (the funnels are the cheap channel bottlenecks introduced just below).

A cutaway workshop with four parallel benches where cartoon workers inspect the same image through different sized magnifying lenses, with funnels squeezing wide token streams thin before the larger benches and one shared conveyor merging all outputs, picturing the Inception module computing several filter scales in parallel with cheap one-by-one channel bottlenecks.
Why pick a kernel size when you can run several at once? Inception looks at every scale in parallel and lets a cheap one-by-one bottleneck keep the bill down.

The obvious problem is cost: running a $5 \times 5$ convolution on a wide feature map is expensive, and concatenating branches makes the next layer's input very wide. The fix is the section's second reusable primitive. A $1 \times 1$ convolution mixes channels at each spatial location without touching the spatial extent, so it can reduce the channel count cheaply before an expensive $3 \times 3$ or $5 \times 5$ branch. Placing a $1 \times 1$ reduction in front of each expensive branch is the "Inception with dimension reduction" module, shown in Figure 20.2.2.

input feature map 1x1 conv 1x1 reduce 3x3 conv 1x1 reduce 5x5 conv 3x3 max-pool 1x1 proj concatenate along channels blue 1x1 = cheap channel bottleneck (reduce before expensive branch)
Figure 20.2.2: An Inception module with dimension reduction. Four branches see the same input at different scales and concatenate their outputs. The blue $1 \times 1$ convolutions reduce the channel count before each expensive $3 \times 3$ or $5 \times 5$ branch, the key trick that keeps a wide module affordable.
# One Inception module runs four branches on the same input and concatenates
# them on the channel axis. Each expensive 3x3 and 5x5 branch is fronted by a
# 1x1 reduction so the wide module stays affordable, exactly as in Figure 20.2.2.
import torch
import torch.nn as nn

class Inception(nn.Module):
    """One Inception module: four parallel branches concatenated on channels."""
    def __init__(self, in_ch, c1, c3_red, c3, c5_red, c5, pool_proj):
        super().__init__()
        relu = nn.ReLU(inplace=True)
        self.b1 = nn.Sequential(nn.Conv2d(in_ch, c1, 1), relu)
        self.b2 = nn.Sequential(nn.Conv2d(in_ch, c3_red, 1), relu,        # 1x1 reduce
                                nn.Conv2d(c3_red, c3, 3, padding=1), relu)
        self.b3 = nn.Sequential(nn.Conv2d(in_ch, c5_red, 1), relu,        # 1x1 reduce
                                nn.Conv2d(c5_red, c5, 5, padding=2), relu)
        self.b4 = nn.Sequential(nn.MaxPool2d(3, stride=1, padding=1),
                                nn.Conv2d(in_ch, pool_proj, 1), relu)

    def forward(self, x):
        return torch.cat([self.b1(x), self.b2(x), self.b3(x), self.b4(x)], dim=1)

# The classic "inception_3a" config from GoogLeNet:
mod = Inception(192, 64, 96, 128, 16, 32, 32)
out = mod(torch.randn(1, 192, 28, 28))
print("output channels:", out.shape[1])  # 64 + 128 + 32 + 32
output channels: 256
Code Fragment 2: A full Inception module built from branches b1 through b4. The four branch outputs (64, 128, 32, 32 channels) concatenate in forward to 256, and the c3_red and c5_red arguments size the $1 \times 1$ reductions that front each expensive branch exactly as in Figure 20.2.2.

GoogLeNet also replaced VGG's parameter-heavy dense head with global average pooling: average each final feature map down to a single number, then apply one small linear layer. This collapses the spatial map to a length-$C$ vector with zero parameters, eliminating the 100-million-weight dense layers entirely. GoogLeNet reached accuracy comparable to VGG with only about 6.8 million parameters, roughly twenty times fewer than VGG-16 (and about twelve times fewer than AlexNet, the comparison the original paper highlights), a direct payoff of fixing the dense-head bottleneck this chapter has flagged since AlexNet.

Key Insight: The 1x1 Convolution Is a Channel Mixer

A $1 \times 1$ convolution does no spatial work; at each pixel it is a small fully connected layer across channels. That makes it the cheapest possible tool to grow or shrink channel count, to mix information across feature maps, and to insert a nonlinearity. You will meet it again as the projection in the ResNet bottleneck (Section 20.3) and as the pointwise step of depthwise-separable convolution (Section 20.4). Of all the primitives in this chapter, the $1 \times 1$ convolution is the one you will type most often.

Fun Fact

The name "Inception" is a nod to the 2010 film, by way of the internet meme "We need to go deeper." The GoogLeNet authors cited the meme in their paper, deadpan, as motivation for stacking modules. So the architecture that taught the field to make width affordable with a $1 \times 1$ bottleneck is also, officially, the one named after a dream-within-a-dream joke. The $1 \times 1$ convolution kept the joke from running out of memory.

3. Depth versus Width as a Living Tradeoff Intermediate

VGG and Inception are the two poles of a tradeoff that never goes away. Depth (more sequential layers) builds abstraction and grows the receptive field, but plain depth eventually stops helping, the very wall that ResNet hits in Section 20.3. Width (more parallel branches or channels) increases representational capacity at each level and parallelizes well on hardware, but costs memory and can waste capacity if the extra width is redundant. Modern designs do not pick a side; they tune both, and EfficientNet (Section 20.4) makes the joint tuning explicit by scaling depth, width, and resolution together with a single coefficient.

Practical Example: VGG as a Perceptual Loss

Who: a small studio building a photo-upscaling feature for a mobile app, 2024. Situation: their super-resolution network trained with pixel-wise mean-squared error produced blurry, over-smoothed results that scored well on PSNR (the peak signal-to-noise ratio image-quality metric from Chapter 1, higher is closer to the reference) but looked soft to users. Problem: pixel loss rewards averaging, which blurs texture. Decision: they added a perceptual loss, comparing the activations of a frozen pretrained VGG-16 on the output and the target rather than comparing raw pixels, the standard trick from Johnson et al.'s style-transfer work. Result: textures (hair, fabric, foliage) sharpened visibly while the network trained on the same data, and user-rated quality rose even though PSNR dropped slightly, the metric decoupling the team had to learn to trust. Lesson: a network's intermediate features are reusable assets, not just a means to a classification score. VGG's simple, uniform features turned out to be such a good general-purpose perceptual space that the network outlived its own benchmark relevance, and you will see this same VGG-feature loss reappear in the generative evaluation of Chapter 37.

You Could Build This: A Neural Style Transfer Tool

The perceptual-loss idea above is one short step from a portfolio-worthy weekend build. Project (intermediate, about 2 to 4 hours): turn a content photo into a painting in the style of any reference image, using nothing but a frozen pretrained VGG-16 and gradient descent on the pixels. Load vgg16 and slice vgg.features as in the shortcut below, then optimize a starting image so that its deep VGG activations match the content photo (a content loss on one mid layer) while the channel correlations of its shallow activations match the style image (a style loss, the Gram matrix of the features on several early layers). No training set and no network to train: the only thing that learns is the image itself. You will reuse the exact lesson of this section, that VGG's uniform features are a general-purpose perceptual space, and the same Gram-matrix machinery returns in the generative evaluation of Chapter 37. Ship it as a small command-line script or a one-page web demo and it reads as a genuine vision project, not a homework exercise.

Library Shortcut: VGG and Inception, Pretrained

Both networks ship in torchvision with ImageNet weights, so the hand-built blocks above become a one-liner each:

# Load VGG-16 and GoogLeNet with ImageNet weights, no block definitions needed.
# Slicing vgg.features to a layer index gives the frozen feature extractor used
# for the perceptual loss of subsection 3.
from torchvision.models import vgg16, VGG16_Weights, googlenet, GoogLeNet_Weights
vgg = vgg16(weights=VGG16_Weights.IMAGENET1K_V1).eval()
goog = googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1).eval()
# To use VGG as a feature extractor (perceptual loss, subsection 3):
feat = vgg.features[:16]  # truncate to a chosen layer, freeze, done.

This replaces the dozens of lines of block definitions with a single call, and the library handles weight loading, the auxiliary-classifier branches GoogLeNet adds during training, and the exact normalization. Truncating vgg.features to a layer index is the entire recipe for the perceptual loss in the example above.

Code Fragment 3: The same VGG-16 and GoogLeNet in two lines each using torchvision instead of the hand-built vgg_block and Inception above. The library handles weight loading, GoogLeNet's auxiliary-classifier branches, and normalization internally, and the vgg.features[:16] slice is the whole perceptual-loss feature extractor, letting you focus on the loss rather than the backbone.
Research Frontier: Are Patches All You Need?

The depth-versus-width debate took a surprising turn in 2022 to 2024. Trockman and Kolter's ConvMixer (TMLR 2023, arXiv:2201.09792) showed that an almost trivially simple all-convolutional network, operating on image patches with large depthwise kernels, rivals far more complex designs, suggesting that the patch embedding (which you will meet in vision transformers, Chapter 22) may matter more than the specific mixing operation. In parallel, RepLKNet (CVPR 2022, arXiv:2203.06717) revived very large kernels, the exact thing VGG argued against, but made them affordable with depthwise convolution and showed they grow the effective receptive field faster than deep small-kernel stacks. The VGG-versus-Inception axis of this section is, a decade later, still the frame researchers reach for when they ask how to spend a compute budget.

Exercise 20.2.1: Prove the Substitution Conceptual

(a) Show algebraically that three stacked $3 \times 3$ convolutions and one $7 \times 7$ convolution have the same receptive field, using the receptive-field recurrence from Chapter 19. (b) For a layer with $C$ input and $C$ output channels, compute the parameter count of each and express the saving as a percentage. (c) State, in one sentence, the property of the data that makes "more nonlinearities for the same field" a genuine advantage rather than a wash.

Exercise 20.2.2: Measure the 1x1 Bottleneck Saving Coding

Write two versions of an Inception-style module on a $192$-channel input: one where the $5 \times 5$ branch operates directly on all 192 channels, and one (as coded above) where a $1 \times 1$ layer first reduces to 16 channels. Use sum(p.numel() ...) and a FLOP counter (torchinfo.summary or fvcore) to report parameters and multiply-adds for each branch on a $28 \times 28$ feature map. Report the reduction factor and explain why it is roughly the channel-reduction ratio.

Exercise 20.2.3: Where Are VGG's Parameters? Analysis

Load vgg16 from torchvision and iterate over its named parameters, summing weights separately for features (convolutions) and classifier (dense layers). Report the split as a fraction of the ~138M total, and identify the single layer holding the most parameters. Then explain, referencing GoogLeNet's global average pooling in subsection 2, how that one layer could be removed and what accuracy or robustness tradeoff you would expect.