"Every year they told me I was too shallow, too wide, too slow, too heavy. So every year I removed the one thing holding me back. A decade later I am still made of the same nine-number filter, just arranged by people who finally read the gradient."
A Convolutional Network on Its Tenth Redesign
This chapter is the story of one operation, the learnable convolution of Chapter 19, redesigned a dozen times over a decade, where each redesign removes the single bottleneck that capped the previous one. Read in sequence, the great architectures are not a museum of unrelated tricks; they are a chain of root-cause fixes. AlexNet removed the compute ceiling with GPUs and ReLU. VGG removed the kernel-size guesswork by stacking small filters. Inception removed the scale-selection problem by computing several scales at once. ResNet removed the depth ceiling with a skip connection that let gradients flow. MobileNet and EfficientNet removed the cost ceiling for phones and edge devices. ConvNeXt removed the assumption that you needed a transformer at all. Learn the bottleneck each design attacked and you will never again memorize an architecture; you will be able to reconstruct it from the problem it solved.
One sentence to carry the whole chapter: every architecture is the answer to "what capped the last one?" If you remember only that question, the chapter rebuilds itself.
| Architecture (year) | The ceiling it hit | The one idea that removed it |
|---|---|---|
| LeNet (1998) | none yet; just no data or compute to scale | the conv, subsample, classify template |
| AlexNet (2012) | compute and dead gradients | GPUs and ReLU |
| VGG (2014) | kernel-size guesswork | stack small $3 \times 3$ filters |
| Inception (2014) | which scale to pick | several scales at once, $1 \times 1$ bottleneck |
| ResNet (2015) | depth makes it worse | the identity skip connection |
| MobileNet / EfficientNet (2017-19) | cost on a phone | factorize the convolution; scale all dims together |
| ConvNeXt (2022) | "attention replaced you" | the recipe, not the mechanism, was the gap |
Table 20.0.1 is the chapter on one page: a ladder where each rung names a ceiling and the single idea that broke it. Skim it now, then return to it after each section; by Section 20.6 you should be able to reproduce the right-hand column from the middle one, which is the test that you have learned the architectures rather than memorized them.
Chapter Overview
In Chapter 19 you built a convolutional network from its parts: the conv-BN-ReLU block, pooling, the receptive field, and a full training loop on CIFAR-10. You now hold every building unit the field uses. This chapter answers the next question, which is also the question that organized a decade of research: given these units, how should you arrange them? The honest answer in 2012 was "nobody is sure", and the answer in 2026 is "it depends on your budget, but here are the designs that won and exactly why". This chapter walks that arc as a detective story, where each architecture is a fix for a named, measurable bottleneck in the one before it.
Section 20.1 opens in 1998 with LeNet-5, the first convolutional network trained end to end by backpropagation, then jumps to 2012 and AlexNet, the network whose ImageNet win started the deep-learning era. The bottleneck AlexNet removed was raw compute and dead gradients: two GPUs and the ReLU nonlinearity turned a network that would have taken months into one that trained in days. Section 20.2 covers the two 2014 answers to "how deep, how wide?". VGG argued that a uniform stack of $3 \times 3$ convolutions is all you need, trading elegance for parameter count, while Inception argued that you should compute several filter sizes in parallel and let the network choose, introducing the $1 \times 1$ bottleneck that makes width affordable. Section 20.3 is the turning point of the entire chapter. ResNet identified that very deep plain networks get worse, not just harder to train, and fixed it with the residual connection, the single most consequential architectural idea in modern vision.
Section 20.4 turns from accuracy to efficiency. Once networks worked, the bottleneck moved to cost: how do you run a strong model on a phone, a drone, or a doorbell camera? MobileNet's depthwise-separable convolution, ShuffleNet's grouped convolutions with channel shuffle, and EfficientNet's principled compound scaling each attack a different facet of the parameters-versus-latency-versus-accuracy triangle. Section 20.5 closes the historical loop. After vision transformers (which you will meet in Chapter 22) appeared to dethrone the CNN in 2020, ConvNeXt asked a sharp question: was it the attention, or was it the training recipe? By modernizing a pure ResNet one change at a time, ConvNeXt matched transformers with no attention at all, proving that much of the transformer's edge was the recipe, not the mechanism. Section 20.6 is the practitioner's section: given all these choices, how do you actually pick one for a real project, read a model's stats sheet, and start from a pretrained backbone rather than from scratch.
A note on why this chapter repays close reading. The vocabulary set here, residual block, bottleneck, depthwise-separable convolution, compound scaling, stage, stem, is the vocabulary of every backbone you will use in detection (Chapter 23), segmentation (Chapter 24), and self-supervised pretraining (Chapter 25). The residual connection in particular reappears inside the transformer block of Chapter 22 and inside the U-Net denoiser of Chapter 33. When you later read a paper that says "a ResNet-50 backbone with an FPN neck", every word of that phrase will be something you built here.
Prerequisites
You should have read Chapter 19: Convolutional Neural Networks, which provides the convolution layer, pooling, the receptive field, and batch normalization that every architecture here assembles. The PyTorch training loop, optimizers, and tensor mechanics from Chapter 18: Neural Networks & PyTorch for Vision are used without re-derivation, especially in the from-scratch implementations. The first-layer filter intuition from Chapter 3: Spatial Filtering & Convolution and the multi-scale image-pyramid idea from Chapter 4: The Frequency Domain & Multi-Scale Analysis make the Inception and feature-hierarchy discussions concrete. Comfort with reading parameter counts and FLOP estimates (just multiplication and addition) is assumed in Section 20.6.
Chapter Roadmap
- 20.1 LeNet & AlexNet: The Breakthrough Years The 1998 network that started it all and the 2012 network that restarted it: LeNet-5's design, AlexNet's removal of the compute and dead-gradient bottlenecks with GPUs, ReLU, dropout, and overlapping pooling.
- 20.2 VGG & Inception: Depth vs Width The two 2014 answers: VGG's uniform stack of small kernels and why stacking beats one large filter, against Inception's parallel multi-scale branches and the $1 \times 1$ bottleneck that makes width cheap.
- 20.3 ResNet: Residual Learning Changes Everything The degradation problem in plain deep networks, the residual connection that fixes it, why skip connections keep gradients alive, the bottleneck block, and why this one idea reappears everywhere in the rest of the book.
- 20.4 Efficient Designs: MobileNet, ShuffleNet & EfficientNet Architecture for a cost budget: depthwise-separable convolution, inverted residuals, grouped convolution with channel shuffle, squeeze-and-excitation, and EfficientNet's compound scaling of depth, width, and resolution together.
- 20.5 ConvNeXt: The CNN, Modernized After the transformer scare: modernizing a ResNet one change at a time, the patchify stem, large depthwise kernels, fewer normalizations and activations, and the lesson that much of the transformer edge was the training recipe.
- 20.6 Choosing an Architecture in Practice Reading the stats sheet (parameters, FLOPs, latency, accuracy), the accuracy-versus-cost frontier, picking a backbone with timm, transfer learning from pretrained weights, and a decision checklist for real projects.
What's Next?
This chapter gives you the great convolutional designs and, more importantly, the habit of reading an architecture as the answer to a bottleneck. The next chapter gives you the other half of the equation: even the best architecture is only as good as how you feed and train it. In Chapter 21: Training Recipes: Data, Augmentation & Transfer, you will learn the data pipelines, augmentation policies, learning-rate schedules, regularization, and transfer-learning workflows that turn the backbones of this chapter into accurate, robust models. That chapter makes explicit something Section 20.5 only hinted at: ConvNeXt matched transformers largely by importing their training recipe, so the recipe is now a first-class part of the architecture. After that, Chapter 22: Vision Transformers introduces the attention-based competitor that forced this chapter's final act, and you will be able to compare it head to head with everything you learned here.
Bibliography & Further Reading
Foundational Papers
Recent Research (2022-2026)
Books & Courses
Tools & Libraries
timm (PyTorch Image Models). github.com/huggingface/pytorch-image-models