Part III: Deep Learning for Computer Vision
Chapter 20: CNN Architectures: From LeNet to ConvNeXt

Chapter 20: CNN Architectures: From LeNet to ConvNeXt

"Every year they told me I was too shallow, too wide, too slow, too heavy. So every year I removed the one thing holding me back. A decade later I am still made of the same nine-number filter, just arranged by people who finally read the gradient."

A Convolutional Network on Its Tenth Redesign
Big Picture

This chapter is the story of one operation, the learnable convolution of Chapter 19, redesigned a dozen times over a decade, where each redesign removes the single bottleneck that capped the previous one. Read in sequence, the great architectures are not a museum of unrelated tricks; they are a chain of root-cause fixes. AlexNet removed the compute ceiling with GPUs and ReLU. VGG removed the kernel-size guesswork by stacking small filters. Inception removed the scale-selection problem by computing several scales at once. ResNet removed the depth ceiling with a skip connection that let gradients flow. MobileNet and EfficientNet removed the cost ceiling for phones and edge devices. ConvNeXt removed the assumption that you needed a transformer at all. Learn the bottleneck each design attacked and you will never again memorize an architecture; you will be able to reconstruct it from the problem it solved.

One sentence to carry the whole chapter: every architecture is the answer to "what capped the last one?" If you remember only that question, the chapter rebuilds itself.

Table 20.0.1: The bottleneck ladder. Read top to bottom, each row is one architecture and the single ceiling it removed; this is the chapter compressed to a recall card.
Architecture (year) The ceiling it hit The one idea that removed it
LeNet (1998)none yet; just no data or compute to scalethe conv, subsample, classify template
AlexNet (2012)compute and dead gradientsGPUs and ReLU
VGG (2014)kernel-size guessworkstack small $3 \times 3$ filters
Inception (2014)which scale to pickseveral scales at once, $1 \times 1$ bottleneck
ResNet (2015)depth makes it worsethe identity skip connection
MobileNet / EfficientNet (2017-19)cost on a phonefactorize the convolution; scale all dims together
ConvNeXt (2022)"attention replaced you"the recipe, not the mechanism, was the gap

Table 20.0.1 is the chapter on one page: a ladder where each rung names a ceiling and the single idea that broke it. Skim it now, then return to it after each section; by Section 20.6 you should be able to reproduce the right-hand column from the middle one, which is the test that you have learned the architectures rather than memorized them.

Chapter Overview

In Chapter 19 you built a convolutional network from its parts: the conv-BN-ReLU block, pooling, the receptive field, and a full training loop on CIFAR-10. You now hold every building unit the field uses. This chapter answers the next question, which is also the question that organized a decade of research: given these units, how should you arrange them? The honest answer in 2012 was "nobody is sure", and the answer in 2026 is "it depends on your budget, but here are the designs that won and exactly why". This chapter walks that arc as a detective story, where each architecture is a fix for a named, measurable bottleneck in the one before it.

Section 20.1 opens in 1998 with LeNet-5, the first convolutional network trained end to end by backpropagation, then jumps to 2012 and AlexNet, the network whose ImageNet win started the deep-learning era. The bottleneck AlexNet removed was raw compute and dead gradients: two GPUs and the ReLU nonlinearity turned a network that would have taken months into one that trained in days. Section 20.2 covers the two 2014 answers to "how deep, how wide?". VGG argued that a uniform stack of $3 \times 3$ convolutions is all you need, trading elegance for parameter count, while Inception argued that you should compute several filter sizes in parallel and let the network choose, introducing the $1 \times 1$ bottleneck that makes width affordable. Section 20.3 is the turning point of the entire chapter. ResNet identified that very deep plain networks get worse, not just harder to train, and fixed it with the residual connection, the single most consequential architectural idea in modern vision.

Section 20.4 turns from accuracy to efficiency. Once networks worked, the bottleneck moved to cost: how do you run a strong model on a phone, a drone, or a doorbell camera? MobileNet's depthwise-separable convolution, ShuffleNet's grouped convolutions with channel shuffle, and EfficientNet's principled compound scaling each attack a different facet of the parameters-versus-latency-versus-accuracy triangle. Section 20.5 closes the historical loop. After vision transformers (which you will meet in Chapter 22) appeared to dethrone the CNN in 2020, ConvNeXt asked a sharp question: was it the attention, or was it the training recipe? By modernizing a pure ResNet one change at a time, ConvNeXt matched transformers with no attention at all, proving that much of the transformer's edge was the recipe, not the mechanism. Section 20.6 is the practitioner's section: given all these choices, how do you actually pick one for a real project, read a model's stats sheet, and start from a pretrained backbone rather than from scratch.

A note on why this chapter repays close reading. The vocabulary set here, residual block, bottleneck, depthwise-separable convolution, compound scaling, stage, stem, is the vocabulary of every backbone you will use in detection (Chapter 23), segmentation (Chapter 24), and self-supervised pretraining (Chapter 25). The residual connection in particular reappears inside the transformer block of Chapter 22 and inside the U-Net denoiser of Chapter 33. When you later read a paper that says "a ResNet-50 backbone with an FPN neck", every word of that phrase will be something you built here.

Prerequisites

You should have read Chapter 19: Convolutional Neural Networks, which provides the convolution layer, pooling, the receptive field, and batch normalization that every architecture here assembles. The PyTorch training loop, optimizers, and tensor mechanics from Chapter 18: Neural Networks & PyTorch for Vision are used without re-derivation, especially in the from-scratch implementations. The first-layer filter intuition from Chapter 3: Spatial Filtering & Convolution and the multi-scale image-pyramid idea from Chapter 4: The Frequency Domain & Multi-Scale Analysis make the Inception and feature-hierarchy discussions concrete. Comfort with reading parameter counts and FLOP estimates (just multiplication and addition) is assumed in Section 20.6.

Chapter Roadmap

What's Next?

This chapter gives you the great convolutional designs and, more importantly, the habit of reading an architecture as the answer to a bottleneck. The next chapter gives you the other half of the equation: even the best architecture is only as good as how you feed and train it. In Chapter 21: Training Recipes: Data, Augmentation & Transfer, you will learn the data pipelines, augmentation policies, learning-rate schedules, regularization, and transfer-learning workflows that turn the backbones of this chapter into accurate, robust models. That chapter makes explicit something Section 20.5 only hinted at: ConvNeXt matched transformers largely by importing their training recipe, so the recipe is now a first-class part of the architecture. After that, Chapter 22: Vision Transformers introduces the attention-based competitor that forced this chapter's final act, and you will be able to compare it head to head with everything you learned here.

Bibliography & Further Reading

Foundational Papers

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE (1998). lecun.com/exdb/publis
The LeNet-5 paper of Section 20.1. It established the convolution-subsample-classify template that every architecture in this chapter inherits, and trained it on handwritten digits decades before the data and hardware existed to scale it.
Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS (2012). papers.nips.cc
AlexNet, the 2012 ImageNet winner of Section 20.1. ReLU, dropout, GPU training, and overlapping pooling, the combination that removed the compute and gradient bottlenecks and started the modern era.
Simonyan, K. and Zisserman, A. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR (2015). arXiv:1409.1556
VGG of Section 20.2. The argument that a uniform stack of $3 \times 3$ convolutions matches large kernels at lower cost, and the network whose features became a standard transfer-learning backbone.
Szegedy, C. et al. "Going Deeper with Convolutions." CVPR (2015). arXiv:1409.4842
GoogLeNet / Inception of Section 20.2. The parallel multi-scale module and the $1 \times 1$ bottleneck that let the network compute several filter sizes at once without exploding the cost.
He, K., Zhang, X., Ren, S., and Sun, J. "Deep Residual Learning for Image Recognition." CVPR (2016). arXiv:1512.03385
ResNet of Section 20.3, the pivot of the chapter. The residual connection that solved the degradation problem and made hundred-layer (and thousand-layer) networks trainable. The most cited paper in modern computer vision.
Howard, A. et al. "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." arXiv (2017). arXiv:1704.04861
MobileNet of Section 20.4. Depthwise-separable convolution factorizes a standard convolution into a per-channel spatial filter plus a $1 \times 1$ mix, cutting cost by roughly the kernel area with little accuracy loss.
Tan, M. and Le, Q. "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." ICML (2019). arXiv:1905.11946
EfficientNet of Section 20.4. Compound scaling grows depth, width, and input resolution together by a single coefficient, defining an accuracy-per-FLOP frontier that dominated for years.
Liu, Z. et al. "A ConvNet for the 2020s." CVPR (2022). arXiv:2201.03545
ConvNeXt of Section 20.5. A pure convolutional network modernized step by step until it matched the Swin transformer, isolating which design and training choices actually mattered.

Recent Research (2022-2026)

Woo, S. et al. "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders." CVPR (2023). arXiv:2301.00808
ConvNeXt V2 adds a global response normalization layer and pairs the architecture with masked-autoencoder self-supervised pretraining (a preview of Chapter 25), pushing pure CNNs further still.
Wightman, R., Touvron, H., and Jegou, H. "ResNet strikes back: An improved training procedure in timm." NeurIPS Workshop (2021). arXiv:2110.00476
Demonstrates that a plain ResNet-50, trained with a modern recipe, jumps several accuracy points, reinforcing Section 20.5's lesson that recipe and architecture are entangled, and that old backbones are far from obsolete.

Books & Courses

Prince, S. J. D. Understanding Deep Learning (2023). udlbook.github.io
Free, clearly illustrated treatment of convolutional architectures and residual networks, with figures that complement the bottleneck-by-bottleneck narrative of this chapter.
Zhang, A. et al. Dive into Deep Learning (2023). d2l.ai
Interactive textbook with runnable PyTorch implementations of LeNet, AlexNet, VGG, GoogLeNet, ResNet, and more, an ideal companion for re-implementing the architectures of this chapter.

Tools & Libraries

Wightman, R. timm (PyTorch Image Models). github.com/huggingface/pytorch-image-models
The de facto library of pretrained vision backbones used throughout Section 20.6, with hundreds of architectures (ResNet, EfficientNet, ConvNeXt, and more), benchmark tables, and a one-line model factory.
torchvision. Models, weights, and transforms reference. pytorch.org/vision/models
The official PyTorch catalogue of architectures and pretrained weights (with documented ImageNet accuracies) for AlexNet, VGG, ResNet, MobileNet, EfficientNet, and ConvNeXt, the source of the backbones loaded in code throughout this chapter.

Datasets & Benchmarks

Russakovsky, O. et al. "ImageNet Large Scale Visual Recognition Challenge." IJCV (2015). arXiv:1409.0575
The ImageNet benchmark (1000 classes, 1.2M training images) on which every architecture in this chapter was measured. The annual challenge from 2010 to 2017 is the scoreboard against which the whole story is told.