Part III: Deep Learning for Computer Vision
Chapter 19: Convolutional Neural Networks

Chapter 19: Convolutional Neural Networks

"In Chapter 3 you slid a kernel I designed by hand across an image. This chapter, you stop designing the kernel and let me learn it. You will be amazed, and slightly worried, by what I decide an edge is."

A Convolution Operator, Finally Off Its Leash
Big Picture

This chapter takes the convolution you built by hand in Chapter 3 and lets gradient descent choose the weights. The machinery is unchanged: a small grid of numbers slides across the image and computes a weighted sum at each stop. What changes is that the numbers are no longer designed; they are learned. That single shift, plus two structural priors named locality and weight sharing, turns the humble filter into the dominant architecture for images for a decade, and explains why a network with a few million weights can outperform one with a few billion. Everything else in this chapter, pooling, batch normalization, feature hierarchies, exists to make that learnable convolution train well and see far.

Chapter Overview

In Chapter 18 you built fully connected networks and trained them with backpropagation and PyTorch. Those networks treat an image as a flat vector of pixels, which throws away the one fact every photograph obeys: nearby pixels are related, and a pattern that appears in one corner can appear in any other. A fully connected layer on a modest $224 \times 224$ color image needs over 150 million weights just to reach a 1000-unit hidden layer, and it must relearn the concept of an edge separately for every location. This is the problem convolutional neural networks solve, and they solve it not by adding capacity but by removing it intelligently.

Section 19.1 makes the argument from first principles. It shows that a convolutional layer is exactly a fully connected layer with two constraints bolted on: each output looks only at a local patch (locality), and the same weights are reused at every position (weight sharing). These constraints are an inductive bias, a built-in assumption about what images are like, and because the assumption is true, the constrained network generalizes from far less data. Section 19.2 turns the idea into the real layer practitioners use, with multiple input and output channels, stride, padding, and dilation, and shows the exact tensor shapes that PyTorch's Conv2d expects. Section 19.3 introduces pooling and the concept of the receptive field, the window of input pixels that influences a given activation, and explains how stacking layers grows that window into a feature hierarchy that climbs from edges to textures to object parts.

Section 19.4 confronts the practical reason early deep CNNs were hard to train: as signals propagate through many layers, their statistics drift, and learning stalls. Batch normalization and its relatives (layer, group, and instance normalization) fix this by re-centering and re-scaling activations, and they remain in nearly every architecture you will meet in Chapter 20. Section 19.5 is the payoff: a complete convolutional network trained end to end on CIFAR-10, with the full PyTorch training loop, data loading, augmentation, and evaluation, reaching competitive accuracy in minutes on a single GPU. Section 19.6 closes the loop on understanding. It opens the trained network and shows what it learned, visualizing first-layer filters that rediscover the Sobel and Gabor kernels of Part I, feature maps that light up on textures, and saliency and Grad-CAM maps that reveal which pixels drove a decision.

A word on why this chapter is load-bearing. The convolution is the book's signature recurring character. You met it as a designed filter in Chapter 3; here it becomes learnable, and in Chapter 33 the same operation, stacked into a U-Net, will turn noise into photographs. The vocabulary set here, channels, stride, receptive field, normalization, feature map, is the vocabulary of every architecture diagram in the rest of the book. Read this chapter where the kernels are still small enough to inspect by eye, because by Chapter 22 they will be replaced, and you will want to know exactly what they were doing.

Prerequisites

You should have read Chapter 18: Neural Networks & PyTorch for Vision, which introduces tensors, automatic differentiation, the training loop, and the optimizers this chapter uses without re-deriving them. The conceptual heart of the chapter rests on Chapter 3: Spatial Filtering & Convolution, where convolution, correlation, kernels, and border handling were built from scratch; this chapter assumes you can read a $3 \times 3$ kernel and predict roughly what it does. The first-layer filter visualizations in Section 19.6 connect directly to the Sobel derivative kernels of Chapter 3, the oriented Gabor filters of Section 4.6, and the gradient operators of Chapter 9. Basic linear algebra (dot products, matrix shapes) and comfort with NumPy arrays from Chapter 0 are assumed throughout.

Chapter Roadmap

What's Next?

This chapter gives you the layer and the network; the next gives you the great designs built from them. In Chapter 20: CNN Architectures: From LeNet to ConvNeXt, the conv-BN-ReLU block you assembled here becomes a building unit in the architectures that defined the field: LeNet's first demonstration, AlexNet's 2012 breakthrough, VGG's uniform depth, the residual connection of ResNet that finally made hundred-layer networks trainable, the efficiency of MobileNet and EfficientNet, and ConvNeXt's modern answer to the vision transformer. You will see how each design trades off accuracy, parameters, and compute, and why the residual connection and batch normalization you learned here turn out to be the two ideas that made all the depth possible.

Bibliography & Further Reading

Foundational Papers

LeCun, Y. et al. "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE (1998). lecun.com/exdb/publis
The LeNet-5 paper that introduced the modern convolutional network: convolution, subsampling (pooling), and a learnable feature hierarchy trained by backpropagation. Every section of this chapter descends from it.
Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS (2012). papers.nips.cc
AlexNet: the network whose 2012 ImageNet win launched the deep-learning era of computer vision. Popularized ReLU, dropout, and GPU training, all built on the convolutional layer of Section 19.2.
Ioffe, S. and Szegedy, C. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML (2015). arXiv:1502.03167
The batch-normalization paper of Section 19.4. The "internal covariate shift" explanation has been challenged, but the technique it introduced remains in nearly every CNN you will train.
Zeiler, M. and Fergus, R. "Visualizing and Understanding Convolutional Networks." ECCV (2014). arXiv:1311.2901
The deconvolution-based feature visualizations behind Section 19.6, showing the edges-to-parts-to-objects hierarchy directly and explaining why opening a trained CNN is worth the effort.
Luo, W. et al. "Understanding the Effective Receptive Field in Deep Convolutional Neural Networks." NeurIPS (2016). arXiv:1701.04128
Shows that the theoretical receptive field of Section 19.3 overstates the truth: influence falls off like a Gaussian, so the effective receptive field is much smaller than the nominal one. Essential reading after computing receptive fields by hand.
Selvaraju, R. et al. "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization." ICCV (2017). arXiv:1610.02391
The class-discriminative localization method used in Section 19.6 to show which pixels drove a prediction. Works on any CNN without retraining and is a standard debugging tool in production.

Recent Research (2022-2026)

Liu, Z. et al. "A ConvNet for the 2020s." CVPR (2022). arXiv:2201.03545
ConvNeXt: a modernized pure-convolutional network that matches vision transformers by importing their training recipe and macro design, while keeping the learnable convolution at its core. The bridge from this chapter to Chapter 20.
Ding, X. et al. "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs." CVPR (2022). arXiv:2203.06717
RepLKNet shows that very large depthwise kernels enlarge the effective receptive field directly, an alternative to the deep-stack route of Section 19.3 for capturing long-range context.
Brock, A. et al. "High-Performance Large-Scale Image Recognition Without Normalization." ICML (2021). arXiv:2102.06171
NFNets: removing batch normalization (Section 19.4) entirely using adaptive gradient clipping and careful initialization, reaching state-of-the-art accuracy. The strongest evidence that normalization is a means, not an end.

Books & Courses

Prince, S. J. D. Understanding Deep Learning (2023). udlbook.github.io
Free, exceptionally clear treatment of convolutional networks, receptive fields, and normalization, with figures that complement this chapter. The convolution chapter is a recommended parallel read.
Zhang, A. et al. Dive into Deep Learning (2023). d2l.ai
Interactive textbook with runnable PyTorch code for every CNN concept in this chapter, including a from-scratch convolution and a batch-norm implementation matching Sections 19.2 and 19.4.

Tools & Libraries

PyTorch. torch.nn.Conv2d and torch.nn.BatchNorm2d documentation. pytorch.org/docs
The reference for the exact arguments, tensor shapes, and default behaviors of the layers used throughout this chapter, including padding modes, groups, and the running-statistics machinery of batch norm.
torchvision. Datasets, transforms, and models reference. pytorch.org/vision
Provides the CIFAR-10 dataset loader, augmentation transforms, and pretrained backbones used in Sections 19.5 and 19.6.
Gildenblat, J. et al. pytorch-grad-cam library. github.com/jacobgil/pytorch-grad-cam
A maintained implementation of Grad-CAM and a dozen related class-activation methods used in Section 19.6, with a few-line API that replaces the from-scratch hook code.

Datasets & Benchmarks

Krizhevsky, A. "Learning Multiple Layers of Features from Tiny Images" (CIFAR-10/100 technical report, 2009). cs.toronto.edu/~kriz/cifar
The CIFAR-10 dataset trained in Section 19.5: 60,000 32x32 color images in 10 classes, the standard small-scale benchmark for prototyping convolutional architectures.