"In Chapter 3 you slid a kernel I designed by hand across an image. This chapter, you stop designing the kernel and let me learn it. You will be amazed, and slightly worried, by what I decide an edge is."
A Convolution Operator, Finally Off Its Leash
This chapter takes the convolution you built by hand in Chapter 3 and lets gradient descent choose the weights. The machinery is unchanged: a small grid of numbers slides across the image and computes a weighted sum at each stop. What changes is that the numbers are no longer designed; they are learned. That single shift, plus two structural priors named locality and weight sharing, turns the humble filter into the dominant architecture for images for a decade, and explains why a network with a few million weights can outperform one with a few billion. Everything else in this chapter, pooling, batch normalization, feature hierarchies, exists to make that learnable convolution train well and see far.
Chapter Overview
In Chapter 18 you built fully connected networks and trained them with backpropagation and PyTorch. Those networks treat an image as a flat vector of pixels, which throws away the one fact every photograph obeys: nearby pixels are related, and a pattern that appears in one corner can appear in any other. A fully connected layer on a modest $224 \times 224$ color image needs over 150 million weights just to reach a 1000-unit hidden layer, and it must relearn the concept of an edge separately for every location. This is the problem convolutional neural networks solve, and they solve it not by adding capacity but by removing it intelligently.
Section 19.1 makes the argument from first principles. It shows that a convolutional layer is exactly a fully connected layer with two constraints bolted on: each output looks only at a local patch (locality), and the same weights are reused at every position (weight sharing). These constraints are an inductive bias, a built-in assumption about what images are like, and because the assumption is true, the constrained network generalizes from far less data. Section 19.2 turns the idea into the real layer practitioners use, with multiple input and output channels, stride, padding, and dilation, and shows the exact tensor shapes that PyTorch's Conv2d expects. Section 19.3 introduces pooling and the concept of the receptive field, the window of input pixels that influences a given activation, and explains how stacking layers grows that window into a feature hierarchy that climbs from edges to textures to object parts.
Section 19.4 confronts the practical reason early deep CNNs were hard to train: as signals propagate through many layers, their statistics drift, and learning stalls. Batch normalization and its relatives (layer, group, and instance normalization) fix this by re-centering and re-scaling activations, and they remain in nearly every architecture you will meet in Chapter 20. Section 19.5 is the payoff: a complete convolutional network trained end to end on CIFAR-10, with the full PyTorch training loop, data loading, augmentation, and evaluation, reaching competitive accuracy in minutes on a single GPU. Section 19.6 closes the loop on understanding. It opens the trained network and shows what it learned, visualizing first-layer filters that rediscover the Sobel and Gabor kernels of Part I, feature maps that light up on textures, and saliency and Grad-CAM maps that reveal which pixels drove a decision.
A word on why this chapter is load-bearing. The convolution is the book's signature recurring character. You met it as a designed filter in Chapter 3; here it becomes learnable, and in Chapter 33 the same operation, stacked into a U-Net, will turn noise into photographs. The vocabulary set here, channels, stride, receptive field, normalization, feature map, is the vocabulary of every architecture diagram in the rest of the book. Read this chapter where the kernels are still small enough to inspect by eye, because by Chapter 22 they will be replaced, and you will want to know exactly what they were doing.
Prerequisites
You should have read Chapter 18: Neural Networks & PyTorch for Vision, which introduces tensors, automatic differentiation, the training loop, and the optimizers this chapter uses without re-deriving them. The conceptual heart of the chapter rests on Chapter 3: Spatial Filtering & Convolution, where convolution, correlation, kernels, and border handling were built from scratch; this chapter assumes you can read a $3 \times 3$ kernel and predict roughly what it does. The first-layer filter visualizations in Section 19.6 connect directly to the Sobel derivative kernels of Chapter 3, the oriented Gabor filters of Section 4.6, and the gradient operators of Chapter 9. Basic linear algebra (dot products, matrix shapes) and comfort with NumPy arrays from Chapter 0 are assumed throughout.
Chapter Roadmap
- 19.1 Why Convolution? Locality, Weight Sharing & Inductive Bias A convolutional layer derived as a constrained fully connected layer: local connectivity and weight sharing as priors, the parameter-count argument, and equivariance to translation.
- 19.2 Convolution Layers: Channels, Stride, Padding & Dilation The real layer: multi-channel filters, the output-size formula, stride for downsampling, padding to control shape, dilation for cheap receptive-field growth, and the exact PyTorch tensor shapes.
- 19.3 Pooling, Receptive Fields & Feature Hierarchies Max and average pooling, the receptive-field calculation, why deep stacks see globally, and how the edges-to-parts-to-objects hierarchy emerges from repeated convolution and downsampling.
- 19.4 Batch Normalization & Friends Internal covariate shift and the smoother loss landscape, the batch-norm forward and backward pass, train-versus-eval behavior, and when to reach for layer, group, or instance normalization instead.
- 19.5 A CNN from Scratch: CIFAR-10 End to End A complete network and training loop: data loading and augmentation, the conv-BN-ReLU block, optimizer and schedule, overfitting diagnostics, and reaching competitive accuracy on a single GPU.
- 19.6 Visualizing What CNNs Learn Opening the trained network: first-layer filters that rediscover edge and Gabor kernels, feature-map activations, maximally activating patches, saliency maps, and Grad-CAM class evidence.
What's Next?
This chapter gives you the layer and the network; the next gives you the great designs built from them. In Chapter 20: CNN Architectures: From LeNet to ConvNeXt, the conv-BN-ReLU block you assembled here becomes a building unit in the architectures that defined the field: LeNet's first demonstration, AlexNet's 2012 breakthrough, VGG's uniform depth, the residual connection of ResNet that finally made hundred-layer networks trainable, the efficiency of MobileNet and EfficientNet, and ConvNeXt's modern answer to the vision transformer. You will see how each design trades off accuracy, parameters, and compute, and why the residual connection and batch normalization you learned here turn out to be the two ideas that made all the depth possible.
Bibliography & Further Reading
Foundational Papers
Recent Research (2022-2026)
Books & Courses
Tools & Libraries
pytorch-grad-cam library. github.com/jacobgil/pytorch-grad-cam