"I summarize. You hand me a four-by-four neighborhood and ask what is there; I hand back the single loudest pixel and walk off. The details you lost were details you did not need."
A Max-Pooling Layer With Strong Opinions
A single convolution sees only its kernel; a deep stack sees an entire scene, because each layer's window opens onto windows of the layer below, and those windows compound. This section makes that compounding precise with the receptive-field recurrence, introduces pooling as a complementary way to summarize and shrink feature maps, and shows how repeated convolution and downsampling assemble the edges-to-textures-to-parts-to-objects hierarchy that is the reason CNNs work. By the end you will be able to compute, for any layer, exactly how much of the input it can possibly respond to.
Section 19.2 gave you the convolutional layer and the output-size formula, but left one question dangling: a $3 \times 3$ kernel touches only nine input pixels, so how can a CNN ever recognize an object that spans hundreds of pixels? The answer is depth. Each layer reads from the layer below, whose every activation already summarized a patch of the layer below that, and so on down to the pixels. Tracking how far back into the input a given activation reaches is the job of the receptive field, the central quantity of this section. Along the way we add pooling, the older sibling of strided convolution, and end with the feature hierarchy that ties the whole chapter together.
1. Pooling: Summarize and Shrink Beginner
A pooling layer replaces each small window of a feature map with a single summary value, with no learnable weights. The two standard summaries are the maximum (max pooling) and the mean (average pooling). A $2 \times 2$ max pool with stride 2 takes each non-overlapping $2 \times 2$ block and keeps only its largest activation, halving the resolution along each axis and discarding three quarters of the values. Pooling has two purposes: it downsamples, reducing the compute and memory of later layers, and it builds a small amount of local translation invariance, because the maximum over a window does not change if the peak shifts by one pixel within that window. Figure 19.3.1 shows a $2 \times 2$ max pool side by side with average pooling on the same input.
Max pooling dominated early architectures because the strongest response in a window is usually the most informative ("is the edge present somewhere here?"), while average pooling blurs and can dilute a sharp detection. Modern networks frequently skip pooling entirely in favor of strided convolution from Section 19.2, which downsamples and learns its summary jointly. But pooling never disappeared: global average pooling, which collapses an entire $H \times W$ feature map to one number per channel, is the standard final step before a classifier head, because it produces a fixed-length vector regardless of input size and gives the whole-image translation invariance that Section 19.1 distinguished from equivariance.
A widespread belief is that pooling (or striding) makes a CNN immune to shifts, so a recognizer trained on centered objects will work on shifted ones for free. It does not. A $2 \times 2$ max pool is invariant only to a shift that keeps the peak inside the same window; shift the input by one pixel and the window boundaries fall on different pixels, so the downsampled output can change. In fact, strided downsampling violates the Nyquist sampling rule you met in Chapter 4: it samples a feature map without first removing the high frequencies, which aliases, and a one-pixel input shift can flip a prediction. True robustness to translation comes mostly from the weight sharing of Section 19.1 (equivariance) plus translation augmentation in Chapter 21, not from pooling. The fix that restores shift-invariance, anti-aliased downsampling (blur then subsample), is exactly the low-pass-before-decimate idea of Chapter 4.
2. The Receptive Field Intermediate
The receptive field of an activation is the set of input pixels that can influence its value. For the first convolutional layer it is just the kernel size; a $3 \times 3$ filter has a $3 \times 3$ receptive field. The interesting case is deeper layers, where the receptive field grows because each activation reads a window of the previous layer, and each of those activations already had its own receptive field below. The growth follows a clean recurrence. Writing $r_{\ell}$ for the receptive field of layer $\ell$, $k_{\ell}$ for its kernel size, and $j_{\ell-1}$ for the jump (the input-pixel spacing between adjacent activations entering layer $\ell$, equal to the product of all strides up to layer $\ell - 1$):
$$ r_{\ell} \;=\; r_{\ell-1} \;+\; (k_{\ell} - 1)\, j_{\ell-1}, \qquad j_{\ell} \;=\; j_{\ell-1}\, s_{\ell}, $$
starting from $r_0 = 1$ and $j_0 = 1$. To see why the jump matters, picture two side-by-side activations in some deep layer: if every layer below used stride 1 they map to adjacent input pixels (jump 1), but each stride-2 layer doubles the input-pixel gap between them, so after two stride-2 layers neighbors are 4 pixels apart (jump 4). A wider gap means each step of the kernel reaches across more input, which is why the jump multiplies by each layer's stride, and the receptive field grows by $(k-1)$ times the current jump at each layer. The crucial consequence: stride accelerates receptive-field growth, because once the jump is large, each new layer reaches across more input pixels. This is why striding (or pooling) is not just about saving compute; it is the main lever for seeing far.
The code below walks the recurrence over a small stack and prints the receptive field after each layer, so you can see it balloon once stride enters.
def receptive_field(layers):
"""layers: list of (kernel_size, stride). Returns RF after each layer."""
r, j = 1, 1 # start: one pixel sees itself, jump 1
out = []
for k, s in layers:
r = r + (k - 1) * j # grow RF by (k-1)*current_jump
j = j * s # jump scales by this layer's stride
out.append((r, j))
return out
# A small VGG-style stack: 3x3 convs with a stride-2 downsample every few layers.
stack = [(3, 1), (3, 1), (3, 2), # block 1, then halve
(3, 1), (3, 1), (3, 2), # block 2, then halve
(3, 1), (3, 1), (3, 2)] # block 3, then halve
for i, (r, j) in enumerate(receptive_field(stack), 1):
print(f"after layer {i}: receptive field {r}x{r}, jump {j}")
# after layer 1: receptive field 3x3, jump 1
# after layer 2: receptive field 5x5, jump 1
# after layer 3: receptive field 7x7, jump 2
# after layer 4: receptive field 11x11, jump 2
# after layer 5: receptive field 15x15, jump 2
# after layer 6: receptive field 19x19, jump 4
# after layer 7: receptive field 27x27, jump 4
# after layer 8: receptive field 35x35, jump 4
# after layer 9: receptive field 43x43, jump 8
The recurrence shows that two stacked $3 \times 3$ convolutions have a $5 \times 5$ receptive field, and three have a $7 \times 7$ one. This is the design principle behind VGG (Chapter 20): a stack of small kernels matches the reach of one large kernel while using fewer parameters ($2 \times 9 = 18$ versus $25$ for a single $5 \times 5$) and inserting a nonlinearity between them, which makes the function richer. The small-kernel stack was the default for years, and it is exactly the receptive-field arithmetic of this section that justifies it.
Turn the recurrence into a small interactive tool you can show off. Take any torchvision backbone (start with resnet18), walk its layers, and plot the theoretical receptive field growing layer by layer, then overlay the box on a real image so a viewer can see exactly how much of the picture the final feature responds to. Difficulty: beginner, about one hour, no GPU and no training needed since you only read kernel sizes and strides. The build is genuinely useful: it answers the "can this neuron even see the whole tile?" question (exactly the one the satellite-imagery team will face in this section's practical example) before any retraining, and it makes the gap between the nominal box and the smaller effective field of the next subsection something you can demonstrate rather than just assert. Stretch it by adding a dilation slider and watching the box jump, the cheap reach this section promised.
3. Effective Versus Theoretical Receptive Field Advanced
The recurrence computes the theoretical receptive field, the pixels that could influence an activation. The pixels that actually influence it meaningfully form a much smaller region. Luo et al. showed that influence falls off roughly like a Gaussian from the center of the receptive field, so the effective receptive field is far smaller than the nominal one, often only a fraction of its diameter, and it grows only as the square root of the number of layers rather than linearly. The reason is statistical: the center pixel reaches the output through exponentially many paths, the corner pixels through very few, so the corners contribute little even though they are technically connected.
The practical lesson is sobering and useful. A network whose theoretical receptive field covers the whole image may, in effect, attend to only a central blob, which is one reason segmentation and detection architectures deliberately enlarge the effective field with dilation (Section 19.2), with explicit large kernels, or with the global context of attention in Chapter 22. When a CNN inexplicably ignores context that is clearly present in the input, a too-small effective receptive field is a prime suspect.
The receptive field is the network bragging about how far it could see; the effective receptive field is how far it actually bothers to look. The corners of a big receptive field are like the back row of a lecture hall: technically attending, contributing almost nothing. So when you compute a glorious $483 \times 483$ theoretical reach and the model still cannot tell forest from farmland, do not be shocked. Keep this slogan on a sticky note: nominal reach is a promise; the Gaussian decides what it keeps. The illustration below captures the gap.
The gap between theoretical and effective receptive field motivated a wave of 2022-2026 designs that enlarge reach without simply stacking more layers. RepLKNet (Ding et al., CVPR 2022, arXiv:2203.06717) uses $31 \times 31$ depthwise kernels to expand the effective field in a single layer, reporting that effective receptive field, not raw depth, tracks downstream accuracy. PeLK (Chen et al., CVPR 2024, arXiv:2403.07589) pushes to $101 \times 101$ "peripheral" kernels with parameter sharing modeled on human peripheral vision. State-space vision models such as VMamba (Liu et al., 2024, arXiv:2401.10166) achieve a global receptive field in linear time, a third route alongside large kernels and attention. The receptive-field arithmetic you computed by hand in this section is precisely the quantity these architectures compete to enlarge efficiently.
4. The Feature Hierarchy Intermediate
Putting convolution, nonlinearity, and downsampling together produces the structure that gives the chapter its theme: a feature hierarchy. The first layer, with a small receptive field, can only respond to local patterns, and it reliably learns oriented edges and color blobs, the same primitives the Sobel kernels of Chapter 3 and the oriented Gabor filters of Section 4.6 compute by hand. The second layer reads combinations of first-layer features over a slightly larger field and learns corners, junctions, and simple textures. Deeper layers, with receptive fields spanning a large fraction of the image, compose those into object parts (an eye, a wheel, a doorknob), and the deepest layers respond to whole objects and scenes. Figure 19.3.2 sketches this climb.
This hierarchy is not imposed; it emerges from training. Nobody tells layer one to learn edges, yet it almost always does, because edges are the most useful local primitive and the receptive field at that depth permits nothing more global. The hierarchy is also the foundation of transfer learning in Chapter 21: because early-layer features (edges, textures) are generic across tasks, a network trained on one large dataset can be reused on another by keeping its early layers and retraining only the later, task-specific ones. We will see the learned edge detectors directly in Section 19.6.
Who: A remote-sensing team classifying satellite tiles as forest, farmland, urban, or water for a land-use monitoring service.
Situation: Their compact CNN scored well on farmland and water but confused fragmented forest with farmland on large $512 \times 512$ tiles, despite the two looking obviously different to a human eye scanning the whole tile.
Problem: The network was shallow and used only stride-1 convolutions with one early pool. Its theoretical receptive field was about $40 \times 40$ pixels, and the effective field smaller still, so each output decision was made from a tiny patch where fragmented forest and tilled farmland are genuinely ambiguous. The global texture that distinguishes them was outside what any neuron could see.
Decision: Rather than naively stacking more layers, the team added two stride-2 downsamples early and a pair of dilation-2 convolutions deeper, enlarging the receptive field to roughly the full tile while keeping the parameter count nearly flat.
Result: Forest-versus-farmland confusion dropped by 60 percent, and overall tile accuracy rose four points. A receptive-field calculation done on a napkin, the recurrence of this section, predicted the fix before any retraining.
Lesson: When a model fails on a distinction that requires global context, compute the receptive field before adding capacity. The problem is often not that the network is too small but that it cannot see far enough, and stride or dilation fixes that more cheaply than depth.
The four-line max-pool loop you might write from Figure 19.3.1 is nn.MaxPool2d(kernel_size=2, stride=2), and average pooling is nn.AvgPool2d(2, 2). The classifier-head global pool that turns any $C \times H \times W$ map into a $C$-vector regardless of resolution is one call, nn.AdaptiveAvgPool2d(1), which internally chooses the window to hit the requested output size. Computing the receptive field of a real torchvision model by hand is also unnecessary: libraries such as torchinfo print per-layer output shapes, and the pytorch-receptive-field utility walks the recurrence of this section automatically for an arbitrary module.
You can now reason about a CNN's geometry end to end: how big each layer's output is (the formula of Section 19.2) and how far back into the image it reaches (the recurrence here). One obstacle still stands between this understanding and a working network: deep stacks are hard to optimize, because activation statistics drift as signals pass through many layers. Section 19.4 introduces batch normalization, the technique that tamed that drift and made the deep hierarchies of this section trainable in practice.
Using the recurrence in this section, compute the receptive field of the final activation in this stack, layer by layer: a $5 \times 5$ stride-1 convolution, then a $2 \times 2$ stride-2 max pool, then two $3 \times 3$ stride-1 convolutions. Show the jump after each step. Then state how the answer would change if you swapped the $5 \times 5$ first layer for two stacked $3 \times 3$ layers, and explain which choice has fewer parameters.
Create a $1 \times 1 \times 8 \times 8$ tensor that is all zeros except for a single large value (a "spike"). Apply both nn.MaxPool2d(2, 2) and nn.AvgPool2d(2, 2), print both outputs, and shift the spike by one pixel within its pooling window to show that max pooling's output is unchanged while average pooling's is. Write one sentence explaining why this makes max pooling locally translation-invariant and why that is sometimes undesirable for dense prediction.
A classifier of $256 \times 256$ medical images uses eight stride-1 $3 \times 3$ convolutions and no downsampling, ending in global average pooling. It performs well on small lesions but poorly on diffuse, large-scale findings. Compute the theoretical receptive field of the last convolutional layer, argue why the effective receptive field is even smaller (cite the Gaussian-falloff result), and propose two distinct architectural changes that would let the network perceive large-scale structure without a large increase in parameters.