"A fully connected layer asked me where the cat was. I said: which cat, in which corner, on which day? It had to memorize all of them. I just learn what a cat looks like, once, and check everywhere."
A Smugly Parameter-Efficient Convolutional Layer
A convolutional layer is nothing more exotic than a fully connected layer with two restrictions imposed: each output unit may look only at a small local patch of the input, and every output unit must reuse the same weights. Those two restrictions encode an assumption about the world, that useful visual patterns are local and can appear anywhere, and because the assumption matches how images actually behave, the restricted network learns more from less data, generalizes better, and uses a tiny fraction of the parameters. This section is the argument for why convolution is the right prior for images, before Section 19.2 turns it into the layer you will actually use.
In Chapter 18 a network treated its input as a flat vector. For a vector of stock prices or a row of survey responses, that is exactly right: there is no meaningful "neighbor" relationship, and every input feature deserves its own independent weight. For an image, flattening is a small catastrophe. It discards the grid structure that makes a pixel's neighbors informative, and it forces the network to learn the appearance of every pattern separately at every possible location. This section explains why that fails, and how two simple constraints, motivated by the convolution you already built by hand in Chapter 3, fix it. The illustration below previews the whole idea in one picture.
1. The Fully Connected Layer Does Not Scale to Images Beginner
Consider a single modestly sized color image, $224 \times 224$ pixels with 3 channels. Flattened, that is $224 \times 224 \times 3 = 150{,}528$ input features. Connect it to a first hidden layer of just 1000 units, the kind of width that is unremarkable in Chapter 18, and the weight matrix alone has $150{,}528 \times 1000 \approx 1.5 \times 10^{8}$ parameters. One layer, 150 million weights, before you have learned anything. A network deep enough to be useful would have billions of parameters in its first few layers, would need an enormous dataset to constrain them, and would overfit ferociously on anything smaller.
The parameter count is only the visible symptom. The deeper problem is that a fully connected layer has no notion of where a feature is, so it cannot transfer knowledge across positions. Suppose the network learns, in the weights feeding one hidden unit, to detect a vertical edge in the top-left corner. Those weights are useless for detecting the same vertical edge in the bottom-right corner; an entirely separate set of weights, feeding a different hidden unit, must learn the identical pattern from scratch. The network spends its capacity re-deriving the same handful of visual primitives once per location. Figure 19.1.1 contrasts the dense wiring of a fully connected layer with the sparse, repeated wiring of a convolutional one.
2. Constraint One: Local Connectivity Beginner
The first constraint is local connectivity: an output unit connects only to a small spatial neighborhood of the input, not to all of it. This is justified by a fact about images: the pixels that tell you whether there is an edge at position $(x, y)$ are the pixels near $(x, y)$. A pixel a hundred rows away contributes almost nothing to the local question "is there a vertical edge here?". So a hidden unit responsible for a local question needs only local inputs. If the input has spatial size $H \times W$ and each output looks at a $k \times k$ patch, the fan-in of an output drops from $H \times W$ to $k \times k$, which for a typical $3 \times 3$ patch is just 9 connections regardless of how large the image is.
Local connectivity alone is the idea behind a locally connected layer: each output position still has its own private $k \times k$ weights, but it ignores everything outside its patch. This already slashes the connection count, but it does not yet share anything across positions, so the network still relearns the vertical-edge detector separately at every location. We need the second constraint to fix that.
Local connectivity bakes in the belief that the information needed to compute a low-level feature is spatially concentrated. This is overwhelmingly true for natural images and is exactly why the small kernels of Chapter 3 work. It is not true for every signal: a layer that needs to relate the far-left and far-right of the input in one step (global reasoning) is poorly served by locality, which is one reason vision transformers in Chapter 22 reintroduce all-pairs attention. The art of architecture is choosing where to assume locality and where to break it.
3. Constraint Two: Weight Sharing Beginner
The second constraint is weight sharing: instead of giving each output position its own $k \times k$ weights, every position reuses the same $k \times k$ weights. The justification is again a fact about images, called stationarity: the statistics of natural images are roughly the same everywhere. A vertical edge looks like a vertical edge whether it is in the sky or the pavement, so a detector that works in one place should work in all places. Sharing one set of weights across all positions encodes that belief directly.
The moment we share one $k \times k$ filter across all positions, the operation becomes exactly the cross-correlation you implemented in Section 3.1: slide the kernel, multiply, sum. The layer's entire parameter set is the $k \times k$ kernel (plus one bias), no matter how large the image. The combination of constraints is the layer's name: locally connected with shared weights equals convolution. Table 19.1.1 makes the parameter savings concrete for one $3 \times 3$ filter on a small image.
It is natural to assume the "convolution" in a convolutional layer is the flip-and-slide convolution of Section 3.1, where the kernel is mirrored before the dot product. It is not. PyTorch's Conv2d, like every mainstream framework, computes cross-correlation: it slides the kernel without flipping it. The two operations differ only by a $180^{\circ}$ rotation of the kernel, and because the kernel here is learned, the network simply learns the already-flipped weights, so the distinction never affects accuracy. The term "convolution" survived for historical reasons. The practical consequence: if you ever load a hand-designed kernel from Part I (a Sobel operator, say) into a Conv2d and expect the textbook convolution result, you must flip it first, or your edges will point the wrong way.
| Layer type | Weights per output | Shared? | Total weights |
|---|---|---|---|
| Fully connected | $1024$ (all inputs) | No | $1024 \times 1024 = 1{,}048{,}576$ |
| Locally connected ($3\times3$) | $9$ | No | $9 \times 1024 = 9{,}216$ |
| Convolutional ($3\times3$) | $9$ | Yes | $9$ (one kernel) |
The contrast in the final column is the whole story of why CNNs work where dense networks cannot. The fully connected layer needs over a million weights; the convolution needs nine. The convolution will need more in practice, one kernel per output channel as we will see in Section 19.2, but the count is governed by the number of distinct features the layer learns, not by the size of the image. A 4-megapixel photograph and a thumbnail use the same kernel weights.
Who: A machine-vision engineer at a printed-circuit-board manufacturer building an automated solder-joint inspector.
Situation: The first prototype used a fully connected classifier on flattened crops of each board. It hit 94 percent accuracy in the lab and was scheduled to ship.
Problem: On the line, accuracy collapsed to the low seventies. The boards were not always seated in the exact same jig position; a one-millimeter shift moved every joint a few pixels in the frame. The dense network had memorized defect appearance at the training positions and could not recognize the same defect a few pixels over, the textbook failure of a position-dependent model.
Decision: Replace the dense classifier with a small convolutional network, trusting weight sharing to make a defect detector that works at any position by construction, and add modest translation augmentation on top.
Result: On-line accuracy rose to 96 percent and, more importantly, stopped depending on jig alignment. The total parameter count dropped by a factor of forty, which let the model run on the existing edge device without a hardware upgrade.
Lesson: The right inductive bias does not just save parameters; it removes an entire class of failure (position dependence) that no amount of dense-network tuning would have cured. Weight sharing turned "recognize this defect here" into "recognize this defect anywhere," which was the actual requirement.
4. Equivariance: The Property That Falls Out for Free Intermediate
Weight sharing gives convolution a clean mathematical property called translation equivariance. A function $f$ is equivariant to a transformation $T$ if shifting the input shifts the output the same way: $f(T(x)) = T(f(x))$. For convolution, if you slide the input image two pixels to the right, every activation in the output slides two pixels to the right, unchanged in value. Formally, writing $S_{\delta}$ for a spatial shift by $\delta$ and $\ast$ for convolution with kernel $w$:
$$ (S_{\delta} x) \ast w \;=\; S_{\delta}(x \ast w). $$
This is not the same as invariance, where the output would be unchanged by the shift. Convolution is equivariant: the features move with the object. Invariance is something you build later, by pooling or by global averaging, when you want a whole-image label that does not care where the object sits. The distinction matters: for classification you eventually want invariance (a cat anywhere is a cat), but for detection in Chapter 23 and segmentation in Chapter 24 you want equivariance preserved, because you need to report where the object is. Convolution gives you equivariance by default and lets you trade it for invariance deliberately.
The code below demonstrates equivariance empirically. We run a fixed convolution on an image, then on a shifted copy of the same image, and confirm that the second output is the shifted first output (away from the borders, where padding breaks the symmetry).
import torch
import torch.nn.functional as F
torch.manual_seed(0)
x = torch.randn(1, 1, 16, 16) # (batch, channel, H, W)
w = torch.randn(1, 1, 3, 3) # one 3x3 filter
# Shift the input 2 columns to the right (roll, then zero the wrapped edge).
x_shifted = torch.roll(x, shifts=2, dims=3)
x_shifted[..., :2] = 0.0
y = F.conv2d(x, w, padding=1) # cross-correlation, same size
y_shifted = F.conv2d(x_shifted, w, padding=1)
# The output of the shifted input equals the shifted output, in the interior.
expected = torch.roll(y, shifts=2, dims=3)
print(torch.allclose(y_shifted[..., 3:-1], expected[..., 3:-1], atol=1e-5))
# Expected output: True (equivariance holds away from the border)
Run the code above, then change shifts=2 to 1, then to 4, in both the input torch.roll and the expected roll, and widen the zeroed strip x_shifted[..., :2] to match. The interior comparison should still print True, but shrink the comparison slice from [..., 3:-1] toward the full width [..., :] and watch it flip to False. The lesson lands in 30 seconds: equivariance is exact in the interior and is broken only by the padding at the edges, which is exactly why the zero border discussed in Section 19.2 lets a network read a little absolute position. Vary the shift and the width of the spoiled band grows with it.
This experiment is the runtime echo of the shift-invariance discussion in Section 3.1, where you saw that a designed filter applies identical weights at every location. The only thing Section 19.2 adds is making those weights learnable; the equivariance is inherited from the sliding-window structure, not from any property of the weight values.
A fully connected layer is the student who memorizes every worked example in the textbook and panics when the exam reorders the numbers. The convolution is the student who learned the one rule and applies it anywhere. The punchline is that the lazy-looking one wins: by refusing to learn the same edge detector 50,000 times, the convolution frees up its capacity for the parts that actually vary. Remember the layer's motto, learn it once, look everywhere, and you have the whole section in four words, sketched in the illustration below.
5. Inductive Bias: Why the Right Prior Beats Raw Capacity Intermediate
The deep reason convolution wins is best stated in the language of inductive bias. Every learning algorithm needs assumptions to generalize from finite data to unseen inputs; without any assumptions, fitting the training set tells you nothing about new examples. The set of assumptions a model builds in is its inductive bias. A fully connected network has a very weak bias: it can in principle represent any function, including the convolution, but it has no preference for the position-independent, local functions that images actually demand, so it must discover that structure from data, which takes enormous amounts of data.
A convolutional network bakes the structure in. Its hypothesis space contains only local, weight-shared functions, which is a tiny corner of the space of all functions, but it is the corner that contains the good image models. Restricting the hypothesis space is a double-edged tool: if the assumption is right, you generalize far better from the same data because you are not wasting capacity on functions that cannot be the answer; if the assumption is wrong, you cannot fit the truth at all. For natural images, the convolutional assumption is right often enough that for a decade it was the only game in town.
A fully connected layer is strictly more expressive than a convolutional one (it can represent every convolution and more) yet performs far worse on images from the same data budget. Expressiveness is not the goal; matching the data's structure is. Convolution is a deliberate reduction in expressiveness that pays for itself many times over by sample efficiency. This is the single most important idea in the chapter, and it reappears every time you choose an architecture: you are choosing a prior, and the best prior is the one that is true.
Everything above, local connectivity and weight sharing, is delivered by one line: torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, padding=1). Compare it to the dense equivalent torch.nn.Linear(1024, 1024), which on a $32 \times 32$ input holds over a million weights to the convolution's ten. PyTorch's Conv2d also handles the sliding window in optimized C++/CUDA, supports multiple channels, stride, padding, dilation, and grouped convolution out of the box, and exposes the learnable kernel as .weight for inspection in Section 19.6. The single constructor encodes the inductive bias that this entire section argued for.
6. When the Bias Is Wrong, and What Came After Advanced
The convolutional prior has real limits, and the rest of Part III is partly a story of relaxing them. Locality means a single convolutional layer cannot relate distant parts of an image; you reach long range only by stacking many layers, which Section 19.3 will quantify with the receptive field. Translation equivariance is the only symmetry convolution provides for free; it is not equivariant to rotation or scale, so a network must learn those variations from data or from augmentation, the topic of Chapter 21. And stationarity is only approximately true: the top of a photograph is statistically more likely to be sky than the bottom, a position-dependent fact that pure weight sharing cannot exploit.
The 2020s reopened the question of how much the convolutional inductive bias really buys. Vision transformers (Dosovitskiy et al., ICLR 2021, arXiv:2010.11929), covered in Chapter 22, threw out locality and weight sharing in favor of global attention and, given enough data, matched or beat CNNs, suggesting the bias is helpful mainly in the low-data regime. The ConvMixer (Trockman and Kolter, TMLR 2023, arXiv:2201.09792) struck back with a near-trivial all-convolutional network on image patches, arguing that much of the transformer's gain was the patch embedding, not attention. ConvNeXt (Liu et al., CVPR 2022, arXiv:2201.03545) showed a pure CNN could match transformers by importing their training recipe, and large-kernel work like RepLKNet (CVPR 2022, arXiv:2203.06717) attacks the locality limit head-on with $31 \times 31$ kernels. As of 2026 the consensus is nuanced: the convolutional bias is a genuine sample-efficiency advantage when data is limited, a mild constraint when data is abundant, and the two families increasingly borrow each other's ideas.
With the argument for convolution settled, the next section makes it real. Section 19.2 takes the single-channel, single-filter operation of this section and equips it with the dimensions a working network needs: many input channels, many output channels, stride, padding, and dilation, along with the exact tensor shapes PyTorch expects.
A grayscale input of size $64 \times 64$ feeds a layer that produces a $64 \times 64$ output. Compute the number of trainable weights (ignore biases) for (a) a fully connected layer, (b) a locally connected layer with $5 \times 5$ patches, and (c) a convolutional layer with a single $5 \times 5$ kernel. Then state in one sentence which of the three counts changes if the input grows to $128 \times 128$, and why that difference is the central reason CNNs scale to large images.
Extend the equivariance experiment in this section. First confirm that a single F.conv2d is equivariant to a vertical shift as well as a horizontal one. Then follow the convolution with F.adaptive_avg_pool2d(y, 1) (global average pooling) and show numerically that the pooled scalar is now approximately invariant to the shift. Write one sentence explaining why pooling converts equivariance into invariance, and connect this to why classification heads end with global pooling.
Describe a vision task where the convolutional locality assumption is a poor fit, that is, where the correct output at one location depends strongly on a distant region of the image. Explain why stacking more convolutional layers is an inefficient remedy (relate your answer to the receptive-field growth you will study in Section 19.3), and name one architectural mechanism from later in Part III that addresses the limitation more directly.