Part III: Deep Learning for Computer Vision
Chapter 19: Convolutional Neural Networks

Why Convolution? Locality, Weight Sharing & Inductive Bias

"A fully connected layer asked me where the cat was. I said: which cat, in which corner, on which day? It had to memorize all of them. I just learn what a cat looks like, once, and check everywhere."

A Smugly Parameter-Efficient Convolutional Layer
Big Picture

A convolutional layer is nothing more exotic than a fully connected layer with two restrictions imposed: each output unit may look only at a small local patch of the input, and every output unit must reuse the same weights. Those two restrictions encode an assumption about the world, that useful visual patterns are local and can appear anywhere, and because the assumption matches how images actually behave, the restricted network learns more from less data, generalizes better, and uses a tiny fraction of the parameters. This section is the argument for why convolution is the right prior for images, before Section 19.2 turns it into the layer you will actually use.

In Chapter 18 a network treated its input as a flat vector. For a vector of stock prices or a row of survey responses, that is exactly right: there is no meaningful "neighbor" relationship, and every input feature deserves its own independent weight. For an image, flattening is a small catastrophe. It discards the grid structure that makes a pixel's neighbors informative, and it forces the network to learn the appearance of every pattern separately at every possible location. This section explains why that fails, and how two simple constraints, motivated by the convolution you already built by hand in Chapter 3, fix it. The illustration below previews the whole idea in one picture.

A cheerful rubber-stamp character presses one small three-by-three grid pattern repeatedly across an entire landscape photo, tiling identical marks everywhere, while a sweating filing cabinet stuffed with thousands of separate drawers looks overwhelmed beside it, contrasting one reusable convolution kernel against a fully connected layer's many per-position weights.
The whole chapter in one image: learn a pattern detector once, then stamp it everywhere, instead of memorizing a separate detector for every position.

1. The Fully Connected Layer Does Not Scale to Images Beginner

Consider a single modestly sized color image, $224 \times 224$ pixels with 3 channels. Flattened, that is $224 \times 224 \times 3 = 150{,}528$ input features. Connect it to a first hidden layer of just 1000 units, the kind of width that is unremarkable in Chapter 18, and the weight matrix alone has $150{,}528 \times 1000 \approx 1.5 \times 10^{8}$ parameters. One layer, 150 million weights, before you have learned anything. A network deep enough to be useful would have billions of parameters in its first few layers, would need an enormous dataset to constrain them, and would overfit ferociously on anything smaller.

The parameter count is only the visible symptom. The deeper problem is that a fully connected layer has no notion of where a feature is, so it cannot transfer knowledge across positions. Suppose the network learns, in the weights feeding one hidden unit, to detect a vertical edge in the top-left corner. Those weights are useless for detecting the same vertical edge in the bottom-right corner; an entirely separate set of weights, feeding a different hidden unit, must learn the identical pattern from scratch. The network spends its capacity re-deriving the same handful of visual primitives once per location. Figure 19.1.1 contrasts the dense wiring of a fully connected layer with the sparse, repeated wiring of a convolutional one.

Fully connected every output sees every input; all weights distinct Convolutional local patch; the same 3 weights reused at every output Matching colors = same weight value, shared across positions.
Figure 19.1.1 Two layers, two philosophies. On the left, each output of a fully connected layer connects to all five inputs through distinct weights, so capacity grows with input size and nothing is shared across positions. On the right, each convolutional output reads only a local 3-input window, and the three weights (green, red, blue) are identical for every output, so a feature learned once applies everywhere.

2. Constraint One: Local Connectivity Beginner

The first constraint is local connectivity: an output unit connects only to a small spatial neighborhood of the input, not to all of it. This is justified by a fact about images: the pixels that tell you whether there is an edge at position $(x, y)$ are the pixels near $(x, y)$. A pixel a hundred rows away contributes almost nothing to the local question "is there a vertical edge here?". So a hidden unit responsible for a local question needs only local inputs. If the input has spatial size $H \times W$ and each output looks at a $k \times k$ patch, the fan-in of an output drops from $H \times W$ to $k \times k$, which for a typical $3 \times 3$ patch is just 9 connections regardless of how large the image is.

Local connectivity alone is the idea behind a locally connected layer: each output position still has its own private $k \times k$ weights, but it ignores everything outside its patch. This already slashes the connection count, but it does not yet share anything across positions, so the network still relearns the vertical-edge detector separately at every location. We need the second constraint to fix that.

Key Insight: Locality Is an Assumption, Not a Law

Local connectivity bakes in the belief that the information needed to compute a low-level feature is spatially concentrated. This is overwhelmingly true for natural images and is exactly why the small kernels of Chapter 3 work. It is not true for every signal: a layer that needs to relate the far-left and far-right of the input in one step (global reasoning) is poorly served by locality, which is one reason vision transformers in Chapter 22 reintroduce all-pairs attention. The art of architecture is choosing where to assume locality and where to break it.

3. Constraint Two: Weight Sharing Beginner

The second constraint is weight sharing: instead of giving each output position its own $k \times k$ weights, every position reuses the same $k \times k$ weights. The justification is again a fact about images, called stationarity: the statistics of natural images are roughly the same everywhere. A vertical edge looks like a vertical edge whether it is in the sky or the pavement, so a detector that works in one place should work in all places. Sharing one set of weights across all positions encodes that belief directly.

The moment we share one $k \times k$ filter across all positions, the operation becomes exactly the cross-correlation you implemented in Section 3.1: slide the kernel, multiply, sum. The layer's entire parameter set is the $k \times k$ kernel (plus one bias), no matter how large the image. The combination of constraints is the layer's name: locally connected with shared weights equals convolution. Table 19.1.1 makes the parameter savings concrete for one $3 \times 3$ filter on a small image.

Common Misconception: A CNN Does Not Perform True Convolution

It is natural to assume the "convolution" in a convolutional layer is the flip-and-slide convolution of Section 3.1, where the kernel is mirrored before the dot product. It is not. PyTorch's Conv2d, like every mainstream framework, computes cross-correlation: it slides the kernel without flipping it. The two operations differ only by a $180^{\circ}$ rotation of the kernel, and because the kernel here is learned, the network simply learns the already-flipped weights, so the distinction never affects accuracy. The term "convolution" survived for historical reasons. The practical consequence: if you ever load a hand-designed kernel from Part I (a Sobel operator, say) into a Conv2d and expect the textbook convolution result, you must flip it first, or your edges will point the wrong way.

Table 19.1.1 Trainable weights to map a $32 \times 32$ single-channel input to a $32 \times 32$ single-channel output, under three connectivity schemes. The convolutional layer uses three orders of magnitude fewer weights and is the only one whose count is independent of image size.
Layer typeWeights per outputShared?Total weights
Fully connected$1024$ (all inputs)No$1024 \times 1024 = 1{,}048{,}576$
Locally connected ($3\times3$)$9$No$9 \times 1024 = 9{,}216$
Convolutional ($3\times3$)$9$Yes$9$ (one kernel)

The contrast in the final column is the whole story of why CNNs work where dense networks cannot. The fully connected layer needs over a million weights; the convolution needs nine. The convolution will need more in practice, one kernel per output channel as we will see in Section 19.2, but the count is governed by the number of distinct features the layer learns, not by the size of the image. A 4-megapixel photograph and a thumbnail use the same kernel weights.

Practical Example: The Defect That Moved

Who: A machine-vision engineer at a printed-circuit-board manufacturer building an automated solder-joint inspector.

Situation: The first prototype used a fully connected classifier on flattened crops of each board. It hit 94 percent accuracy in the lab and was scheduled to ship.

Problem: On the line, accuracy collapsed to the low seventies. The boards were not always seated in the exact same jig position; a one-millimeter shift moved every joint a few pixels in the frame. The dense network had memorized defect appearance at the training positions and could not recognize the same defect a few pixels over, the textbook failure of a position-dependent model.

Decision: Replace the dense classifier with a small convolutional network, trusting weight sharing to make a defect detector that works at any position by construction, and add modest translation augmentation on top.

Result: On-line accuracy rose to 96 percent and, more importantly, stopped depending on jig alignment. The total parameter count dropped by a factor of forty, which let the model run on the existing edge device without a hardware upgrade.

Lesson: The right inductive bias does not just save parameters; it removes an entire class of failure (position dependence) that no amount of dense-network tuning would have cured. Weight sharing turned "recognize this defect here" into "recognize this defect anywhere," which was the actual requirement.

4. Equivariance: The Property That Falls Out for Free Intermediate

Weight sharing gives convolution a clean mathematical property called translation equivariance. A function $f$ is equivariant to a transformation $T$ if shifting the input shifts the output the same way: $f(T(x)) = T(f(x))$. For convolution, if you slide the input image two pixels to the right, every activation in the output slides two pixels to the right, unchanged in value. Formally, writing $S_{\delta}$ for a spatial shift by $\delta$ and $\ast$ for convolution with kernel $w$:

$$ (S_{\delta} x) \ast w \;=\; S_{\delta}(x \ast w). $$

This is not the same as invariance, where the output would be unchanged by the shift. Convolution is equivariant: the features move with the object. Invariance is something you build later, by pooling or by global averaging, when you want a whole-image label that does not care where the object sits. The distinction matters: for classification you eventually want invariance (a cat anywhere is a cat), but for detection in Chapter 23 and segmentation in Chapter 24 you want equivariance preserved, because you need to report where the object is. Convolution gives you equivariance by default and lets you trade it for invariance deliberately.

The code below demonstrates equivariance empirically. We run a fixed convolution on an image, then on a shifted copy of the same image, and confirm that the second output is the shifted first output (away from the borders, where padding breaks the symmetry).

import torch
import torch.nn.functional as F

torch.manual_seed(0)
x = torch.randn(1, 1, 16, 16)                 # (batch, channel, H, W)
w = torch.randn(1, 1, 3, 3)                   # one 3x3 filter

# Shift the input 2 columns to the right (roll, then zero the wrapped edge).
x_shifted = torch.roll(x, shifts=2, dims=3)
x_shifted[..., :2] = 0.0

y        = F.conv2d(x,         w, padding=1)  # cross-correlation, same size
y_shifted = F.conv2d(x_shifted, w, padding=1)

# The output of the shifted input equals the shifted output, in the interior.
expected = torch.roll(y, shifts=2, dims=3)
print(torch.allclose(y_shifted[..., 3:-1], expected[..., 3:-1], atol=1e-5))
# Expected output: True   (equivariance holds away from the border)
Code Fragment 1: An empirical check of translation equivariance: convolving a shifted image produces the shifted convolution of the original, confirming the feature map tracks the object's position rather than memorizing absolute coordinates.
Try This: Watch the Border Break the Symmetry

Run the code above, then change shifts=2 to 1, then to 4, in both the input torch.roll and the expected roll, and widen the zeroed strip x_shifted[..., :2] to match. The interior comparison should still print True, but shrink the comparison slice from [..., 3:-1] toward the full width [..., :] and watch it flip to False. The lesson lands in 30 seconds: equivariance is exact in the interior and is broken only by the padding at the edges, which is exactly why the zero border discussed in Section 19.2 lets a network read a little absolute position. Vary the shift and the width of the spoiled band grows with it.

This experiment is the runtime echo of the shift-invariance discussion in Section 3.1, where you saw that a designed filter applies identical weights at every location. The only thing Section 19.2 adds is making those weights learnable; the equivariance is inherited from the sliding-window structure, not from any property of the weight values.

Fun Note: The Lazy Genius of Sharing

A fully connected layer is the student who memorizes every worked example in the textbook and panics when the exam reorders the numbers. The convolution is the student who learned the one rule and applies it anywhere. The punchline is that the lazy-looking one wins: by refusing to learn the same edge detector 50,000 times, the convolution frees up its capacity for the parts that actually vary. Remember the layer's motto, learn it once, look everywhere, and you have the whole section in four words, sketched in the illustration below.

Two cartoon students at exam desks: one is exhausted under a towering stack of flashcards each showing the same cat in a different corner, while the other calmly holds a single cat card and a magnifying glass to scan the whole page, illustrating a fully connected layer memorizing every location versus a convolution learning one detector and applying it everywhere.
Weight sharing's motto in four words: learn it once, look everywhere, which is exactly why the lazy-looking layer wins.

5. Inductive Bias: Why the Right Prior Beats Raw Capacity Intermediate

The deep reason convolution wins is best stated in the language of inductive bias. Every learning algorithm needs assumptions to generalize from finite data to unseen inputs; without any assumptions, fitting the training set tells you nothing about new examples. The set of assumptions a model builds in is its inductive bias. A fully connected network has a very weak bias: it can in principle represent any function, including the convolution, but it has no preference for the position-independent, local functions that images actually demand, so it must discover that structure from data, which takes enormous amounts of data.

A convolutional network bakes the structure in. Its hypothesis space contains only local, weight-shared functions, which is a tiny corner of the space of all functions, but it is the corner that contains the good image models. Restricting the hypothesis space is a double-edged tool: if the assumption is right, you generalize far better from the same data because you are not wasting capacity on functions that cannot be the answer; if the assumption is wrong, you cannot fit the truth at all. For natural images, the convolutional assumption is right often enough that for a decade it was the only game in town.

Key Insight: The Bias-Capacity Trade

A fully connected layer is strictly more expressive than a convolutional one (it can represent every convolution and more) yet performs far worse on images from the same data budget. Expressiveness is not the goal; matching the data's structure is. Convolution is a deliberate reduction in expressiveness that pays for itself many times over by sample efficiency. This is the single most important idea in the chapter, and it reappears every time you choose an architecture: you are choosing a prior, and the best prior is the one that is true.

Library Shortcut: The Whole Argument in One Constructor

Everything above, local connectivity and weight sharing, is delivered by one line: torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, padding=1). Compare it to the dense equivalent torch.nn.Linear(1024, 1024), which on a $32 \times 32$ input holds over a million weights to the convolution's ten. PyTorch's Conv2d also handles the sliding window in optimized C++/CUDA, supports multiple channels, stride, padding, dilation, and grouped convolution out of the box, and exposes the learnable kernel as .weight for inspection in Section 19.6. The single constructor encodes the inductive bias that this entire section argued for.

6. When the Bias Is Wrong, and What Came After Advanced

The convolutional prior has real limits, and the rest of Part III is partly a story of relaxing them. Locality means a single convolutional layer cannot relate distant parts of an image; you reach long range only by stacking many layers, which Section 19.3 will quantify with the receptive field. Translation equivariance is the only symmetry convolution provides for free; it is not equivariant to rotation or scale, so a network must learn those variations from data or from augmentation, the topic of Chapter 21. And stationarity is only approximately true: the top of a photograph is statistically more likely to be sky than the bottom, a position-dependent fact that pure weight sharing cannot exploit.

Research Frontier: Testing the Convolutional Prior

The 2020s reopened the question of how much the convolutional inductive bias really buys. Vision transformers (Dosovitskiy et al., ICLR 2021, arXiv:2010.11929), covered in Chapter 22, threw out locality and weight sharing in favor of global attention and, given enough data, matched or beat CNNs, suggesting the bias is helpful mainly in the low-data regime. The ConvMixer (Trockman and Kolter, TMLR 2023, arXiv:2201.09792) struck back with a near-trivial all-convolutional network on image patches, arguing that much of the transformer's gain was the patch embedding, not attention. ConvNeXt (Liu et al., CVPR 2022, arXiv:2201.03545) showed a pure CNN could match transformers by importing their training recipe, and large-kernel work like RepLKNet (CVPR 2022, arXiv:2203.06717) attacks the locality limit head-on with $31 \times 31$ kernels. As of 2026 the consensus is nuanced: the convolutional bias is a genuine sample-efficiency advantage when data is limited, a mild constraint when data is abundant, and the two families increasingly borrow each other's ideas.

With the argument for convolution settled, the next section makes it real. Section 19.2 takes the single-channel, single-filter operation of this section and equips it with the dimensions a working network needs: many input channels, many output channels, stride, padding, and dilation, along with the exact tensor shapes PyTorch expects.

Exercise 19.1.1: Count the Savings Conceptual

A grayscale input of size $64 \times 64$ feeds a layer that produces a $64 \times 64$ output. Compute the number of trainable weights (ignore biases) for (a) a fully connected layer, (b) a locally connected layer with $5 \times 5$ patches, and (c) a convolutional layer with a single $5 \times 5$ kernel. Then state in one sentence which of the three counts changes if the input grows to $128 \times 128$, and why that difference is the central reason CNNs scale to large images.

Exercise 19.1.2: Equivariance Versus Invariance Coding

Extend the equivariance experiment in this section. First confirm that a single F.conv2d is equivariant to a vertical shift as well as a horizontal one. Then follow the convolution with F.adaptive_avg_pool2d(y, 1) (global average pooling) and show numerically that the pooled scalar is now approximately invariant to the shift. Write one sentence explaining why pooling converts equivariance into invariance, and connect this to why classification heads end with global pooling.

Exercise 19.1.3: When Locality Hurts Analysis

Describe a vision task where the convolutional locality assumption is a poor fit, that is, where the correct output at one location depends strongly on a distant region of the image. Explain why stacking more convolutional layers is an inefficient remedy (relate your answer to the receptive-field growth you will study in Section 19.3), and name one architectural mechanism from later in Part III that addresses the limitation more directly.