"I take in 64 channels and put out 128. People think that means I doubled something. What I actually did was learn 128 different opinions about the same patch, each one a weighted vote over all 64 inputs."
A Deeply Multi-Channel Convolutional Layer
A real convolutional layer is the single-filter operation of Section 19.1 stacked along three new axes: many input channels per filter, many filters per layer, and a set of geometric controls, stride, padding, and dilation, that decide the output's spatial size and reach. Master the tensor shapes and the one output-size formula in this section and you can read any architecture table in the rest of the book, because every convolutional layer you will ever meet is fully described by a handful of these numbers.
In Section 19.1 a convolution was one $k \times k$ kernel sliding over one grayscale image. That is enough to argue for the inductive bias, but no useful network looks like that. Color images have three input channels; intermediate feature maps have dozens or hundreds. A layer learns many filters at once, not one. And you constantly need to control the spatial resolution, shrinking it to summarize, holding it to preserve detail, or expanding a filter's reach without paying for more weights. This section adds all of that, ending at the exact PyTorch Conv2d arguments and the shapes that flow through them, the foundation for the network you will train in Section 19.5.
1. Channels: A Filter Spans the Full Input Depth Beginner
The first generalization is depth. A color image is not one $H \times W$ grid but three, stacked: red, green, and blue. In tensor terms it is a $3 \times H \times W$ array, where 3 is the number of input channels. A convolutional filter does not slide three separate $k \times k$ kernels; it slides one $k \times k \times 3$ kernel that reaches across all input channels at once. At each spatial position it multiplies its $k \times k \times 3$ weights against the $k \times k \times 3$ patch beneath it, sums every product into a single number, and that number is one pixel of one output channel. Crucially, the depth of a filter always equals the number of input channels, so you never specify it directly; it is inferred.
A layer learns many such filters, and the count of filters is the number of output channels. Each filter produces its own $H' \times W'$ output map, and these maps stack to form the layer's output of shape $C_{\text{out}} \times H' \times W'$. So a layer mapping a $3 \times 32 \times 32$ color image to a $16 \times 32 \times 32$ feature volume holds $16$ filters, each $3 \times 3 \times 3$, for $16 \times 3 \times 3 \times 3 = 432$ weights plus $16$ biases. Figure 19.2.1 shows the full three-axis structure: depth of each filter set by the input, count of filters set by the desired output channels.
This is the shape you decoded at the end of Section 3.1, where the PyTorch weight tensor was (out_channels, in_channels, kH, kW). Now you know what each axis means: the layer holds out_channels filters, each of depth in_channels, each a $kH \times kW$ spatial grid.
The most persistent beginner misreading of Conv2d(64, 128, ...) is "it doubles the channels," as if the layer copied something. It copies nothing. It learns 128 fresh opinions about every patch, each opinion a weighted vote across all 64 inputs, and then files those 128 verdicts as the new channels. The number 128 is a design choice about how many distinct things this layer is allowed to notice, not a transformation of the 64. The memory hook: output channels are opinions, not copies.
2. The Output-Size Formula Beginner
The spatial output size is governed by one formula you will use constantly. For an input of spatial size $H$ along one axis, a kernel of size $k$, padding $p$ on each side, stride $s$, and dilation $d$, the output size is:
$$ H_{\text{out}} \;=\; \left\lfloor \frac{H + 2p - d\,(k - 1) - 1}{s} \right\rfloor + 1. $$
Each term has a plain meaning. The kernel cannot center on the outermost pixels, so it loses $k - 1$ along each axis (the $d(k-1)$ form generalizes this to dilation, below). Padding $p$ adds $2p$ rows or columns of border, often chosen precisely to cancel that loss. Stride $s$ downsamples the output positions, dividing the count by $s$. The floor handles the case where the kernel does not fit a whole number of times. We will dissect stride, padding, and dilation one at a time, but commit the formula to memory; it predicts every shape mismatch you will ever debug.
3. Stride: Downsampling Inside the Convolution Beginner
Stride is the step size of the sliding window. A stride of 1 visits every pixel; a stride of 2 visits every other pixel, halving the output resolution along each axis. Striding is the convolution's built-in way to downsample: instead of computing a full-resolution map and then shrinking it, you skip positions during the convolution itself, which is cheaper and is the standard way modern architectures reduce spatial size (older networks used pooling, covered in Section 19.3, but strided convolution has largely taken over). With $k = 3$, $p = 1$, $d = 1$, a stride of $s = 2$ on a $32 \times 32$ input gives $\lfloor (32 + 2 - 2 - 1)/2 \rfloor + 1 = 16$, an exact halving.
4. Padding: Controlling Shape and Borders Beginner
Padding adds a border of extra values around the input so the kernel can center on the edge pixels. Without it, every convolution shrinks the map by $k - 1$, and a deep stack would erode a $32 \times 32$ image to nothing after sixteen $3 \times 3$ layers. The common choice is "same" padding, $p = (k-1)/2$ for odd $k$ and stride 1, which makes the output exactly the input size: for $k = 3$ use $p = 1$, for $k = 5$ use $p = 2$. The values placed in the border are usually zeros, though PyTorch also offers reflect and replicate modes, exactly the border strategies you studied in Section 3.6. Zero padding is the default and is rarely worth changing, but it does introduce a faint artifact: pixels near the border see artificial zeros, so the network can in principle read absolute position from the border, partially breaking the position-independence that Section 19.1 prized.
The code below ties the formula to the API. It builds three layers that differ only in stride and padding, and prints the resulting shapes so you can match each to the formula.
import torch
import torch.nn as nn
x = torch.randn(8, 3, 32, 32) # (batch=8, channels=3, H=32, W=32)
same = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1) # keep 32x32
shrink = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=0) # lose 2 -> 30
halve = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1) # downsample -> 16
print(same(x).shape) # torch.Size([8, 16, 32, 32])
print(shrink(x).shape) # torch.Size([8, 16, 30, 30])
print(halve(x).shape) # torch.Size([8, 16, 16, 16])
# Trainable parameters of one layer: out*in*kH*kW weights + out biases.
print(sum(p.numel() for p in same.parameters())) # 16*3*3*3 + 16 = 448
The 448 weights in that layer do not change if the input is $32 \times 32$ or $4096 \times 4096$. This is the practical face of weight sharing from Section 19.1: the layer's cost in memory for weights is fixed by channels and kernel size, while its cost in compute scales with the number of output positions. The two costs are independent, which is why you can fine-tune an ImageNet model on higher-resolution images without changing a single parameter, only the activation memory grows.
5. Dilation: Reach Without Cost Intermediate
Dilation spreads the kernel's sampling points apart, inserting gaps between the weights so a $3 \times 3$ kernel covers a $5 \times 5$ region (dilation 2) or larger, while still using only nine weights. The output-size formula's $d(k-1)$ term captures the enlarged footprint. Dilation buys a larger receptive field (the subject of Section 19.3) at no extra parameters and no extra compute, which is why it became central to semantic segmentation, where you need to see broad context but cannot afford to lose resolution by striding. Figure 19.2.2 contrasts a standard $3 \times 3$ kernel with its dilated counterpart.
Here is the number that explains why dilation took over segmentation. To make a plain stack of $3 \times 3$ convolutions cover a $15 \times 15$ window, you need seven layers, because each ordinary $3 \times 3$ layer adds only $2$ pixels of reach. Now dilate them instead: layers at dilation $1$, $2$, then $4$ cover a $15 \times 15$ window in just three layers, still nine weights each, still nine multiplies per output pixel. Reach grows geometrically with the dilation rate while parameters and compute stay flat, which is exactly what a segmentation network wants: it must see broad context around every pixel yet cannot afford to throw away resolution by striding. The cost is the gaps you saw in Figure 19.2.2, so production designs (the DeepLab atrous-spatial-pyramid family) mix several dilation rates in parallel to fill them in.
6. Two Special Cases: 1x1 and Depthwise Convolution Intermediate
Two degenerate-looking convolutions are workhorses of modern design. The $1 \times 1$ convolution uses a single-pixel kernel, so it does no spatial mixing at all; it is a per-pixel linear combination across channels, exactly a small fully connected layer applied independently at every location. It is the cheapest way to change the channel count, mix information across channels, or add nonlinearity between spatial layers, and it is the backbone of the bottleneck blocks in Chapter 20. The depthwise convolution goes the other way: it convolves each input channel with its own single-channel kernel and does no cross-channel mixing, set in PyTorch by groups=in_channels. The groups argument splits the channels into independent bundles that never see each other; setting it equal to the channel count puts every channel in its own bundle, which is exactly "one private kernel per channel." Pairing a depthwise convolution (spatial mixing, cheap) with a $1 \times 1$ convolution (channel mixing, cheap) is the depthwise-separable convolution that makes MobileNet efficient enough for phones, a topic that returns in Chapter 28.
Who: An embedded-vision team shipping a person-detection model on a battery-powered smart doorbell with a tiny microcontroller-class accelerator.
Situation: Their accurate prototype used ordinary $3 \times 3$ convolutions throughout. It ran at 2 frames per second and drained the battery in a day, both non-starters.
Problem: The standard $3 \times 3$ convolutions dominated the compute budget. A layer mapping 64 channels to 64 with a $3 \times 3$ kernel costs $64 \times 64 \times 3 \times 3$ multiplications per output pixel, and there were dozens of such layers.
Decision: Replace every standard convolution with a depthwise-separable pair: a $3 \times 3$ depthwise convolution (spatial, $64 \times 3 \times 3$ per pixel) followed by a $1 \times 1$ convolution (channel mixing, $64 \times 64$ per pixel). The combined cost is roughly $64 \times 9 + 64 \times 64$ versus $64 \times 64 \times 9$, an eight- to nine-fold reduction.
Result: Inference rose to 15 frames per second and battery life to roughly a week, with a 1.5-point accuracy drop that augmentation and a slightly wider network recovered. The product shipped on the original hardware.
Lesson: Factoring one convolution into a spatial part and a channel part is almost free in accuracy and enormous in cost, which is why every efficient architecture since 2017 is built from these two special cases rather than from plain $3 \times 3$ convolutions.
With stride, $1 \times 1$, and depthwise-separable convolutions in hand, you can build a tiny real-time classifier that turns a hand sign in front of your webcam (open palm, fist, thumbs-up) into an action on your machine, all running live on a laptop CPU. Capture a few hundred frames per gesture with OpenCV, stack five or six depthwise-separable blocks ending in global average pooling, train for a few minutes, then read frames in a loop and trigger a key press on the top prediction. Difficulty: intermediate, about two to three hours. The portfolio hook is that it forces the cost reasoning of this section: the doorbell example's eight- to nine-fold multiply reduction is exactly what keeps the loop above 20 frames per second on a CPU with no GPU. Stretch it by adding a "no gesture" class so the switch stays quiet when your hands are at the keyboard, the same background-rejection problem real on-device detectors must solve.
7. The PyTorch Conv2d API, End to End Intermediate
Everything in this section is one constructor and one forward call. The snippet below builds a two-layer stack that takes a batch of color images and produces a 32-channel feature volume at half resolution, exercising channels, stride, padding, and a $1 \times 1$ projection, and it prints the shape after each layer so you can verify the formula at every step.
import torch
import torch.nn as nn
x = torch.randn(4, 3, 64, 64) # 4 color images, 64x64
block = nn.Sequential(
nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1), # 3 -> 16, keep 64x64
nn.ReLU(inplace=True),
nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1), # 16 -> 32, halve to 32x32
nn.ReLU(inplace=True),
nn.Conv2d(32, 32, kernel_size=1), # 1x1 channel mix, keep shape
)
# Trace shapes layer by layer.
h = x
for layer in block:
h = layer(h)
if isinstance(layer, nn.Conv2d):
print(type(layer).__name__, tuple(h.shape))
# Conv2d (4, 16, 64, 64)
# Conv2d (4, 32, 32, 32)
# Conv2d (4, 32, 32, 32)
print("total params:", sum(p.numel() for p in block.parameters()))
# total params: 6432
Conv2d stack with shape tracing: a size-preserving layer, a stride-2 downsampling layer, and a $1 \times 1$ channel-mixing layer. The printed shapes confirm the output-size formula, and the 6,432 total parameters are independent of the $64 \times 64$ input resolution.8. The Backward Pass: Gradients Through a Convolution Advanced
The forward pass is only half of a layer. Training needs the gradients of the loss $L$ with respect to the kernel $W$ (to update it) and with respect to the input $x$ (to pass back to the previous layer). Both fall out of the chain rule applied to the same sum that defines the forward convolution, $y = x * W$, and both turn out to be convolutions themselves, which is why a convolutional layer is so efficient to train. Writing $\delta = \partial L / \partial y$ for the gradient arriving from above, the two gradients are:
The weight gradient is the input cross-correlated with the upstream gradient: each kernel weight touched every spatial position on the forward pass, so its gradient accumulates $\delta$ over all of them. The input gradient is a full (zero-padded) convolution of the upstream gradient with the kernel flipped $180^\circ$, the discrete adjoint of the forward operator, which is exactly the "transposed convolution" used for learnable upsampling in Chapter 24. Both operations reuse the forward machinery, so a framework implements the backward pass with the same fast kernels (im2col, Winograd, or cuDNN) it uses for the forward pass; you never write either by hand, but knowing they are convolutions explains why convolutional training costs about the same as inference per layer.
A naive convolution is the six-nested-loop monster you would write from Section 3.1's definition: over batch, output channel, output row, output column, then the kernel's two axes and input channels. PyTorch's nn.Conv2d instead lowers the convolution to a single matrix multiply (the im2col reshape, or a Winograd or FFT algorithm, or a fused cuDNN kernel) and runs it on the GPU. You write one line; the library handles the algorithm selection, the memory layout, mixed-precision accumulation, and the backward pass for free. The from-scratch educational version is dozens of lines and thousands of times slower; never ship it.
A 2021-2024 line of work decouples a convolution's training structure from its inference structure. RepVGG (Ding et al., CVPR 2021, arXiv:2101.03697) trains a layer as a sum of $3 \times 3$, $1 \times 1$, and identity branches, then algebraically fuses them into a single $3 \times 3$ convolution for deployment, getting multi-branch training benefits with single-branch inference speed. The large-kernel networks RepLKNet (CVPR 2022) and UniRepLKNet (CVPR 2024, arXiv:2311.15599) extend the idea to reparameterize small kernels into a large one. These methods exploit the linearity you proved for convolution back in Section 3.1: a sum of convolutions is itself a convolution, so branches can be merged exactly. The trend is to make the convolution you deploy as cheap as possible while letting training use a richer structure.
You now have the complete vocabulary of a convolutional layer: channels in and out, kernel size, stride, padding, and dilation, plus the special $1 \times 1$ and depthwise cases. The one thing the formula hints at but does not explain is how far a deep network can ultimately see. Section 19.3 answers that with the receptive field, and adds pooling, the other classic way to summarize and downsample a feature map.
An input is $1 \times 224 \times 224$. Using the output-size formula, compute the output spatial size for each layer: (a) $k=7, s=2, p=3$; (b) $k=3, s=1, p=1, d=2$ (dilation 2); (c) $k=5, s=2, p=0$. For each, also state the number of trainable weights if the layer maps to 64 output channels. Then explain why the dilated layer (b) has the same parameter count as a plain $3 \times 3$ layer despite covering a larger area.
Build two PyTorch modules that both map a $32 \times 64 \times 64$ input to $64$ channels at the same resolution: (a) a single standard $3 \times 3$ Conv2d, and (b) a depthwise-separable pair (a $3 \times 3$ depthwise Conv2d with groups=32 from 32 to 32 channels, then a $1 \times 1$ Conv2d from 32 to 64). Print the parameter count of each with sum(p.numel() for p in m.parameters()) and report the ratio. Verify both produce the same output shape.
A colleague's network fails with a runtime error: a layer expects an input of $16 \times 16$ but receives $15 \times 15$. They are using a $3 \times 3$ convolution with stride 2 and padding 0 on a $32 \times 32$ input earlier in the stack. Use the output-size formula to find where the off-by-one entered, and state the smallest single change (a padding value) that makes the chain produce clean powers-of-two resolutions. Explain why "same" padding with even-strided convolutions requires care.