Part III: Deep Learning for Computer Vision
Chapter 19: Convolutional Neural Networks

Convolution Layers: Channels, Stride, Padding & Dilation

"I take in 64 channels and put out 128. People think that means I doubled something. What I actually did was learn 128 different opinions about the same patch, each one a weighted vote over all 64 inputs."

A Deeply Multi-Channel Convolutional Layer
Big Picture

A real convolutional layer is the single-filter operation of Section 19.1 stacked along three new axes: many input channels per filter, many filters per layer, and a set of geometric controls, stride, padding, and dilation, that decide the output's spatial size and reach. Master the tensor shapes and the one output-size formula in this section and you can read any architecture table in the rest of the book, because every convolutional layer you will ever meet is fully described by a handful of these numbers.

In Section 19.1 a convolution was one $k \times k$ kernel sliding over one grayscale image. That is enough to argue for the inductive bias, but no useful network looks like that. Color images have three input channels; intermediate feature maps have dozens or hundreds. A layer learns many filters at once, not one. And you constantly need to control the spatial resolution, shrinking it to summarize, holding it to preserve detail, or expanding a filter's reach without paying for more weights. This section adds all of that, ending at the exact PyTorch Conv2d arguments and the shapes that flow through them, the foundation for the network you will train in Section 19.5.

1. Channels: A Filter Spans the Full Input Depth Beginner

The first generalization is depth. A color image is not one $H \times W$ grid but three, stacked: red, green, and blue. In tensor terms it is a $3 \times H \times W$ array, where 3 is the number of input channels. A convolutional filter does not slide three separate $k \times k$ kernels; it slides one $k \times k \times 3$ kernel that reaches across all input channels at once. At each spatial position it multiplies its $k \times k \times 3$ weights against the $k \times k \times 3$ patch beneath it, sums every product into a single number, and that number is one pixel of one output channel. Crucially, the depth of a filter always equals the number of input channels, so you never specify it directly; it is inferred.

A layer learns many such filters, and the count of filters is the number of output channels. Each filter produces its own $H' \times W'$ output map, and these maps stack to form the layer's output of shape $C_{\text{out}} \times H' \times W'$. So a layer mapping a $3 \times 32 \times 32$ color image to a $16 \times 32 \times 32$ feature volume holds $16$ filters, each $3 \times 3 \times 3$, for $16 \times 3 \times 3 \times 3 = 432$ weights plus $16$ biases. Figure 19.2.1 shows the full three-axis structure: depth of each filter set by the input, count of filters set by the desired output channels.

Input volume 3 x 32 x 32 16 filters each 3 x 3 x 3 (one filter, reads all 3 channels) Output volume 16 x 32 x 32 (16 stacked feature maps) Each filter spans the full input depth and produces exactly one output channel; stacking the 16 filter outputs gives a 16-channel output volume.
Figure 19.2.1 The three axes of a convolutional layer. The input has 3 channels; each of the 16 filters is therefore $3 \times 3 \times 3$ and reads across all input channels at once, collapsing each patch to a single number. The 16 filters yield 16 output channels. The filter's depth is dictated by the input; the number of filters is the layer's design choice.

This is the shape you decoded at the end of Section 3.1, where the PyTorch weight tensor was (out_channels, in_channels, kH, kW). Now you know what each axis means: the layer holds out_channels filters, each of depth in_channels, each a $kH \times kW$ spatial grid.

Fun Note: It Does Not Double Anything

The most persistent beginner misreading of Conv2d(64, 128, ...) is "it doubles the channels," as if the layer copied something. It copies nothing. It learns 128 fresh opinions about every patch, each opinion a weighted vote across all 64 inputs, and then files those 128 verdicts as the new channels. The number 128 is a design choice about how many distinct things this layer is allowed to notice, not a transformation of the 64. The memory hook: output channels are opinions, not copies.

2. The Output-Size Formula Beginner

The spatial output size is governed by one formula you will use constantly. For an input of spatial size $H$ along one axis, a kernel of size $k$, padding $p$ on each side, stride $s$, and dilation $d$, the output size is:

$$ H_{\text{out}} \;=\; \left\lfloor \frac{H + 2p - d\,(k - 1) - 1}{s} \right\rfloor + 1. $$

Each term has a plain meaning. The kernel cannot center on the outermost pixels, so it loses $k - 1$ along each axis (the $d(k-1)$ form generalizes this to dilation, below). Padding $p$ adds $2p$ rows or columns of border, often chosen precisely to cancel that loss. Stride $s$ downsamples the output positions, dividing the count by $s$. The floor handles the case where the kernel does not fit a whole number of times. We will dissect stride, padding, and dilation one at a time, but commit the formula to memory; it predicts every shape mismatch you will ever debug.

3. Stride: Downsampling Inside the Convolution Beginner

Stride is the step size of the sliding window. A stride of 1 visits every pixel; a stride of 2 visits every other pixel, halving the output resolution along each axis. Striding is the convolution's built-in way to downsample: instead of computing a full-resolution map and then shrinking it, you skip positions during the convolution itself, which is cheaper and is the standard way modern architectures reduce spatial size (older networks used pooling, covered in Section 19.3, but strided convolution has largely taken over). With $k = 3$, $p = 1$, $d = 1$, a stride of $s = 2$ on a $32 \times 32$ input gives $\lfloor (32 + 2 - 2 - 1)/2 \rfloor + 1 = 16$, an exact halving.

4. Padding: Controlling Shape and Borders Beginner

Padding adds a border of extra values around the input so the kernel can center on the edge pixels. Without it, every convolution shrinks the map by $k - 1$, and a deep stack would erode a $32 \times 32$ image to nothing after sixteen $3 \times 3$ layers. The common choice is "same" padding, $p = (k-1)/2$ for odd $k$ and stride 1, which makes the output exactly the input size: for $k = 3$ use $p = 1$, for $k = 5$ use $p = 2$. The values placed in the border are usually zeros, though PyTorch also offers reflect and replicate modes, exactly the border strategies you studied in Section 3.6. Zero padding is the default and is rarely worth changing, but it does introduce a faint artifact: pixels near the border see artificial zeros, so the network can in principle read absolute position from the border, partially breaking the position-independence that Section 19.1 prized.

The code below ties the formula to the API. It builds three layers that differ only in stride and padding, and prints the resulting shapes so you can match each to the formula.

import torch
import torch.nn as nn

x = torch.randn(8, 3, 32, 32)        # (batch=8, channels=3, H=32, W=32)

same   = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)  # keep 32x32
shrink = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=0)  # lose 2 -> 30
halve  = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)  # downsample -> 16

print(same(x).shape)    # torch.Size([8, 16, 32, 32])
print(shrink(x).shape)  # torch.Size([8, 16, 30, 30])
print(halve(x).shape)   # torch.Size([8, 16, 16, 16])

# Trainable parameters of one layer: out*in*kH*kW weights + out biases.
print(sum(p.numel() for p in same.parameters()))  # 16*3*3*3 + 16 = 448
Code Fragment 1: Stride and padding control output shape: padding 1 keeps a $3 \times 3$ convolution size-preserving, padding 0 shrinks by 2, and stride 2 halves resolution. The parameter count (448) depends only on channels and kernel size, never on image size.
Key Insight: Parameters Are Decoupled from Resolution

The 448 weights in that layer do not change if the input is $32 \times 32$ or $4096 \times 4096$. This is the practical face of weight sharing from Section 19.1: the layer's cost in memory for weights is fixed by channels and kernel size, while its cost in compute scales with the number of output positions. The two costs are independent, which is why you can fine-tune an ImageNet model on higher-resolution images without changing a single parameter, only the activation memory grows.

5. Dilation: Reach Without Cost Intermediate

Dilation spreads the kernel's sampling points apart, inserting gaps between the weights so a $3 \times 3$ kernel covers a $5 \times 5$ region (dilation 2) or larger, while still using only nine weights. The output-size formula's $d(k-1)$ term captures the enlarged footprint. Dilation buys a larger receptive field (the subject of Section 19.3) at no extra parameters and no extra compute, which is why it became central to semantic segmentation, where you need to see broad context but cannot afford to lose resolution by striding. Figure 19.2.2 contrasts a standard $3 \times 3$ kernel with its dilated counterpart.

Standard 3x3 (dilation 1) 9 weights, 3x3 footprint Dilated 3x3 (dilation 2) still 9 weights, 5x5 footprint
Figure 19.2.2 Dilation trades nothing for reach. The standard kernel (blue) samples a contiguous $3 \times 3$ block; the dilation-2 kernel (red) samples the same nine weights spread over a $5 \times 5$ region, doubling its footprint with no extra parameters and no extra multiplications. Gaps between sampled points are the cost: fine detail between them is skipped.
Key Insight: Stack Three Dilated Layers, See a 15x15 Window for the Price of a 3x3

Here is the number that explains why dilation took over segmentation. To make a plain stack of $3 \times 3$ convolutions cover a $15 \times 15$ window, you need seven layers, because each ordinary $3 \times 3$ layer adds only $2$ pixels of reach. Now dilate them instead: layers at dilation $1$, $2$, then $4$ cover a $15 \times 15$ window in just three layers, still nine weights each, still nine multiplies per output pixel. Reach grows geometrically with the dilation rate while parameters and compute stay flat, which is exactly what a segmentation network wants: it must see broad context around every pixel yet cannot afford to throw away resolution by striding. The cost is the gaps you saw in Figure 19.2.2, so production designs (the DeepLab atrous-spatial-pyramid family) mix several dilation rates in parallel to fill them in.

6. Two Special Cases: 1x1 and Depthwise Convolution Intermediate

Two degenerate-looking convolutions are workhorses of modern design. The $1 \times 1$ convolution uses a single-pixel kernel, so it does no spatial mixing at all; it is a per-pixel linear combination across channels, exactly a small fully connected layer applied independently at every location. It is the cheapest way to change the channel count, mix information across channels, or add nonlinearity between spatial layers, and it is the backbone of the bottleneck blocks in Chapter 20. The depthwise convolution goes the other way: it convolves each input channel with its own single-channel kernel and does no cross-channel mixing, set in PyTorch by groups=in_channels. The groups argument splits the channels into independent bundles that never see each other; setting it equal to the channel count puts every channel in its own bundle, which is exactly "one private kernel per channel." Pairing a depthwise convolution (spatial mixing, cheap) with a $1 \times 1$ convolution (channel mixing, cheap) is the depthwise-separable convolution that makes MobileNet efficient enough for phones, a topic that returns in Chapter 28.

Practical Example: Shrinking a Backbone for the Doorbell

Who: An embedded-vision team shipping a person-detection model on a battery-powered smart doorbell with a tiny microcontroller-class accelerator.

Situation: Their accurate prototype used ordinary $3 \times 3$ convolutions throughout. It ran at 2 frames per second and drained the battery in a day, both non-starters.

Problem: The standard $3 \times 3$ convolutions dominated the compute budget. A layer mapping 64 channels to 64 with a $3 \times 3$ kernel costs $64 \times 64 \times 3 \times 3$ multiplications per output pixel, and there were dozens of such layers.

Decision: Replace every standard convolution with a depthwise-separable pair: a $3 \times 3$ depthwise convolution (spatial, $64 \times 3 \times 3$ per pixel) followed by a $1 \times 1$ convolution (channel mixing, $64 \times 64$ per pixel). The combined cost is roughly $64 \times 9 + 64 \times 64$ versus $64 \times 64 \times 9$, an eight- to nine-fold reduction.

Result: Inference rose to 15 frames per second and battery life to roughly a week, with a 1.5-point accuracy drop that augmentation and a slightly wider network recovered. The product shipped on the original hardware.

Lesson: Factoring one convolution into a spatial part and a channel part is almost free in accuracy and enormous in cost, which is why every efficient architecture since 2017 is built from these two special cases rather than from plain $3 \times 3$ convolutions.

You Could Build This: A Webcam Gesture Switch

With stride, $1 \times 1$, and depthwise-separable convolutions in hand, you can build a tiny real-time classifier that turns a hand sign in front of your webcam (open palm, fist, thumbs-up) into an action on your machine, all running live on a laptop CPU. Capture a few hundred frames per gesture with OpenCV, stack five or six depthwise-separable blocks ending in global average pooling, train for a few minutes, then read frames in a loop and trigger a key press on the top prediction. Difficulty: intermediate, about two to three hours. The portfolio hook is that it forces the cost reasoning of this section: the doorbell example's eight- to nine-fold multiply reduction is exactly what keeps the loop above 20 frames per second on a CPU with no GPU. Stretch it by adding a "no gesture" class so the switch stays quiet when your hands are at the keyboard, the same background-rejection problem real on-device detectors must solve.

7. The PyTorch Conv2d API, End to End Intermediate

Everything in this section is one constructor and one forward call. The snippet below builds a two-layer stack that takes a batch of color images and produces a 32-channel feature volume at half resolution, exercising channels, stride, padding, and a $1 \times 1$ projection, and it prints the shape after each layer so you can verify the formula at every step.

import torch
import torch.nn as nn

x = torch.randn(4, 3, 64, 64)        # 4 color images, 64x64

block = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),   # 3 -> 16, keep 64x64
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1),  # 16 -> 32, halve to 32x32
    nn.ReLU(inplace=True),
    nn.Conv2d(32, 32, kernel_size=1),                       # 1x1 channel mix, keep shape
)

# Trace shapes layer by layer.
h = x
for layer in block:
    h = layer(h)
    if isinstance(layer, nn.Conv2d):
        print(type(layer).__name__, tuple(h.shape))
# Conv2d (4, 16, 64, 64)
# Conv2d (4, 32, 32, 32)
# Conv2d (4, 32, 32, 32)

print("total params:", sum(p.numel() for p in block.parameters()))
# total params: 6432
Code Fragment 2: A complete Conv2d stack with shape tracing: a size-preserving layer, a stride-2 downsampling layer, and a $1 \times 1$ channel-mixing layer. The printed shapes confirm the output-size formula, and the 6,432 total parameters are independent of the $64 \times 64$ input resolution.

8. The Backward Pass: Gradients Through a Convolution Advanced

The forward pass is only half of a layer. Training needs the gradients of the loss $L$ with respect to the kernel $W$ (to update it) and with respect to the input $x$ (to pass back to the previous layer). Both fall out of the chain rule applied to the same sum that defines the forward convolution, $y = x * W$, and both turn out to be convolutions themselves, which is why a convolutional layer is so efficient to train. Writing $\delta = \partial L / \partial y$ for the gradient arriving from above, the two gradients are:

$$\frac{\partial L}{\partial W} = x * \delta, \qquad \frac{\partial L}{\partial x} = \delta *_{\text{full}} \operatorname{rot180}(W).$$

The weight gradient is the input cross-correlated with the upstream gradient: each kernel weight touched every spatial position on the forward pass, so its gradient accumulates $\delta$ over all of them. The input gradient is a full (zero-padded) convolution of the upstream gradient with the kernel flipped $180^\circ$, the discrete adjoint of the forward operator, which is exactly the "transposed convolution" used for learnable upsampling in Chapter 24. Both operations reuse the forward machinery, so a framework implements the backward pass with the same fast kernels (im2col, Winograd, or cuDNN) it uses for the forward pass; you never write either by hand, but knowing they are convolutions explains why convolutional training costs about the same as inference per layer.

Library Shortcut: The im2col Trick You Do Not Have to Write

A naive convolution is the six-nested-loop monster you would write from Section 3.1's definition: over batch, output channel, output row, output column, then the kernel's two axes and input channels. PyTorch's nn.Conv2d instead lowers the convolution to a single matrix multiply (the im2col reshape, or a Winograd or FFT algorithm, or a fused cuDNN kernel) and runs it on the GPU. You write one line; the library handles the algorithm selection, the memory layout, mixed-precision accumulation, and the backward pass for free. The from-scratch educational version is dozens of lines and thousands of times slower; never ship it.

Research Frontier: Reparameterizing Convolutions

A 2021-2024 line of work decouples a convolution's training structure from its inference structure. RepVGG (Ding et al., CVPR 2021, arXiv:2101.03697) trains a layer as a sum of $3 \times 3$, $1 \times 1$, and identity branches, then algebraically fuses them into a single $3 \times 3$ convolution for deployment, getting multi-branch training benefits with single-branch inference speed. The large-kernel networks RepLKNet (CVPR 2022) and UniRepLKNet (CVPR 2024, arXiv:2311.15599) extend the idea to reparameterize small kernels into a large one. These methods exploit the linearity you proved for convolution back in Section 3.1: a sum of convolutions is itself a convolution, so branches can be merged exactly. The trend is to make the convolution you deploy as cheap as possible while letting training use a richer structure.

You now have the complete vocabulary of a convolutional layer: channels in and out, kernel size, stride, padding, and dilation, plus the special $1 \times 1$ and depthwise cases. The one thing the formula hints at but does not explain is how far a deep network can ultimately see. Section 19.3 answers that with the receptive field, and adds pooling, the other classic way to summarize and downsample a feature map.

Exercise 19.2.1: Apply the Formula Conceptual

An input is $1 \times 224 \times 224$. Using the output-size formula, compute the output spatial size for each layer: (a) $k=7, s=2, p=3$; (b) $k=3, s=1, p=1, d=2$ (dilation 2); (c) $k=5, s=2, p=0$. For each, also state the number of trainable weights if the layer maps to 64 output channels. Then explain why the dilated layer (b) has the same parameter count as a plain $3 \times 3$ layer despite covering a larger area.

Exercise 19.2.2: Count the Cost of Separability Coding

Build two PyTorch modules that both map a $32 \times 64 \times 64$ input to $64$ channels at the same resolution: (a) a single standard $3 \times 3$ Conv2d, and (b) a depthwise-separable pair (a $3 \times 3$ depthwise Conv2d with groups=32 from 32 to 32 channels, then a $1 \times 1$ Conv2d from 32 to 64). Print the parameter count of each with sum(p.numel() for p in m.parameters()) and report the ratio. Verify both produce the same output shape.

Exercise 19.2.3: Diagnose the Shape Mismatch Analysis

A colleague's network fails with a runtime error: a layer expects an input of $16 \times 16$ but receives $15 \times 15$. They are using a $3 \times 3$ convolution with stride 2 and padding 0 on a $32 \times 32$ input earlier in the stack. Use the output-size formula to find where the off-by-one entered, and state the smallest single change (a padding value) that makes the chain produce clean powers-of-two resolutions. Explain why "same" padding with even-strided convolutions requires care.