Part III: Deep Learning for Computer Vision
Chapter 19: Convolutional Neural Networks

A CNN from Scratch: CIFAR-10 End to End

"They gave me sixty thousand tiny pictures and ten boxes to sort them into. I cried for three epochs, found the edges, found the wings, and by epoch thirty I knew a frog from a truck. Mostly."

A Recently Converged CIFAR-10 Network
Big Picture

This section assembles every idea in the chapter, learnable convolution, pooling, batch normalization, into one complete network and trains it end to end on CIFAR-10, reaching roughly 85 percent test accuracy in a few minutes on a single GPU. The conv-BN-ReLU block is the atom; the training loop, data augmentation, optimizer, and learning-rate schedule are the machinery that turns a randomly initialized stack into a working classifier. Nothing here is a toy abstraction; this is the actual code you would run, and the same skeleton scales to the architectures of Chapter 20.

The four preceding sections built the parts. Section 19.1 argued for convolution, Section 19.2 gave the layer, Section 19.3 the receptive field and pooling, Section 19.4 the normalization that makes depth trainable. This section spends them. We train on CIFAR-10, sixty thousand $32 \times 32$ color images in ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), the standard small benchmark for prototyping CNNs. The training loop is the same one introduced in Chapter 18; the new content is the convolutional architecture and the practical recipe that makes it generalize.

1. The Data: Loading and Augmenting CIFAR-10 Beginner

torchvision provides CIFAR-10 as a downloadable dataset, and the transforms pipeline handles normalization and augmentation. Two ideas from earlier in the book appear here. First, we normalize each channel by its dataset mean and standard deviation, the per-channel statistics whose computation traces back to the histograms of Chapter 2; this centers the input so the first layer (and its batch norm) starts well-conditioned. Second, we augment the training set with random crops and horizontal flips, the geometric transforms of Chapter 5 repurposed as a regularizer, and we deliberately apply augmentation only to the training split, never to the test split.

Why does a random shift or mirror regularize? Each transform produces a fresh image that keeps the same label, so the network sees a cat that is two pixels left, then mirrored, then cropped differently every epoch and can never memorize one exact pixel arrangement; it is forced instead toward features that survive these nuisances, which is precisely the position and orientation robustness a real classifier needs. The test split is left untouched because augmentation is a training-time device to expand the effective dataset, not a property of the input you want to evaluate on, so reporting accuracy on clean images measures what the model will actually face at deployment.

import torch
import torchvision
import torchvision.transforms as T

# CIFAR-10 per-channel mean and std (precomputed over the training set).
MEAN, STD = (0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)

train_tf = T.Compose([
    T.RandomCrop(32, padding=4),          # random shift: translation augmentation
    T.RandomHorizontalFlip(),             # mirror left-right (cats face both ways)
    T.ToTensor(),                         # [0,255] HWC uint8 -> [0,1] CHW float
    T.Normalize(MEAN, STD),               # center and scale per channel
])
test_tf = T.Compose([T.ToTensor(), T.Normalize(MEAN, STD)])  # NO augmentation

train_set = torchvision.datasets.CIFAR10(root="./data", train=True,  download=True, transform=train_tf)
test_set  = torchvision.datasets.CIFAR10(root="./data", train=False, download=True, transform=test_tf)

train_loader = torch.utils.data.DataLoader(train_set, batch_size=128, shuffle=True,  num_workers=2)
test_loader  = torch.utils.data.DataLoader(test_set,  batch_size=256, shuffle=False, num_workers=2)
print(len(train_set), len(test_set))   # Expected output: 50000 10000
Code Fragment 1: The CIFAR-10 data pipeline: per-channel normalization on both splits, plus random crop and horizontal flip on the training split only. Augmentation never touches the test set, so the reported accuracy reflects clean images.

2. The Architecture: A Stack of Conv-BN-ReLU Blocks Beginner

The network is three stages. Each stage applies two convolutional blocks, then halves the resolution. A block is the canonical trio: a convolution (with bias=False, since the following batch norm has its own shift, as Exercise 19.4.1 explained), then batch normalization from Section 19.4, then a ReLU nonlinearity (the rectified linear unit from Section 18.1, which zeros every negative activation and passes positives unchanged, supplying the nonlinearity without which a stack of convolutions would collapse to a single linear map). Channels double as resolution halves, the standard pattern that keeps roughly constant compute per stage while letting deeper layers hold more feature types. The balance is no accident: halving each spatial side quarters the number of output positions, while doubling the channels of both the input and output roughly quadruples the work per position, so the two effects cancel and each stage costs about the same. The head is global average pooling from Section 19.3 followed by a single linear classifier. Figure 19.5.1 shows the full data flow with shapes.

Input 3x32x32 Stage 1 32x16x16 Stage 2 64x8x8 Stage 3 128x4x4 Global avg pool 128 Linear 10 logits Channels double as resolution halves; global pooling makes the head size-independent Each stage = two conv-BN-ReLU blocks then a stride-2 downsample. Three stages take 32x32 down to 4x4.
Figure 19.5.1 The CIFAR-10 network's data flow. Three stages of conv-BN-ReLU blocks take the $3 \times 32 \times 32$ input through $32 \times 16 \times 16$, $64 \times 8 \times 8$, and $128 \times 4 \times 4$ feature volumes. Global average pooling collapses the final map to a 128-vector, and a linear layer produces the 10 class logits.
import torch.nn as nn

def conv_block(in_ch, out_ch, stride=1):
    """The chapter's atom: convolution (no bias), batch norm, ReLU."""
    return nn.Sequential(
        nn.Conv2d(in_ch, out_ch, kernel_size=3, stride=stride, padding=1, bias=False),
        nn.BatchNorm2d(out_ch),           # from Section 19.4; supplies the shift
        nn.ReLU(inplace=True),
    )

class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            conv_block(3,   32), conv_block(32,  32, stride=2),   # 32x32 -> 16x16
            conv_block(32,  64), conv_block(64,  64, stride=2),   # 16x16 ->  8x8
            conv_block(64, 128), conv_block(128, 128, stride=2),  #  8x8  ->  4x4
        )
        self.pool = nn.AdaptiveAvgPool2d(1)            # global average pool -> 128x1x1
        self.classifier = nn.Linear(128, num_classes)  # 128 -> 10 logits

    def forward(self, x):
        x = self.features(x)
        x = self.pool(x).flatten(1)                    # (N, 128, 1, 1) -> (N, 128)
        return self.classifier(x)

model = SmallCNN()
n_params = sum(p.numel() for p in model.parameters())
print(f"{n_params:,} parameters")   # Expected output: 288,746 parameters
Code Fragment 2: The full architecture in two helpers: a reusable conv-BN-ReLU block and a three-stage SmallCNN that downsamples with strided convolutions and ends in global pooling. At about 289K parameters it is two orders of magnitude smaller than the dense network of Section 19.1 and far more accurate.

3. The Training Loop Intermediate

The loop is the standard supervised recipe from Chapter 18: for each batch, run a forward pass, compute the cross-entropy loss, backpropagate, and step the optimizer. The choices that matter for a CNN are the optimizer and schedule. We use SGD with momentum and weight decay, the workhorse for CNNs, and a cosine learning-rate schedule that smoothly decays the rate to near zero, which reliably squeezes out the last few accuracy points. Note the disciplined use of model.train() and model.eval() from Section 19.4, without which batch norm misbehaves at evaluation.

import torch
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"
model  = SmallCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
EPOCHS = 30
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)

@torch.no_grad()
def evaluate(loader):
    model.eval()                                   # freeze batch-norm stats, no dropout
    correct = total = 0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        preds = model(x).argmax(dim=1)
        correct += (preds == y).sum().item()
        total   += y.size(0)
    return 100.0 * correct / total

for epoch in range(EPOCHS):
    model.train()                                  # batch norm uses batch stats here
    running = 0.0
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        running += loss.item()
    scheduler.step()
    acc = evaluate(test_loader)
    print(f"epoch {epoch+1:2d}  loss {running/len(train_loader):.3f}  test acc {acc:.2f}%")

# Representative tail of the run on a single modern GPU (a few minutes total):
# epoch 28  loss 0.281  test acc 84.91%
# epoch 29  loss 0.270  test acc 85.33%
# epoch 30  loss 0.262  test acc 85.46%
Code Fragment 3: The complete training loop: SGD with momentum and weight decay, a cosine learning-rate schedule, and explicit train/eval mode switching for batch norm. Thirty epochs reach roughly 85 percent CIFAR-10 test accuracy in a few minutes on a single GPU.
Key Insight: Remember the Block as Three Verbs, Mix, Normalize, Activate

The conv-BN-ReLU block is the chapter's atom, and it is worth memorizing as three verbs in order: mix, normalize, activate. The convolution mixes a local patch across channels into new features (Section 19.2), batch norm normalizes those features to a well-conditioned scale (Section 19.4), and ReLU activates them with the nonlinearity that keeps the stack from collapsing to one linear map. Every stage of SmallCNN, and almost every convolutional network you will meet from Chapter 20 onward, is this three-verb block repeated with the channel count rising as resolution falls. When you read an architecture diagram, you are reading mix-normalize-activate over and over, the three-station assembly line in the illustration below.

A cozy three-station assembly line where one robot blends colorful patch-cubes into a new feature, a second robot levels everything to the same neat height with a spirit level, and a third robot flips an ON switch that lights up the result, illustrating the conv-BN-ReLU block as mix, normalize, then activate.
Every convolutional network is this three-verb block on repeat: mix a patch into new features, normalize them to a steady scale, then activate.
Key Insight: The Recipe Is as Important as the Architecture

The same SmallCNN trained with a poor recipe (a too-large constant learning rate, no weight decay, no augmentation) might reach only the low seventies and overfit badly. The augmentation, the weight decay, and the learning-rate schedule are not optional polish; they are responsible for several accuracy points each. This is why Chapter 21 is devoted entirely to training recipes: in modern practice the gap between a mediocre and a strong result on the same architecture is usually the recipe, not the layers.

Fun Note: The Network Cried for Three Epochs

The epigraph is closer to the truth than it has any right to be. A freshly initialized network really does flail for the first few epochs (loss high, predictions essentially random), then discovers edges, then wings and wheels, then sorts a frog from a truck. The temptation when the early loss looks bad is to panic and add layers. Resist it. Most of the time the architecture was fine and the recipe was hungry: a schedule, some augmentation, a little weight decay. The mantra for this section: before you make the network bigger, make the training better.

4. Diagnosing Overfitting Intermediate

The most useful single plot in supervised learning is training loss and validation accuracy versus epoch. When training loss keeps falling while validation accuracy plateaus or declines, the network is overfitting: memorizing the training set rather than learning generalizable features. The remedies are exactly the regularizers in the recipe, more augmentation, more weight decay, dropout, or a smaller network, plus early stopping on the validation metric. Figure 19.5.2 shows the canonical signatures of underfitting, a healthy fit, and overfitting.

UnderfittingGood fitOverfitting training accuracy validation accuracy
Figure 19.5.2 Reading the learning curves. Underfitting (left): both curves are low and close, so the model lacks capacity or training time. Good fit (center): both are high with a small, stable gap. Overfitting (right): training accuracy climbs while validation accuracy peaks and then declines, the classic memorization signature that augmentation, weight decay, and early stopping are designed to combat.
Practical Example: The Leaderboard Model That Failed in the Field

Who: A startup building a plant-disease classifier from phone photos taken by farmers.

Situation: Their CNN reached 98 percent on a held-out split of their collected dataset and looked ready to ship.

Problem: In a field pilot, accuracy fell to the low sixties. Inspection revealed the training photos for each disease had been collected on the same few days with the same lighting and backgrounds, so the network had partly learned background and color-cast cues rather than the lesions. Its own validation split shared those spurious cues, so the validation accuracy was optimistic, a between-the-lines case of the overfitting signature in Figure 19.5.2 hidden by a leaky split.

Decision: Rebuild the validation split to hold out entire collection sessions (so background and lighting could not leak), then aggressively augment with color jitter, random crops, and the random erasing of Chapter 21 to force reliance on lesion structure.

Result: Reported validation accuracy dropped to a believable 88 percent, but field accuracy rose to 86 percent, finally matching the lab number. The honest split and stronger augmentation closed the gap between benchmark and reality.

Lesson: A high validation number means nothing if the split leaks the spurious cues the test will not contain. Augmentation that attacks the spurious cue, and a split that mirrors deployment, are what make a CNN generalize, exactly the regularization themes this section's recipe embodies.

Library Shortcut: A Strong Baseline in Three Lines

If your goal is a working classifier rather than a teaching exercise, skip the from-scratch architecture entirely. torchvision.models.resnet18(weights=None, num_classes=10) gives a stronger network in one line; loading weights="IMAGENET1K_V1" and fine-tuning (the transfer learning of Chapter 21) reaches well over 95 percent on CIFAR-10. The training-loop boilerplate, the loop, AMP mixed precision, checkpointing, logging, also has library answers: PyTorch Lightning or the Hugging Face Trainer replace roughly a hundred lines of loop code with a configured object. Build the loop once by hand to own it, as this section does, then graduate to the framework for real projects.

Research Frontier: How Far Can a Small CNN Go?

CIFAR-10 remains a live benchmark for training efficiency rather than peak accuracy. The "CIFAR-10 speedrun" community, anchored by Keller Jordan's airbench and tracked publicly through 2024-2025, trains small ResNet-style CNNs to 94 percent in under ten seconds on a single GPU using aggressive techniques: whitening the input with a fixed first layer, label smoothing, lookahead-style optimizers, and test-time augmentation. The lesson is that the architecture in this section is near the efficient frontier for the parameter budget, and most remaining gains come from the optimization and data recipe of Chapter 21 rather than from more layers, a striking confirmation that on small data the recipe dominates.

You have now trained a real convolutional network from random weights to competitive accuracy, exercising every concept in the chapter. The natural next question is what those 289 thousand learned numbers actually became. Section 19.6 opens the trained model and answers it, visualizing the filters, feature maps, and class evidence and confirming that the network rediscovered the edge detectors of Chapter 3 on its own.

Exercise 19.5.1: Read the Curves Conceptual

You train SmallCNN and observe: training accuracy 99 percent, test accuracy 78 percent, and a test-accuracy curve that peaked at epoch 20 and declined thereafter. Diagnose the condition using Figure 19.5.2, then list three distinct changes (one to the data, one to the optimizer, one to the architecture or training duration) that would each be expected to raise the test accuracy, and predict the direction each would push the train-test gap.

Exercise 19.5.2: Ablate the Block Coding

Train two variants of SmallCNN for 15 epochs each: one with the batch-norm layers removed from conv_block, and one with them kept. Plot or print test accuracy per epoch for both. Report the difference in final accuracy and in how many epochs each takes to first exceed 70 percent, and connect your observation to the claims about training speed and stability in Section 19.4.

Exercise 19.5.3: Where Do the Errors Live? Analysis

After training, build the $10 \times 10$ confusion matrix on the CIFAR-10 test set (rows true class, columns predicted). Identify the two class pairs the network confuses most often, look at a handful of the misclassified images, and explain in terms of the feature hierarchy of Section 19.3 why those particular classes (for example cat and dog) are harder to separate than others (for example ship and frog). Propose one targeted change that would most help the confused pair.