"I do not learn rules. I learn a million small corrections, each one a little less wrong than the last, until the loss stops complaining."
A Gradient That Refuses to Vanish
Chapter Overview
For seventeen chapters, every transformation we applied to an image was something we wrote down by hand. A Sobel kernel, a homography, a SIFT descriptor, a histogram of oriented gradients: humans designed the operation, and the computer executed it faithfully. Part III inverts that relationship. From here on, we specify only the shape of the computation and a measure of how wrong its output is, and an optimization procedure discovers the operation itself from data. This is the deep learning bargain, and Chapter 18 is where we sign the contract: not by training a state-of-the-art network, but by building, from tensors upward, a training loop simple enough that nothing in it is magic.
The chapter has a deliberate arc. Section 18.1 starts where your linear-algebra intuition already lives, with a linear classifier, and shows precisely why stacking linear layers gains nothing until a nonlinearity is inserted between them, turning a single matrix multiply into a multi-layer perceptron that can carve curved decision boundaries. Section 18.2 opens the optimization engine: backpropagation as the chain rule applied mechanically to a computation graph, and gradient descent with its modern refinements (momentum, Adam, learning-rate schedules) as the procedure that turns gradients into learning. These two sections are framework-agnostic; they would be true in any language.
The remaining four sections make it real in PyTorch, the framework that dominates vision research and a large share of production. Section 18.3 introduces the three primitives every later chapter assumes you know cold: the Tensor (a GPU-aware array), autograd (the engine that records operations and replays them backward), and nn.Module (the container that organizes parameters and forward passes). Section 18.4 builds the input side, Dataset and DataLoader, the unglamorous plumbing that decides whether your expensive accelerator sits idle or stays fed. Section 18.5 assembles the canonical training loop with losses, metrics, validation, and checkpointing, the loop you will recognize, lightly disguised, inside every training script in this book. Section 18.6 closes with the practical machinery that separates a notebook demo from a real run: moving work to the GPU, mixed-precision training, and the reproducibility discipline that lets you trust a number twice.
Nothing here is computer-vision-specific yet; the classifier in this chapter eats flattened pixel vectors, which is exactly the wrong way to treat an image. That is intentional. Chapter 19 will fix the architecture by reintroducing the convolution you already met as a hand-designed filter in Chapter 3, this time with learnable kernels. But a convolutional network is trained by the very same loop you build here. Master this chapter and the rest of Part III becomes a story about better architectures and better data, told over an optimization machine you already understand end to end.
The chapter closes with a single capstone activity. The Hands-On Lab at the end of this chapter has you assemble every piece, the nn.Module MLP of Section 18.1, the autograd-driven optimizer of Section 18.2, the Dataset and DataLoader of Section 18.4, the five-beat training loop of Section 18.5, and the device, mixed-precision, and seeding discipline of Section 18.6, into one runnable Fashion-MNIST classifier with a saved best checkpoint and a reproducible accuracy you can defend.
Deep learning is not a new kind of mathematics; it is a single, reusable loop: feed data forward through a parameterized function, measure the error with a loss, propagate gradients backward, and nudge every parameter downhill, repeated until the loss stops falling. Everything in Part III, every CNN, transformer, detector, and diffusion model, is that loop wrapped around a different function and a different dataset. Chapter 18 builds the loop once, transparently, so that for the next twenty chapters you can take it for granted.
If you carry one thing out of this chapter, carry the five-beat loop, the step that Section 18.3 names and Section 18.5 assembles in full: zero, forward, loss, backward, step. Zero the accumulated gradients, run the forward pass, measure the loss, backpropagate, and let the optimizer step downhill. Every network in the next twenty chapters, convolutional, transformer, or diffusion, is trained by exactly this five-beat loop with a different model and loss dropped into the middle two beats. The architectures change; the heartbeat does not.
Learning Objectives
- Explain why a stack of linear layers collapses to a single linear layer, and how a nonlinear activation rescues expressivity, turning a perceptron into a universal-approximating multi-layer perceptron.
- Derive backpropagation as the chain rule on a computation graph, and state what stochastic gradient descent, momentum, and Adam each contribute.
- Manipulate PyTorch tensors fluently: shapes, broadcasting, device placement, and the difference between an in-place and an out-of-place operation.
- Use
autogradto compute gradients automatically, and read a.grad_fnchain to debug a backward pass. - Define models with
nn.Module, feed them withDatasetandDataLoader, and write a training loop with losses, metrics, validation, and checkpointing from scratch. - Move training to a GPU, enable automatic mixed precision for a speed and memory win, and seed a run so that its results are reproducible.
Prerequisites
This chapter assumes the linear-algebra comfort the book has required throughout: vectors, matrices, the dot product, and matrix multiplication. It leans on three earlier ideas in particular. The convolution and kernel intuition from Chapter 3 is the hand-designed ancestor of the learnable layers ahead. The classifier-and-evaluation thinking from the classical recognition pipelines of Chapter 16 sets up what a learned classifier replaces. And the histogram-and-statistics view from Chapter 2 returns as the normalization statistics that every input pipeline computes. No prior PyTorch or neural-network experience is needed; we build from the tensor up. A machine with a GPU helps for Section 18.6 but is not required to follow the code.
Chapter Roadmap
- 18.1 From Linear Models to Multi-Layer Perceptrons Why a stack of linear layers must collapse to one, how a single nonlinearity restores expressivity, and the multi-layer perceptron as the simplest network that learns curved decision boundaries.
- 18.2 Backpropagation & Optimization in a Nutshell The chain rule applied mechanically to a computation graph, and the gradient-descent family from plain SGD through momentum and Adam to the learning-rate schedules that decide convergence.
- 18.3 PyTorch Essentials: Tensors, Autograd & nn.Module The three abstractions every later chapter assumes: the GPU-aware tensor, the autograd engine that replays operations backward, and the nn.Module container, all built and exercised by hand.
- 18.4 Datasets, DataLoaders & Input Pipelines Feeding the accelerator: map-style datasets, batching and shuffling, worker parallelism, transforms and augmentation, and the normalization statistics that keep training stable.
- 18.5 The Training Loop: Losses, Metrics & Checkpointing The canonical loop with train and validation phases, the loss-versus-metric distinction, early stopping, and saving the best model, the code that recurs in every later chapter.
- 18.6 GPUs, Mixed Precision & Reproducibility Device placement, automatic mixed precision with loss scaling for a speed and memory win, and the seeding discipline that makes a reported number trustworthy.
Hands-On Lab: Build the Training Loop That Trains Everything
Objective
Assemble, from the tensor up, a single self-contained script that trains a multi-layer perceptron to classify Fashion-MNIST, using your own nn.Module, your own five-beat training loop, automatic device placement and mixed precision, a saved best checkpoint, and a seed so the reported accuracy is reproducible. This is the optimization machine the rest of Part III reuses, and the artifact you will recognize, lightly disguised, inside every later training script in this book.
What You'll Practice
- Defining a model with
nn.Moduleand seeing why a nonlinearity between linear layers is what makes it an MLP rather than a collapsed single layer (Section 18.1). - Driving
autogradand an optimizer through the five-beat loop, zero, forward, loss, backward, step (Sections 18.2 and 18.5). - Feeding the model with a torchvision
Datasetand a batched, shuffledDataLoader(Section 18.4). - Separating training from validation, tracking a loss versus an accuracy metric, and checkpointing the best model (Section 18.5).
- Adding device placement, automatic mixed precision, and reproducible seeding to turn a notebook demo into a real run (Section 18.6).
Setup
Two libraries; torchvision downloads Fashion-MNIST (about 30 MB) on first run, then caches it. Install with:
pip install torch torchvision
The lab runs end to end on a CPU in a few minutes for three epochs; a GPU (CUDA or Apple mps) makes it near-instant and exercises the mixed-precision path. No notebook is required, the script is a single file.
Steps
Step 1: Seed every random source
Before anything else, pin the randomness so two runs of this script agree, the discipline of Section 18.6. Seed Python, NumPy, and PyTorch, and detect the best available device once.
import random
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import v2
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed) # seeds CPU and all CUDA devices
set_seed(42)
# TODO: detect the best device with the cuda / mps / cpu idiom from Section 18.6
# and store it in a variable named `device`. Print it.
Hint
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu". Everything later, the model and each batch, gets sent here with .to(device).
Step 2: Build the input pipeline
Load Fashion-MNIST as two datasets (train and test), normalized with its known channel statistics, and wrap each in a DataLoader. The transform converts each image to a float tensor and standardizes it, the normalization step of Section 18.4.
tf = v2.Compose([
v2.ToImage(),
v2.ToDtype(torch.float32, scale=True), # uint8 [0,255] -> float [0,1]
v2.Normalize(mean=[0.2860], std=[0.3530]), # Fashion-MNIST grayscale stats
])
train_set = datasets.FashionMNIST(root="data", train=True, download=True, transform=tf)
test_set = datasets.FashionMNIST(root="data", train=False, download=True, transform=tf)
# TODO: build train_loader (batch_size=128, shuffle=True) and
# test_loader (batch_size=256, shuffle=False) over these two datasets.
Hint
train_loader = DataLoader(train_set, batch_size=128, shuffle=True) and test_loader = DataLoader(test_set, batch_size=256, shuffle=False). Shuffle the training data so batches differ each epoch; never shuffle the evaluation data.
Step 3: Define the MLP
Subclass nn.Module to build a two-layer perceptron: flatten the 28 by 28 image to a 784-vector, map it to a hidden layer, apply a nonlinearity, then map to the ten class logits. The nonlinearity is what stops the two linear layers from collapsing into one (Section 18.1).
class MLP(nn.Module):
def __init__(self, hidden=256):
super().__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28 * 28, hidden)
# TODO: add the activation (nn.ReLU) and the output layer
# self.fc2 mapping `hidden` -> 10 class logits.
def forward(self, x):
x = self.flatten(x)
x = self.fc1(x)
# TODO: apply the activation, then the output layer, and return the logits.
return x
model = MLP().to(device) # move ALL parameters to the device
Hint
In __init__: self.act = nn.ReLU() and self.fc2 = nn.Linear(hidden, 10). In forward: x = self.act(self.fc1(x)); return self.fc2(x). Return raw logits, not probabilities; cross_entropy applies the softmax internally.
Step 4: Write the five-beat training step
Define one epoch of training. For each batch: move it to the device, run the forward pass inside autocast for mixed precision, compute the loss, and run the scaler-wrapped backward and optimizer step, the five beats of Section 18.5 with the AMP wrapper of Section 18.6.
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = torch.amp.GradScaler(device) # dynamic loss scaling for FP16
def train_one_epoch():
model.train()
running = 0.0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad() # beat 1: zero
with torch.amp.autocast(device_type=device, dtype=torch.float16):
logits = model(images) # beat 2: forward
loss = criterion(logits, labels) # beat 3: loss
# TODO: run the scaled backward (beat 4) and the scaler step + update
# (beat 5) using `scaler`, then accumulate loss.item() * images.size(0).
return running / len(train_loader.dataset)
Hint
scaler.scale(loss).backward(), then scaler.step(optimizer), then scaler.update(); and running += loss.item() * images.size(0). The autocast context plus the three scaler calls are the only difference from a plain FP32 loop.
Step 5: Write the validation pass
Evaluate accuracy on the held-out test set with gradients disabled and the model in eval mode. Accuracy is the metric you care about; the loss is what the optimizer minimizes, the loss-versus-metric distinction of Section 18.5.
@torch.no_grad()
def evaluate():
model.eval()
correct = total = 0
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
logits = model(images)
# TODO: count correct predictions (logits.argmax(1) == labels)
# and accumulate `total`. Return correct / total.
return correct / total
Hint
correct += (logits.argmax(1) == labels).sum().item() and total += labels.size(0). The @torch.no_grad() decorator and model.eval() together turn off gradient tracking and switch any dropout or batch-norm layers to inference behavior.
Step 6: Run the loop and checkpoint the best model
Train for a few epochs, evaluate after each, print progress, and save the weights whenever validation accuracy improves, so the file on disk is always your best model, not your last one (Section 18.5).
best_acc = 0.0
for epoch in range(1, 4):
train_loss = train_one_epoch()
val_acc = evaluate()
print(f"epoch {epoch}: train_loss={train_loss:.4f} val_acc={val_acc:.4f}")
# TODO: if val_acc beats best_acc, update best_acc and
# torch.save(model.state_dict(), "best_mlp.pt").
print(f"best validation accuracy: {best_acc:.4f}")
Hint
if val_acc > best_acc: best_acc = val_acc; torch.save(model.state_dict(), "best_mlp.pt"). Save the state_dict (the parameter tensors), not the whole model object; reload later with model.load_state_dict(torch.load("best_mlp.pt")).
Expected Output
Three lines of per-epoch progress followed by a best-accuracy summary, for example:
epoch 1: train_loss=0.5189 val_acc=0.8456
epoch 2: train_loss=0.3812 val_acc=0.8643
epoch 3: train_loss=0.3401 val_acc=0.8721
best validation accuracy: 0.8721
A plain MLP on flattened Fashion-MNIST lands in roughly the mid-to-high 80s percent after three epochs (your exact numbers will shift a little with hardware and library version, but seeding makes them stable across reruns on the same machine). A best_mlp.pt file appears in the working directory. The ceiling here is the architecture, not the loop: flattening discards the spatial structure, which is exactly the limitation Chapter 19 removes by swapping the MLP for a convolutional network inside this identical loop.
Stretch Goals
- Run the five-seed honesty check from Section 18.6: wrap the whole script in a loop over five seeds, record the best validation accuracy each time, and report the mean and standard deviation. Then widen the hidden layer (for example 256 to 512) and decide whether the gain exceeds your measured seed-to-seed spread before calling it a real improvement.
- Measure the mixed-precision payoff: time three epochs with and without the
autocastplusscalerpath (toggle it behind a flag) on a GPU, and report wall-clock and peak memory (torch.cuda.max_memory_allocated()). Confirm the accuracy is unchanged within noise, the verification the chapter insists on. - Library shortcut, the Right Tool principle in action: reload
best_mlp.ptand confirm it still scores your reported accuracy, then rewrite the entire train-and-validate loop usingpytorch_lightning(aLightningModuleplus a one-lineTrainer(precision="16-mixed")) and check you reproduce the same result with the boilerplate of Steps 4 to 6 absorbed by the framework.
Complete Solution
import random
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import v2
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(42)
device = (
"cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu"
)
print("training on:", device)
tf = v2.Compose([
v2.ToImage(),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.2860], std=[0.3530]),
])
train_set = datasets.FashionMNIST(root="data", train=True, download=True, transform=tf)
test_set = datasets.FashionMNIST(root="data", train=False, download=True, transform=tf)
train_loader = DataLoader(train_set, batch_size=128, shuffle=True)
test_loader = DataLoader(test_set, batch_size=256, shuffle=False)
class MLP(nn.Module):
def __init__(self, hidden=256):
super().__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28 * 28, hidden)
self.act = nn.ReLU()
self.fc2 = nn.Linear(hidden, 10)
def forward(self, x):
x = self.flatten(x)
x = self.act(self.fc1(x))
return self.fc2(x)
model = MLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scaler = torch.amp.GradScaler(device)
def train_one_epoch():
model.train()
running = 0.0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
with torch.amp.autocast(device_type=device, dtype=torch.float16):
logits = model(images)
loss = criterion(logits, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
running += loss.item() * images.size(0)
return running / len(train_loader.dataset)
@torch.no_grad()
def evaluate():
model.eval()
correct = total = 0
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
logits = model(images)
correct += (logits.argmax(1) == labels).sum().item()
total += labels.size(0)
return correct / total
best_acc = 0.0
for epoch in range(1, 4):
train_loss = train_one_epoch()
val_acc = evaluate()
print(f"epoch {epoch}: train_loss={train_loss:.4f} val_acc={val_acc:.4f}")
if val_acc > best_acc:
best_acc = val_acc
torch.save(model.state_dict(), "best_mlp.pt")
print(f"best validation accuracy: {best_acc:.4f}")
What's Next
You will leave this chapter with a working training loop and an MLP that classifies flattened images, badly, because flattening throws away the very spatial structure that makes an image an image. Chapter 19: Convolutional Neural Networks repairs the architecture without touching the loop. It reintroduces the convolution from Chapter 3 as a layer whose kernels are learned rather than hand-tuned, adds pooling and the receptive-field idea, and shows the same loop you built here driving a network that respects the geometry of pixels. From there, Part III is a tour of better functions to plug into this machine: deeper and smarter architectures in Chapter 20, the training recipes and transfer learning of Chapter 21, and the attention-based vision transformers of Chapter 22. Every one of them is trained by the loop in Section 18.5.
Bibliography & Further Reading
Foundational Papers
The paper that popularized backpropagation for training multi-layer networks; the algorithm Section 18.2 derives is exactly this, expressed on a computation graph.
The adaptive optimizer that is still the default first choice in 2026; Section 18.2 explains its per-parameter learning rates and Section 18.5 uses its decoupled-weight-decay variant AdamW.
The AdamW paper: shows why weight decay should be decoupled from the gradient update, fixing a subtle bug in how Adam regularizes; AdamW is the optimizer the training loop in Section 18.5 reaches for.
The paper behind the loss-scaling and FP16 mechanics that Section 18.6 invokes through torch.amp; explains why naive half precision underflows and how scaling rescues it.
The system paper for the framework this entire chapter teaches; describes the define-by-run autograd engine that Section 18.3 exercises by hand.
Books
The standard graduate reference; its chapters on feedforward networks and optimization are the natural deep dive behind Sections 18.1 and 18.2.
A modern, exceptionally clear textbook with free PDF and notebooks; its early chapters on MLPs and backpropagation track this chapter's pedagogy closely.
A practitioner's tour of tensors, autograd, and training loops in PyTorch; the closest book-length companion to Sections 18.3 through 18.5.
Tools & Documentation
The authoritative API reference for every torch, torch.nn, and torch.utils.data symbol used in this chapter, including the autograd mechanics and AMP guides.
The canonical walkthrough of the reverse-mode autodiff engine that Section 18.3 unpacks; pairs the requires_grad and .backward() mechanics with diagrams.
The vision companion library whose datasets and transforms.v2 APIs Section 18.4 uses to build input pipelines in a few lines instead of dozens.
The official checklist for deterministic runs: seeding, use_deterministic_algorithms, and the cuDNN flags that Section 18.6 turns into a reusable seeding utility.
A higher-level wrapper that absorbs the boilerplate of the Section 18.5 loop (device handling, AMP, checkpointing, logging) once you understand what it is automating.
Datasets & Benchmarks
The drop-in, harder replacement for MNIST used as the running example dataset across this chapter's code; 70,000 grayscale clothing images in ten classes.
The small natural-image benchmark that Section 18.4 uses to compute channel normalization statistics; the bridge from grayscale toy data to real RGB images for Chapter 19.