Section 18.6: GPUs, Mixed Precision & Reproducibility

"Run me twice and get two different answers, and you do not have a result; you have a rumor. Seed me properly and the rumor becomes evidence."
A Random Number Generator With a Strong Sense of Accountability

Big Picture

Three practical disciplines separate a notebook demo from a real training run: putting the model and data on the GPU so the matrix multiplies run on the right hardware, using automatic mixed precision to roughly double speed and sharply cut activation memory for free, and seeding every random source so a result you report can be reproduced. This section turns each into a few lines you add to the training loop of Section 18.5, and explains the one subtlety, loss scaling, that mixed precision requires.

The training loop of the previous section is correct but, on a CPU, slow. Deep learning is fast because the GPU executes the same matrix operations on thousands of cores at once, and the convolutions of Chapter 19 only become practical there. Three concerns dominate once you leave the toy regime: getting the computation onto the GPU at all, getting more of it per second and per gigabyte through reduced precision, and getting the same result twice so a number means something. None is conceptually deep, but each has a sharp edge that bites the unprepared, and getting all three right is what makes the rest of Part III reproducible on your own hardware. The same matrix-multiply hardware that accelerates a network was already lurking under the optimized filtering of Chapter 3; here we drive it deliberately, and Chapter 21 folds these knobs into a full training recipe.

1. Device Placement Beginner

A tensor lives on a device, cpu or cuda (an NVIDIA GPU; Apple silicon exposes mps). An operation requires all its inputs on the same device, and the model's parameters and the input batch must agree, or PyTorch raises a device-mismatch error. The pattern is to detect the device once, move the model there at the start, and move each batch there inside the loop, exactly the two .to(device) calls already present in Section 18.5. The code shows the canonical detection-and-placement idiom.

# The device idiom: detect the best available accelerator once, move the model's
# parameters and each input batch there so the matrix multiplies run on the GPU,
# and pull only scalars back to the CPU for logging.
import torch

# detect the best available device once, at program start
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print("training on:", device)

model = torch.nn.Linear(784, 10).to(device)     # move ALL parameters to the device

x = torch.randn(64, 784).to(device)             # the batch must live there too
out = model(x)                                   # both on device -> runs on GPU
print(out.device)                                # cuda:0  (or mps / cpu)

# moving a result back to the CPU for logging or numpy:
acc_value = out.argmax(1).float().mean().cpu().item()

Code Fragment 1: The device idiom: the chained cuda/mps/cpu detection picks the best accelerator once, model.to(device) and x.to(device) put both operands on the same device so the matmul runs there, and .cpu().item() pulls a scalar back for logging. The printed out.device confirms the result stayed on the GPU.

The most common beginner error is forgetting to move the batch, which gives a clear "expected all tensors to be on the same device" message. The second is unnecessary transfers: moving tensors back and forth between CPU and GPU inside the inner loop is slow because the transfer crosses the PCIe bus. Keep data on the GPU for the whole forward and backward pass and only pull scalars (losses, metrics) back to the CPU, with .item(), for logging.

2. Mixed Precision: Speed and Memory for Almost Free Intermediate

By default tensors are 32-bit floats (FP32). Modern GPUs compute far faster in 16-bit (FP16 or the more robust BF16) and store half the bytes, so using 16-bit where it is safe roughly doubles throughput and halves activation memory. Automatic mixed precision (AMP) does this selectively: it runs the matrix multiplies and convolutions in 16-bit (where the speedup lives and the precision loss is tolerable) while keeping numerically sensitive operations like the loss reduction and normalization statistics in FP32.

The one wrinkle is that FP16 has a narrow range, and small gradients can underflow to zero, silently stalling learning. The fix is loss scaling: multiply the loss by a large factor before backward so the gradients land in FP16's representable range, then unscale them before the optimizer step. This is exact rather than a hack, because the gradient is linear in the loss. Scaling the loss by a constant $s$ scales every gradient by the same $s$ (the chain rule just carries the constant through), so dividing the gradients back by $s$ before the step recovers precisely the true update. The only effect is that the in-between numbers were large enough for FP16 to represent. PyTorch automates the entire dance with two objects.

# Automatic mixed precision in one training step: autocast runs the heavy
# matrix work in FP16 while keeping the loss reduction in FP32, and GradScaler
# applies dynamic loss scaling so small FP16 gradients do not underflow to zero.
import torch

device = "cuda"
model = torch.nn.Linear(784, 10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
scaler = torch.amp.GradScaler(device)          # manages dynamic loss scaling

x = torch.randn(64, 784, device=device)
y = torch.randint(0, 10, (64,), device=device)

optimizer.zero_grad()
with torch.amp.autocast(device_type=device, dtype=torch.float16):
    logits = model(x)                          # matmuls run in FP16
    loss = torch.nn.functional.cross_entropy(logits, y)   # reduction stays FP32

scaler.scale(loss).backward()                  # scale loss -> gradients in FP16 range
scaler.step(optimizer)                         # unscale, then optimizer.step()
scaler.update()                                # adapt the scale factor for next step
print(round(loss.item(), 4))

Code Fragment 2: Automatic mixed precision in the training step: the autocast context runs the model(x) matmuls in FP16 while cross_entropy reduces in FP32, and the three scaler calls (scale, step, update) apply dynamic loss scaling around the backward and optimizer step. These four changes replace the plain FP32 forward and backward of Code Fragment 1 in Section 18.5.

Folding these four lines into the Section 18.5 loop is the entire change needed to train in mixed precision: wrap the forward pass in autocast, and route the backward and step through the scaler. On a recent GPU the speedup is often 1.5 to 3 times with no measurable accuracy loss, and the halved memory lets you fit larger batches or models. BF16, available on newer hardware, has the same range as FP32 and so does not need the scaler at all, which is why it is increasingly the default for training large models. Figure 18.6.1 shows where the precisions differ.

Common Misconception: "Mixed Precision Throws Away Accuracy" or "Trains the Whole Model in 16 Bits"

Two opposite errors cluster around AMP. The first is fear: students assume that because FP16 has fewer bits, mixed precision must noticeably degrade a vision model's accuracy, so they leave the speedup on the table. In practice the final accuracy is within run-to-run noise, because AMP is selective, not blanket: it runs only the matmuls and convolutions in 16-bit while keeping the loss reduction, the softmax, and normalization statistics in FP32. The second error is the mirror image: students believe AMP stores the model in 16-bit. It does not. The optimizer keeps a master copy of the weights in FP32 and the 16-bit values are used only for the forward and backward math, which is exactly why the small accuracy loss the bit count suggests never materializes. "Mixed" is the load-bearing word: the sensitive parts stay full precision. Verify by switching a working FP32 run to AMP and confirming the loss curve is unchanged, then keep the speedup.

Figure 18.6.1 The three precisions, by bit layout. FP16 trades exponent bits for a narrow range, which is why its small gradients underflow and a loss scaler is needed. BF16 keeps FP32's 8-bit exponent (so the same range, no scaler required) at the cost of mantissa precision, which is why modern large-model training increasingly prefers it. AMP mixes these with FP32 automatically.

Key Insight: Precision Is a Lever, Not a Default

The choice of numeric precision is one of the cheapest large wins in deep learning, and one of the most overlooked. Switching a working FP32 run to AMP typically costs four lines and gives back a 2x speedup and roughly half the memory, with accuracy within noise. The reason it is not simply the silent default is the underflow subtlety: get the loss scaling wrong, or use FP16 where you needed FP32, and training can diverge or quietly fail to learn. Treat precision as a knob you turn deliberately, verify the loss curve is unchanged after switching, and you get the speedup as a near-free lunch.

3. Reproducibility: Making a Number Mean Something Intermediate

Randomness enters training from many sources: weight initialization, data shuffling in the DataLoader (Section 18.4), dropout masks, and augmentation. Run the same script twice without controlling these and you get two different results, which makes it impossible to tell whether a change you made helped or whether you just got luckier on the second run. Reproducibility means pinning every random source so the run is repeatable. The minimum is seeding Python's random, NumPy, and PyTorch (CPU and CUDA); full determinism additionally requires telling cuDNN to use deterministic algorithms and disabling its autotuner. The code packages this into a reusable utility.

# A reusable seeding utility: set_seed pins every random source (Python, NumPy,
# PyTorch CPU and CUDA) and optionally forces deterministic algorithms, while
# seed_worker gives each DataLoader worker its own repeatable seed.
import os, random
import numpy as np
import torch

def set_seed(seed=42, deterministic=True):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)                 # seeds CPU and all CUDA devices
    if deterministic:
        torch.use_deterministic_algorithms(True)   # error on nondeterministic ops
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False      # disable autotuner (it varies)
        os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"   # needed for determinism

def seed_worker(worker_id):
    # give each DataLoader worker a derived, deterministic seed
    s = torch.initial_seed() % 2**32
    np.random.seed(s); random.seed(s)

set_seed(42)
g = torch.Generator(); g.manual_seed(42)
# pass worker_init_fn=seed_worker and generator=g to the training DataLoader
print("seeded; runs are now repeatable")

Code Fragment 3: A reusable seeding utility: set_seed seeds random, np.random, and torch.manual_seed, then optionally sets use_deterministic_algorithms, the cuDNN flags, and the CUBLAS_WORKSPACE_CONFIG env var for full determinism. seed_worker plus the passed Generator make the DataLoader's per-worker shuffling repeatable, the piece most ad-hoc seeding scripts forget.

Two honest caveats. Full determinism can be slightly slower (the cuDNN autotuner, disabled above, normally picks the fastest convolution algorithm per shape) and a handful of GPU operations have no deterministic implementation, in which case use_deterministic_algorithms(True) will tell you by raising an error rather than failing silently. The pragmatic stance for research is to always seed (so a result is repeatable on the same machine) and to report results across several seeds (so you can distinguish a real improvement from seed luck), the very practice the experiment-tracking culture of modern vision papers expects. Reproducibility across different GPUs or library versions is harder still and rarely bit-exact; the goal is repeatability on your own setup plus seed-averaged honesty in what you report.

Fun Note

Every machine learning team eventually lives through the same tragedy in two acts. Act one: a new idea beats the baseline by 0.8 points, champagne is opened, the slide deck is written. Act two: someone reruns it with a different seed, the ranking flips, and the champagne goes flat. The villain was never the method; it was an unmeasured variance hiding behind a single lucky run. One run is an anecdote; the mean over five seeds is a result. Seed it, repeat it, then believe it.

Library Shortcut: One Call for Seeding, One Flag for AMP

Both disciplines in this section have one-liner front ends in the higher-level ecosystem. pytorch_lightning.seed_everything(42, workers=True) does everything set_seed plus seed_worker do, including the per-worker DataLoader seeding, in a single call. And mixed precision, the four-line scaler-and-autocast dance, becomes the single argument Trainer(precision="16-mixed") (or "bf16-mixed") in Lightning, or accelerator.prepare(...) with a config in Hugging Face Accelerate. As in Section 18.5, the value of writing the explicit version once is that when the abstraction misbehaves, an underflow that the scaler should have caught, a nondeterministic op the seeder could not pin, you know exactly which gear slipped.

Practical Example: The Improvement That Was Pure Seed Luck

Who: A small applied-research group comparing a new augmentation policy against their baseline classifier for a wildlife-camera-trap project.

Situation: The new policy scored 91.4 percent against the baseline's 90.6 percent in a single run each, and the team prepared to ship it as a clear win.

Problem: Neither run was seeded, and nobody had measured run-to-run variance. When a skeptical reviewer asked for a repeat, the baseline scored 91.1 percent and the "improved" policy scored 90.5 percent, the ranking flipped. The 0.8-point gap was smaller than the noise between seeds.

Decision: The team adopted the set_seed utility, then ran each configuration across five seeds and compared the mean and spread rather than single runs. They reported the augmentation result only after the five-seed mean showed a gap larger than the measured standard deviation.

Result: Across five seeds the new policy genuinely helped, by a smaller and honest margin with non-overlapping spreads, and the team had a number they could defend. The original single-run "win" had been within the noise the whole time.

Lesson: An unseeded single-run comparison is not evidence; it is one sample of a noisy process. Seed for repeatability, then report across several seeds so the spread is visible. A difference smaller than the seed-to-seed variance is not a result, and reproducibility discipline is what tells the two apart.

Research Frontier: FP8, torch.compile, and the Precision Frontier

The precision lever keeps moving down. NVIDIA's Hopper and Blackwell GPUs expose FP8 (8-bit floats) for training, and the Transformer Engine library plus native PyTorch FP8 support (maturing through 2024 to 2026) push large-model training to 8-bit matmuls with per-tensor scaling, another roughly 2x over BF16 where the hardware allows. On the reproducibility and speed side, torch.compile from Section 18.3 composes with AMP to fuse kernels and cut overhead further, and deterministic-mode coverage in PyTorch widens each release so fewer operations force the autotuner-off slowdown. The throughline from this section holds: choose precision deliberately, verify the loss curve is unchanged, and seed so the verification is trustworthy. The numbers in the formats get smaller; the discipline does not change.

Exercise 18.6.1: Why FP16 Needs a Scaler but BF16 Does Not Conceptual

Using the bit layouts of Figure 18.6.1, explain why FP16 gradients are prone to underflowing to zero while BF16 gradients are not, despite BF16 having fewer mantissa bits. Define what loss scaling does to the gradient distribution and why multiplying the loss by a constant before backward, then dividing the gradients by the same constant before the step, is mathematically equivalent to the unscaled update. State the one thing dynamic loss scaling adapts at runtime and what event triggers it to lower the scale.

Exercise 18.6.2: Measure the AMP Speedup Coding

Take the training loop from Section 18.5 and add the AMP changes from subsection 2 behind a flag. On a GPU, train the chapter's MLP (or a small CNN if available) for a fixed number of steps with and without AMP, and report wall-clock time, peak GPU memory (torch.cuda.max_memory_allocated), and final training loss for both. Confirm the loss curves overlap within noise while time and memory drop, and report your measured speedup and memory ratio. If you have a BF16-capable GPU, repeat with dtype=torch.bfloat16 and no scaler.

Exercise 18.6.3: How Much Does the Seed Matter Analysis

Train the chapter's MLP on Fashion-MNIST five times with five different seeds (using set_seed) and record the final validation accuracy each time. Report the mean and standard deviation. Then design a hypothetical "improvement" to the model (for example a wider hidden layer) and state how large its measured gain would have to be, relative to your measured seed-to-seed standard deviation, before you would believe it is real. Connect your reasoning to the seed-luck practical example and to the reporting discipline the chapter recommends.