Section 21.4: Regularization, Schedules & the Modern Training Recipe

"For the first five hundred steps they let me walk slowly so I would not trip. Then they let me run. Then, gently, over thousands of steps, they asked me to slow to a stop right at the finish. I have never trained any other way and I cannot imagine why I once did."
An Optimizer Describing Its Cosine Schedule

Big Picture

Every competitive modern vision result rests on the same small bundle of training techniques used together, and learning them as a single recipe rather than a list of unrelated tricks is what lets you reproduce strong numbers reliably. The recipe has six staples: a learning-rate schedule with warmup and cosine decay, weight decay applied correctly through AdamW, label smoothing to stop overconfidence, stochastic depth to regularize deep stacks, an exponential moving average of the weights for a smoother final model, and mixed-precision training to make it all fast. None of these is individually dramatic; together they are the difference between a baseline and a leaderboard entry. This section assembles them into one coherent, reproducible recipe.

In Section 21.3 you decided what to fine-tune. This section decides how to run the optimization itself. The motivating fact, established in Chapter 20, is that "ResNet strikes back" showed a 2015 ResNet-50 gaining several accuracy points purely from a modern recipe, no architecture change at all. That result is this section's thesis made measurable: the recipe is a first-class part of the model. We will take each ingredient in turn, explain the problem it solves, and then assemble them, because the ingredients interact and the bundle is what works. The illustration below sums up that point: no single layer of the cake is impressive alone, but stacked together they become the recipe.

A proud cartoon robot cook stands beside a neat multi-layered cake on a stand, each pastel layer representing one training-recipe ingredient like schedule, weight decay, and label smoothing, illustrating how the modern recipe is a bundle of modest techniques that only produce a strong result when assembled together. — No single ingredient is dramatic; stacked together they are the difference between a baseline and a leaderboard entry.

1. The Learning-Rate Schedule: Warmup and Cosine Decay Beginner

A constant learning rate is almost never optimal. Early in training the weights are far from any good solution and large steps help, but a freshly-initialized network (especially with the noisy gradients of a new head from Section 21.3) can diverge if the rate starts high. Late in training, large steps cause the loss to bounce around the minimum instead of settling into it. The modern schedule solves both ends. Warmup ramps the learning rate linearly from near zero up to its peak over the first few hundred steps, so the network stabilizes before taking big steps. Cosine decay then smoothly lowers the rate following a half-cosine curve from the peak down to near zero over the rest of training, $\eta_t = \frac{1}{2}\eta_{\max}\left(1 + \cos\frac{\pi t}{T}\right)$, so the model settles gently into the minimum.

Warmup looks like a fussy detail until you watch a run try to skip it. Launch a vision transformer straight at its peak learning rate with no warmup and the loss does not gently fail, it explodes to NaN within the first dozen steps: the freshly-initialized attention layers produce enormous gradients, one full-size step blows the weights past any usable range, and the run is dead before it has seen a single full epoch. The same network with a few hundred steps of linear warmup trains to a strong result. The fix and the catastrophe are separated by one short ramp at the very start, which is exactly why warmup, once an obscure trick, is now non-negotiable for transformers and increasingly standard for CNNs too.

Figure 21.4.1 shows the combined warmup-then-cosine curve that is now the default in essentially every vision recipe.

Figure 21.4.1: The standard warmup-then-cosine learning-rate schedule. A short linear warmup ramps from near zero to the peak so a fresh network does not diverge, then a half-cosine curve decays smoothly to near zero so the model settles cleanly into the minimum. This single schedule replaced the hand-tuned step-decay drops of earlier eras.

2. Weight Decay Done Right: AdamW Intermediate

Weight decay shrinks weights toward zero each step, a regularizer that discourages large weights and improves generalization. For plain stochastic gradient descent (SGD), the gradient-following optimizer introduced in Section 18.2, adding an L2 penalty to the loss (an extra term proportional to the sum of squared weights, which the gradient then shrinks) and applying weight decay are mathematically the same operation. For adaptive optimizers like Adam, which scale each parameter's step by a running estimate of its own gradient magnitude (also from Section 18.2), they are not, and this surprised the field for years. If the L2 penalty rides inside the gradient, Adam's per-parameter scaling shrinks it too, so parameters with large gradients are decayed less than intended. The fix, AdamW, decouples the decay: it applies the adaptive gradient step and then separately shrinks the weights by a fixed factor, $\theta \leftarrow \theta - \eta\,\lambda\,\theta$, independent of the gradient scaling. AdamW is now the default optimizer for training the vision transformers of Chapter 22 and modern CNNs alike.

Key Insight: Do Not Decay Norms and Biases

A subtle but important detail: weight decay should be applied to the convolution and linear weight matrices, but generally not to bias terms or to the scale and shift parameters of normalization layers (the batch-norm and layer-norm parameters from Chapter 19). Decaying a normalization layer's learned scale toward zero fights against the very calibration the layer exists to provide, and decaying biases offers no real regularization benefit. Production recipes split parameters into two groups, one with weight decay and one without, a five-line distinction that is easy to miss and quietly costs accuracy when omitted.

3. Label Smoothing, Stochastic Depth, and EMA Intermediate

Three more regularizers round out the bundle. Label smoothing replaces the hard one-hot target with a softened one: instead of asking the model to predict probability $1.0$ for the true class and $0.0$ elsewhere, it asks for $1 - \epsilon$ on the true class and $\epsilon / (K-1)$ spread over the others, with $\epsilon$ around $0.1$. This stops the model from driving its logits (the raw pre-softmax scores from the classifier head, introduced with the cross-entropy loss in Section 18.5) to infinity chasing perfect confidence, which improves both generalization and the calibration we discussed for MixUp in Section 21.2. Stochastic depth randomly drops entire residual blocks during training (replacing them with the identity skip connection from Chapter 20), which regularizes very deep networks and, as a bonus, speeds up training since dropped blocks are not computed.

Exponential moving average (EMA) of the weights maintains a slowly-updated shadow copy, $\theta_{\text{EMA}} \leftarrow \alpha\,\theta_{\text{EMA}} + (1-\alpha)\,\theta$ with $\alpha$ near $0.9999$, and uses that averaged copy for evaluation. Because SGD bounces around the minimum, the average sits closer to the center of the basin than any single iterate, typically buying a fraction of a point of accuracy for almost no cost. Label smoothing in PyTorch is a one-argument change.

import torch
import torch.nn as nn

# Label smoothing is a single argument to the standard loss.
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# A minimal model-EMA: keep a shadow copy, update it after each optimizer step.
class ModelEMA:
    def __init__(self, model, decay=0.9999):
        self.decay = decay
        self.shadow = {k: v.detach().clone()
                       for k, v in model.state_dict().items()}

    @torch.no_grad()
    def update(self, model):
        for k, v in model.state_dict().items():
            if v.dtype.is_floating_point:
                self.shadow[k].mul_(self.decay).add_(v, alpha=1 - self.decay)
# Usage: opt.step(); ema.update(model)   then evaluate with ema.shadow weights.

Code Fragment 1: Label smoothing as a one-argument loss change and a compact model-EMA. Passing label_smoothing=0.1 to CrossEntropyLoss softens the targets, while ModelEMA.update blends each weight into a decay=0.9999 shadow copy after every step. The EMA shadow weights, used only at evaluation, sit nearer the center of the loss basin than any single training iterate.

Common Misconception: "Every Recipe Ingredient Is Free Accuracy, So Stack Them All At Maximum"

Because this bundle reliably lifts large-scale ImageNet results, learners often treat each ingredient as a one-way accuracy switch and crank weight decay, label smoothing, and stochastic depth as high as they go on every project. In fact every one of these is a regularizer, and a regularizer only helps when the model is overfitting; on a small dataset that is already underfitting, the same heavy label smoothing and high drop-path rate make the model fit worse, the mirror image of the augmentation-strength tradeoff in Section 21.2. These values are tuned to the data scale they were published on: the $\epsilon = 0.1$ smoothing and $0.1$ drop-path that help a 300-epoch ImageNet run can quietly hurt a few-hundred-image transfer task. Treat the recipe as a starting point whose strengths you reduce when you see underfitting, not a fixed list to apply at full magnitude everywhere.

Fun Fact

An EMA decay of $0.9999$ sounds like rounding error, but it means each new set of weights contributes only one ten-thousandth of the average, so the shadow copy effectively remembers roughly the last ten thousand training steps. The model you finally ship is therefore never a model that actually existed during training; it is a committee average of ten thousand slightly different networks that briefly flickered into being and were never evaluated on their own. The single best-performing weights of your run usually belong to a network nobody ever trained directly.

4. Mixed Precision and the Assembled Recipe Advanced

The last ingredient is about speed, not accuracy. Mixed-precision training runs most operations in 16-bit floating point (half the memory and roughly twice the throughput on modern GPUs with tensor cores, the specialized hardware units that multiply low-precision matrices very fast) while keeping a 32-bit master copy of the weights and a loss scaler to prevent tiny gradients from underflowing to zero. It typically halves memory use and noticeably speeds up training with no accuracy loss, which is why it is standard, and it is the same mechanism first met in Section 18.6. PyTorch 2.x provides it through the device-generic torch.amp.autocast and a gradient scaler. With every ingredient introduced, the assembled recipe below is close to what timm and other production frameworks run by default.

import torch
from torch.amp import autocast, GradScaler         # PyTorch 2.x device-generic AMP
from torch.optim.lr_scheduler import OneCycleLR   # warmup + cosine-like decay

model = model.cuda()
# Split params: decay weights, no-decay for norms/biases (subsection 2 insight).
decay, no_decay = [], []
for name, p in model.named_parameters():
    (no_decay if p.ndim <= 1 or name.endswith("bias") else decay).append(p)
opt = torch.optim.AdamW(
    [{"params": decay, "weight_decay": 0.05},
     {"params": no_decay, "weight_decay": 0.0}], lr=1e-3)

steps_per_epoch, epochs = len(loader), 100
sched = OneCycleLR(opt, max_lr=1e-3, total_steps=steps_per_epoch * epochs,
                   pct_start=0.05)               # 5% of steps as warmup
scaler = GradScaler("cuda")
criterion = torch.nn.CrossEntropyLoss(label_smoothing=0.1)

for epoch in range(epochs):
    for images, labels in loader:               # loader already does RandAugment
        images, labels = images.cuda(), labels.cuda()
        opt.zero_grad()
        with autocast("cuda"):                   # run forward in fp16
            loss = criterion(model(images), labels)
        scaler.scale(loss).backward()            # scaled backward to avoid underflow
        scaler.step(opt); scaler.update()
        sched.step()                             # per-step cosine schedule

Code Fragment 2: The modern recipe assembled in one loop. The named_parameters split sends norms and biases (p.ndim <= 1) to a zero-decay group, OneCycleLR provides the warmup-plus-decay schedule, label_smoothing=0.1 softens targets, and autocast with GradScaler runs the forward pass in fp16. Adding MixUp from Section 21.2 and the model EMA from Code Fragment 1 completes the timm-style recipe.

Key Insight: The Recipe Is Five Regularizers and One Speed Trick

A compact way to remember the bundle is to sort it by what each ingredient is for. Five of the six fight overfitting or instability: the warmup-plus-cosine schedule stabilizes the start and settles the end, AdamW decay shrinks the weights correctly, label smoothing caps overconfidence, stochastic depth regularizes the deep stack, and EMA averages toward the center of the basin. Only the sixth, mixed precision, touches accuracy not at all; it buys speed and memory. So the one-line mental model is "five regularizers and one speed trick, used together," and the practical-example gap below shows the cost of dropping even a couple of the five.

Practical Example: The Six-Point Gap That Was Pure Recipe

Who: a research engineer reproducing a published ConvNeXt-Tiny ImageNet result, 2024. Situation: their from-scratch training plateaued about six points below the paper's reported top-1 accuracy, using the identical architecture and dataset. Problem: the architecture was byte-for-byte correct, so the gap had to be elsewhere. Decision: they diffed their training script against the paper's recipe line by line and found three omissions: no label smoothing, plain Adam instead of AdamW with split decay groups, and a step-decay schedule instead of warmup-plus-cosine. They added all three and re-ran. Result: accuracy climbed to within a few tenths of a point of the published number, no architecture change whatsoever. Lesson: when a reproduction misses by several points, suspect the recipe before the architecture. The bundle in this section is not optional polish; it is load-bearing, and missing even a couple of ingredients can account for the entire gap.

Library Shortcut: The Entire Recipe via the timm Training Script

The loop above is correct but still leaves you wiring EMA, MixUp, and schedulers together. timm's reference training script accepts the whole recipe as command-line flags, turning hundreds of lines into one invocation:

# timm's train.py reproduces the full modern recipe from flags:
#   python train.py /imagenet \
#     --model convnext_tiny --opt adamw --weight-decay 0.05 \
#     --sched cosine --warmup-epochs 5 --epochs 300 \
#     --smoothing 0.1 --mixup 0.8 --cutmix 1.0 \
#     --aa rand-m9-mstd0.5 --drop-path 0.1 --model-ema --amp

# Programmatic equivalents of the key pieces:
from timm.scheduler import CosineLRScheduler
from timm.loss import SoftTargetCrossEntropy   # for MixUp/CutMix soft labels
from timm.utils import ModelEmaV2
from timm.data import Mixup

Code Fragment 3: The entire recipe expressed as timm train.py flags. The commented invocation maps each ingredient of this section to a flag (--opt adamw, --sched cosine, --smoothing, --mixup, --drop-path, --model-ema, --amp), and the imports below name the programmatic equivalents (CosineLRScheduler, SoftTargetCrossEntropy, ModelEmaV2, Mixup) for wiring them into your own loop.

timm bundles AdamW with the no-decay split, the cosine scheduler with warmup, label smoothing, MixUp and CutMix, RandAugment, stochastic depth (--drop-path), model EMA, and mixed precision (--amp), every ingredient of this section, validated together. For reproducing or fine-tuning a published result, matching its timm flags is the single most reliable path.

Research Frontier: Optimizers Beyond AdamW

AdamW has been the default for years, but 2023-2026 brought serious challengers. The Lion optimizer (discovered by symbolic program search) uses only the sign of a momentum term, halving optimizer memory while matching or beating AdamW on large vision and language models. Sharpness-Aware Minimization (SAM) explicitly seeks flat minima by taking a worried step toward the local worst case before each update, improving generalization at roughly double the compute. Newer second-order-inspired methods like Sophia and the matrix-preconditioned Muon push training efficiency further still. The schedule story is also evolving: schedule-free optimizers aim to remove the learning-rate decay schedule entirely. The recipe in this section is the robust 2026 default, but the optimizer slot is the most actively contested ingredient, and it is worth re-checking the leaderboards when you start a large training run.

Exercise 21.4.1: Why Decouple the Decay? Conceptual

Explain, with the update equations, why L2 regularization and weight decay coincide for plain SGD but diverge for Adam. Specifically, show how Adam's per-parameter gradient scaling causes an in-loss L2 penalty to be applied unevenly across parameters, and how AdamW's decoupled decay restores the intended uniform shrinkage. Then state in one sentence why this matters more for parameters with consistently large gradients.

Exercise 21.4.2: Ablate the Recipe Coding

Train a ResNet-18 on CIFAR-100 with the full recipe from subsection 4, then re-train it five more times, each time removing exactly one ingredient (warmup, cosine decay, label smoothing, the no-decay parameter split, mixed precision). Report final test accuracy and wall-clock time for each run. Rank the ingredients by how much accuracy each one contributed, and note which one only affected speed. Relate your ranking to the practical example's six-point gap.

Exercise 21.4.3: Plot the Schedule You Are Running Analysis

Instantiate the OneCycleLR scheduler from subsection 4 (or timm's CosineLRScheduler) and, without training anything, step it through all total_steps iterations while recording opt.param_groups[0]['lr'] at each step. Plot the learning rate against step number and confirm it matches the warmup-then-cosine shape of Figure 21.4.1. Then change the warmup fraction from 5% to 0% and re-plot, and write two sentences predicting what would go wrong in early training without the warmup, referring to the fresh-head argument of subsection 1.