Part III: Deep Learning for Computer Vision
Chapter 20: CNN Architectures: From LeNet to ConvNeXt

Choosing an Architecture in Practice

"The leaderboard crowns the network with the most accuracy. The pager at 3 a.m. crowns the network that fit in memory and answered in time. Choose the one that will not wake you up."

A Backbone That Has Been On Call
Big Picture

After a decade of designs, the practical bottleneck is no longer inventing an architecture but choosing one, and the right choice is almost never the most accurate network on a leaderboard but the cheapest network that clears your accuracy bar under your real latency, memory, and data constraints. This section turns the chapter's history into a decision procedure: read the four numbers that describe any model (parameters, FLOPs, latency, accuracy), find the accuracy-versus-cost frontier, start from a pretrained backbone almost always, and reach for transfer learning before you reach for training from scratch. The deliverable is a checklist you can apply to a real project on Monday.

You now know the great architectures and, more usefully, the bottleneck each removed. Section 20.5 left you with the entangled-recipe lesson: architecture and training are inseparable, and the modern workflow rarely starts from a blank network. This closing section is deliberately practical. It assumes you have a task, a hardware target, and a deadline, and it shows how to pick a backbone without re-deriving the field.

1. The Four Numbers on Every Stats Sheet Beginner

Every architecture is summarized by four quantities, and confusing them is the most common selection error. Parameters count the weights; they determine model file size and a rough lower bound on memory, but not speed. FLOPs (floating-point operations, usually reported as multiply-adds per image) estimate compute, but as Section 20.4 warned, they correlate only loosely with wall-clock time because memory bandwidth, not arithmetic, often dominates. Latency is the measured time for one forward pass on your target hardware, the number that actually matters for a real-time system, and it must be measured, not predicted. Accuracy (top-1 on ImageNet, or your task's metric) is the headline, but it is meaningless without the cost it was bought at. Table 20.6.1 grounds these with representative figures.

Common Misconception: "Parameters, FLOPs, and latency all measure the same 'size'"

A frequent selection error is treating the first three numbers as one axis, so that "more parameters" is assumed to mean "more FLOPs" and "more latency". They can move independently, and vision architectures are full of counterexamples you have already met. VGG-16 (Section 20.2) carries about 138 million parameters, most of them in a single dense layer that costs almost nothing per image, so its parameter count badly overstates its compute. MobileNetV3 has very few parameters and very few FLOPs, yet its depthwise convolutions are memory-bandwidth-bound, so its latency does not fall as far as its FLOP count promises (Section 20.4). Parameters bound model file size, FLOPs estimate arithmetic, and latency is the wall-clock time you must measure on the real chip; reading any one of them as a stand-in for the others is how a model that looks cheap on the stats sheet blows your latency budget in production.

Table 20.6.1: Representative ImageNet backbones (approximate figures, 224x224 input). Use as relative guidance; exact numbers depend on weights and hardware.
Model Params (M) FLOPs (G) Top-1 (%) Typical use
MobileNetV3-Small2.50.0667.7phones, embedded, edge cameras
EfficientNet-B05.30.3977.7strong accuracy per FLOP, mobile
ResNet-5025.64.180.9the default server backbone
ConvNeXt-Tiny28.64.582.1modern accuracy, server
ConvNeXt-Base88.615.483.8accuracy-first, ample compute

Read the table as a frontier, not a ranking. MobileNetV3-Small uses on the order of a couple hundred times fewer FLOPs than ConvNeXt-Base for about sixteen fewer accuracy points, an excellent trade on a phone and a terrible one on a server with idle GPUs. The ResNet-50 row is bolded in your mind for a reason: it remains the field's reference point, well understood, widely supported, and with the modern weights of Section 20.3 still competitive. Figure 20.6.1 plots the frontier these rows trace.

66 84 top-1 accuracy (%) compute (GFLOPs, log scale) → MobileNetV3-S (67.7%) EfficientNet-B0 (77.7%) ResNet-50 (80.9%) ConvNeXt-T (82.1%) ConvNeXt-B (83.8%) a dominated model efficient frontier
Figure 20.6.1: The accuracy-versus-compute frontier. Models on the green frontier give the most accuracy for their compute; anything below and to the right (the gray point) is dominated, beaten on both axes by a frontier model. Architecture selection is choosing the frontier point that meets your accuracy bar at the lowest cost your budget allows.

2. Picking a Backbone with timm Beginner

In practice you do not implement any of this chapter's networks; you select one from a library and load pretrained weights. The timm library is the standard catalogue, with hundreds of backbones, benchmarked accuracy and throughput tables, and a one-line model factory. The workflow is: filter the catalogue to candidates that fit your cost budget, load each pretrained, measure latency on your hardware, and pick the cheapest that clears your accuracy target after fine-tuning.

# Benchmark candidate backbones on YOUR hardware, the number that actually
# decides a deployment. Warm-up passes absorb one-time compile and allocation
# costs so the timed loop reports steady-state latency, not first-call overhead.
import timm, torch, time

# 1. List candidate backbones by name pattern:
candidates = ["mobilenetv3_large_100", "efficientnet_b0", "resnet50", "convnext_tiny"]

# 2. Load each pretrained and measure latency on YOUR hardware (the real metric):
device = "cuda" if torch.cuda.is_available() else "cpu"
x = torch.randn(1, 3, 224, 224, device=device)
for name in candidates:
    model = timm.create_model(name, pretrained=True).eval().to(device)
    with torch.no_grad():
        for _ in range(5):           # warm up (kernels compile, caches fill)
            model(x)
        t0 = time.perf_counter()
        for _ in range(50):
            model(x)
        if device == "cuda":
            torch.cuda.synchronize()
    n_params = sum(p.numel() for p in model.parameters()) / 1e6
    ms = (time.perf_counter() - t0) / 50 * 1000
    print(f"{name:24s} {n_params:6.1f}M params  {ms:6.2f} ms/image")
mobilenetv3_large_100 5.5M params 6.41 ms/image efficientnet_b0 5.3M params 9.12 ms/image resnet50 25.6M params 12.83 ms/image convnext_tiny 28.6M params 15.07 ms/image
Code Fragment 1: A backbone bake-off in one loop, measuring ms/image over 50 timed forward passes after 5 warm-up runs. The warm-up runs matter: the first forward pass pays one-time compilation and allocation costs, so timing it would badly overstate latency. This is how you turn the abstract frontier of Figure 20.6.1 into numbers for your actual deployment target (the latencies shown are illustrative and vary by hardware).
Library Shortcut: Re-Head a Backbone for Your Classes

The single most common production task, taking an ImageNet backbone and adapting it to your own label set, is one argument in timm:

# Adapt a pretrained ImageNet backbone to your own label set in one argument,
# then freeze the trunk and unfreeze only the new head for linear probing. This
# is the most common production task, done without manual layer surgery.
import timm
# Load a pretrained backbone but replace the 1000-class head with your own:
model = timm.create_model("convnext_tiny", pretrained=True, num_classes=37)
# timm swaps the classifier, keeps the pretrained trunk, ready to fine-tune.
# Freeze the trunk for linear probing, or leave it trainable for full fine-tuning:
for p in model.parameters(): p.requires_grad = False
for p in model.get_classifier().parameters(): p.requires_grad = True

The library handles re-initializing the head to the right output dimension, leaving the pretrained trunk intact, and exposing get_classifier() so you can freeze or train it. This replaces the error-prone manual surgery of locating the final linear layer and reshaping it, and it is the launching point for the transfer-learning workflow of Chapter 21.

Code Fragment 2: Re-heading a backbone for transfer learning with the single num_classes=37 argument, then setting requires_grad to freeze the trunk and train only get_classifier() for linear probing. The library re-initializes the head and keeps the pretrained trunk intact internally, letting you choose linear probing or full fine-tuning rather than hand-locating the final linear layer.

3. Transfer Learning Is the Default Intermediate

The most important practical fact in this section is that you should almost never train an architecture from scratch. A backbone pretrained on a large dataset has already learned the general visual features (the edge, texture, and part detectors you visualized in Chapter 19), and those features transfer to nearly any image task. Fine-tuning such a backbone on a few thousand of your own images routinely beats training the same architecture from random initialization on the same data, often by a wide margin, and trains in a fraction of the time. The transfer-learning arc of this book runs from here through the foundation backbones of Chapter 25 to the fine-tuned generators of Chapter 34; this is its first concrete payoff.

There are two transfer styles. Linear probing freezes the trunk and trains only a new head, fast and data-thrifty, best when your data is scarce or very similar to the pretraining domain. Full fine-tuning updates the whole network at a small learning rate, stronger when you have more data or a domain shift (medical scans, satellite imagery, microscopy). The decision and its details, learning-rate schedules, layer-wise rates, when to unfreeze, belong to Chapter 21, but the default posture is set here: start pretrained, fine-tune, only consider from-scratch if a strong reason demands it.

Key Insight: The Cheapest Sufficient Model Wins

Leaderboards optimize a single axis: accuracy. Real systems optimize accuracy subject to latency, memory, power, cost, and maintainability constraints, and the winner is the cheapest model that satisfies all of them, not the most accurate one overall. A two-point accuracy gain that doubles latency and breaks your real-time budget is a loss. Train your instinct to ask "what is the least expensive model that clears the bar?" rather than "what is the most accurate model?", and most architecture decisions answer themselves.

Fun Fact

There is an unofficial law of applied vision: the model you end up shipping is a ResNet-50, no matter which paper you started the project intending to use. It is the off-white paint of backbones, never the most exciting choice in the room, supported everywhere, understood by everyone, and quietly correct far more often than the trend would predict. Reach for the shiny thing if you must, but do not be surprised when the bake-off sends you home with the ResNet.

4. A Decision Checklist Beginner

Putting the chapter together, here is a procedure for choosing an architecture for a new vision project. It is ordered so the cheapest, highest-leverage decisions come first.

  1. State the hard constraints first. Target hardware, maximum latency, memory budget, and the minimum acceptable accuracy. These prune the candidate set before you look at any model.
  2. Default to a pretrained backbone. Begin with a ResNet-50 or ConvNeXt-Tiny on a server, an EfficientNet-B0 or MobileNetV3 on a device. Only deviate with a reason.
  3. Match the backbone to the budget using the frontier. Pick the frontier model (Figure 20.6.1) whose accuracy clears your bar at the lowest cost your hardware allows; do not overshoot.
  4. Re-head and fine-tune, do not train from scratch. Use linear probing if data is scarce or in-domain, full fine-tuning if data is ample or out-of-domain.
  5. Measure latency on the real target, not FLOPs. Run the bake-off above on the actual chip; FLOPs can mislead by an order of magnitude.
  6. Match the recipe before comparing models. Per Section 20.5, never compare two architectures trained with different recipes; hold the recipe fixed or your comparison is meaningless.
  7. Only then consider exotic options. Vision transformers (Chapter 22), quantization and pruning (Chapter 28), or a custom design, when a strong, pretrained CNN provably cannot meet the constraints.

This checklist is exactly what the Hands-On Lab below turns into a runnable tool. You will profile the chapter's backbones on the four numbers, draw the frontier from your own measurements, and then fine-tune the model the checklist chooses, ending with a single decision report you could paste into a project ticket.

Practical Example: The Defect Detector That Did Not Need a Transformer

Who: a manufacturing-vision team building a surface-defect classifier for an inline inspection camera, 2025. Situation: a vendor proposal recommended a large vision transformer, citing its ImageNet supremacy. Problem: the team had only about four thousand labeled defect images, a strict 15-millisecond-per-frame latency budget on an industrial edge box, and no GPU at the line. Decision: they followed the checklist: hard constraints first (CPU edge box, 15 ms, small dataset), so they took a pretrained EfficientNet-B0, re-headed it to their six defect classes, and full-fine-tuned it. Result: 96% accuracy after fine-tuning, 9 ms per frame on the edge CPU, trained in under an hour on a single workstation GPU. The transformer, tested for due diligence, needed far more data to match the accuracy and missed the latency budget badly. Lesson: the leaderboard champion is rarely the project champion. A pretrained efficient CNN, fine-tuned, is the correct first answer for the large majority of applied vision tasks, and the checklist gets you there without a month of experiments.

Research Frontier: The Backbone Is Becoming a Foundation Model

The selection story is shifting under the field's feet from 2024 to 2026. Increasingly the strongest starting point is not an ImageNet-supervised backbone but a self-supervised foundation model (DINOv2, the SigLIP and CLIP image encoders, and successors), pretrained on hundreds of millions of unlabeled or weakly labeled images, whose frozen features rival fully fine-tuned task-specific networks (Chapter 25). Libraries like timm and Hugging Face now serve these alongside the classic CNNs, so the practical question is evolving from "which architecture?" to "which pretrained representation, and do I even need to fine-tune it?". The frontier-and-checklist discipline of this section still applies; what changes is that the cheapest sufficient model is more and more often a frozen foundation backbone with a small trained head. The architectures of this chapter remain the bodies of those foundation models, which is why understanding them is the durable skill.

Hands-On Lab: Build a Backbone Selection Bake-Off
Difficulty: Intermediate Duration: 60 to 90 minutes

Build one self-contained script, backbone_bakeoff.py, that loads several of this chapter's architectures from pretrained weights, measures all four stats-sheet numbers (parameters, FLOPs, measured latency, and published accuracy), plots the accuracy-versus-latency frontier from your own timings, then fine-tunes the model your decision checklist selects on a small dataset and prints a one-paragraph decision report. This is the whole chapter, from LeNet's template to ConvNeXt, turned into the selection tool the section's checklist describes.

What You'll Practice

  • Loading pretrained backbones (ResNet, EfficientNet, MobileNet, ConvNeXt) with a single factory call (subsection 3).
  • Separating the four numbers, parameters, FLOPs, measured latency, and accuracy, and seeing where they disagree (subsection 1, Section 20.4).
  • Benchmarking forward-pass latency correctly, with warm-up and synchronization, and drawing the efficient frontier (Figure 20.6.1).
  • Re-heading a frozen backbone and fine-tuning it on a small custom set instead of training from scratch (subsection 3).
  • Turning measurements into a decision via the section's checklist rather than a leaderboard.

Setup

pip install torch torchvision timm matplotlib

A GPU makes the fine-tuning step faster but is not required; every step runs on CPU, and the timing step deliberately measures whatever hardware you are on. The fine-tuning step uses torchvision.datasets.Flowers102, which downloads automatically on first run (about 350 MB).

Put the section's selection procedure into practice below. Work the steps in order; each prints a checkpoint so you can confirm progress before moving on. A complete reference solution is folded at the end.

Step 1: Load the candidate backbones from pretrained weights

Assemble the shortlist the checklist would consider on a server with a phone fallback. Loading pretrained weights is the chapter's central practical lesson: you almost never start from scratch.

import timm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
NAMES = ["resnet50", "efficientnet_b0", "mobilenetv3_small_100", "convnext_tiny"]

# TODO: build a dict {name: model} of pretrained backbones in eval mode on `device`.
# Hint: timm.create_model(name, pretrained=True), then .eval().to(device)
models = {}
for name in NAMES:
    ...
print({n: type(m).__name__ for n, m in models.items()})
Hint

timm.create_model(name, pretrained=True) returns a ready model; chain .eval().to(device). Wrap the loop body in a try/except so one missing weight file does not abort the whole shortlist.

Step 2: Count parameters and estimate FLOPs

Two of the four numbers come from the model itself, no hardware needed. Counting them first shows why parameters and FLOPs are not interchangeable, the warning from subsection 1.

def param_count_m(model):
    # TODO: return the total parameter count in millions.
    # Hint: sum p.numel() for p in model.parameters(), divide by 1e6
    ...

# FLOPs via fvcore if available, else skip gracefully.
def gflops(model):
    try:
        from fvcore.nn import FlopCountAnalysis
        x = torch.randn(1, 3, 224, 224, device=next(model.parameters()).device)
        return FlopCountAnalysis(model, x).total() / 1e9
    except Exception:
        return float("nan")

for n, m in models.items():
    print(f"{n:24s} params={param_count_m(m):6.1f}M  GFLOPs={gflops(m):6.2f}")
Hint

sum(p.numel() for p in model.parameters()) / 1e6 gives millions of parameters. FLOP counting needs a dummy input of the model's expected size; fvcore is optional, so degrade to nan rather than crashing if it is not installed.

Step 3: Measure latency the right way

Latency is the only number that must be measured on your hardware. The discipline is warm-up first (to trigger lazy CUDA kernels and cache allocation), then synchronize, then time many runs and take the median.

import time

@torch.no_grad()
def latency_ms(model, runs=30, warmup=5):
    x = torch.randn(1, 3, 224, 224, device=device)
    for _ in range(warmup):
        model(x)
    if device == "cuda":
        torch.cuda.synchronize()
    times = []
    for _ in range(runs):
        t0 = time.perf_counter()
        model(x)
        if device == "cuda":
            torch.cuda.synchronize()
        # TODO: append the elapsed milliseconds for this run to `times`
        ...
    times.sort()
    return times[len(times) // 2]  # median

for n, m in models.items():
    print(f"{n:24s} {latency_ms(m):7.2f} ms on {device}")
Hint

Elapsed milliseconds for one run is (time.perf_counter() - t0) * 1000. The warm-up loop and the synchronize() calls are what separate an honest measurement from a misleading one; without them the first timed run includes one-time setup cost.

Step 4: Plot the frontier from your own numbers

Pair each model's measured latency with its published top-1 accuracy and draw the scatter. This reproduces Figure 20.6.1 with your hardware on the x-axis, the only frontier that matters for your project.

import matplotlib.pyplot as plt

# Published ImageNet top-1 (%), from torchvision/timm model cards (Table 20.6.1).
ACC = {"resnet50": 80.9, "efficientnet_b0": 77.7,
       "mobilenetv3_small_100": 67.7, "convnext_tiny": 82.1}

lat = {n: latency_ms(m) for n, m in models.items()}

fig, ax = plt.subplots(figsize=(6, 4))
for n in models:
    # TODO: scatter point (lat[n], ACC[n]) and annotate it with the model name
    ...
ax.set_xlabel(f"Measured latency (ms, {device})"); ax.set_ylabel("ImageNet top-1 (%)")
ax.set_title("Accuracy vs measured latency"); plt.tight_layout(); plt.show()
Hint

ax.scatter(lat[n], ACC[n]) plots the point; ax.annotate(n, (lat[n], ACC[n])) labels it. A model is on the frontier if no other model is both faster and more accurate; eyeball it from the plot, or compute it in the stretch goal.

Step 5: Apply the checklist and pick one model

Encode the section's checklist as code: given an accuracy floor and a latency budget, return the cheapest model that clears both. This is the step that converts four numbers into a decision.

def choose(acc, lat, min_acc, max_lat_ms):
    # Keep only models that clear BOTH constraints, then pick the fastest survivor.
    feasible = {n: lat[n] for n in acc if acc[n] >= min_acc and lat[n] <= max_lat_ms}
    # TODO: return the name with the smallest latency among `feasible`,
    #       or None if nothing qualifies. Hint: min(feasible, key=feasible.get)
    ...

pick = choose(ACC, lat, min_acc=75.0, max_lat_ms=max(lat.values()) + 1)
print("Checklist selects:", pick)
Hint

min(feasible, key=feasible.get) returns the key with the smallest value. Guard the empty case with if not feasible: return None so a too-strict budget reports "no model qualifies" instead of raising.

Step 6: Re-head and fine-tune the chosen backbone

The checklist says fine-tune, never train from scratch. Replace the 1000-class ImageNet head with one sized to the new task, freeze the body for a quick linear probe, and train only the head for a few hundred steps.

import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import Flowers102

tf = transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor()])
train = Flowers102(root="./data", split="train", download=True, transform=tf)
loader = DataLoader(train, batch_size=32, shuffle=True, num_workers=0)

net = timm.create_model(pick, pretrained=True, num_classes=102).to(device)
# TODO: freeze every parameter, then unfreeze ONLY the classifier head
#       (timm exposes it via net.get_classifier()) so this is a linear probe.
...

opt = torch.optim.Adam((p for p in net.parameters() if p.requires_grad), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
net.train()
for step, (xb, yb) in enumerate(loader):
    xb, yb = xb.to(device), yb.to(device)
    opt.zero_grad(); loss = loss_fn(net(xb), yb); loss.backward(); opt.step()
    if step % 10 == 0: print(f"step {step:3d}  loss {loss.item():.3f}")
    if step == 60: break
Hint

Set p.requires_grad = False for all parameters, then set it back to True for the parameters of net.get_classifier(). Passing only the trainable parameters to the optimizer is what makes this a fast linear probe rather than a full fine-tune; the falling loss confirms the frozen features are already useful, the feature-reuse lesson of subsection 3.

Step 7: Print the decision report

Close the loop with a short, paste-ready summary that justifies the choice with measurements rather than hype, exactly what a reviewer or a project ticket wants.

print("=== Backbone decision report ===")
print(f"Shortlist : {', '.join(models)}")
for n in models:
    print(f"  {n:24s} {param_count_m(models[n]):5.1f}M  "
          f"{lat[n]:6.2f} ms  {ACC[n]:.1f}% top-1")
# TODO: print the selected model, why it won (cheapest that cleared the bar),
#       and the constraints it was chosen under.
...
Hint

One f-string is enough: name the pick, restate the accuracy floor and latency budget it satisfied, and note that it was the cheapest survivor. The point of the report is that the decision is reproducible from the numbers above it.

Expected Output

The script prints a four-row table whose columns disagree in instructive ways: MobileNetV3-Small has the fewest parameters and FLOPs yet often does not have the lowest latency on a desktop CPU because its depthwise convolutions are memory-bandwidth-bound (Section 20.4), while ResNet-50 and ConvNeXt-Tiny sit near the top for accuracy at higher cost. The frontier plot shows accuracy rising with latency, with one or two models dominated (slower and less accurate than a competitor) and the rest tracing the curve. With a 75% accuracy floor and a generous latency budget the checklist typically selects EfficientNet-B0 or ResNet-50; tightening the latency budget flips the choice toward the smaller models. The fine-tuning loop shows cross-entropy loss falling from roughly 4.6 (random over 102 classes) toward 2 or below within sixty steps, and the decision report prints a single justified recommendation.

Stretch Goals

  • Compute the efficient frontier programmatically: a model is on it when no other model is both faster and more accurate. Mark frontier points and dominated points in different colors on the Step 4 plot.
  • Replace the linear probe in Step 6 with a full fine-tune (unfreeze everything, drop the learning rate to 1e-4), run both for the same step budget on the same data, and report the accuracy gap, reproducing Exercise 20.6.3's transfer-versus-scratch comparison.
  • Add a vision transformer (vit_tiny_patch16_224) to the shortlist and place it on your frontier, previewing the head-to-head with Chapter 22 and testing whether the checklist still prefers a CNN under a tight latency budget.
Complete Solution
import time
import torch
import torch.nn as nn
import timm
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import Flowers102

device = "cuda" if torch.cuda.is_available() else "cpu"
NAMES = ["resnet50", "efficientnet_b0", "mobilenetv3_small_100", "convnext_tiny"]
ACC = {"resnet50": 80.9, "efficientnet_b0": 77.7,
       "mobilenetv3_small_100": 67.7, "convnext_tiny": 82.1}

# Step 1: load pretrained backbones
models = {}
for name in NAMES:
    try:
        models[name] = timm.create_model(name, pretrained=True).eval().to(device)
    except Exception as e:
        print(f"skip {name}: {e}")

# Step 2: parameters and FLOPs
def param_count_m(model):
    return sum(p.numel() for p in model.parameters()) / 1e6

def gflops(model):
    try:
        from fvcore.nn import FlopCountAnalysis
        x = torch.randn(1, 3, 224, 224, device=next(model.parameters()).device)
        return FlopCountAnalysis(model, x).total() / 1e9
    except Exception:
        return float("nan")

# Step 3: latency
@torch.no_grad()
def latency_ms(model, runs=30, warmup=5):
    x = torch.randn(1, 3, 224, 224, device=device)
    for _ in range(warmup):
        model(x)
    if device == "cuda":
        torch.cuda.synchronize()
    times = []
    for _ in range(runs):
        t0 = time.perf_counter()
        model(x)
        if device == "cuda":
            torch.cuda.synchronize()
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]

lat = {n: latency_ms(m) for n, m in models.items()}

# Step 4: frontier plot
fig, ax = plt.subplots(figsize=(6, 4))
for n in models:
    ax.scatter(lat[n], ACC[n])
    ax.annotate(n, (lat[n], ACC[n]), fontsize=8, xytext=(4, 4),
                textcoords="offset points")
ax.set_xlabel(f"Measured latency (ms, {device})")
ax.set_ylabel("ImageNet top-1 (%)")
ax.set_title("Accuracy vs measured latency")
plt.tight_layout(); plt.savefig("frontier.png", dpi=120)

# Step 5: checklist selection
def choose(acc, lat, min_acc, max_lat_ms):
    feasible = {n: lat[n] for n in acc if acc[n] >= min_acc and lat[n] <= max_lat_ms}
    if not feasible:
        return None
    return min(feasible, key=feasible.get)

pick = choose(ACC, lat, min_acc=75.0, max_lat_ms=max(lat.values()) + 1)
print("Checklist selects:", pick)

# Step 6: re-head and linear-probe fine-tune on Flowers102
tf = transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor()])
train = Flowers102(root="./data", split="train", download=True, transform=tf)
loader = DataLoader(train, batch_size=32, shuffle=True, num_workers=0)

net = timm.create_model(pick, pretrained=True, num_classes=102).to(device)
for p in net.parameters():
    p.requires_grad = False
for p in net.get_classifier().parameters():
    p.requires_grad = True

opt = torch.optim.Adam((p for p in net.parameters() if p.requires_grad), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
net.train()
for step, (xb, yb) in enumerate(loader):
    xb, yb = xb.to(device), yb.to(device)
    opt.zero_grad()
    loss = loss_fn(net(xb), yb)
    loss.backward()
    opt.step()
    if step % 10 == 0:
        print(f"step {step:3d}  loss {loss.item():.3f}")
    if step == 60:
        break

# Step 7: decision report
print("=== Backbone decision report ===")
print(f"Shortlist : {', '.join(models)}")
for n in models:
    print(f"  {n:24s} {param_count_m(models[n]):5.1f}M  "
          f"GFLOPs={gflops(models[n]):5.2f}  {lat[n]:6.2f} ms  {ACC[n]:.1f}% top-1")
print(f"Selected  : {pick}")
print("Why       : cheapest model that cleared the 75.0% accuracy floor within "
      "the latency budget; fine-tuned, not trained from scratch.")
Exercise 20.6.1: Read a Frontier Conceptual

From Table 20.6.1, identify which models lie on the efficient frontier and which (if any) are dominated. For a hypothetical project requiring at least 80% top-1 accuracy with a strict server FLOP budget of 5 GFLOPs, state which single model you would choose and justify it in two sentences. Then state how your choice changes if the deployment target becomes a battery-powered phone with the same accuracy requirement, and explain why the answer flips.

Exercise 20.6.2: Run the Bake-Off Coding

Run the timm latency bake-off above on your own machine for the four candidate backbones (add a warm-up as shown). Produce a table of parameters, your measured latency, and the published top-1 accuracy. Then plot accuracy against your measured latency (not FLOPs) and draw the frontier. Note any case where the FLOP ranking and your latency ranking disagree, and propose an explanation in terms of memory bandwidth and depthwise convolutions from Section 20.4.

Exercise 20.6.3: Transfer versus Scratch Analysis

On a small dataset (Oxford-IIIT Pets or Flowers-102), train a resnet50 two ways: (a) re-headed from ImageNet-pretrained weights and fine-tuned, and (b) from random initialization, with the identical recipe and epoch budget for both. Plot validation accuracy versus epoch for each, and report the final gap and the epoch at which the pretrained model first exceeds the from-scratch model's best score. Quantify how much data and time transfer learning saved, and connect the result to the feature-reuse argument of subsection 3.