Chapter 29: Tools of the Trade: The Deep Vision Stack

"Twelve chapters ago I was a pile of random weights. Now I am a checkpoint on a hub, downloaded forty thousand times, fine-tuned by strangers, and quietly judged by a leaderboard I never agreed to enter. Tooling, it turns out, is destiny."
A Pretrained Backbone With a Public Download Counter

Chapter Overview

Part III taught you to build deep vision systems from the ground up: tensors and autograd in Chapter 18, convolutions in Chapter 19, the architecture lineage in Chapter 20, training recipes in Chapter 21, transformers in Chapter 22, detection and segmentation in Chapters 23 and 24, self-supervised foundation models in Chapter 25, video and 3D in Chapters 26 and 27, and deployment in Chapter 28. Along the way, almost every chapter reached past the from-scratch implementation for a one-line library equivalent: timm.create_model for a backbone, an Ultralytics call for a detector, a Hugging Face pipeline for a segmenter. This chapter is the pause where we finally name those tools, compare them, and decide when to reach for which.

It is built as a reference, not a narrative. The deep vision stack is broader than the classical one because it has three layers a classical pipeline never needed: a place to get pretrained weights (a model hub), heavyweight frameworks for the hardest tasks (detection and segmentation), and a parallel ecosystem of data and experiment tooling that exists because deep learning is bottlenecked on labeled data and reproducibility, not on algorithms. Each layer has a handful of dominant tools with overlapping but distinct sweet spots, and choosing well is most of what separates a smooth project from a stalled one.

Section 29.1 maps the model hubs and core libraries: torchvision, timm, Hugging Face, and Ultralytics, the four front doors to pretrained vision models, with a decision guide for which to use when. Section 29.2 covers the two heavyweight research frameworks, Detectron2 and MMDetection, that you reach for when an Ultralytics one-liner is not enough and you need to compose your own detector or segmenter from a configurable zoo. Section 29.3 turns to data tooling: annotation platforms, dataset versioning, and the visual debugging tools FiftyOne and Roboflow that find the label errors quietly capping your accuracy. Section 29.4 closes with experiment tracking (Weights & Biases, MLflow, TensorBoard) and a curated, annotated reading list for the whole of Part III.

Read Section 29.1 first; it will stop you from training a backbone you could have downloaded. Keep the rest bookmarked and return when a project needs a custom detector, a label audit, or a way to remember which of last week's forty runs actually worked. This is the third of the book's four "Tools of the Trade" chapters: Chapter 8 consolidated the image-processing stack, Chapter 17 the classical-vision stack, and Chapter 38 will close Part IV with the generative stack.

Big Picture

In deep vision you rarely build a model from random weights; you download one, adapt it, track the experiment, and audit the data, and choosing the right tool for each of those four steps decides whether a project ships in a week or stalls for a month. Four verbs name the whole stack: download, adapt, audit, track. Each section of this chapter equips one of them, and the four together are the deep vision workflow that the from-scratch chapters of Part III earned the right to skip. The theory of Chapters 18 through 28 does not change. The tooling decides whether applying it means writing a training loop or writing a config file, and whether your accuracy ceiling is the model or the mislabeled examples you never looked at.

Figure 29.0.1 lays out that organizing scheme on one page: each of the four verbs maps to one section of this chapter and to the representative tool that section equips. Read it as a map of the whole stack before diving into any single front door, and notice that the four verbs are not a pipeline you run once but a set of records you keep, the four legs of the reproducibility contract that Section 29.4 returns to at the end.

Figure 29.0.1: The chapter on one page. Each of the four verbs that name the deep vision workflow (download, adapt, audit, track) maps to one section and to the representative tool that section equips. The four are not run once and discarded; kept as records (a hub-versioned model, a config, a dataset version, a logged run) they become the four legs of the reproducibility contract that Section 29.4 closes on.

Learning Objectives

Choose among torchvision, timm, Hugging Face, and Ultralytics deliberately, based on task, model coverage, license, and how much control you need.
Load a pretrained backbone and its correct preprocessing transform in a few lines, and recognize the preprocessing-mismatch bug that silently halves accuracy.
Decide when an Ultralytics one-liner suffices and when a configurable framework (Detectron2 or MMDetection) is the right tool, and read their config systems.
Use FiftyOne and Roboflow to visualize a dataset, find label errors and hard cases, and version a dataset across experiments.
Instrument a training run with an experiment tracker so that runs are comparable, reproducible, and recoverable months later.
Build a personal further-reading map for Part III: which hub, paper, or documentation trail to consult for each deep vision topic.

Prerequisites

This chapter consolidates all of Part III, so any of Chapter 18 through Chapter 28 enriches it, but three are essential. Chapter 18: Neural Networks & PyTorch for Vision established the PyTorch nn.Module, tensors, and training loop that every tool here either wraps or expects. Chapter 21: Training Recipes introduced transfer learning, the workflow that makes a model hub useful at all. The framework discussion in Section 29.2 builds directly on the detector and segmenter anatomy from Chapter 23: Object Detection and Chapter 24: Segmentation, and the data-tooling section assumes the metrics (mAP, mIoU) defined there.

Chapter Roadmap

29.1 Model Hubs & Libraries: torchvision, timm, Hugging Face & Ultralytics The four front doors to pretrained vision models: their model coverage, APIs, licenses, and the preprocessing conventions that bite, with a decision guide.
29.2 Detection & Segmentation Frameworks: Detectron2 & MMDetection When a one-liner is not enough: the two configurable research frameworks, their model zoos and config systems, and how to assemble a custom detector or segmenter.
29.3 Data Tooling: Annotation, Versioning, FiftyOne & Roboflow Where accuracy is really won: annotation platforms, dataset versioning, and the visual debugging tools that surface the label errors capping your metrics.
29.4 Experiment Tracking, Curated References & Further Reading Weights & Biases, MLflow, and TensorBoard for reproducible runs, plus an annotated reading map of the books, courses, and papers behind Part III.

Fun Fact

The model-hub idea is younger than most readers assume. As recently as 2018, getting a pretrained ResNet often meant downloading a Caffe .caffemodel from a lab's web page, converting it with a community script, and hoping the BatchNorm statistics survived. torchvision's model_zoo, timm's first release, and the Hugging Face Hub's pivot to vision all landed within a few years of each other, and the entire "just download the weights" workflow this chapter takes for granted is barely older than the transformer.

The four verbs of this chapter, download, adapt, audit, and track, only become a single skill when you run them once, end to end, on a real dataset. The Hands-On Lab below does exactly that: it builds one small but complete deep vision project that touches every tool in the chapter, so the four sections stop being a reference and become a workflow you have executed.

Hands-On Lab: A Reproducible Deep Vision Pipeline End to End

Duration: about 60 to 90 minutes Difficulty: Intermediate

Objective

Build one small but complete deep vision project that exercises all four verbs of this chapter on a single real dataset: download a pretrained backbone from a hub (Section 29.1), adapt it by fine-tuning a new classification head on the Oxford-IIIT Pets dataset, track the run so every epoch is logged and recoverable (Section 29.4), and audit the trained model's predictions visually to find the images it gets wrong (Section 29.3). The artifact you finish with is a fine-tuned classifier, a logged run with a training curve, and a ranked list of the model's hardest mistakes, the same three deliverables a real project hands off.

What You'll Practice

Loading a pretrained backbone and its matching preprocessing transform from the same source, the one discipline of Section 29.1 that prevents the silent accuracy-killing mismatch bug.
Adapting a hub model to a new label set by replacing the classification head and fine-tuning, the transfer-learning workflow of Chapter 21.
Instrumenting a training loop with an experiment tracker so the run logs its config, per-epoch metrics, and a checkpoint artifact (Section 29.4).
Auditing predictions visually with FiftyOne to surface the highest-confidence mistakes that an aggregate accuracy number hides (Section 29.3).
Assembling the four-legged reproducibility contract, model, config, data, run, into one runnable script.

Setup

One deep-learning stack plus the two tools the chapter introduces. The Oxford-IIIT Pets dataset (37 cat and dog breeds, about 7,400 images) downloads automatically through torchvision on first run, so there is no manual data step. TensorBoard ships with PyTorch and needs no account; FiftyOne installs with one line. A GPU finishes the suggested three epochs in a few minutes, but the code runs on CPU too (slower). Install with:

pip install torch torchvision timm fiftyone tensorboard

Everything below lives in one script. The model is a timm-loaded backbone fine-tuned only on its head, so it reaches a meaningful accuracy on Pets in a few epochs rather than the hours a full fine-tune would need.

Steps

Step 1: Download a backbone and its matching transform

Load a pretrained backbone from timm and, in the same breath, resolve the exact preprocessing it expects, the inseparable model-and-transform pair of Section 29.1. Resolving the transform from the model (rather than hand-copying constants) is what keeps the input distribution aligned with what the weights were trained on.

import timm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load a small, fast backbone pretrained on ImageNet. num_classes=37 swaps in a
# fresh head sized for the 37 Pets breeds; the backbone weights stay pretrained.
model = timm.create_model("resnet18", pretrained=True, num_classes=37).to(device)

# TODO: resolve THIS model's preprocessing config, then build train and eval
# transforms from it. Use timm.data.resolve_model_data_config(model) and
# timm.data.create_transform(**cfg, is_training=True) for training (adds
# augmentation) and is_training=False for evaluation.
cfg = ...
train_tf = ...
eval_tf = ...

Hint

cfg = timm.data.resolve_model_data_config(model), then train_tf = timm.data.create_transform(**cfg, is_training=True) and eval_tf = timm.data.create_transform(**cfg, is_training=False). Passing num_classes=37 to create_model is the head swap: timm discards the 1000-way ImageNet classifier and attaches a randomly initialized 37-way one.

Step 2: Load the Pets dataset with the resolved transforms

Point torchvision at Oxford-IIIT Pets, which downloads on first use, and wrap each split in the transform from Step 1. The training split gets the augmenting transform; the test split gets the plain evaluation transform, so you never leak augmentation into the metric.

from torchvision.datasets import OxfordIIITPet
from torch.utils.data import DataLoader

# download=True fetches ~7,400 images on first run, then caches them.
train_ds = OxfordIIITPet(root="data", split="trainval",
                         transform=train_tf, download=True)
test_ds = OxfordIIITPet(root="data", split="test",
                        transform=eval_tf, download=True)

# TODO: build a training DataLoader (batch_size=64, shuffle=True) and a test
# DataLoader (batch_size=64, shuffle=False). Set num_workers to taste.
train_dl = ...
test_dl = ...

Hint

train_dl = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=2) and test_dl = DataLoader(test_ds, batch_size=64, shuffle=False, num_workers=2). Shuffle the training loader so batches are decorrelated; never shuffle the test loader, so prediction order matches the dataset order you will need in Step 5.

Step 3: Start a tracked run and log the config

Open a tracker before training so the run records what it is about to do. TensorBoard is the zero-setup option of Section 29.4: a single writer object captures the config and will receive per-epoch metrics. Logging the config up front is what makes the run reproducible later.

from torch.utils.tensorboard import SummaryWriter

config = {"backbone": "resnet18", "lr": 1e-3, "epochs": 3,
          "dataset": "oxford-iiit-pet", "head_only": True}

# TODO: create a SummaryWriter and record the config. Use
# writer = SummaryWriter(comment="pets-resnet18") and log the hyperparameters
# once with writer.add_hparams(config, {}) so the run is self-describing.
writer = ...

Hint

writer = SummaryWriter(comment="pets-resnet18"), then writer.add_hparams(config, {}). Run tensorboard --logdir runs in a second terminal to watch the curve appear live. If you prefer the hosted dashboard, the same loop logs to Weights and Biases by swapping in the wandb.init(config=config) and wandb.log(...) calls from Code Fragment 1 of Section 29.4.

Step 4: Fine-tune the head and log each epoch

Freeze the backbone and train only the new head for a few epochs, the cheap, fast transfer-learning move of Chapter 21. After each epoch, compute test accuracy and log both the loss and the accuracy to the tracker so you get a curve, not just a final number.

import torch.nn as nn

# Freeze every parameter, then re-enable only the classifier head.
for p in model.parameters():
    p.requires_grad = False
for p in model.get_classifier().parameters():
    p.requires_grad = True

opt = torch.optim.Adam(model.get_classifier().parameters(), lr=config["lr"])
loss_fn = nn.CrossEntropyLoss()

for epoch in range(config["epochs"]):
    model.train()
    for x, y in train_dl:
        x, y = x.to(device), y.to(device)
        opt.zero_grad()
        loss = loss_fn(model(x), y)
        loss.backward()
        opt.step()

    # Evaluate and log. TODO: compute test accuracy over test_dl with the model
    # in eval() mode and torch.no_grad(), then log it and the last loss to the
    # tracker: writer.add_scalar("val/acc", acc, epoch) and
    # writer.add_scalar("train/loss", loss.item(), epoch).
    acc = ...
    print(f"epoch {epoch}: val acc {acc:.3f}")

torch.save(model.state_dict(), "pets_resnet18.pt")   # the run's artifact
writer.close()

Hint

For accuracy: set model.eval(), loop under with torch.no_grad():, accumulate (model(x).argmax(1) == y).sum().item() over the test set, and divide by len(test_ds). Then call writer.add_scalar("val/acc", acc, epoch). Head-only fine-tuning of ResNet-18 should clear 80 percent on Pets within three epochs because the frozen backbone already encodes generic animal features.

Step 5: Audit the mistakes in FiftyOne

An accuracy number tells you the model is wrong 15 percent of the time but not which images or why. Load the test set and the model's predictions into FiftyOne, the visual auditing tool of Section 29.3, and sort to the highest-confidence wrong predictions, the most informative failures.

import fiftyone as fo
from fiftyone import ViewField as F

classes = test_ds.classes
dataset = fo.Dataset("pets-audit")
model.eval()

# Attach each test image plus its ground truth and the model's top prediction.
# test_ds._images holds the file paths and test_ds._labels the integer breeds.
for img_path, label in zip(test_ds._images, test_ds._labels):
    sample = fo.Sample(filepath=img_path)
    sample["truth"] = fo.Classification(label=classes[label])
    x = eval_tf(__import__("PIL").Image.open(img_path).convert("RGB"))
    with torch.no_grad():
        logits = model(x.unsqueeze(0).to(device)).softmax(1)[0]
    conf, pred = logits.max(0)
    sample["pred"] = fo.Classification(label=classes[pred], confidence=float(conf))
    dataset.add_sample(sample)

# TODO: build a view of confident mistakes: samples where pred != truth,
# sorted by descending prediction confidence. Filter with
# F("pred.label") != F("truth.label") and sort_by("pred.confidence",
# reverse=True), then fo.launch_app(view=...).
mistakes = ...
session = fo.launch_app(view=mistakes)

Hint

mistakes = dataset.match(F("pred.label") != F("truth.label")).sort_by("pred.confidence", reverse=True), then fo.launch_app(view=mistakes). The top of this view is where the model is both wrong and sure, often a mislabeled image, an ambiguous breed, or two cats in one frame, exactly the systematic failures Section 29.3 argues aggregate metrics hide.

Expected Output

The training loop prints a rising validation accuracy that reaches roughly 0.80 to 0.88 after three head-only epochs, with the matching curve visible in TensorBoard at http://localhost:6006. Step 4 leaves a checkpoint file pets_resnet18.pt on disk as the run's artifact. Step 5 opens the FiftyOne app in your browser showing the confident mistakes ranked by confidence; the first few are typically visually confusable breeds (for example British Shorthair versus Russian Blue) or images with an occluding object, the hard cases that explain most of the residual error. Together these are the three deliverables of a real project: a fine-tuned model, a logged and reproducible run, and a prioritized error list.

Stretch Goals

Unfreeze the backbone for a final epoch at a 10x lower learning rate (discriminative fine-tuning) and log it as a second tracked run, then compare the two runs' curves in the tracker side by side, the cross-run comparison that motivates a tracker in the first place.
Swap the backbone for a stronger one in a single string (timm.create_model("convnext_tiny", pretrained=True, num_classes=37)) and confirm the accuracy gain, demonstrating timm's one-line architecture swap from Section 29.1.
Use the FiftyOne view to find and correct any genuinely mislabeled test images, export the cleaned split as a new dataset version (the versioning idea of Section 29.3), and re-evaluate to measure how much of the error was label noise rather than model error.

Complete Solution

# Full reproducible deep vision pipeline: download, adapt, track, audit.
import timm, torch, torch.nn as nn
from torchvision.datasets import OxfordIIITPet
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from PIL import Image
import fiftyone as fo
from fiftyone import ViewField as F

device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Step 1: download backbone + matching transform ---
model = timm.create_model("resnet18", pretrained=True, num_classes=37).to(device)
cfg = timm.data.resolve_model_data_config(model)
train_tf = timm.data.create_transform(**cfg, is_training=True)
eval_tf = timm.data.create_transform(**cfg, is_training=False)

# --- Step 2: data ---
train_ds = OxfordIIITPet("data", "trainval", transform=train_tf, download=True)
test_ds = OxfordIIITPet("data", "test", transform=eval_tf, download=True)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=2)
test_dl = DataLoader(test_ds, batch_size=64, shuffle=False, num_workers=2)

# --- Step 3: tracker ---
config = {"backbone": "resnet18", "lr": 1e-3, "epochs": 3,
          "dataset": "oxford-iiit-pet", "head_only": True}
writer = SummaryWriter(comment="pets-resnet18")
writer.add_hparams(config, {})

# --- Step 4: head-only fine-tune, logging each epoch ---
for p in model.parameters():
    p.requires_grad = False
for p in model.get_classifier().parameters():
    p.requires_grad = True
opt = torch.optim.Adam(model.get_classifier().parameters(), lr=config["lr"])
loss_fn = nn.CrossEntropyLoss()

for epoch in range(config["epochs"]):
    model.train()
    last_loss = 0.0
    for x, y in train_dl:
        x, y = x.to(device), y.to(device)
        opt.zero_grad()
        loss = loss_fn(model(x), y)
        loss.backward()
        opt.step()
        last_loss = loss.item()
    model.eval()
    correct = 0
    with torch.no_grad():
        for x, y in test_dl:
            x, y = x.to(device), y.to(device)
            correct += (model(x).argmax(1) == y).sum().item()
    acc = correct / len(test_ds)
    writer.add_scalar("train/loss", last_loss, epoch)
    writer.add_scalar("val/acc", acc, epoch)
    print(f"epoch {epoch}: val acc {acc:.3f}")

torch.save(model.state_dict(), "pets_resnet18.pt")
writer.close()

# --- Step 5: audit mistakes in FiftyOne ---
classes = test_ds.classes
dataset = fo.Dataset("pets-audit")
model.eval()
for img_path, label in zip(test_ds._images, test_ds._labels):
    sample = fo.Sample(filepath=img_path)
    sample["truth"] = fo.Classification(label=classes[label])
    x = eval_tf(Image.open(img_path).convert("RGB")).unsqueeze(0).to(device)
    with torch.no_grad():
        probs = model(x).softmax(1)[0]
    conf, pred = probs.max(0)
    sample["pred"] = fo.Classification(label=classes[int(pred)],
                                       confidence=float(conf))
    dataset.add_sample(sample)

mistakes = (dataset
            .match(F("pred.label") != F("truth.label"))
            .sort_by("pred.confidence", reverse=True))
session = fo.launch_app(view=mistakes)
session.wait()   # keep the app open until you close the script

What's Next?

With the deep vision workshop organized, the book turns to its final act. Chapter 30: Foundations of Generative Modeling opens Part IV: Generative Vision Models, where the question stops being "what is in this image?" and becomes "how do I make a new one?". The backbones, hubs, and training tooling of this chapter do not retire; they reappear as the encoders, the feature extractors behind FID, and the pretrained text encoders that condition every modern image generator. The latent spaces hinted at by autoencoders return as the working medium of diffusion, and the convolution you built by hand in Chapter 3 returns one last time as the U-Net denoiser at the heart of Chapter 33.

Bibliography & Further Reading

Foundational Papers

He, K., Zhang, X., Ren, S., and Sun, J. "Deep Residual Learning for Image Recognition." CVPR (2016). arXiv:1512.03385

ResNet, the backbone that every hub still ships and that Section 29.1 uses as the canonical "just download the weights" example.

Dosovitskiy, A. et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)." ICLR (2021). arXiv:2010.11929

The Vision Transformer, the architecture that made timm and the Hugging Face Hub the default homes for state-of-the-art vision weights.

Wightman, R., Touvron, H., and Jegou, H. "ResNet strikes back: An improved training procedure in timm." NeurIPS Workshop (2021). arXiv:2110.00476

Shows that much of the apparent gap between old and new architectures is a training-recipe gap, the empirical argument behind timm's role as a benchmarking library.

Ren, S., He, K., Girshick, R., and Sun, J. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." NeurIPS (2015). arXiv:1506.01497

The two-stage detector whose reference implementation Detectron2 descends from; Section 29.2 reads its config as the canonical framework example.

Books

Howard, J. and Gugger, S. "Deep Learning for Coders with fastai and PyTorch." O'Reilly (2020). Free notebooks on GitHub

A code-first deep learning text whose transfer-learning and data-block chapters are the gentlest on-ramp to the workflows this chapter's tools automate.

Szeliski, R. "Computer Vision: Algorithms and Applications." Springer, 2nd ed. (2022). Free online edition

Free, encyclopedic, current; its deep learning chapters give the conceptual map under the tooling, with full references for everything in Part III.

Tools & Libraries

torchvision documentation. pytorch.org/vision

The official PyTorch vision library: models, transforms (v2), datasets, and ops, the baseline against which Section 29.1 compares everything else.

Wightman, R. "timm (PyTorch Image Models)." GitHub repository

The largest curated collection of pretrained image backbones with a uniform API and per-model preprocessing config; the backbone library of Section 29.1.

Hugging Face Transformers documentation. huggingface.co/docs/transformers

The cross-modal model library and Hub whose AutoModel and pipeline abstractions standardize loading detection, segmentation, and vision-language models.

Ultralytics YOLO documentation. docs.ultralytics.com

The detection, segmentation, and pose framework behind the famous YOLO one-liner; Section 29.1 contrasts its convenience with the frameworks of Section 29.2.

Wu, Y., Kirillov, A., Massa, F., Lo, W., and Girshick, R. "Detectron2." Meta AI Research (2019). GitHub repository

Meta's modular detection and segmentation framework; Section 29.2 walks its registry and config system as the archetype of a research framework.

Chen, K. et al. "MMDetection: Open MMLab Detection Toolbox and Benchmark." (2019). arXiv:1906.07155

The OpenMMLab detection toolbox with the broadest published model zoo; Section 29.2 contrasts its inheritance-based configs with Detectron2's.

Moore, B. and Corso, J. "FiftyOne: open-source tool for dataset curation and model analysis." Voxel51. GitHub repository

The visual dataset-and-prediction explorer of Section 29.3, built to surface label errors and hard cases that aggregate metrics hide.

Biewald, L. "Experiment Tracking with Weights & Biases." (2020). docs.wandb.ai

The hosted experiment tracker of Section 29.4, the most widely used of the run-logging tools that make deep vision reproducible.

Datasets & Benchmarks

Deng, J. et al. "ImageNet: A Large-Scale Hierarchical Image Database." CVPR (2009). image-net.org

The classification benchmark whose pretrained weights seed almost every model in this chapter's hubs; the source of the "ImageNet-pretrained" backbone.

Lin, T. et al. "Microsoft COCO: Common Objects in Context." ECCV (2014). cocodataset.org

The detection and segmentation benchmark and annotation format that Detectron2, MMDetection, and Roboflow all speak natively.