Part III: Deep Learning for Computer Vision
Chapter 23: Object Detection

Training & Deploying a Detector on Custom Data

"The theory was lovely. Then they handed me four hundred photos of their specific kind of bolt, half of them blurry, and said: by Friday. That is when I learned that detection is ten percent architecture and ninety percent labels, learning rate, and remembering which folder the validation set lives in."

A Fine-Tuned Detector With a Deadline
Big Picture

In practice you almost never design a detector or train one from scratch; you fine-tune a pretrained modern detector on a few hundred to a few thousand of your own labeled images, validate its mAP honestly, and export it to a runtime your deployment target can run. The four prior sections gave you the conceptual map; this one is the route you actually drive. The workflow is the same regardless of architecture: collect and label data in a standard format, pick a pretrained model sized for your speed budget, fine-tune with sensible augmentation and a transfer-learning schedule, read validation mAP while guarding against the data leaks and easy-set illusions that inflate it, and export to ONNX or TensorRT for the edge. We run it concretely with Ultralytics YOLO, the most common production toolkit, and flag the failure modes that bite real teams.

Everything from Section 23.1 to Section 23.5 was about how detectors work. This section is about getting one to work for you. The honest truth of applied detection is that the architecture choice, the subject of four sections, is usually the smallest decision; far more of your time goes to labeling, to the transfer-learning recipe from Chapter 21, to reading the metrics of Section 23.1 without fooling yourself, and to exporting for the hardware of Chapter 28. We walk that whole route once, end to end.

1. Collecting and Labeling Data Beginner

A detector learns from images paired with box annotations, so the first task is to produce that pairing for your domain. Two practical rules dominate. First, label what you will deploy on: photograph the objects in the lighting, angles, backgrounds, and resolution your camera will actually see, because a detector trained on clean catalogue photos fails on a noisy warehouse feed. Second, label consistently: agree on exactly how tight a box should be, how to handle occluded or truncated objects, and which borderline cases count, then apply the rules uniformly, since inconsistent labels put a noise floor under your mAP that no architecture can lift.

Annotation tools (Label Studio, CVAT, Roboflow, and others) export to a handful of standard formats. The two you will meet most are COCO JSON (one file listing all images and annotations, corner-format boxes, used by torchvision and the DETR family) and the YOLO text format (one .txt per image, one line per object: class index and a center-format box normalized to $[0, 1]$, the format and conventions from Section 23.1). Knowing both, and being able to convert between them, saves more debugging hours than any modeling trick. Figure 23.6.1 shows the whole workflow this section follows.

collect + label split train/val/test fine-tune pretrained validate read mAP export + deploy the same five stages regardless of which detector family you chose
Figure 23.6.1: The custom-detector workflow. Collect and label data in your deployment conditions, split it without leakage, fine-tune a pretrained model, validate mAP honestly, then export and deploy. The architecture you picked in Sections 23.2 to 23.5 changes only the "fine-tune" box; the rest is identical.

2. Splitting Data Without Fooling Yourself Intermediate

Before training, split your images into training, validation, and test sets. The cardinal rule, the same one from Chapter 21, is that no information may leak from validation or test into training. In detection this rule has a sharp, easy-to-miss edge: if your images come from video or from burst captures, near-identical consecutive frames must all go to the same split. Put frame 100 in training and frame 101 in validation and your validation mAP will be gloriously high and completely meaningless, because the model has effectively seen the validation images. Split by source clip, by scene, or by capture session, never by random per-frame shuffle, whenever frames are correlated.

Key Insight: A High mAP Is Guilty Until Proven Innocent

The single most common way a custom-detector project goes wrong is a validation mAP that looks wonderful and collapses in deployment. Three culprits account for nearly all cases: (1) leakage from correlated frames split across train and val, as above; (2) an easy validation set that does not contain the hard cases (occlusion, small objects, unusual lighting) the deployment will face, so mAP measures the easy slice only; (3) label noise in validation that either flatters a model that learned the same mistakes or unfairly penalizes a correct one. Before you trust any mAP, visualize the model's predictions on the validation images by eye, check the size-stratified APs (Section 23.1) so small objects are not hiding a failure, and confirm your split has no correlated-frame leakage. Treat a suspiciously high number as a bug to investigate, not a result to celebrate.

3. Fine-Tuning a Modern Detector Intermediate

With clean, split data you fine-tune. We use Ultralytics YOLO because it is the most widely deployed toolkit and reduces the whole training loop to a few lines, but the principles transfer to torchvision and the DETR family. The data is described by a small YAML file naming the train and validation image folders and the class names; the model loads pretrained COCO weights (transfer learning from Chapter 21, since COCO features are an excellent starting point for almost any natural-image domain); and a single train call runs the schedule with sensible default augmentation. The code below is a complete, runnable fine-tuning script.

# Fine-tune a COCO-pretrained YOLO on a custom dataset described by a YAML file.
# Transfer learning from COCO means the backbone already sees natural-image
# features, so a short schedule on your own classes is usually enough.
from ultralytics import YOLO

# data.yaml lists: path, train: images/train, val: images/val, names: {0: bolt, 1: nut}
model = YOLO("yolo11n.pt")          # nano model, COCO-pretrained; start small

results = model.train(
    data="data.yaml",
    epochs=100,
    imgsz=640,                      # train/infer resolution; match your deployment
    batch=16,
    lr0=0.01,                       # initial LR; YOLO cosine-decays it automatically
    patience=20,                    # early-stop if val mAP stalls for 20 epochs
    augment=True,                   # mosaic, flips, HSV jitter (Chapter 21 policies)
)
metrics = model.val()               # evaluates on the val set
print(metrics.box.map)              # COCO mAP@[0.50:0.95]
print(metrics.box.map50)            # mAP@0.50
Code Fragment 1: A complete custom fine-tuning run with Ultralytics YOLO. The single model.train call drives the whole schedule (cosine-decayed lr0, patience early-stopping, default augmentation), and the yolo11n.pt nano model is the right place to start: it trains fast and tells you quickly whether your data and labels are sound before you spend time on a larger model.

Two recipe choices matter most. Start with the smallest model (the "n" nano variant) to get a fast signal that your pipeline and labels are correct; only scale up to "s", "m", or larger once you have confirmed the data is sound and you need more accuracy. And match the training image size to your deployment resolution: train at $640$ and deploy at $1280$ and the model sees objects at unfamiliar scales. The augmentation defaults (mosaic, horizontal flip, HSV jitter) implement the policies of Chapter 21 and are usually a good starting point, though mosaic is often disabled for the final few epochs so the model finishes on un-augmented images.

Practical Example: The Detector That Learned the Watermark

Who: a retail-analytics startup training a detector to find specific product packages on store shelves from phone photos, 2024. Situation: their training images came from the brand's marketing team and their validation images from a separate field-collection effort. The first model scored an excellent validation mAP and they prepared to ship. Problem: in a pre-launch field test the detector failed badly on ordinary store photos. Investigating, they found the marketing training images all carried a faint corner watermark and a consistent studio background, and the model had partly keyed on those cues; the validation set happened to share the studio look, so validation mAP was inflated. Decision: they re-collected training data in real stores, re-split by store location to prevent any single store's shelf from spanning train and val, and added heavy background and color augmentation to break the spurious cues. Result: validation mAP dropped to a lower but honest number that matched field performance, and the shipped model worked. Lesson: a detector will exploit any shortcut your data offers, including watermarks and backgrounds; an mAP is only as trustworthy as the realism and independence of the set it is measured on, exactly the leakage and easy-set warnings of subsection 2. The illustration below shows the failure: a proud detector boxing the watermark instead of the product.

A proud cartoon robot draws its bounding box around a faint corner watermark and the studio backdrop instead of the actual product object beside it, while a floating report card shows a high score, illustrating how a detector can inflate its validation mAP by learning a spurious shortcut cue rather than the real object.
A detector will gleefully cheat on any shortcut your data leaves lying around; a beautiful validation mAP that quietly learned the watermark collapses the moment the watermark is gone.

4. Diagnosing Training Failures Advanced

When a custom detector trains poorly, the cause is almost never the architecture and almost always one of a short list of practical faults. Table 23.6.1 is the checklist experienced practitioners run through; the column order roughly matches how often each cause appears. The most common by far is a data or label problem: a class-index off-by-one, boxes in the wrong format (center versus corner, normalized versus pixel, the conversions from Section 23.1), or annotations that are simply inconsistent. The single most useful debugging action is to render a handful of training images with their loaded labels drawn on top, before any training, and confirm the boxes sit on the objects; this five-minute check catches the majority of "the model will not learn" reports.

Table 23.6.1: Common custom-detector training failures and their fixes.
Symptom Likely cause Fix
Loss never drops, mAP near 0 Wrong box format or class indices Render loaded labels on images; verify format and index base
Train mAP high, val mAP low Overfitting or leakage More data and augmentation; re-split by source to remove leakage
Val mAP high, deployment poor Easy or non-representative val set Re-collect val in deployment conditions; check size-stratified AP
Small objects missed Train resolution too low Raise imgsz; use a model with finer pyramid levels
Loss explodes to NaN Learning rate too high Lower lr0; confirm warmup is enabled
Library Shortcut: Predict and Export in Two Lines

Once trained, running inference and exporting to a deployment runtime are each a single call. Ultralytics wraps the entire ONNX and TensorRT export toolchain, which by hand involves tracing the model, pinning opset versions, and wiring the NMS into the graph:

# Load the best checkpoint from training, run inference, and export the model
# to deployment runtimes. Each of predict and export is a single call that
# hides the tracing and NMS-graph surgery you would otherwise write by hand.
model = YOLO("runs/detect/train/weights/best.pt")

# Inference on new images: returns boxes, classes, scores per image.
preds = model.predict("new_shelf.jpg", conf=0.4, imgsz=640)
preds[0].show()                       # draw the boxes

# Export for deployment. One line each:
model.export(format="onnx")           # portable ONNX graph, runs anywhere
model.export(format="engine")         # TensorRT engine for NVIDIA edge devices
Code Fragment 2: Inference and deployment export in two calls, picking up the best.pt checkpoint that Code Fragment 1 produced. model.predict returns boxes, classes, and scores, while model.export bakes the decoding and NMS into a portable ONNX graph or a TensorRT engine; doing either by hand is a few hundred lines of fragile tracing and graph surgery, the techniques Chapter 28 treats in depth.

The ONNX export produces a portable graph that runs under ONNX Runtime on CPU, mobile, or the browser; the TensorRT export produces a hardware-optimized engine for NVIDIA Jetson and server GPUs, often two to five times faster than the PyTorch model. Doing either by hand is a few hundred lines of fragile tracing and post-processing graph surgery; the library does it in one call and bakes the decoding and NMS into the exported graph so the runtime needs no Python. These are the techniques the efficiency chapter, Chapter 28, treats in depth.

Research Frontier: Label Less, Detect More

The expensive, error-prone labeling stage of subsection 1 is the part of this workflow that 2024 to 2026 research is most aggressively automating. Open-vocabulary detectors such as Grounding DINO and YOLO-World (Section 23.5) can detect objects from a text prompt with zero domain labels, and teams increasingly use them to auto-label a first pass that human annotators only correct, cutting labeling cost by large factors; the Segment Anything Model of Chapter 24 is used the same way for masks. Self-training and active learning loops let a partially trained detector propose labels for the unlabeled pool, surfacing only its most uncertain images for human review. And the foundation backbones of Chapter 25, such as DINOv2, give such strong pretrained features that fine-tuning a detector now needs far fewer labels than the COCO-era recipes assumed. The frontier of applied detection is shifting from "design the model" to "spend your human attention only where the model is genuinely unsure."

Fun Fact

The Ultralytics YOLO("yolo11n.pt") call that fits in one line will, on first run, quietly download the pretrained weights, set up the augmentation pipeline, detect your GPU, pick a batch size, and configure a cosine learning-rate schedule with warmup, all the machinery that a 2016-era detection paper would have devoted half its experimental section to describing. A practitioner today can fine-tune a competitive detector before lunch on hardware that costs less than the GPUs the original papers ran on, which is the quiet, cumulative payoff of the decade of research this chapter traced.

5. The Whole Chapter, in One Pipeline Advanced

Step back and notice how this section reuses everything before it. The boxes and mAP of Section 23.1 are the labels you create and the metric you read. The architecture you fine-tune is a member of one of the families from Sections 23.2 to 23.5, chosen by the speed-accuracy trade-off those sections laid out: a YOLO for the edge, a Faster R-CNN for small-object recall, a DINO-DETR for top accuracy. The augmentation and transfer learning come from Chapter 21, the export targets are the edge devices of Chapter 28, and the moment you need pixel-precise outlines instead of boxes you will extend this exact workflow into the segmentation of Chapter 24. Detection is the hub, and you have now driven the full loop from raw images to a deployable model. Put the whole chapter into practice in the Hands-On Lab at the end of this section, which walks the entire pipeline once and leaves you with a trained, validated, and exported detector you can show.

Exercise 23.6.1: Diagnose the Inflated mAP Conceptual

A colleague reports a 0.92 validation mAP@0.50 on a custom traffic-sign detector trained from dashcam video, but the model performs poorly on a new drive. List the three most likely causes from subsection 2 and Table 23.6.1, and for each describe the one concrete check or fix you would apply. Which cause is most probable given that the data is dashcam video, and why?

Exercise 23.6.2: Fine-Tune on a Public Dataset Coding

Download a small public detection dataset (for example a Roboflow Universe dataset or the Oxford-IIIT Pet boxes), write its data.yaml, and fine-tune yolo11n.pt for a modest number of epochs using the script in subsection 3. Report the validation mAP@0.50 and mAP@[0.50:0.95], then render the model's predictions on five validation images and inspect them by eye. Note any class or size for which the model struggles and relate it to the size-stratified AP discussion of Section 23.1.

Exercise 23.6.3: Measure the Export Speedup Analysis

Take your trained model from Exercise 23.6.2 and benchmark inference latency three ways: the native PyTorch .pt model, the exported ONNX model under ONNX Runtime, and (if you have an NVIDIA GPU) the TensorRT engine. Average over many runs after warmup, and report frames per second for each at your deployment imgsz. Write a short analysis of the speedup each export gives, the accuracy (if any) it costs, and which one you would ship for a CPU-only server versus an NVIDIA Jetson, connecting your reasoning to the deployment considerations of Chapter 28.

Hands-On Lab: Train, Evaluate, and Export a Detector on a Small Custom Set

Duration: ~75 minutes Intermediate

Objective

Take a small custom detection dataset from raw labels to a deployable model: fine-tune a COCO-pretrained YOLO on roughly a hundred of your own images, read its mean average precision (mAP) without fooling yourself, and export the trained detector to a portable ONNX graph you can run anywhere. The artifact you finish with is a working detector for a domain of your choosing plus an honest one-paragraph evaluation report, exactly the deliverable an applied team ships.

What You'll Practice

  • Writing a YOLO-format dataset and its data.yaml, the labeling conventions of Section 23.1
  • Splitting correlated images without leakage, the cardinal rule of subsection 2
  • Fine-tuning a pretrained one-stage detector (Section 23.3) with the transfer-learning recipe from Chapter 21
  • Reading validation mAP@0.50 and mAP@[0.50:0.95] and verifying it by eye
  • Exporting to ONNX for the edge runtimes of Chapter 28

Setup

A machine with Python 3.9 or newer (a free Colab GPU runtime is ideal but a CPU works for the nano model on a small set). Install the toolkit and the ONNX export backend:

pip install ultralytics onnx onnxruntime

For data, either label about a hundred of your own photos of a single object class with a tool such as Label Studio or CVAT exporting to YOLO format, or download a tiny public set (for example a Roboflow Universe export already in YOLO format). Keep it small on purpose: the goal is to drive the whole pipeline, not to chase a high number.

Steps

Step 1: Lay out the dataset and write data.yaml

Arrange images and label files in the folder structure Ultralytics expects, then describe it with a YAML file naming the splits and class names. Getting this exactly right is most of the battle; the rest is one training call.

# Expected layout:
#   dataset/images/train/*.jpg   dataset/labels/train/*.txt
#   dataset/images/val/*.jpg     dataset/labels/val/*.txt
# Each label .txt has one line per object: "class_idx cx cy w h"
# with cx, cy, w, h normalized to [0, 1] (the YOLO center-format of Section 23.1).
from pathlib import Path

root = Path("dataset")
# TODO: write dataset/data.yaml as text with these keys:
#   path: ./dataset
#   train: images/train
#   val: images/val
#   names: {0: your_class_name}
# Hint: build a short multi-line string and Path("dataset/data.yaml").write_text(...)
yaml_text = ...
Hint

The class index in every label line must match the key in names. Off-by-one or wrong-base indices are the number-one cause of a model that never learns (Table 23.6.1), so double-check that a file with class 0 corresponds to names: {0: ...}.

Step 2: Split without leakage and visualize the loaded labels

Before training, confirm two things: that no correlated images straddle the train and val split (subsection 2), and that your boxes actually sit on the objects. The five-minute label-render check catches the majority of "the model will not learn" reports.

import cv2, matplotlib.pyplot as plt

def draw_yolo_labels(img_path, label_path):
    img = cv2.cvtColor(cv2.imread(str(img_path)), cv2.COLOR_BGR2RGB)
    h, w = img.shape[:2]
    for line in open(label_path):
        c, cx, cy, bw, bh = map(float, line.split())
        # TODO: convert the normalized center-format box (cx, cy, bw, bh)
        # to pixel corners (x1, y1, x2, y2) and draw it with cv2.rectangle.
        # Hint: x1 = (cx - bw/2) * w ; y1 = (cy - bh/2) * h ; and so on.
        ...
    return img

plt.imshow(draw_yolo_labels("dataset/images/train/ex.jpg",
                            "dataset/labels/train/ex.txt"))
plt.axis("off"); plt.show()
Hint

If frames come from video or burst captures, split by source clip or capture session, never by random per-frame shuffle. Putting frame 100 in train and frame 101 in val gives a gloriously high and meaningless val mAP.

Step 3: Fine-tune a COCO-pretrained nano model

Start with the smallest model so you get a fast signal that your data and labels are sound. One train call drives the whole cosine-decayed schedule with default augmentation, the policies from Chapter 21.

from ultralytics import YOLO

model = YOLO("yolo11n.pt")   # nano, COCO-pretrained; start small on purpose
# TODO: call model.train(...) with data="dataset/data.yaml", a modest epochs
# count (50 is plenty for a tiny set), imgsz=640, and patience=20 early-stopping.
results = ...
Hint

Match imgsz to the resolution you will deploy at. If training stalls at mAP near zero, stop and return to Step 2: the cause is almost always the labels, not the model.

Step 4: Read the validation mAP honestly

Evaluate on the held-out val set and record both metrics. Then treat the number as guilty until proven innocent: render predictions on several val images and look at them.

metrics = model.val()
# TODO: print metrics.box.map (COCO mAP@[0.50:0.95]) and metrics.box.map50.
# Then run model.predict on five val images at conf=0.4 and view the boxes.
...
Hint

A suspiciously high mAP on a tiny set usually means leakage or an easy val slice. Check the predictions by eye and, if your toolkit reports them, the size-stratified APs from Section 23.1.

Step 5: Export the trained detector to ONNX

Pick up the best.pt checkpoint and export it to a portable graph. One line replaces the few hundred lines of fragile tracing and NMS-graph surgery you would otherwise write, the techniques Chapter 28 covers in depth.

best = YOLO("runs/detect/train/weights/best.pt")
# TODO: export to ONNX with best.export(format="onnx"), then confirm the file
# exists and re-load it with YOLO(...) to run one inference, proving it works.
...
Hint

The export bakes the decoding and NMS into the graph, so the runtime needs no Python. YOLO("best.onnx").predict(...) loads the exported model back through ONNX Runtime to verify it.

Expected Output

After Step 3 you should see a per-epoch training log and a runs/detect/train/ folder containing weights/best.pt plus result plots. Step 4 prints two numbers, for example map50 = 0.78 and map = 0.52 on a clean small set (your values will differ by domain and label quality), and shows boxes drawn on val images that visibly sit on the objects. Step 5 produces a best.onnx file that reloads and runs inference with results matching the PyTorch model. The finished artifact is a trained detector, its two mAP numbers, and a portable ONNX graph, with a one-paragraph note on whether you trust the mAP and why.

Stretch Goals

  • Scale up from yolo11n.pt to yolo11s.pt and quantify how much mAP the larger model buys against its slower inference, the speed-accuracy trade-off of Section 23.3.
  • Benchmark inference latency of the PyTorch model versus the ONNX export under ONNX Runtime, averaging over many runs after warmup, and connect the speedup to the deployment choices of Chapter 28.
  • Library Shortcut: auto-label a fresh batch of unlabeled images with an open-vocabulary detector (Grounding DINO or YOLO-World from Section 23.5) using a text prompt, correct only the mistakes by hand, and retrain. Measure how much labeling time the auto-label pass saved.
Complete Solution
# Complete custom-detector pipeline: data.yaml, label check, train, eval, export.
from pathlib import Path
import cv2, matplotlib.pyplot as plt
from ultralytics import YOLO

# --- Step 1: write data.yaml ---
yaml_text = """\
path: ./dataset
train: images/train
val: images/val
names:
  0: widget
"""
Path("dataset/data.yaml").write_text(yaml_text)

# --- Step 2: render loaded labels to verify format before training ---
def draw_yolo_labels(img_path, label_path):
    img = cv2.cvtColor(cv2.imread(str(img_path)), cv2.COLOR_BGR2RGB)
    h, w = img.shape[:2]
    for line in open(label_path):
        c, cx, cy, bw, bh = map(float, line.split())
        x1 = int((cx - bw / 2) * w); y1 = int((cy - bh / 2) * h)
        x2 = int((cx + bw / 2) * w); y2 = int((cy + bh / 2) * h)
        cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 0), 2)
    return img

plt.imshow(draw_yolo_labels("dataset/images/train/ex.jpg",
                            "dataset/labels/train/ex.txt"))
plt.axis("off"); plt.show()
# (Splitting rule: if images are correlated frames, assign whole clips to one split.)

# --- Step 3: fine-tune the COCO-pretrained nano model ---
model = YOLO("yolo11n.pt")
model.train(
    data="dataset/data.yaml",
    epochs=50,
    imgsz=640,
    batch=16,
    patience=20,      # early-stop if val mAP stalls
)

# --- Step 4: read mAP honestly, then check predictions by eye ---
metrics = model.val()
print("mAP@0.50        :", metrics.box.map50)
print("mAP@[0.50:0.95] :", metrics.box.map)
val_imgs = sorted(Path("dataset/images/val").glob("*.jpg"))[:5]
for r in model.predict(val_imgs, conf=0.4, imgsz=640):
    r.show()

# --- Step 5: export to ONNX and verify the exported model reloads and runs ---
best = YOLO("runs/detect/train/weights/best.pt")
onnx_path = best.export(format="onnx")     # one line; bakes decode + NMS into the graph
print("exported:", onnx_path)
reloaded = YOLO(onnx_path)                  # runs through ONNX Runtime
reloaded.predict(val_imgs[0], conf=0.4)[0].show()