Chapter 23: Object Detection | Building Vision AI

"Classification asked me one easy question: what is this? Detection asks me a thousand questions at once, and they all start with 'where', and they all expect a tidy rectangle for an answer. I have learned to draw boxes in my sleep, and to apologize for the ones that overlap."
An Anchor Box With Attachment Problems

Big Picture

Object detection answers two questions for every image at once: where is each object, and what is it. A classifier produces a single label for a whole image; a detector produces a variable-length list of (box, class, confidence) triples, one per object, and it must do this for crowded scenes it has never seen. That change of output, from one label to an unknown number of localized labels, forces every design decision in this chapter: how to score an imperfect box against the truth (IoU), how to summarize a detector's quality across all confidence thresholds (mean average precision), how to turn a fixed-size network into a variable-length predictor (anchors, then anchor-free centers, then learned object queries), and how to suppress the duplicate boxes that every dense predictor emits. Detection is also where deep vision became a product: face unlock, autonomous driving perception, retail shelf audits, medical lesion finding, and sports analytics all run a detector in their inner loop. By the end of the chapter you will understand the three architectural families that have defined the field, you will know which to reach for, and you will have trained and exported one on your own data.

Chapter Overview

For the last four chapters you built networks that consume an image and emit a single decision: a class, in Chapter 20, or a sequence of patch tokens fused into one in Chapter 22. Detection breaks that contract. The output is no longer one thing; it is a set whose size you do not know in advance, and every element of that set carries a spatial location. A photo of a street may contain three cars, eight pedestrians, and a dog, or it may contain none of those; the network must commit to a count and a position for each. This single requirement, predicting a variable-length set of localized labels, is the source of nearly every idea in the chapter, and it is why detection architectures look so different from the classifiers that feed them.

We begin with the rules of the game. Section 23.1 defines the bounding box, the intersection-over-union (IoU) that measures how well a predicted box overlaps the truth, and mean average precision (mAP), the precision-recall-derived score that ranks every detector you will ever read about. These metrics are not bookkeeping; they shape the loss functions and the post-processing of every model that follows, so we build them carefully and from scratch.

Then we walk the three architectural families in the order history discovered them. Section 23.2 covers the two-stage R-CNN family, which first proposes regions that might contain objects and then classifies each one, the accurate-but-slower lineage that culminates in Faster R-CNN and its region proposal network. Section 23.3 covers the one-stage detectors, YOLO, SSD, and RetinaNet, which skip the proposal step and predict boxes and classes directly on a grid, trading a little accuracy for the real-time speed that put detection in phones and cameras; RetinaNet's focal loss is the idea that finally let one-stage models match two-stage accuracy. Section 23.4 shows how the field shed the hand-designed anchor box entirely, predicting object centers and sizes directly (FCOS, CenterNet) and even casting detection as keypoint estimation. Section 23.5 arrives at DETR, which reframes detection as direct set prediction with a transformer decoder and bipartite matching, eliminating both anchors and the non-maximum suppression step that every previous family needed.

Finally, Section 23.6 is the hands-on payoff: you will label a small custom dataset, fine-tune a modern detector on it with the augmentation and transfer-learning practices from Chapter 21, read its mAP honestly, and export it to a deployable format for the edge devices of Chapter 28. This is the workflow you will actually run in industry, distilled from the four sections of theory that precede it.

A thread runs through the whole chapter and onward. Detection localizes objects to rectangles; the moment you want pixel-precise outlines instead of rectangles you are doing segmentation, and Mask R-CNN (a one-line extension of Faster R-CNN) is the bridge to Chapter 24. The attention you built in Chapter 22 returns as the engine of DETR. And the feature-pyramid fusion that detectors live on is the same multi-scale idea you first met as the image pyramid in Chapter 4. Detection is not a side quest; it is the hub where most of applied computer vision connects.

The Detection Schema: Propose, Predict, Point, Match

The whole chapter is one moving target, drawing a clean box around a variable number of objects, attacked four ways, each removing a piece of hand-designed machinery the last one needed. Propose: the two-stage R-CNN family of Section 23.2 first proposes regions, then classifies them. Predict: the one-stage detectors of Section 23.3 drop the proposal and predict densely on a grid. Point: the anchor-free detectors of Section 23.4 drop the anchor catalogue and predict from bare feature-map points. Match: DETR in Section 23.5 drops non-maximum suppression and lets a matching loss produce a clean set directly. The one-line summary of the arc is steadily less hand-designed structure, steadily more learned. And the single thread that explains the last two steps is assignment cardinality: every detector that uses one-to-many assignment (many predictions per object) must clean up with NMS; the one detector that uses one-to-one assignment (exactly one prediction per object) needs no cleanup at all. Keep propose, predict, point, match and one-to-many needs NMS, one-to-one is NMS in mind and the chapter's four families fall into a single line.

Prerequisites

You should have read Chapter 19: Convolutional Neural Networks and Chapter 20: CNN Architectures, because every detector in this chapter sits on top of a convolutional or transformer backbone and reuses its feature maps. Chapter 21: Training Recipes supplies the transfer learning, augmentation, and learning-rate schedules that the training section depends on. Chapter 22: Vision Transformers is the direct prerequisite for the DETR section, whose decoder is the attention block you built there. From the classical part, the box-overlap and grouping intuitions of Chapter 16: Classical Recognition Pipelines and the multi-scale pyramids of Chapter 4 give useful background, but are not strictly required. You should be comfortable reading and writing PyTorch nn.Module code.

Chapter Roadmap

23.1 The Detection Problem: Boxes, IoU & mAP What detection outputs and why it is hard: the bounding box and its coordinate conventions, intersection-over-union as the overlap measure, precision-recall curves, average precision per class, and mean average precision across classes. All built and verified from scratch in NumPy.
23.2 Two-Stage Detectors: The R-CNN Family The propose-then-classify lineage: R-CNN, the shared-backbone speedups of Fast R-CNN with RoI pooling, and Faster R-CNN's region proposal network that makes proposals learnable and the whole detector end-to-end trainable. The accurate baseline the rest of the field is measured against.
23.3 One-Stage Detectors: YOLO, SSD & RetinaNet Detection as dense grid prediction with no proposal step: YOLO's single forward pass, SSD's multi-scale default boxes, and RetinaNet's focal loss, the idea that solved the extreme foreground-background imbalance and let one-stage detectors finally match two-stage accuracy at real-time speed.
23.4 Anchor-Free & Keypoint-Based Detection Dropping the hand-tuned anchor box: FCOS predicting boxes per feature-map location with a center-ness branch, CenterNet detecting objects as heatmap peaks, and the keypoint view of detection. Why anchor-free designs simplified the pipeline and became the basis of modern real-time detectors.
23.5 DETR: Detection as Set Prediction Detection reframed as predicting a fixed-size set with a transformer decoder and a learned set of object queries. The Hungarian bipartite matching loss that removes non-maximum suppression, why the original DETR trained slowly, and how Deformable DETR and the DINO family fixed it.
23.6 Training & Deploying a Detector on Custom Data The end-to-end practitioner workflow: labeling a small dataset, choosing and fine-tuning a modern detector with Ultralytics YOLO, reading validation mAP without fooling yourself, common training failures and their fixes, and exporting to ONNX and TensorRT for deployment.

What's Next?

A detector tells you that an object sits inside a rectangle, but a rectangle is a coarse summary of a thing with an actual outline. The instant you need the precise silhouette, the count of touching instances, or a mask you can composite, you have crossed into segmentation. Chapter 24: Segmentation: Semantic, Instance & Promptable picks up exactly where this chapter ends: Mask R-CNN adds a single mask-prediction head to the Faster R-CNN of Section 23.2 and turns boxes into instance masks, the mask transformers generalize the DETR decoder of Section 23.5 to per-pixel prediction, and the Segment Anything Model makes masks promptable. Detection and segmentation are two readings of the same localized-recognition problem; learn detection well here and segmentation will feel like a refinement rather than a new subject. The deployment skills from Section 23.6 carry directly into the edge-efficiency techniques of Chapter 28.

Bibliography & Further Reading

Foundational Papers

Girshick, R. et al. "Rich feature hierarchies for accurate object detection and semantic segmentation (R-CNN)." CVPR (2014). arXiv:1311.2524

The paper that brought deep features to detection: region proposals classified by a CNN. Slow, but it shattered the previous state of the art and launched the two-stage family of Section 23.2.

Ren, S. et al. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." NeurIPS (2015). arXiv:1506.01497

The region proposal network made proposals learnable and the detector end-to-end trainable, the design that defines two-stage detection in Section 23.2 to this day.

Redmon, J. et al. "You Only Look Once: Unified, Real-Time Object Detection." CVPR (2016). arXiv:1506.02640

YOLO, the founding one-stage detector of Section 23.3. A single network predicts all boxes and classes in one pass, trading some accuracy for the speed that put detection on live video.

Liu, W. et al. "SSD: Single Shot MultiBox Detector." ECCV (2016). arXiv:1512.02325

SSD of Section 23.3 predicts default boxes at multiple feature-map scales in one shot, the multi-scale dense-prediction template that later detectors refined.

Lin, T.-Y. et al. "Focal Loss for Dense Object Detection (RetinaNet)." ICCV (2017). arXiv:1708.02002

Focal loss, the idea of Section 23.3 that down-weights easy background examples so the rare foreground dominates training, letting a one-stage detector match two-stage accuracy.

Lin, T.-Y. et al. "Feature Pyramid Networks for Object Detection." CVPR (2017). arXiv:1612.03144

The FPN that fuses coarse-semantic and fine-spatial feature maps, the multi-scale neck nearly every detector in this chapter sits on, and the learned descendant of the image pyramids of Chapter 4.

Anchor-Free, Keypoint & Set-Prediction Detectors

Carion, N. et al. "End-to-End Object Detection with Transformers (DETR)." ECCV (2020). arXiv:2005.12872

DETR of Section 23.5 reframes detection as set prediction with a transformer decoder and Hungarian matching, removing anchors and non-maximum suppression entirely.

Tian, Z. et al. "FCOS: Fully Convolutional One-Stage Object Detection." ICCV (2019). arXiv:1904.01355

The anchor-free detector of Section 23.4: per-location box regression with a center-ness branch, no anchor hyperparameters at all, and the basis of many modern real-time models.

Zhou, X. et al. "Objects as Points (CenterNet)." arXiv (2019). arXiv:1904.07850

The keypoint view of detection in Section 23.4: represent each object by its center as a peak in a class heatmap and regress size at the peak, an NMS-free design that also seeds pose estimation and tracking.

Zhang, H. et al. "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." ICLR (2023). arXiv:2203.03605

The DETR-family detector of Section 23.5 that finally beat the best convolutional models on COCO, combining query denoising, mixed query selection, and deformable attention.

Liu, S. et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." ECCV (2024). arXiv:2303.05499

The open-vocabulary frontier of Section 23.5: detect objects named by an arbitrary text prompt, not a fixed label set, by fusing a DETR detector with a language model.

Tools & Libraries

Jocher, G. et al. Ultralytics YOLO (YOLO11). docs.ultralytics.com

The most widely used training and deployment toolkit for modern YOLO detectors, the library behind the custom-training and export workflow of Section 23.6.

torchvision detection models and reference scripts. pytorch.org/vision

Pretrained Faster R-CNN, RetinaNet, FCOS, and SSD with a uniform API, the library shortcut used throughout Sections 23.2 to 23.4.

Hugging Face Transformers object-detection models (DETR, Deformable DETR, RT-DETR). huggingface.co/docs/transformers

High-level loaders and a post-processor for the DETR family, the shortcut used in Section 23.5 to run set-prediction detection in a handful of lines.

Datasets & Benchmarks

Lin, T.-Y. et al. "Microsoft COCO: Common Objects in Context." ECCV (2014). cocodataset.org

The 80-class benchmark and the COCO mAP protocol (average precision across IoU thresholds 0.50 to 0.95) that every detector in this chapter reports against. The standard yardstick of Section 23.1.

Everingham, M. et al. "The PASCAL Visual Object Classes (VOC) Challenge." IJCV (2010). host.robots.ox.ac.uk/pascal/VOC

The 20-class predecessor to COCO and the source of the original VOC mAP (a single IoU threshold of 0.50), the simpler metric we build first in Section 23.1.