"Classification asked me one easy question: what is this? Detection asks me a thousand questions at once, and they all start with 'where', and they all expect a tidy rectangle for an answer. I have learned to draw boxes in my sleep, and to apologize for the ones that overlap."
An Anchor Box With Attachment Problems
Object detection answers two questions for every image at once: where is each object, and what is it. A classifier produces a single label for a whole image; a detector produces a variable-length list of (box, class, confidence) triples, one per object, and it must do this for crowded scenes it has never seen. That change of output, from one label to an unknown number of localized labels, forces every design decision in this chapter: how to score an imperfect box against the truth (IoU), how to summarize a detector's quality across all confidence thresholds (mean average precision), how to turn a fixed-size network into a variable-length predictor (anchors, then anchor-free centers, then learned object queries), and how to suppress the duplicate boxes that every dense predictor emits. Detection is also where deep vision became a product: face unlock, autonomous driving perception, retail shelf audits, medical lesion finding, and sports analytics all run a detector in their inner loop. By the end of the chapter you will understand the three architectural families that have defined the field, you will know which to reach for, and you will have trained and exported one on your own data.
Chapter Overview
For the last four chapters you built networks that consume an image and emit a single decision: a class, in Chapter 20, or a sequence of patch tokens fused into one in Chapter 22. Detection breaks that contract. The output is no longer one thing; it is a set whose size you do not know in advance, and every element of that set carries a spatial location. A photo of a street may contain three cars, eight pedestrians, and a dog, or it may contain none of those; the network must commit to a count and a position for each. This single requirement, predicting a variable-length set of localized labels, is the source of nearly every idea in the chapter, and it is why detection architectures look so different from the classifiers that feed them.
We begin with the rules of the game. Section 23.1 defines the bounding box, the intersection-over-union (IoU) that measures how well a predicted box overlaps the truth, and mean average precision (mAP), the precision-recall-derived score that ranks every detector you will ever read about. These metrics are not bookkeeping; they shape the loss functions and the post-processing of every model that follows, so we build them carefully and from scratch.
Then we walk the three architectural families in the order history discovered them. Section 23.2 covers the two-stage R-CNN family, which first proposes regions that might contain objects and then classifies each one, the accurate-but-slower lineage that culminates in Faster R-CNN and its region proposal network. Section 23.3 covers the one-stage detectors, YOLO, SSD, and RetinaNet, which skip the proposal step and predict boxes and classes directly on a grid, trading a little accuracy for the real-time speed that put detection in phones and cameras; RetinaNet's focal loss is the idea that finally let one-stage models match two-stage accuracy. Section 23.4 shows how the field shed the hand-designed anchor box entirely, predicting object centers and sizes directly (FCOS, CenterNet) and even casting detection as keypoint estimation. Section 23.5 arrives at DETR, which reframes detection as direct set prediction with a transformer decoder and bipartite matching, eliminating both anchors and the non-maximum suppression step that every previous family needed.
Finally, Section 23.6 is the hands-on payoff: you will label a small custom dataset, fine-tune a modern detector on it with the augmentation and transfer-learning practices from Chapter 21, read its mAP honestly, and export it to a deployable format for the edge devices of Chapter 28. This is the workflow you will actually run in industry, distilled from the four sections of theory that precede it.
A thread runs through the whole chapter and onward. Detection localizes objects to rectangles; the moment you want pixel-precise outlines instead of rectangles you are doing segmentation, and Mask R-CNN (a one-line extension of Faster R-CNN) is the bridge to Chapter 24. The attention you built in Chapter 22 returns as the engine of DETR. And the feature-pyramid fusion that detectors live on is the same multi-scale idea you first met as the image pyramid in Chapter 4. Detection is not a side quest; it is the hub where most of applied computer vision connects.
The whole chapter is one moving target, drawing a clean box around a variable number of objects, attacked four ways, each removing a piece of hand-designed machinery the last one needed. Propose: the two-stage R-CNN family of Section 23.2 first proposes regions, then classifies them. Predict: the one-stage detectors of Section 23.3 drop the proposal and predict densely on a grid. Point: the anchor-free detectors of Section 23.4 drop the anchor catalogue and predict from bare feature-map points. Match: DETR in Section 23.5 drops non-maximum suppression and lets a matching loss produce a clean set directly. The one-line summary of the arc is steadily less hand-designed structure, steadily more learned. And the single thread that explains the last two steps is assignment cardinality: every detector that uses one-to-many assignment (many predictions per object) must clean up with NMS; the one detector that uses one-to-one assignment (exactly one prediction per object) needs no cleanup at all. Keep propose, predict, point, match and one-to-many needs NMS, one-to-one is NMS in mind and the chapter's four families fall into a single line.
Prerequisites
You should have read Chapter 19: Convolutional Neural Networks and Chapter 20: CNN Architectures, because every detector in this chapter sits on top of a convolutional or transformer backbone and reuses its feature maps. Chapter 21: Training Recipes supplies the transfer learning, augmentation, and learning-rate schedules that the training section depends on. Chapter 22: Vision Transformers is the direct prerequisite for the DETR section, whose decoder is the attention block you built there. From the classical part, the box-overlap and grouping intuitions of Chapter 16: Classical Recognition Pipelines and the multi-scale pyramids of Chapter 4 give useful background, but are not strictly required. You should be comfortable reading and writing PyTorch nn.Module code.
Chapter Roadmap
- 23.1 The Detection Problem: Boxes, IoU & mAP What detection outputs and why it is hard: the bounding box and its coordinate conventions, intersection-over-union as the overlap measure, precision-recall curves, average precision per class, and mean average precision across classes. All built and verified from scratch in NumPy.
- 23.2 Two-Stage Detectors: The R-CNN Family The propose-then-classify lineage: R-CNN, the shared-backbone speedups of Fast R-CNN with RoI pooling, and Faster R-CNN's region proposal network that makes proposals learnable and the whole detector end-to-end trainable. The accurate baseline the rest of the field is measured against.
- 23.3 One-Stage Detectors: YOLO, SSD & RetinaNet Detection as dense grid prediction with no proposal step: YOLO's single forward pass, SSD's multi-scale default boxes, and RetinaNet's focal loss, the idea that solved the extreme foreground-background imbalance and let one-stage detectors finally match two-stage accuracy at real-time speed.
- 23.4 Anchor-Free & Keypoint-Based Detection Dropping the hand-tuned anchor box: FCOS predicting boxes per feature-map location with a center-ness branch, CenterNet detecting objects as heatmap peaks, and the keypoint view of detection. Why anchor-free designs simplified the pipeline and became the basis of modern real-time detectors.
- 23.5 DETR: Detection as Set Prediction Detection reframed as predicting a fixed-size set with a transformer decoder and a learned set of object queries. The Hungarian bipartite matching loss that removes non-maximum suppression, why the original DETR trained slowly, and how Deformable DETR and the DINO family fixed it.
- 23.6 Training & Deploying a Detector on Custom Data The end-to-end practitioner workflow: labeling a small dataset, choosing and fine-tuning a modern detector with Ultralytics YOLO, reading validation mAP without fooling yourself, common training failures and their fixes, and exporting to ONNX and TensorRT for deployment.
What's Next?
A detector tells you that an object sits inside a rectangle, but a rectangle is a coarse summary of a thing with an actual outline. The instant you need the precise silhouette, the count of touching instances, or a mask you can composite, you have crossed into segmentation. Chapter 24: Segmentation: Semantic, Instance & Promptable picks up exactly where this chapter ends: Mask R-CNN adds a single mask-prediction head to the Faster R-CNN of Section 23.2 and turns boxes into instance masks, the mask transformers generalize the DETR decoder of Section 23.5 to per-pixel prediction, and the Segment Anything Model makes masks promptable. Detection and segmentation are two readings of the same localized-recognition problem; learn detection well here and segmentation will feel like a refinement rather than a new subject. The deployment skills from Section 23.6 carry directly into the edge-efficiency techniques of Chapter 28.
Bibliography & Further Reading
Foundational Papers
Girshick, R. et al. "Rich feature hierarchies for accurate object detection and semantic segmentation (R-CNN)." CVPR (2014). arXiv:1311.2524
Ren, S. et al. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." NeurIPS (2015). arXiv:1506.01497
Redmon, J. et al. "You Only Look Once: Unified, Real-Time Object Detection." CVPR (2016). arXiv:1506.02640
Liu, W. et al. "SSD: Single Shot MultiBox Detector." ECCV (2016). arXiv:1512.02325
Lin, T.-Y. et al. "Focal Loss for Dense Object Detection (RetinaNet)." ICCV (2017). arXiv:1708.02002
Lin, T.-Y. et al. "Feature Pyramid Networks for Object Detection." CVPR (2017). arXiv:1612.03144
Anchor-Free, Keypoint & Set-Prediction Detectors
Carion, N. et al. "End-to-End Object Detection with Transformers (DETR)." ECCV (2020). arXiv:2005.12872
Tian, Z. et al. "FCOS: Fully Convolutional One-Stage Object Detection." ICCV (2019). arXiv:1904.01355
Zhou, X. et al. "Objects as Points (CenterNet)." arXiv (2019). arXiv:1904.07850
Zhang, H. et al. "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." ICLR (2023). arXiv:2203.03605
Liu, S. et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." ECCV (2024). arXiv:2303.05499
Tools & Libraries
Jocher, G. et al. Ultralytics YOLO (YOLO11). docs.ultralytics.com
torchvision detection models and reference scripts. pytorch.org/vision
Hugging Face Transformers object-detection models (DETR, Deformable DETR, RT-DETR). huggingface.co/docs/transformers
Datasets & Benchmarks
Lin, T.-Y. et al. "Microsoft COCO: Common Objects in Context." ECCV (2014). cocodataset.org
Everingham, M. et al. "The PASCAL Visual Object Classes (VOC) Challenge." IJCV (2010). host.robots.ox.ac.uk/pascal/VOC