Section 23.3: One-Stage Detectors: YOLO, SSD & RetinaNet

"Why propose regions and then judge them, like some indecisive committee? I look once. The whole grid speaks at the same instant, every cell shouting what it sees and where. It is chaos, but it is fast chaos, and I have learned to ignore the ten thousand cells yelling 'background' at me."
A YOLO Grid Cell That Made Up Its Mind

Big Picture

One-stage detectors delete the proposal step entirely: a single network looks at the image once and predicts a class and a box directly at every location of a dense grid, trading a little accuracy for the real-time speed that two-stage detectors cannot reach. This is the family that put detection on phones, drones, and live video. The price of skipping proposals is a brutal data imbalance, since the vast majority of grid locations are background, and for years that imbalance kept one-stage models a step behind in accuracy. RetinaNet's focal loss diagnosed and fixed the imbalance, letting a one-stage detector match the best two-stage models for the first time. This section builds the dense-prediction idea through YOLO and SSD, derives the focal loss carefully, and connects it to the feature pyramid that makes multi-scale detection work.

The Faster R-CNN of Section 23.2 is accurate but spends a per-region head on hundreds of proposals, capping it at a handful of frames per second. For a self-driving perception stack or a phone camera that wants to draw boxes on a live preview, that is far too slow. The one-stage family asks: what if we skip the proposal stage and predict everything in a single forward pass, treating detection as a dense regression problem over a fixed grid? The answer is a class of detectors an order of magnitude faster, and the engineering question that dominates them is how to train such a dense predictor when almost every location is background.

1. YOLO: You Only Look Once Beginner

The original YOLO (Redmon et al., 2016) reframed detection as a single regression. Divide the image into an $S \times S$ grid; each cell predicts a small fixed number of boxes (center, size, and an objectness confidence) plus one class distribution for the cell. A whole image's worth of detections, all boxes and all classes, emerges from one forward pass of one network, hence "you only look once." The training target assigns each ground-truth object to the single grid cell containing its center, and the loss is a weighted sum of localization error on the responsible box, objectness error everywhere, and classification error on the responsible cell.

That single-pass design made YOLO dramatically faster than anything before it, running at real-time frame rates while staying respectably accurate. Its weaknesses were equally instructive: the original coarse grid struggled with small objects and with multiple objects whose centers fell in the same cell, and its single-scale prediction missed the multi-scale structure that detection needs. Every YOLO version since (and there have been many, through YOLO11 in 2024 and the YOLO12 line in 2025) has been a steady accumulation of the fixes the rest of this chapter describes: anchors, then anchor-free heads, multi-scale feature pyramids, better losses, and stronger augmentation. The YOLO name now denotes a family of fast, well-engineered one-stage detectors rather than a single architecture, and it is the family you will fine-tune in Section 23.6.

Common Misconception: "Looks Once" Means Less Accurate, and One Number of "Looks" Differs From Two-Stage

Two confusions cluster around the name. First, "you only look once" does not mean the network is shallow or sees each pixel a single time: like every detector here, YOLO runs a deep backbone whose stacked convolutions process each pixel through many layers. "Once" refers to the absence of a separate region-proposal pass, not to the depth of computation. Second, students assume one-stage detectors are inherently less accurate than two-stage ones. That was historically true and entirely because of the foreground-background imbalance of subsection 3, not because skipping proposals loses information. Once focal loss fixed the imbalance (subsection 4), one-stage RetinaNet matched the best two-stage models, and modern YOLOs lead the speed-accuracy frontier outright. The one-stage versus two-stage choice is a speed-and-recall trade-off, not an accuracy ceiling.

2. SSD: Multi-Scale Default Boxes Intermediate

SSD (Single Shot MultiBox Detector, Liu et al., 2016) addressed YOLO's single-scale weakness head-on. Instead of predicting from one final feature map, SSD attaches detection heads to several feature maps of decreasing resolution within the backbone. Early, high-resolution maps have small receptive fields and detect small objects; later, low-resolution maps have large receptive fields and detect large objects. At each location of each map, SSD places a set of default boxes (its name for anchors) of a few aspect ratios, and predicts a class score and a box offset per default box, exactly the anchor parameterization of Section 23.2. The result is a dense, multi-scale, single-pass detector that handled the range of object sizes far better than the original YOLO.

SSD's multi-scale insight, that different feature-map resolutions should detect different object sizes, is the direct ancestor of the feature pyramid network (FPN) that nearly every modern detector uses. FPN improves on SSD's naive use of backbone maps by adding a top-down pathway: it takes the semantically rich but spatially coarse deep feature map, upsamples it, and adds it to the spatially fine but semantically weak shallow maps, so that every level of the pyramid is both high-resolution and semantically strong. This is the learned, end-to-end descendant of the Gaussian and Laplacian image pyramids you built in Chapter 4, and the feature-hierarchy fusion of Chapter 20. Figure 23.3.1 contrasts the single-scale, plain multi-scale, and top-down-fused designs.

Figure 23.3.1: From single-scale to feature pyramid. SSD predicts on plain backbone maps of different resolutions; FPN adds a top-down pathway (purple) that injects deep semantic features into the high-resolution shallow maps, so every pyramid level is both fine and semantically rich. Modern detectors predict on the FPN levels.

3. The Foreground-Background Imbalance Intermediate

Here is the structural problem that kept one-stage detectors behind. A dense detector evaluates a class score at every anchor on every pyramid level, perhaps a hundred thousand anchors per image. Of those, only a few dozen overlap a real object; the rest, more than 99.9 percent, are background. When you sum a standard cross-entropy loss over all of them, the gradient is dominated by the enormous mass of easy background examples, each contributing a tiny but relentless signal that collectively drowns out the few foreground examples that actually teach the model what objects look like. Two-stage detectors dodged this because the RPN and a sampling step rebalanced the data before the second-stage loss; a one-stage detector has no such filter. The illustration below previews the fix: turning the easy crowd's volume down rather than discarding it.

A teacher robot turns a volume knob that fades a vast crowd of pale, identical, sleepy background blobs into near silence while a few bright, distinct little object characters stay vivid and loud, illustrating how focal loss down-weights the easy background majority so rare foreground examples dominate training. — Focal loss does not throw the easy background away; it simply turns its volume down, so the hundred thousand bored 'background' voices stop drowning out the few objects that actually have something to teach.

The standard cross-entropy for a single example with predicted probability $p_t$ of the correct class is $\text{CE}(p_t) = -\log(p_t)$. The trouble is that even a well-classified background example with, say, $p_t = 0.9$ still contributes a loss of $-\log(0.9) \approx 0.105$, and multiplied by a hundred thousand such examples this is a large, persistent gradient that swamps the foreground. The detector learns to predict "background" confidently and stops improving on the rare objects.

Key Insight: Imbalance Is a Loss Problem, Not a Data Problem

The natural first instinct is to fix imbalance by resampling: throw away most background examples (hard negative mining), as SSD did, keeping only the hardest negatives at a fixed 3:1 ratio. RetinaNet's deeper insight was that you do not need to discard any examples if you instead reshape the loss so that easy examples contribute almost nothing. This keeps the full signal from the hard examples of every class while automatically silencing the easy background, and it requires no sampling heuristics or ratio tuning. Reshaping the loss rather than the dataset is a recurring move in deep learning; you will see its cousin in the class-balanced losses of long-tailed recognition.

4. Focal Loss: Down-Weighting the Easy Advanced

The focal loss (Lin et al., 2017) is cross-entropy multiplied by a modulating factor that shrinks toward zero as an example becomes easy:

\text{FL}(p_t) = -\alpha_t\,(1 - p_t)^{\gamma}\,\log(p_t)

The factor $(1 - p_t)^{\gamma}$ is the whole idea. For a well-classified example $p_t$ is near $1$, so $(1 - p_t)^{\gamma}$ is near $0$ and the example's loss is suppressed; for a misclassified example $p_t$ is small, $(1 - p_t)^{\gamma}$ is near $1$, and the loss is essentially unchanged. The focusing parameter $\gamma$ (the paper uses $\gamma = 2$) controls how aggressively easy examples are down-weighted: at $\gamma = 0$ focal loss is plain cross-entropy, and larger $\gamma$ silences the easy examples harder. The $\alpha_t$ is an ordinary class-weighting term that additionally balances foreground against background. Figure 23.3.2 plots how the modulating factor collapses the loss of easy examples.

Figure 23.3.2: Cross-entropy versus focal loss. For easy examples (high $p_t$, right side) the modulating factor $(1 - p_t)^{\gamma}$ drives the focal loss toward zero, so the hundred thousand easy background anchors barely move the gradient, while hard examples (left side) are nearly unchanged. This rebalancing is what let RetinaNet train a dense detector to two-stage accuracy.

The implementation is a few lines on top of the binary-cross-entropy logits loss, and reading it makes the modulating factor concrete.

# Focal loss: down-weight the easy, well-classified examples so the hundred
# thousand background anchors stop drowning out the rare foreground ones.
# The (1 - p_t) ** gamma factor is what does the rebalancing.
import torch
import torch.nn.functional as F

def focal_loss(logits, targets, alpha=0.25, gamma=2.0):
    """logits, targets: same shape (e.g. (N, num_classes)) with 0/1 targets.
       Returns the summed focal loss over all entries."""
    p = torch.sigmoid(logits)
    ce = F.binary_cross_entropy_with_logits(logits, targets, reduction="none")
    p_t = p * targets + (1 - p) * (1 - targets)        # prob. of the true label
    modulating = (1 - p_t) ** gamma                    # collapses for easy examples
    alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
    return (alpha_t * modulating * ce).sum()

logits = torch.tensor([[3.0], [0.1]])     # one easy positive, one hard positive
targets = torch.tensor([[1.0], [1.0]])
print(focal_loss(logits, targets).item())  # the easy example contributes far less

Code Fragment 1: Focal loss in five lines on top of binary_cross_entropy_with_logits. The easy positive (logit 3.0, so $p_t \approx 0.95$) is multiplied by a tiny modulating factor $(1 - 0.95)^2$ and nearly vanishes, while the hard positive (logit 0.1) keeps almost its full cross-entropy, exactly the rebalancing Figure 23.3.2 shows.

Try This: Turn the Focusing Knob

Re-run the snippet above three times with gamma=0.0, then 2.0, then 5.0 (leave everything else fixed), and watch how the gap between the easy and hard examples changes. At gamma=0 focal loss is plain cross-entropy and the easy positive still contributes a sizeable loss; raise gamma and the easy example's term collapses toward zero while the hard one barely moves. Print the two examples' losses separately (drop the .sum() and inspect alpha_t * modulating * ce) and you will see the ratio of hard-to-easy emphasis grow with gamma, which is the whole reason a one-stage detector can train despite a hundred thousand easy background anchors per image. This thirty-second sweep makes the modulating factor of Figure 23.3.2 something you have felt, not just read.

RetinaNet is the architecture that paired this loss with a ResNet-plus-FPN backbone and a simple dense head. With the imbalance solved by the loss rather than by sampling, RetinaNet became the first one-stage detector to match the accuracy of the best two-stage models, while keeping the one-stage speed advantage. Focal loss has since spread far beyond RetinaNet; it appears in FCOS (Section 23.4), in many YOLO heads, and in the classification branch of several DETR variants.

Library Shortcut: RetinaNet and Focal Loss, Prebuilt

torchvision ships RetinaNet with COCO-pretrained weights, and its focal-loss helper is a single import, so you never need the from-scratch version above in production:

# Load a COCO-pretrained RetinaNet and run a forward pass.
# The same focal loss from Code Fragment 1 is available fused as
# torchvision.ops.sigmoid_focal_loss, so nothing here is hand-written.
import torch
from torchvision.models.detection import (
    retinanet_resnet50_fpn_v2, RetinaNet_ResNet50_FPN_V2_Weights)
from torchvision.ops import sigmoid_focal_loss     # the same loss, fused

weights = RetinaNet_ResNet50_FPN_V2_Weights.DEFAULT
model = retinanet_resnet50_fpn_v2(weights=weights).eval()
with torch.no_grad():
    out = model([torch.rand(3, 800, 800)])[0]       # boxes, labels, scores
print({k: v.shape for k, v in out.items()})

Code Fragment 2: The whole RetinaNet in one constructor, with sigmoid_focal_loss replacing Code Fragment 1's hand-written loss. The library bundles the ResNet-50 backbone, FPN neck, classification and box subnets, anchor generation, the focal-loss objective, and the batched inference NMS, and it guarantees the anchor configuration matches the pretrained weights, so the from-scratch version is for understanding rather than production.

The whole RetinaNet, ResNet-50 backbone, FPN neck, classification and box subnets, anchor generation, focal-loss training objective, and the batched NMS at inference, is one constructor; the standalone sigmoid_focal_loss replaces the function above with a tested, fused kernel. Roughly 800 lines of detector plus the focal-loss math reduce to two imports, and the library guarantees the anchor configuration matches the pretrained weights.

Practical Example: Real-Time Detection on a Drone's Tiny Chip

Who: an agritech company counting and locating crop rows and weeds from a low-flying drone, 2024. Situation: the drone carried a small embedded accelerator with a few watts of power budget, and detection had to keep up with the video feed so the sprayer could act in flight. Problem: a Faster R-CNN gave the best weed-versus-crop accuracy in the lab but ran at 4 frames per second on the embedded chip, far too slow for the 30-frame feed, and dropped frames meant missed weeds. Decision: they moved to a compact one-stage YOLO model, quantized it to int8 (a preview of the efficiency techniques of Chapter 28), and accepted a small mAP drop in exchange for a tenfold speedup. They recovered most of the lost accuracy with aggressive augmentation and by training at the drone's actual altitude and resolution. Result: the detector ran at video rate on-device, the sprayer acted in real time, and the small-object weed recall stayed within their tolerance. Lesson: the one-stage family exists for exactly this regime; when the deployment target is a low-power chip on a moving platform, the speed of a single forward pass is the constraint that picks the architecture, and the lab-best accurate model is the wrong choice.

Fun Fact

"You Only Look Once" was a deliberate jab at the two-stage detectors that look thousands of times, and the acronym YOLO landed in 2016 just as the same four letters were peaking as internet slang for "you only live once." Joseph Redmon leaned all the way into the joke: the early YOLO papers are sprinkled with deadpan asides, and his resume was once formatted as a My Little Pony fan page. The field's most-deployed detector family is named after a meme, which is a useful reminder that the catchiest name, not always the highest mAP, is what makes an architecture famous.

Research Frontier: The YOLO Arms Race, 2024 to 2026

The one-stage line is the most actively developed corner of detection. YOLOv8 and YOLO11 (Ultralytics, 2023 to 2024) are anchor-free, use a decoupled detection head, and pair task-aligned label assignment with distribution-focal-loss box regression, blurring the line between this section and the anchor-free Section 23.4. YOLOv9 introduced programmable gradient information, YOLOv10 (2024) removed NMS at inference with a dual-assignment training trick, getting one-stage detectors most of the way to DETR's NMS-free promise without a transformer, the YOLO12 line (Feb 2025) brings area-attention modules into the backbone, and YOLO26 (Jan 2026) goes NMS-free at inference using ProgLoss training and STAL label assignment. RT-DETR, a real-time DETR (Section 23.5), now competes directly on the speed-accuracy frontier these YOLOs defined. The headline trend is convergence: anchor-free, NMS-free, attention-augmented one-stage detectors that are fast enough for the edge and accurate enough to rival anything two-stage.

5. Choosing Among the One-Stage Family Advanced

The one-stage detectors share a profile: a single forward pass, dense prediction on FPN levels, a focal or task-aligned classification loss to handle imbalance, and an NMS post-step (which the newest YOLOs are learning to drop). They differ mainly in the backbone size, the head design, and the training recipe, and the practical choice is a speed-accuracy trade-off rather than a deep architectural decision. A modern compact YOLO is the default for edge and real-time work; RetinaNet remains a clean, well-understood research baseline; SSD is largely of historical interest now, superseded by its descendants. Table 23.3.1 summarizes the trade-offs against the two-stage family of the previous section.

Table 23.3.1: Two-stage versus one-stage detectors, the trade-offs that pick one.

Property	Two-stage (Faster R-CNN)	One-stage (YOLO, RetinaNet)
Proposal step	Yes (RPN)	No (dense grid)
Imbalance handling	Proposal sampling	Focal / task-aligned loss
Speed	Lower (per-region head)	Higher (single pass)
Small-object recall	Often higher	Improving, historically lower
Typical use	Offline, accuracy-critical	Real-time, edge, video

Both families still rely on the anchor box, the hand-designed reference shapes whose scales and aspect ratios you must tune to your data, as Exercise 23.2.2 showed. That tuning is a real burden, and it is the next thing the field shed. Section 23.4 shows how detectors learned to predict boxes directly from feature-map locations with no anchors at all.

Exercise 23.3.1: When Does Focal Loss Help? Conceptual

Focal loss helped one-stage detection enormously but is rarely used for ordinary image classification. In two or three sentences, explain the difference in class balance between the two settings that makes focal loss valuable in one and unnecessary in the other. Then predict what would happen to a balanced classification task (1,000 examples per class) if you trained it with $\gamma = 2$ focal loss: would it help, hurt, or roughly match plain cross-entropy, and why?

Exercise 23.3.2: Visualizing the Modulating Factor Coding

Plot the per-example focal loss as a function of $p_t$ for $\gamma \in \{0, 0.5, 1, 2, 5\}$ on one figure (set $\alpha = 1$ so you isolate the focusing effect). Mark the loss value at $p_t = 0.9$ (a typical easy background) for each $\gamma$ and compute, for $\gamma = 2$, the ratio of the loss at $p_t = 0.5$ (hard) to the loss at $p_t = 0.9$ (easy). This ratio is the relative emphasis focal loss places on hard examples; explain in one sentence how it changes as $\gamma$ grows.

Exercise 23.3.3: Speed-Accuracy on Your Hardware Analysis

Using the torchvision library shortcuts, run the pretrained Faster R-CNN (Section 23.2) and the pretrained RetinaNet on the same batch of images and time both end-to-end (warm up first, then average over 20 runs). Report frames per second for each on your machine, and on a busy multi-object image compare the number of detections above a 0.5 confidence threshold. Write a short paragraph deciding which you would deploy for a 30-frame-per-second video application and which for an offline archival-tagging job, justifying each choice from your measured numbers and Table 23.3.1.