Part II: Classical Computer Vision
Chapter 16: Classical Recognition Pipelines

HOG + SVM: The Pedestrian Detection Era

"Forget the pixels; tell me which way the brightness leans, and I will tell you whether that is a person. I have been doing it since 2005, one sliding window at a time."

A Histogram of Oriented Gradients, Quietly Confident
Big Picture

Replace pixels with the local distribution of gradient orientations, slide a single linear classifier over the image at every position and scale, and you get the detector that defined an entire decade. The Dalal-Triggs pipeline is the most complete worked example of classical recognition in this chapter, and every stage teaches a transferable lesson. The HOG descriptor captures shape while discarding texture and lighting. The linear support vector machine finds the maximum-margin boundary between "person" and "not person." The sliding window plus image pyramid turns a classifier into a detector. Hard negative mining fixes the data imbalance that detection always creates. Non-maximum suppression cleans the redundant detections, and the miss-rate curve measures the result honestly. Hold these pieces in mind: Chapter 23's deep detectors keep every one of them and only replace the features.

The previous section turned an image into a single histogram for whole-image classification. Detection is harder: we must find where an object is, not just whether one is present, and the same object can appear at any position and any size. This section follows the canonical solution, Dalal and Triggs's human detector (CVPR 2005), from the descriptor that made it work to the evaluation that proved it. It is the natural sequel to the gradient machinery of Chapter 9 and Chapter 3: HOG is built entirely from image gradients.

1. The HOG Descriptor: Shape as a Gradient Distribution Intermediate

The insight behind histograms of oriented gradients is that an object's shape is captured well by the local distribution of edge directions, even when its exact pixel values vary wildly. A pedestrian has roughly vertical gradients along the legs and torso and a characteristic head-and-shoulders outline, and those orientation statistics are remarkably stable across clothing, lighting, and skin tone, the things templates could not survive. HOG throws away absolute brightness and keeps only which way the brightness changes and how strongly.

The construction proceeds in four steps. First, compute the gradient at every pixel using a simple $[-1, 0, 1]$ filter (Dalal and Triggs found this plain kernel beat fancier ones), giving magnitude $m = \sqrt{g_x^2 + g_y^2}$ and orientation $\theta = \arctan(g_y / g_x)$ at each pixel. Second, divide the window into small cells (typically $8 \times 8$ pixels) and, within each cell, build a histogram of orientations: each pixel votes for its orientation bin (usually $9$ bins over $0$ to $180$ degrees, unsigned) weighted by its gradient magnitude. Third, group cells into overlapping blocks (typically $2 \times 2$ cells) and normalize each block's concatenated histogram to unit length, which is what grants invariance to local contrast and illumination. Fourth, concatenate all block descriptors into one long feature vector for the window. Figure 16.3.1 lays out the cell-and-block geometry and shows the orientation histogram inside one cell.

Window in cells 2x2 block 8x8 cells, 2x2 blocks One cell's 9-bin histogram orientation bin (0 to 180 deg) vote Dominant edge in cell strong vertical gradients (a leg)
Figure 16.3.1: The HOG descriptor. The window is tiled into $8 \times 8$ cells (left); each cell accumulates a $9$-bin orientation histogram weighted by gradient magnitude (center); cells are grouped into overlapping $2 \times 2$ blocks that are normalized together (the blue block). A cell dominated by vertical gradients, as on a pedestrian's leg, produces a histogram peaked at the vertical bin (right).

The block overlap is not an accident. Each cell appears in multiple blocks and is normalized differently in each, so the descriptor encodes the same cell relative to several different local contexts. That redundancy is what makes HOG robust to the local lighting variation that defeated Section 16.1's templates. The code below computes a HOG descriptor for a detection window using scikit-image, which exposes the cell and block geometry directly so you can see the parameters that Figure 16.3.1 names.

# Compute the HOG descriptor of a detection window with the canonical
# Dalal-Triggs parameters (9 orientations, 8x8 cells, 2x2 blocks), turning
# a 64x128 image into a fixed-length gradient-orientation feature vector.
import cv2
import numpy as np
from skimage.feature import hog

# A pedestrian detection window is canonically 64 wide x 128 tall.
window = cv2.imread("person_crop.png", cv2.IMREAD_GRAYSCALE)
window = cv2.resize(window, (64, 128))

features, hog_image = hog(
    window,
    orientations=9,                 # 9 unsigned orientation bins
    pixels_per_cell=(8, 8),         # 8x8 pixel cells
    cells_per_block=(2, 2),         # 2x2 cell blocks, normalized together
    block_norm="L2-Hys",            # L2 norm with clipping, the Dalal-Triggs choice
    visualize=True,
    feature_vector=True)

print(f"HOG feature length: {features.shape[0]}")
# 7 blocks wide x 15 tall x (2*2 cells) x 9 bins = 3780
# HOG feature length: 3780
Computing the $3780$-dimensional HOG descriptor of a $64 \times 128$ window with the canonical Dalal-Triggs parameters; L2-Hys is L2 normalization followed by clipping large values and renormalizing, which suppresses the influence of very strong gradients.

2. The Linear SVM: Maximum-Margin Classification Intermediate

HOG turns each window into a $3780$-dimensional vector; now we need a classifier that decides "person" or "not person." Dalal and Triggs used a linear support vector machine, and the choice was deliberate. A linear SVM learns a weight vector $\mathbf{w}$ and bias $b$ so the decision is simply the sign of $\mathbf{w} \cdot \mathbf{x} + b$, and among all hyperplanes that separate the training classes it picks the one with the largest margin, the widest gap to the nearest examples of either class. Maximizing the margin is a principled defense against overfitting: the boundary depends only on the borderline examples (the support vectors) and ignores the easy interior ones. Formally the soft-margin SVM minimizes

$$ \frac{1}{2}\lVert \mathbf{w} \rVert^2 \;+\; C \sum_i \max\!\big(0,\; 1 - y_i(\mathbf{w}\cdot \mathbf{x}_i + b)\big), $$

where $y_i \in \{-1, +1\}$ are the labels and $C$ trades margin width against training errors. The first term widens the margin; the hinge-loss sum penalizes points on the wrong side. The $1$ inside the loss is the margin itself: a point earns zero penalty only once $y_i(\mathbf{w}\cdot \mathbf{x}_i + b) \ge 1$, that is, once it sits not merely on the correct side of the boundary but a full margin's width clear of it.

Two properties made the linear SVM ideal for detection. It is extremely fast at test time, one dot product per window, which matters enormously when you evaluate millions of windows per image. And the learned weight vector $\mathbf{w}$ is itself a HOG-shaped template you can visualize: positive weights mark "person-like gradient here," negative weights mark "background-like gradient here," and rendering $\mathbf{w}$ produces a ghostly average pedestrian.

Key Insight: A Linear SVM on HOG Is a Learned, Soft Template

Section 16.1's template matching correlated a fixed pixel patch against the image. A linear SVM on HOG correlates a learned weight vector against the HOG feature image, and that is the whole conceptual upgrade in one sentence. The template is no longer a single photographed example; it is the maximum-margin combination of thousands of examples, expressed in a gradient-orientation space that already discards lighting and texture. When you reach the first convolutional layer of a CNN in Chapter 19 and find filters that look like oriented edge detectors, you are seeing this same idea, learned rather than designed, and stacked many layers deep.

# Train a linear SVM on HOG vectors so the learned weight vector w becomes a
# soft, maximum-margin pedestrian template; classifying a window is then a
# single dot product with w, the cheap test-time cost detection relies on.
import numpy as np
from sklearn.svm import LinearSVC

# X: (n_windows, 3780) HOG vectors; y: +1 for person crops, -1 for background.
svm = LinearSVC(C=0.01, max_iter=20000)
svm.fit(X_train, y_train)

# The learned weight vector is itself a HOG-shaped detector you can score with.
w = svm.coef_.ravel()           # shape (3780,)
score = w @ hog_window + svm.intercept_[0]   # > 0 means "person"
print(f"decision score: {score:.3f}  ({'person' if score > 0 else 'background'})")
# decision score: 1.842  (person)
Training the linear SVM on HOG vectors: coef_ is the learned weight vector $\mathbf{w}$, and classifying any window reduces to a single dot product with it, the one-operation test-time cost that makes dense sliding-window search feasible.

3. From Classifier to Detector: Sliding Windows Over a Pyramid Intermediate

A classifier answers "is this window a person?" A detector answers "where are the people?", and the bridge is exhaustive search. Slide the $64 \times 128$ window across the image in small steps, scoring each placement with the SVM; then resize the image down and repeat, so a single fixed-size window finds people at every scale. This is the same image pyramid from Chapter 4 that powered the multi-scale template search in Section 16.1, now feeding a learned classifier instead of a correlation. The result is a dense field of scores; thresholding it yields candidate detections, typically thousands per image before cleanup.

# Turn the HOG+SVM classifier into a detector: slide the window at every
# position and pyramid scale, HOG-encode and score each placement, and keep
# above-threshold boxes mapped back to original-image coordinates.
import cv2
import numpy as np
from skimage.feature import hog

def detect(image, svm_w, svm_b, win=(128, 64), stride=8, scales=10, downscale=1.2):
    """Sliding-window HOG+SVM detection across an image pyramid."""
    detections = []
    for s in range(scales):
        scale = downscale ** s
        resized = cv2.resize(image, None, fx=1 / scale, fy=1 / scale)
        if resized.shape[0] < win[0] or resized.shape[1] < win[1]:
            break
        for y in range(0, resized.shape[0] - win[0], stride):
            for x in range(0, resized.shape[1] - win[1], stride):
                patch = resized[y:y + win[0], x:x + win[1]]
                f = hog(patch, orientations=9, pixels_per_cell=(8, 8),
                        cells_per_block=(2, 2), block_norm="L2-Hys")  # match the training descriptor exactly
                score = svm_w @ f + svm_b
                if score > 0.5:                      # detection threshold
                    detections.append((int(x * scale), int(y * scale),
                                       int(win[1] * scale), int(win[0] * scale), score))
    return detections

raw = detect(cv2.imread("street.png", cv2.IMREAD_GRAYSCALE), svm.coef_.ravel(),
             svm.intercept_[0])
print(f"raw detections before NMS: {len(raw)}")
# raw detections before NMS: 1473
The detection loop: at each pyramid scale the window slides with an $8$-pixel stride, each placement is HOG-encoded and scored by the SVM, and above-threshold windows are kept with their box mapped back to original-image coordinates. Note the $1473$ raw detections, mostly redundant overlaps that the next stage removes.

4. Hard Negative Mining and Non-Maximum Suppression Advanced

Two practical problems break the naive pipeline, and both fixes became permanent fixtures of detection. The first is data imbalance: a training image contains a handful of person windows and literally millions of background windows, so a classifier trained on a random sample of negatives never sees the hard negatives (a lamppost, a tree trunk, a doorway) that actually trigger false positives. Hard negative mining solves this by bootstrapping: train on an initial set, run the detector on negative images, collect every high-scoring false positive (the hard negatives), add them to the training set, and retrain. Iterate until false positives stop appearing. This focuses the SVM's capacity exactly where it matters, on the boundary cases, and it typically halves the miss rate.

The second problem is redundancy. A real pedestrian triggers the detector at many nearby positions and scales, producing a cluster of overlapping boxes (the $1473$ in the code above). Non-maximum suppression (NMS) collapses each cluster to one box: sort detections by score, greedily keep the highest, remove every remaining box that overlaps it by more than an intersection-over-union threshold, and repeat. Figure 16.3.2 shows both stages, the bootstrap loop that hardens the classifier and the greedy suppression that cleans its output.

Hard negative mining loop train SVM run on negatives collect false +ves add & retrain repeat until no new false positives Non-maximum suppression before after keep highest score, drop the overlaps
Figure 16.3.2: The two fixes detection cannot live without. Left: hard negative mining is a bootstrap loop that repeatedly mines the detector's own false positives and retrains on them, concentrating SVM capacity on the confusing background. Right: non-maximum suppression replaces a cluster of overlapping boxes around one person with the single highest-scoring box.

NMS is so fundamental that it survived the deep-learning transition almost unchanged; the box-cleanup stage of the detectors in Chapter 23 is the same greedy algorithm. The implementation below is the classic vectorized version.

# Collapse the cluster of overlapping detection boxes around each object
# into one: process boxes highest-score-first and discard any later box whose
# intersection-over-union with a kept box exceeds the threshold.
import numpy as np

def nms(boxes, scores, iou_thresh=0.4):
    """Greedy non-maximum suppression; boxes as (x, y, w, h)."""
    if len(boxes) == 0:
        return []
    b = np.array(boxes, dtype=np.float32)
    x1, y1 = b[:, 0], b[:, 1]
    x2, y2 = b[:, 0] + b[:, 2], b[:, 1] + b[:, 3]
    areas = (x2 - x1) * (y2 - y1)
    order = np.argsort(scores)[::-1]               # highest score first
    keep = []
    while order.size > 0:
        i = order[0]
        keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]])     # overlap with the rest
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])
        inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
        iou = inter / (areas[i] + areas[order[1:]] - inter)
        order = order[1:][iou < iou_thresh]        # drop boxes that overlap the kept one
    return keep

kept = nms([d[:4] for d in raw], [d[4] for d in raw])
print(f"after NMS: {len(kept)} detections")
# after NMS: 6 detections
Greedy non-maximum suppression: detections are processed highest-score-first, and every box overlapping a kept box by more than the IoU threshold is discarded, collapsing the $1473$ raw windows of Section 3 to $6$ clean detections.

5. Measuring Detection Honestly: Miss Rate Versus FPPI Intermediate

Detection cannot be scored by accuracy, because the overwhelming majority of windows are background and a detector that says "no person" everywhere would score near-perfect accuracy while finding nobody. The pedestrian-detection community settled on a curve instead: miss rate versus false positives per image (FPPI), traced out by sweeping the detection threshold. Lowering the threshold finds more true pedestrians (lower miss rate) but also raises false alarms (higher FPPI); the curve shows the full trade-off, and detectors are summarized by their log-average miss rate over a standard FPPI range. The Caltech Pedestrian study (Dollár et al., IEEE TPAMI 2012) made this protocol the field standard and benchmarked sixteen detectors under it, which is why HOG-era results are quoted as miss rates, not accuracies. The mean average precision (mAP) of Chapter 23 is the same idea generalized: integrate precision over recall instead of miss rate over FPPI. The illustration below makes the absurdity vivid: a detector can win a near-perfect accuracy trophy while finding nobody at all.

A smug robot detector with its eyes closed proudly holds a near-perfect-score trophy while having stamped every window of a crowded street, including all the obvious pedestrians, with a red cross meaning no person, so it found nobody, illustrating why high accuracy is meaningless when background windows vastly outnumber objects.
Answer no person to every window and you score almost perfectly while finding nobody, which is exactly why detection counts the misses, not the hits.
Fun Fact: The Detector and the Weather Forecaster

There is a classic forecaster who predicts "no rain tomorrow" every single day in the desert and retires with a 95 percent accuracy record and zero useful forecasts. The "no person everywhere" detector is that forecaster with a camera: accuracy is trivially won by always betting on the majority class, so the number certifies your grasp of the base rate, not your skill at the task. Miss-rate-versus-FPPI exists precisely to take the bet away, scoring you only on the rare windows that actually matter. The keeper phrase for the whole detector is slide, score, suppress, then count the misses, not the hits.

Practical Example: A Parking Garage That Counted Cars but Missed People

Who & situation: a building-security integrator deployed a HOG+SVM pedestrian detector across a chain of parking garages to trigger lighting and alert guards to people in low-traffic zones at night. Problem: the detector worked beautifully in daytime tests but in production it missed roughly a third of real pedestrians and fired constantly on pillars and parked-car edges. Decision: the team realized the public Dalal-Triggs model had been trained on upright, well-lit, unoccluded pedestrians, nothing like a crouching person between cars under sodium lighting. They ran hard negative mining directly on a week of the garages' own background footage to learn the pillar and bumper false positives, and they retrained the positive set with garage-specific crops including partially occluded and bent poses. Result: the log-average miss rate on their own validation footage fell by more than half, and the false-alarm rate dropped to a level the guards stopped ignoring. Lesson: a classical detector is exactly as good as the negatives it was hardened against and the positives it was shown; hard negative mining on the deployment distribution, not the benchmark distribution, is what separates a demo from a system.

Research Frontier: HOG's Ideas, Still Running

HOG+SVM is no longer state of the art, but its structure is everywhere in modern detection. The sliding window became the dense prediction head; the image pyramid became the feature pyramid network; hard negative mining became focal loss and online hard example mining; non-maximum suppression remains a literal final stage in most detectors. The 2024-2026 frontier even revisits HOG's gradient-orientation idea: edge and gradient maps are now standard conditioning inputs for controllable generation (the ControlNet edge condition of Chapter 35 is a soft HOG), and real-time detectors in the YOLO line (which Chapter 23 covers) keep the slide-score-suppress skeleton while replacing the HOG features and the linear SVM with a single learned network. The 2005 pipeline was not discarded; its feature extractor and classifier were fused and made learnable, and everything around them stayed.

Library Shortcut: The Pretrained Detector in Three Lines

The pipeline above, descriptor plus SVM plus pyramid plus NMS, is well over a hundred lines to assemble from scratch. OpenCV ships a complete, pretrained Dalal-Triggs pedestrian detector behind HOGDescriptor with a built-in people SVM, the sliding-window search, the pyramid, and NMS-style grouping all handled internally; you write three lines. It returns boxes and confidences directly, so the entire Sections 1 through 4 reduce to one call to detectMultiScale.

# OpenCV's batteries-included pedestrian detector: load the pretrained
# people SVM into HOGDescriptor and let detectMultiScale run the full
# descriptor, pyramid, sliding window, and grouping internally.
import cv2
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())  # pretrained SVM
boxes, weights = hog.detectMultiScale(image, winStride=(8, 8), scale=1.05)
# boxes: (n, 4) pedestrian rectangles; weights: their SVM scores
OpenCV's batteries-included pedestrian detector: detectMultiScale runs the full HOG+SVM sliding-window pyramid and grouping internally, collapsing the hundred-plus lines of Sections 1 through 4 into a single call returning boxes and scores.

The five stages of this section are easiest to internalize once you have wired them together yourself and watched the numbers move. The Hands-On Lab at the end of this section builds the complete Dalal-Triggs detector end to end, descriptor, support vector machine, sliding-window pyramid, hard negative mining, and non-maximum suppression, then measures it with the miss-rate curve Exercises 16.3.1 through 16.3.3 ask you to reason about.

Exercise 16.3.1: Why a Linear Classifier? Conceptual

Dalal and Triggs reported that a linear SVM on HOG nearly matched a much slower kernel SVM, and they chose the linear one. Explain the two reasons a linear classifier is preferable for sliding-window detection specifically, referring to the per-window test-time cost and to the visualizability of the weight vector. Then argue why the same speed argument does not hold for the whole-image classification of Section 16.2, where kernel SVMs were the norm.

Exercise 16.3.2: Implementing and Tuning NMS Coding

Run the Section 3 detector on a few street images, then apply the Section 4 NMS with IoU thresholds of $0.2$, $0.4$, and $0.7$. Visualize the kept boxes for each threshold over the same image and count detections. Describe what goes wrong at the extremes: which threshold merges two close pedestrians into one box, and which leaves duplicate boxes on a single pedestrian? Then implement soft-NMS (decay overlapping scores instead of deleting the boxes) and report whether it resolves the two-close-pedestrians failure.

Exercise 16.3.3: The Value of Hard Negatives Analysis

Train two HOG+SVM detectors on the same positive set: one with negatives sampled randomly from background images, and one with two rounds of hard negative mining. Evaluate both with a miss-rate-versus-FPPI curve on a held-out test set. Quantify the improvement at a fixed FPPI of $1$ false positive per image, and inspect the hard negatives the mining collected; describe the three most common false-positive structures it found and explain, in terms of gradient orientations, why each one fooled the initial detector.

Hands-On Lab: Build the Dalal-Triggs Pedestrian Detector End to End
Difficulty: Intermediate Duration: 60 to 90 minutes

Build one self-contained script, hog_detector.py, that trains a complete HOG plus linear-SVM pedestrian detector on a standard public dataset, hardens it with one round of hard negative mining, runs it as a sliding-window detector over an image pyramid, cleans the output with non-maximum suppression, and scores it honestly with the miss-rate-versus-false-positives-per-image curve of Section 5. The output is a working detector you trained yourself plus the curve that proves how good it is, the exact five-step recognition recipe of this chapter assembled into one runnable artifact.

What You'll Practice

  • Computing HOG descriptors with the canonical Dalal-Triggs parameters and training a maximum-margin linear SVM on them (Sections 1 and 2).
  • Turning a window classifier into a detector with a sliding window over an image pyramid (Section 3, reusing the pyramid of Chapter 4).
  • Implementing the hard negative mining bootstrap loop and greedy non-maximum suppression that detection cannot live without (Section 4).
  • Scoring a detector with the miss-rate-versus-FPPI curve rather than accuracy (Section 5).

Setup

pip install scikit-image scikit-learn opencv-python numpy matplotlib

Use the INRIA Person dataset (or any small set of cropped person windows plus person-free background images). Resize positive crops to the canonical $64 \times 128$ window. No GPU is required; everything here runs on a CPU in a few minutes for a few thousand windows. The same pipeline works on any single-object class: swap the positive crops and you have a detector for that class instead.

Put the section's detection pipeline into practice below. Work through the steps in order; each one prints a checkpoint so you can confirm progress before moving on. A complete reference solution is folded at the end.

Step 1: Encode the training windows as HOG vectors

Load the positive person crops and an initial random sample of background windows, resize each to $64 \times 128$, and turn every window into the $3780$-dimensional HOG vector of Section 1. This is the Represent step of the recognition recipe.

import numpy as np, cv2
from skimage.feature import hog

WIN = (128, 64)  # (height, width): the canonical 64x128 pedestrian window

def to_hog(gray):
    return hog(gray, orientations=9, pixels_per_cell=(8, 8),
               cells_per_block=(2, 2), block_norm="L2-Hys")

def load_features(paths, label):
    X = []
    for p in paths:
        g = cv2.resize(cv2.imread(p, cv2.IMREAD_GRAYSCALE), (WIN[1], WIN[0]))
        # TODO: append the HOG vector of this window to X.
        # Hint: use the to_hog helper above.
        ...
    return np.array(X), np.full(len(X), label)

Xpos, ypos = load_features(pos_paths, +1)
Xneg, yneg = load_features(neg_paths, -1)
X = np.vstack([Xpos, Xneg]); y = np.concatenate([ypos, yneg])
print(f"Training set: {len(Xpos)} positives, {len(Xneg)} negatives, dim {X.shape[1]}")
Hint

Inside the loop, call X.append(to_hog(g)). Every window must be encoded with the identical HOG parameters you will use at detection time, otherwise the SVM scores a feature it was never trained on. Expect dim 3780.

Step 2: Train the linear SVM

Fit a linear support vector machine on the HOG vectors. The learned weight vector is the soft, maximum-margin pedestrian template of Section 2; scoring a window will be a single dot product with it. This is the Score step.

from sklearn.svm import LinearSVC

# TODO: train a LinearSVC on (X, y) and pull out the weight vector and bias.
# Hint: small C (around 0.01) favors a wide margin; coef_ and intercept_ hold w and b.
svm = ...
w = ...
b = ...
print(f"Trained SVM; weight vector length {w.shape[0]}, bias {b:.3f}")
Hint

svm = LinearSVC(C=0.01, max_iter=20000).fit(X, y), then w = svm.coef_.ravel() and b = svm.intercept_[0]. Reshaping w back onto the window and visualizing it produces the ghostly average pedestrian described in Section 2.

Step 3: Slide the detector over an image pyramid

Write the detection loop: at each pyramid scale, slide the $64 \times 128$ window with an $8$-pixel stride, HOG-encode and score each placement, and keep above-threshold boxes mapped back to original coordinates. This is the Search step that turns a classifier into a detector.

def detect(image, w, b, stride=8, scales=10, downscale=1.2, thresh=0.5):
    dets = []
    for s in range(scales):
        scale = downscale ** s
        resized = cv2.resize(image, None, fx=1 / scale, fy=1 / scale)
        if resized.shape[0] < WIN[0] or resized.shape[1] < WIN[1]:
            break
        for y0 in range(0, resized.shape[0] - WIN[0], stride):
            for x0 in range(0, resized.shape[1] - WIN[1], stride):
                patch = resized[y0:y0 + WIN[0], x0:x0 + WIN[1]]
                # TODO: HOG-encode the patch, score it with w and b, and if the
                # score exceeds thresh append (x, y, w_box, h_box, score) in
                # ORIGINAL-image coordinates (multiply positions by scale).
                ...
    return dets

raw = detect(cv2.imread("street.png", cv2.IMREAD_GRAYSCALE), w, b)
print(f"raw detections before NMS: {len(raw)}")
Hint

Compute score = w @ to_hog(patch) + b; when score > thresh, append (int(x0*scale), int(y0*scale), int(WIN[1]*scale), int(WIN[0]*scale), score). Expect on the order of a thousand raw, overlapping detections, exactly the redundancy the next step removes.

Step 4: Clean the output with non-maximum suppression

Collapse each cluster of overlapping boxes to the single highest-scoring one with greedy NMS, the Suppress step. Process boxes highest-score-first and discard any later box whose intersection-over-union with a kept box exceeds the threshold.

def nms(boxes, scores, iou_thresh=0.4):
    if len(boxes) == 0:
        return []
    b = np.array(boxes, dtype=np.float32)
    x1, y1 = b[:, 0], b[:, 1]
    x2, y2 = b[:, 0] + b[:, 2], b[:, 1] + b[:, 3]
    areas = (x2 - x1) * (y2 - y1)
    order = np.argsort(scores)[::-1]
    keep = []
    while order.size > 0:
        i = order[0]; keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]]); yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]]); yy2 = np.minimum(y2[i], y2[order[1:]])
        inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
        iou = inter / (areas[i] + areas[order[1:]] - inter)
        # TODO: keep only the boxes whose IoU with box i is below iou_thresh.
        order = ...
    return keep

kept = nms([d[:4] for d in raw], [d[4] for d in raw])
print(f"after NMS: {len(kept)} detections")
Hint

The surviving boxes are those that do not overlap the kept one: order = order[1:][iou < iou_thresh]. This is the identical greedy algorithm that survives into the deep detectors of Chapter 23.

Step 5: Harden the detector with hard negative mining

Run the current detector over person-free background images, collect every above-threshold box as a hard negative, add those windows to the training set, and retrain. This bootstrap loop concentrates the SVM's capacity on the confusing background and typically halves the miss rate.

hard = []
for bg_path in background_image_paths:        # images known to contain no people
    img = cv2.imread(bg_path, cv2.IMREAD_GRAYSCALE)
    for (x0, y0, bw, bh, score) in detect(img, w, b, thresh=0.0):
        crop = cv2.resize(img[y0:y0 + bh, x0:x0 + bw], (WIN[1], WIN[0]))
        hard.append(to_hog(crop))             # a false positive, now a labeled negative

# TODO: add the hard negatives to the training set with label -1 and retrain.
# Hint: vstack hard onto X, concatenate -1 labels onto y, refit the LinearSVC.
X2, y2 = ..., ...
svm2 = LinearSVC(C=0.01, max_iter=20000).fit(X2, y2)
w, b = svm2.coef_.ravel(), svm2.intercept_[0]
print(f"Mined {len(hard)} hard negatives; retrained detector")
Hint

X2 = np.vstack([X, np.array(hard)]) and y2 = np.concatenate([y, np.full(len(hard), -1)]). Detecting with thresh=0.0 on background-only images surfaces every confident false positive; those lampposts and doorways are exactly the windows the first SVM never saw.

Step 6: Score the detector with a miss-rate curve

Sweep the detection threshold over a test set, matching kept boxes to ground-truth boxes by IoU, and plot miss rate against false positives per image, the Settle step. The curve, not a single accuracy number, is the honest summary of a detector.

import matplotlib.pyplot as plt

def evaluate(test_images, gt_boxes, w, b, thresholds=np.linspace(-1, 2, 25)):
    miss_rates, fppi = [], []
    for t in thresholds:
        misses = false_pos = total_gt = 0
        for img, gts in zip(test_images, gt_boxes):
            dets = nms_filter(detect(img, w, b, thresh=t))   # detect + NMS
            matched, fp = match_to_gt(dets, gts, iou_thresh=0.5)
            misses += len(gts) - matched; false_pos += fp; total_gt += len(gts)
        # TODO: append miss rate (misses / total_gt) and FPPI (false_pos / #images).
        ...
    return fppi, miss_rates

x, ymr = evaluate(test_images, gt_boxes, w, b)
plt.loglog(x, ymr, marker="o"); plt.xlabel("false positives per image")
plt.ylabel("miss rate"); plt.title("HOG+SVM pedestrian detector"); plt.savefig("miss_rate.png")
print("Wrote miss_rate.png")
Hint

Append miss_rates.append(misses / max(total_gt, 1)) and fppi.append(false_pos / len(test_images)). A detection counts as a hit when its IoU with an unmatched ground-truth box clears $0.5$; every unmatched detection is a false positive. Lowering the threshold pushes you down-and-right along the curve, the trade-off Section 5 describes.

Step 7: The Right Tool, the pretrained detector in three lines

You built the pipeline to understand it; in production OpenCV ships the whole thing pretrained. Run its built-in Dalal-Triggs people detector and compare its boxes against your own, mirroring this chapter's Right Tool principle.

# TODO: load OpenCV's pretrained people SVM into a HOGDescriptor and run
# detectMultiScale on a test image; it does descriptor, pyramid, slide, and
# grouping internally. Hint: cv2.HOGDescriptor_getDefaultPeopleDetector().
hog_cv = cv2.HOGDescriptor()
...
boxes, weights = ...
print(f"OpenCV pretrained detector found {len(boxes)} pedestrians")
Hint

hog_cv.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector()), then boxes, weights = hog_cv.detectMultiScale(image, winStride=(8, 8), scale=1.05). The roughly hundred lines of Steps 1 through 6 collapse to two calls; you wrote the long version so you can diagnose the short one when it misses.

Expected Output

Step 1 reports a training set of a few hundred positives, a few thousand negatives, and dimension $3780$. Step 2 confirms a $3780$-length weight vector. Step 3 prints on the order of a thousand raw detections; Step 4 collapses them to a handful (single digits on a typical street image). Step 5 mines tens to hundreds of hard negatives and retrains. Step 6 writes miss_rate.png: a downward-sloping log-log curve where lowering the threshold trades a lower miss rate for a higher false-positive rate, and the post-mining detector's curve sits visibly below the pre-mining one. Step 7's pretrained detector should find roughly the same pedestrians your hand-built one does. If your miss rate stays high everywhere, the usual cause is too few or too-easy negatives; another round of hard negative mining is the fix, exactly as Section 4 predicts.

Stretch Goals

  • Add a second round of hard negative mining and overlay all three miss-rate curves (random negatives, one mining round, two mining rounds) on one plot to see the bootstrap converge, the experiment Exercise 16.3.3 describes.
  • Replace the deletion in your NMS with soft-NMS (decay overlapping scores instead of removing the boxes) and check whether it recovers two pedestrians standing close together, the failure case of Exercise 16.3.2.
  • Retrain the whole pipeline for a different rigid class (a stop sign, a car rear) by swapping only the positive crops, then compare your hand-crafted detector against a small CNN from Chapter 19 on the same data, previewing the plateau argument of Section 16.6.
Complete Solution
import numpy as np, cv2
from skimage.feature import hog
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt

WIN = (128, 64)  # (height, width)

def to_hog(gray):
    return hog(gray, orientations=9, pixels_per_cell=(8, 8),
               cells_per_block=(2, 2), block_norm="L2-Hys")

# Step 1: encode training windows as HOG vectors
def load_features(paths, label):
    X = [to_hog(cv2.resize(cv2.imread(p, cv2.IMREAD_GRAYSCALE), (WIN[1], WIN[0])))
         for p in paths]
    return np.array(X), np.full(len(X), label)

Xpos, ypos = load_features(pos_paths, +1)
Xneg, yneg = load_features(neg_paths, -1)
X = np.vstack([Xpos, Xneg]); y = np.concatenate([ypos, yneg])

# Step 2: train the linear SVM
svm = LinearSVC(C=0.01, max_iter=20000).fit(X, y)
w, b = svm.coef_.ravel(), svm.intercept_[0]

# Step 3: sliding-window detection over a pyramid
def detect(image, w, b, stride=8, scales=10, downscale=1.2, thresh=0.5):
    dets = []
    for s in range(scales):
        scale = downscale ** s
        resized = cv2.resize(image, None, fx=1 / scale, fy=1 / scale)
        if resized.shape[0] < WIN[0] or resized.shape[1] < WIN[1]:
            break
        for y0 in range(0, resized.shape[0] - WIN[0], stride):
            for x0 in range(0, resized.shape[1] - WIN[1], stride):
                patch = resized[y0:y0 + WIN[0], x0:x0 + WIN[1]]
                score = w @ to_hog(patch) + b
                if score > thresh:
                    dets.append((int(x0 * scale), int(y0 * scale),
                                 int(WIN[1] * scale), int(WIN[0] * scale), float(score)))
    return dets

# Step 4: greedy non-maximum suppression
def nms(boxes, scores, iou_thresh=0.4):
    if len(boxes) == 0:
        return []
    b = np.array(boxes, dtype=np.float32)
    x1, y1 = b[:, 0], b[:, 1]
    x2, y2 = b[:, 0] + b[:, 2], b[:, 1] + b[:, 3]
    areas = (x2 - x1) * (y2 - y1)
    order = np.argsort(scores)[::-1]
    keep = []
    while order.size > 0:
        i = order[0]; keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]]); yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]]); yy2 = np.minimum(y2[i], y2[order[1:]])
        inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
        iou = inter / (areas[i] + areas[order[1:]] - inter)
        order = order[1:][iou < iou_thresh]
    return keep

def nms_filter(dets):
    if not dets:
        return []
    keep = nms([d[:4] for d in dets], [d[4] for d in dets])
    return [dets[i] for i in keep]

# Step 5: hard negative mining
hard = []
for bg_path in background_image_paths:
    img = cv2.imread(bg_path, cv2.IMREAD_GRAYSCALE)
    for (x0, y0, bw, bh, score) in detect(img, w, b, thresh=0.0):
        crop = cv2.resize(img[y0:y0 + bh, x0:x0 + bw], (WIN[1], WIN[0]))
        hard.append(to_hog(crop))
X2 = np.vstack([X, np.array(hard)])
y2 = np.concatenate([y, np.full(len(hard), -1)])
svm2 = LinearSVC(C=0.01, max_iter=20000).fit(X2, y2)
w, b = svm2.coef_.ravel(), svm2.intercept_[0]

# Step 6: miss-rate-versus-FPPI evaluation
def iou(a, g):
    ax1, ay1, ax2, ay2 = a[0], a[1], a[0] + a[2], a[1] + a[3]
    gx1, gy1, gx2, gy2 = g[0], g[1], g[0] + g[2], g[1] + g[3]
    ix1, iy1 = max(ax1, gx1), max(ay1, gy1)
    ix2, iy2 = min(ax2, gx2), min(ay2, gy2)
    inter = max(0, ix2 - ix1) * max(0, iy2 - iy1)
    union = (ax2 - ax1) * (ay2 - ay1) + (gx2 - gx1) * (gy2 - gy1) - inter
    return inter / union if union > 0 else 0.0

def match_to_gt(dets, gts, iou_thresh=0.5):
    used = set(); matched = 0; fp = 0
    for d in sorted(dets, key=lambda z: -z[4]):
        hit = None
        for j, g in enumerate(gts):
            if j not in used and iou(d[:4], g) >= iou_thresh:
                hit = j; break
        if hit is None:
            fp += 1
        else:
            used.add(hit); matched += 1
    return matched, fp

def evaluate(test_images, gt_boxes, w, b, thresholds=np.linspace(-1, 2, 25)):
    miss_rates, fppi = [], []
    for t in thresholds:
        misses = false_pos = total_gt = 0
        for img, gts in zip(test_images, gt_boxes):
            dets = nms_filter(detect(img, w, b, thresh=t))
            matched, fp = match_to_gt(dets, gts, iou_thresh=0.5)
            misses += len(gts) - matched; false_pos += fp; total_gt += len(gts)
        miss_rates.append(misses / max(total_gt, 1))
        fppi.append(false_pos / len(test_images))
    return fppi, miss_rates

x, ymr = evaluate(test_images, gt_boxes, w, b)
plt.loglog(np.clip(x, 1e-3, None), np.clip(ymr, 1e-3, None), marker="o")
plt.xlabel("false positives per image"); plt.ylabel("miss rate")
plt.title("HOG+SVM pedestrian detector"); plt.savefig("miss_rate.png")

# Step 7: the Right Tool, OpenCV's pretrained detector
hog_cv = cv2.HOGDescriptor()
hog_cv.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
test = cv2.imread("street.png", cv2.IMREAD_GRAYSCALE)
boxes, weights = hog_cv.detectMultiScale(test, winStride=(8, 8), scale=1.05)
print(f"OpenCV pretrained detector found {len(boxes)} pedestrians")