"Forget the pixels; tell me which way the brightness leans, and I will tell you whether that is a person. I have been doing it since 2005, one sliding window at a time."
A Histogram of Oriented Gradients, Quietly Confident
Replace pixels with the local distribution of gradient orientations, slide a single linear classifier over the image at every position and scale, and you get the detector that defined an entire decade. The Dalal-Triggs pipeline is the most complete worked example of classical recognition in this chapter, and every stage teaches a transferable lesson. The HOG descriptor captures shape while discarding texture and lighting. The linear support vector machine finds the maximum-margin boundary between "person" and "not person." The sliding window plus image pyramid turns a classifier into a detector. Hard negative mining fixes the data imbalance that detection always creates. Non-maximum suppression cleans the redundant detections, and the miss-rate curve measures the result honestly. Hold these pieces in mind: Chapter 23's deep detectors keep every one of them and only replace the features.
The previous section turned an image into a single histogram for whole-image classification. Detection is harder: we must find where an object is, not just whether one is present, and the same object can appear at any position and any size. This section follows the canonical solution, Dalal and Triggs's human detector (CVPR 2005), from the descriptor that made it work to the evaluation that proved it. It is the natural sequel to the gradient machinery of Chapter 9 and Chapter 3: HOG is built entirely from image gradients.
1. The HOG Descriptor: Shape as a Gradient Distribution Intermediate
The insight behind histograms of oriented gradients is that an object's shape is captured well by the local distribution of edge directions, even when its exact pixel values vary wildly. A pedestrian has roughly vertical gradients along the legs and torso and a characteristic head-and-shoulders outline, and those orientation statistics are remarkably stable across clothing, lighting, and skin tone, the things templates could not survive. HOG throws away absolute brightness and keeps only which way the brightness changes and how strongly.
The construction proceeds in four steps. First, compute the gradient at every pixel using a simple $[-1, 0, 1]$ filter (Dalal and Triggs found this plain kernel beat fancier ones), giving magnitude $m = \sqrt{g_x^2 + g_y^2}$ and orientation $\theta = \arctan(g_y / g_x)$ at each pixel. Second, divide the window into small cells (typically $8 \times 8$ pixels) and, within each cell, build a histogram of orientations: each pixel votes for its orientation bin (usually $9$ bins over $0$ to $180$ degrees, unsigned) weighted by its gradient magnitude. Third, group cells into overlapping blocks (typically $2 \times 2$ cells) and normalize each block's concatenated histogram to unit length, which is what grants invariance to local contrast and illumination. Fourth, concatenate all block descriptors into one long feature vector for the window. Figure 16.3.1 lays out the cell-and-block geometry and shows the orientation histogram inside one cell.
The block overlap is not an accident. Each cell appears in multiple blocks and is normalized differently in each, so the descriptor encodes the same cell relative to several different local contexts. That redundancy is what makes HOG robust to the local lighting variation that defeated Section 16.1's templates. The code below computes a HOG descriptor for a detection window using scikit-image, which exposes the cell and block geometry directly so you can see the parameters that Figure 16.3.1 names.
# Compute the HOG descriptor of a detection window with the canonical
# Dalal-Triggs parameters (9 orientations, 8x8 cells, 2x2 blocks), turning
# a 64x128 image into a fixed-length gradient-orientation feature vector.
import cv2
import numpy as np
from skimage.feature import hog
# A pedestrian detection window is canonically 64 wide x 128 tall.
window = cv2.imread("person_crop.png", cv2.IMREAD_GRAYSCALE)
window = cv2.resize(window, (64, 128))
features, hog_image = hog(
window,
orientations=9, # 9 unsigned orientation bins
pixels_per_cell=(8, 8), # 8x8 pixel cells
cells_per_block=(2, 2), # 2x2 cell blocks, normalized together
block_norm="L2-Hys", # L2 norm with clipping, the Dalal-Triggs choice
visualize=True,
feature_vector=True)
print(f"HOG feature length: {features.shape[0]}")
# 7 blocks wide x 15 tall x (2*2 cells) x 9 bins = 3780
# HOG feature length: 3780
L2-Hys is L2 normalization followed by clipping large values and renormalizing, which suppresses the influence of very strong gradients.2. The Linear SVM: Maximum-Margin Classification Intermediate
HOG turns each window into a $3780$-dimensional vector; now we need a classifier that decides "person" or "not person." Dalal and Triggs used a linear support vector machine, and the choice was deliberate. A linear SVM learns a weight vector $\mathbf{w}$ and bias $b$ so the decision is simply the sign of $\mathbf{w} \cdot \mathbf{x} + b$, and among all hyperplanes that separate the training classes it picks the one with the largest margin, the widest gap to the nearest examples of either class. Maximizing the margin is a principled defense against overfitting: the boundary depends only on the borderline examples (the support vectors) and ignores the easy interior ones. Formally the soft-margin SVM minimizes
$$ \frac{1}{2}\lVert \mathbf{w} \rVert^2 \;+\; C \sum_i \max\!\big(0,\; 1 - y_i(\mathbf{w}\cdot \mathbf{x}_i + b)\big), $$
where $y_i \in \{-1, +1\}$ are the labels and $C$ trades margin width against training errors. The first term widens the margin; the hinge-loss sum penalizes points on the wrong side. The $1$ inside the loss is the margin itself: a point earns zero penalty only once $y_i(\mathbf{w}\cdot \mathbf{x}_i + b) \ge 1$, that is, once it sits not merely on the correct side of the boundary but a full margin's width clear of it.
Two properties made the linear SVM ideal for detection. It is extremely fast at test time, one dot product per window, which matters enormously when you evaluate millions of windows per image. And the learned weight vector $\mathbf{w}$ is itself a HOG-shaped template you can visualize: positive weights mark "person-like gradient here," negative weights mark "background-like gradient here," and rendering $\mathbf{w}$ produces a ghostly average pedestrian.
Section 16.1's template matching correlated a fixed pixel patch against the image. A linear SVM on HOG correlates a learned weight vector against the HOG feature image, and that is the whole conceptual upgrade in one sentence. The template is no longer a single photographed example; it is the maximum-margin combination of thousands of examples, expressed in a gradient-orientation space that already discards lighting and texture. When you reach the first convolutional layer of a CNN in Chapter 19 and find filters that look like oriented edge detectors, you are seeing this same idea, learned rather than designed, and stacked many layers deep.
# Train a linear SVM on HOG vectors so the learned weight vector w becomes a
# soft, maximum-margin pedestrian template; classifying a window is then a
# single dot product with w, the cheap test-time cost detection relies on.
import numpy as np
from sklearn.svm import LinearSVC
# X: (n_windows, 3780) HOG vectors; y: +1 for person crops, -1 for background.
svm = LinearSVC(C=0.01, max_iter=20000)
svm.fit(X_train, y_train)
# The learned weight vector is itself a HOG-shaped detector you can score with.
w = svm.coef_.ravel() # shape (3780,)
score = w @ hog_window + svm.intercept_[0] # > 0 means "person"
print(f"decision score: {score:.3f} ({'person' if score > 0 else 'background'})")
# decision score: 1.842 (person)
coef_ is the learned weight vector $\mathbf{w}$, and classifying any window reduces to a single dot product with it, the one-operation test-time cost that makes dense sliding-window search feasible.3. From Classifier to Detector: Sliding Windows Over a Pyramid Intermediate
A classifier answers "is this window a person?" A detector answers "where are the people?", and the bridge is exhaustive search. Slide the $64 \times 128$ window across the image in small steps, scoring each placement with the SVM; then resize the image down and repeat, so a single fixed-size window finds people at every scale. This is the same image pyramid from Chapter 4 that powered the multi-scale template search in Section 16.1, now feeding a learned classifier instead of a correlation. The result is a dense field of scores; thresholding it yields candidate detections, typically thousands per image before cleanup.
# Turn the HOG+SVM classifier into a detector: slide the window at every
# position and pyramid scale, HOG-encode and score each placement, and keep
# above-threshold boxes mapped back to original-image coordinates.
import cv2
import numpy as np
from skimage.feature import hog
def detect(image, svm_w, svm_b, win=(128, 64), stride=8, scales=10, downscale=1.2):
"""Sliding-window HOG+SVM detection across an image pyramid."""
detections = []
for s in range(scales):
scale = downscale ** s
resized = cv2.resize(image, None, fx=1 / scale, fy=1 / scale)
if resized.shape[0] < win[0] or resized.shape[1] < win[1]:
break
for y in range(0, resized.shape[0] - win[0], stride):
for x in range(0, resized.shape[1] - win[1], stride):
patch = resized[y:y + win[0], x:x + win[1]]
f = hog(patch, orientations=9, pixels_per_cell=(8, 8),
cells_per_block=(2, 2), block_norm="L2-Hys") # match the training descriptor exactly
score = svm_w @ f + svm_b
if score > 0.5: # detection threshold
detections.append((int(x * scale), int(y * scale),
int(win[1] * scale), int(win[0] * scale), score))
return detections
raw = detect(cv2.imread("street.png", cv2.IMREAD_GRAYSCALE), svm.coef_.ravel(),
svm.intercept_[0])
print(f"raw detections before NMS: {len(raw)}")
# raw detections before NMS: 1473
4. Hard Negative Mining and Non-Maximum Suppression Advanced
Two practical problems break the naive pipeline, and both fixes became permanent fixtures of detection. The first is data imbalance: a training image contains a handful of person windows and literally millions of background windows, so a classifier trained on a random sample of negatives never sees the hard negatives (a lamppost, a tree trunk, a doorway) that actually trigger false positives. Hard negative mining solves this by bootstrapping: train on an initial set, run the detector on negative images, collect every high-scoring false positive (the hard negatives), add them to the training set, and retrain. Iterate until false positives stop appearing. This focuses the SVM's capacity exactly where it matters, on the boundary cases, and it typically halves the miss rate.
The second problem is redundancy. A real pedestrian triggers the detector at many nearby positions and scales, producing a cluster of overlapping boxes (the $1473$ in the code above). Non-maximum suppression (NMS) collapses each cluster to one box: sort detections by score, greedily keep the highest, remove every remaining box that overlaps it by more than an intersection-over-union threshold, and repeat. Figure 16.3.2 shows both stages, the bootstrap loop that hardens the classifier and the greedy suppression that cleans its output.
NMS is so fundamental that it survived the deep-learning transition almost unchanged; the box-cleanup stage of the detectors in Chapter 23 is the same greedy algorithm. The implementation below is the classic vectorized version.
# Collapse the cluster of overlapping detection boxes around each object
# into one: process boxes highest-score-first and discard any later box whose
# intersection-over-union with a kept box exceeds the threshold.
import numpy as np
def nms(boxes, scores, iou_thresh=0.4):
"""Greedy non-maximum suppression; boxes as (x, y, w, h)."""
if len(boxes) == 0:
return []
b = np.array(boxes, dtype=np.float32)
x1, y1 = b[:, 0], b[:, 1]
x2, y2 = b[:, 0] + b[:, 2], b[:, 1] + b[:, 3]
areas = (x2 - x1) * (y2 - y1)
order = np.argsort(scores)[::-1] # highest score first
keep = []
while order.size > 0:
i = order[0]
keep.append(i)
xx1 = np.maximum(x1[i], x1[order[1:]]) # overlap with the rest
yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]])
yy2 = np.minimum(y2[i], y2[order[1:]])
inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
iou = inter / (areas[i] + areas[order[1:]] - inter)
order = order[1:][iou < iou_thresh] # drop boxes that overlap the kept one
return keep
kept = nms([d[:4] for d in raw], [d[4] for d in raw])
print(f"after NMS: {len(kept)} detections")
# after NMS: 6 detections
5. Measuring Detection Honestly: Miss Rate Versus FPPI Intermediate
Detection cannot be scored by accuracy, because the overwhelming majority of windows are background and a detector that says "no person" everywhere would score near-perfect accuracy while finding nobody. The pedestrian-detection community settled on a curve instead: miss rate versus false positives per image (FPPI), traced out by sweeping the detection threshold. Lowering the threshold finds more true pedestrians (lower miss rate) but also raises false alarms (higher FPPI); the curve shows the full trade-off, and detectors are summarized by their log-average miss rate over a standard FPPI range. The Caltech Pedestrian study (Dollár et al., IEEE TPAMI 2012) made this protocol the field standard and benchmarked sixteen detectors under it, which is why HOG-era results are quoted as miss rates, not accuracies. The mean average precision (mAP) of Chapter 23 is the same idea generalized: integrate precision over recall instead of miss rate over FPPI. The illustration below makes the absurdity vivid: a detector can win a near-perfect accuracy trophy while finding nobody at all.
There is a classic forecaster who predicts "no rain tomorrow" every single day in the desert and retires with a 95 percent accuracy record and zero useful forecasts. The "no person everywhere" detector is that forecaster with a camera: accuracy is trivially won by always betting on the majority class, so the number certifies your grasp of the base rate, not your skill at the task. Miss-rate-versus-FPPI exists precisely to take the bet away, scoring you only on the rare windows that actually matter. The keeper phrase for the whole detector is slide, score, suppress, then count the misses, not the hits.
Who & situation: a building-security integrator deployed a HOG+SVM pedestrian detector across a chain of parking garages to trigger lighting and alert guards to people in low-traffic zones at night. Problem: the detector worked beautifully in daytime tests but in production it missed roughly a third of real pedestrians and fired constantly on pillars and parked-car edges. Decision: the team realized the public Dalal-Triggs model had been trained on upright, well-lit, unoccluded pedestrians, nothing like a crouching person between cars under sodium lighting. They ran hard negative mining directly on a week of the garages' own background footage to learn the pillar and bumper false positives, and they retrained the positive set with garage-specific crops including partially occluded and bent poses. Result: the log-average miss rate on their own validation footage fell by more than half, and the false-alarm rate dropped to a level the guards stopped ignoring. Lesson: a classical detector is exactly as good as the negatives it was hardened against and the positives it was shown; hard negative mining on the deployment distribution, not the benchmark distribution, is what separates a demo from a system.
HOG+SVM is no longer state of the art, but its structure is everywhere in modern detection. The sliding window became the dense prediction head; the image pyramid became the feature pyramid network; hard negative mining became focal loss and online hard example mining; non-maximum suppression remains a literal final stage in most detectors. The 2024-2026 frontier even revisits HOG's gradient-orientation idea: edge and gradient maps are now standard conditioning inputs for controllable generation (the ControlNet edge condition of Chapter 35 is a soft HOG), and real-time detectors in the YOLO line (which Chapter 23 covers) keep the slide-score-suppress skeleton while replacing the HOG features and the linear SVM with a single learned network. The 2005 pipeline was not discarded; its feature extractor and classifier were fused and made learnable, and everything around them stayed.
The pipeline above, descriptor plus SVM plus pyramid plus NMS, is well over a hundred lines to assemble from scratch. OpenCV ships a complete, pretrained Dalal-Triggs pedestrian detector behind HOGDescriptor with a built-in people SVM, the sliding-window search, the pyramid, and NMS-style grouping all handled internally; you write three lines. It returns boxes and confidences directly, so the entire Sections 1 through 4 reduce to one call to detectMultiScale.
# OpenCV's batteries-included pedestrian detector: load the pretrained
# people SVM into HOGDescriptor and let detectMultiScale run the full
# descriptor, pyramid, sliding window, and grouping internally.
import cv2
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector()) # pretrained SVM
boxes, weights = hog.detectMultiScale(image, winStride=(8, 8), scale=1.05)
# boxes: (n, 4) pedestrian rectangles; weights: their SVM scores
detectMultiScale runs the full HOG+SVM sliding-window pyramid and grouping internally, collapsing the hundred-plus lines of Sections 1 through 4 into a single call returning boxes and scores.The five stages of this section are easiest to internalize once you have wired them together yourself and watched the numbers move. The Hands-On Lab at the end of this section builds the complete Dalal-Triggs detector end to end, descriptor, support vector machine, sliding-window pyramid, hard negative mining, and non-maximum suppression, then measures it with the miss-rate curve Exercises 16.3.1 through 16.3.3 ask you to reason about.
Dalal and Triggs reported that a linear SVM on HOG nearly matched a much slower kernel SVM, and they chose the linear one. Explain the two reasons a linear classifier is preferable for sliding-window detection specifically, referring to the per-window test-time cost and to the visualizability of the weight vector. Then argue why the same speed argument does not hold for the whole-image classification of Section 16.2, where kernel SVMs were the norm.
Run the Section 3 detector on a few street images, then apply the Section 4 NMS with IoU thresholds of $0.2$, $0.4$, and $0.7$. Visualize the kept boxes for each threshold over the same image and count detections. Describe what goes wrong at the extremes: which threshold merges two close pedestrians into one box, and which leaves duplicate boxes on a single pedestrian? Then implement soft-NMS (decay overlapping scores instead of deleting the boxes) and report whether it resolves the two-close-pedestrians failure.
Train two HOG+SVM detectors on the same positive set: one with negatives sampled randomly from background images, and one with two rounds of hard negative mining. Evaluate both with a miss-rate-versus-FPPI curve on a held-out test set. Quantify the improvement at a fixed FPPI of $1$ false positive per image, and inspect the hard negatives the mining collected; describe the three most common false-positive structures it found and explain, in terms of gradient orientations, why each one fooled the initial detector.
Build one self-contained script, hog_detector.py, that trains a complete HOG plus linear-SVM pedestrian detector on a standard public dataset, hardens it with one round of hard negative mining, runs it as a sliding-window detector over an image pyramid, cleans the output with non-maximum suppression, and scores it honestly with the miss-rate-versus-false-positives-per-image curve of Section 5. The output is a working detector you trained yourself plus the curve that proves how good it is, the exact five-step recognition recipe of this chapter assembled into one runnable artifact.
What You'll Practice
- Computing HOG descriptors with the canonical Dalal-Triggs parameters and training a maximum-margin linear SVM on them (Sections 1 and 2).
- Turning a window classifier into a detector with a sliding window over an image pyramid (Section 3, reusing the pyramid of Chapter 4).
- Implementing the hard negative mining bootstrap loop and greedy non-maximum suppression that detection cannot live without (Section 4).
- Scoring a detector with the miss-rate-versus-FPPI curve rather than accuracy (Section 5).
Setup
pip install scikit-image scikit-learn opencv-python numpy matplotlib
Use the INRIA Person dataset (or any small set of cropped person windows plus person-free background images). Resize positive crops to the canonical $64 \times 128$ window. No GPU is required; everything here runs on a CPU in a few minutes for a few thousand windows. The same pipeline works on any single-object class: swap the positive crops and you have a detector for that class instead.
Put the section's detection pipeline into practice below. Work through the steps in order; each one prints a checkpoint so you can confirm progress before moving on. A complete reference solution is folded at the end.
Step 1: Encode the training windows as HOG vectors
Load the positive person crops and an initial random sample of background windows, resize each to $64 \times 128$, and turn every window into the $3780$-dimensional HOG vector of Section 1. This is the Represent step of the recognition recipe.
import numpy as np, cv2
from skimage.feature import hog
WIN = (128, 64) # (height, width): the canonical 64x128 pedestrian window
def to_hog(gray):
return hog(gray, orientations=9, pixels_per_cell=(8, 8),
cells_per_block=(2, 2), block_norm="L2-Hys")
def load_features(paths, label):
X = []
for p in paths:
g = cv2.resize(cv2.imread(p, cv2.IMREAD_GRAYSCALE), (WIN[1], WIN[0]))
# TODO: append the HOG vector of this window to X.
# Hint: use the to_hog helper above.
...
return np.array(X), np.full(len(X), label)
Xpos, ypos = load_features(pos_paths, +1)
Xneg, yneg = load_features(neg_paths, -1)
X = np.vstack([Xpos, Xneg]); y = np.concatenate([ypos, yneg])
print(f"Training set: {len(Xpos)} positives, {len(Xneg)} negatives, dim {X.shape[1]}")
Hint
Inside the loop, call X.append(to_hog(g)). Every window must be encoded with the identical HOG parameters you will use at detection time, otherwise the SVM scores a feature it was never trained on. Expect dim 3780.
Step 2: Train the linear SVM
Fit a linear support vector machine on the HOG vectors. The learned weight vector is the soft, maximum-margin pedestrian template of Section 2; scoring a window will be a single dot product with it. This is the Score step.
from sklearn.svm import LinearSVC
# TODO: train a LinearSVC on (X, y) and pull out the weight vector and bias.
# Hint: small C (around 0.01) favors a wide margin; coef_ and intercept_ hold w and b.
svm = ...
w = ...
b = ...
print(f"Trained SVM; weight vector length {w.shape[0]}, bias {b:.3f}")
Hint
svm = LinearSVC(C=0.01, max_iter=20000).fit(X, y), then w = svm.coef_.ravel() and b = svm.intercept_[0]. Reshaping w back onto the window and visualizing it produces the ghostly average pedestrian described in Section 2.
Step 3: Slide the detector over an image pyramid
Write the detection loop: at each pyramid scale, slide the $64 \times 128$ window with an $8$-pixel stride, HOG-encode and score each placement, and keep above-threshold boxes mapped back to original coordinates. This is the Search step that turns a classifier into a detector.
def detect(image, w, b, stride=8, scales=10, downscale=1.2, thresh=0.5):
dets = []
for s in range(scales):
scale = downscale ** s
resized = cv2.resize(image, None, fx=1 / scale, fy=1 / scale)
if resized.shape[0] < WIN[0] or resized.shape[1] < WIN[1]:
break
for y0 in range(0, resized.shape[0] - WIN[0], stride):
for x0 in range(0, resized.shape[1] - WIN[1], stride):
patch = resized[y0:y0 + WIN[0], x0:x0 + WIN[1]]
# TODO: HOG-encode the patch, score it with w and b, and if the
# score exceeds thresh append (x, y, w_box, h_box, score) in
# ORIGINAL-image coordinates (multiply positions by scale).
...
return dets
raw = detect(cv2.imread("street.png", cv2.IMREAD_GRAYSCALE), w, b)
print(f"raw detections before NMS: {len(raw)}")
Hint
Compute score = w @ to_hog(patch) + b; when score > thresh, append (int(x0*scale), int(y0*scale), int(WIN[1]*scale), int(WIN[0]*scale), score). Expect on the order of a thousand raw, overlapping detections, exactly the redundancy the next step removes.
Step 4: Clean the output with non-maximum suppression
Collapse each cluster of overlapping boxes to the single highest-scoring one with greedy NMS, the Suppress step. Process boxes highest-score-first and discard any later box whose intersection-over-union with a kept box exceeds the threshold.
def nms(boxes, scores, iou_thresh=0.4):
if len(boxes) == 0:
return []
b = np.array(boxes, dtype=np.float32)
x1, y1 = b[:, 0], b[:, 1]
x2, y2 = b[:, 0] + b[:, 2], b[:, 1] + b[:, 3]
areas = (x2 - x1) * (y2 - y1)
order = np.argsort(scores)[::-1]
keep = []
while order.size > 0:
i = order[0]; keep.append(i)
xx1 = np.maximum(x1[i], x1[order[1:]]); yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]]); yy2 = np.minimum(y2[i], y2[order[1:]])
inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
iou = inter / (areas[i] + areas[order[1:]] - inter)
# TODO: keep only the boxes whose IoU with box i is below iou_thresh.
order = ...
return keep
kept = nms([d[:4] for d in raw], [d[4] for d in raw])
print(f"after NMS: {len(kept)} detections")
Hint
The surviving boxes are those that do not overlap the kept one: order = order[1:][iou < iou_thresh]. This is the identical greedy algorithm that survives into the deep detectors of Chapter 23.
Step 5: Harden the detector with hard negative mining
Run the current detector over person-free background images, collect every above-threshold box as a hard negative, add those windows to the training set, and retrain. This bootstrap loop concentrates the SVM's capacity on the confusing background and typically halves the miss rate.
hard = []
for bg_path in background_image_paths: # images known to contain no people
img = cv2.imread(bg_path, cv2.IMREAD_GRAYSCALE)
for (x0, y0, bw, bh, score) in detect(img, w, b, thresh=0.0):
crop = cv2.resize(img[y0:y0 + bh, x0:x0 + bw], (WIN[1], WIN[0]))
hard.append(to_hog(crop)) # a false positive, now a labeled negative
# TODO: add the hard negatives to the training set with label -1 and retrain.
# Hint: vstack hard onto X, concatenate -1 labels onto y, refit the LinearSVC.
X2, y2 = ..., ...
svm2 = LinearSVC(C=0.01, max_iter=20000).fit(X2, y2)
w, b = svm2.coef_.ravel(), svm2.intercept_[0]
print(f"Mined {len(hard)} hard negatives; retrained detector")
Hint
X2 = np.vstack([X, np.array(hard)]) and y2 = np.concatenate([y, np.full(len(hard), -1)]). Detecting with thresh=0.0 on background-only images surfaces every confident false positive; those lampposts and doorways are exactly the windows the first SVM never saw.
Step 6: Score the detector with a miss-rate curve
Sweep the detection threshold over a test set, matching kept boxes to ground-truth boxes by IoU, and plot miss rate against false positives per image, the Settle step. The curve, not a single accuracy number, is the honest summary of a detector.
import matplotlib.pyplot as plt
def evaluate(test_images, gt_boxes, w, b, thresholds=np.linspace(-1, 2, 25)):
miss_rates, fppi = [], []
for t in thresholds:
misses = false_pos = total_gt = 0
for img, gts in zip(test_images, gt_boxes):
dets = nms_filter(detect(img, w, b, thresh=t)) # detect + NMS
matched, fp = match_to_gt(dets, gts, iou_thresh=0.5)
misses += len(gts) - matched; false_pos += fp; total_gt += len(gts)
# TODO: append miss rate (misses / total_gt) and FPPI (false_pos / #images).
...
return fppi, miss_rates
x, ymr = evaluate(test_images, gt_boxes, w, b)
plt.loglog(x, ymr, marker="o"); plt.xlabel("false positives per image")
plt.ylabel("miss rate"); plt.title("HOG+SVM pedestrian detector"); plt.savefig("miss_rate.png")
print("Wrote miss_rate.png")
Hint
Append miss_rates.append(misses / max(total_gt, 1)) and fppi.append(false_pos / len(test_images)). A detection counts as a hit when its IoU with an unmatched ground-truth box clears $0.5$; every unmatched detection is a false positive. Lowering the threshold pushes you down-and-right along the curve, the trade-off Section 5 describes.
Step 7: The Right Tool, the pretrained detector in three lines
You built the pipeline to understand it; in production OpenCV ships the whole thing pretrained. Run its built-in Dalal-Triggs people detector and compare its boxes against your own, mirroring this chapter's Right Tool principle.
# TODO: load OpenCV's pretrained people SVM into a HOGDescriptor and run
# detectMultiScale on a test image; it does descriptor, pyramid, slide, and
# grouping internally. Hint: cv2.HOGDescriptor_getDefaultPeopleDetector().
hog_cv = cv2.HOGDescriptor()
...
boxes, weights = ...
print(f"OpenCV pretrained detector found {len(boxes)} pedestrians")
Hint
hog_cv.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector()), then boxes, weights = hog_cv.detectMultiScale(image, winStride=(8, 8), scale=1.05). The roughly hundred lines of Steps 1 through 6 collapse to two calls; you wrote the long version so you can diagnose the short one when it misses.
Expected Output
Step 1 reports a training set of a few hundred positives, a few thousand negatives, and dimension $3780$. Step 2 confirms a $3780$-length weight vector. Step 3 prints on the order of a thousand raw detections; Step 4 collapses them to a handful (single digits on a typical street image). Step 5 mines tens to hundreds of hard negatives and retrains. Step 6 writes miss_rate.png: a downward-sloping log-log curve where lowering the threshold trades a lower miss rate for a higher false-positive rate, and the post-mining detector's curve sits visibly below the pre-mining one. Step 7's pretrained detector should find roughly the same pedestrians your hand-built one does. If your miss rate stays high everywhere, the usual cause is too few or too-easy negatives; another round of hard negative mining is the fix, exactly as Section 4 predicts.
Stretch Goals
- Add a second round of hard negative mining and overlay all three miss-rate curves (random negatives, one mining round, two mining rounds) on one plot to see the bootstrap converge, the experiment Exercise 16.3.3 describes.
- Replace the deletion in your NMS with soft-NMS (decay overlapping scores instead of removing the boxes) and check whether it recovers two pedestrians standing close together, the failure case of Exercise 16.3.2.
- Retrain the whole pipeline for a different rigid class (a stop sign, a car rear) by swapping only the positive crops, then compare your hand-crafted detector against a small CNN from Chapter 19 on the same data, previewing the plateau argument of Section 16.6.
Complete Solution
import numpy as np, cv2
from skimage.feature import hog
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
WIN = (128, 64) # (height, width)
def to_hog(gray):
return hog(gray, orientations=9, pixels_per_cell=(8, 8),
cells_per_block=(2, 2), block_norm="L2-Hys")
# Step 1: encode training windows as HOG vectors
def load_features(paths, label):
X = [to_hog(cv2.resize(cv2.imread(p, cv2.IMREAD_GRAYSCALE), (WIN[1], WIN[0])))
for p in paths]
return np.array(X), np.full(len(X), label)
Xpos, ypos = load_features(pos_paths, +1)
Xneg, yneg = load_features(neg_paths, -1)
X = np.vstack([Xpos, Xneg]); y = np.concatenate([ypos, yneg])
# Step 2: train the linear SVM
svm = LinearSVC(C=0.01, max_iter=20000).fit(X, y)
w, b = svm.coef_.ravel(), svm.intercept_[0]
# Step 3: sliding-window detection over a pyramid
def detect(image, w, b, stride=8, scales=10, downscale=1.2, thresh=0.5):
dets = []
for s in range(scales):
scale = downscale ** s
resized = cv2.resize(image, None, fx=1 / scale, fy=1 / scale)
if resized.shape[0] < WIN[0] or resized.shape[1] < WIN[1]:
break
for y0 in range(0, resized.shape[0] - WIN[0], stride):
for x0 in range(0, resized.shape[1] - WIN[1], stride):
patch = resized[y0:y0 + WIN[0], x0:x0 + WIN[1]]
score = w @ to_hog(patch) + b
if score > thresh:
dets.append((int(x0 * scale), int(y0 * scale),
int(WIN[1] * scale), int(WIN[0] * scale), float(score)))
return dets
# Step 4: greedy non-maximum suppression
def nms(boxes, scores, iou_thresh=0.4):
if len(boxes) == 0:
return []
b = np.array(boxes, dtype=np.float32)
x1, y1 = b[:, 0], b[:, 1]
x2, y2 = b[:, 0] + b[:, 2], b[:, 1] + b[:, 3]
areas = (x2 - x1) * (y2 - y1)
order = np.argsort(scores)[::-1]
keep = []
while order.size > 0:
i = order[0]; keep.append(i)
xx1 = np.maximum(x1[i], x1[order[1:]]); yy1 = np.maximum(y1[i], y1[order[1:]])
xx2 = np.minimum(x2[i], x2[order[1:]]); yy2 = np.minimum(y2[i], y2[order[1:]])
inter = np.maximum(0, xx2 - xx1) * np.maximum(0, yy2 - yy1)
iou = inter / (areas[i] + areas[order[1:]] - inter)
order = order[1:][iou < iou_thresh]
return keep
def nms_filter(dets):
if not dets:
return []
keep = nms([d[:4] for d in dets], [d[4] for d in dets])
return [dets[i] for i in keep]
# Step 5: hard negative mining
hard = []
for bg_path in background_image_paths:
img = cv2.imread(bg_path, cv2.IMREAD_GRAYSCALE)
for (x0, y0, bw, bh, score) in detect(img, w, b, thresh=0.0):
crop = cv2.resize(img[y0:y0 + bh, x0:x0 + bw], (WIN[1], WIN[0]))
hard.append(to_hog(crop))
X2 = np.vstack([X, np.array(hard)])
y2 = np.concatenate([y, np.full(len(hard), -1)])
svm2 = LinearSVC(C=0.01, max_iter=20000).fit(X2, y2)
w, b = svm2.coef_.ravel(), svm2.intercept_[0]
# Step 6: miss-rate-versus-FPPI evaluation
def iou(a, g):
ax1, ay1, ax2, ay2 = a[0], a[1], a[0] + a[2], a[1] + a[3]
gx1, gy1, gx2, gy2 = g[0], g[1], g[0] + g[2], g[1] + g[3]
ix1, iy1 = max(ax1, gx1), max(ay1, gy1)
ix2, iy2 = min(ax2, gx2), min(ay2, gy2)
inter = max(0, ix2 - ix1) * max(0, iy2 - iy1)
union = (ax2 - ax1) * (ay2 - ay1) + (gx2 - gx1) * (gy2 - gy1) - inter
return inter / union if union > 0 else 0.0
def match_to_gt(dets, gts, iou_thresh=0.5):
used = set(); matched = 0; fp = 0
for d in sorted(dets, key=lambda z: -z[4]):
hit = None
for j, g in enumerate(gts):
if j not in used and iou(d[:4], g) >= iou_thresh:
hit = j; break
if hit is None:
fp += 1
else:
used.add(hit); matched += 1
return matched, fp
def evaluate(test_images, gt_boxes, w, b, thresholds=np.linspace(-1, 2, 25)):
miss_rates, fppi = [], []
for t in thresholds:
misses = false_pos = total_gt = 0
for img, gts in zip(test_images, gt_boxes):
dets = nms_filter(detect(img, w, b, thresh=t))
matched, fp = match_to_gt(dets, gts, iou_thresh=0.5)
misses += len(gts) - matched; false_pos += fp; total_gt += len(gts)
miss_rates.append(misses / max(total_gt, 1))
fppi.append(false_pos / len(test_images))
return fppi, miss_rates
x, ymr = evaluate(test_images, gt_boxes, w, b)
plt.loglog(np.clip(x, 1e-3, None), np.clip(ymr, 1e-3, None), marker="o")
plt.xlabel("false positives per image"); plt.ylabel("miss rate")
plt.title("HOG+SVM pedestrian detector"); plt.savefig("miss_rate.png")
# Step 7: the Right Tool, OpenCV's pretrained detector
hog_cv = cv2.HOGDescriptor()
hog_cv.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
test = cv2.imread("street.png", cv2.IMREAD_GRAYSCALE)
boxes, weights = hog_cv.detectMultiScale(test, winStride=(8, 8), scale=1.05)
print(f"OpenCV pretrained detector found {len(boxes)} pedestrians")