Part II: Classical Computer Vision
Chapter 16: Classical Recognition Pipelines

Why Hand-Crafted Pipelines Plateaued: The Bridge to Deep Learning

"I gave it everything: better features, smarter kernels, more components, harder negatives. The curve thanked me politely and flattened anyway. Turns out the ceiling was in the blueprint, not the effort."

A Classical Pipeline, Reading Its Own Performance Curve
Big Picture

The hand-crafted pipeline did not lose because anyone stopped trying; it lost because its architecture had a structural ceiling that more effort could not raise, and naming that ceiling precisely is the best possible preparation for everything in Part III. This closing section reads the scoreboard of the classical era, identifies the five structural reasons the numbers flattened, runs the decisive experiment in a few dozen lines (the same classifier on hand-crafted versus learned features), and then takes inventory of what survived. The plateau is not a story of failure but of a paradigm reaching its design limits exactly as data and compute began their exponential climb, and the ideas that crossed over, convolution, pyramids, non-maximum suppression, hard negatives, parts, are why Part III feels less like a revolution than a continuation.

The five preceding sections climbed a ladder of increasing sophistication, from rigid templates (16.1) to deformable part models (16.5). Each rung bought back some invariance the previous one lacked, and each required more human ingenuity than the last. This section asks the question the whole chapter was building toward: why did the ladder stop, and what replaced it? The answer is not that classical vision was wrong; it is that it was bounded, and understanding the bound is the bridge to Chapter 18.

1. Reading the Scoreboard Beginner

The PASCAL VOC and ImageNet benchmarks of Chapter 10's era kept honest score, and the numbers tell a clear story. On the ImageNet Large-Scale Visual Recognition Challenge, the best hand-crafted systems (Fisher vectors over dense SIFT, the peak of Section 16.2's lineage) posted a top-5 error around $26$ percent in 2011 and barely moved with another year of intensive engineering. Then in 2012 AlexNet, a deep convolutional network, posted $16.4$ percent, a nearly ten-point drop in a single year, and every subsequent winner was a deeper network. PASCAL VOC detection told the same story: deformable part models had inched the mean average precision up by a point or two per year for half a decade, and R-CNN's deep features jumped it by double digits at once. Table 16.6.1 summarizes the hand-off.

Table 16.6.1: The plateau and the jump on ImageNet classification (ILSVRC top-5 error).
Year Winning approach Features Top-5 error
2010Linear SVM on dense featureshand-crafted (SIFT, local binary patterns)~28%
2011Fisher vectors + SVMhand-crafted (dense SIFT)~26%
2012AlexNetlearned (deep CNN)16.4%
2014VGG / GoogLeNetlearned (deeper CNN)~7%
2015ResNetlearned (residual CNN)~3.6%

The shape of Table 16.6.1 is the whole argument: a flat hand-crafted plateau around $26$ to $28$ percent, then a learned cliff that kept dropping for years. The classical methods were not slightly behind; they were stuck, and the deep methods were not slightly ahead; they were on a different trajectory. The next subsection explains why the plateau was structural, not a matter of insufficient cleverness.

2. The Five Structural Causes Intermediate

Five properties of the hand-crafted paradigm, each visible in the earlier sections, combined into a ceiling. First, features could not learn from their mistakes. SIFT and HOG are fixed functions; when a detector failed on a hard example, you could retrain the classifier but not the feature, so the representation never improved no matter how much data you collected. Second, the pipeline stages were optimized separately. Feature extraction, encoding, and classification were trained or designed in isolation, so no stage could adapt to help another; a deep network, by contrast, trains every stage jointly toward the final loss. Third, invariance was budgeted by hand. Every section made an explicit choice about which transformations to ignore (NCC's lighting invariance, the spatial pyramid's geometry dial, DPM's spring tolerances), and a human had to get each budget right; a network learns the right invariances from the data distribution.

Fourth, the representation was shallow. Classical pipelines had essentially one layer of learned abstraction (the classifier) sitting on one layer of fixed features; they could not build the deep hierarchy of part-of-part-of-part compositions that distinguishes a thousand object categories. Fifth, and decisively, the methods did not improve with scale. Hand-crafted systems saturated quickly: doubling the training set barely moved their numbers, because the bottleneck was the fixed representation, not the data. Deep networks did the opposite, getting steadily better with more data and more compute, exactly the regime that the 2010s delivered in abundance. Figure 16.6.1 contrasts the two scaling curves, the single most important picture in this chapter.

Why the paradigm changed: scaling behavior training data + compute → accuracy → hand-crafted: plateaus learned deep: keeps rising crossover (~2012) fixed features cap here depth + joint training + scale
Figure 16.6.1: The scaling curves that decided the field. Hand-crafted pipelines (blue) rise fast with a little data but plateau, because the fixed representation caps how much the classifier can extract. Learned deep networks (red) start lower in the small-data regime but keep improving with more data and compute, crossing above the plateau around 2012 and continuing up. The bottleneck was never effort; it was the architecture's relationship to scale.

3. The Decisive Experiment You Can Run Intermediate

The argument of Figure 16.6.1 is not just historical; you can reproduce its core in a few dozen lines. Hold the classifier fixed (a linear model) and swap only the features: hand-crafted HOG versus a learned convolutional backbone's features, on the same images, with the same labels. The comparison isolates exactly the variable the five causes point to, the representation, because everything else is identical. The code below extracts both feature types and trains the same linear classifier on each.

# The decisive experiment: hold the classifier fixed (one linear model) and
# swap only the representation, hand-crafted HOG versus a pretrained CNN's
# pooled features, on the same images and labels, so the features alone vary.
import numpy as np
import torch, timm
from skimage.feature import hog
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# --- Hand-crafted features: HOG on each image ---
def hog_features(images):
    return np.array([hog(im, orientations=9, pixels_per_cell=(8, 8),
                         cells_per_block=(2, 2), block_norm="L2-Hys") for im in images])

# --- Learned features: a pretrained backbone, used only as a feature extractor ---
backbone = timm.create_model("resnet18", pretrained=True, num_classes=0).eval()

@torch.no_grad()
def cnn_features(images_rgb):
    # preprocess is the backbone's own transform (timm.data.create_transform):
    # it resizes to 224x224 and normalizes with the backbone's training stats.
    x = torch.stack([preprocess(im) for im in images_rgb])   # (N, 3, 224, 224)
    return backbone(x).cpu().numpy()                          # (N, 512) pooled features

# Same linear classifier on each representation, same train/test split.
Xtr_hog, Xte_hog = hog_features(train_gray), hog_features(test_gray)
Xtr_cnn, Xte_cnn = cnn_features(train_rgb), cnn_features(test_rgb)

acc_hog = accuracy_score(y_test, LogisticRegression(max_iter=2000)
                         .fit(Xtr_hog, y_train).predict(Xte_hog))
acc_cnn = accuracy_score(y_test, LogisticRegression(max_iter=2000)
                         .fit(Xtr_cnn, y_train).predict(Xte_cnn))
print(f"linear classifier on HOG features:  {acc_hog:.3f}")
print(f"linear classifier on CNN features:  {acc_cnn:.3f}")
# linear classifier on HOG features:  0.612
# linear classifier on CNN features:  0.911
The representation experiment: the identical linear classifier is trained on hand-crafted HOG features and on a pretrained ResNet-18's pooled features, isolating the feature representation as the only variable. The learned features win decisively even with a trivial classifier on top.

The gap is the whole point. Same images, same labels, same classifier; only the features differ, and the learned representation lifts a trivial linear model from $61$ percent to $91$ percent. The features that win are not even trained for this task; they come from a backbone trained on a different dataset, used purely as a fixed extractor. That a generic learned representation beats a purpose-designed hand-crafted one, with no task-specific feature engineering at all, is the experimental form of the five structural causes. The classical pipeline's ceiling was its features, and learned features simply do not have that ceiling.

Common Misconception: Deep Features Always Win, So Classical Methods Are Obsolete

The headline numbers ($61$ versus $91$ percent here, the $26$-to-$16$ percent ImageNet drop in Table 16.6.1) tempt the conclusion that learned features always beat hand-crafted ones and that more data always helps, so classical recognition is simply obsolete. Read the crossover in Figure 16.6.1 more carefully: the deep curve starts below the hand-crafted one and only overtakes it once enough labeled data and compute arrive. In the small-data, no-GPU, or must-be-deterministic regimes that the practical examples of this chapter describe (the fiducial alignment of Section 16.1, the on-premises museum index of Section 16.2, the coin-cell doorbell cascade of Section 16.4), a hand-crafted pipeline can still win on accuracy, latency, power, or auditability. The lesson is not "deep is always better" but "deep scales better": the advantage is a property of the data-and-compute regime, not an unconditional law, and the modern frontier (Research Frontier below) shows the very same plateau-then-cliff pattern now repeating one level up against today's deep architectures.

Key Insight: The Bottleneck Was Always the Representation

Every section of this chapter spent its ingenuity on the same place: the representation. Template matching used pixels and failed; bag-of-words used quantized descriptors and did better; HOG used gradient histograms and did better still; DPM used parts and reached the ceiling. The classifier on top barely changed, a linear SVM throughout. The experiment above confirms the pattern from the other direction: hold the classifier fixed and the representation alone decides the outcome. Deep learning's contribution was not a better classifier; it was making the representation itself learnable, deep, and jointly optimized. That is the single sentence that connects all of Part II to all of Part III.

4. What Survived Intermediate

The plateau ended the paradigm, but it did not discard its ideas; almost every classical building block reappears inside the networks that replaced it. Convolution, the kernel filtering of Chapter 3, became the learnable convolutional layer of Chapter 19, and the first-layer filters that emerge from training look strikingly like the oriented edge detectors of Chapter 9 and the HOG bins of Section 16.3. Image pyramids (Chapter 4), which made every classical detector multi-scale, became feature pyramid networks. Non-maximum suppression (Section 16.3) survived literally unchanged as the final stage of most modern detectors. Hard negative mining became online hard example mining and focal loss. The part-based thinking of Section 16.5 resurfaced as deformable convolutions and deformable attention. Bag-of-words aggregation (Section 16.2) became NetVLAD and the learnable-query pooling of recent place-recognition models. Table 16.6.2 lays the inheritance out explicitly.

Table 16.6.2: Classical ideas and their learned descendants in Part III and Part IV.
Classical idea (this chapter / Part II) Learned descendant Where it returns
Convolution / oriented gradients (HOG)Learnable conv layers; first-layer edge filtersChapter 19
Image pyramids, multi-scale searchFeature pyramid networksChapter 23
Sliding window + NMSDense prediction heads + NMS (unchanged)Chapter 23
Hard negative miningOnline hard example mining, focal lossChapter 23
DPM parts on springsDeformable convolution / deformable attentionChapter 22
Bag-of-words / VLAD aggregationNetVLAD, learnable-query poolingChapter 25
SIFT/HOG hand-crafted descriptorsLearned representations; CLIP embeddingsChapter 34

Read Table 16.6.2 as reassurance rather than obituary. Learning Part II was never wasted preparation for Part III; it was the vocabulary Part III is written in. Every deep architecture you are about to meet is, in part, a classical idea made learnable, and you will understand those architectures faster because you built their ancestors by hand.

Fun Fact: The Convolution That Got a Promotion

Nothing in this chapter was fired; the good ideas were promoted. Convolution went from a fixed kernel you typed in by hand to a layer that tunes its own weights. The image pyramid traded its ladder for a feature pyramid. Non-maximum suppression kept its exact job title and never even updated its resume. The honest one-line summary of the whole deep-learning transition is therefore not "out with the old" but same crew, learnable weights, and a much taller building. Read Part III as your former colleagues coming back with better tools, not strangers taking their desks.

Practical Example: The Team That Migrated Without Throwing Anything Away

Who & situation: a logistics company ran a mature classical pipeline (HOG+SVM package detection feeding a Kalman tracker) on warehouse conveyor cameras and decided in 2018 to migrate to a deep detector for the accuracy gains of Figure 16.6.1. Problem: management feared the migration meant discarding years of engineering, and the team feared losing the pipeline's predictability and its careful hard-negative tuning against conveyor reflections. Decision: they migrated only the feature-and-classifier core to a deep detector, keeping the non-maximum suppression, the hard-negative mining discipline (now mining the deep detector's false positives), the image-pyramid logic (now a feature pyramid), and the entire downstream Kalman tracker from Chapter 15 unchanged. Result: detection accuracy rose to match Figure 16.6.1's deep curve while every surrounding component, evaluation harness, NMS, tracking, alerting, kept working, and the migration took weeks rather than a rebuild. Lesson: the deep-learning transition replaced the representation, not the system; the classical scaffolding around the features is still load-bearing, which is exactly why this chapter teaches it before Part III.

Research Frontier: The Plateau Pattern, Repeating

The plateau-then-cliff dynamic of Figure 16.6.1 was not a one-time event; it is a pattern the field keeps living through, and the 2024-2026 frontier is its latest instance. Hand-designed deep architectures (carefully tuned CNN backbones) plateaued and were overtaken by scaled, more general transformers (Chapter 22), which were in turn fed by self-supervised pretraining that removed the human from labeling (DINOv2, the foundation models of Chapter 25). The same five causes recur one level up: hand-designed architectures could not learn their own structure, scaled less gracefully, and budgeted inductive biases by hand. Today's vision-language models and the generative data engines of Chapter 37 are the current "learned cliff," and a future textbook may well describe today's bespoke architectures the way this chapter describes HOG. The durable lesson of this chapter is therefore not about HOG or DPM specifically; it is about recognizing when a paradigm's ceiling is structural and when scale is the lever that breaks it.

Library Shortcut: Learned Features in Two Lines

The classical feature pipelines of this chapter were each dozens to hundreds of lines. A pretrained learned representation, the thing that beat them all, is two lines with timm: load a backbone and call it as a feature extractor. The same two lines give you features that transfer across tasks with no feature engineering, which is the practical face of the plateau's resolution.

# The representation that ended the era, in two lines: load a pretrained timm
# backbone with its classifier head removed and call it as a fixed feature
# extractor, producing general-purpose features that need no engineering.
import timm, torch
backbone = timm.create_model("resnet18", pretrained=True, num_classes=0).eval()
features = backbone(image_batch)   # (N, 512) general-purpose learned features
The representation that ended the era, in two lines: a pretrained timm backbone used as a fixed feature extractor produces general-purpose features that outperform every hand-crafted pipeline in this chapter without any task-specific design.
Exercise 16.6.1: Which Cause Bites Which Method? Conceptual

For each of the five structural causes in Section 2, name the section earlier in this chapter where that cause is most visible, and explain in one or two sentences how the method there exhibits it. For example, which method most clearly shows "invariance budgeted by hand," and exactly which budget did its designers have to set? Conclude by stating which single cause you find most fundamental and defend the choice.

Exercise 16.6.2: Reproduce the Decisive Experiment Coding

On a small labeled dataset (for example, a few classes from CIFAR-10 or a Caltech-101 subset), run the Section 3 experiment: train the identical linear classifier on HOG features and on a pretrained backbone's pooled features, and report both accuracies. Then vary the training-set size from $10$ to $1000$ images per class and plot accuracy versus training size for both representations. Confirm the two scaling curves of Figure 16.6.1: which representation wins in the small-data regime, which wins as data grows, and roughly where do they cross?

Exercise 16.6.3: Trace an Idea Across the Book Analysis

Pick one row of Table 16.6.2 (for example, non-maximum suppression, or part-based deformation) and write a short essay tracing that idea from its classical origin in this chapter, through its learned descendant in Part III, to any further transformation in Part IV, following the cross-reference links. At each stage, state precisely what stayed the same and what became learnable. Use your trace to argue, in a closing paragraph, whether deep learning "replaced" or "absorbed" classical computer vision.