Part II: Classical Computer Vision
Chapter 16: Classical Recognition Pipelines

Classical Recognition Pipelines

"I spent twenty years learning to see: I memorized templates, counted gradients, boosted weak opinions into strong ones, and held my parts together with springs. Then a neural network looked at a million photos and politely took my job."

A Classical Recognition Pipeline, Updating Its Resume
Big Picture

For two decades, recognition meant a pipeline: hand-crafted features feeding a shallow classifier, and the story of how that pipeline was built, perfected, and finally outgrown is the single best explanation of why deep learning works. This chapter walks the full arc. Template matching shows the naive starting point and its brittleness. Bag of visual words turns the local features of Chapter 10 into whole-image classification. HOG plus SVM and Viola-Jones show two opposite engineering philosophies producing two landmark detectors. Deformable part models push the paradigm to its expressive ceiling. And the closing section reads the scoreboard: where the numbers flattened, why no amount of cleverness un-flattened them, and exactly which classical ideas survived inside the networks of Chapter 19 and beyond.

Chapter Overview

Everything in Part II so far has been about geometry and correspondence: where edges are, which keypoints match, how cameras relate, what moved between frames. This chapter asks the question all of that was secretly preparing for: what is in the image? Is this window a face, a pedestrian, a bicycle? Recognition is the problem that made computer vision famous, and from roughly 1990 to 2012 the field answered it with one architectural pattern repeated at increasing sophistication: extract features you designed by hand, then feed them to a classifier you trained from data. The features carried the visual insight; the classifier carried the statistics; a human carried the responsibility for deciding what mattered.

The chapter climbs the ladder of that paradigm rung by rung. Section 16.1 starts at the bottom with template matching: comparing pixels directly, no abstraction at all, and discovering within a page why pixels are the wrong currency for recognition. Section 16.2 makes the first leap of abstraction: the bag of visual words quantizes the local descriptors of Chapter 10 into a vocabulary and describes a whole image as a histogram, a trick borrowed wholesale from text retrieval, then patches its blindness to layout with spatial pyramids. Section 16.3 visits the workhorse of the detection era: histograms of oriented gradients feeding a linear support vector machine, the Dalal-Triggs pedestrian detector that defined sliding-window detection. Section 16.4 tells the speed story instead: Viola and Jones made face detection run in real time on 2001 hardware with three inventions (integral images, boosted feature selection, attentional cascades) that are each worth knowing on their own. Section 16.5 reaches the paradigm's high-water mark with deformable part models, which describe objects as parts connected by springs and won the PASCAL VOC detection challenge for half a decade.

Then comes the reckoning. Section 16.6 examines why the whole edifice plateaued: features that could not learn from their mistakes, pipeline stages optimized separately, invariance budgeted by hand, and performance curves that flattened just as datasets and compute began their exponential climb. The section runs the decisive experiment in a few dozen lines of code (same classifier, hand-crafted versus learned features) and inventories what survived: convolution, pyramids, non-maximum suppression, hard negative mining, and the part-based thinking that resurfaced inside deformable convolutions and attention. The plateau was not a failure of effort; it was a structural property of the paradigm, and seeing it clearly is the best preparation for Part III.

A reading note: this chapter is deliberately historical, but it is not a museum. Template matching still aligns circuit boards on every assembly line; cascades still run in low-power firmware; bag-of-words machinery still powers place recognition inside the SLAM systems of Chapter 14; and every design conversation about modern detectors in Chapter 23 uses vocabulary coined here. You are learning the ancestors because the descendants still speak their language.

If you carry one thing out of this chapter, carry its skeleton. Every recognizer in these six sections, and every deep detector in Chapter 23 after them, is the same five-step recipe with different parts swapped in. The schema below names those five steps so you can watch each section fill them, and so you can read Section 16.6's "what survived" inventory as a checklist of which steps the networks kept.

The Recognition Recipe: Represent, Score, Search, Suppress, Settle

Every detector in this chapter is the same five-step pipeline; only the contents of each step change. Represent the window as something better than raw pixels (a template, a word histogram, HOG, Haar sums, parts on springs). Score that representation with a classifier (correlation, an SVM, a boosted vote, a match-minus-spring sum). Search position and scale by sliding over an image pyramid. Suppress the redundant overlapping hits with non-maximum suppression. Settle the score honestly with the right metric (miss-rate-versus-FPPI, mean average precision). The five-word handle is represent, score, search, suppress, settle; the entire deep-learning transition of Section 16.6 changes only the first two steps and leaves the last three almost untouched.

Prerequisites

This chapter leans on the local features of Chapter 10: Keypoints, Descriptors & Matching (SIFT descriptors feed Section 16.2 directly) and on the gradient machinery of Chapter 9: Edges, Lines & Curves and Chapter 3: Spatial Filtering & Convolution, which Sections 16.3 and 16.5 reuse at every step. Histograms as image statistics come from Chapter 2: Point Operations, Histograms & Thresholding, and the image pyramids that make every detector multi-scale come from Chapter 4: The Frequency Domain & Multi-Scale Analysis. No machine learning background is assumed beyond comfort with the idea of training a classifier from labeled examples; the chapter introduces support vector machines and boosting at working depth as it needs them.

Chapter Roadmap

What's Next?

This chapter closes the conceptual story of Part II; Chapter 17: Tools of the Trade: The Classical CV Stack closes its practical one. There you will find the consolidated map of the classical toolbox: the OpenCV modules behind everything from Chapter 9 through this chapter, the reconstruction tooling around COLMAP, and the datasets and benchmarks that measured this era. After that, Part III begins with Chapter 18, and the bridge built in Section 16.6 carries you across: the same recognition problems, attacked by features that learn.

Bibliography & Further Reading

Foundational Papers

Viola, P. and Jones, M. "Rapid Object Detection using a Boosted Cascade of Simple Features." CVPR (2001). doi:10.1109/CVPR.2001.990517

The real-time face detector: integral images, AdaBoost feature selection, and the attentional cascade, all in one paper. Section 16.4 is a guided tour of it.

Dalal, N. and Triggs, B. "Histograms of Oriented Gradients for Human Detection." CVPR (2005). doi:10.1109/CVPR.2005.177

The pedestrian detector that defined a decade of sliding-window detection, famous for its exhaustive ablations. Section 16.3 rebuilds it choice by choice.

Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan, D. "Object Detection with Discriminatively Trained Part-Based Models." IEEE TPAMI 32(9) (2010). doi:10.1109/TPAMI.2009.167

The definitive deformable part models paper: latent SVM, mixtures, and the spring-cost inference of Section 16.5. The peak of hand-crafted detection.

Lazebnik, S., Schmid, C., and Ponce, J. "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories." CVPR (2006). doi:10.1109/CVPR.2006.68

The fix for bag-of-words blindness to layout: histograms over a pyramid of spatial grids. Section 16.2 implements its weighting scheme directly.

Sivic, J. and Zisserman, A. "Video Google: A Text Retrieval Approach to Object Matching in Videos." ICCV (2003). doi:10.1109/ICCV.2003.1238663

The paper that imported visual words, inverted files, and TF-IDF from text retrieval into vision; the origin of Section 16.2's whole vocabulary.

Fischler, M. and Elschlager, R. "The Representation and Matching of Pictorial Structures." IEEE Transactions on Computers C-22(1) (1973). doi:10.1109/T-C.1973.223602

Parts connected by springs, proposed thirty-five years before DPM made it win benchmarks. Section 16.5's opening idea, from the field's earliest days.

Everingham, M., Van Gool, L., Williams, C., Winn, J., and Zisserman, A. "The PASCAL Visual Object Classes (VOC) Challenge." IJCV 88 (2010). doi:10.1007/s11263-009-0275-4

The benchmark that kept score for the entire era this chapter covers; its mean average precision (mAP) metric and evaluation protocol are still the template for detection benchmarks.

Dollár, P., Wojek, C., Schiele, B., and Perona, P. "Pedestrian Detection: An Evaluation of the State of the Art." IEEE TPAMI 34(4) (2012). doi:10.1109/TPAMI.2011.155

The Caltech Pedestrian benchmark study: sixteen detectors, one protocol, and the miss-rate-versus-FPPI methodology Section 16.3 explains.

Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS (2012). papers.nips.cc

AlexNet: the result that ended the era. Section 16.6 reads its ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012 margin (16.4 percent versus 26.2 percent top-5 error) as the plateau's closing argument.

Recent Research (2024-2026)

Girshick, R., Iandola, F., Darrell, T., and Malik, J. "Deformable Part Models are Convolutional Neural Networks." CVPR (2015). arXiv:1409.5403

The formal bridge: DPM inference rewritten as a CNN with distance-transform pooling layers. The hinge between Sections 16.5 and 16.6.

Jiang, Q., Li, F., Zeng, Z., Ren, T., Liu, S., and Zhang, L. "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy." ECCV (2024). arXiv:2403.14610

Detection prompted by visual examples: template matching reborn on foundation-model features, discussed in Section 16.1's research frontier.

Xiong, Y., Li, Z., Chen, Y., Wang, F., et al. "Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications." CVPR (2024). arXiv:2401.06197

DCNv4: the modern descendant of DPM's deformation idea, now a fast learned operator inside backbones. Section 16.5's frontier callout traces the lineage.

Ali-bey, A., Chaib-draa, B., and Giguère, P. "BoQ: A Place is Worth a Bag of Learnable Queries." CVPR (2024). arXiv:2405.07364

Bag-of-words thinking, fully learned: a 2024 place-recognition model whose aggregation layer is a direct intellectual descendant of Section 16.2.

Books

Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book

Chapter 6 covers recognition with classical features and classifiers in depth, with the historical context this chapter compresses; free online.

Tools & Libraries

OpenCV. "Object Detection (objdetect) module" documentation. docs.opencv.org

The living museum: CascadeClassifier, HOGDescriptor, and the modern FaceDetectorYN replacement, all called from this chapter's code.