Part II: Classical Computer Vision
Chapter 10: Keypoints, Descriptors & Matching

Keypoints, Descriptors & Matching

"I have spent my whole career looking for the same point in two photographs. The trick is to be memorable enough that the second photograph recognizes you."

A Quietly Distinctive Keypoint
Big Picture

This chapter solves the correspondence problem: find the same physical point in two different photographs, and most of geometric vision follows as a consequence. Panorama stitching, camera calibration, stereo depth, structure from motion, visual SLAM, and image retrieval all begin with the same four-stage pipeline built here: detect repeatable points, describe their neighborhoods as vectors, match the vectors between images, and verify the matches with a geometric model. The pipeline is classical, but it is no museum piece: it runs today inside Chapter 14's COLMAP reconstructions that feed the neural radiance fields of Chapter 27, and its design choices echo through the learned representations of Chapter 25.

Chapter Overview

Chapter 9 extracted structure from single images: edges, lines, and curves. This chapter asks a question that involves two images at once. Given a photograph of a building taken from the street and another taken a few steps to the left, can a program point at a window corner in the first image and find the very same window corner in the second? A human does this instantly. For a machine, it is a genuinely hard problem: the second image differs in viewpoint, scale, rotation, lighting, and sensor noise, and the search space is every pixel. Solving it unlocks an astonishing amount of geometry, because two views of the same point constrain where the cameras were, which is the seed of Chapter 13's stereo and Chapter 14's structure from motion.

The classical answer decomposes the problem into stages, and the chapter walks through them in order. Section 10.1 asks which points are worth finding at all and answers with corners: locations where the image gradient varies in two directions, detected by the Harris and Shi-Tomasi operators and, at production speed, by FAST. Section 10.2 confronts the fact that a corner detector tuned to one zoom level fails at another, and builds the scale-space machinery (Gaussian pyramids, the Difference of Gaussians, canonical orientation) that makes detection survive zoom and rotation. Section 10.3 assembles these parts into SIFT, the 128-dimensional gradient-histogram descriptor that dominated computer vision for a decade and still ships inside reconstruction pipelines today.

The second half of the chapter is about engineering and trust. Section 10.4 trades SIFT's floating-point precision for binary descriptors (BRIEF, ORB, AKAZE) that match hundreds of times faster and fit real-time budgets on phones and robots. Section 10.5 turns piles of descriptors into actual correspondences, introducing brute-force and approximate nearest-neighbor search and the deceptively simple ratio test that rejects most false matches before they cause damage. Section 10.6 deals with the false matches that survive anyway: RANSAC, the hypothesize-and-verify algorithm that fits geometric models to contaminated data and, in doing so, separates inliers from outliers. RANSAC is so generally useful that it long ago escaped this chapter's topic and became a standard tool across all of engineering.

One theme deserves flagging before you begin. Every design decision in this chapter is a trade between invariance (ignoring changes you do not care about) and distinctiveness (preserving differences you do). A descriptor invariant to everything describes nothing; a descriptor sensitive to everything matches nothing. SIFT's gradient histograms, BRIEF's intensity comparisons, and the ratio test's relative threshold are all answers to this one tension. The same tension returns, with weights learned from data instead of designed by hand, when Chapter 25 trains networks to produce embeddings, and when Chapter 34's CLIP vectors act as universal descriptors for entire images. Hand-crafted features lost the recognition war to deep learning, but the vocabulary they established (detect, describe, match, verify) remains the grammar of geometric vision.

Prerequisites

This chapter leans directly on image gradients and the Sobel operator from Chapter 3: Spatial Filtering & Convolution, and on Gaussian pyramids and the Difference of Gaussians from Chapter 4: The Frequency Domain & Multi-Scale Analysis. Section 10.6 fits homographies, so the geometric transformations of Chapter 5: Geometric Transformations & Image Warping should be fresh. The discussion of robust fitting continues a thread started in Chapter 9: Edges, Lines & Curves, whose least-squares and Hough machinery is the contrast against which RANSAC makes sense. Comfort with NumPy arrays and basic linear algebra (eigenvalues of a 2x2 matrix) is assumed throughout.

Chapter Roadmap

What's Next?

Keypoints answer "where is the same point?" but stay silent about regions: which pixels belong together as one object or surface? Chapter 11: Classical Segmentation & Grouping takes up that question with clustering, region growing, watersheds, graph cuts, and superpixels: the classical toolkit for carving an image into meaningful pieces. The two chapters are complementary halves of classical scene understanding: this one finds sparse, precise anchors for geometry, the next finds dense, coherent groupings for content. Both threads converge later, when Chapter 12 begins the geometric arc that consumes this chapter's correspondences.

Bibliography & Further Reading

Foundational Papers

Harris, C. and Stephens, M. "A Combined Corner and Edge Detector." Alvey Vision Conference (1988). bmva.org (PDF)
The four-page paper behind Section 10.1: the structure tensor, the eigenvalue picture of flat/edge/corner, and the famous response formula with its mysterious constant k.
Lowe, D. "Distinctive Image Features from Scale-Invariant Keypoints." IJCV (2004). doi:10.1023/B:VISI.0000029664.99615.94
The SIFT paper: scale-space detection, orientation assignment, the 128-dimensional descriptor, and the ratio test, all in one document. Sections 10.2, 10.3, and 10.5 are essentially a guided tour of it.
Lindeberg, T. "Feature Detection with Automatic Scale Selection." IJCV (1998). doi:10.1023/A:1008045108935
The theoretical foundation of Section 10.2: scale-normalized derivatives and the principle that extrema over scale select an object's intrinsic scale.
Rosten, E. and Drummond, T. "Machine Learning for High-Speed Corner Detection." ECCV (2006). doi:10.1007/11744023_34
FAST: the segment test, and the decision-tree compilation that made corner detection nearly free. The reason Section 10.1's third detector runs at video rate on a microcontroller.
Calonder, M., Lepetit, V., Strecha, C., and Fua, P. "BRIEF: Binary Robust Independent Elementary Features." ECCV (2010). doi:10.1007/978-3-642-15561-1_56
The paper that replaced 128 floats with 256 bits and made Section 10.4 possible; remarkably readable, including the empirical comparison of sampling strategies.
Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. "ORB: An Efficient Alternative to SIFT or SURF." ICCV (2011). doi:10.1109/ICCV.2011.6126544
Oriented FAST plus rotated, de-correlated BRIEF: the descriptor that powers ORB-SLAM and most real-time matching a decade later. Section 10.4's centerpiece.
Fischler, M. and Bolles, R. "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography." Communications of the ACM (1981). doi:10.1145/358669.358692
RANSAC's birth certificate, written for camera pose estimation in cartography. Section 10.6 follows its hypothesize-and-verify logic almost line by line.

Recent Research (2023-2026)

Lindenberger, P., Sarlin, P.-E., and Pollefeys, M. "LightGlue: Local Feature Matching at Light Speed." ICCV (2023). arXiv:2306.13643
The learned matcher that replaced Section 10.5's nearest-neighbor search in many modern pipelines: adaptive depth and width make it fast enough for real-time use.
Jiang, H. et al. "OmniGlue: Generalizable Feature Matching with Foundation Model Guidance." CVPR (2024). arXiv:2405.12979
Feature matching guided by a vision foundation model, aimed at generalizing across image domains that break classical and learned matchers alike.
Potje, G. et al. "XFeat: Accelerated Features for Lightweight Image Matching." CVPR (2024). arXiv:2404.19174
A learned detector-descriptor designed for the same CPU-and-embedded niche ORB occupies, showing the efficiency race of Section 10.4 continues in the deep era.
Leroy, V., Cabon, Y., and Revaud, J. "Grounding Image Matching in 3D with MASt3R." ECCV (2024). arXiv:2406.09756
Recasts two-view matching as direct 3D reconstruction with a transformer, challenging the detect-describe-match-verify decomposition this whole chapter teaches.

Books

Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book
Chapter 7 covers feature detection, description, and matching with full mathematical depth and exhaustive references; free online.

Tools & Libraries

OpenCV. "Feature Detection and Description" tutorial series. docs.opencv.org
The official OpenCV 4.x walkthroughs of Harris, Shi-Tomasi, FAST, SIFT, ORB, matching, and homography estimation used throughout this chapter's code.
Kornia. kornia.feature documentation. kornia.readthedocs.io
Differentiable PyTorch implementations of Harris, DoG, SIFT, and modern learned features (DISK, LoFTR adapters), bridging this chapter to Part III.
Schönberger, J. L. COLMAP: Structure-from-Motion and Multi-View Stereo. colmap.github.io
The reference reconstruction pipeline: SIFT keypoints, ratio-test matching, and RANSAC geometric verification at industrial scale. Where this chapter's pipeline lives in production.