"I have spent my whole career looking for the same point in two photographs. The trick is to be memorable enough that the second photograph recognizes you."
A Quietly Distinctive Keypoint
This chapter solves the correspondence problem: find the same physical point in two different photographs, and most of geometric vision follows as a consequence. Panorama stitching, camera calibration, stereo depth, structure from motion, visual SLAM, and image retrieval all begin with the same four-stage pipeline built here: detect repeatable points, describe their neighborhoods as vectors, match the vectors between images, and verify the matches with a geometric model. The pipeline is classical, but it is no museum piece: it runs today inside Chapter 14's COLMAP reconstructions that feed the neural radiance fields of Chapter 27, and its design choices echo through the learned representations of Chapter 25.
Chapter Overview
Chapter 9 extracted structure from single images: edges, lines, and curves. This chapter asks a question that involves two images at once. Given a photograph of a building taken from the street and another taken a few steps to the left, can a program point at a window corner in the first image and find the very same window corner in the second? A human does this instantly. For a machine, it is a genuinely hard problem: the second image differs in viewpoint, scale, rotation, lighting, and sensor noise, and the search space is every pixel. Solving it unlocks an astonishing amount of geometry, because two views of the same point constrain where the cameras were, which is the seed of Chapter 13's stereo and Chapter 14's structure from motion.
The classical answer decomposes the problem into stages, and the chapter walks through them in order. Section 10.1 asks which points are worth finding at all and answers with corners: locations where the image gradient varies in two directions, detected by the Harris and Shi-Tomasi operators and, at production speed, by FAST. Section 10.2 confronts the fact that a corner detector tuned to one zoom level fails at another, and builds the scale-space machinery (Gaussian pyramids, the Difference of Gaussians, canonical orientation) that makes detection survive zoom and rotation. Section 10.3 assembles these parts into SIFT, the 128-dimensional gradient-histogram descriptor that dominated computer vision for a decade and still ships inside reconstruction pipelines today.
The second half of the chapter is about engineering and trust. Section 10.4 trades SIFT's floating-point precision for binary descriptors (BRIEF, ORB, AKAZE) that match hundreds of times faster and fit real-time budgets on phones and robots. Section 10.5 turns piles of descriptors into actual correspondences, introducing brute-force and approximate nearest-neighbor search and the deceptively simple ratio test that rejects most false matches before they cause damage. Section 10.6 deals with the false matches that survive anyway: RANSAC, the hypothesize-and-verify algorithm that fits geometric models to contaminated data and, in doing so, separates inliers from outliers. RANSAC is so generally useful that it long ago escaped this chapter's topic and became a standard tool across all of engineering.
One theme deserves flagging before you begin. Every design decision in this chapter is a trade between invariance (ignoring changes you do not care about) and distinctiveness (preserving differences you do). A descriptor invariant to everything describes nothing; a descriptor sensitive to everything matches nothing. SIFT's gradient histograms, BRIEF's intensity comparisons, and the ratio test's relative threshold are all answers to this one tension. The same tension returns, with weights learned from data instead of designed by hand, when Chapter 25 trains networks to produce embeddings, and when Chapter 34's CLIP vectors act as universal descriptors for entire images. Hand-crafted features lost the recognition war to deep learning, but the vocabulary they established (detect, describe, match, verify) remains the grammar of geometric vision.
Prerequisites
This chapter leans directly on image gradients and the Sobel operator from Chapter 3: Spatial Filtering & Convolution, and on Gaussian pyramids and the Difference of Gaussians from Chapter 4: The Frequency Domain & Multi-Scale Analysis. Section 10.6 fits homographies, so the geometric transformations of Chapter 5: Geometric Transformations & Image Warping should be fresh. The discussion of robust fitting continues a thread started in Chapter 9: Edges, Lines & Curves, whose least-squares and Hough machinery is the contrast against which RANSAC makes sense. Comfort with NumPy arrays and basic linear algebra (eigenvalues of a 2x2 matrix) is assumed throughout.
Chapter Roadmap
- 10.1 Corner Detection: Harris, Shi-Tomasi & FAST Why corners are the findable points: the structure tensor, eigenvalue analysis, the Harris response, Shi-Tomasi's refinement, and the FAST segment test that runs at video rate.
- 10.2 Scale & Rotation Invariance: Scale Space Making detection survive zoom and rotation: Gaussian scale space, octaves, Difference-of-Gaussians extrema, sub-pixel refinement, and canonical orientation assignment.
- 10.3 SIFT: The Descriptor That Defined a Decade The 128-dimensional gradient-histogram descriptor: its 4x4x8 anatomy, the normalization tricks behind its illumination robustness, RootSIFT, and its legacy.
- 10.4 Fast Binary Alternatives: BRIEF, ORB & AKAZE Descriptors as bit strings: BRIEF's intensity comparisons, Hamming distance at hardware speed, ORB's oriented and de-correlated upgrade, and AKAZE's nonlinear scale space.
- 10.5 Descriptor Matching & the Ratio Test From descriptor piles to correspondences: brute-force and FLANN nearest-neighbor search, cross-checking, and Lowe's ratio test that compares best to second-best.
- 10.6 RANSAC & Robust Model Fitting Trusting matches by geometric consensus: why least squares breaks, the hypothesize-and-verify loop, fitting homographies to contaminated matches, and modern MAGSAC++ variants.
What's Next?
Keypoints answer "where is the same point?" but stay silent about regions: which pixels belong together as one object or surface? Chapter 11: Classical Segmentation & Grouping takes up that question with clustering, region growing, watersheds, graph cuts, and superpixels: the classical toolkit for carving an image into meaningful pieces. The two chapters are complementary halves of classical scene understanding: this one finds sparse, precise anchors for geometry, the next finds dense, coherent groupings for content. Both threads converge later, when Chapter 12 begins the geometric arc that consumes this chapter's correspondences.
Bibliography & Further Reading
Foundational Papers
Recent Research (2023-2026)
Books
Tools & Libraries
kornia.feature documentation. kornia.readthedocs.io