Chapter 10: Keypoints, Descriptors & Matching

"I have spent my whole career looking for the same point in two photographs. The trick is to be memorable enough that the second photograph recognizes you."
A Quietly Distinctive Keypoint

Big Picture

This chapter solves the correspondence problem: find the same physical point in two different photographs, and most of geometric vision follows as a consequence. Panorama stitching, camera calibration, stereo depth, structure from motion, visual SLAM, and image retrieval all begin with the same four-stage pipeline built here: detect repeatable points, describe their neighborhoods as vectors, match the vectors between images, and verify the matches with a geometric model. The pipeline is classical, but it is no museum piece: it runs today inside Chapter 14's COLMAP reconstructions that feed the neural radiance fields of Chapter 27, and its design choices echo through the learned representations of Chapter 25.

Remember Four Words: Detect, Describe, Match, Verify

If you forget everything else in this chapter, keep these four words in order. Detect (Section 10.1 to 10.2) finds points you can locate again; Describe (Sections 10.3 to 10.4) turns each neighborhood into a comparable vector; Match (Section 10.5) pairs the vectors across images; Verify (Section 10.6) keeps only the pairs a single geometry endorses. Each stage exists to cover the previous one's blind spot, and the same four words name the grammar of every geometric-vision system in the rest of Part II. The sections that follow are each one word of this sentence, expanded.

Chapter Overview

Chapter 9 extracted structure from single images: edges, lines, and curves. This chapter asks a question that involves two images at once. Given a photograph of a building taken from the street and another taken a few steps to the left, can a program point at a window corner in the first image and find the very same window corner in the second? A human does this instantly. For a machine, it is a genuinely hard problem: the second image differs in viewpoint, scale, rotation, lighting, and sensor noise, and the search space is every pixel. Solving it unlocks an astonishing amount of geometry, because two views of the same point constrain where the cameras were, which is the seed of Chapter 13's stereo and Chapter 14's structure from motion.

The classical answer decomposes the problem into stages, and the chapter walks through them in order. Section 10.1 asks which points are worth finding at all and answers with corners: locations where the image gradient varies in two directions, detected by the Harris and Shi-Tomasi operators and, at production speed, by FAST. Section 10.2 confronts the fact that a corner detector tuned to one zoom level fails at another, and builds the scale-space machinery (Gaussian pyramids, the Difference of Gaussians, canonical orientation) that makes detection survive zoom and rotation. Section 10.3 assembles these parts into SIFT, the 128-dimensional gradient-histogram descriptor that dominated computer vision for a decade and still ships inside reconstruction pipelines today.

The second half of the chapter is about engineering and trust. Section 10.4 trades SIFT's floating-point precision for binary descriptors (BRIEF, ORB, AKAZE) that match hundreds of times faster and fit real-time budgets on phones and robots. Section 10.5 turns piles of descriptors into actual correspondences, introducing brute-force and approximate nearest-neighbor search and the deceptively simple ratio test that rejects most false matches before they cause damage. Section 10.6 deals with the false matches that survive anyway: RANSAC, the hypothesize-and-verify algorithm that fits geometric models to contaminated data and, in doing so, separates inliers from outliers. RANSAC is so generally useful that it long ago escaped this chapter's topic and became a standard tool across all of engineering.

One theme deserves flagging before you begin. Every design decision in this chapter is a trade between invariance (ignoring changes you do not care about) and distinctiveness (preserving differences you do). A descriptor invariant to everything describes nothing; a descriptor sensitive to everything matches nothing. SIFT's gradient histograms, BRIEF's intensity comparisons, and the ratio test's relative threshold are all answers to this one tension. The same tension returns, with weights learned from data instead of designed by hand, when Chapter 25 trains networks to produce embeddings, and when Chapter 34's CLIP vectors act as universal descriptors for entire images. Hand-crafted features lost the recognition war to deep learning, but the vocabulary they established (detect, describe, match, verify) remains the grammar of geometric vision.

Prerequisites

This chapter leans directly on image gradients and the Sobel operator from Chapter 3: Spatial Filtering & Convolution, and on Gaussian pyramids and the Difference of Gaussians from Chapter 4: The Frequency Domain & Multi-Scale Analysis. Section 10.6 fits homographies, so the geometric transformations of Chapter 5: Geometric Transformations & Image Warping should be fresh. The discussion of robust fitting continues a thread started in Chapter 9: Edges, Lines & Curves, whose least-squares and Hough machinery is the contrast against which RANSAC makes sense. Comfort with NumPy arrays and basic linear algebra (eigenvalues of a 2x2 matrix) is assumed throughout.

Chapter Roadmap

10.1 Corner Detection: Harris, Shi-Tomasi & FAST Why corners are the findable points: the structure tensor, eigenvalue analysis, the Harris response, Shi-Tomasi's refinement, and the FAST segment test that runs at video rate.
10.2 Scale & Rotation Invariance: Scale Space Making detection survive zoom and rotation: Gaussian scale space, octaves, Difference-of-Gaussians extrema, sub-pixel refinement, and canonical orientation assignment.
10.3 SIFT: The Descriptor That Defined a Decade The 128-dimensional gradient-histogram descriptor: its 4x4x8 anatomy, the normalization tricks behind its illumination robustness, RootSIFT, and its legacy.
10.4 Fast Binary Alternatives: BRIEF, ORB & AKAZE Descriptors as bit strings: BRIEF's intensity comparisons, Hamming distance at hardware speed, ORB's oriented and de-correlated upgrade, and AKAZE's nonlinear scale space.
10.5 Descriptor Matching & the Ratio Test From descriptor piles to correspondences: brute-force and FLANN nearest-neighbor search, cross-checking, and Lowe's ratio test that compares best to second-best.
10.6 RANSAC & Robust Model Fitting Trusting matches by geometric consensus: why least squares breaks, the hypothesize-and-verify loop, fitting homographies to contaminated matches, and modern MAGSAC++ variants.

What's Next?

Keypoints answer "where is the same point?" but stay silent about regions: which pixels belong together as one object or surface? Chapter 11: Classical Segmentation & Grouping takes up that question with clustering, region growing, watersheds, graph cuts, and superpixels: the classical toolkit for carving an image into meaningful pieces. The two chapters are complementary halves of classical scene understanding: this one finds sparse, precise anchors for geometry, the next finds dense, coherent groupings for content. Both threads converge later, when Chapter 12 begins the geometric arc that consumes this chapter's correspondences. Before moving on, assemble all four verbs into one runnable tool in the Hands-On Lab below, where detect, describe, match, and verify combine into a two-photo panorama stitcher.

Hands-On Lab: Build a Two-Photo Panorama Stitcher

Duration: about 60 to 90 minutes Difficulty: Intermediate

Objective

Assemble the four verbs of this chapter (detect, describe, match, verify) into one runnable program: a panorama stitcher that takes two overlapping photographs, finds keypoints in each, matches their descriptors, filters the matches with the ratio test, verifies them with a RANSAC homography, and warps one image onto the other to produce a single wide composite. The script synthesizes its own overlapping pair by cropping two windows from any image, so it always produces a panorama even without a curated dataset.

What You'll Practice

Detecting and describing keypoints with SIFT, the gradient-histogram descriptor of Section 10.3.
Matching descriptors with brute-force k-nearest-neighbor search and Lowe's ratio test from Section 10.5.
Estimating a homography from contaminated matches with RANSAC and reading its inlier mask, the robust fitting of Section 10.6.
Gating on the inlier count so the tool refuses pairs that do not actually overlap.
Warping and compositing two views into one panorama, the geometric payoff that the full detect-describe-match-verify pipeline unlocks.

Setup

Two libraries and no curated dataset required; the script splits any single image into an overlapping left and right pair if you do not supply two of your own. Install with:

pip install opencv-python numpy

To stitch your own photos, drop two overlapping shots named left.jpg and right.jpg beside the script; the loader falls back to splitting a built-in test image when those files are absent.

Steps

Step 1: Load an overlapping pair, or synthesize one

Build a loader that returns two overlapping color images. When left.jpg and right.jpg are present it uses them; otherwise it crops two horizontally shifted windows from OpenCV's bundled test image so the two crops share a middle strip to match on.

import cv2
import numpy as np

def load_pair():
    """Return (left, right) overlapping BGR images."""
    a = cv2.imread("left.jpg")
    b = cv2.imread("right.jpg")
    if a is not None and b is not None:
        return a, b
    # Fallback: crop two overlapping windows from a built-in sample.
    full = cv2.imread(cv2.samples.findFile("lena.jpg"))
    if full is None:                                 # last resort: synthetic texture
        full = np.random.default_rng(0).integers(0, 256, (512, 512, 3), np.uint8)
    h, w = full.shape[:2]
    # TODO: return two crops that share an overlap. Take the left crop as
    # columns 0 to int(0.65*w) and the right crop as columns int(0.35*w) to w,
    # so the two windows share the middle 30 percent of the image.
    ...

Hint

return full[:, :int(0.65*w)].copy(), full[:, int(0.35*w):].copy(). The shared middle band is what the matcher latches onto; widen the overlap if too few inliers survive Step 4.

Step 2: Detect and describe keypoints in each image

Run SIFT on the grayscale version of each image to obtain keypoints and their 128-dimensional descriptors. SIFT detection survives the moderate scale and viewpoint change between two handheld shots, which is exactly why Section 10.3 built it.

def detect_describe(bgr):
    gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
    sift = cv2.SIFT_create(nfeatures=4000)
    # TODO: call sift.detectAndCompute(gray, None) and return the
    # (keypoints, descriptors) tuple it produces.
    ...

Hint

return sift.detectAndCompute(gray, None). If you hit an attribute error, your OpenCV is old: pip install --upgrade opencv-python brings SIFT into the main package (it left the contrib-only days after the patent expired in 2020).

Step 3: Match descriptors and apply the ratio test

Use a brute-force L2 matcher to find the two nearest neighbors of every left descriptor among the right descriptors, then keep a match only when the best neighbor is clearly closer than the second-best. This is Lowe's ratio test from Section 10.5, the cheapest high-value filter in the pipeline.

def ratio_match(d1, d2, ratio=0.75):
    bf = cv2.BFMatcher(cv2.NORM_L2)
    pairs = bf.knnMatch(d1, d2, k=2)                 # best two neighbors each
    # TODO: keep match m from each (m, n) pair only when
    # m.distance < ratio * n.distance. Return the list of survivors.
    ...

Hint

return [m for m, n in pairs if m.distance < ratio * n.distance]. Tightening ratio toward 0.6 yields fewer but cleaner matches; loosening toward 0.8 keeps more candidates for RANSAC to sort out in Step 4.

Step 4: Verify with a RANSAC homography

Collect the matched point coordinates and fit a homography with cv2.findHomography under the cv2.RANSAC flag. The returned inlier mask is half the product: it labels each ratio-test survivor as geometrically consistent or not, the verification step of Section 10.6. Gate on the inlier count so a non-overlapping pair is refused rather than warped into garbage.

def estimate_homography(k1, k2, good, min_inliers=20):
    if len(good) < 4:                                # need 4 pairs for a homography
        return None, None
    src = np.float32([k1[m.queryIdx].pt for m in good]).reshape(-1, 1, 2)
    dst = np.float32([k2[m.trainIdx].pt for m in good]).reshape(-1, 1, 2)
    # TODO: call cv2.findHomography(src, dst, cv2.RANSAC,
    # ransacReprojThreshold=3.0). If the inlier mask sums to fewer than
    # min_inliers, return (None, None) to refuse the pair; else return (H, mask).
    ...

Hint

H, mask = cv2.findHomography(src, dst, cv2.RANSAC, 3.0); then if H is None or int(mask.sum()) < min_inliers: return None, None else return H, mask. The gate is the lesson of Section 10.6's orthomosaic story: a robust estimator that can say "no" beats one that always returns a matrix.

Step 5: Warp and composite into one panorama

Map the right image into the left image's coordinate frame with the homography, allocate a canvas wide enough for both, and paste the left image on top of the warped right. The seam will be visible; that is fine, the point is that the geometry lines up.

def stitch(left, right, H):
    h1, w1 = left.shape[:2]
    h2, w2 = right.shape[:2]
    canvas_w = w1 + w2                               # room for both side by side
    warped = cv2.warpPerspective(right, H, (canvas_w, max(h1, h2)))
    # TODO: copy `left` into the top-left region of `warped`
    # (rows 0:h1, cols 0:w1), then return the composited canvas.
    ...

Hint

warped[0:h1, 0:w1] = left then return warped. Pasting the un-warped left image last lets it overwrite the warped right in the overlap region, hiding the empty border the warp introduces.

Step 6: Run the full pipeline and report the attrition

Chain the five stages and print the match attrition at each step, the same shrinking funnel Section 10.6 traced: raw keypoints, ratio-test survivors, geometric inliers. Write the panorama as the artifact you keep.

left, right = load_pair()
(k1, d1) = detect_describe(left)
(k2, d2) = detect_describe(right)
good = ratio_match(d1, d2)
H, mask = estimate_homography(k1, k2, good)

if H is None:
    print(f"Refused: only {len(good)} ratio survivors, too few inliers to trust.")
else:
    n_in = int(mask.sum())
    print(f"keypoints {len(k1)}/{len(k2)} -> ratio {len(good)} -> inliers {n_in}")
    pano = stitch(left, right, H)
    cv2.imwrite("panorama.jpg", pano)
    print("wrote panorama.jpg")

Hint

If the stitch looks torn, your overlap is too small or the wrong image is being warped. Confirm the inlier count is comfortably above the gate (a healthy overlapping pair gives well over a hundred), and remember the homography maps right into left's frame, matching the src/dst order in Step 4.

Expected Output

One image file, panorama.jpg, wider than either input, with the two views aligned across their shared region so straight edges (a window frame, a horizon, the bookshelf in the test image) continue unbroken across the seam. The console prints the attrition funnel, for example keypoints 3200/3150 -> ratio 540 -> inliers 410, the chapter's whole story in one line: thousands of keypoints, hundreds of ratio survivors, a geometrically consistent core. If you feed it two non-overlapping photos, it prints the Refused line and writes nothing, which is the correct behavior, not a bug.

Stretch Goals

Swap SIFT for ORB (cv2.ORB_create) and the matcher norm to cv2.NORM_HAMMING, the binary-descriptor path of Section 10.4; time both and compare inlier counts to feel the speed-versus-distinctiveness trade.
Replace cv2.RANSAC with cv2.USAC_MAGSAC in Step 4 and stitch the same pair both ways; report how the inlier count and seam alignment change, exercising the modern estimator family of Section 10.6.
Blend the seam instead of overwriting it: feather the overlap with a linear alpha ramp so the join is invisible, the first step toward the multi-band blending real panorama tools use.

Library Shortcut: cv2.Stitcher Does the Whole Lab in Three Lines

The pipeline above is roughly seventy lines and exposes every stage on purpose. OpenCV bundles the same detect-describe-match-verify-warp chain, plus exposure compensation and multi-band blending the lab skips, behind one class: stitcher = cv2.Stitcher_create() then status, pano = stitcher.stitch([left, right]) and if status == cv2.Stitcher_OK: cv2.imwrite("panorama.jpg", pano). That is a 70-to-3 reduction, and the result has seamless blending the from-scratch version lacks. Build it once by hand to understand what the three lines hide; reach for the library every time after.

Complete Solution

import cv2
import numpy as np

def load_pair():
    a = cv2.imread("left.jpg")
    b = cv2.imread("right.jpg")
    if a is not None and b is not None:
        return a, b
    full = cv2.imread(cv2.samples.findFile("lena.jpg"))
    if full is None:
        full = np.random.default_rng(0).integers(0, 256, (512, 512, 3), np.uint8)
    h, w = full.shape[:2]
    return full[:, :int(0.65 * w)].copy(), full[:, int(0.35 * w):].copy()

def detect_describe(bgr):
    gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
    sift = cv2.SIFT_create(nfeatures=4000)
    return sift.detectAndCompute(gray, None)

def ratio_match(d1, d2, ratio=0.75):
    bf = cv2.BFMatcher(cv2.NORM_L2)
    pairs = bf.knnMatch(d1, d2, k=2)
    return [m for m, n in pairs if m.distance < ratio * n.distance]

def estimate_homography(k1, k2, good, min_inliers=20):
    if len(good) < 4:
        return None, None
    src = np.float32([k1[m.queryIdx].pt for m in good]).reshape(-1, 1, 2)
    dst = np.float32([k2[m.trainIdx].pt for m in good]).reshape(-1, 1, 2)
    H, mask = cv2.findHomography(src, dst, cv2.RANSAC, 3.0)
    if H is None or int(mask.sum()) < min_inliers:
        return None, None
    return H, mask

def stitch(left, right, H):
    h1, w1 = left.shape[:2]
    h2, w2 = right.shape[:2]
    canvas_w = w1 + w2
    warped = cv2.warpPerspective(right, H, (canvas_w, max(h1, h2)))
    warped[0:h1, 0:w1] = left
    return warped

if __name__ == "__main__":
    left, right = load_pair()
    k1, d1 = detect_describe(left)
    k2, d2 = detect_describe(right)
    good = ratio_match(d1, d2)
    H, mask = estimate_homography(k1, k2, good)
    if H is None:
        print(f"Refused: only {len(good)} ratio survivors, too few inliers to trust.")
    else:
        n_in = int(mask.sum())
        print(f"keypoints {len(k1)}/{len(k2)} -> ratio {len(good)} -> inliers {n_in}")
        cv2.imwrite("panorama.jpg", stitch(left, right, H))
        print("wrote panorama.jpg")

Bibliography & Further Reading

Foundational Papers

Harris, C. and Stephens, M. "A Combined Corner and Edge Detector." Alvey Vision Conference (1988). bmva.org (PDF)

The four-page paper behind Section 10.1: the structure tensor, the eigenvalue picture of flat/edge/corner, and the famous response formula with its mysterious constant k.

Lowe, D. "Distinctive Image Features from Scale-Invariant Keypoints." IJCV (2004). doi:10.1023/B:VISI.0000029664.99615.94

The SIFT paper: scale-space detection, orientation assignment, the 128-dimensional descriptor, and the ratio test, all in one document. Sections 10.2, 10.3, and 10.5 are essentially a guided tour of it.

Lindeberg, T. "Feature Detection with Automatic Scale Selection." IJCV (1998). doi:10.1023/A:1008045108935

The theoretical foundation of Section 10.2: scale-normalized derivatives and the principle that extrema over scale select an object's intrinsic scale.

Rosten, E. and Drummond, T. "Machine Learning for High-Speed Corner Detection." ECCV (2006). doi:10.1007/11744023_34

FAST: the segment test, and the decision-tree compilation that made corner detection nearly free. The reason Section 10.1's third detector runs at video rate on a microcontroller.

Calonder, M., Lepetit, V., Strecha, C., and Fua, P. "BRIEF: Binary Robust Independent Elementary Features." ECCV (2010). doi:10.1007/978-3-642-15561-1_56

The paper that replaced 128 floats with 256 bits and made Section 10.4 possible; remarkably readable, including the empirical comparison of sampling strategies.

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. "ORB: An Efficient Alternative to SIFT or SURF." ICCV (2011). doi:10.1109/ICCV.2011.6126544

Oriented FAST plus rotated, de-correlated BRIEF: the descriptor that powers ORB-SLAM and most real-time matching a decade later. Section 10.4's centerpiece.

Fischler, M. and Bolles, R. "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography." Communications of the ACM (1981). doi:10.1145/358669.358692

RANSAC's birth certificate, written for camera pose estimation in cartography. Section 10.6 follows its hypothesize-and-verify logic almost line by line.

Recent Research (2023-2026)

Lindenberger, P., Sarlin, P.-E., and Pollefeys, M. "LightGlue: Local Feature Matching at Light Speed." ICCV (2023). arXiv:2306.13643

The learned matcher that replaced Section 10.5's nearest-neighbor search in many modern pipelines: adaptive depth and width make it fast enough for real-time use.

Jiang, H. et al. "OmniGlue: Generalizable Feature Matching with Foundation Model Guidance." CVPR (2024). arXiv:2405.12979

Feature matching guided by a vision foundation model, aimed at generalizing across image domains that break classical and learned matchers alike.

Potje, G. et al. "XFeat: Accelerated Features for Lightweight Image Matching." CVPR (2024). arXiv:2404.19174

A learned detector-descriptor designed for the same CPU-and-embedded niche ORB occupies, showing the efficiency race of Section 10.4 continues in the deep era.

Leroy, V., Cabon, Y., and Revaud, J. "Grounding Image Matching in 3D with MASt3R." ECCV (2024). arXiv:2406.09756

Recasts two-view matching as direct 3D reconstruction with a transformer, challenging the detect-describe-match-verify decomposition this whole chapter teaches.

Books

Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book

Chapter 7 covers feature detection, description, and matching with full mathematical depth and exhaustive references; free online.

Tools & Libraries

OpenCV. "Feature Detection and Description" tutorial series. docs.opencv.org

The official OpenCV 4.x walkthroughs of Harris, Shi-Tomasi, FAST, SIFT, ORB, matching, and homography estimation used throughout this chapter's code.

Kornia. kornia.feature documentation. kornia.readthedocs.io

Differentiable PyTorch implementations of Harris, DoG, SIFT, and modern learned features (DISK, LoFTR adapters), bridging this chapter to Part III.

Schönberger, J. L. COLMAP: Structure-from-Motion and Multi-View Stereo. colmap.github.io

The reference reconstruction pipeline: SIFT keypoints, ratio-test matching, and RANSAC geometric verification at industrial scale. Where this chapter's pipeline lives in production.