Part II: Classical Computer Vision
Chapter 13: Two-View Geometry, Stereo & Depth

Two-View Geometry, Stereo & Depth

"Close one eye and the world goes politely flat. Open both, and geometry quietly hands the third dimension back. I do exactly the same thing, just with more linear algebra and fewer eyelashes."

A Binocular Rig With a Modest Baseline
Big Picture

Projection destroys depth; a second view restores it, and this chapter is the complete account of how. One photograph collapses every point along a viewing ray onto a single pixel, so distance is unrecoverable. Add a second photograph from a known (or recoverable) position and each pixel pair becomes an intersection problem: two rays, one 3D point. Everything in between (epipolar constraints, the essential and fundamental matrices, rectification, disparity, triangulation) is the machinery that turns that intersection idea into working code. The same machinery scales from two views to thousands in Chapter 14 and supplies the camera poses that neural scene representations in Chapter 27 quietly depend on.

Chapter Overview

Chapter 12 ended with a fully characterized camera: intrinsics that map rays to pixels, extrinsics that place the camera in the world, and distortion coefficients that straighten what the lens bent. But a single calibrated camera still cannot answer the most natural question about a photograph: how far away is that? The information was destroyed at exposure time, when every point along a 3D ray landed on the same pixel. This chapter adds the one ingredient that makes depth recoverable: a second view. Two views taken from different positions see the world along different rays, and where pairs of rays intersect, 3D structure lives.

The chapter opens with geometry before algorithms. Section 13.1 establishes the epipolar constraint, the surprising fact that a point in one image confines its match in the other image to a single line, collapsing correspondence from a 2D search into a 1D one. Section 13.2 packages that constraint into two famous $3 \times 3$ matrices, the essential matrix (calibrated) and the fundamental matrix (uncalibrated), and shows how to estimate them from the point matches that Chapter 10 taught you to produce, including the normalization trick that rescued the eight-point algorithm from numerical infamy. Section 13.3 takes a deliberate detour into the special case where two views are related point-to-point rather than point-to-line: planar scenes and rotating cameras, the realm of the homography, and the reason your phone can stitch panoramas.

The second half is about density and metric depth. Section 13.4 rectifies a stereo pair so that epipolar lines become horizontal scanlines, then estimates disparity for every pixel with block matching and semi-global matching. Section 13.5 converts disparity into metric depth through one elegant formula, $Z = fB/d$, and confronts its consequences: depth error grows quadratically with distance, which dictates how every stereo product is engineered. Section 13.6 closes the loop with triangulation, recovering individual 3D points from matched pairs the proper way, including why "intersect the two rays" is subtler than it sounds when the rays, thanks to noise, never actually meet.

A theme worth tracking: this chapter is where the camera matrix $K$ from Chapter 12 earns its keep, where the matched keypoints and RANSAC verification of Chapter 10 become inputs rather than outputs, and where the homographies first met in Chapter 5 reappear with a physical interpretation. Classical two-view geometry is also remarkably alive: the disparity estimators of Section 13.4 are now recurrent neural networks, and 2024-2026 systems like DUSt3R and VGGT regress 3D structure directly from image pairs. They did not make the geometry obsolete; they made fluency in it the entry ticket for understanding what those models predict and how they are evaluated.

Prerequisites

This chapter builds directly on Chapter 12: Camera Models & Calibration: the pinhole model, the intrinsic matrix $K$, homogeneous coordinates, and rotation-translation extrinsics are used on nearly every page. The point correspondences that feed every estimator come from the detect-describe-match-verify pipeline of Chapter 10: Keypoints, Descriptors & Matching, and RANSAC from Section 10.6 returns here as the standard wrapper around every geometric fit. Section 13.3 assumes the warping and interpolation machinery of Chapter 5: Geometric Transformations & Image Warping, and its blending discussion echoes the pyramids of Chapter 4. Comfort with the SVD as a linear algebra tool (least-squares null spaces, rank constraints) is assumed; each use is explained in context.

Chapter Roadmap

What's Next?

Two views recover depth along a single baseline; the obvious next question is what happens with two hundred views, or with a video stream from a camera that never stops moving. Chapter 14: Structure from Motion & Visual SLAM generalizes everything built here: pairwise relative poses chain into global camera trajectories, triangulated points merge into a single 3D model, and bundle adjustment polishes both simultaneously. The two-view estimators of this chapter literally run inside those systems, thousands of times per reconstruction, which is why getting them right here pays off there.

Bibliography & Further Reading

Foundational Papers

Longuet-Higgins, H. C. "A Computer Algorithm for Reconstructing a Scene from Two Projections." Nature 293 (1981). doi:10.1038/293133a0
The paper that started two-view geometry as a computational subject: the essential matrix and the eight-point algorithm, published in Nature of all places. Section 13.2 is its direct descendant.
Hartley, R. "In Defense of the Eight-Point Algorithm." IEEE TPAMI 19(6) (1997). doi:10.1109/34.601246
The famous rescue: a one-page normalization fix that turned a numerically disgraced algorithm into the standard initializer. Section 13.2 implements it line by line.
Nistér, D. "An Efficient Solution to the Five-Point Relative Pose Problem." IEEE TPAMI 26(6) (2004). doi:10.1109/TPAMI.2004.17
The minimal solver for the calibrated case, used inside RANSAC loops everywhere; the engine behind OpenCV's findEssentialMat discussed in Section 13.2.
Hartley, R. and Sturm, P. "Triangulation." Computer Vision and Image Understanding 68(2) (1997). doi:10.1006/cviu.1997.0547
Why "intersect two rays" is not as easy as it sounds, and the optimal answer that minimizes reprojection error. The backbone of Section 13.6.
Scharstein, D. and Szeliski, R. "A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms." IJCV 47 (2002). doi:10.1023/A:1014573219977
The paper that organized stereo matching into the cost-aggregation-optimization-refinement template Section 13.4 follows, and that launched the Middlebury benchmark.
Hirschmüller, H. "Stereo Processing by Semiglobal Matching and Mutual Information." IEEE TPAMI 30(2) (2008). doi:10.1109/TPAMI.2007.1166
Semi-global matching: the dynamic-programming compromise between local block matching and intractable global optimization. Still shipping in products two decades later as Section 13.4's SGBM.
Brown, M. and Lowe, D. "Automatic Panoramic Image Stitching using Invariant Features." IJCV 74 (2007). doi:10.1007/s11263-006-0002-3
The complete panorama recipe (SIFT, RANSAC homographies, bundle adjustment, multi-band blending) that Section 13.3 walks through and that OpenCV's Stitcher implements.

Recent Research (2024-2026)

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., and Revaud, J. "DUSt3R: Geometric 3D Vision Made Easy." CVPR (2024). arXiv:2312.14132
Two uncalibrated images in, dense 3D pointmaps out, no explicit epipolar pipeline anywhere. The strongest current challenge to this chapter's decomposition, discussed in Sections 13.1 and 13.6.
Wang, J., Karaev, N., Rupprecht, C., and Novotny, D. et al. "VGGT: Visual Geometry Grounded Transformer." CVPR (2025). arXiv:2503.11651
A feed-forward transformer that predicts cameras, depth maps, and 3D points for many views in one pass; the 2025 state of the "geometry as regression" research line.
Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O., and Birchfield, S. "FoundationStereo: Zero-Shot Stereo Matching." CVPR (2025). arXiv:2501.09898
A stereo foundation model that generalizes across domains without fine-tuning; where Section 13.4's matching problem stands in 2025.
Lipson, L., Teed, Z., and Deng, J. "RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching." 3DV (2021). arXiv:2109.07547
The recurrent refinement architecture that made learned stereo robust enough for products, and the design most 2024-2026 stereo networks still build on.
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., and Zhao, H. "Depth Anything V2." NeurIPS (2024). arXiv:2406.09414
Monocular depth from a foundation model: the single-camera rival that Section 13.5 weighs against stereo's metric guarantees.

Books

Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edition. Cambridge University Press (2004). robots.ox.ac.uk/~vgg/hzbook
The definitive reference for everything in this chapter, with full proofs. When this chapter says "it can be shown", the showing is in here.
Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book
Chapters 11 and 12 cover stereo correspondence and 3D reconstruction with an engineering eye and exhaustive references; free online.

Tools & Libraries

OpenCV. "Camera Calibration and 3D Reconstruction" (calib3d) documentation. docs.opencv.org
The reference for every function this chapter calls: findFundamentalMat, findEssentialMat, recoverPose, stereoRectify, StereoSGBM, triangulatePoints, reprojectImageTo3D.
Middlebury Stereo Vision Page. vision.middlebury.edu/stereo
The canonical stereo datasets and leaderboard since 2002: ground-truth disparities for testing everything Section 13.4 builds.
KITTI Stereo / Scene Flow Benchmark. cvlibs.net/datasets/kitti
Real driving imagery with LiDAR ground truth: the benchmark that measures how Section 13.4 and 13.5 methods behave outside the lab.