"Close one eye and the world goes politely flat. Open both, and geometry quietly hands the third dimension back. I do exactly the same thing, just with more linear algebra and fewer eyelashes."
A Binocular Rig With a Modest Baseline
Projection destroys depth; a second view restores it, and this chapter is the complete account of how. One photograph collapses every point along a viewing ray onto a single pixel, so distance is unrecoverable. Add a second photograph from a known (or recoverable) position and each pixel pair becomes an intersection problem: two rays, one 3D point. Everything in between (epipolar constraints, the essential and fundamental matrices, rectification, disparity, triangulation) is the machinery that turns that intersection idea into working code. The same machinery scales from two views to thousands in Chapter 14 and supplies the camera poses that neural scene representations in Chapter 27 quietly depend on.
Chapter Overview
Chapter 12 ended with a fully characterized camera: intrinsics that map rays to pixels, extrinsics that place the camera in the world, and distortion coefficients that straighten what the lens bent. But a single calibrated camera still cannot answer the most natural question about a photograph: how far away is that? The information was destroyed at exposure time, when every point along a 3D ray landed on the same pixel. This chapter adds the one ingredient that makes depth recoverable: a second view. Two views taken from different positions see the world along different rays, and where pairs of rays intersect, 3D structure lives.
The chapter opens with geometry before algorithms. Section 13.1 establishes the epipolar constraint, the surprising fact that a point in one image confines its match in the other image to a single line, collapsing correspondence from a 2D search into a 1D one. Section 13.2 packages that constraint into two famous $3 \times 3$ matrices, the essential matrix (calibrated) and the fundamental matrix (uncalibrated), and shows how to estimate them from the point matches that Chapter 10 taught you to produce, including the normalization trick that rescued the eight-point algorithm from numerical infamy. Section 13.3 takes a deliberate detour into the special case where two views are related point-to-point rather than point-to-line: planar scenes and rotating cameras, the realm of the homography, and the reason your phone can stitch panoramas.
The second half is about density and metric depth. Section 13.4 rectifies a stereo pair so that epipolar lines become horizontal scanlines, then estimates disparity for every pixel with block matching and semi-global matching. Section 13.5 converts disparity into metric depth through one elegant formula, $Z = fB/d$, and confronts its consequences: depth error grows quadratically with distance, which dictates how every stereo product is engineered. Section 13.6 closes the loop with triangulation, recovering individual 3D points from matched pairs the proper way, including why "intersect the two rays" is subtler than it sounds when the rays, thanks to noise, never actually meet.
A theme worth tracking: this chapter is where the camera matrix $K$ from Chapter 12 earns its keep, where the matched keypoints and RANSAC verification of Chapter 10 become inputs rather than outputs, and where the homographies first met in Chapter 5 reappear with a physical interpretation. Classical two-view geometry is also remarkably alive: the disparity estimators of Section 13.4 are now recurrent neural networks, and 2024-2026 systems like DUSt3R and VGGT regress 3D structure directly from image pairs. They did not make the geometry obsolete; they made fluency in it the entry ticket for understanding what those models predict and how they are evaluated.
Prerequisites
This chapter builds directly on Chapter 12: Camera Models & Calibration: the pinhole model, the intrinsic matrix $K$, homogeneous coordinates, and rotation-translation extrinsics are used on nearly every page. The point correspondences that feed every estimator come from the detect-describe-match-verify pipeline of Chapter 10: Keypoints, Descriptors & Matching, and RANSAC from Section 10.6 returns here as the standard wrapper around every geometric fit. Section 13.3 assumes the warping and interpolation machinery of Chapter 5: Geometric Transformations & Image Warping, and its blending discussion echoes the pyramids of Chapter 4. Comfort with the SVD as a linear algebra tool (least-squares null spaces, rank constraints) is assumed; each use is explained in context.
Chapter Roadmap
- 13.1 Epipolar Geometry: The Geometry of Two Views Baselines, epipoles, epipolar planes and lines: why a point in one image pins its match in the other to a single line, and what that buys every algorithm downstream.
- 13.2 Essential & Fundamental Matrices The epipolar constraint as algebra: $E = [t]_\times R$, the fundamental matrix for uncalibrated cameras, the normalized eight-point algorithm, robust estimation, and recovering camera motion from $E$.
- 13.3 Homographies & Panorama Stitching When two views map point-to-point: planar scenes and rotating cameras, DLT estimation, homography-versus-fundamental model selection, and the full panorama pipeline with its parallax failure modes.
- 13.4 Stereo Rectification & Disparity Estimation Warping a stereo pair so epipolar lines become scanlines, then matching every pixel: block matching, semi-global matching, SGBM parameters that matter, and disparity hygiene.
- 13.5 From Disparity to Depth Maps The formula $Z = fB/d$ and its consequences: quadratic depth error, baseline design trade-offs, converting disparity maps to metric point clouds, and stereo against ToF, LiDAR, and learned monocular depth.
- 13.6 Triangulation & 3D Point Recovery Recovering 3D points from matched pairs when noisy rays never quite intersect: the midpoint method, linear DLT triangulation, reprojection error, cheirality, and a complete two-view reconstruction pipeline.
What's Next?
Two views recover depth along a single baseline; the obvious next question is what happens with two hundred views, or with a video stream from a camera that never stops moving. Chapter 14: Structure from Motion & Visual SLAM generalizes everything built here: pairwise relative poses chain into global camera trajectories, triangulated points merge into a single 3D model, and bundle adjustment polishes both simultaneously. The two-view estimators of this chapter literally run inside those systems, thousands of times per reconstruction, which is why getting them right here pays off there.