Part II: Classical Computer Vision
Chapter 13: Two-View Geometry, Stereo & Depth

Two-View Geometry, Stereo & Depth

"Close one eye and the world goes politely flat. Open both, and geometry quietly hands the third dimension back. I do exactly the same thing, just with more linear algebra and fewer eyelashes."

A Binocular Rig With a Modest Baseline
Big Picture

Projection destroys depth; a second view restores it, and this chapter is the complete account of how. One photograph collapses every point along a viewing ray onto a single pixel, so distance is unrecoverable. Add a second photograph from a known (or recoverable) position and each pixel pair becomes an intersection problem: two rays, one 3D point. Everything in between (epipolar constraints, the essential and fundamental matrices, rectification, disparity, triangulation) is the machinery that turns that intersection idea into working code. The same machinery scales from two views to thousands in Chapter 14 and supplies the camera poses that neural scene representations in Chapter 27 quietly depend on.

Chapter Overview

Chapter 12 ended with a fully characterized camera: intrinsics that map rays to pixels, extrinsics that place the camera in the world, and distortion coefficients that straighten what the lens bent. But a single calibrated camera still cannot answer the most natural question about a photograph: how far away is that? The information was destroyed at exposure time, when every point along a 3D ray landed on the same pixel. This chapter adds the one ingredient that makes depth recoverable: a second view. Two views taken from different positions see the world along different rays, and where pairs of rays intersect, 3D structure lives.

The chapter opens with geometry before algorithms. Section 13.1 establishes the epipolar constraint, the surprising fact that a point in one image confines its match in the other image to a single line, collapsing correspondence from a 2D search into a 1D one. Section 13.2 packages that constraint into two famous $3 \times 3$ matrices, the essential matrix (calibrated) and the fundamental matrix (uncalibrated), and shows how to estimate them from the point matches that Chapter 10 taught you to produce, including the normalization trick that rescued the eight-point algorithm from numerical infamy. Section 13.3 takes a deliberate detour into the special case where two views are related point-to-point rather than point-to-line: planar scenes and rotating cameras, the realm of the homography, and the reason your phone can stitch panoramas.

The second half is about density and metric depth. Section 13.4 rectifies a stereo pair so that epipolar lines become horizontal scanlines, then estimates disparity for every pixel with block matching and semi-global matching. Section 13.5 converts disparity into metric depth through one elegant formula, $Z = fB/d$, and confronts its consequences: depth error grows quadratically with distance, which dictates how every stereo product is engineered. Section 13.6 closes the loop with triangulation, recovering individual 3D points from matched pairs the proper way, including why "intersect the two rays" is subtler than it sounds when the rays, thanks to noise, never actually meet.

A theme worth tracking: this chapter is where the camera matrix $K$ from Chapter 12 earns its keep, where the matched keypoints and RANSAC verification of Chapter 10 become inputs rather than outputs, and where the homographies first met in Chapter 5 reappear with a physical interpretation. Classical two-view geometry is also remarkably alive: the disparity estimators of Section 13.4 are now recurrent neural networks, and 2024-2026 systems like DUSt3R and VGGT regress 3D structure directly from image pairs. They did not make the geometry obsolete; they made fluency in it the entry ticket for understanding what those models predict and how they are evaluated.

Prerequisites

This chapter builds directly on Chapter 12: Camera Models & Calibration: the pinhole model, the intrinsic matrix $K$, homogeneous coordinates, and rotation-translation extrinsics are used on nearly every page. The point correspondences that feed every estimator come from the detect-describe-match-verify pipeline of Chapter 10: Keypoints, Descriptors & Matching, and RANSAC from Section 10.6 returns here as the standard wrapper around every geometric fit. Section 13.3 assumes the warping and interpolation machinery of Chapter 5: Geometric Transformations & Image Warping, and its blending discussion echoes the pyramids of Chapter 4. Comfort with the SVD as a linear algebra tool (least-squares null spaces, rank constraints) is assumed; each use is explained in context, and Appendix A: Mathematical Foundations gives a compact refresher on the singular value decomposition, including the null-space and rank facts every estimator in this chapter relies on.

Chapter Roadmap

Key Insight: The Whole Chapter in One Mental Model

If only one schema survives the week, make it the chapter's spine, a five-link chain that runs left to right: constrain, encode, estimate, match, intersect. A second view constrains each match to an epipolar line (13.1); that constraint encodes into one $3 \times 3$ matrix, $E$ or $F$ (13.2), with the homography $H$ as its point-to-point special case (13.3); robust fitting estimates the matrix and the camera motion hiding inside it (13.2); rectification turns the lines into scanlines so a dense matcher can match every pixel into a disparity (13.4); and triangulation intersects the rays, via $Z = fB/d$ for rectified pairs (13.5) or the DLT for arbitrary ones (13.6), to recover 3D points. Two phrases anchor the consequences: depth from images is recoverable only up to scale until something metric pins it, and depth precision degrades with the square of distance. Every later 3D chapter, Chapter 14's structure from motion and Chapter 27's neural scenes, runs this same five-link chain inside its inner loop. The Hands-On Lab at the end of this chapter walks all five links in one runnable program against a synthetic cube that lets you grade every step.

Hands-On Lab: A Two-View Reconstruction From Scratch

Duration: about 75 to 90 minutes Difficulty: Intermediate

Objective

Run the chapter's whole five-link chain (constrain, encode, estimate, match, intersect) inside one runnable program. You will start from two views of a known synthetic cube, estimate the essential matrix from point matches, recover the camera motion hiding inside it, triangulate the matched points into 3D, then check your reconstruction against ground truth with reprojection error. Because the scene is synthesized with exact camera poses, the lab grades itself: you can compare your recovered rotation, translation direction, and 3D points to the values that generated the images, something no real dataset lets you do.

What You'll Practice

  • Projecting known 3D points through two calibrated cameras to build a self-checking correspondence set (the pinhole model of Chapter 12 used in reverse).
  • Estimating the essential matrix from matches and recovering relative pose with the cheirality test (Section 13.2).
  • Verifying the epipolar constraint $x'^\top E x = 0$ numerically on your own matches (Section 13.1).
  • Triangulating matched pairs into 3D points with the linear DLT method and resolving the up-to-scale ambiguity (Section 13.6).
  • Auditing a reconstruction with reprojection error, the same metric bundle adjustment minimizes in Chapter 14.

Setup

Two libraries and no dataset; the script generates its own cube and cameras, so it always runs to completion. Install with:

pip install opencv-python numpy

Everything runs on the CPU in well under a second. Matplotlib is optional and only used by the final stretch goal for a 3D scatter of the recovered points.

Steps

Step 1: Build a synthetic scene and two cameras

Define the 3D corners of a cube and two calibrated cameras with the same intrinsics $K$ but different poses. The second camera is the first translated sideways (a baseline) and rotated slightly toward the scene, exactly the rig of Section 13.1.

import cv2
import numpy as np

np.random.seed(0)

# Shared intrinsics: 800 px focal length, principal point at image center.
K = np.array([[800, 0, 320],
              [0, 800, 240],
              [0,   0,   1]], float)

# A cube of 3D points in world coordinates, sitting in front of the cameras.
g = np.linspace(-0.5, 0.5, 3)
X = np.array([[x, y, z] for x in g for y in g for z in g]) + [0, 0, 6]

# Camera 1 is the world origin (R = I, t = 0).
R1, t1 = np.eye(3), np.zeros(3)

# TODO: define camera 2's pose. Build a small rotation R2 about the y axis
# (use cv2.Rodrigues on the vector [0, -0.15, 0]) and a translation
# t2 = [1.0, 0, 0] (a 1-metre rightward baseline). Keep them as (3,3) and (3,).
Hint

R2, _ = cv2.Rodrigues(np.array([0.0, -0.15, 0.0])) gives the rotation matrix; t2 = np.array([1.0, 0.0, 0.0]) is the baseline. A camera pose maps a world point $X$ to camera coordinates as $R X + t$, so these two numbers fully place camera 2.

Step 2: Project the cube into both images

Push every 3D corner through each camera with the projection $x \sim K (R X + t)$ to get two sets of pixel coordinates. These perfectly corresponding pixel pairs are the matches that Chapter 10 would have produced from a real image, with zero outliers so you can isolate the geometry.

def project(X, K, R, t):
    Xc = X @ R.T + t                 # world -> camera coordinates
    x = Xc @ K.T                     # camera -> homogeneous pixels
    return x[:, :2] / x[:, 2:3]      # perspective divide

pts1 = project(X, K, R1, t1)
# TODO: produce pts2 the same way using R2 and t2 from Step 1.
Hint

pts2 = project(X, K, R2, t2). Each row of pts1 and the matching row of pts2 are the same cube corner seen from the two views, the noiseless ideal of a verified match.

Step 3: Estimate the essential matrix and check the epipolar constraint

Estimate $E$ from the matches with OpenCV's RANSAC-wrapped solver (Section 13.2), then verify it: every match should satisfy $x'^\top E x \approx 0$ when the points are expressed as normalized homogeneous rays $K^{-1}[x,y,1]^\top$ (Section 13.1).

E, mask = cv2.findEssentialMat(pts1, pts2, K,
                               method=cv2.RANSAC, prob=0.999, threshold=1.0)

def to_rays(pts, K):
    h = np.hstack([pts, np.ones((len(pts), 1))])
    return h @ np.linalg.inv(K).T    # normalized image-plane rays

r1, r2 = to_rays(pts1, K), to_rays(pts2, K)
# TODO: compute the per-match epipolar residual r2[i] @ E @ r1[i] for all i
# and print its maximum absolute value. It should be tiny (about 1e-12).
Hint

res = np.einsum('ij,jk,ik->i', r2, E, r1) gives one residual per match; print(np.abs(res).max()). A near-zero maximum confirms your $E$ encodes the same epipolar geometry that generated the points.

Step 4: Recover the camera motion from E

Decompose $E$ into a rotation and a unit translation direction with cv2.recoverPose, which runs the cheirality test of Section 13.2 to pick the one physically valid solution out of the four that $E$ admits. Compare the result to the ground-truth pose from Step 1.

n_in, R_est, t_est, _ = cv2.recoverPose(E, pts1, pts2, K)

# Ground-truth relative pose (camera 1 -> camera 2).
R_gt = R2
t_gt = t2 / np.linalg.norm(t2)       # recoverPose returns t up to scale

# TODO: print the rotation error in degrees and the angle between the
# estimated and ground-truth translation directions. Both should be ~0.
Hint

Rotation error: ang = np.degrees(np.arccos((np.trace(R_est @ R_gt.T) - 1) / 2)). Translation direction error: np.degrees(np.arccos(np.clip(abs(t_est.ravel() @ t_gt), -1, 1))). The abs absorbs the sign flip, since $E$ only fixes the baseline direction up to sign.

Step 5: Triangulate the matches into 3D

Build the two camera projection matrices $P_1 = K[I \mid 0]$ and $P_2 = K[R_{est} \mid t_{est}]$, then call cv2.triangulatePoints, the linear DLT of Section 13.6. The output is homogeneous; divide through to get Euclidean 3D points.

P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])
P2 = K @ np.hstack([R_est, t_est])     # t_est is already a column vector

Xh = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
# TODO: convert the 4xN homogeneous output Xh into an Nx3 array of
# Euclidean points by dividing the first three rows by the fourth.
Hint

X_est = (Xh[:3] / Xh[3]).T. These points live in the reconstruction's own coordinate frame and are correct only up to a single global scale, because $t$ was recovered as a unit vector.

Step 6: Fix the scale and measure 3D error

Depth from two views is recoverable only up to scale (the chapter's recurring caveat), so the recovered cube is the right shape but the wrong size. Estimate the single scale factor that best aligns your points to ground truth, apply it, and report the residual 3D error.

# Centre both clouds, then find the scalar that best matches their sizes.
Xc_gt = X - X.mean(0)
Xc_est = X_est - X_est.mean(0)
scale = (Xc_gt * Xc_est).sum() / (Xc_est * Xc_est).sum()

# TODO: build X_aligned = scale * Xc_est + X.mean(0) and print the mean
# Euclidean distance between X_aligned and the ground-truth X.
Hint

X_aligned = scale * Xc_est + X.mean(0); err = np.linalg.norm(X_aligned - X, axis=1).mean(). On noiseless input this should be tiny (well below a millimetre), confirming the cube was reconstructed up to the expected scale freedom.

Step 7: Audit with reprojection error

The honest final check, and the quantity Chapter 14's bundle adjustment minimizes: project your triangulated points back into both images and measure how far they land from the original matches. Low reprojection error is the certificate that estimation, pose recovery, and triangulation all agree.

def reproj_error(X3d, pts, P):
    h = np.hstack([X3d, np.ones((len(X3d), 1))]) @ P.T
    proj = h[:, :2] / h[:, 2:3]
    return np.linalg.norm(proj - pts, axis=1).mean()

# TODO: print the mean reprojection error in both views using X_est
# (the unscaled triangulation), P1 with pts1, and P2 with pts2.
print("inliers used by recoverPose:", n_in)
Hint

print(reproj_error(X_est, pts1, P1), reproj_error(X_est, pts2, P2)). Use the unscaled X_est here, not X_aligned: reprojection error is scale-invariant because the global scale cancels in the projection. Both numbers should be a small fraction of a pixel.

Expected Output

The script prints, in order: an epipolar residual maximum near 1e-12 (Step 3); a rotation error and a translation-direction error both within a small fraction of a degree (Step 4); a mean 3D alignment error well below a millimetre after scale fixing (Step 6); and two reprojection errors, one per view, each a tiny fraction of a pixel (Step 7), alongside the inlier count (all matches, since the input is noiseless). The takeaway is concrete: from nothing but two pixel-coordinate lists and $K$, you recovered the camera motion and the 3D cube up to scale, and every internal consistency check passed.

Stretch Goals

  • Add Gaussian pixel noise to pts1 and pts2 (for example pts + np.random.normal(0, 0.5, pts.shape)) and plot how rotation error, translation-direction error, and reprojection error grow as the noise standard deviation rises from 0 to 2 pixels. This reproduces in miniature why real pipelines wrap every estimator in RANSAC and a refinement step.
  • Make depth metric instead of up-to-scale: assume you know the true baseline length (1 metre) and rescale the reconstruction by the ratio of the known baseline to the recovered unit translation, then verify the cube edge length comes out at 1 metre. This is the metric-pinning move of Section 13.5.
  • Library shortcut, the Right Tool principle in action: replace your hand-rolled scale alignment in Step 6 with a single call to cv2.estimateAffine3D (or a Procrustes/Umeyama similarity fit) to register the two point clouds, and visualize both with a Matplotlib 3D scatter. Note how much of the manual centring and scaling the library absorbs.
Complete Solution
import cv2
import numpy as np

np.random.seed(0)

K = np.array([[800, 0, 320],
              [0, 800, 240],
              [0,   0,   1]], float)

g = np.linspace(-0.5, 0.5, 3)
X = np.array([[x, y, z] for x in g for y in g for z in g]) + [0, 0, 6]

R1, t1 = np.eye(3), np.zeros(3)
R2, _ = cv2.Rodrigues(np.array([0.0, -0.15, 0.0]))
t2 = np.array([1.0, 0.0, 0.0])

def project(X, K, R, t):
    Xc = X @ R.T + t
    x = Xc @ K.T
    return x[:, :2] / x[:, 2:3]

pts1 = project(X, K, R1, t1)
pts2 = project(X, K, R2, t2)

E, mask = cv2.findEssentialMat(pts1, pts2, K,
                               method=cv2.RANSAC, prob=0.999, threshold=1.0)

def to_rays(pts, K):
    h = np.hstack([pts, np.ones((len(pts), 1))])
    return h @ np.linalg.inv(K).T

r1, r2 = to_rays(pts1, K), to_rays(pts2, K)
res = np.einsum('ij,jk,ik->i', r2, E, r1)
print("max epipolar residual:", np.abs(res).max())

n_in, R_est, t_est, _ = cv2.recoverPose(E, pts1, pts2, K)
R_gt = R2
t_gt = t2 / np.linalg.norm(t2)
rot_err = np.degrees(np.arccos((np.trace(R_est @ R_gt.T) - 1) / 2))
trans_err = np.degrees(np.arccos(np.clip(abs(t_est.ravel() @ t_gt), -1, 1)))
print("rotation error (deg):", rot_err)
print("translation direction error (deg):", trans_err)

P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])
P2 = K @ np.hstack([R_est, t_est])
Xh = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
X_est = (Xh[:3] / Xh[3]).T

Xc_gt = X - X.mean(0)
Xc_est = X_est - X_est.mean(0)
scale = (Xc_gt * Xc_est).sum() / (Xc_est * Xc_est).sum()
X_aligned = scale * Xc_est + X.mean(0)
print("mean 3D error after scale fix:",
      np.linalg.norm(X_aligned - X, axis=1).mean())

def reproj_error(X3d, pts, P):
    h = np.hstack([X3d, np.ones((len(X3d), 1))]) @ P.T
    proj = h[:, :2] / h[:, 2:3]
    return np.linalg.norm(proj - pts, axis=1).mean()

print("reprojection error view 1:", reproj_error(X_est, pts1, P1))
print("reprojection error view 2:", reproj_error(X_est, pts2, P2))
print("inliers used by recoverPose:", n_in)

What's Next?

Two views recover depth along a single baseline; the obvious next question is what happens with two hundred views, or with a video stream from a camera that never stops moving. Chapter 14: Structure from Motion & Visual SLAM generalizes everything built here: pairwise relative poses chain into global camera trajectories, triangulated points merge into a single 3D model, and bundle adjustment polishes both simultaneously. The two-view estimators of this chapter literally run inside those systems, thousands of times per reconstruction, which is why getting them right here pays off there. The Hands-On Lab above is a one-pair instance of exactly that inner loop, reprojection-error audit and all.

Bibliography & Further Reading

Foundational Papers

Longuet-Higgins, H. C. "A Computer Algorithm for Reconstructing a Scene from Two Projections." Nature 293 (1981). doi:10.1038/293133a0
The paper that started two-view geometry as a computational subject: the essential matrix and the eight-point algorithm, published in Nature of all places. Section 13.2 is its direct descendant.
Hartley, R. "In Defense of the Eight-Point Algorithm." IEEE TPAMI 19(6) (1997). doi:10.1109/34.601246
The famous rescue: a one-page normalization fix that turned a numerically disgraced algorithm into the standard initializer. Section 13.2 implements it line by line.
Nistér, D. "An Efficient Solution to the Five-Point Relative Pose Problem." IEEE TPAMI 26(6) (2004). doi:10.1109/TPAMI.2004.17
The minimal solver for the calibrated case, used inside RANSAC loops everywhere; the engine behind OpenCV's findEssentialMat discussed in Section 13.2.
Hartley, R. and Sturm, P. "Triangulation." Computer Vision and Image Understanding 68(2) (1997). doi:10.1006/cviu.1997.0547
Why "intersect two rays" is not as easy as it sounds, and the optimal answer that minimizes reprojection error. The backbone of Section 13.6.
Scharstein, D. and Szeliski, R. "A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms." IJCV 47 (2002). doi:10.1023/A:1014573219977
The paper that organized stereo matching into the cost-aggregation-optimization-refinement template Section 13.4 follows, and that launched the Middlebury benchmark.
Hirschmüller, H. "Stereo Processing by Semiglobal Matching and Mutual Information." IEEE TPAMI 30(2) (2008). doi:10.1109/TPAMI.2007.1166
Semi-global matching: the dynamic-programming compromise between local block matching and intractable global optimization. Still shipping in products two decades later as Section 13.4's SGBM.
Brown, M. and Lowe, D. "Automatic Panoramic Image Stitching using Invariant Features." IJCV 74 (2007). doi:10.1007/s11263-006-0002-3
The complete panorama recipe (SIFT, RANSAC homographies, bundle adjustment, multi-band blending) that Section 13.3 walks through and that OpenCV's Stitcher implements.

Recent Research (2024-2026)

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., and Revaud, J. "DUSt3R: Geometric 3D Vision Made Easy." CVPR (2024). arXiv:2312.14132
Two uncalibrated images in, dense 3D pointmaps out, no explicit epipolar pipeline anywhere. The strongest current challenge to this chapter's decomposition, discussed in Sections 13.1 and 13.6.
Wang, J., Karaev, N., Rupprecht, C., and Novotny, D. et al. "VGGT: Visual Geometry Grounded Transformer." CVPR (2025). arXiv:2503.11651
A feed-forward transformer that predicts cameras, depth maps, and 3D points for many views in one pass; the 2025 state of the "geometry as regression" research line.
Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O., and Birchfield, S. "FoundationStereo: Zero-Shot Stereo Matching." CVPR (2025). arXiv:2501.09898
A stereo foundation model that generalizes across domains without fine-tuning; where Section 13.4's matching problem stands in 2025.
Lipson, L., Teed, Z., and Deng, J. "RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching." 3DV (2021). arXiv:2109.07547
The recurrent refinement architecture that made learned stereo robust enough for products, and the design most 2024-2026 stereo networks still build on.
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., and Zhao, H. "Depth Anything V2." NeurIPS (2024). arXiv:2406.09414
Monocular depth from a foundation model: the single-camera rival that Section 13.5 weighs against stereo's metric guarantees.

Books

Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edition. Cambridge University Press (2004). robots.ox.ac.uk/~vgg/hzbook
The definitive reference for everything in this chapter, with full proofs. When this chapter says "it can be shown", the showing is in here.
Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book
Chapters 11 and 12 cover stereo correspondence and 3D reconstruction with an engineering eye and exhaustive references; free online.

Tools & Libraries

OpenCV. "Camera Calibration and 3D Reconstruction" (calib3d) documentation. docs.opencv.org
The reference for every function this chapter calls: findFundamentalMat, findEssentialMat, recoverPose, stereoRectify, StereoSGBM, triangulatePoints, reprojectImageTo3D.
Middlebury Stereo Vision Page. vision.middlebury.edu/stereo
The canonical stereo datasets and leaderboard since 2002: ground-truth disparities for testing everything Section 13.4 builds.
KITTI Stereo / Scene Flow Benchmark. cvlibs.net/datasets/kitti
Real driving imagery with LiDAR ground truth: the benchmark that measures how Section 13.4 and 13.5 methods behave outside the lab.