"Close one eye and the world goes politely flat. Open both, and geometry quietly hands the third dimension back. I do exactly the same thing, just with more linear algebra and fewer eyelashes."
A Binocular Rig With a Modest Baseline
Projection destroys depth; a second view restores it, and this chapter is the complete account of how. One photograph collapses every point along a viewing ray onto a single pixel, so distance is unrecoverable. Add a second photograph from a known (or recoverable) position and each pixel pair becomes an intersection problem: two rays, one 3D point. Everything in between (epipolar constraints, the essential and fundamental matrices, rectification, disparity, triangulation) is the machinery that turns that intersection idea into working code. The same machinery scales from two views to thousands in Chapter 14 and supplies the camera poses that neural scene representations in Chapter 27 quietly depend on.
Chapter Overview
Chapter 12 ended with a fully characterized camera: intrinsics that map rays to pixels, extrinsics that place the camera in the world, and distortion coefficients that straighten what the lens bent. But a single calibrated camera still cannot answer the most natural question about a photograph: how far away is that? The information was destroyed at exposure time, when every point along a 3D ray landed on the same pixel. This chapter adds the one ingredient that makes depth recoverable: a second view. Two views taken from different positions see the world along different rays, and where pairs of rays intersect, 3D structure lives.
The chapter opens with geometry before algorithms. Section 13.1 establishes the epipolar constraint, the surprising fact that a point in one image confines its match in the other image to a single line, collapsing correspondence from a 2D search into a 1D one. Section 13.2 packages that constraint into two famous $3 \times 3$ matrices, the essential matrix (calibrated) and the fundamental matrix (uncalibrated), and shows how to estimate them from the point matches that Chapter 10 taught you to produce, including the normalization trick that rescued the eight-point algorithm from numerical infamy. Section 13.3 takes a deliberate detour into the special case where two views are related point-to-point rather than point-to-line: planar scenes and rotating cameras, the realm of the homography, and the reason your phone can stitch panoramas.
The second half is about density and metric depth. Section 13.4 rectifies a stereo pair so that epipolar lines become horizontal scanlines, then estimates disparity for every pixel with block matching and semi-global matching. Section 13.5 converts disparity into metric depth through one elegant formula, $Z = fB/d$, and confronts its consequences: depth error grows quadratically with distance, which dictates how every stereo product is engineered. Section 13.6 closes the loop with triangulation, recovering individual 3D points from matched pairs the proper way, including why "intersect the two rays" is subtler than it sounds when the rays, thanks to noise, never actually meet.
A theme worth tracking: this chapter is where the camera matrix $K$ from Chapter 12 earns its keep, where the matched keypoints and RANSAC verification of Chapter 10 become inputs rather than outputs, and where the homographies first met in Chapter 5 reappear with a physical interpretation. Classical two-view geometry is also remarkably alive: the disparity estimators of Section 13.4 are now recurrent neural networks, and 2024-2026 systems like DUSt3R and VGGT regress 3D structure directly from image pairs. They did not make the geometry obsolete; they made fluency in it the entry ticket for understanding what those models predict and how they are evaluated.
Prerequisites
This chapter builds directly on Chapter 12: Camera Models & Calibration: the pinhole model, the intrinsic matrix $K$, homogeneous coordinates, and rotation-translation extrinsics are used on nearly every page. The point correspondences that feed every estimator come from the detect-describe-match-verify pipeline of Chapter 10: Keypoints, Descriptors & Matching, and RANSAC from Section 10.6 returns here as the standard wrapper around every geometric fit. Section 13.3 assumes the warping and interpolation machinery of Chapter 5: Geometric Transformations & Image Warping, and its blending discussion echoes the pyramids of Chapter 4. Comfort with the SVD as a linear algebra tool (least-squares null spaces, rank constraints) is assumed; each use is explained in context, and Appendix A: Mathematical Foundations gives a compact refresher on the singular value decomposition, including the null-space and rank facts every estimator in this chapter relies on.
Chapter Roadmap
- 13.1 Epipolar Geometry: The Geometry of Two Views Baselines, epipoles, epipolar planes and lines: why a point in one image pins its match in the other to a single line, and what that buys every algorithm downstream.
- 13.2 Essential & Fundamental Matrices The epipolar constraint as algebra: $E = [t]_\times R$, the fundamental matrix for uncalibrated cameras, the normalized eight-point algorithm, robust estimation, and recovering camera motion from $E$.
- 13.3 Homographies & Panorama Stitching When two views map point-to-point: planar scenes and rotating cameras, DLT estimation, homography-versus-fundamental model selection, and the full panorama pipeline with its parallax failure modes.
- 13.4 Stereo Rectification & Disparity Estimation Warping a stereo pair so epipolar lines become scanlines, then matching every pixel: block matching, semi-global matching, SGBM parameters that matter, and disparity hygiene.
- 13.5 From Disparity to Depth Maps The formula $Z = fB/d$ and its consequences: quadratic depth error, baseline design trade-offs, converting disparity maps to metric point clouds, and stereo against ToF, LiDAR, and learned monocular depth.
- 13.6 Triangulation & 3D Point Recovery Recovering 3D points from matched pairs when noisy rays never quite intersect: the midpoint method, linear DLT triangulation, reprojection error, cheirality, and a complete two-view reconstruction pipeline.
If only one schema survives the week, make it the chapter's spine, a five-link chain that runs left to right: constrain, encode, estimate, match, intersect. A second view constrains each match to an epipolar line (13.1); that constraint encodes into one $3 \times 3$ matrix, $E$ or $F$ (13.2), with the homography $H$ as its point-to-point special case (13.3); robust fitting estimates the matrix and the camera motion hiding inside it (13.2); rectification turns the lines into scanlines so a dense matcher can match every pixel into a disparity (13.4); and triangulation intersects the rays, via $Z = fB/d$ for rectified pairs (13.5) or the DLT for arbitrary ones (13.6), to recover 3D points. Two phrases anchor the consequences: depth from images is recoverable only up to scale until something metric pins it, and depth precision degrades with the square of distance. Every later 3D chapter, Chapter 14's structure from motion and Chapter 27's neural scenes, runs this same five-link chain inside its inner loop. The Hands-On Lab at the end of this chapter walks all five links in one runnable program against a synthetic cube that lets you grade every step.
Hands-On Lab: A Two-View Reconstruction From Scratch
Objective
Run the chapter's whole five-link chain (constrain, encode, estimate, match, intersect) inside one runnable program. You will start from two views of a known synthetic cube, estimate the essential matrix from point matches, recover the camera motion hiding inside it, triangulate the matched points into 3D, then check your reconstruction against ground truth with reprojection error. Because the scene is synthesized with exact camera poses, the lab grades itself: you can compare your recovered rotation, translation direction, and 3D points to the values that generated the images, something no real dataset lets you do.
What You'll Practice
- Projecting known 3D points through two calibrated cameras to build a self-checking correspondence set (the pinhole model of Chapter 12 used in reverse).
- Estimating the essential matrix from matches and recovering relative pose with the cheirality test (Section 13.2).
- Verifying the epipolar constraint $x'^\top E x = 0$ numerically on your own matches (Section 13.1).
- Triangulating matched pairs into 3D points with the linear DLT method and resolving the up-to-scale ambiguity (Section 13.6).
- Auditing a reconstruction with reprojection error, the same metric bundle adjustment minimizes in Chapter 14.
Setup
Two libraries and no dataset; the script generates its own cube and cameras, so it always runs to completion. Install with:
pip install opencv-python numpy
Everything runs on the CPU in well under a second. Matplotlib is optional and only used by the final stretch goal for a 3D scatter of the recovered points.
Steps
Step 1: Build a synthetic scene and two cameras
Define the 3D corners of a cube and two calibrated cameras with the same intrinsics $K$ but different poses. The second camera is the first translated sideways (a baseline) and rotated slightly toward the scene, exactly the rig of Section 13.1.
import cv2
import numpy as np
np.random.seed(0)
# Shared intrinsics: 800 px focal length, principal point at image center.
K = np.array([[800, 0, 320],
[0, 800, 240],
[0, 0, 1]], float)
# A cube of 3D points in world coordinates, sitting in front of the cameras.
g = np.linspace(-0.5, 0.5, 3)
X = np.array([[x, y, z] for x in g for y in g for z in g]) + [0, 0, 6]
# Camera 1 is the world origin (R = I, t = 0).
R1, t1 = np.eye(3), np.zeros(3)
# TODO: define camera 2's pose. Build a small rotation R2 about the y axis
# (use cv2.Rodrigues on the vector [0, -0.15, 0]) and a translation
# t2 = [1.0, 0, 0] (a 1-metre rightward baseline). Keep them as (3,3) and (3,).
Hint
R2, _ = cv2.Rodrigues(np.array([0.0, -0.15, 0.0])) gives the rotation matrix; t2 = np.array([1.0, 0.0, 0.0]) is the baseline. A camera pose maps a world point $X$ to camera coordinates as $R X + t$, so these two numbers fully place camera 2.
Step 2: Project the cube into both images
Push every 3D corner through each camera with the projection $x \sim K (R X + t)$ to get two sets of pixel coordinates. These perfectly corresponding pixel pairs are the matches that Chapter 10 would have produced from a real image, with zero outliers so you can isolate the geometry.
def project(X, K, R, t):
Xc = X @ R.T + t # world -> camera coordinates
x = Xc @ K.T # camera -> homogeneous pixels
return x[:, :2] / x[:, 2:3] # perspective divide
pts1 = project(X, K, R1, t1)
# TODO: produce pts2 the same way using R2 and t2 from Step 1.
Hint
pts2 = project(X, K, R2, t2). Each row of pts1 and the matching row of pts2 are the same cube corner seen from the two views, the noiseless ideal of a verified match.
Step 3: Estimate the essential matrix and check the epipolar constraint
Estimate $E$ from the matches with OpenCV's RANSAC-wrapped solver (Section 13.2), then verify it: every match should satisfy $x'^\top E x \approx 0$ when the points are expressed as normalized homogeneous rays $K^{-1}[x,y,1]^\top$ (Section 13.1).
E, mask = cv2.findEssentialMat(pts1, pts2, K,
method=cv2.RANSAC, prob=0.999, threshold=1.0)
def to_rays(pts, K):
h = np.hstack([pts, np.ones((len(pts), 1))])
return h @ np.linalg.inv(K).T # normalized image-plane rays
r1, r2 = to_rays(pts1, K), to_rays(pts2, K)
# TODO: compute the per-match epipolar residual r2[i] @ E @ r1[i] for all i
# and print its maximum absolute value. It should be tiny (about 1e-12).
Hint
res = np.einsum('ij,jk,ik->i', r2, E, r1) gives one residual per match; print(np.abs(res).max()). A near-zero maximum confirms your $E$ encodes the same epipolar geometry that generated the points.
Step 4: Recover the camera motion from E
Decompose $E$ into a rotation and a unit translation direction with cv2.recoverPose, which runs the cheirality test of Section 13.2 to pick the one physically valid solution out of the four that $E$ admits. Compare the result to the ground-truth pose from Step 1.
n_in, R_est, t_est, _ = cv2.recoverPose(E, pts1, pts2, K)
# Ground-truth relative pose (camera 1 -> camera 2).
R_gt = R2
t_gt = t2 / np.linalg.norm(t2) # recoverPose returns t up to scale
# TODO: print the rotation error in degrees and the angle between the
# estimated and ground-truth translation directions. Both should be ~0.
Hint
Rotation error: ang = np.degrees(np.arccos((np.trace(R_est @ R_gt.T) - 1) / 2)). Translation direction error: np.degrees(np.arccos(np.clip(abs(t_est.ravel() @ t_gt), -1, 1))). The abs absorbs the sign flip, since $E$ only fixes the baseline direction up to sign.
Step 5: Triangulate the matches into 3D
Build the two camera projection matrices $P_1 = K[I \mid 0]$ and $P_2 = K[R_{est} \mid t_{est}]$, then call cv2.triangulatePoints, the linear DLT of Section 13.6. The output is homogeneous; divide through to get Euclidean 3D points.
P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])
P2 = K @ np.hstack([R_est, t_est]) # t_est is already a column vector
Xh = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
# TODO: convert the 4xN homogeneous output Xh into an Nx3 array of
# Euclidean points by dividing the first three rows by the fourth.
Hint
X_est = (Xh[:3] / Xh[3]).T. These points live in the reconstruction's own coordinate frame and are correct only up to a single global scale, because $t$ was recovered as a unit vector.
Step 6: Fix the scale and measure 3D error
Depth from two views is recoverable only up to scale (the chapter's recurring caveat), so the recovered cube is the right shape but the wrong size. Estimate the single scale factor that best aligns your points to ground truth, apply it, and report the residual 3D error.
# Centre both clouds, then find the scalar that best matches their sizes.
Xc_gt = X - X.mean(0)
Xc_est = X_est - X_est.mean(0)
scale = (Xc_gt * Xc_est).sum() / (Xc_est * Xc_est).sum()
# TODO: build X_aligned = scale * Xc_est + X.mean(0) and print the mean
# Euclidean distance between X_aligned and the ground-truth X.
Hint
X_aligned = scale * Xc_est + X.mean(0); err = np.linalg.norm(X_aligned - X, axis=1).mean(). On noiseless input this should be tiny (well below a millimetre), confirming the cube was reconstructed up to the expected scale freedom.
Step 7: Audit with reprojection error
The honest final check, and the quantity Chapter 14's bundle adjustment minimizes: project your triangulated points back into both images and measure how far they land from the original matches. Low reprojection error is the certificate that estimation, pose recovery, and triangulation all agree.
def reproj_error(X3d, pts, P):
h = np.hstack([X3d, np.ones((len(X3d), 1))]) @ P.T
proj = h[:, :2] / h[:, 2:3]
return np.linalg.norm(proj - pts, axis=1).mean()
# TODO: print the mean reprojection error in both views using X_est
# (the unscaled triangulation), P1 with pts1, and P2 with pts2.
print("inliers used by recoverPose:", n_in)
Hint
print(reproj_error(X_est, pts1, P1), reproj_error(X_est, pts2, P2)). Use the unscaled X_est here, not X_aligned: reprojection error is scale-invariant because the global scale cancels in the projection. Both numbers should be a small fraction of a pixel.
Expected Output
The script prints, in order: an epipolar residual maximum near 1e-12 (Step 3); a rotation error and a translation-direction error both within a small fraction of a degree (Step 4); a mean 3D alignment error well below a millimetre after scale fixing (Step 6); and two reprojection errors, one per view, each a tiny fraction of a pixel (Step 7), alongside the inlier count (all matches, since the input is noiseless). The takeaway is concrete: from nothing but two pixel-coordinate lists and $K$, you recovered the camera motion and the 3D cube up to scale, and every internal consistency check passed.
Stretch Goals
- Add Gaussian pixel noise to
pts1andpts2(for examplepts + np.random.normal(0, 0.5, pts.shape)) and plot how rotation error, translation-direction error, and reprojection error grow as the noise standard deviation rises from 0 to 2 pixels. This reproduces in miniature why real pipelines wrap every estimator in RANSAC and a refinement step. - Make depth metric instead of up-to-scale: assume you know the true baseline length (1 metre) and rescale the reconstruction by the ratio of the known baseline to the recovered unit translation, then verify the cube edge length comes out at 1 metre. This is the metric-pinning move of Section 13.5.
- Library shortcut, the Right Tool principle in action: replace your hand-rolled scale alignment in Step 6 with a single call to
cv2.estimateAffine3D(or a Procrustes/Umeyama similarity fit) to register the two point clouds, and visualize both with a Matplotlib 3D scatter. Note how much of the manual centring and scaling the library absorbs.
Complete Solution
import cv2
import numpy as np
np.random.seed(0)
K = np.array([[800, 0, 320],
[0, 800, 240],
[0, 0, 1]], float)
g = np.linspace(-0.5, 0.5, 3)
X = np.array([[x, y, z] for x in g for y in g for z in g]) + [0, 0, 6]
R1, t1 = np.eye(3), np.zeros(3)
R2, _ = cv2.Rodrigues(np.array([0.0, -0.15, 0.0]))
t2 = np.array([1.0, 0.0, 0.0])
def project(X, K, R, t):
Xc = X @ R.T + t
x = Xc @ K.T
return x[:, :2] / x[:, 2:3]
pts1 = project(X, K, R1, t1)
pts2 = project(X, K, R2, t2)
E, mask = cv2.findEssentialMat(pts1, pts2, K,
method=cv2.RANSAC, prob=0.999, threshold=1.0)
def to_rays(pts, K):
h = np.hstack([pts, np.ones((len(pts), 1))])
return h @ np.linalg.inv(K).T
r1, r2 = to_rays(pts1, K), to_rays(pts2, K)
res = np.einsum('ij,jk,ik->i', r2, E, r1)
print("max epipolar residual:", np.abs(res).max())
n_in, R_est, t_est, _ = cv2.recoverPose(E, pts1, pts2, K)
R_gt = R2
t_gt = t2 / np.linalg.norm(t2)
rot_err = np.degrees(np.arccos((np.trace(R_est @ R_gt.T) - 1) / 2))
trans_err = np.degrees(np.arccos(np.clip(abs(t_est.ravel() @ t_gt), -1, 1)))
print("rotation error (deg):", rot_err)
print("translation direction error (deg):", trans_err)
P1 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])
P2 = K @ np.hstack([R_est, t_est])
Xh = cv2.triangulatePoints(P1, P2, pts1.T, pts2.T)
X_est = (Xh[:3] / Xh[3]).T
Xc_gt = X - X.mean(0)
Xc_est = X_est - X_est.mean(0)
scale = (Xc_gt * Xc_est).sum() / (Xc_est * Xc_est).sum()
X_aligned = scale * Xc_est + X.mean(0)
print("mean 3D error after scale fix:",
np.linalg.norm(X_aligned - X, axis=1).mean())
def reproj_error(X3d, pts, P):
h = np.hstack([X3d, np.ones((len(X3d), 1))]) @ P.T
proj = h[:, :2] / h[:, 2:3]
return np.linalg.norm(proj - pts, axis=1).mean()
print("reprojection error view 1:", reproj_error(X_est, pts1, P1))
print("reprojection error view 2:", reproj_error(X_est, pts2, P2))
print("inliers used by recoverPose:", n_in)
What's Next?
Two views recover depth along a single baseline; the obvious next question is what happens with two hundred views, or with a video stream from a camera that never stops moving. Chapter 14: Structure from Motion & Visual SLAM generalizes everything built here: pairwise relative poses chain into global camera trajectories, triangulated points merge into a single 3D model, and bundle adjustment polishes both simultaneously. The two-view estimators of this chapter literally run inside those systems, thousands of times per reconstruction, which is why getting them right here pays off there. The Hands-On Lab above is a one-pair instance of exactly that inner loop, reprojection-error audit and all.