Part II: Classical Computer Vision
Chapter 13: Two-View Geometry, Stereo & Depth

Essential & Fundamental Matrices

"Nine numbers, rank two, defined up to scale, and the entire relationship between two photographs is my business. I'd say I carry this chapter, but technically I am only seven degrees of freedom."

A Quietly Rank-Deficient Fundamental Matrix
Big Picture

The entire epipolar geometry of Section 13.1 compresses into one $3 \times 3$ matrix, and that matrix can be estimated from eight point matches and then unpacked into the relative rotation and translation of the two cameras. This section does all three jobs. First the derivation: the essential matrix $E = [\mathbf{t}]_\times R$ for calibrated cameras, and its uncalibrated cousin the fundamental matrix $F$. Then estimation: the normalized eight-point algorithm built from scratch, wrapped in the RANSAC discipline of Chapter 10. Finally extraction: decomposing $E$ into the four candidate camera motions and using a depth test to pick the real one. These three steps run, in exactly this order, inside every structure-from-motion system you will meet in Chapter 14.

The previous section ended by treating the point-to-line map as a black box. Opening the box requires nothing beyond the pinhole algebra of Chapter 12 and one well-chosen cross product, and the payoff is large: once the constraint is a matrix equation, estimating it becomes linear algebra, and the camera motion you never measured becomes something you can read out of an SVD.

1. Deriving the Essential Matrix Advanced

Work in normalized camera coordinates: $\hat{\mathbf{x}} = K^{-1} \mathbf{x}$, the pixel coordinates with the intrinsics of Chapter 12 divided out, so each $\hat{\mathbf{x}}$ is a direction vector from the optical center. Let the left camera sit at the origin and the right camera be displaced by rotation $R$ and translation $\mathbf{t}$, meaning a point $\mathbf{X}_L$ in left-camera coordinates appears in right-camera coordinates as $\mathbf{X}_R = R\,\mathbf{X}_L + \mathbf{t}$.

The epipolar constraint says three vectors are coplanar: the left ray direction, the right ray direction, and the baseline. Expressed in the right camera's frame, those are $R\hat{\mathbf{x}}_L$, $\hat{\mathbf{x}}_R$, and $\mathbf{t}$. Coplanarity of three vectors means their scalar triple product vanishes:

$$ \hat{\mathbf{x}}_R \cdot \big( \mathbf{t} \times R\,\hat{\mathbf{x}}_L \big) = 0. $$

The cross product with $\mathbf{t}$ is a linear map, so write it as the skew-symmetric matrix $[\mathbf{t}]_\times$, where $[\mathbf{t}]_\times \mathbf{v} = \mathbf{t} \times \mathbf{v}$:

$$ [\mathbf{t}]_\times = \begin{pmatrix} 0 & -t_z & t_y \\ t_z & 0 & -t_x \\ -t_y & t_x & 0 \end{pmatrix}, \qquad \hat{\mathbf{x}}_R^\top \underbrace{[\mathbf{t}]_\times R}_{E} \, \hat{\mathbf{x}}_L = 0. $$

That is the whole derivation. The essential matrix $E = [\mathbf{t}]_\times R$ packages the unknown camera motion into one matrix, and the constraint $\hat{\mathbf{x}}_R^\top E \,\hat{\mathbf{x}}_L = 0$ must hold for every correspondence. Its structure mirrors its construction: $E$ has rank 2 (because $[\mathbf{t}]_\times$ does; $\mathbf{t}$ itself is its null vector), its two nonzero singular values are equal, and it has five degrees of freedom: three for rotation, two for the direction of translation. The missing sixth is the length of $\mathbf{t}$: from images alone, a small scene with a small baseline is indistinguishable from a large scene with a large baseline. Two-view geometry is reconstructible only up to scale, a fact that will echo through Chapter 14's structure from motion (and that stereo rigs sidestep by measuring their baseline with a ruler, as Section 13.5 exploits).

2. The Fundamental Matrix: The Same Idea, Uncalibrated Intermediate

The essential matrix lives in normalized coordinates, which you can only compute if you know $K$. Substituting $\hat{\mathbf{x}} = K^{-1}\mathbf{x}$ into the constraint gives the pixel-coordinate version,

$$ \mathbf{x}_R^\top \underbrace{K_R^{-\top} E \, K_L^{-1}}_{F} \, \mathbf{x}_L = 0, $$

defining the fundamental matrix $F$. It is the same coplanarity statement wearing pixel units, which is what makes it practical: $F$ can be estimated from raw pixel matches with no calibration whatsoever. The price of generality is structure lost: $F$ keeps rank 2 (the epipoles are its null vectors: $F\mathbf{e}_L = 0$ and $F^\top \mathbf{e}_R = 0$), but its singular values are no longer constrained to be equal, leaving seven degrees of freedom (nine entries, minus one for overall scale, minus one for the rank-2 constraint). The geometric reading of Section 13.1 is built in: $\boldsymbol{\ell}_R = F\mathbf{x}_L$ is the right-image epipolar line of $\mathbf{x}_L$, and $\boldsymbol{\ell}_L = F^\top \mathbf{x}_R$ its mate.

Key Insight: One Equation Per Match, Linear in the Unknowns

The constraint $\mathbf{x}_R^\top F \mathbf{x}_L = 0$ is bilinear in the points but linear in the nine entries of $F$. Each correspondence therefore contributes one linear equation in nine unknowns, and since $F$ is only defined up to scale, eight matches suffice to solve for it as a null-space problem. This is the deepest practical fact in the section: a quantity that encodes camera rotation, translation direction, and both intrinsic matrices can be recovered by ordinary linear least squares on point coordinates. Whenever a vision problem can be rearranged so its unknowns enter linearly, an SVD will solve it; you saw the same move in homography fitting in Chapter 10 and will see it again in triangulation in Section 13.6.

3. The Normalized Eight-Point Algorithm, From Scratch Advanced

Expanding $\mathbf{x}_R^\top F \mathbf{x}_L = 0$ for a match $(x, y) \leftrightarrow (x', y')$ (primes for the right image) gives one row of a design matrix $A$:

$$ \big(x'x,\; x'y,\; x',\; y'x,\; y'y,\; y',\; x,\; y,\; 1\big) \cdot \mathbf{f} = 0, $$

where $\mathbf{f}$ stacks the entries of $F$ row by row. With $n \ge 8$ matches, the least-squares solution is the right singular vector of $A$ with the smallest singular value, after which the rank-2 constraint is enforced by zeroing the smallest singular value of the reshaped $F$. But there is a trap, and it is famous. The columns of $A$ mix products of pixel coordinates (magnitude $\sim 10^5$ for a 1000-pixel image) with raw coordinates ($\sim 10^3$) and ones. The resulting matrix is so badly conditioned that the original eight-point algorithm earned a reputation for uselessness, until Hartley's 1997 paper "In Defense of the Eight-Point Algorithm" showed the fix is one preprocessing step: translate the points of each image to their centroid and scale them so the average distance from the origin is $\sqrt{2}$, solve in normalized coordinates, then transform the result back. The code below implements the whole thing, normalization included.

import numpy as np

def normalize_points(pts):
    """Hartley normalization: centroid to origin, mean distance sqrt(2)."""
    centroid = pts.mean(axis=0)
    mean_dist = np.linalg.norm(pts - centroid, axis=1).mean()
    s = np.sqrt(2.0) / mean_dist
    T = np.array([[s, 0, -s * centroid[0]],
                  [0, s, -s * centroid[1]],
                  [0, 0, 1.0]])
    pts_h = np.column_stack([pts, np.ones(len(pts))])   # to homogeneous
    return (T @ pts_h.T).T, T

def eight_point(ptsL, ptsR):
    """Normalized 8-point algorithm: F such that x_R^T F x_L = 0."""
    pL, TL = normalize_points(ptsL)
    pR, TR = normalize_points(ptsR)
    # one row of A per correspondence, linear in the 9 entries of F
    A = np.column_stack([
        pR[:, 0] * pL[:, 0], pR[:, 0] * pL[:, 1], pR[:, 0],
        pR[:, 1] * pL[:, 0], pR[:, 1] * pL[:, 1], pR[:, 1],
        pL[:, 0],            pL[:, 1],            np.ones(len(pL))])
    _, _, Vt = np.linalg.svd(A)
    F = Vt[-1].reshape(3, 3)            # null-space solution
    U, S, Vt = np.linalg.svd(F)
    F = U @ np.diag([S[0], S[1], 0.0]) @ Vt    # enforce rank 2
    F = TR.T @ F @ TL                   # undo the normalizations
    return F / F[2, 2]                  # fix the free scale

F_ours = eight_point(ptsL[inl], ptsR[inl])   # inlier matches from Section 13.1
print(np.round(F_ours / np.linalg.norm(F_ours), 6))
print("rank:", np.linalg.matrix_rank(F_ours))   # rank: 2
The normalized eight-point algorithm in 30 lines: Hartley normalization conditions the design matrix, the SVD null space solves the linear system, a second SVD projects onto rank 2, and the similarity transforms are folded back so the returned $F$ operates on raw pixel coordinates.

Two implementation notes repay attention. First, the rank-2 projection must come before denormalization conceptually, as written, and skipping it leaves epipolar lines that fail to meet at an epipole. Second, the quality of the result should be checked not by eyeballing matrix entries but by the residuals that matter: the epipolar distances of Section 13.1, evaluated on held-out matches. With normalization, the from-scratch $F$ typically agrees with OpenCV's to within a fraction of a pixel of epipolar distance; without it, errors of tens of pixels are routine on the same data.

Fun Note: The Best-Titled Rescue in Vision

Hartley's paper really is titled "In Defense of the Eight-Point Algorithm", and it reads like a courtroom drama: the defendant (a 1981 algorithm by Longuet-Higgins) stood accused of hopeless noise sensitivity, sophisticated nonlinear methods were the prosecution, and the defense's entire case was "your honor, nobody scaled the coordinates." Few papers have improved so many downstream systems with one similarity transform.

4. Estimation in the Real World: RANSAC and Degeneracies Intermediate

Real match sets contain outliers, so the eight-point solver never runs alone: it runs as the model-fitter inside the hypothesize-and-verify loop of Chapter 10's RANSAC, with epipolar distance as the inlier test. Sample seven or eight matches, fit, count inliers, repeat, refit on the winners. OpenCV exposes this whole stack through one call, and its modern USAC backends (notably cv2.USAC_MAGSAC, implementing MAGSAC++) replace the hard inlier threshold with a marginalization over thresholds, which in practice removes the most fragile parameter. For calibrated cameras the better tool is cv2.findEssentialMat, which runs Nistér's five-point minimal solver inside the same robust loop; fewer points per sample means exponentially fewer iterations at the same outlier rate.

Even with robustness, two degeneracies produce confident nonsense, and both are everyday occurrences rather than textbook curiosities. If all sampled matches lie on a single 3D plane (a wall, a floor, a building facade filling the frame), the data satisfies a homography, and a one-parameter family of fundamental matrices fits it perfectly; the estimate is arbitrary within that family. If the camera motion is (nearly) pure rotation, there is no baseline, no epipolar geometry exists at all, and whatever $F$ the solver returns is fiction. The practical symptom is identical in both cases: inlier counts look excellent, downstream triangulation explodes. The standard defense, used by COLMAP and ORB-SLAM alike, is to fit both a fundamental matrix and a homography to every pair and compare inlier support; when $H$ explains nearly as many matches as $F$, the pair is treated as planar or rotational, exactly the regime Section 13.3 covers.

Library Shortcut: findEssentialMat and recoverPose

The from-scratch stack of this section (normalization, eight-point solve, rank projection, RANSAC loop, decomposition, cheirality test) is about 200 lines. OpenCV compresses it to three: E, mask = cv2.findEssentialMat(ptsL, ptsR, K, method=cv2.USAC_MAGSAC, prob=0.999, threshold=1.0), then n, R, t, mask = cv2.recoverPose(E, ptsL, ptsR, K), a roughly 70-to-1 reduction. Internally the library runs the five-point solver on minimal samples (not eight-point, so planar scenes are far less degenerate), MAGSAC++ inlier scoring, the SVD decomposition of subsection 5, and the cheirality vote over all four motion candidates. What it does not do is decide whether your scene was degenerate to begin with; the $H$-versus-$F$ comparison remains your job.

5. From E to Camera Motion: Four Candidates, One Survivor Advanced

The essential matrix was built as $E = [\mathbf{t}]_\times R$, so it should be possible to take it apart again. The SVD does it: writing $E = U \operatorname{diag}(1, 1, 0) V^\top$ and defining the 90-degree rotation $W$ about the $z$-axis, the two candidate rotations are $R_1 = U W V^\top$ and $R_2 = U W^\top V^\top$ (sign-corrected to determinant $+1$), and the translation direction is $\mathbf{t} = \pm \mathbf{u}_3$, the last column of $U$. Two rotations times two translation signs gives four candidate motions, and all four reproduce $E$ exactly. Figure 13.2.1 shows what distinguishes them: in only one configuration does the scene sit in front of both cameras.

(a) R₁, +t in front of both ✓ (b) R₁, −t behind both ✗ (c) R₂, +t behind one ✗ (d) R₂, −t behind one ✗ cheirality test: triangulate matches under each candidate, keep the one with positive depths in both views
Figure 13.2.1 The four motions hiding in one essential matrix. Decomposing $E$ yields two rotations and two translation signs; all four candidate configurations satisfy the epipolar constraint, but triangulated points (red) land in front of both cameras in exactly one of them (a). The dashed rays mark cameras that would have to see through the back of their own image plane.

The disambiguation in Figure 13.2.1 is called the cheirality test: triangulate a handful of matches (using Section 13.6's machinery) under each of the four candidates and count points with positive depth in both cameras. The true motion wins by a landslide; the impostors place points behind at least one camera. cv2.recoverPose performs this vote internally, which is why it asks for the point matches and not just $E$. The recovered $\mathbf{t}$ is, as always, a unit direction: the scale ambiguity of subsection 1 is not an implementation detail but a law.

Practical Example: The Drone Mapper That Died on Flat Farmland

Who: A two-engineer team at an agritech startup building photogrammetry from fixed-wing drone surveys.

Situation: Their pipeline estimated pairwise essential matrices from SIFT matches, chained relative poses, and triangulated crop-field models. Over orchards and farm buildings it worked beautifully.

Problem: Over flat wheat fields, reconstructions intermittently shattered: camera poses jumped, triangulated terrain folded into impossible ridges. Inlier ratios looked better on the failing flights than on the successful ones, which confused the team for a week.

Decision: A flat field photographed from altitude is one big plane: the planar degeneracy of subsection 4, where a homography explains the matches and the essential-matrix fit is unconstrained along a family of solutions. Following the COLMAP playbook, they added a model-selection step: fit both $H$ and $E$ per pair, and when the homography's inlier count exceeded 85 percent of the essential matrix's, switch to homography-based pose recovery (cv2.decomposeHomographyMat, with plane-normal disambiguation across pairs).

Result: Failure rate on flat-terrain flights dropped from roughly one in three to below one in fifty, with no change to the matcher or the flight plan. The "too good" inlier ratios had been the planar tell all along.

Lesson: High inlier support means the model fits the data, not that the model is identifiable. Degenerate data fits many models perfectly; production pipelines must test for the degeneracy, not just the fit.

6. E or F? Choosing Your Matrix Beginner

A short decision rule closes the tour. Use the essential matrix whenever calibration is available: the five-point solver needs fewer RANSAC samples, the equal-singular-value structure regularizes the fit, and the output decomposes directly into $R$ and $\mathbf{t}$ for the reconstruction work of Section 13.6 and Chapter 14. Use the fundamental matrix when intrinsics are unknown or untrusted: stitching matches across photos from the internet, validating correspondences from an arbitrary camera, or the uncalibrated rectification of Section 13.4. And before either, undistort: both matrices assume the straight-ray pinhole world that Chapter 12's distortion model restores.

Research Frontier: Solvers, Learned and Otherwise

The estimation problem of this section remains an active 2024-2026 battleground on three fronts. The robust-loop front: MAGSAC++ and the broader USAC framework (both in OpenCV since 4.5) are being joined by learned inlier scorers, descendants of differentiable RANSAC (DSAC), that weight correspondences with a network before a classical solver runs; libraries like PoseLib have become the standard home for fast minimal solvers. The correspondence front: dense matchers such as RoMa (CVPR 2024) and the matching head of MASt3R (ECCV 2024) feed thousands of high-quality matches to exactly the estimators taught here, and the 2023-2025 Image Matching Challenge leaderboards are dominated by such hybrid stacks. The end-to-end front: VGGT (CVPR 2025) skips $E$ and $F$ entirely, regressing camera poses directly from pixels with a transformer, and reporting pose accuracy competitive with optimization pipelines on standard benchmarks. Notably, even these end-to-end systems are evaluated by comparing against poses computed with the classical machinery of this section; the eight-point algorithm has become the measuring stick for its own would-be replacements.

Exercise 13.2.1: Counting Degrees of Freedom Conceptual

Account for every degree of freedom: starting from nine entries, derive why $F$ has seven degrees of freedom and $E$ has five. Then explain (a) why seven matches suffice in principle to estimate $F$ (the seven-point algorithm) and what goes wrong with the linear null-space approach at exactly seven points, and (b) which two constraints distinguish a valid $E$ from a generic rank-2 matrix, and how cv2.findEssentialMat's five-point solver exploits them.

Exercise 13.2.2: Normalization Ablation Coding

Take this section's eight_point and produce an un-normalized variant by deleting the two normalize_points calls and the denormalization line. On a real image pair with at least 200 inlier matches, run both variants on 100 random subsets of 20 matches each, and for every fit compute the mean symmetric epipolar distance on the held-out matches. Report the median and the 90th percentile error for each variant, and the condition number of the design matrix $A$ in both cases. How large is Hartley's one-transform improvement on your data?

Exercise 13.2.3: Watching the Cheirality Vote Analysis

Estimate $E$ for a calibrated image pair, decompose it by hand with the SVD recipe of subsection 5 into all four $(R, \mathbf{t})$ candidates, and triangulate all inlier matches under each candidate using cv2.triangulatePoints. For each candidate report the fraction of points with positive depth in both cameras, and verify the winner matches cv2.recoverPose's output. Then degrade the geometry: repeat the experiment on a pair with a very short baseline (two frames of a nearly still camera). How decisive is the vote now, and what does that imply for structure-from-motion systems choosing which frame pairs to initialize from?