Section 14.2: Incremental Structure from Motion

"Choose your first two friends carefully. Everyone who joins the party afterwards will be positioned, for all eternity, relative to them."
An Incremental Mapper With Strong Opinions About Seating

Big Picture

Incremental SfM grows a reconstruction the way you assemble a jigsaw puzzle: start from the two pieces that fit together most convincingly, then repeatedly add the piece that connects best to what is already on the table, fixing mistakes as you go. The two pieces are a seed image pair reconstructed with the tools of Chapter 13; each new piece is a camera registered against existing 3D points via the Perspective-n-Point (PnP) problem; and after every addition, triangulation mints new points from the tracks of Section 14.1 while filtering and local bundle adjustment (the subject of Section 14.3) keep errors from compounding. This loop is what runs inside COLMAP when you press reconstruct. The illustration below carries the puzzle metaphor.

A cartoon robot builds a jigsaw puzzle starting from two glowing pieces locked together on a small pedestal, then adds the next piece that connects best, illustrating incremental structure from motion growing a reconstruction one camera at a time from a carefully chosen seed pair. — Incremental SfM is two views that recruited a thousand more, so the first pair you snap together gets all the scrutiny.

With tracks in hand, the reconstruction problem becomes concrete: estimate a rotation $R_i$ and translation $\mathbf{t}_i$ for every image and a 3D position $\mathbf{X}_j$ for every track, such that each track's observations land where the cameras say they should. Solving for everything at once from scratch is possible (global SfM, which we will meet at the end of this section), but the workhorse strategy of the last two decades is incremental: establish a small, high-quality core, then expand it one camera at a time. The order of operations matters enormously, and the algorithm's intelligence lives mostly in its choices of what to add next and what to throw away.

1. The Two-View Seed Intermediate

Everything starts from one image pair, and the choice is consequential because the seed defines the coordinate frame and the (arbitrary) global scale that every later camera inherits. A good seed pair has three properties. First, many verified inliers, so the relative pose is reliable. Second, a wide baseline: a large triangulation angle between the viewing rays makes the initial points accurate, whereas a narrow baseline puts every point's depth in doubt, exactly the quadratic depth-error effect quantified in Chapter 13. Third, the pair must not be homography-dominated: if a single plane or a pure rotation explains the matches, as flagged on the match-graph edges of Section 14.1, the pair carries almost no parallax and triangulating from it builds the whole model on sand. COLMAP scores candidate seeds by exactly these criteria.

Once chosen, the seed reconstruction is pure Chapter 13: essential matrix, pose recovery, triangulation, plus the quality filters that incremental SfM applies relentlessly from the first moment. The listing below bootstraps a map from images 0 and 1, keeping only points that are in front of both cameras, triangulated with a sufficient ray angle, and consistent with their observations.

# Bootstrap the map from the seed pair (images 0 and 1): recover relative pose,
# triangulate, then keep only points that pass three quality gates, positive depth,
# a wide-enough triangulation angle, and a low reprojection error.
import cv2
import numpy as np

def reproj_error(P, X, x):
    """Pixel distance between projection of 3D points X and observations x."""
    Xh = np.hstack([X, np.ones((len(X), 1))])
    proj = (P @ Xh.T).T
    proj = proj[:, :2] / proj[:, 2:]
    return np.linalg.norm(proj - x, axis=1)

# p0, p1: Nx2 float32 arrays of verified matches between images 0 and 1
E, _ = cv2.findEssentialMat(p0, p1, cameraMatrix=K,
                            method=cv2.USAC_MAGSAC, threshold=1.5)
n_inl, R, t, mask = cv2.recoverPose(E, p0, p1, K)     # keeps only points in front of both cameras
keep = mask.ravel().astype(bool)

P0 = K @ np.hstack([np.eye(3), np.zeros((3, 1))])     # camera 0 = world origin
P1 = K @ np.hstack([R, t])                            # |t| = 1: scale is arbitrary
Xh = cv2.triangulatePoints(P0, P1, p0[keep].T, p1[keep].T)
X = (Xh[:3] / Xh[3]).T                                # Nx3 points, world frame

# quality gates: ray angle and reprojection error
rays0 = X / np.linalg.norm(X, axis=1, keepdims=True)
C1 = (-R.T @ t).ravel()                               # center of camera 1
rays1 = X - C1
rays1 /= np.linalg.norm(rays1, axis=1, keepdims=True)
angles = np.degrees(np.arccos(np.clip((rays0 * rays1).sum(1), -1, 1)))
err = np.maximum(reproj_error(P0, X, p0[keep]), reproj_error(P1, X, p1[keep]))
good = (angles > 2.0) & (err < 2.0) & (X[:, 2] > 0)

print(f"seed: {good.sum()} points kept of {len(X)} "
      f"(median angle {np.median(angles):.1f} deg, median err {np.median(err):.2f}px)")
# seed: 1376 points kept of 1671 (median angle 7.4 deg, median err 0.41px)

Code Fragment 1: Bootstrapping the map from the seed pair: essential-matrix pose recovery and triangulation from Chapter 13, followed by the three quality gates every incremental mapper applies, positive depth, a minimum triangulation angle (here 2 degrees), and a reprojection-error ceiling.

Two details deserve emphasis. The translation from recoverPose has unit norm: monocular reconstruction has no access to metric scale, so the seed baseline defines the map unit, a gauge freedom we will meet formally in Section 14.3. And the angle gate is not cosmetic: a point triangulated from rays 0.5 degrees apart can move tens of map units along the ray while changing its reprojection by less than a pixel, so admitting it pollutes the map with pseudo-structure that later PnP steps will trust to their detriment. Figure 14.2.1 makes the geometry of this trade-off visible: a wide baseline crosses the two viewing rays at a steep angle and pins the point tightly, while a narrow baseline grazes them nearly parallel, smearing the intersection into a long depth-uncertain wedge.

Figure 14.2.1 Why the seed pair needs a wide baseline. Left: two cameras far apart see a point along rays that cross at a steep angle, so their intersection is a small region and the triangulated depth is well pinned. Right: two cameras close together send nearly parallel rays, so a small pixel error lets the point slide far along the line of sight (the elongated red wedge), exactly the near-parallel case the angle gate in Code Fragment 1 rejects. The same geometry that ruins a narrow seed is why a tiny triangulation angle pollutes the map with depth-uncertain pseudo-structure.

Try This: Sweep the Triangulation-Angle Gate

In Code Fragment 1, change the minimum angle in good = (angles > 2.0) & ... and rerun the seed step at a few values: 0.5, 2.0, and 5.0 degrees. Watch two printed numbers move in opposite directions. As you lower the gate the kept-point count climbs (more rays admitted), but the median reprojection error and the depth scatter of the resulting cloud grow too, because near-parallel rays place points almost arbitrarily far along the line of sight. As you raise it, far fewer points survive, yet the ones that do are crisp. There is no universal best value; the sweep itself is the lesson that point count and point quality trade off through this single knob, which is why every incremental mapper exposes it.

The Reconstruction Is Born With No Sense of Scale

The seed pair is the model's original sin: it sets a coordinate frame nobody voted for and a unit of length equal to whatever the first baseline happened to be. Every camera that joins later is positioned relative to those two, forever, which is why the seed gets all the scrutiny. Remember it as one line: incremental SfM is two views that recruited a thousand more. The map measures the world in seed-baselines until something external, GPS, a known object, a tape measure, tells it what a meter is.

Common Misconception: "Enough Photos (or Bundle Adjustment) Will Eventually Recover True Scale"

Learners often assume that adding more views, or running the bundle adjustment of Section 14.3, gradually pins down the real-world size of a monocular reconstruction. It never can. A single moving camera leaves the whole model free under a 7-degree-of-freedom similarity: 3 for global rotation, 3 for global translation, and 1 for overall scale. Multiply every 3D point and every camera position by two and shrink the focal-implied depths to match, and every reprojection is pixel-for-pixel identical, so no image-only objective can prefer one scaling over another. This is gauge freedom, not estimator weakness: the metric scale is simply not present in the pixels. A reconstruction can therefore be flawless and still measure the world in arbitrary seed-baselines. Recovering meters requires an external metric anchor (a stereo baseline as in Chapter 13, an IMU, GPS, or one known distance in the scene); for the same reason, monocular SLAM in Section 14.4 suffers slow scale drift that local accuracy alone cannot cure.

2. Registering the Next Camera: PnP Intermediate

Here is the move that makes incremental SfM work. Once the seed map exists, some unregistered images observe tracks that already have 3D positions. For such an image we know a set of 2D-3D correspondences: keypoint $\mathbf{x}_k$ in the new image belongs to a track whose point $\mathbf{X}_k$ is already triangulated. Estimating a camera's pose from $n$ such correspondences, given intrinsics $K$ from Chapter 12, is the Perspective-n-Point problem: find $R$, $\mathbf{t}$ minimizing

$$ \sum_{k=1}^{n} \left\lVert \pi\!\left(K \left[ R \,|\, \mathbf{t} \right] \tilde{\mathbf{X}}_k\right) - \mathbf{x}_k \right\rVert^2 , $$

where $\pi$ is the perspective division. The minimal case $n = 3$ (P3P) fixes the camera center from the three known inter-point distances and the angles their rays subtend at the lens, a system that reduces to a quartic and therefore admits up to four algebraic solutions; a fourth correspondence selects the physically correct one. This minimal solver is the engine inside RANSAC, where the residual four-way ambiguity is harmless because the consensus step keeps only the hypothesis that the remaining points also vote for; larger $n$ is solved in closed form by EPnP or SQPnP and refined iteratively. Because some 2D-3D correspondences are wrong (contaminated tracks from Section 14.1, or map points that drifted), registration is always RANSAC-wrapped, just as every estimator has been since Chapter 10:

# X_map: Mx3 triangulated points visible in the new image (via track lookup)
# x_new: Mx2 corresponding keypoint locations in the new image
ok, rvec, tvec, inliers = cv2.solvePnPRansac(
    X_map, x_new, K, distCoeffs=None,
    reprojectionError=4.0,           # pixel tolerance for an inlier vote
    iterationsCount=5000,
    confidence=0.999,
    flags=cv2.SOLVEPNP_EPNP)         # minimal/efficient solver inside RANSAC

assert ok and len(inliers) > 50      # refuse weakly-supported registrations
R_new, _ = cv2.Rodrigues(rvec)       # axis-angle to 3x3 rotation
print(f"registered with {len(inliers)}/{len(X_map)} PnP inliers")
# registered with 412/588 PnP inliers

# refine on inliers only, Levenberg-Marquardt on reprojection error
rvec, tvec = cv2.solvePnPRefineLM(X_map[inliers.ravel()],
                                  x_new[inliers.ravel()], K, None, rvec, tvec)

Code Fragment 2: Camera registration by RANSAC-wrapped PnP: solvePnPRansac finds the pose explaining the most 2D-3D correspondences within 4 pixels, the inlier threshold guards against contaminated tracks, and solvePnPRefineLM polishes the winning pose on inliers only.

Which image should be registered next? Greedy intuition says the one seeing the most triangulated points, but COLMAP's next-best-view scoring refines this: it prefers images whose visible map points are spatially well distributed across the frame, because fifty correspondences concentrated in one corner constrain pose far worse than fifty spread evenly. Poor next-view choices were a chief cause of pre-2016 SfM failures: register a barely-connected image early, and its noisy pose seeds bad triangulations that cascade.

3. Growing Structure and the Audit Loop Intermediate

A newly registered camera does two things. It adds observations to existing points (its inlier correspondences extend their tracks), and it makes new triangulations possible: every track observed by the new camera and at least one other registered camera, but not yet triangulated, is now a candidate 3D point, gated by the same angle and reprojection filters as the seed. Then comes the step that separates a working mapper from a toy: the audit. After each registration (or each few), the mapper re-examines the model: observations whose reprojection error has crept above threshold are removed from their tracks; points left with fewer than two observations, or with degenerate angles, are deleted; previously failed triangulations are retried now that more views exist (re-triangulation); and a local bundle adjustment refines the most recently touched cameras and points. Figure 14.2.2 lays out the full cycle.

Figure 14.2.2 The incremental SfM loop. A seed pair initializes the map; then the mapper repeatedly selects the best next view, registers it with PnP plus RANSAC, triangulates newly observable tracks, and runs the audit (outlier filtering, re-triangulation, local bundle adjustment) before looping back. When no remaining image can be registered, a final global bundle adjustment polishes everything.

Key Insight: Robustness Comes from the Loop, Not the Steps

No single step of incremental SfM is especially accurate: PnP poses wobble, triangulations scatter, and a few percent of track observations are simply wrong. The system works because estimation is interleaved with audit at every iteration: each new camera both consumes the map and stress-tests it, and anything that stops reprojecting consistently is evicted before it can recruit further errors. When you build any multi-stage estimation system, this pattern, grow a little, verify everything touched, repeat, transfers directly.

4. Failure Modes: Drift, Degeneracy, and Doppelgangers Advanced

Incremental SfM inherits three systematic weaknesses worth knowing before you trust its output. Drift: errors compound along the registration order, so a long chain of cameras (a street captured end to end) slowly bends; the reconstruction is locally crisp but globally warped. Only closing a loop in the capture, or external constraints like GPS, can anchor long chains, a theme that returns with full force in the SLAM setting of Section 14.4. Degenerate motion: a segment of near-pure rotation provides no baseline, so nothing new can be triangulated and registration eventually starves; this is the capture-time failure that no algorithm fixes after the fact. Doppelgangers: the repeated-structure merges of Section 14.1, which at reconstruction time manifest as a camera confidently registered to the wrong instance of a repeated facade. The audit loop catches many; the rest require the graph-level priors discussed there.

Practical Example: Mapping 30 Kilometers of Railway Corridor

Who: The perception lead at an infrastructure-inspection company under contract to model a regional railway corridor from a camera-equipped maintenance train.

Situation: 80,000 forward-facing frames along 30 km of track, to be reconstructed for clearance analysis (does vegetation or equipment intrude into the train envelope?).

Problem: Incremental SfM on the raw sequence drifted badly: by kilometer 8, the reconstructed track bed was 14 meters off the surveyed centerline and visibly bowed. A pure chain of forward motion is the worst case for drift, with no loops to close and a narrow effective baseline between consecutive frames.

Decision: Three interventions: subsample frames to enforce a minimum baseline (one frame per 2 meters of GPS-estimated motion); inject the train's GPS track as a soft prior on camera positions during bundle adjustment; and reconstruct in overlapping 2 km blocks, rigidly aligned via surveyed control points, rather than one 30 km chain.

Result: Maximum deviation from the survey dropped from 14 m to 0.21 m, comfortably inside the 0.5 m contract tolerance, and per-block reconstruction parallelized across machines, cutting wall-clock time by a factor of six.

Lesson: Drift is not a bug to fix downstream; it is a property of chain-shaped capture. Break the chain (blocks), anchor it (GPS, control points), or close it (loops): a capture plan decides reconstructability before any algorithm runs.

5. Incremental versus Global SfM Advanced

The incremental strategy is not the only one. Global SfM estimates all camera rotations at once by averaging the pairwise relative rotations on the match graph (rotation averaging: find one global orientation per camera that is as consistent as possible with every measured pairwise rotation, the way you would reconcile many noisy compass bearings into a single set of headings), then solves all translations and triangulates everything in a second pass, finishing with one global bundle adjustment. It is dramatically faster, since there are no thousands of intermediate local adjustments, and free of registration-order drift; historically it was also far more brittle, because a few bad edges in the match graph could corrupt the averaged solution where the incremental audit loop would have caught them. That trade-off was substantially rebalanced in 2024: GLOMAP, from the COLMAP lineage, made global SfM robust enough to match incremental quality on standard benchmarks at one to two orders of magnitude less compute, mainly by skipping the fragile separate translation-averaging step and solving camera positions and 3D structure jointly. For very large, well-connected collections, global methods are now often the default; incremental mapping retains the edge on sparse, weakly connected, or sequentially captured sets where its caution pays.

Library Shortcut: One Call Replaces the Whole Loop

The full incremental machine of this section, seed scoring, next-best-view selection, PnP registration, triangulation with angle gates, outlier eviction, re-triangulation, and interleaved local plus periodic global bundle adjustment, is a single call on the database produced in Section 14.1: maps = pycolmap.incremental_mapping(db, image_dir, out_dir). Our didactic seed-plus-one-camera snippets total about 60 lines and cover perhaps a tenth of the real logic; the production loop is thousands of lines you do not have to write or tune, with two decades of failure-mode handling baked in. Use the from-scratch version to understand the loop; use COLMAP's to ship.

Research Frontier: Reconstruction as a Forward Pass

The 2024-2026 research wave asks whether the loop itself is necessary. VGGSfM (CVPR 2024) made the entire incremental pipeline differentiable and trained it end to end, removing the hand-tuned scheduling. MASt3R-SfM (2024) replaces two-view estimation with a learned reconstruction prior, then aligns the resulting pointmaps globally, making reconstruction usable from a handful of uncalibrated phone photos. VGGT (CVPR 2025) is the strongest statement: a transformer that ingests up to hundreds of images and directly outputs cameras, depth maps, and 3D points in seconds, no PnP, no triangulation, no adjustment loop, with accuracy competitive with optimization pipelines on many benchmarks. The pattern to watch is hybridization: feed-forward models provide initialization and robustness on sparse views, while classical bundle adjustment (Section 14.3) still supplies the final metric polish wherever accuracy is contractual, and the neural scene representations of Chapter 27 consume either source of poses indifferently.

Exercise 14.2.1: Anatomy of a Bad Seed Conceptual

Explain, in terms of triangulation angles and the gauge freedom of monocular reconstruction, what goes wrong downstream if the seed pair is (a) two nearly identical frames from a video, (b) two views of a flat poster filling the frame, (c) a wide-baseline pair with 40 inliers concentrated in one image corner. For each case, state which of the three seed criteria it violates and which later stage (PnP registration, triangulation, or bundle adjustment) first exposes the damage.

Exercise 14.2.2: A Three-Camera Reconstruction by Hand Coding

Take three overlapping photos of a textured scene with a calibrated camera (intrinsics from Chapter 12). Implement the full path of this section: seed reconstruction from the best pair, track lookup to collect 2D-3D correspondences for the third image, solvePnPRansac registration, and triangulation of tracks newly visible from the third view. Report PnP inlier counts and the median reprojection error over all observations before and after solvePnPRefineLM. Then deliberately initialize from the worst pair instead and compare.

Exercise 14.2.3: Registration Order as an Experiment Analysis

Using COLMAP or pycolmap on a 30-to-60 image dataset, run incremental mapping three times: default settings, initialization forced to a deliberately poor pair (init_image_id1/2), and mapping with local bundle adjustment effectively disabled (set its interval very high). Compare registered-image counts, mean track lengths, and mean reprojection errors. Which intervention hurts more, and does the damage appear immediately or accumulate? Relate your observations to the audit-loop insight of this section.