"Choose your first two friends carefully. Everyone who joins the party afterwards will be positioned, for all eternity, relative to them."
An Incremental Mapper With Strong Opinions About Seating
Incremental SfM grows a reconstruction the way you assemble a jigsaw puzzle: start from the two pieces that fit together most convincingly, then repeatedly add the piece that connects best to what is already on the table, fixing mistakes as you go. The two pieces are a seed image pair reconstructed with the tools of Chapter 13; each new piece is a camera registered against existing 3D points via the Perspective-n-Point (PnP) problem; and after every addition, triangulation mints new points from the tracks of Section 14.1 while filtering and local bundle adjustment (the subject of Section 14.3) keep errors from compounding. This loop is what runs inside COLMAP when you press reconstruct.
With tracks in hand, the reconstruction problem becomes concrete: estimate a rotation $R_i$ and translation $\mathbf{t}_i$ for every image and a 3D position $\mathbf{X}_j$ for every track, such that each track's observations land where the cameras say they should. Solving for everything at once from scratch is possible (global SfM, which we will meet at the end of this section), but the workhorse strategy of the last two decades is incremental: establish a small, high-quality core, then expand it one camera at a time. The order of operations matters enormously, and the algorithm's intelligence lives mostly in its choices of what to add next and what to throw away.
1. The Two-View Seed Intermediate
Everything starts from one image pair, and the choice is consequential because the seed defines the coordinate frame and the (arbitrary) global scale that every later camera inherits. A good seed pair has three properties. First, many verified inliers, so the relative pose is reliable. Second, a wide baseline: a large triangulation angle between the viewing rays makes the initial points accurate, whereas a narrow baseline puts every point's depth in doubt, exactly the quadratic depth-error effect quantified in Chapter 13. Third, the pair must not be homography-dominated: if a single plane or a pure rotation explains the matches, as flagged on the match-graph edges of Section 14.1, the pair carries almost no parallax and triangulating from it builds the whole model on sand. COLMAP scores candidate seeds by exactly these criteria.
Once chosen, the seed reconstruction is pure Chapter 13: essential matrix, pose recovery, triangulation, plus the quality filters that incremental SfM applies relentlessly from the first moment. The listing below bootstraps a map from images 0 and 1, keeping only points that are in front of both cameras, triangulated with a sufficient ray angle, and consistent with their observations.
import cv2
import numpy as np
def reproj_error(P, X, x):
"""Pixel distance between projection of 3D points X and observations x."""
Xh = np.hstack([X, np.ones((len(X), 1))])
proj = (P @ Xh.T).T
proj = proj[:, :2] / proj[:, 2:]
return np.linalg.norm(proj - x, axis=1)
# p0, p1: Nx2 float32 arrays of verified matches between images 0 and 1
E, _ = cv2.findEssentialMat(p0, p1, cameraMatrix=K,
method=cv2.USAC_MAGSAC, threshold=1.5)
n_inl, R, t, mask = cv2.recoverPose(E, p0, p1, K) # cheirality check inside
keep = mask.ravel().astype(bool)
P0 = K @ np.hstack([np.eye(3), np.zeros((3, 1))]) # camera 0 = world origin
P1 = K @ np.hstack([R, t]) # |t| = 1: scale is arbitrary
Xh = cv2.triangulatePoints(P0, P1, p0[keep].T, p1[keep].T)
X = (Xh[:3] / Xh[3]).T # Nx3 points, world frame
# quality gates: ray angle and reprojection error
rays0 = X / np.linalg.norm(X, axis=1, keepdims=True)
C1 = (-R.T @ t).ravel() # center of camera 1
rays1 = X - C1
rays1 /= np.linalg.norm(rays1, axis=1, keepdims=True)
angles = np.degrees(np.arccos(np.clip((rays0 * rays1).sum(1), -1, 1)))
err = np.maximum(reproj_error(P0, X, p0[keep]), reproj_error(P1, X, p1[keep]))
good = (angles > 2.0) & (err < 2.0) & (X[:, 2] > 0)
print(f"seed: {good.sum()} points kept of {len(X)} "
f"(median angle {np.median(angles):.1f} deg, median err {np.median(err):.2f}px)")
# seed: 1376 points kept of 1671 (median angle 7.4 deg, median err 0.41px)
Two details deserve emphasis. The translation from recoverPose has unit norm: monocular reconstruction has no access to metric scale, so the seed baseline defines the map unit, a gauge freedom we will meet formally in Section 14.3. And the angle gate is not cosmetic: a point triangulated from rays 0.5 degrees apart can move tens of map units along the ray while changing its reprojection by less than a pixel, so admitting it pollutes the map with pseudo-structure that later PnP steps will trust to their detriment.
2. Registering the Next Camera: PnP Intermediate
Here is the move that makes incremental SfM work. Once the seed map exists, some unregistered images observe tracks that already have 3D positions. For such an image we know a set of 2D-3D correspondences: keypoint $\mathbf{x}_k$ in the new image belongs to a track whose point $\mathbf{X}_k$ is already triangulated. Estimating a camera's pose from $n$ such correspondences, given intrinsics $K$ from Chapter 12, is the Perspective-n-Point problem: find $R$, $\mathbf{t}$ minimizing
$$ \sum_{k=1}^{n} \left\lVert \pi\!\left(K \left[ R \,|\, \mathbf{t} \right] \tilde{\mathbf{X}}_k\right) - \mathbf{x}_k \right\rVert^2 , $$
where $\pi$ is the perspective division. The minimal case $n = 3$ (P3P) has up to four algebraic solutions and is the engine inside RANSAC; larger $n$ is solved by EPnP or SQPnP and refined iteratively. Because some 2D-3D correspondences are wrong (contaminated tracks from Section 14.1, or map points that drifted), registration is always RANSAC-wrapped, just as every estimator has been since Chapter 10:
# X_map: Mx3 triangulated points visible in the new image (via track lookup)
# x_new: Mx2 corresponding keypoint locations in the new image
ok, rvec, tvec, inliers = cv2.solvePnPRansac(
X_map, x_new, K, distCoeffs=None,
reprojectionError=4.0, # pixel tolerance for an inlier vote
iterationsCount=5000,
confidence=0.999,
flags=cv2.SOLVEPNP_EPNP) # minimal/efficient solver inside RANSAC
assert ok and len(inliers) > 50 # refuse weakly-supported registrations
R_new, _ = cv2.Rodrigues(rvec) # axis-angle to 3x3 rotation
print(f"registered with {len(inliers)}/{len(X_map)} PnP inliers")
# registered with 412/588 PnP inliers
# refine on inliers only, Levenberg-Marquardt on reprojection error
rvec, tvec = cv2.solvePnPRefineLM(X_map[inliers.ravel()],
x_new[inliers.ravel()], K, None, rvec, tvec)
solvePnPRansac finds the pose explaining the most 2D-3D correspondences within 4 pixels, the inlier threshold guards against contaminated tracks, and solvePnPRefineLM polishes the winning pose on inliers only.Which image should be registered next? Greedy intuition says the one seeing the most triangulated points, but COLMAP's next-best-view scoring refines this: it prefers images whose visible map points are spatially well distributed across the frame, because fifty correspondences concentrated in one corner constrain pose far worse than fifty spread evenly. Poor next-view choices were a chief cause of pre-2016 SfM failures: register a barely-connected image early, and its noisy pose seeds bad triangulations that cascade.
3. Growing Structure and the Audit Loop Intermediate
A newly registered camera does two things. It adds observations to existing points (its inlier correspondences extend their tracks), and it makes new triangulations possible: every track observed by the new camera and at least one other registered camera, but not yet triangulated, is now a candidate 3D point, gated by the same angle and reprojection filters as the seed. Then comes the step that separates a working mapper from a toy: the audit. After each registration (or each few), the mapper re-examines the model: observations whose reprojection error has crept above threshold are removed from their tracks; points left with fewer than two observations, or with degenerate angles, are deleted; previously failed triangulations are retried now that more views exist (re-triangulation); and a local bundle adjustment refines the most recently touched cameras and points. Figure 14.2.1 lays out the full cycle.
No single step of incremental SfM is especially accurate: PnP poses wobble, triangulations scatter, and a few percent of track observations are simply wrong. The system works because estimation is interleaved with audit at every iteration: each new camera both consumes the map and stress-tests it, and anything that stops reprojecting consistently is evicted before it can recruit further errors. When you build any multi-stage estimation system, this pattern, grow a little, verify everything touched, repeat, transfers directly.
4. Failure Modes: Drift, Degeneracy, and Doppelgangers Advanced
Incremental SfM inherits three systematic weaknesses worth knowing before you trust its output. Drift: errors compound along the registration order, so a long chain of cameras (a street captured end to end) slowly bends; the reconstruction is locally crisp but globally warped. Only closing a loop in the capture, or external constraints like GPS, can anchor long chains, a theme that returns with full force in the SLAM setting of Section 14.4. Degenerate motion: a segment of near-pure rotation provides no baseline, so nothing new can be triangulated and registration eventually starves; this is the capture-time failure that no algorithm fixes after the fact. Doppelgangers: the repeated-structure merges of Section 14.1, which at reconstruction time manifest as a camera confidently registered to the wrong instance of a repeated facade. The audit loop catches many; the rest require the graph-level priors discussed there.
Who: The perception lead at an infrastructure-inspection company under contract to model a regional railway corridor from a camera-equipped maintenance train.
Situation: 80,000 forward-facing frames along 30 km of track, to be reconstructed for clearance analysis (does vegetation or equipment intrude into the train envelope?).
Problem: Incremental SfM on the raw sequence drifted badly: by kilometer 8, the reconstructed track bed was 14 meters off the surveyed centerline and visibly bowed. A pure chain of forward motion is the worst case for drift, with no loops to close and a narrow effective baseline between consecutive frames.
Decision: Three interventions: subsample frames to enforce a minimum baseline (one frame per 2 meters of GPS-estimated motion); inject the train's GPS track as a soft prior on camera positions during bundle adjustment; and reconstruct in overlapping 2 km blocks, rigidly aligned via surveyed control points, rather than one 30 km chain.
Result: Maximum deviation from the survey dropped from 14 m to 0.21 m, comfortably inside the 0.5 m contract tolerance, and per-block reconstruction parallelized across machines, cutting wall-clock time by a factor of six.
Lesson: Drift is not a bug to fix downstream; it is a property of chain-shaped capture. Break the chain (blocks), anchor it (GPS, control points), or close it (loops): a capture plan decides reconstructability before any algorithm runs.
5. Incremental versus Global SfM Advanced
The incremental strategy is not the only one. Global SfM estimates all camera rotations at once by averaging the pairwise relative rotations on the match graph (rotation averaging), then solves all translations and triangulates everything in a second pass, finishing with one global bundle adjustment. It is dramatically faster, since there are no thousands of intermediate local adjustments, and free of registration-order drift; historically it was also far more brittle, because a few bad edges in the match graph could corrupt the averaged solution where the incremental audit loop would have caught them. That trade-off was substantially rebalanced in 2024: GLOMAP, from the COLMAP lineage, made global SfM robust enough to match incremental quality on standard benchmarks at one to two orders of magnitude less compute, mainly by skipping the fragile separate translation-averaging step and solving camera positions and 3D structure jointly. For very large, well-connected collections, global methods are now often the default; incremental mapping retains the edge on sparse, weakly connected, or sequentially captured sets where its caution pays.
The full incremental machine of this section, seed scoring, next-best-view selection, PnP registration, triangulation with angle gates, outlier eviction, re-triangulation, and interleaved local plus periodic global bundle adjustment, is a single call on the database produced in Section 14.1: maps = pycolmap.incremental_mapping(db, image_dir, out_dir). Our didactic seed-plus-one-camera snippets total about 60 lines and cover perhaps a tenth of the real logic; the production loop is thousands of lines you do not have to write or tune, with two decades of failure-mode handling baked in. Use the from-scratch version to understand the loop; use COLMAP's to ship.
The 2024-2026 research wave asks whether the loop itself is necessary. VGGSfM (CVPR 2024) made the entire incremental pipeline differentiable and trained it end to end, removing the hand-tuned scheduling. MASt3R-SfM (2024) replaces two-view estimation with a learned reconstruction prior, then aligns the resulting pointmaps globally, making reconstruction usable from a handful of uncalibrated phone photos. VGGT (CVPR 2025) is the strongest statement: a transformer that ingests up to hundreds of images and directly outputs cameras, depth maps, and 3D points in seconds, no PnP, no triangulation, no adjustment loop, with accuracy competitive with optimization pipelines on many benchmarks. The pattern to watch is hybridization: feed-forward models provide initialization and robustness on sparse views, while classical bundle adjustment (Section 14.3) still supplies the final metric polish wherever accuracy is contractual, and the neural scene representations of Chapter 27 consume either source of poses indifferently.
Explain, in terms of triangulation angles and the gauge freedom of monocular reconstruction, what goes wrong downstream if the seed pair is (a) two nearly identical frames from a video, (b) two views of a flat poster filling the frame, (c) a wide-baseline pair with 40 inliers concentrated in one image corner. For each case, state which of the three seed criteria it violates and which later stage (PnP registration, triangulation, or bundle adjustment) first exposes the damage.
Take three overlapping photos of a textured scene with a calibrated camera (intrinsics from Chapter 12). Implement the full path of this section: seed reconstruction from the best pair, track lookup to collect 2D-3D correspondences for the third image, solvePnPRansac registration, and triangulation of tracks newly visible from the third view. Report PnP inlier counts and the median reprojection error over all observations before and after solvePnPRefineLM. Then deliberately initialize from the worst pair instead and compare.
Using COLMAP or pycolmap on a 30-to-60 image dataset, run incremental mapping three times: default settings, initialization forced to a deliberately poor pair (init_image_id1/2), and mapping with local bundle adjustment effectively disabled (set its interval very high). Compare registered-image counts, mean track lengths, and mean reprojection errors. Which intervention hurts more, and does the damage appear immediately or accumulate? Relate your observations to the audit-loop insight of this section.