Section 14.3: Bundle Adjustment: Polishing the Reconstruction

"Every camera is a little wrong and every point is a little wrong, and at two in the morning I redistribute the blame across nineteen thousand parameters until nobody can point fingers."
A Sleep-Deprived Bundle Adjuster

Big Picture

Bundle adjustment is where reconstruction accuracy actually comes from: a joint nonlinear least-squares refinement of every camera pose and every 3D point at once, minimizing the total reprojection error over all observations; everything the previous sections built is, from its perspective, just a good initial guess. The method looks computationally hopeless, hundreds of thousands of coupled unknowns, until one structural observation rescues it: each observation links exactly one camera to one point, making the problem profoundly sparse. Exploiting that sparsity, via the Schur complement, is what lets COLMAP polish city-scale models and what the real-time SLAM systems of Section 14.4 run dozens of times per second on a laptop.

The incremental loop of Section 14.2 makes locally reasonable decisions: each pose from PnP is the best fit to the map as it stood, each triangulation the best fit to the cameras as they stood. But "as it stood" is the problem: the map used to register camera 41 already contained the small errors of cameras 1 through 40, so errors do not merely persist, they compound. What the reconstruction needs is a stage where no estimate is privileged: cameras and points all become variables simultaneously, and the data, the observed pixel locations of the tracks from Section 14.1, arbitrates among them. That stage is bundle adjustment, named for the bundles of light rays connecting each 3D point to the cameras that observe it; the name predates computer vision by decades, coming from the photogrammetrists who adjusted aerial-survey bundles by hand calculation in the 1950s.

1. The Objective: Total Reprojection Error Intermediate

Let camera $i$ carry parameters $\boldsymbol{\theta}_i$ (rotation and translation; optionally intrinsics and distortion, as calibrated in Chapter 12), let $\mathbf{X}_j$ be the position of track $j$, and let $\mathcal{O}$ be the set of observations: $(i, j) \in \mathcal{O}$ when camera $i$ observes track $j$ at pixel $\mathbf{x}_{ij}$. With $\pi$ the projection function from Chapter 12, bundle adjustment solves

$$ \min_{\{\boldsymbol{\theta}_i\}, \{\mathbf{X}_j\}} \; \sum_{(i,j) \in \mathcal{O}} \rho\!\left( \left\lVert \pi(\boldsymbol{\theta}_i, \mathbf{X}_j) - \mathbf{x}_{ij} \right\rVert^2 \right), $$

where $\rho$ is a robust loss we return to in Section 5 (for now, read it as the identity). The objective is exactly the reprojection error that Chapter 13 used to evaluate triangulations, summed over the entire model. Two properties shape everything that follows. First, the problem is nonlinear: $\pi$ involves a rotation and a perspective division, so no closed-form solution exists and we must iterate from the initialization that incremental SfM supplies. Second, it is not fully determined: moving every camera and every point by a common rigid transform, or scaling the whole scene, changes nothing about reprojection. This seven-dimensional family of equivalent solutions (3 rotation, 3 translation, 1 scale) is the gauge freedom; solvers handle it by fixing one camera and one distance, or by letting the normal equations' damping absorb it.

Common Misconception: "Low Reprojection Error Means the Reconstruction Is Correct"

The most seductive trap in this chapter is reading a sub-pixel mean reprojection error as a certificate of a correct model. Reprojection error measures only self-consistency: whether the estimated cameras and points reproject onto the pixels they were fit to. A reconstruction can drive that residual near zero while being geometrically wrong, because bundle adjustment minimizes exactly what you ask over exactly the parameters you free. The bowl-effect example in Section 5 is precisely this: mis-modeled lens distortion lets the optimizer curve a flat parking lot into a dish and still reproject every observation to a fraction of a pixel. Reprojection error is necessary but not sufficient. Trusting an output demands three further checks: gauge anchoring (the residual says nothing about absolute scale or pose, by the gauge freedom above), which parameters were left fixed versus free, and independent ground truth such as the surveyed control points of Section 5. A model that fits the images perfectly can still describe a world that was never there.

2. The Solver: Gauss-Newton and Levenberg-Marquardt Intermediate

Stack all residuals $\mathbf{r}(\mathbf{p}) = \left( \pi(\boldsymbol{\theta}_i, \mathbf{X}_j) - \mathbf{x}_{ij} \right)_{(i,j)\in\mathcal{O}}$ into one vector, where $\mathbf{p}$ collects every parameter. Gauss-Newton linearizes the residuals around the current estimate via the Jacobian $J = \partial \mathbf{r} / \partial \mathbf{p}$ and solves the normal equations for an update $\boldsymbol{\delta}$:

$$ (J^\top J)\, \boldsymbol{\delta} = -J^\top \mathbf{r}. $$

Far from the solution the linearization can overshoot, so Levenberg-Marquardt (LM) adds adaptive damping,

$$ \left(J^\top J + \lambda \,\mathrm{diag}(J^\top J)\right) \boldsymbol{\delta} = -J^\top \mathbf{r}, $$

shrinking $\lambda$ when a step reduces the cost (behaving like fast Gauss-Newton) and growing it when a step fails (retreating toward cautious gradient descent). LM is the default optimizer of every serious bundle adjuster, and its character matters for intuition: unlike the first-order stochastic optimizers you will meet training networks in Chapter 18, LM uses curvature, converging in typically 10 to 50 iterations to sub-pixel consistency, and it is deterministic: the same input yields the same polished model.

3. The Sparsity Miracle and the Schur Complement Advanced

Count parameters for a modest reconstruction: 500 cameras at 6 parameters and 200,000 points at 3 parameters give $n = 603{,}000$ unknowns. The matrix $J^\top J$ is $n \times n$; storing it densely needs about 2.9 terabytes, and solving it densely costs $O(n^3)$. Bundle adjustment is feasible only because of the structure visible in the objective: each residual involves exactly one camera and exactly one point. The Jacobian row for observation $(i,j)$ is zero everywhere except in the 6 columns of camera $i$ and the 3 columns of point $j$, so $J^\top J$ assembles into the famous arrowhead form

$$ J^\top J = \begin{bmatrix} B & E \\ E^\top & C \end{bmatrix}, $$

where $B$ (cameras by cameras) is small, $C$ (points by points) is enormous but block diagonal with independent $3 \times 3$ blocks (block diagonal because no two distinct 3D points share a parameter, so $C$ has no off-diagonal coupling between different points), and $E$ encodes which camera sees which point. Block-diagonal matrices invert block by block, which enables the Schur complement trick: eliminate all point parameters analytically and solve the small reduced camera system

$$ \left( B - E\, C^{-1} E^\top \right) \boldsymbol{\delta}_{\text{cam}} = -\mathbf{g}_{\text{cam}} + E\, C^{-1} \mathbf{g}_{\text{pt}}, $$

then back-substitute each point's update independently (and in parallel). The expensive linear solve shrinks from 603,000 unknowns to 3,000, a workload a laptop handles in milliseconds per iteration. Figure 14.3.1 makes the structure visual; once you have seen it, you cannot unsee why bundle adjustment scales. The mental model below recasts the same elimination as settling a restaurant bill, for readers who would rather picture the trick than read its algebra.

Figure 14.3.1 Why bundle adjustment scales. Left: the normal matrix $J^\top J$ has arrowhead structure, a small dense camera block $B$, a huge but block-diagonal point block $C$, and sparse couplings $E$ wherever a camera observes a point. Right: because $C$ inverts block by block, the Schur complement reduces each LM iteration to a linear solve over cameras alone, shrinking the core problem by orders of magnitude.

💡 Mental Model: The Schur Complement Is Settling the Group Bill by Table

Think of the Schur complement as splitting a restaurant bill where the cheap items vastly outnumber the expensive ones. Imagine a few big shared platters (the cameras) and two hundred thousand individually ordered drinks (the points), where each drink was ordered by exactly one person at exactly one platter, so no drink couples two platters: that is precisely the block-diagonal $C$. Settling the whole bill at once means one $603{,}000$-line invoice. Instead you note that every drink's cost depends only on which platter it belongs to, so you fold each drink's share back onto its own platter (this folding is the $E\,C^{-1}E^\top$ term), settle the tiny platter-only bill among the few diners (the reduced camera system), and then read each person's drink total straight off, independently and in parallel (the back-substitution). The map of who-ordered-what at which platter is exactly $E$. You never lose a cent, because the points were coupled to cameras but never to each other.

Where this model breaks down: a drink shared across two platters would couple them and destroy the block-diagonal structure; in bundle adjustment that never happens (a 3D point's parameters are its own), which is exactly why the trick is always available.

Fun Fact: Adjusted by Hand Since the 1950s

Photogrammetrists were solving bundle adjustment for aerial surveys decades before computer vision existed, originally with mechanical calculators and rooms of human computers; the "block adjustment" of a national mapping survey could take months. The 2000 survey paper that consolidated the theory for vision, Triggs et al.'s "Bundle Adjustment: A Modern Synthesis," is affectionately known as the field's longest apology for reinventing photogrammetry.

The same reduced system also hands you the solution's uncertainty for free, which is what a downstream SLAM or fusion stage needs. At the optimum the parameter covariance is the inverse of the Gauss-Newton approximation to the Hessian, $\operatorname{Cov} \approx (J^\top J)^{-1}$ (scaled by the residual variance). Inverting the full arrowhead matrix is as expensive as it looked before Schur, but the same elimination helps: the inverse of the reduced camera matrix $S = B - E C^{-1} E^\top$ gives the camera-block covariance $\operatorname{Cov}(\text{cameras}) = S^{-1}$ directly, and back-substitution then recovers each point's covariance from its own $3 \times 3$ block plus a camera-coupling correction. So the camera poses come with error ellipsoids, and a point triangulated from a thin baseline reports the large covariance along the viewing direction that the depth-uncertainty law of Section 13.5 predicts.

4. Bundle Adjustment in Code Intermediate

SciPy's least_squares will not win speed records against Ceres, but it exposes every concept of this section honestly, and on problems up to a few hundred cameras it is perfectly serviceable. The implementation needs three ingredients: a residual function (project every observed point through its camera, subtract the observation), a parameter vector packing camera 6-vectors (axis-angle rotation plus translation, the three-number Rodrigues encoding of a rotation from Section 12.4) and point 3-vectors, and, crucially, the Jacobian sparsity pattern, without which SciPy would attempt dense finite differences over the full parameter count and never finish. The arrays cam_idx, pt_idx, and obs are exactly the observation list of the objective: row $k$ says camera cam_idx[k] saw point pt_idx[k] at pixel obs[k], data that falls straight out of the track table built in Section 14.1. One argument below, loss="huber", is the robust loss $\rho$ promised in Section 1; treat it for now as "downweight gross outliers," with the full justification arriving in Section 5.

# A sparse bundle adjuster on top of SciPy least_squares: a residual function that
# reprojects every observation through its camera, plus the Jacobian sparsity mask
# that lets the solver exploit the arrowhead structure instead of densely differencing.
import numpy as np
from scipy.optimize import least_squares
from scipy.sparse import lil_matrix

def rotate(pts, rvecs):
    """Apply batched axis-angle rotations (Rodrigues formula)."""
    theta = np.linalg.norm(rvecs, axis=1, keepdims=True)
    with np.errstate(invalid="ignore"):
        k = np.nan_to_num(rvecs / theta)              # unit axes (0 where theta=0)
    cos, sin = np.cos(theta), np.sin(theta)
    return (cos * pts + sin * np.cross(k, pts)
            + (pts * k).sum(1, keepdims=True) * (1 - cos) * k)

def residuals(params, n_cams, n_pts, cam_idx, pt_idx, obs, K):
    cams = params[:n_cams * 6].reshape(n_cams, 6)     # [rvec | tvec] per camera
    pts = params[n_cams * 6:].reshape(n_pts, 3)
    Xc = rotate(pts[pt_idx], cams[cam_idx, :3]) + cams[cam_idx, 3:6]
    proj = Xc @ K.T
    proj = proj[:, :2] / proj[:, 2:]                  # perspective division
    return (proj - obs).ravel()

def jac_sparsity(n_cams, n_pts, cam_idx, pt_idx):
    """Mark which parameters each residual can possibly touch."""
    m = cam_idx.size * 2                              # two residuals per observation
    A = lil_matrix((m, n_cams * 6 + n_pts * 3), dtype=int)
    row = np.arange(cam_idx.size)
    for s in range(6):                                # this residual's camera...
        A[2 * row, cam_idx * 6 + s] = 1
        A[2 * row + 1, cam_idx * 6 + s] = 1
    for s in range(3):                                # ...and its point, nothing else
        A[2 * row, n_cams * 6 + pt_idx * 3 + s] = 1
        A[2 * row + 1, n_cams * 6 + pt_idx * 3 + s] = 1
    return A

x0 = np.hstack([cams_init.ravel(), pts_init.ravel()])
A = jac_sparsity(n_cams, n_pts, cam_idx, pt_idx)
res = least_squares(residuals, x0, jac_sparsity=A, method="trf",
                    loss="huber", f_scale=1.0, x_scale="jac",
                    args=(n_cams, n_pts, cam_idx, pt_idx, obs, K))

r0 = residuals(x0, n_cams, n_pts, cam_idx, pt_idx, obs, K)
print(f"reprojection RMSE: {np.sqrt((r0**2).mean()):.2f}px "
      f"-> {np.sqrt((res.fun**2).mean()):.2f}px in {res.nfev} evaluations")
# reprojection RMSE: 3.84px -> 0.47px in 23 evaluations

Code Fragment 1: A complete sparse bundle adjuster in NumPy and SciPy: batched Rodrigues rotation, a residual function projecting every observation through its camera, the sparsity mask that tells the solver each residual touches only 9 of the hundreds of thousands of parameters, and a Huber loss for robustness. On the author's 60-camera test scene the RMSE drops roughly eightfold in two dozen evaluations.

Watch what the printed before-and-after line says: the optimizer did not move any single camera dramatically, yet the collective consistency improved by nearly an order of magnitude. That is the signature of bundle adjustment: it harvests hundreds of small, correlated corrections that no per-camera or per-point estimator could see.

Try This: Toggle the Robust Loss and Plant an Outlier

In Code Fragment 1, switch loss="huber" to loss="linear" (the plain squared loss) and rerun, then add one gross outlier by hand: pick a single row of obs and shift it 40 pixels off. With the squared loss, watch the final RMSE balloon and, if you plot the cameras, see nearby poses visibly tugged toward the bad observation; one contaminated point now contributes as much gradient as roughly 1,600 good ones. Flip back to loss="huber" (or try "cauchy") and rerun unchanged: the same outlier barely moves the solution. Varying just this one argument makes concrete why production bundle adjusters never use a bare squared loss on real, outlier-laced tracks.

Key Insight: Joint Refinement Beats Sequential Refinement Because Errors Are Correlated

Refining each camera against fixed points, then each point against fixed cameras, and alternating, looks like it should converge to the same place. It usually does not, and when it does, it crawls: an error in camera 12's pose is partially absorbed by the points it observes, which then mislead camera 31 that shares those points. Bundle adjustment moves both sides of every such bargain simultaneously, following the true joint curvature of the problem; this is precisely the coupling the Schur complement preserves (rather than ignores) while eliminating the point variables. The general lesson, that jointly optimizing coupled estimates outperforms coordinate-wise fixing, recurs throughout vision, from calibration in Chapter 12 to end-to-end network training in Chapter 18.

5. Robustness in Practice: Losses, Outliers, Scheduling Intermediate

Three practical levers separate a textbook adjuster from a production one: the loss, the eviction schedule, and the bundle-adjustment cadence.

A plain squared loss hands control of the model to its worst data: one contaminated track observation, 40 pixels off, contributes as much gradient as 1,600 observations that are one pixel off. Hence the robust loss $\rho$. The Huber loss is quadratic inside a threshold and linear beyond, so outliers retain influence but no longer dominate; the Cauchy loss goes further, asymptotically ignoring gross errors. Production systems pair a robust kernel with an eviction schedule: adjust, delete observations whose residuals remain large, adjust again, exactly the audit rhythm of Section 14.2 applied at global scale.

Scheduling is the other practical lever. Running global bundle adjustment (BA) after every new camera would be ruinously slow, so incremental mappers interleave local BA (the new camera plus its covisible neighborhood, a few dozen cameras) after every registration with global BA at exponentially spaced milestones, when the model has grown by some percentage. And when many images share one physical camera, refining shared intrinsics inside the final global BA (self-calibration) recovers focal length and distortion better than any single image could; this is why COLMAP often reports better intrinsics than the EXIF data claimed.

Practical Example: The Bowl-Shaped Parking Lot

Who: A photogrammetry engineer at a construction-tech company producing weekly drone surveys of building sites for volume and progress tracking.

Situation: Flat sites kept reconstructing with a subtle dish shape: corners of a level parking lot curved up to half a meter relative to its center, corrupting cut-and-fill volume estimates that fed invoices (the illustration below caricatures the effect).

A cartoon drone photographs a parking lot whose reconstruction has curved up at the edges into a shallow bowl while a puzzled engineer robot holds a flat ruler against it, illustrating the bundle-adjustment bowl effect where misestimated lens distortion lets the optimizer curve a flat world while still reprojecting every observation perfectly. — Bundle adjustment minimizes exactly what you ask: leave distortion unmodeled and it will happily bend a flat lot into a dish, sub-pixel error and all.

Problem: The dreaded "bowl effect," a classic systematic failure: slightly misestimated radial distortion bends every camera's geometry coherently, and bundle adjustment, given freedom only over poses and points, accommodates the bias by curving the world. Nadir-only imagery makes distortion and terrain curvature nearly indistinguishable to the optimizer.

Decision: Three changes to the BA configuration rather than the capture hardware: model and jointly refine a full radial-tangential distortion (shared across all images from the one camera) in global BA instead of trusting the lens profile; add oblique imagery (15 degrees off-nadir on the perimeter) to break the distortion-curvature ambiguity; and pin five surveyed ground control points as fixed 3D anchors with observation residuals in the objective.

Result: Vertical deviation on the flat-lot acceptance test dropped from 0.48 m to 0.019 m. Volume disputes with contractors, previously a monthly support burden, effectively ended.

Lesson: Bundle adjustment minimizes exactly what you ask, over exactly the parameters you free. Systematic shape errors usually mean the model is missing a parameter (here, distortion) or the data cannot disambiguate one; the fix is modeling and capture geometry, not more iterations.

Library Shortcut: Ceres via pycolmap, Three Lines for the Industrial Version

Our SciPy adjuster is about 50 lines and fine for hundreds of cameras; the production tool is Google's Ceres Solver, which COLMAP wraps so that refining an entire reconstruction is pycolmap.bundle_adjustment(rec, pycolmap.BundleAdjustmentOptions()) on the model from Section 14.2. Internally Ceres supplies what our version lacks: analytic or automatic derivatives (no finite differencing), specialized Schur-complement linear solvers with camera-ordering heuristics, multithreaded and optionally CUDA-accelerated factorization, robust kernels, and parameter blocks that stay fixed on demand (ground control points, calibrated intrinsics). The line-count reduction is roughly 50 to 3; the speedup on large problems is two to three orders of magnitude.

Research Frontier: Bundle Adjustment Meets the GPU and the Gradient

Bundle adjustment is having a second youth. One thread makes the classical solver differentiable and embeds it in networks: DROID-SLAM (NeurIPS 2021) iterates a dense BA layer inside a recurrent architecture, and its descendants (DPV-SLAM, ECCV 2024) keep the pattern at higher speed. A second thread re-platforms the solver: Ceres gained CUDA solvers, and dedicated GPU adjusters like MegBA (ECCV 2022) handle millions of cameras by distributing the Schur complement. A third thread sidesteps BA entirely, then sneaks it back in: feed-forward systems such as VGGT (CVPR 2025) regress poses directly, yet their best reported configurations still apply a final BA polish, and pose refinement inside 3D Gaussian splatting pipelines (taken up in Chapter 27) is bundle adjustment by another name, with photometric rather than reprojection residuals. Half a century in, the normal equations remain stubbornly load-bearing.

Exercise 14.3.1: Gauge Freedom Made Concrete Conceptual

The BA objective is invariant to 7 degrees of freedom. Enumerate them, then answer: what does this invariance do to the matrix $J^\top J$ (think eigenvalues), and why does LM's damping term let the solver proceed anyway? If a client requires the model in metric units with a specific site origin, which 7 numbers must come from outside the images, and name two realistic sources for them.

Exercise 14.3.2: Free the Focal Length Coding

Extend the SciPy adjuster to refine a shared focal length: make $f$ parameter zero of the vector, update residuals and one column of the sparsity pattern (every residual now touches $f$). Initialize $f$ 10 percent wrong on a synthetic scene with known ground truth and report: does BA recover it? Then re-run with all cameras facing nearly the same direction and compare; relate the failure to the bowl-effect story.

Exercise 14.3.3: Measure the Sparsity Dividend Analysis

For a reconstruction with $C$ cameras, $P$ points, and $O$ observations, derive expressions for (a) the nonzero count of the Jacobian, (b) its dense element count, and (c) the size of the reduced camera system. Evaluate for the railway scenario of Section 14.2: $C = 40{,}000$, $P = 12$ million, $O = 90$ million. Then time the SciPy adjuster on a synthetic problem at $C \in \{10, 30, 100\}$ with and without jac_sparsity and plot both curves. Where would the dense version exhaust your machine's memory?