"You call it a camera roll. I call it a crime scene: every photo is a witness statement, and I cross-examine them until the whole afternoon confesses where everything stood."
An Overzealous Reconstruction Pipeline
Every photograph is a measurement of the same rigid world, and with enough photographs both the world and every camera that looked at it can be recovered jointly, from pixels alone. Chapter 13 showed that two views determine relative pose and sparse depth. This chapter scales that result from two views to two thousand: structure from motion (SfM) turns an unordered pile of photos into camera poses plus a 3D point cloud, and visual SLAM re-engineers the same mathematics so a robot or headset can do it live, while moving. The output of this chapter, calibrated cameras locked to sparse geometry, is exactly what the neural scene representations of Chapter 27 consume as input.
Chapter Overview
Two-view geometry answers a local question: given this pair of images, where was the second camera relative to the first, and where in 3D are the matched points? But nobody photographs a building twice and stops. A real capture is dozens to thousands of images, and a real robot produces thirty frames per second indefinitely. The moment more than two views exist, new questions appear that no pairwise machinery can answer alone. Which images even overlap? How do hundreds of pairwise relative poses, each with its own arbitrary scale, merge into one consistent set of camera positions? And when small errors pile up across a long chain of views, what pulls the reconstruction back into shape? This chapter is the classical answer, and it remains the production answer in 2026.
The chapter builds the offline pipeline first. Section 14.1 extends the matching machinery of Chapter 10 from one image pair to a whole collection: geometrically verified match graphs, and feature tracks, the multi-view equivalence classes that say "these seventeen detections, across seventeen photos, are the same physical point." Section 14.2 assembles tracks into a reconstruction with incremental SfM: bootstrap from one well-chosen image pair, then repeatedly register the next camera with PnP and triangulate fresh structure, the algorithm running inside COLMAP and most photogrammetry products. Section 14.3 introduces the method that makes any of this accurate: bundle adjustment, the joint nonlinear least-squares refinement of every camera and every point at once, made tractable by one of the great sparsity tricks in applied mathematics.
Then the camera starts moving. Section 14.4 rebuilds SfM under real-time constraints as visual SLAM: keyframes, a tracking thread that localizes every frame in milliseconds, a mapping thread that grows and refines the map, and loop closure, the act of recognizing a previously visited place and snapping accumulated drift out of the trajectory. Section 14.5 closes with practice: COLMAP and pycolmap end to end, dense multi-view stereo on top of the sparse model, capture technique that decides success before any algorithm runs, and the 2024-2026 landscape in which learned matchers and feed-forward geometry transformers plug into, and increasingly compete with, the classical pipeline.
A thread to watch: this chapter is where Part II's machinery converges. The keypoints, descriptors, and RANSAC of Chapter 10 become the front end; the calibration of Chapter 12 and the essential matrices and triangulation of Chapter 13 become inner loops, executed thousands of times per reconstruction. And the chapter's output outlives the classical era: NeRFs and Gaussian splats in Chapter 27 and the 3D generation of Chapter 36 are trained on camera poses that, in nearly every published paper, came out of the pipeline taught here.
Prerequisites
This chapter assumes the full detect-describe-match-verify pipeline of Chapter 10: Keypoints, Descriptors & Matching, including RANSAC, which runs inside every estimator here. From Chapter 12: Camera Models & Calibration you need the pinhole model, the intrinsic matrix $K$, homogeneous coordinates, and rotation-translation extrinsics. From Chapter 13: Two-View Geometry, Stereo & Depth you need the essential matrix, pose recovery, and triangulation; they are called as subroutines on nearly every page. Comfort with nonlinear least squares helps in Section 14.3 but is developed from scratch there.
Chapter Roadmap
- 14.1 Feature Tracks & Correspondence Across Many Views From pairwise matches to a geometrically verified match graph, and from the graph to feature tracks: the multi-view correspondences that are the atoms of reconstruction, plus pair selection strategies that keep matching affordable at scale.
- 14.2 Incremental Structure from Motion Growing a reconstruction one camera at a time: choosing the two-view seed, registering new views with PnP and RANSAC, triangulating fresh structure, and the filtering loop that keeps the model healthy.
- 14.3 Bundle Adjustment: Polishing the Reconstruction The joint refinement of all cameras and all points by minimizing total reprojection error: Levenberg-Marquardt, the sparse Jacobian, the Schur complement trick, robust losses, and a working implementation in SciPy.
- 14.4 Visual SLAM: Mapping While Moving Structure from motion under a deadline: tracking, local mapping, and loop-closing threads, keyframes and covisibility, pose-graph optimization that cancels drift, and the monocular-stereo-inertial sensor spectrum.
- 14.5 COLMAP & Modern Reconstruction Pipelines The standard tool end to end: COLMAP and pycolmap in practice, dense multi-view stereo, capture technique that makes or breaks reconstructions, and how learned matchers and feed-forward models like VGGT reshape the pipeline.
In 2009, the "Building Rome in a Day" project reconstructed landmark-scale 3D models from roughly 150,000 Flickr photos of Rome in under 24 hours on about 500 cores, with no information beyond the pixels and the occasional lying EXIF tag. It was the moment structure from motion stopped being a lab demo and became internet-scale infrastructure, and its core loop is the same one you will implement in Sections 14.1 through 14.3.
What's Next?
SfM and SLAM treat the world as rigid and motion as something cameras do. Chapter 15: Motion, Optical Flow & Tracking drops the rigidity assumption and asks what moves inside the frame: dense per-pixel optical flow, sparse Lucas-Kanade tracking (the front end many SLAM systems quietly use), and the Kalman filters that keep track of objects over time. Together, Chapters 14 and 15 cover the two halves of dynamic vision: how the camera moves through the scene, and how the scene moves in front of the camera.