Chapter 14: Structure from Motion & Visual SLAM

"You call it a camera roll. I call it a crime scene: every photo is a witness statement, and I cross-examine them until the whole afternoon confesses where everything stood."
An Overzealous Reconstruction Pipeline

Big Picture

Every photograph is a measurement of the same rigid world, and with enough photographs both the world and every camera that looked at it can be recovered jointly, from pixels alone. Chapter 13 showed that two views determine relative pose and sparse depth. This chapter scales that result from two views to two thousand: structure from motion (SfM) turns an unordered pile of photos into camera poses plus a 3D point cloud, and visual SLAM re-engineers the same mathematics so a robot or headset can do it live, while moving. The output of this chapter, calibrated cameras locked to sparse geometry, is exactly what the neural scene representations of Chapter 27 consume as input.

Chapter Overview

Two-view geometry answers a local question: given this pair of images, where was the second camera relative to the first, and where in 3D are the matched points? But nobody photographs a building twice and stops. A real capture is dozens to thousands of images, and a real robot produces thirty frames per second indefinitely. The moment more than two views exist, new questions appear that no pairwise machinery can answer alone. Which images even overlap? How do hundreds of pairwise relative poses, each with its own arbitrary scale, merge into one consistent set of camera positions? And when small errors pile up across a long chain of views, what pulls the reconstruction back into shape? This chapter is the classical answer, and it remains the production answer in 2026.

The chapter builds the offline pipeline first. Section 14.1 extends the matching machinery of Chapter 10 from one image pair to a whole collection: geometrically verified match graphs, and feature tracks, the multi-view equivalence classes that say "these seventeen detections, across seventeen photos, are the same physical point." Section 14.2 assembles tracks into a reconstruction with incremental SfM: bootstrap from one well-chosen image pair, then repeatedly register the next camera with PnP and triangulate fresh structure, the algorithm running inside COLMAP and most photogrammetry products. Section 14.3 introduces the method that makes any of this accurate: bundle adjustment, the joint nonlinear least-squares refinement of every camera and every point at once, made tractable by one of the great sparsity tricks in applied mathematics.

Then the camera starts moving. Section 14.4 rebuilds SfM under real-time constraints as visual SLAM: keyframes, a tracking thread that localizes every frame in milliseconds, a mapping thread that grows and refines the map, and loop closure, the act of recognizing a previously visited place and snapping accumulated drift out of the trajectory. Section 14.5 closes with practice: COLMAP and pycolmap end to end, dense multi-view stereo on top of the sparse model, capture technique that decides success before any algorithm runs, and the 2024-2026 landscape in which learned matchers and feed-forward geometry transformers plug into, and increasingly compete with, the classical pipeline.

A thread to watch: this chapter is where Part II's machinery converges. The keypoints, descriptors, and RANSAC of Chapter 10 become the front end; the calibration of Chapter 12 and the essential matrices and triangulation of Chapter 13 become inner loops, executed thousands of times per reconstruction. And the chapter's output outlives the classical era: NeRFs and Gaussian splats in Chapter 27 and the 3D generation of Chapter 36 are trained on camera poses that, in nearly every published paper, came out of the pipeline taught here.

Prerequisites

This chapter assumes the full detect-describe-match-verify pipeline of Chapter 10: Keypoints, Descriptors & Matching, including RANSAC, which runs inside every estimator here. From Chapter 12: Camera Models & Calibration you need the pinhole model, the intrinsic matrix $K$, homogeneous coordinates, and rotation-translation extrinsics. From Chapter 13: Two-View Geometry, Stereo & Depth you need the essential matrix, pose recovery, and triangulation; they are called as subroutines on nearly every page. Comfort with nonlinear least squares helps in Section 14.3 but is developed from scratch there.

Chapter Roadmap

14.1 Feature Tracks & Correspondence Across Many Views From pairwise matches to a geometrically verified match graph, and from the graph to feature tracks: the multi-view correspondences that are the atoms of reconstruction, plus pair selection strategies that keep matching affordable at scale.
14.2 Incremental Structure from Motion Growing a reconstruction one camera at a time: choosing the two-view seed, registering new views with PnP and RANSAC, triangulating fresh structure, and the filtering loop that keeps the model healthy.
14.3 Bundle Adjustment: Polishing the Reconstruction The joint refinement of all cameras and all points by minimizing total reprojection error: Levenberg-Marquardt, the sparse Jacobian, the Schur complement trick, robust losses, and a working implementation in SciPy.
14.4 Visual SLAM: Mapping While Moving Structure from motion under a deadline: tracking, local mapping, and loop-closing threads, keyframes and covisibility, pose-graph optimization that cancels drift, and the monocular-stereo-inertial sensor spectrum.
14.5 COLMAP & Modern Reconstruction Pipelines The standard tool end to end: COLMAP and pycolmap in practice, dense multi-view stereo, capture technique that makes or breaks reconstructions, and how learned matchers and feed-forward models like VGGT reshape the pipeline.

Key Insight: The Whole Chapter Is One Five-Verb Pipeline

Every reconstruction in this chapter, offline or live, is the same ordered recipe, and five verbs hold it in memory: track, seed, register, triangulate, adjust. Section 14.1 tracks a point across many views; Section 14.2 seeds from one good pair and registers each new camera, triangulating fresh structure as it goes; Section 14.3 adjusts everything jointly so the errors cancel. Visual SLAM in Section 14.4 runs the identical five verbs against a deadline, and COLMAP in Section 14.5 is those five verbs wearing a command-line coat. When a reconstruction fails, ask which verb broke: the answer is always one of these five.

Fun Fact: Rome in a Day

In 2009, the "Building Rome in a Day" project reconstructed landmark-scale 3D models from roughly 150,000 Flickr photos of Rome in under 24 hours on about 500 cores, with no information beyond the pixels and the occasional lying EXIF tag. It was the moment structure from motion stopped being a lab demo and became internet-scale infrastructure, and its core loop is the same one you will implement in Sections 14.1 through 14.3.

What's Next?

SfM and SLAM treat the world as rigid and motion as something cameras do. Chapter 15: Motion, Optical Flow & Tracking drops the rigidity assumption and asks what moves inside the frame: dense per-pixel optical flow, sparse Lucas-Kanade tracking (the front end many SLAM systems quietly use), and the Kalman filters that keep track of objects over time. Together, Chapters 14 and 15 cover the two halves of dynamic vision: how the camera moves through the scene, and how the scene moves in front of the camera.

Bibliography & Further Reading

Foundational Papers

Snavely, N., Seitz, S. M., and Szeliski, R. "Photo Tourism: Exploring Photo Collections in 3D." ACM SIGGRAPH (2006). doi:10.1145/1141911.1141964

The paper that proved SfM works on uncontrolled internet photos: SIFT, pairwise matching, incremental reconstruction, bundle adjustment. The pipeline of Sections 14.1 and 14.2 in its original form, later released as the Bundler toolkit.

Agarwal, S., Snavely, N., Simon, I., Seitz, S. M., and Szeliski, R. "Building Rome in a Day." ICCV (2009). doi:10.1109/ICCV.2009.5459148

City-scale SfM from 150,000 internet photos: vocabulary-tree pair selection, distributed matching, and large-scale bundle adjustment. The scalability arguments of Section 14.1 come from here.

Schönberger, J. L. and Frahm, J.-M. "Structure-from-Motion Revisited." CVPR (2016). openaccess.thecvf.com

The COLMAP paper: next-best-view selection, robust triangulation, and the filtering-and-retriangulation schedule that made incremental SfM reliable enough to become the field's default tool. Sections 14.2 and 14.5 lean on it heavily.

Triggs, B., McLauchlan, P., Hartley, R., and Fitzgibbon, A. "Bundle Adjustment: A Modern Synthesis." Vision Algorithms: Theory and Practice, Springer LNCS 1883 (2000). doi:10.1007/3-540-44480-7_21

The definitive survey of bundle adjustment: the objective, the sparse normal equations, the Schur complement, gauge freedom, robust kernels. Section 14.3 is a guided tour of its core chapters.

Klein, G. and Murray, D. "Parallel Tracking and Mapping for Small AR Workspaces." ISMAR (2007). doi:10.1109/ISMAR.2007.4538852

PTAM: the architectural insight that tracking and mapping should run as parallel threads at different rates, with keyframes in between. Every system in Section 14.4 descends from this design.

Mur-Artal, R., Montiel, J. M. M., and Tardós, J. D. "ORB-SLAM: A Versatile and Accurate Monocular SLAM System." IEEE Transactions on Robotics 31(5) (2015). arXiv:1502.00956

The reference feature-based SLAM system: ORB everywhere, covisibility graphs, DBoW2 loop closure, and pose-graph plus bundle-adjustment back end. Its successor ORB-SLAM3 (arXiv:2007.11898) adds visual-inertial fusion and multi-map operation. Section 14.4's architecture walk-through follows it.

Gálvez-López, D. and Tardós, J. D. "Bags of Binary Words for Fast Place Recognition in Image Sequences." IEEE Transactions on Robotics 28(5) (2012). doi:10.1109/TRO.2012.2197158

DBoW2: binary bag-of-words place recognition, the loop-closure detector in ORB-SLAM and many production systems, and a preview of the retrieval ideas formalized in Chapter 16.

Recent Research (2024-2026)

Pan, L., Bárath, D., Pollefeys, M., and Schönberger, J. L. "Global Structure-from-Motion Revisited." ECCV (2024). arXiv:2407.20219

GLOMAP: a global SfM system that solves all camera poses at once instead of incrementally, reaching COLMAP-level robustness at order-of-magnitude speedups. The strongest classical challenger to Section 14.2's incremental recipe.

Teed, Z. and Deng, J. "DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras." NeurIPS (2021). arXiv:2108.10869

The learned SLAM landmark: a recurrent network predicts dense correspondences while a differentiable bundle-adjustment layer enforces geometry. The bridge between Section 14.3's optimizer and deep learning.

Murai, R., Dexheimer, E., and Davison, A. J. "MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors." CVPR (2025). arXiv:2412.12392

Real-time dense SLAM built on the MASt3R two-view reconstruction prior, requiring no calibration at runtime. Where the feed-forward geometry wave meets Section 14.4's problem statement.

Wang, J., Karaev, N., Rupprecht, C., and Novotny, D. et al. "VGGT: Visual Geometry Grounded Transformer." CVPR (2025). arXiv:2503.11651

A feed-forward transformer that predicts cameras, depths, and 3D points for up to hundreds of views in a single pass, no matching or bundle adjustment in the loop. Section 14.5 weighs it against the classical pipeline.

Keetha, N., Karhade, J., Jatavallabhula, K. M., et al. "SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM." CVPR (2024). arXiv:2312.02126

SLAM with a 3D Gaussian splat as the map representation: tracking and mapping by differentiable rendering. One of the 2024 papers that re-merged SLAM with the neural rendering of Chapter 27.

Books

Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edition. Cambridge University Press (2004). robots.ox.ac.uk/~vgg/hzbook

Part IV covers N-view geometry and the full theory behind this chapter, including the proofs of everything Section 14.3 states about the normal equations.

Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book

Chapter 11 treats structure from motion and SLAM with an engineer's perspective and a bibliography deep enough to follow any thread; free online.

Tools & Libraries

COLMAP documentation. colmap.github.io

Installation, CLI reference, camera models, and the database schema for the tool Section 14.5 drives end to end, including the pycolmap Python bindings.

Sarlin, P.-E. et al. "hloc: Hierarchical Localization toolbox." github.com/cvg/Hierarchical-Localization

The standard recipe for swapping learned features (SuperPoint, LightGlue, NetVLAD retrieval) into a COLMAP reconstruction; Section 14.5's library shortcut for difficult scenes.

Agarwal, S., Mierle, K., et al. "Ceres Solver." ceres-solver.org

The industrial-strength nonlinear least-squares library behind COLMAP's bundle adjustment: automatic differentiation, Schur-complement solvers, robust loss functions. Section 14.3 explains exactly what it does for you.