Chapter 12: Camera Models & Calibration

"I turn the entire world into pixels using five numbers and a small polynomial. Lose those numbers, and I am no longer a measuring instrument. I am modern art."
A Camera That Lost Its Calibration

Big Picture

This chapter is where the image stops being a grid of colors and becomes a measurement of the 3D world: the pinhole model says how space projects to pixels, and calibration recovers the exact parameters of that projection for your specific camera. Every geometric capability that follows in this book, depth from stereo in Chapter 13, structure from motion in Chapter 14, the pose-hungry neural scene representations of Chapter 27, stands on the few numbers this chapter teaches you to estimate, doubt, and maintain.

Chapter Overview

Part II has so far stayed inside the image: Chapter 9 found edges and lines, Chapter 10 matched points between images, Chapter 11 grouped pixels into regions. None of that answers the questions a robot, a drone, or an AR headset actually asks: how far away is that object, where am I standing, which way am I looking? Those are questions about the 3D world, and a 2D image can answer them only if you know precisely how the world was flattened into it. That flattening is the camera model, and measuring its parameters is calibration.

The model itself is almost embarrassingly simple. A pinhole camera projects every 3D point along a straight ray through one center onto an image plane; with homogeneous coordinates the whole operation is one matrix, the intrinsic matrix $K$, holding the focal lengths in pixels, the principal point, and little else. Real lenses complicate the picture in exactly one well-behaved way, a radially symmetric distortion that a low-order polynomial captures, and correcting it restores the pinhole's pristine geometry. What elevates this from algebra to engineering is that the parameters can be measured cheaply and precisely: Zhang's method, the subject of the chapter's centerpiece section, recovers everything from a dozen photographs of a printed checkerboard, an idea that demolished the machined-target calibration rigs that came before it and that now lives behind a single OpenCV call.

With intrinsics in hand, the chapter turns to the question asked most often in production: where is the camera right now? The Perspective-n-Point problem recovers the rotation and translation from a handful of known 3D points and their pixel observations, in microseconds, robustly even when a quarter of the correspondences are wrong, thanks to the same RANSAC logic that rescued matching in Chapter 10. PnP is the workhorse of augmented reality, marker-based robotics, and the tracking loops of visual SLAM; we treat its solver families, its failure modes (including the notorious planar flip), and its fiducial-marker ecosystem. The closing section is the field manual the rest of the chapter earns: which targets to print, how to capture views that constrain every parameter, how to read residual plots like an engineer, and how to keep a calibration trustworthy through months of heat, vibration, and helpful colleagues.

A single thread stitches the five sections together: a calibrated camera converts pixels into rays, and rays are geometry you can compute with. One ray gives a direction; the depth along it is gone, lost to the projection. Recovering that lost dimension, by a second camera, by motion, by learned priors, is the business of the next two chapters and a recurring theme all the way to the generative world models of Part IV. This chapter buys the ticket.

The Chapter in Three Words: Project, Calibrate, Locate

If you remember nothing else from this chapter, remember its spine. Project (Section 12.1): the pinhole model and the intrinsic matrix $K$ flatten the 3D world into pixels, throwing away depth. Calibrate (Sections 12.2 and 12.3): Zhang's checkerboard method recovers $K$ and the distortion polynomial for your specific camera, the slow, once-per-hardware step. Locate (Section 12.4): with $K$ in hand, the PnP problem finds where the camera is every frame, the fast, per-image step. Section 12.5 is the craft that keeps all three honest. Every later geometry chapter reuses this same three-beat structure, so the phrase doubles as a map of the rest of Part II.

Prerequisites

The geometry here leans directly on Chapter 5: Geometric Transformations & Image Warping: homogeneous coordinates, homographies and their DLT estimation, and the remap machinery reappear constantly. From Chapter 10: Keypoints, Descriptors & Matching we reuse subpixel corner detection and the RANSAC robust-fitting pattern. The plumb-line diagnostics of Section 12.2 borrow line detection from Chapter 9: Edges, Lines & Curves, and the discussion of sensors, pixel pitch, and image sharpness assumes the imaging-pipeline vocabulary of Chapter 1: Digital Image Fundamentals. Mathematically, comfort with matrix multiplication, matrix inverses, and the idea of least-squares fitting is assumed; rotation matrices, the singular value decomposition, and homogeneous least squares appear here, and Appendix A: Mathematical Foundations is the refresher for any of them. The nonlinear optimizer that polishes calibrations and poses (Levenberg-Marquardt) is introduced where it is first used.

Chapter Roadmap

12.1 The Pinhole Camera & Intrinsic Parameters Perspective projection from similar triangles, homogeneous coordinates that make it linear, the anatomy of the intrinsic matrix $K$, field of view versus focal length, and why every pixel is really a ray with its depth missing.
12.2 Lens Distortion & Its Correction Barrel and pincushion distortion, the Brown-Conrady radial and tangential model in normalized coordinates, undistortion as a precomputed remap with the alpha trade-off, and the fisheye models that take over past 120 degrees.
12.3 Camera Calibration: Zhang's Method in Practice Why a flat checkerboard suffices: homographies constrain the intrinsics linearly, Levenberg-Marquardt polishes everything including distortion, and cv2.calibrateCamera runs the whole pipeline; reading RMS, per-view errors, and parameter uncertainties.
12.4 Extrinsics & Pose Estimation: The PnP Problem Rotation and translation between world and camera, the Perspective-n-Point problem and its solver families (P3P, EPnP, iterative, IPPE), RANSAC against bad correspondences, and ArUco marker pose with the planar ambiguity.
12.5 Calibration Workflows, Targets & Quality Checks Checkerboard versus circle grid versus ChArUco, a capture protocol that pays for every parameter, residual forensics beyond the RMS number, and running calibration as a versioned, monitored operational asset.

What's Next?

A calibrated camera turns every pixel into a ray, and one ray cannot tell you depth. Chapter 13: Two-View Geometry, Stereo & Depth adds the second camera: epipolar geometry constrains where matches can live, rectification (built from this chapter's intrinsics and distortion models) aligns the search to scanlines, and triangulation finally intersects the rays to recover the dimension that projection destroyed. From there, Chapter 14: Structure from Motion & Visual SLAM scales the same geometry to thousands of views, with the PnP solver from this chapter running in its tracking loop. The arc continues into the learned era: the neural scene representations of Chapter 27 consume exactly the poses and intrinsics taught here, and the 3D-aware generation of Chapter 36 inherits the camera as its interface to space.

Bibliography & Further Reading

Foundational Papers

Zhang, Z. "A Flexible New Technique for Camera Calibration." IEEE TPAMI (2000). doi:10.1109/34.888718

The planar-target calibration method that Section 12.3 teaches and that cv2.calibrateCamera implements: homography constraints on the image of the absolute conic, closed-form initialization, and maximum-likelihood refinement.

Tsai, R. "A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses." IEEE Journal of Robotics and Automation (1987). doi:10.1109/JRA.1987.1087109

The pre-Zhang standard, built around radial alignment with a 3D target. Worth reading to appreciate what the move to planar targets simplified, and still cited for its careful error analysis.

Lepetit, V., Moreno-Noguer, F., and Fua, P. "EPnP: An Accurate O(n) Solution to the PnP Problem." IJCV (2009). doi:10.1007/s11263-008-0152-6

The control-point reformulation that made many-point pose estimation linear-time; the solver behind SOLVEPNP_EPNP in Section 12.4.

Gao, X.-S., Hou, X., Tang, J., and Cheng, H. "Complete Solution Classification for the Perspective-Three-Point Problem." IEEE TPAMI (2003). doi:10.1109/TPAMI.2003.1217599

The definitive enumeration of P3P's up-to-four solutions, explaining the multiple-personality epigraph of Section 12.4 and underpinning the minimal solver inside every PnP RANSAC loop.

Fischler, M. and Bolles, R. "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography." Communications of the ACM (1981). doi:10.1145/358669.358692

The RANSAC paper, which also happens to be the paper that named the Perspective-n-Point problem. Both halves of Section 12.4's robust pose pipeline trace to this single reference.

Collins, T. and Bartoli, A. "Infinitesimal Plane-Based Pose Estimation." IJCV (2014). doi:10.1007/s11263-014-0725-5

IPPE, the planar-target pose solver behind SOLVEPNP_IPPE_SQUARE, with the cleanest published treatment of the two-fold planar ambiguity that Section 12.4's museum story dramatizes.

Recent Research (2024-2026)

Veicht, A., Sarlin, P.-E., Lindenberger, P., and Pollefeys, M. "GeoCalib: Learning Single-Image Calibration with Geometric Optimization." ECCV (2024). arXiv:2409.06704

Recovers focal length, gravity, and distortion from one uncontrolled image by differentiating through a geometric optimizer; the strongest current answer to "what if I cannot photograph a checkerboard?"

Wang, S. et al. "DUSt3R: Geometric 3D Vision Made Easy." CVPR (2024). arXiv:2312.14132

Feed-forward pointmaps from unposed, uncalibrated image pairs, with intrinsics recovered afterward from the predictions; the manifesto of the calibration-free reconstruction trend discussed in this chapter's research-frontier callouts.

Wang, J. et al. "VGGT: Visual Geometry Grounded Transformer." CVPR (2025). arXiv:2503.11651

A single transformer pass predicting cameras, depth, and 3D points for hundreds of views; the feed-forward heir to the classical pipeline whose components this chapter and the next two build piece by piece.

Wen, B. et al. "FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects." CVPR (2024). arXiv:2312.08344

Object pose for instances never seen in training, model-based or model-free; the deep-era counterpart of Section 12.4's marker-based pose, and a bridge to the recognition material of Part III.

Books

Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edition (2004). robots.ox.ac.uk/~vgg/hzbook

The bible of projective vision. Chapters 6 and 7 cover camera models and calibration at full mathematical depth; the notation of this chapter is deliberately compatible with it.

Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book

Chapter 11 (Structure from Motion and SLAM) and the camera-model material in Chapter 2 give a modern, implementation-aware treatment with generous references; free online.

Tools & Libraries

OpenCV. Camera Calibration and 3D Reconstruction (calib3d module) documentation. docs.opencv.org

The reference for every function this chapter calls: calibrateCamera, solvePnP and its flags, undistort, the fisheye namespace, and the ArUco/ChArUco detector objects. The official calibration tutorial pairs well with Section 12.3.

Kogan, D. mrcal: a camera calibration toolkit. mrcal.secretsauce.net

Splined lens models far richer than the Brown-Conrady polynomial, plus rigorous uncertainty propagation and the best residual-diagnostic tooling available; the professional-grade extension of Section 12.5's quality checks.

ETH ASL. Kalibr: camera and camera-IMU calibration toolbox. github.com/ethz-asl/kalibr

The standard tool when calibration must extend beyond a lone camera: multi-camera rigs, camera-to-IMU transforms, and temporal offsets, the inputs that the SLAM systems of Chapter 14 depend on.