"For years I flattened the world into a grid of brightness and called it sight. Then someone asked me how far away the chair was, and I realized I had been reading a novel by looking only at the shape of the ink."
A Pixel That Finally Learned to Reach Out and Touch the Scene
This chapter is where the deep networks of Part III collide with the projective geometry of Part II, and the prize is the third dimension that a camera throws away. A photograph is a projection: the moment light hits the sensor, depth is collapsed and the scene's geometry is lost. Part II recovered that geometry with two cameras and triangulation. This chapter recovers it with learned priors, sometimes from a single image, and then goes further, asking a network not merely to estimate depth but to represent the entire scene as a function you can render from any new viewpoint. We travel from a depth map (one number per pixel) through explicit 3D structures (points, voxels, meshes), to networks that operate directly on irregular point sets, and finally to the neural and point-based scene representations, NeRF and 3D Gaussian splatting, that redefined novel-view synthesis between 2020 and 2024. The connecting thread is geometry as a learnable object.
Chapter Overview
Every model in this book so far has lived in the image plane. Even the video networks of Chapter 26 stacked images and reasoned about how the plane changed over time. But the world is not flat, and a great deal of what we want vision systems to do (drive a car, grasp an object, build a map, place a virtual sofa in a living room) requires knowing where things are in three dimensions, not just what they are in two. Recovering that third dimension is the oldest problem in computer vision and one of the hardest, because projection is a lossy, many-to-one mapping: infinitely many 3D scenes produce the same 2D image. Part II solved the problem with explicit geometry, two or more views, calibrated cameras, and triangulation. This chapter shows what changes when you let a network learn the priors that disambiguate depth, and what becomes possible when geometry itself becomes the thing the network outputs.
We open with the most striking case: estimating depth from a single image. Classically this is impossible, a single projection cannot determine scale, yet humans do it effortlessly using learned cues (perspective, texture gradients, familiar object sizes, shading). Section 27.1 shows how a network learns those same cues, why the right loss is scale-invariant, and how the 2024 foundation models for depth (Depth Anything, Marigold, Depth Pro) turned monocular depth from a fragile research demo into a reliable off-the-shelf tool. Once we can produce depth, we need somewhere to put it. Section 27.2 surveys the three explicit ways to store 3D structure, point clouds, voxel grids, and meshes, each with its own memory profile, its own natural operations, and its own pathologies, and shows how to convert a depth map into each one.
Explicit structures raise a deep learning problem the rest of the book sidestepped. A point cloud has no grid, no canonical ordering, no fixed neighbor relationships, so the convolution of Chapter 19 simply does not apply. Section 27.3 builds the architecture that solved this, PointNet, from its central insight (a symmetric function of per-point features gives permutation invariance) and follows it to the hierarchical and graph-based successors that brought back local structure. Then we make the conceptual leap that defines modern 3D vision. Instead of storing geometry explicitly, Section 27.4 represents a whole scene as a small neural network, a neural radiance field, that maps a 3D point and a viewing direction to color and density, and renders new views by marching rays through that field and integrating. NeRF is the place where the volume rendering of graphics meets the gradient descent of deep learning, and it leans directly on the camera calibration and pose estimation of Chapter 12 and Chapter 14.
NeRF is beautiful and slow. Section 27.5 covers the representation that kept its photorealism while throwing out the per-ray network: 3D Gaussian splatting, which models the scene as millions of colored, oriented translucent blobs and renders them by rasterization at hundreds of frames per second. Splatting is, in a sense, the explicit point cloud of Section 27.2 made differentiable and renderable, and its arrival in 2023 reshaped the field within a year. Finally, Section 27.6 steps back from any single method to the practical pipeline that real captures flow through: take photos or video, recover camera poses with structure from motion, train a radiance field or splat, clean the result, and export to a game engine or web viewer. This is where the geometry of Part II and the learning of Part III become one workflow, and where the failure modes that no paper advertises actually live.
Throughout, the recurring lesson is that 3D vision is a negotiation between two kinds of knowledge: the hard constraints of projective geometry, which are exact and never wrong, and the soft priors of a trained network, which fill in everything geometry cannot determine. The best systems use both. Keep that tension in mind, because it returns transformed in Chapter 36, where 3D structure is no longer recovered from a real scene but generated from scratch.
If you remember one shape for the whole chapter, make it the ladder below: each rung stores geometry more powerfully than the last, and every method here is a rung or a jump between rungs.
The chapter climbs a single ladder, from the thinnest description of geometry to the richest, and the through-line is one phrase: flat, explicit, implicit, splat.
- Flat (Section 27.1): a depth map, one number per pixel, still glued to the image plane.
- Explicit (Sections 27.2 to 27.3): points, voxels, and meshes that live in real 3D space, and the PointNet that learns on them.
- Implicit (Section 27.4): NeRF, where the whole scene hides inside a network's weights and appears only when you query it.
- Splat (Section 27.5): millions of explicit 3D Gaussians, the implicit field's realism made fast and rasterizable.
Figure 27.0.1 draws that climb as a single shape. Section 27.6 is the ladder put to work: the capture-to-render pipeline that climbs from photographs to a rung you can ship.
Prerequisites
This chapter sits at the meeting point of the book's two halves, so it draws on both. From Part II you should be comfortable with Chapter 12: Camera Models & Calibration (the pinhole model, intrinsics, and the projection that maps 3D points to pixels), Chapter 13: Two-View Geometry, Stereo & Depth (disparity, triangulation, and why depth is recoverable from two views), and Chapter 14: Structure from Motion & Visual SLAM (recovering camera poses and sparse 3D structure from many images, which is exactly what NeRF and splatting consume as input). From Part III you need the network and training fundamentals of Chapter 18: Neural Networks & PyTorch for Vision, the convolutional encoder-decoder of Chapter 19: Convolutional Neural Networks (the backbone of every depth network), and the dense-prediction ideas of Chapter 24: Segmentation, since monocular depth is a per-pixel regression problem with the same encoder-decoder shape as semantic segmentation. A nodding acquaintance with the self-supervised foundation models of Chapter 25 helps for the 2024 depth models, which are built on those backbones.
Chapter Roadmap
- 27.1 Monocular Depth Estimation Recovering depth from one image: the cues a network learns, why the loss must be scale-invariant, the encoder-decoder architecture, self-supervised training from video, and the 2024 foundation models (Depth Anything, Marigold, Depth Pro) that made monocular depth reliable off the shelf.
- 27.2 3D Representations: Point Clouds, Voxels & Meshes The three explicit ways to store geometry: unordered point clouds, regular voxel grids, and triangle meshes. Memory profiles, natural operations, and conversions, including lifting a depth map into a colored point cloud with the camera intrinsics of Chapter 12.
- 27.3 Learning on Point Clouds: PointNet & Successors Why a convolution cannot consume a point cloud, and how PointNet's symmetric max-pooling solves permutation invariance. The architecture built from scratch, then the hierarchical (PointNet++) and graph (DGCNN) successors, and the modern point transformers.
- 27.4 NeRF: Neural Radiance Fields Representing a scene as a small network mapping position and direction to color and density, and rendering new views by integrating along rays. The volume rendering equation, positional encoding, the training loop, and why poses from structure from motion are the unglamorous prerequisite.
- 27.5 3D Gaussian Splatting Modeling a scene as millions of colored anisotropic 3D Gaussians and rendering them by differentiable rasterization at real-time rates. The Gaussian parameters, alpha-blended splatting, adaptive densification, and why this point-based representation overtook NeRF for many tasks within a year.
- 27.6 Capture-to-Render Pipelines in Practice The end-to-end workflow a real capture flows through: shooting good images, recovering poses with COLMAP, training a radiance field or splat in Nerfstudio, and the failure modes (bad poses, reflective surfaces, floaters) that no paper advertises. Where Part II geometry and Part III learning become one workflow.
What's Next?
Once a scene can be represented as a renderable 3D function, two roads open. One leads to deployment: the radiance fields and splats of this chapter are computationally heavy, and getting them to run on a phone or an embedded device is exactly the problem of Chapter 28: Efficient Vision & Edge Deployment, where quantization, pruning, and mobile inference make these representations practical outside a workstation. The other road leads to generation. Every method in this chapter reconstructs a real scene from real photographs; in Chapter 36: Video, 3D Generation & World Models, the same radiance-field and splat representations become the output of generative models that hallucinate plausible 3D scenes from text or a single image, with no real capture at all. The volume rendering you learn here is the differentiable bridge that lets a 2D image-generation prior sculpt a 3D object. Understanding how to recover geometry is the prerequisite for learning how to invent it.