Chapter 27: Depth, 3D Vision & Neural Scene Representations

"For years I flattened the world into a grid of brightness and called it sight. Then someone asked me how far away the chair was, and I realized I had been reading a novel by looking only at the shape of the ink."
A Pixel That Finally Learned to Reach Out and Touch the Scene

Big Picture

This chapter is where the deep networks of Part III collide with the projective geometry of Part II, and the prize is the third dimension that a camera throws away. A photograph is a projection: the moment light hits the sensor, depth is collapsed and the scene's geometry is lost. Part II recovered that geometry with two cameras and triangulation. This chapter recovers it with learned priors, sometimes from a single image, and then goes further, asking a network not merely to estimate depth but to represent the entire scene as a function you can render from any new viewpoint. We travel from a depth map (one number per pixel) through explicit 3D structures (points, voxels, meshes), to networks that operate directly on irregular point sets, and finally to the neural and point-based scene representations, NeRF and 3D Gaussian splatting, that redefined novel-view synthesis between 2020 and 2024. The connecting thread is geometry as a learnable object.

Chapter Overview

Every model in this book so far has lived in the image plane. Even the video networks of Chapter 26 stacked images and reasoned about how the plane changed over time. But the world is not flat, and a great deal of what we want vision systems to do (drive a car, grasp an object, build a map, place a virtual sofa in a living room) requires knowing where things are in three dimensions, not just what they are in two. Recovering that third dimension is the oldest problem in computer vision and one of the hardest, because projection is a lossy, many-to-one mapping: infinitely many 3D scenes produce the same 2D image. Part II solved the problem with explicit geometry, two or more views, calibrated cameras, and triangulation. This chapter shows what changes when you let a network learn the priors that disambiguate depth, and what becomes possible when geometry itself becomes the thing the network outputs.

We open with the most striking case: estimating depth from a single image. Classically this is impossible, a single projection cannot determine scale, yet humans do it effortlessly using learned cues (perspective, texture gradients, familiar object sizes, shading). Section 27.1 shows how a network learns those same cues, why the right loss is scale-invariant, and how the 2024 foundation models for depth (Depth Anything, Marigold, Depth Pro) turned monocular depth from a fragile research demo into a reliable off-the-shelf tool. Once we can produce depth, we need somewhere to put it. Section 27.2 surveys the three explicit ways to store 3D structure, point clouds, voxel grids, and meshes, each with its own memory profile, its own natural operations, and its own pathologies, and shows how to convert a depth map into each one.

Explicit structures raise a deep learning problem the rest of the book sidestepped. A point cloud has no grid, no canonical ordering, no fixed neighbor relationships, so the convolution of Chapter 19 simply does not apply. Section 27.3 builds the architecture that solved this, PointNet, from its central insight (a symmetric function of per-point features gives permutation invariance) and follows it to the hierarchical and graph-based successors that brought back local structure. Then we make the conceptual leap that defines modern 3D vision. Instead of storing geometry explicitly, Section 27.4 represents a whole scene as a small neural network, a neural radiance field, that maps a 3D point and a viewing direction to color and density, and renders new views by marching rays through that field and integrating. NeRF is the place where the volume rendering of graphics meets the gradient descent of deep learning, and it leans directly on the camera calibration and pose estimation of Chapter 12 and Chapter 14.

NeRF is beautiful and slow. Section 27.5 covers the representation that kept its photorealism while throwing out the per-ray network: 3D Gaussian splatting, which models the scene as millions of colored, oriented translucent blobs and renders them by rasterization at hundreds of frames per second. Splatting is, in a sense, the explicit point cloud of Section 27.2 made differentiable and renderable, and its arrival in 2023 reshaped the field within a year. Finally, Section 27.6 steps back from any single method to the practical pipeline that real captures flow through: take photos or video, recover camera poses with structure from motion, train a radiance field or splat, clean the result, and export to a game engine or web viewer. This is where the geometry of Part II and the learning of Part III become one workflow, and where the failure modes that no paper advertises actually live.

Throughout, the recurring lesson is that 3D vision is a negotiation between two kinds of knowledge: the hard constraints of projective geometry, which are exact and never wrong, and the soft priors of a trained network, which fill in everything geometry cannot determine. The best systems use both. Keep that tension in mind, because it returns transformed in Chapter 36, where 3D structure is no longer recovered from a real scene but generated from scratch.

If you remember one shape for the whole chapter, make it the ladder below: each rung stores geometry more powerfully than the last, and every method here is a rung or a jump between rungs.

Mental Model: The Ladder of Representations

The chapter climbs a single ladder, from the thinnest description of geometry to the richest, and the through-line is one phrase: flat, explicit, implicit, splat.

Flat (Section 27.1): a depth map, one number per pixel, still glued to the image plane.
Explicit (Sections 27.2 to 27.3): points, voxels, and meshes that live in real 3D space, and the PointNet that learns on them.
Implicit (Section 27.4): NeRF, where the whole scene hides inside a network's weights and appears only when you query it.
Splat (Section 27.5): millions of explicit 3D Gaussians, the implicit field's realism made fast and rasterizable.

Figure 27.0.1: The ladder of representations the chapter climbs. Each rung holds the scene's geometry more richly than the one below: a flat per-pixel depth map (Section 27.1), explicit points, voxels, and meshes in real 3D space (Sections 27.2 to 27.3), an implicit neural radiance field stored in a network's weights (Section 27.4), and an explicit cloud of 3D Gaussians that makes that realism fast to render (Section 27.5). Section 27.6 is the ladder put to work as a capture-to-render pipeline.

Figure 27.0.1 draws that climb as a single shape. Section 27.6 is the ladder put to work: the capture-to-render pipeline that climbs from photographs to a rung you can ship.

Prerequisites

This chapter sits at the meeting point of the book's two halves, so it draws on both. From Part II you should be comfortable with Chapter 12: Camera Models & Calibration (the pinhole model, intrinsics, and the projection that maps 3D points to pixels), Chapter 13: Two-View Geometry, Stereo & Depth (disparity, triangulation, and why depth is recoverable from two views), and Chapter 14: Structure from Motion & Visual SLAM (recovering camera poses and sparse 3D structure from many images, which is exactly what NeRF and splatting consume as input). From Part III you need the network and training fundamentals of Chapter 18: Neural Networks & PyTorch for Vision, the convolutional encoder-decoder of Chapter 19: Convolutional Neural Networks (the backbone of every depth network), and the dense-prediction ideas of Chapter 24: Segmentation, since monocular depth is a per-pixel regression problem with the same encoder-decoder shape as semantic segmentation. A nodding acquaintance with the self-supervised foundation models of Chapter 25 helps for the 2024 depth models, which are built on those backbones.

Chapter Roadmap

27.1 Monocular Depth Estimation Recovering depth from one image: the cues a network learns, why the loss must be scale-invariant, the encoder-decoder architecture, self-supervised training from video, and the 2024 foundation models (Depth Anything, Marigold, Depth Pro) that made monocular depth reliable off the shelf.
27.2 3D Representations: Point Clouds, Voxels & Meshes The three explicit ways to store geometry: unordered point clouds, regular voxel grids, and triangle meshes. Memory profiles, natural operations, and conversions, including lifting a depth map into a colored point cloud with the camera intrinsics of Chapter 12.
27.3 Learning on Point Clouds: PointNet & Successors Why a convolution cannot consume a point cloud, and how PointNet's symmetric max-pooling solves permutation invariance. The architecture built from scratch, then the hierarchical (PointNet++) and graph (DGCNN) successors, and the modern point transformers.
27.4 NeRF: Neural Radiance Fields Representing a scene as a small network mapping position and direction to color and density, and rendering new views by integrating along rays. The volume rendering equation, positional encoding, the training loop, and why poses from structure from motion are the unglamorous prerequisite.
27.5 3D Gaussian Splatting Modeling a scene as millions of colored anisotropic 3D Gaussians and rendering them by differentiable rasterization at real-time rates. The Gaussian parameters, alpha-blended splatting, adaptive densification, and why this point-based representation overtook NeRF for many tasks within a year.
27.6 Capture-to-Render Pipelines in Practice The end-to-end workflow a real capture flows through: shooting good images, recovering poses with COLMAP, training a radiance field or splat in Nerfstudio, and the failure modes (bad poses, reflective surfaces, floaters) that no paper advertises. Where Part II geometry and Part III learning become one workflow.

What's Next?

Once a scene can be represented as a renderable 3D function, two roads open. One leads to deployment: the radiance fields and splats of this chapter are computationally heavy, and getting them to run on a phone or an embedded device is exactly the problem of Chapter 28: Efficient Vision & Edge Deployment, where quantization, pruning, and mobile inference make these representations practical outside a workstation. The other road leads to generation. Every method in this chapter reconstructs a real scene from real photographs; in Chapter 36: Video, 3D Generation & World Models, the same radiance-field and splat representations become the output of generative models that hallucinate plausible 3D scenes from text or a single image, with no real capture at all. The volume rendering you learn here is the differentiable bridge that lets a 2D image-generation prior sculpt a 3D object. Understanding how to recover geometry is the prerequisite for learning how to invent it.

Bibliography & Further Reading

Foundational Papers: Monocular Depth

Eigen, D., Puhrsch, C. & Fergus, R. "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network." NeurIPS (2014). arXiv:1406.2283

The first deep monocular depth network and the source of the scale-invariant loss in Section 27.1. It established the multi-scale coarse-to-fine architecture and the framing of depth as a per-pixel regression problem.

Godard, C., Mac Aodha, O. & Brostow, G. "Unsupervised Monocular Depth Estimation with Left-Right Consistency (Monodepth)." CVPR (2017). arXiv:1609.03677

The self-supervised stereo-and-video training paradigm of Section 27.1: supervise depth with a photometric reconstruction loss instead of ground-truth depth, by warping one view into another. Monodepth2 (2019) refined it.

Ranftl, R. et al. "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer (MiDaS)." TPAMI (2020). arXiv:1907.01341

MiDaS, the model that made monocular depth generalize across scenes by training on a mixture of datasets with a scale-and-shift-invariant loss. The direct ancestor of the 2024 foundation depth models in Section 27.1.

Yang, L. et al. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR (2024), arXiv:2401.10891; "Depth Anything V2" (2024), arXiv:2406.09414

The 2024 foundation model for monocular depth featured in Section 27.1. V2 is trained on roughly 595,000 synthetic labeled images plus 62 million real unlabeled images via a teacher-student loop on a DINOv2 backbone, and is the widely used default for robust off-the-shelf relative depth.

Ke, B. et al. "Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (Marigold)." CVPR (2024). arXiv:2312.02145

Marigold of Section 27.1, which fine-tunes a Stable Diffusion latent model to emit depth, showing that a generative image prior transfers to crisp, detailed depth with little labeled data. The bridge to the generative models of Part IV.

Foundational Papers: 3D Representations & Learning

Qi, C. R. et al. "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation." CVPR (2017). arXiv:1612.00593

The architecture of Section 27.3. A shared per-point MLP followed by a symmetric max-pool gives permutation invariance, the first network to consume raw point clouds directly. The conceptual foundation for all point-based learning.

Qi, C. R. et al. "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space." NeurIPS (2017). arXiv:1706.02413

PointNet++ of Section 27.3, which reintroduces local structure by applying PointNet recursively on nested neighborhoods, recovering the spatial hierarchy that the original flat PointNet discarded.

Wang, Y. et al. "Dynamic Graph CNN for Learning on Point Clouds (DGCNN)." ACM TOG (2019). arXiv:1801.07829

DGCNN of Section 27.3, which builds a k-nearest-neighbor graph in feature space at each layer and applies edge convolutions, capturing relationships between points rather than treating each in isolation.

Foundational Papers: Neural Scene Representations

Mildenhall, B. et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." ECCV (2020), Best Paper Honorable Mention. arXiv:2003.08934

The original NeRF, the entirety of Section 27.4. A small MLP maps position and view direction to color and density, rendered by volume integration along rays. The paper that launched the neural-scene-representation field.

Müller, T. et al. "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (Instant-NGP)." ACM TOG / SIGGRAPH (2022). arXiv:2201.05989

Instant-NGP of Section 27.4, which replaced NeRF's slow positional encoding with a learned multi-resolution hash grid, cutting training from hours to seconds. The optimization that made radiance fields practical.

Kerbl, B. et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM TOG / SIGGRAPH (2023), Best Paper. arXiv:2308.04079

The 3D Gaussian splatting paper of Section 27.5. Millions of optimized anisotropic Gaussians rendered by differentiable rasterization match NeRF quality while rendering in real time. The representation that reshaped the field in a year.

Feed-Forward 3D Reconstruction (2024-2026)

Wang, J. et al. "VGGT: Visual Geometry Grounded Transformer." CVPR (2025), Best Paper Award. arXiv:2503.11651

The feed-forward 3D transformer of the Sections 27.4 and 27.6 frontier callouts. From one to hundreds of images it predicts camera parameters, depth, dense point maps, and 3D point tracks in a single pass in under a second, dissolving the COLMAP-then-train pipeline that the chapter is organized around.

Lin, H. et al. "Depth Anything 3: Recovering the Visual Space from Any Views." (2025). arXiv:2511.10647

The late-2025 successor to Depth Anything featured in Sections 27.1, 27.4, and 27.6. A single plain transformer unifies monocular depth, camera pose, and multi-view geometry behind one depth-ray target, reporting on its benchmark roughly a 35 percent gain in camera-pose accuracy and a 24 percent gain in geometric accuracy over VGGT.

Keetha, N. et al. "MapAnything: Universal Feed-Forward Metric 3D Reconstruction." (2025). arXiv:2509.13414

The universal feed-forward reconstruction model in the Section 27.6 frontier callout. One transformer ingests one or many images, plus optional intrinsics, poses, or depth, and regresses metric scene geometry and cameras in a single pass, unifying uncalibrated structure from motion, multi-view stereo, and depth completion.

Tools, Libraries & Benchmarks

Tancik, M. et al. "Nerfstudio: A Modular Framework for Neural Radiance Field Development." SIGGRAPH (2023). docs.nerf.studio · arXiv:2302.04264

The pipeline framework of Section 27.6: a unified, modular toolkit for processing captures, training NeRFs and splats (including Splatfacto), and viewing results interactively. The reference codebase for the capture-to-render workflow.

Schönberger, J. L. & Frahm, J.-M. "Structure-from-Motion Revisited (COLMAP)." CVPR (2016). colmap.github.io

COLMAP, the structure-from-motion engine of Sections 27.4 and 27.6 that recovers the camera poses every radiance field and splat needs. The unglamorous but essential first stage of the pipeline, built on the geometry of Chapter 14.

Open3D, a modern library for 3D data processing. open3d.org

The point-cloud and mesh library of Section 27.2: loading, visualization, downsampling, normal estimation, registration, and surface reconstruction. The library shortcut behind most of this chapter's explicit-geometry code.

PyTorch3D, a library for deep learning with 3D data. pytorch3d.org

The differentiable 3D library behind Sections 27.2 and 27.3: batched point clouds and meshes, differentiable rendering, chamfer distance, and the operators that make point and mesh learning practical in PyTorch.