Section 27.1: Monocular Depth Estimation

"Give me two eyes and I will measure the world. Give me one and I will guess it, confidently, from the way the railroad tracks pretend to meet at the horizon."
A Single Lens With Surprisingly Strong Opinions About Distance

Big Picture

Recovering depth from a single image is geometrically impossible and practically routine, because a network can learn the pictorial cues that disambiguate what projection threw away. A lone photograph cannot determine absolute scale: a dollhouse and a real house produce identical pixels. But it is dense with relative cues, perspective convergence, texture gradients, occlusion, shading, and the familiar sizes of known objects, and a network trained on enough images learns to read them just as a one-eyed person does. This section explains why the supervision must be scale-invariant, builds the encoder-decoder that predicts a depth map, shows how to train it without ground truth by warping video frames, and ends with the 2024 foundation models that turned monocular depth into a reliable off-the-shelf tool.

In the previous chapter we added the time axis to vision; now we add the depth axis, and we begin with the hardest possible version of the problem. Chapter 13 recovered depth from two views by triangulation, a clean geometric procedure: find the same point in both images, measure its disparity, and invert. With a single image there is no disparity to measure and no triangulation to perform. Yet monocular depth estimation works, and works well, because the missing geometric constraint is replaced by a learned statistical prior over how real scenes look. This section is about how a network acquires that prior and what it can and cannot deliver. The whole chapter is one climb, sketched in the illustration below: from a flat depth map up through explicit structures, implicit neural fields, and Gaussian splats.

A cheerful cartoon robot climbs a four-rung ladder rising out of a flat photograph, each rung holding a richer 3D form: a depth-gradient plane, a cluster of dots and a cube, a glowing translucent sphere, and a cloud of blobs forming a house, illustrating the chapter's progression from a flat depth map to explicit structures to implicit neural fields to Gaussian splats. — The whole chapter is one climb: flat, explicit, implicit, splat, with each rung holding the scene's geometry more richly than the last.

1. Why a Single Image Has No Absolute Scale Beginner

Recall the pinhole projection from Chapter 12. A 3D point $(X, Y, Z)$ in camera coordinates lands at the pixel

u = f_x \frac{X}{Z} + c_x, \qquad v = f_y \frac{Y}{Z} + c_y

Notice that only the ratios $X/Z$ and $Y/Z$ appear. If you scale the entire scene, multiply every $X$, $Y$, and $Z$ by some constant $\alpha$, the ratios are unchanged and every pixel lands in exactly the same place. The image is invariant to the scale of the scene. This is the depth-scale ambiguity, and it is not a limitation of any particular method; it is a property of projection itself. A single image can, at best, recover depth up to an unknown global scale. To pin down the absolute meters, you need extra information: a known object size, a calibrated stereo baseline, an inertial sensor, or metadata such as the camera's focal length and sensor size.

This has a direct consequence for how we train. If we naively penalize the squared error between predicted and true depth, the network is punished for getting the scale wrong even when its relative structure is perfect, an unfair penalty given that scale is genuinely unrecoverable from the pixels. The fix, introduced by Eigen and colleagues in 2014, is a scale-invariant loss: compare predicted and true depth only after factoring out the global scale.

Key Insight: Predict Log-Depth, Penalize Only Relative Structure

Working in log-depth turns a global multiplicative scale into a global additive shift, which is far easier to remove. Let $d_i = \log \hat{z}_i - \log z_i$ be the log-error at pixel $i$. The scale-invariant loss subtracts the mean log-error before squaring, so a uniform offset (the scale ambiguity) costs nothing: $$\mathcal{L}_{\text{SI}} = \frac{1}{n}\sum_i d_i^2 - \frac{\lambda}{n^2}\Big(\sum_i d_i\Big)^2$$ With $\lambda = 1$ the loss measures only the variance of the log-error, that is, how well the shape of the depth map matches, independent of overall scale. This single idea is what lets one network generalize across indoor rooms, outdoor streets, and close-up objects whose absolute depths differ by orders of magnitude. Watch what happens to a prediction whose shape is perfect but whose scale is wrong by a factor of two: every $d_i$ equals the same constant $\log 2 \approx 0.69$, so the first term is $0.69^2 \approx 0.48$ but the second term subtracts exactly $0.69^2$, and the loss collapses to $0$. A plain squared-log loss would have charged that same prediction a flat $0.48$ per pixel for a "mistake" the camera made unrecoverable.

Remember This: Shape for Free, Scale Never

The four-word summary of monocular depth is shape for free, scale never. A single image hands you the relative structure of a scene at no cost, but the absolute meters are genuinely unrecoverable from the pixels alone. This is not a weakness of any one model; it is the depth-scale ambiguity of subsection 1 baked into projection itself. Every practical decision in this section follows from it: train with a scale-invariant loss, read raw foundation-model output as relative not metric, and recover absolute scale only from an outside anchor (a known object size, a stereo baseline, or a metric model). When a monocular depth result disappoints, ask first whether you demanded scale it could never give.

Fun Note: The Dollhouse Defense

This is exactly the trick behind every miniature special effect ever filmed. A model train set photographed at the right angle is indistinguishable from a real railway, because the camera, like a monocular depth network, has no way to know the scale. Film crews exploit the ambiguity on purpose; depth networks suffer from it by accident. The difference between Hollywood and a confused neural net is that Hollywood knew the scale all along and simply declined to tell you. The illustration below makes the trap literal: one lens, two scenes, identical pixels.

A one-eyed cartoon camera photographs a tiny model train set and a real full-size railway, and both produce an identical framed image, illustrating the depth-scale ambiguity that makes a single image unable to recover absolute scale, the trick behind miniature special effects. — A single lens gives you shape for free but scale never: a dollhouse and a real house can land identical pixels, which is exactly how miniature effects fool the eye on purpose.

2. The Cues a Network Learns Beginner

What does the network actually key on? The same monocular cues that let a person with one eye closed still reach for a coffee cup. Figure 27.1.1 names the four that dominate. Perspective: parallel lines converge with distance, so the angle of convergence encodes depth. Relative size: a familiar object (a car, a person, a door) that appears small is far away. Texture gradient: a regular texture like grass or brick compresses as it recedes. Occlusion: if object A hides part of object B, then A is nearer. None of these is geometrically exact, but together, over millions of training images, they form a powerful prior. The network is not doing geometry; it is doing learned pattern completion constrained by how real 3D scenes project.

Figure 27.1.1: The four dominant monocular depth cues a network learns to read. None determines absolute distance on its own, but jointly they constrain the relative depth ordering of a scene tightly enough that a trained model can produce a dense, coherent depth map from a single photograph.

Practical Example: A Real-Estate Startup Discovers the Scale Trap

Who: a three-person team at a property-tech startup, 2023, building a tool that estimates room dimensions from listing photos so buyers can check if their furniture fits. Situation: they wired up a pretrained monocular depth network, fed it apartment photos, and got beautiful, plausible depth maps. Problem: when they read off "the sofa wall is 4.2 meters", the number was wrong by a factor that varied from photo to photo, sometimes 0.7x, sometimes 1.5x. The relative structure was perfect, but the absolute meters drifted. Decision: instead of fighting the model, they accepted the scale ambiguity of subsection 1 and added a single anchor per photo: they detected a standard object of known size (an electrical outlet, a doorway at a regulated height) and rescaled the whole depth map so that object measured correctly. Result: measurements came within roughly 5 percent, good enough to answer "will the couch fit". Lesson: a monocular depth model gives you shape for free and scale never; design the product around recovering scale from one known reference rather than trusting the raw meters.

3. The Architecture: An Encoder-Decoder for Dense Regression Intermediate

Structurally, monocular depth estimation is a dense per-pixel regression problem, and that makes it a close cousin of the semantic segmentation you built in Chapter 24. The architecture is the same encoder-decoder shape: a convolutional or transformer encoder compresses the image into a low-resolution, high-channel feature map that captures global context (essential, because depth cues like perspective are global), and a decoder upsamples back to full resolution, fusing skip connections from the encoder so fine boundaries survive. The only differences from segmentation are the head, one continuous output channel instead of per-class logits, and the loss, the scale-invariant regression loss of subsection 1 instead of cross-entropy.

The code below sketches a compact encoder-decoder depth network on a ResNet backbone of the kind you met in Chapter 20. It is deliberately small enough to read; production models are deeper and often transformer-based, but the skeleton is identical.

# Monocular depth as dense regression: a pretrained ResNet-18 encoder
# compresses the image, an upsampling decoder restores resolution, and
# encoder skip connections feed fine detail back to keep depth edges sharp.
import torch
import torch.nn as nn
import torchvision

class DepthNet(nn.Module):
    """Encoder-decoder for monocular depth: ResNet encoder, conv decoder with skips."""
    def __init__(self):
        super().__init__()
        backbone = torchvision.models.resnet18(weights="IMAGENET1K_V1")
        # Encoder stages, kept separately so we can grab skip features.
        self.stem = nn.Sequential(backbone.conv1, backbone.bn1, backbone.relu)  # 1/2
        self.pool = backbone.maxpool                                            # 1/4
        self.enc1, self.enc2 = backbone.layer1, backbone.layer2                 # 1/4, 1/8
        self.enc3, self.enc4 = backbone.layer3, backbone.layer4                 # 1/16, 1/32

        def up(cin, cout):  # upsample-then-conv decoder block
            return nn.Sequential(
                nn.Upsample(scale_factor=2, mode="bilinear", align_corners=False),
                nn.Conv2d(cin, cout, 3, padding=1), nn.ReLU(inplace=True))
        self.dec4, self.dec3 = up(512, 256), up(256 + 256, 128)
        self.dec2, self.dec1 = up(128 + 128, 64), up(64 + 64, 32)
        self.head = nn.Sequential(up(32, 16), nn.Conv2d(16, 1, 1))  # 1 depth channel

    def forward(self, x):
        s0 = self.stem(x); s1 = self.enc1(self.pool(s0))
        s2 = self.enc2(s1); s3 = self.enc3(s2); s4 = self.enc4(s3)
        d = self.dec4(s4)
        d = self.dec3(torch.cat([d, s3], 1))   # fuse encoder skip for sharp edges
        d = self.dec2(torch.cat([d, s2], 1))
        d = self.dec1(torch.cat([d, s1], 1))
        return torch.sigmoid(self.head(d))      # in (0,1); rescale to a depth range

net = DepthNet()
print("output:", net(torch.randn(1, 3, 256, 256)).shape)  # output: torch.Size([1, 1, 256, 256])

Code Fragment 1: A minimal monocular depth network: a pretrained ResNet-18 encoder feeds an upsampling decoder whose skip connections (the torch.cat calls) restore the fine spatial detail the encoder discarded, ending in a single depth channel at full input resolution.

The scale-invariant loss from subsection 1 is just as compact. The implementation below operates on log-depth and matches the formula exactly; it is the supervision signal you pair with the network above when ground-truth depth (from a laser scanner or RGB-D sensor) is available.

# Scale-invariant log-depth loss (Eigen et al. 2014): penalize the variance
# of the log-error so a uniform scale offset costs nothing, which is what lets
# one network train across scenes whose absolute depths differ wildly.
def scale_invariant_loss(pred, target, mask, lam=0.85):
    """pred, target: depth maps (>0). mask: valid-pixel boolean. Eigen et al. 2014."""
    d = torch.log(pred[mask]) - torch.log(target[mask])   # per-pixel log error
    n = d.numel()
    return (d ** 2).mean() - lam * (d.sum() ** 2) / (n ** 2)

# With lam close to 1 the loss is fully scale-invariant; lam<1 keeps a little
# absolute-scale pressure, which helps when some weak metric supervision exists.

Code Fragment 2: The scale-invariant log-depth loss. The first term penalizes per-pixel error; the second subtracts the squared mean error, removing the global scale the camera can never recover. The mask excludes pixels with no valid ground-truth depth.

4. Training Without Ground Truth: Self-Supervision from Video Intermediate

Ground-truth depth is expensive: it needs a LiDAR rig or an RGB-D camera, and those produce sparse, noisy, indoor-biased data. The breakthrough that scaled monocular depth was learning without any depth labels, using only ordinary video or stereo pairs. The idea, from Godard and colleagues, exploits a geometric identity you already know: if you knew the depth of every pixel in frame $t$ and the camera's motion to frame $t+1$, you could warp frame $t$ to predict what frame $t+1$ should look like. The network is trained to make that prediction match the real frame $t+1$. Depth is never supervised directly; it is supervised through the photometric reconstruction error it produces.

Concretely, the network predicts depth $\hat{z}$ for the source frame and a small pose network predicts the relative camera motion $T$ between frames. Each source pixel $p_s$ is back-projected to 3D using $\hat{z}$ and the intrinsics $K$ (from Chapter 12), transformed by $T$, and reprojected into the target frame:

p_t \sim K\, T\, \hat{z}(p_s)\, K^{-1} p_s

Sampling the target image at those reprojected coordinates gives a reconstruction of the source, and the photometric loss (a blend of pixel difference and structural similarity) drives both networks. The depth that minimizes the reconstruction error is, by the geometry, the correct relative depth. This is the same warping you would use in stereo, turned into a self-supervised objective, and it inherits the depth-scale ambiguity of subsection 1: monocular self-supervision recovers depth only up to scale, while stereo self-supervision (with a known baseline) recovers it metrically.

Library Shortcut: Differentiable Warping in One Call

The back-project, transform, reproject, and sample pipeline is fiddly to write correctly by hand (it is roughly 40 lines with easy-to-flip sign conventions). The Kornia library implements the entire warp as differentiable building blocks:

import kornia
# depth_warp reprojects src into the frame of dst using predicted depth + pose.
# K: (B,3,3) intrinsics; T_src_to_dst: (B,4,4); depth_src: (B,1,H,W)
warped_src = kornia.geometry.depth.warp_frame_depth(
    image_src=dst_img, depth_dst=depth_src,
    src_trans_dst=T_src_to_dst, camera_matrix=K)
photometric = (warped_src - src_img).abs().mean()  # the self-supervision signal

Code Fragment 3: The same depth-pose warp in three lines using Kornia. The single warp_frame_depth call replaces the hand-written back-project, transform, reproject, and grid-sample pipeline, handling the homogeneous coordinates and gradient flow internally so the photometric difference becomes the self-supervision signal.

Kornia handles the homogeneous coordinates, the grid sampling, and the gradient flow, so you supply only the predicted depth and pose and read off the reconstruction. This is the core of every self-supervised depth codebase (Monodepth2, ManyDepth) and replaces dozens of error-prone lines with a single, tested call.

5. The 2024 Foundation Models for Depth Advanced

For years, monocular depth was a per-dataset affair: a model trained on indoor scenes failed outdoors, and vice versa. The shift that changed this, beginning with MiDaS in 2020 and reaching maturity in 2024, was scale and a scale-and-shift-invariant loss that let a single model train on a mixture of dozens of datasets at once. (The scale-invariant loss of subsection 1 removes a global multiplier; the scale-and-shift-invariant version also removes a global additive offset, which matters because datasets disagree not only on units but on what zero depth means, so both degrees of freedom must be factored out before comparing.) The result is zero-shot relative depth: a model that produces a sensible, sharp relative depth map on essentially any image it is shown, including domains it never saw in training. The widely used default, Depth Anything V2 (2024, arXiv:2406.09414), was trained on roughly 595,000 high-quality synthetic labeled images plus 62 million real unlabeled images through a teacher-student loop on a DINOv2 self-supervised backbone, exactly the foundation-model recipe of Chapter 25 applied to geometry. Its late-2025 successor, Depth Anything 3 (Lin et al., 2025, arXiv:2511.10647), keeps the same plain-transformer philosophy but unifies depth, camera pose, and multi-view geometry behind a single depth-ray prediction target. On its own benchmark it reports roughly a 35 percent gain in camera-pose accuracy and a 24 percent gain in geometric accuracy over the prior feed-forward state of the art (VGGT), while still improving on Depth Anything V2 for monocular depth.

Running one is now a few lines. The Hugging Face transformers pipeline below loads Depth Anything V2 and produces a relative depth map for any image, the off-the-shelf tool that the real-estate team of subsection 2 wished they had.

# Zero-shot monocular depth with a 2024 foundation model: the Hugging Face
# pipeline downloads the weights, normalizes the input, and returns a relative
# depth map for any image, with no training or per-scene tuning.
from transformers import pipeline
from PIL import Image

# Loads a 2024 foundation depth model; "Small" runs on a CPU, "Large" needs a GPU.
depth = pipeline(task="depth-estimation",
                 model="depth-anything/Depth-Anything-V2-Small-hf")

image = Image.open("street.jpg")
result = depth(image)
result["depth"].save("street_depth.png")   # a relative-depth visualization (PIL Image)
# result["predicted_depth"] is the raw tensor; larger = nearer in this model's convention.
print(result["predicted_depth"].shape)      # e.g. torch.Size([1, 518, 518])

Code Fragment 4: Running Depth Anything V2 in five lines. The pipeline(task="depth-estimation", ...) call downloads weights, normalizes the input, and returns both a visualization and the raw predicted_depth tensor; no training, no per-scene tuning, and it generalizes to images far outside any single dataset.

You Could Build This: A Portrait-Mode Background Blur From One Photo

With the relative depth map above and the Gaussian blur of Chapter 3, you have everything needed to recreate a phone's "portrait mode" on an ordinary single-lens photo. Run Depth Anything V2 to get a depth map, pick a focus depth (say the nearest large object), and blend each pixel between the sharp original and a blurred copy by how far its depth sits from that focus plane, more blur for farther pixels. The result is synthetic depth-of-field, exactly the fake bokeh that computational-photography teams ship on budget phones that lack a second camera. Difficulty: beginner. Time: about 45 minutes. It is portfolio-ready: a side-by-side of the flat input and your depth-aware blur shows you can turn a foundation model's raw output into a finished visual effect, and it makes the "shape for free, scale never" lesson concrete, since you need only relative depth, never metric meters.

Common Misconception: The predicted_depth Numbers Are Distances in Meters

A natural first reading of the raw tensor is that a value of 7.0 means a surface seven meters away. It does not. A zero-shot model like Depth Anything V2 outputs relative depth, and by the scale ambiguity of subsection 1 it cannot be metric; the values carry no fixed unit and are only meaningful up to an unknown global scale and shift, comparable within one image but not across images. Worse, the raw output is often inverse depth (larger means nearer, the opposite of distance) and is nonlinear, so subtracting two values does not give meters of separation. To get real meters you must either anchor the map to one known size (the real-estate fix in subsection 2) or use a genuinely metric model such as Depth Pro or Metric3D from the research-frontier callout. Treating relative-depth output as calibrated distance is the single most common and most damaging mistake practitioners make with these models.

Research Frontier: Metric Depth and Diffusion Depth (2024-2026)

Two fronts are active. The first is closing the scale gap. Apple's Depth Pro (2024, arXiv:2410.02073) produces sharp metric depth (in meters, not just relative) from a single image, generating a 2.25-megapixel map in about 0.3 seconds on a GPU by estimating focal length internally, attacking the very ambiguity of subsection 1 with a learned camera prior. Metric3D and UniDepth push the same goal by conditioning on or predicting intrinsics, and Microsoft's MoGe (CVPR 2025, arXiv:2410.19115) regresses a full metric 3D point map and camera field of view from one image at once. The second front borrows the generative prior of Part IV: Marigold (2024, arXiv:2312.02145) fine-tunes a Stable Diffusion latent model to emit depth, treating depth estimation as conditional image generation and inheriting the diffusion model's crisp detail and strong prior from very little labeled data. The convergence is telling: the same denoising machinery you will build for image generation in Chapter 33 turns out to be an excellent geometry estimator, because predicting plausible depth is, like generation, a problem of completing structure under a strong prior.

With a depth map in hand, by whatever route, the natural next question is where to put it. A depth map is still a 2D array; to reason about the scene in 3D, to rotate it, measure it, or feed it to a 3D network, we must lift it into an explicit three-dimensional representation. That is the subject of Section 27.2.

Exercise 27.1.1: Why Scale Cannot Be Recovered Conceptual

Using the pinhole projection equations of subsection 1, prove algebraically that scaling a scene by a factor $\alpha$ (multiplying every $X$, $Y$, $Z$ by $\alpha$) leaves every pixel coordinate $(u, v)$ unchanged, provided the camera intrinsics are fixed. Then explain in two or three sentences why this means a single uncalibrated image cannot recover absolute depth, and name one piece of additional information that would break the ambiguity. Connect your answer to the stereo triangulation of Chapter 13: what does the second camera provide that resolves the scale?

Exercise 27.1.2: Run and Probe a Foundation Depth Model Coding

Install transformers and run the Depth Anything V2 pipeline of subsection 5 on three of your own photos: one indoor scene, one outdoor scene with strong perspective (a road or hallway), and one close-up of a single object on a plain background. For each, visualize the depth map and write one sentence identifying which of the four cues from Figure 27.1.1 the model appears to be using. Then take the perspective photo and digitally remove the converging lines (crop to a flat wall), re-run, and report how the depth map degrades. What does this tell you about the model's reliance on global context?

Exercise 27.1.3: Scale-Invariant versus Plain L2 Analysis

Take a single ground-truth depth map and a predicted depth map that has the correct relative structure but is uniformly 1.5 times too large. Compute both the plain mean-squared log error and the scale-invariant loss of subsection 1 for this pair. Show numerically that the scale-invariant loss is near zero while the plain loss is large, and explain in a short paragraph why training with the plain loss would push the network to memorize dataset-specific average scales rather than learn transferable relative structure. Relate this to why MiDaS and Depth Anything use a scale-and-shift-invariant loss to train on mixed datasets.