Section 36.4: Generative Neural Rendering: From Splats to Scenes

"For years they fed me fifty photos of a real room and I learned to render it from new angles. Then one day they fed me no photos and a sentence, and asked for a room that never existed. The renderer did not change. Only my imagination got a prior."
A 3D Gaussian That Stopped Memorizing and Started Dreaming

Big Picture

Neural rendering and generative modeling are two halves of one idea: a differentiable renderer turns 3D parameters into images, and a generative prior over those parameters turns the renderer from a memory of one captured scene into a machine that imagines new ones. This section takes the 3D Gaussian splatting of Chapter 27 and makes it generative, by diffusing over splat parameters, by amortizing scene generation into a feed-forward network, and by using the video and 3D generators of the previous sections as scene-building blocks.

Section 36.3 generated individual objects: a chair, an animal, a prop. This section scales up to scenes and dwells on the representation that has come to dominate, 3D Gaussian splatting. In Chapter 27 you fit a cloud of millions of anisotropic 3D Gaussians, each with a position, covariance, opacity, and color, to a set of photographs, and rasterized them into novel views in real time. That was reconstruction: the Gaussians memorized one real scene. Generative neural rendering asks the renderer to instead produce Gaussians for a scene that was never photographed, conditioned on text, an image, or nothing at all.

1. The Renderer Is Already Generative-Ready Intermediate

The key realization is that the splatting renderer of Chapter 27 is differentiable and fast, which makes it a perfect generative-model output head. Recall the rendering equation for a pixel as a depth-ordered alpha composite over the Gaussians that project onto it:

$$ C \;=\; \sum_{i} c_i\, \alpha_i \prod_{j<i} (1 - \alpha_j), $$

where $c_i$ is the color of Gaussian $i$, $\alpha_i$ is its effective opacity at this pixel (its stored opacity modulated by the projected Gaussian footprint, the 2D screen-space Gaussian obtained by the EWA projection $\boldsymbol{\Sigma}' = J W \boldsymbol{\Sigma} W^\top J^\top$ derived in Section 27.5, with $W$ the view transform and $J$ the Jacobian of the projection linearized at the Gaussian center), and the product $\prod_{j<i}(1-\alpha_j)$ is the transmittance, the fraction of light that survives the trip through all the Gaussians in front of $i$ without being blocked, so each Gaussian contributes its color only in proportion to how much of the view is still unobstructed when the ray reaches it. A two-number trace makes the accumulation concrete: with three front-to-back Gaussians of opacity $\alpha = [0.5, 0.5, 0.5]$, the first contributes weight $0.5$, the second $0.5 \times (1 - 0.5) = 0.25$, and the third $0.5 \times (1 - 0.5)(1 - 0.5) = 0.125$, so each layer can paint only the $50\%$, then $25\%$, then $12.5\%$ of the pixel its predecessors left unpainted; like coats of translucent paint, every coat covers half of whatever is still showing. Because $C$ is differentiable in every Gaussian parameter, gradients flow from a rendered image back to positions, covariances, colors, and opacities. A generative model only has to produce those parameters; the renderer turns them into viewable, view-consistent images for free, and the same gradient path supports the score distillation of Section 36.3.

This means generative neural rendering needs no new rendering theory. It needs a prior, a way to sample plausible sets of Gaussians, plugged into the front of the renderer you already have. Three priors are in use, in increasing order of abstraction.

2. Three Ways to Make Splatting Generative Advanced

Distillation prior (per-scene optimization). The most direct route, used by DreamGaussian (Tang et al., 2024), is the score-distillation loop of Section 36.3 with Gaussians as the optimized parameters instead of a NeRF. The 2D diffusion model critiques rendered views and the gradient sculpts the Gaussian cloud. This needs no 3D training data but optimizes each scene from scratch.

Diffusion prior (generate the parameters directly). Treat the set of Gaussian parameters as the data and train a diffusion model to denoise them, the same DDPM machinery from Chapter 33 applied to splat tensors rather than pixels. The challenge is that a Gaussian set is unordered and variable-size, so these models often impose structure (a fixed grid of Gaussians, or a latent that decodes to Gaussians) to make the data a regular tensor a U-Net or transformer can denoise. This gives genuine sampling diversity without per-scene optimization.

Amortized prior (feed-forward scene generation). Train a network that maps a conditioning signal (one image, a few images, or a text embedding) directly to a Gaussian cloud in one forward pass, the LRM-style amortization of Section 36.3 extended to splats and to whole scenes. The pixelSplat and MVSplat lines (2024) predict Gaussians from sparse views in milliseconds; the latent-LRM successors generate room-scale scenes from text.

Figure 36.4.1: Three generative priors share one differentiable splatting renderer. The distillation prior optimizes Gaussians per scene against a 2D diffusion critic; the diffusion prior denoises Gaussian parameters directly; the amortized prior maps a conditioning signal to Gaussians in a single forward pass. All three emit a Gaussian cloud, and the shared real-time renderer of Chapter 27 turns it into view-consistent images. The renderer is the constant; the prior is the variable.

Figure 36.4.1 makes the architecture explicit: the renderer is fixed and the three priors are interchangeable front ends. The code below shows the minimal differentiable splat-rendering forward pass that every one of these priors targets, so you can see exactly what a generative model must produce.

# Minimal differentiable splat renderer: composite depth-sorted Gaussians
# front-to-back with running transmittance. This fixed forward model is the
# target every generative prior must produce parameters for.
import torch

def render_gaussians(means2d, colors, opacities, depths, image_size):
    """Front-to-back alpha compositing of 2D-projected Gaussians (simplified).
    A generative prior produces means/colors/opacities; this renderer is fixed."""
    H, W = image_size
    out = torch.zeros(H, W, 3)
    transmittance = torch.ones(H, W)
    order = torch.argsort(depths)                # composite near-to-far
    for i in order:
        # splat this Gaussian's footprint as a soft weight over pixels (simplified)
        alpha = opacities[i] * gaussian_footprint(means2d[i], image_size)  # (H, W)
        out += transmittance.unsqueeze(-1) * alpha.unsqueeze(-1) * colors[i]
        transmittance = transmittance * (1.0 - alpha)   # attenuate behind
    return out

# Everything is differentiable: backprop reaches means2d, colors, opacities,
# so any prior (distillation, diffusion, amortized) trains through this renderer.

Code Fragment 1: A simplified differentiable Gaussian-splat renderer. The render_gaussians loop performs front-to-back alpha compositing with a running transmittance, exactly the operation from Chapter 27. Because every step is differentiable in means2d, colors, and opacities, all three generative priors of Figure 36.4.1 can train through it without modification.

Key Insight: The Prior Is the Only New Thing

Across this section the renderer never changes; only the source of its inputs does. This is the cleanest illustration of a theme that runs through all of Part IV: a generative model is a prior plus a likelihood, and in neural rendering the likelihood (the renderer mapping parameters to images) is borrowed wholesale from the reconstruction era. Reconstruction taught us a differentiable, fast, view-consistent renderer; generation just learns to feed it. Whenever you can write a differentiable forward model of how parameters become observations, you can turn a fitting problem into a generation problem by putting a learned prior in front of it.

Right Tool: Real Splat Rendering with gsplat

The naive Python loop above is for understanding only; it would be hopelessly slow for the millions of Gaussians a real scene needs. The gsplat library provides the optimized, fully differentiable CUDA rasterizer:

# Production splat rasterization: gsplat's CUDA kernel replaces the Python loop,
# rendering millions of Gaussians in milliseconds while staying differentiable
# end-to-end, so a generative prior can train through it at real-scene scale.
from gsplat import rasterization
# means: (N,3), quats: (N,4), scales: (N,3), opacities: (N,), colors: (N,3)
rendered, alpha, info = rasterization(
    means, quats, scales, opacities, colors,
    viewmats, Ks, width=512, height=512)   # tile-based CUDA, fully differentiable

Code Fragment 2: Real splat rendering with gsplat: one rasterization call, a tile-based differentiable CUDA kernel, rasterizes millions of Gaussians in milliseconds, replacing the hand-written compositing loop of Code Fragment 1 while still backpropagating to every parameter.

This replaces the hand-written compositing loop, plus projection, tiling, and sorting, hundreds of lines of CUDA, with one call that handles millions of Gaussians in milliseconds and still backpropagates to every parameter, exactly what an amortized or distillation prior needs in its training loop.

3. Composing Scenes from Generated Pieces Intermediate

A practical route to whole scenes, rather than training one giant scene generator, is composition: generate or retrieve individual objects (Section 36.3), place them in a layout, and let a generative model harmonize lighting and fill background. Because splats are an explicit point-like representation, two Gaussian clouds can be merged by concatenating their parameter lists and re-rendering, an editability that NeRF's implicit field lacks. This explicitness is why the field has largely migrated to splats for generative and editable 3D, and it connects forward to world models: a world model that must place and move objects benefits from a representation where objects are separable, addressable things rather than entangled in a single MLP.

Fun Note

Adding a sofa to a splat scene is, gloriously, torch.cat. The whole room and the new couch are just two lists of fuzzy little ellipsoids, and merging them is concatenation, not surgery. Try that with a NeRF, whose entire scene is baked into one tangled MLP, and "put a lamp in the corner" becomes "please retrain the universe." Splats won the generative-3D popularity contest partly on physics and speed, and partly because they let you rearrange the furniture without a lawsuit from the renderer.

From the Field: Real Estate Walkthroughs from a Few Phone Photos

A property-tech company let agents capture a handful of phone photos of an empty apartment and wanted to deliver an interactive 3D walkthrough plus virtually staged furniture. Their v1 used classical photogrammetry (the structure-from-motion of Chapter 14) and produced holey, low-quality meshes that looked worse than the photos. The v2 pipeline used a feed-forward sparse-view Gaussian generator to reconstruct the room from as few as five photos in seconds, then composed in furniture generated by a text-to-3D model (Section 36.3), merging the Gaussian clouds and re-rendering so the lighting matched. Because splats render in real time in a browser, the walkthrough was smooth on a phone. The lesson the CTO drew: the move from meshes to generative splats turned a fragile, capture-heavy reconstruction pipeline into a forgiving, few-shot generative one, and the explicit splat representation is what made virtual staging a concatenation rather than a re-optimization.

4. From Static Scenes to Worlds Beginner

Generative neural rendering, as covered so far, produces a static 3D scene: beautiful, view-consistent, but frozen in time. The natural next axis is the one this whole chapter is building toward, dynamics. A scene that changes over time, with objects that move and respond, is a world, and a generative model of such a thing is a world model. The bridge is conceptual but short: take the latent-space view (a scene is a set of parameters), add a learned transition function that predicts the next parameters from the current ones and an action, and you have moved from generative rendering to generative simulation.

That transition function, the learned dynamics, is the subject of the rest of the chapter. Section 36.5 builds it in its most explicit and classical form, a recurrent state-space model that predicts the next latent from the current latent and an action, and trains an agent inside the resulting dream. The renderer of this section becomes the decoder that turns the world model's latent state back into pixels, closing the loop from imagined dynamics to viewable, controllable scenes.

Research Frontier: Generative Splatting and 4D Worlds (2024-2026)

The frontier here is moving fast on three fronts. Latent splat diffusion (the Splatter Image, LGM, and GS-LRM lines, 2024) trains feed-forward generators that emit millions of Gaussians from text or sparse views, pushing scene generation toward interactive speed. Dynamic (4D) Gaussians add per-Gaussian motion trajectories so a generated scene can move, the spatial substrate for the world simulators of Section 36.6. And video-conditioned reconstruction closes the loop with the first half of this chapter: a video diffusion model (Sections 36.1 to 36.2) generates a turntable of consistent views, and a splat generator lifts those views into an explicit, editable, real-time 3D scene, a pipeline that several 2025 systems ship end to end. The signal is unmistakable: pixels, geometry, and dynamics are collapsing into one generative stack, and the next sections give that stack a controller.

Exercise 36.4.1: One Renderer, Two Jobs Conceptual

The section claims the splatting renderer is identical for reconstruction and generation, and only the prior differs. Defend this claim by listing what changes and what stays the same when you move from fitting Gaussians to one real scene (Chapter 27) to sampling Gaussians for a new scene. Why is a differentiable renderer the precondition for both?

Exercise 36.4.2: Editing as a Tensor Operation Coding

Using gsplat (or the simplified renderer if no GPU is available), create two small Gaussian clouds, a red cube and a blue sphere, by hand. Merge them by concatenating their parameter tensors and render the combined scene from several viewpoints. Verify view consistency, then perturb one cloud's positions and re-render to confirm that object-level editing is a tensor operation, not a re-optimization.

Exercise 36.4.3: Choosing a Generative Prior Analysis

Compare the three generative priors (distillation, diffusion, amortized) on three axes: training-data requirement, per-scene generation time, and sample diversity. For a product that must generate one bespoke scene per user request with no 3D training corpus, which prior fits, and what is the cost you accept? For a product generating millions of scenes from a fixed catalog of styles, which fits instead?