"I spent a week deriving a variational bound with a dozen KL terms, and at the end it told me to just predict the noise and square the error. I am told this is what physicists call beauty."
A Loss Function That Simplified Beyond Recognition
The denoising diffusion probabilistic model derives a rigorous variational bound on the data log-likelihood, exactly as the VAE did, but that intimidating bound collapses to a simple mean-squared error between true and predicted noise, which is the loss everyone actually trains. Three design choices turn this principle into a working model: the noise schedule that decides how fast corruption proceeds, the parameterization that decides whether the network predicts the noise, the clean image, or a velocity blend of the two, and the weighting of the loss across timesteps. This section works through all three. The payoff is that you will understand both why the loss is mathematically justified and why, in practice, it is one line of code.
In Section 33.1 you built the forward and reverse processes and trained a denoiser with a noise-prediction loss that we asked you to take on faith. This section pays that debt. Here is the payoff worth reading for: we are about to write down a likelihood bound with a dozen KL terms that looks like a graduate exam question, and then watch almost all of it evaporate until what remains is the one-line squared error you already trained against. Stay with the derivation, because the moment it collapses is the moment you understand why diffusion training is both principled and trivial to code. We derive where the loss comes from, the variational lower bound that DDPM shares with the VAE of Chapter 31, and we show how each term of that bound reduces to a tractable quantity. Then we turn to the engineering knobs that separate a model that trains from one that trains well: the schedule, the parameterization, and the loss weighting. By the end you will be able to read the configuration of any modern diffusion model and know what every field means.
1. The Variational Bound, and How It Collapses Advanced
We want to maximize the likelihood the model assigns to real data, $\log p_\theta(x_0)$, but the marginal requires integrating over all the latent noisy states $x_1, \dots, x_T$, which is intractable. The same problem appeared with the VAE, and the same tool solves it: the evidence lower bound (ELBO). Treating the entire forward trajectory as the "encoder" and the reverse process as the "decoder," the bound is
This looks forbidding, but every piece is benign. The last term compares the fully-corrupted $x_T$ to a standard Gaussian; since the forward process is designed to reach exactly that, the term is essentially zero and has no parameters. The first term is a reconstruction at the final denoising step. The interesting work is in the middle sum: each term compares the model's reverse step $p_\theta(x_{t-1} \mid x_t)$ against the true posterior $q(x_{t-1} \mid x_t, x_0)$, which, crucially, is a Gaussian we can write in closed form because we condition on the clean $x_0$. Its mean is
A KL divergence between two Gaussians with the same variance reduces to the squared distance between their means. So each middle term becomes $\| \tilde\mu_t - \mu_\theta \|^2$ up to a constant. Substituting the noise-prediction expression for $\mu_\theta$ from Section 33.1 and simplifying (the algebra is mechanical but lengthy) turns that mean-distance into a distance between the true noise $\epsilon$ and the predicted noise $\epsilon_\theta$, with a per-timestep weight in front. Ho et al. found that simply dropping that weight, setting it to one for every $t$, trained better and gave the now-canonical loss:
The variational derivation tells you the principled objective is a weighted sum of denoising errors. The empirical discovery of DDPM is that the unweighted version, plain mean-squared error on the noise, trains better, because the principled weights overemphasize the tiny-noise steps where the task is trivial. This is a recurring pattern in deep learning: the theory tells you the shape of the objective, and an empirical simplification of it works best. You can train a strong diffusion model knowing only $\mathcal{L}_{\text{simple}}$, but knowing the bound tells you why it is a sound objective and where the freedom to reweight comes from.
Because the loss was derived from a likelihood bound, it is tempting to read $\mathcal{L}_{\text{simple}}$ as the model's data likelihood and to assume that a lower loss means crisper, better images. Both readings mislead. Once the per-timestep weights are dropped, $\mathcal{L}_{\text{simple}}$ is no longer the variational bound at all; it is a reweighted denoising score-matching objective, so its numerical value is not a likelihood and cannot be compared across noise schedules or parameterizations. And it correlates only loosely with perceptual sample quality: two models can reach nearly identical loss yet differ sharply in Frechet Inception Distance (FID), the sample-quality metric of Chapter 37, because the loss averages denoising error over all noise levels while FID is dominated by the structure the sampler assembles at a few critical ones. Judge a diffusion model by generated samples and FID, never by the training loss curve alone.
2. Noise Schedules: Linear vs Cosine Intermediate
The schedule $\{\beta_t\}$, the variance schedule introduced in Section 33.1 and also called the noise schedule, decides how quickly the forward process destroys the image, and through $\bar\alpha_t$ it sets the signal-to-noise ratio at every step. The original DDPM used a linear schedule, $\beta_t$ rising linearly from $10^{-4}$ to $0.02$. This works but has a flaw at high resolution: it destroys information too quickly near the end, so the last fraction of timesteps carries almost no signal and is wasted. Nichol and Dhariwal proposed a cosine schedule that keeps more signal in the middle of the process by defining $\bar\alpha_t$ directly:
where $s$ is a small offset (about $0.008$) that prevents $\beta_t$ from being too small near $t = 0$. The cosine schedule spends more steps at intermediate noise levels, which is where the model does most of its useful learning, and it noticeably improves sample quality on datasets like ImageNet. Figure 33.2.1 contrasts how $\bar\alpha_t$ decays under the two schedules.
The two schedules are a few lines apart in code. The function below returns the $\bar\alpha_t$ arrays for both, and you can drop either into the q_sample of Section 33.1.
# Two noise schedules expressed through their alpha_bar arrays.
# Linear builds beta_t first then takes the cumulative product; cosine
# defines alpha_bar directly so signal decays gradually at high resolution.
import torch
import math
def linear_alpha_bars(T=1000, b0=1e-4, b1=0.02):
betas = torch.linspace(b0, b1, T)
return torch.cumprod(1.0 - betas, dim=0)
def cosine_alpha_bars(T=1000, s=0.008):
steps = torch.arange(T + 1) / T
f = torch.cos((steps + s) / (1 + s) * math.pi / 2) ** 2
ab = f / f[0]
return ab[1:] # length T, alpha_bar at each step
lin = linear_alpha_bars()
cos = cosine_alpha_bars()
print("linear alpha_bar at t=900:", lin[900].item()) # ~0.0001, nearly all noise
print("cosine alpha_bar at t=900:", cos[900].item()) # larger: signal survives
linear_alpha_bars and cosine_alpha_bars functions side by side, each returning the $\bar\alpha_t$ array you can drop into q_sample. The printed values show that at $t=900$ the cosine schedule retains far more signal (about 0.023) than the linear one (about 0.0001), the quantitative version of the gap in Figure 33.2.1.Print lin[t] and cos[t] from Code Fragment 1 as you sweep t across 0, 100, 300, 500, 700, 900, and watch the ratio between them. Near $t=0$ the two are almost identical; somewhere past the midpoint the cosine value pulls dramatically ahead, and by $t=900$ it holds roughly two hundred times more signal than the linear one. The takeaway is concrete: the schedules barely differ early, so the cosine schedule's whole advantage lives in the second half of the chain, exactly the high-noise steps the linear schedule wastes. That is the number behind Figure 33.2.1, felt rather than read.
3. Three Parameterizations: Noise, Image, and Velocity Advanced
The network has to output something from which the reverse step can be computed, but that something is not unique. Because $x_t$, $x_0$, and $\epsilon$ are linked by the single closed-form equation, predicting any one of them determines the others. There are three standard choices, and each is the right one in a different regime.
- Noise prediction ($\epsilon$-prediction): the network outputs the noise $\epsilon_\theta(x_t, t)$. This is the DDPM default and works well across most of the schedule.
- Image prediction ($x_0$-prediction): the network outputs the clean image directly. This is more stable at very high noise levels: when $x_t$ is almost pure noise, predicting $\epsilon$ amounts to echoing back the input (the target $\epsilon$ is most of what the network already sees), so a tiny error in the prediction barely changes the loss yet badly corrupts the recovered image, whereas predicting the faint underlying $x_0$ gives the network a target that actually carries signal.
- Velocity prediction ($v$-prediction): the network predicts $v = \sqrt{\bar\alpha_t}\,\epsilon - \sqrt{1-\bar\alpha_t}\,x_0$, a blend that interpolates between the two. It behaves like image prediction at high noise and noise prediction at low noise, giving the best of both, and is the choice used in progressive distillation and many modern systems.
The three are exact algebraic re-encodings of each other; given one and the noisy $x_t$, you can recover the others. The reason the choice matters is the implied loss weighting: predicting $\epsilon$, $x_0$, or $v$ puts different emphasis on different noise levels, which changes what the network spends its capacity on. The conversion utilities are short.
# Convert among the noise, image, and velocity parameterizations.
# All three are exact re-encodings linked by the closed-form equation,
# so a velocity prediction must round-trip back to the original noise.
import torch
def to_x0(xt, eps, ab): # ab is alpha_bar_t, shape (B,1,1,1)
"""Recover the clean image from x_t and predicted noise."""
return (xt - (1 - ab).sqrt() * eps) / ab.sqrt()
def v_target(x0, eps, ab):
"""The velocity target used by v-prediction."""
return ab.sqrt() * eps - (1 - ab).sqrt() * x0
def eps_from_v(xt, v, ab):
"""Recover noise from a velocity prediction (for the reverse step)."""
return ab.sqrt() * v + (1 - ab).sqrt() * xt
# Sanity check: round-trip consistency on random tensors
x0 = torch.randn(2, 1, 4, 4); eps = torch.randn_like(x0)
ab = torch.tensor([0.3, 0.7]).view(-1, 1, 1, 1)
xt = ab.sqrt() * x0 + (1 - ab).sqrt() * eps
v = v_target(x0, eps, ab)
assert torch.allclose(eps_from_v(xt, v, ab), eps, atol=1e-5)
print("parameterizations are mutually consistent")
to_x0, v_target, and eps_from_v helpers converting between the noise, image, and velocity parameterizations. The assert confirms they are exact re-encodings: eps_from_v recovers the same noise the closed form used. Choosing among them is a training-stability decision, not a modeling one.Take a clean x0, build xt with to_x0's inputs, then perturb the noise estimate by a tiny fixed amount (eps_hat = eps + 0.05 * torch.randn_like(eps)) and recover the image with to_x0(xt, eps_hat, ab). Compare the recovered-image error against the clean x0 for ab near one (low noise, try $0.99$) and near zero (high noise, try $0.01$). The same small noise error barely moves the recovered image at low noise but blows up by the factor $\sqrt{(1-\bar\alpha_t)/\bar\alpha_t}$ at high noise, which explodes as $\bar\alpha_t \to 0$. Watching that one number grow is the whole argument for why $x_0$-prediction and $v$-prediction are steadier at the noisy end, made tangible in a few lines.
Velocity prediction got its name and its formula from a 2022 paper on progressive distillation, where the authors needed a target that stayed well-behaved as they repeatedly halved the number of sampling steps. The name evokes physics: $v$ is the instantaneous rate of change along the path from data to noise, and predicting it is like predicting velocity instead of position. Stable Diffusion 2 quietly switched to it, which is why models fine-tuned for one parameterization can misbehave if you load them with a scheduler configured for another.
4. The Network: A Time-Conditioned U-Net Intermediate
For images, the denoiser $\epsilon_\theta$ is almost always a U-Net: an encoder that downsamples the image through convolutional blocks, a bottleneck, and a decoder that upsamples back to full resolution, with skip connections joining matching levels. This is the same architecture you met for segmentation in Chapter 24, repurposed: instead of predicting a per-pixel class, it predicts per-pixel noise. Two ingredients make it diffusion-specific. First, the timestep $t$ is embedded (with the sinusoidal encoding of Section 33.1) and injected into every block, usually by adding it to the feature maps, so the same weights behave differently at different noise levels. Second, self-attention layers are inserted at the lower-resolution levels so the network can relate distant regions, which a pure convolution cannot do in one layer. The skip connections matter especially here: they carry the high-frequency detail from the encoder straight to the decoder, exactly the detail that the bottleneck would otherwise lose, and that the denoiser must restore at low noise levels.
A correct, attention-augmented, time-conditioned U-Net is several hundred lines of careful PyTorch. The diffusers library gives it to you as a configured class: UNet2DModel(sample_size=64, in_channels=3, out_channels=3, block_out_channels=(128,128,256,256,512,512), down_block_types=(...)) builds the full encoder-decoder with residual blocks, the timestep embedding wired into every block, and attention at the resolutions you specify. That is roughly a 300-line reduction over a from-scratch implementation, and the library handles the group-normalization placement, the SiLU activations, and the skip-connection concatenation that are easy to misorder by hand. The forward call unet(noisy_x, timestep).sample returns the predicted noise (or $x_0$ or $v$, per your config) in one line.
5. Putting It Together: The DDPM Training Step Intermediate
With the schedule chosen, the parameterization fixed, and the U-Net in hand, the training step is the same three-line idea from Section 33.1, now on real images. The function below is the complete DDPM loss for a batch, and it is what every image-diffusion codebase runs at its core.
# The complete DDPM training objective L_simple for a batch of images.
# Draw a random timestep, corrupt with the closed form, predict the noise,
# and return the mean squared error; this is the core of every codebase.
import torch
def ddpm_loss(unet, x0, alpha_bars, T=1000):
"""One DDPM training step on a batch of images x0 in [-1, 1]."""
B = x0.size(0)
t = torch.randint(0, T, (B,), device=x0.device) # random step per image
ab = alpha_bars[t].view(B, 1, 1, 1) # signal level
noise = torch.randn_like(x0) # the target
xt = ab.sqrt() * x0 + (1 - ab).sqrt() * noise # closed-form corruption
pred = unet(xt, t).sample # predicted noise
return ((pred - noise) ** 2).mean() # L_simple
# Inside a normal training loop:
# loss = ddpm_loss(unet, batch, alpha_bars)
# loss.backward(); optimizer.step()
ddpm_loss function, the full image-diffusion training step, identical in spirit to the 2D toy loop of Section 33.1. The only differences from the toy version are that x0 is a batch of images and the denoiser is a U-Net called as unet(xt, t).sample; the diffusion logic, corrupt and predict the noise, is unchanged.Who: a solo researcher fine-tuning a diffusion model on a dataset of 256x256 satellite tiles, 2023. Situation: training with the default linear schedule from a tutorial, samples looked muddy and low-contrast even after a week of compute. Problem: at 256x256 the linear schedule destroyed the low-frequency structure too early, so the high-noise steps carried no learnable signal and the model never learned the coarse layout that satellite imagery depends on. Decision: following the Nichol-Dhariwal finding, she swapped to the cosine schedule (a five-line change to the $\bar\alpha_t$ array) and switched the network from $\epsilon$-prediction to $v$-prediction for stability at the noisy end. Result: within two days of further training the samples gained coherent large-scale structure (fields, roads, coastlines) that the linear run never produced, and the Frechet Inception Distance (FID), the standard sample-quality metric of Chapter 37, dropped substantially. Lesson: the schedule and parameterization are not incidental defaults; at higher resolution they decide whether the model spends its capacity on signal or on already-destroyed noise. When samples look structurally wrong rather than merely blurry, suspect the schedule before the architecture.
The linear-versus-cosine choice is the tip of a larger design question that the EDM line of work (Karras et al., 2022, arXiv:2206.00364) reframed cleanly: rather than tuning $\beta_t$, parameterize everything in terms of a noise standard deviation $\sigma$ and choose its distribution, the network preconditioning, and the loss weighting to keep the effective target unit-scale at every noise level. EDM's continuous-$\sigma$ formulation and its log-normal sampling of training noise levels became a de facto standard for high-quality models. The 2024 follow-up EDM2 (Karras et al., arXiv:2312.02696) went further, fixing uncontrolled magnitude growth in the network weights and setting the ImageNet generation record at the time. The lesson for a practitioner is that the schedule and loss weighting are first-class hyperparameters; the cosine schedule here is a strong, simple default, but frontier results come from treating the noise-level distribution as something to optimize.
The variational bound in subsection 1 has three groups of terms. Explain in a short paragraph why the final term $D_{\mathrm{KL}}(q(x_T \mid x_0) \| p(x_T))$ contributes essentially nothing to the loss and has no learnable parameters, and why the middle KL terms reduce to a squared difference of Gaussian means. Then state, in one sentence, what empirical change DDPM made to the per-timestep weights and why it helped.
Train two copies of the 2D moons denoiser from Section 33.1, one with $\epsilon$-prediction and one with $x_0$-prediction (change the target and adjust the reverse step using the to_x0 helper from subsection 3). Sample 512 points from each and overlay them on the true data. Report whether the two produce visibly different sample quality on this easy dataset, and explain why the difference would be larger at high image resolution than on 2D toys.
Using linear_alpha_bars and cosine_alpha_bars from subsection 2, plot the signal-to-noise ratio $\bar\alpha_t / (1 - \bar\alpha_t)$ on a log scale against $t$ for both schedules with $T = 1000$. Identify the timestep range where the two schedules differ most, and write one paragraph arguing, from the curve, why the cosine schedule allocates more "useful" steps. Connect this to the multi-scale coarse-to-fine argument from Section 33.1 and to the image-pyramid idea of Chapter 4.