"A thousand tiny steps will get you home, but so will twenty long strides if you know which way to lean. I learned to lean."
A Sampler That Used to Take All Day
The thousand-step sampling that makes diffusion slow is not fundamental; it is the cost of solving the generative ODE with the crudest possible integrator, and three families of methods cut it dramatically: DDIM rewrites the sampler as a deterministic ODE solver that takes large accurate steps, higher-order solvers like DPM-Solver use curvature information to take even fewer, and distillation trains a student network to reproduce many teacher steps in one. The first two need no extra training and turn a thousand steps into roughly twenty; the third needs training but pushes toward one to four steps and near-real-time generation. This section derives DDIM, surveys the solver landscape, and explains distillation.
In Section 33.3 you learned that every diffusion model has a deterministic probability-flow ODE whose trajectories carry noise to data. The stochastic ancestral sampler of Section 33.1 is, in effect, a noisy Euler solver of the reverse SDE, and Euler with a thousand steps is reliable but wasteful. Recognizing sampling as numerical ODE solving immediately suggests the fix: use a better solver. This section makes that concrete. We derive the DDIM sampler as the Euler discretization of the probability-flow ODE, see why it can take twenty steps where the stochastic sampler needs a thousand, then move to the higher-order solvers and the distillation methods that go further still.
1. DDIM: Sampling as Deterministic ODE Solving Intermediate
The denoising diffusion implicit model (DDIM) starts from a clever observation: the DDPM training objective from Section 33.2 only constrains the marginals $q(x_t \mid x_0)$, not the joint Markov chain. So you are free to define a different reverse process, even a non-Markovian one, as long as it has the same marginals, and reuse the exact same trained network. DDIM chooses a reverse process that is deterministic. Its update predicts the clean image from the current noisy image and the network's noise estimate, then re-noises it to the next, lower noise level along a straight line:
There is no random term. The two pieces are the predicted clean image, scaled to the next noise level, plus the predicted noise direction, also scaled to the next level. It may not look like an ODE step (there is no visible $dt$), but it is one in disguise: estimating $x_0$ and re-projecting it onto the noise level of $t-1$ is algebraically the same move as taking one integration step of the probability-flow ODE, which is why the two formulations agree exactly. Because the trajectory is deterministic and smooth, you do not have to visit every integer timestep; you can pick any decreasing subsequence of steps, say 50 of the 1000, and apply the update between consecutive chosen steps. This is exactly the probability-flow ODE integrator you wrote in Section 33.3, now named and motivated. Figure 33.4.1 contrasts the wandering stochastic path with the smooth deterministic one.
The sampler below implements the deterministic DDIM update directly: it walks a chosen subset of timesteps, predicting the clean image and re-noising to the next level without drawing any fresh noise.
# Deterministic DDIM sampling on a sparse subset of timesteps.
# Reuses a network trained with the ordinary DDPM loss; at each step we
# predict the clean image and re-noise it to the next lower noise level.
import torch
@torch.no_grad()
def ddim_sample(unet, alpha_bars, shape, steps=50, T=1000, device="cpu"):
"""Deterministic DDIM sampling on a chosen subset of timesteps."""
x = torch.randn(shape, device=device)
ts = torch.linspace(T - 1, 0, steps).long().to(device) # 50 of 1000 steps
for i in range(len(ts)):
t = ts[i]
ab = alpha_bars[t]
eps = unet(x, t.repeat(shape[0])).sample # one network call
x0_pred = (x - (1 - ab).sqrt() * eps) / ab.sqrt() # predicted clean image
if i < len(ts) - 1:
ab_next = alpha_bars[ts[i + 1]]
x = ab_next.sqrt() * x0_pred + (1 - ab_next).sqrt() * eps # to next level
else:
x = x0_pred # final step: return x0
return x.clamp(-1, 1)
# 50 network calls instead of 1000: a 20x speedup with the same weights.
print("DDIM uses the same trained unet, just a smarter update rule")
ddim_sample function, a complete DDIM sampler. It reuses a network trained with the ordinary DDPM loss of Section 33.2 and makes only steps calls to it, predicting x0_pred and re-noising to the next level without drawing fresh noise. Setting steps=50 gives roughly a twentyfold speedup over the 1000-step ancestral sampler with little visible quality loss.The most important practical fact about DDIM is that it requires no retraining. A model trained with the plain DDPM objective can be sampled with the stochastic ancestral sampler, with DDIM at 50 steps, or with a high-order solver at 20 steps, all from the identical weights. The number of sampling steps is a deployment-time dial, not a training-time commitment. This is why a single Stable Diffusion checkpoint ships with a dozen interchangeable schedulers: they are all different ODE/SDE solvers walking the same learned vector field. The determinism of the DDIM walk also makes it invertible, which is the property the inversion-based editing of Chapter 35 relies on to map a real image back to the noise that would regenerate it.
2. Higher-Order Solvers: DPM-Solver and Friends Advanced
DDIM is a first-order (Euler) solver of the probability-flow ODE. Numerical analysis offers a long menu of more accurate solvers that use information about the curvature of the trajectory to take larger steps for the same error. The diffusion ODE has a special structure: its drift is linear in $x$ plus a nonlinear score term, a so-called semi-linear ODE, and exploiting that structure with an exponential integrator yields DPM-Solver (and its improved variants DPM-Solver++ and the multistep UniPC). These solvers reach DDPM-quality samples in roughly 10 to 20 network calls, about half of what DDIM needs at the same quality.
What buys the extra accuracy is curvature. A second-order solver evaluates the network twice per step (or reuses the previous step's evaluation) to estimate how the trajectory bends, then takes a corrected step. The EDM sampler of Section 33.2 uses Heun's method, a classic second-order scheme, for the same reason. The practical guidance is simple: for a given step budget, a higher-order solver almost always beats Euler/DDIM, and the cost is only a little extra bookkeeping.
Implementing DPM-Solver++ or UniPC correctly, with the exponential-integrator coefficients and the multistep history, is a hundred-plus lines of subtle numerics that are easy to get wrong at the schedule endpoints. In diffusers it is a one-line swap: pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) replaces whatever sampler a pipeline shipped with. The same checkpoint then generates in 20 steps instead of 50. The library exposes DDIMScheduler, DPMSolverMultistepScheduler, UniPCMultistepScheduler, EulerDiscreteScheduler, and HeunDiscreteScheduler, all interchangeable, and handles the VP/VE and $\epsilon$/$v$ conversions of Sections 33.2 and 33.3 internally so you do not have to reconcile them by hand. Read one solver from scratch to understand the idea; choose among the library's schedulers in practice.
3. Distillation: Training Away the Steps Advanced
Better solvers reduce steps without retraining, but they hit a floor around 10 to 20 steps because below that the ODE trajectory is too curved for any fixed-order solver to track in a few jumps. To go lower you change the model, not the solver. Progressive distillation (Salimans and Ho, 2022) trains a student network to take one step that matches two teacher steps, then distills the student into a new student that takes one step for its two, and so on, halving the step count each round: 1024 to 512 to 256 down to 4 or even 1. Each round is a short fine-tune with a regression loss against the teacher's two-step output. The $v$-prediction parameterization of Section 33.2 was invented precisely to keep this distillation numerically stable. A newer and now-dominant family, the consistency models of Section 33.5, trains directly for the property that any point on a trajectory maps to its endpoint, achieving one to four step generation in a single training run rather than many distillation rounds. Distillation is itself a form of the knowledge-transfer training of Chapter 21: a small or few-step student is supervised by a slow, high-quality teacher, the same teacher-student pattern, here aimed at sampling speed rather than label efficiency.
Who: A product team at a marketing-creative startup, shipping a browser tool where designers type a prompt and watch the image update as they type.
Situation: Early users abandoned the tool during the wait. A single 50-step generation on the shared L4 GPUs took four to six seconds, so every prompt edit felt like a page reload, and the median session ended before the third generation.
Problem: Dropping to a fixed low step count to feel responsive wrecked the final image quality that the deliverable needed, while keeping the high step count kept the tool unusable for exploration. One step budget could not serve both interactive editing and final rendering.
Decision: The team split sampling into two tiers driven by the exact same checkpoint, treating step count as a deployment dial. A four-step Latent Consistency Model path redrew a draft on every keystroke for instant feedback; a dedicated "render" button then ran a twenty-step DPMSolverMultistepScheduler pass for the final asset. No new model was trained for the live tier beyond loading an off-the-shelf LCM adapter.
Result: Live previews dropped from roughly five seconds to under 300 milliseconds, the final render stayed at full quality, and median generations per session rose more than fourfold. The change touched the scheduler and adapter loading only; the weights were shared across both tiers.
Lesson: When one workload has two different speed-quality needs, do not compromise on a single step count. The same learned vector field can be walked fast for exploration and carefully for delivery, which is the whole payoff of treating sampling as solver choice rather than a fixed property of the model.
With the DDIM sampler of subsection 1 and the distilled few-step models of subsection 3, you can build the two-tier generator from the practical example yourself, an interview-ready portfolio piece. Load a Latent Consistency Model adapter for the live tier and a standard diffusers pipeline with DPMSolverMultistepScheduler for the final tier, then wire a small Gradio or Streamlit front end: the four-step LCM path regenerates a draft on every keystroke of the prompt, and a "render" button runs the 20-step solver for the deliverable. Log the wall-clock of each tier on your GPU and show the speed gap as a chart. Difficulty: intermediate, about 60 to 90 minutes. It demonstrates that you understand step count as a deployment dial, the central lesson of this section, rather than a fixed property of the model.
The reason naive step-skipping (just running the ancestral sampler on every 50th step) produces garbage, while DDIM on the same 50 steps produces clean images, confused many practitioners early on. The difference is that the stochastic sampler's update assumes the small-step Gaussian approximation, which breaks across large gaps, whereas DDIM's update is exact for the deterministic ODE regardless of step size. The lesson the field learned the hard way: you cannot skip steps of a method designed for small steps; you must switch to a method designed for large ones.
The race to few-step generation defined diffusion engineering from 2023 to 2025. Latent Consistency Models (LCM; Luo et al., 2023) brought consistency distillation to Stable Diffusion latents for four-step generation. Adversarial distillation, SDXL-Turbo and SD3-Turbo (Sauer et al., 2023 to 2024), added a GAN-style discriminator from Chapter 32 to the distillation loss, reaching convincing single-step samples. Distribution-matching distillation (DMD, 2024, arXiv:2311.18828, and DMD2, arXiv:2405.14867) matched the student's output distribution to the teacher's score and closed much of the remaining quality gap to multi-step sampling. By 2025 to 2026 the practical state of the art was that a well-distilled model gives roughly 90 percent or more of multi-step quality in one to four steps, which is why real-time, type-and-see image and video tools became feasible, and why few-step diffusion is increasingly a candidate for the on-device and edge deployment of Chapter 28 rather than a server-only workload. The newest open-weight families ship a few-step distilled variant alongside the full model (for example Stable Diffusion 3.5 Large Turbo, a four-step distilled model) so the speed tier is a first-class deliverable, not an afterthought. The open question is whether one-step models can fully match the best multi-step quality or whether a small residual gap is inherent.
4. The EDM Design Space: One Framework for All the Knobs Advanced
The solvers above all assume the noise schedule, the network parameterization, and the loss weighting are fixed; they only change how you walk the trajectory. But these design choices interact, and a sampler tuned for one parameterization can be mediocre for another. The practical pain is concrete: a team porting a VP-trained checkpoint to a VE schedule, or swapping $\epsilon$-prediction for $x_0$-prediction, often finds the carefully tuned step count no longer produces clean samples, and there is no principled way to know which knob to turn. Karras et al. (2022), in "Elucidating the Design Space of Diffusion-Based Generative Models," untangle these coupled choices into independent, separately tunable components. The reframing is so clean that the resulting recipe, known as EDM, became a default starting point for new diffusion models.
The first simplification is to treat the noise level $\sigma$ itself as time. EDM uses a variance-exploding forward process in which a clean sample is simply corrupted by additive Gaussian noise of standard deviation $\sigma$:
There is no shrinking $\sqrt{\bar\alpha_t}$ factor on the signal here, which is what makes the algebra below so much simpler than the VP bookkeeping of Section 33.2. The marginal at noise level $\sigma$ is the data distribution blurred by a Gaussian of width $\sigma$, written $p(x; \sigma)$. The probability-flow ODE that transports samples from high noise to low noise is
where $\nabla_x \log p(x; \sigma)$ is the score (the gradient of the log-density at noise level $\sigma$) and $\dot\sigma(t)$ is the time-derivative of the noise schedule. EDM then makes the cleanest possible choice of schedule, $\sigma(t) = t$, so that noise level and time are literally the same variable and $\dot\sigma = 1$. The ODE collapses to $dx = -\sigma\,\nabla_x \log p(x;\sigma)\,dt$, and because the score relates to a denoiser by $\nabla_x \log p(x;\sigma) = (D_\theta(x;\sigma) - x)/\sigma^2$, the drift is just $(x - D_\theta(x;\sigma))/\sigma$. That single quantity, the normalized difference between the current point and the denoised estimate, is the slope the sampler follows; everything in the Heun algorithm below is built from it.
The conceptual core of EDM is to stop asking the neural network to predict noise or clean images directly at every noise level, and instead wrap it in noise-level-dependent scaling so that the network always sees a well-conditioned, unit-variance input and produces a well-conditioned target. The network $F_\theta$ does the learning; a set of closed-form coefficients $c_{\text{skip}}, c_{\text{out}}, c_{\text{in}}, c_{\text{noise}}$ do the bookkeeping. Because the coefficients are derived from first principles (the requirement that inputs and training targets have unit variance) rather than chosen by trial, the same network behaves consistently from $\sigma_{\min} \approx 0.002$ to $\sigma_{\max} \approx 80$, a range spanning more than four orders of magnitude.
Concretely, EDM never uses the raw network output as the denoiser. It defines a preconditioned denoiser that combines a skip connection from the noisy input with a scaled network evaluation:
Read this left to right. The term $c_{\text{skip}}(\sigma)\,x$ passes the noisy input through directly, so at low noise the denoiser can be nearly the identity (the input is already almost clean) and the network only has to predict a small correction. The term $c_{\text{in}}(\sigma)\,x$ rescales the input so the network always sees unit variance regardless of $\sigma$. The factor $c_{\text{out}}(\sigma)$ rescales the network's output to the magnitude actually needed, and $c_{\text{noise}}(\sigma)$ maps the noise level to a conditioning input that varies gently rather than over four orders of magnitude. The four coefficients have closed forms,
where $\sigma_{\text{data}}$ is the standard deviation of the data (EDM uses $\sigma_{\text{data}} = 0.5$ for images normalized to roughly unit range). These are not tuned constants; they are the unique choices that make both the network's input and its regression target have unit variance at every $\sigma$, which is exactly the well-conditioning that makes training stable across the whole noise range. Deriving them is the subject of Exercise 33.4.4. Sanity-check the limits: as $\sigma \to 0$, $c_{\text{skip}} \to 1$ and $c_{\text{out}} \to 0$, so the denoiser returns the input essentially unchanged (correct, since a nearly clean image needs no denoising); as $\sigma \to \infty$, $c_{\text{skip}} \to 0$ and the denoiser leans entirely on the network output, since the input is almost pure noise and carries no usable signal.
The loss weighting and the noise distribution used during training follow the same conditioning logic. The per-sample loss is weighted by
which exactly cancels the $c_{\text{out}}$ scaling so that every noise level contributes a loss of comparable magnitude, and no single $\sigma$ band dominates the gradient. The noise levels seen during training are drawn so that $\ln\sigma$ is Gaussian,
which concentrates training on the intermediate noise levels that matter most for sample quality while still covering the tails. Plugging in, $P_{\text{mean}} = -1.2$ places the median noise level at $\sigma = e^{-1.2} \approx 0.30$, comfortably near $\sigma_{\text{data}} = 0.5$ where the denoising problem is hardest and most informative.
5. The EDM Sampling Schedule and Heun's Method Advanced
Choosing how to place the discrete noise levels between $\sigma_{\max}$ and $\sigma_{\min}$ is a sampling decision, fully decoupled from training in the EDM framework. A uniform spacing wastes steps: the trajectory bends most sharply at low noise, so EDM warps the spacing to put more steps there. The schedule for $N$ steps is
with $\rho = 7$, $\sigma_{\min} \approx 0.002$, and $\sigma_{\max} \approx 80$, and a final step to $\sigma_N = 0$. The exponent $\rho$ controls the warp: $\rho = 1$ recovers uniform spacing in $\sigma$, and larger $\rho$ packs more levels near $\sigma_{\min}$. The value $\rho = 7$ was found to minimize the discretization error of the sampler across datasets. Linear interpolation happens in $\sigma^{1/\rho}$ space, then the result is raised back to the $\rho$ power, which is why low-noise levels (small $\sigma^{1/\rho}$) end up densely sampled.
With the schedule fixed, EDM solves the ODE with Heun's method, the second-order solver foreshadowed in subsection 2. Heun takes a provisional Euler step, evaluates the slope again at the landing point, then re-steps using the average of the two slopes. This trapezoidal correction halves the per-step error compared to Euler at the cost of one extra network evaluation per step, and it is the dominant reason EDM reaches high quality in roughly 30 to 60 network evaluations where DDIM needs many more.
Input: denoiser $D_\theta$, schedule $\sigma_0 = \sigma_{\max} > \sigma_1 > \dots > \sigma_{N-1} > \sigma_N = 0$.
1. Sample initial point $x_0 \sim \mathcal{N}(0,\ \sigma_0^2 I)$.
2. For $i = 0, \dots, N-1$:
(a) Evaluate the slope: $d_i = \dfrac{x_i - D_\theta(x_i; \sigma_i)}{\sigma_i}$.
(b) Euler step: $x_{i+1} = x_i + (\sigma_{i+1} - \sigma_i)\,d_i$.
(c) If $\sigma_{i+1} \ne 0$, apply the second-order correction:
$d_i' = \dfrac{x_{i+1} - D_\theta(x_{i+1}; \sigma_{i+1})}{\sigma_{i+1}}$, then $x_{i+1} = x_i + (\sigma_{i+1} - \sigma_i)\,\tfrac{1}{2}(d_i + d_i')$.
Output: $x_N$, the generated sample at $\sigma = 0$.
The conditional in step (c) matters: at the last step the target noise level is exactly zero, so $d_i'$ would divide by $\sigma_{i+1} = 0$. EDM therefore skips the correction on that final Euler step and lands directly on the clean manifold. Heun's method is purely deterministic here, but EDM also offers an optional stochastic churn: at each step you may add a small amount of fresh noise to nudge $\sigma$ slightly upward before the deterministic step, which can correct accumulated error and improve sample quality for some models, at the cost of reintroducing the randomness that pure ODE sampling removed. Churn is a tunable extra, off by default; the deterministic Heun sampler is the workhorse.
The code below implements the EDM preconditioning and a single Heun step from scratch, so the closed-form coefficients and the two-slope correction are visible side by side rather than hidden inside a scheduler.
# EDM preconditioning (Karras et al. 2022) and one deterministic Heun step.
import torch
SIGMA_DATA = 0.5
def edm_denoise(net, x, sigma):
"""Preconditioned denoiser D(x; sigma) wrapping a raw network F_theta."""
s = sigma.view(-1, 1, 1, 1) # broadcast over images
c_in = 1.0 / (s**2 + SIGMA_DATA**2).sqrt()
c_skip = SIGMA_DATA**2 / (s**2 + SIGMA_DATA**2)
c_out = s * SIGMA_DATA / (s**2 + SIGMA_DATA**2).sqrt()
c_noise = 0.25 * sigma.log() # gentle conditioning input
F = net(c_in * x, c_noise) # the only network call
return c_skip * x + c_out * F # D(x; sigma)
@torch.no_grad()
def heun_step(net, x, sigma_cur, sigma_next):
"""One 2nd-order Heun step from sigma_cur down to sigma_next."""
d = (x - edm_denoise(net, x, sigma_cur)) / sigma_cur # slope at start
x_euler = x + (sigma_next - sigma_cur) * d # provisional Euler
if sigma_next == 0: # final step: no correction
return x_euler
d_next = (x_euler - edm_denoise(net, x_euler, sigma_next)) / sigma_next
return x + (sigma_next - sigma_cur) * 0.5 * (d + d_next) # averaged slope
edm_denoise function applies the four closed-form coefficients so the raw network net always sees a unit-variance input, and heun_step evaluates the slope twice (once at the start, once at the provisional Euler landing point) and steps with their average, skipping the correction only on the final step into $\sigma = 0$.
EDM does not contradict the DDPM and DDIM formulations of Sections 33.1 to 33.3; it re-expresses them. The preconditioning coefficients reduce, for particular choices, to the familiar $\epsilon$-prediction and VP schedules, so an EDM denoiser can be converted to the $\epsilon_\theta$ form the diffusers schedulers expect. The payoff of the reframing is that each design choice (the schedule $\sigma_i$, the parameterization via the $c$-coefficients, the loss weighting $\lambda$, the sampler) is now an independent dial you can tune in isolation rather than a tangled bundle, which is exactly why EDM became a strong default baseline for training new diffusion models from scratch.
Explain in a short paragraph why the deterministic DDIM update can take a large jump from step $t$ to a much smaller step $t'$ with little error, while the stochastic ancestral update of Section 33.1 cannot. Reference the smoothness of the probability-flow ODE from Section 33.3 and the role of the injected noise term. Then state what quantity a higher-order solver estimates that Euler/DDIM ignores.
Take a pretrained Stable Diffusion or DDPM checkpoint from diffusers. Generate the same prompt and seed at 5, 10, 20, 50, and 100 steps with both DDIMScheduler and DPMSolverMultistepScheduler, keeping everything else fixed. Arrange the images in a grid and, for each scheduler, note the smallest step count at which you stop seeing improvement. Confirm the claim in subsection 2 that the higher-order solver reaches good quality at fewer steps.
Compare the three speedup strategies of this section along several axes: extra training cost, achievable minimum steps, and whether the original checkpoint is reusable. Summarize in a small table (DDIM, DPM-Solver, progressive/consistency distillation) and write one paragraph advising a team that needs sub-second generation but cannot afford a long distillation run. Connect your recommendation to the two-tier deployment in the practical example, and to the consistency models you will study in Section 33.5.
The EDM coefficients are not tuned; they follow from a unit-variance requirement. Assume the clean data has variance $\sigma_{\text{data}}^2$ and the added noise has variance $\sigma^2$, independent of the data, so the noisy input $x = x_0 + \sigma\epsilon$ has variance $\sigma^2 + \sigma_{\text{data}}^2$. (a) Derive $c_{\text{in}}(\sigma) = 1/\sqrt{\sigma^2 + \sigma_{\text{data}}^2}$ from the requirement that the network input $c_{\text{in}}(\sigma)\,x$ has unit variance. (b) The network is asked to predict the target $\big(x_0 - c_{\text{skip}}(\sigma)\,x\big)/c_{\text{out}}(\sigma)$. Require this target to have unit variance and require $c_{\text{skip}}$ to be chosen so that the coefficient of the noise $\epsilon$ in the effective error is minimized; show that these two conditions yield $c_{\text{skip}}(\sigma) = \sigma_{\text{data}}^2/(\sigma^2 + \sigma_{\text{data}}^2)$ and $c_{\text{out}}(\sigma) = \sigma\,\sigma_{\text{data}}/\sqrt{\sigma^2 + \sigma_{\text{data}}^2}$. (c) Verify the $\sigma \to 0$ and $\sigma \to \infty$ limits and explain in one sentence each why they are the behavior you want.
Using the edm_denoise and heun_step code of subsection 5 (or the EDM reference implementation), build two samplers from the same trained checkpoint: a plain Euler sampler (drop the correction in step (c)) and the full Heun sampler. Generate samples at matched numbers of function evaluations (NFE), recalling that one Heun step costs two network calls and one Euler step costs one, so an $N$-step Heun run and a $2N$-step Euler run use the same NFE. Sweep NFE over roughly 10, 20, 40, and 80, compute FID against a reference set at each point, and plot FID versus NFE for both samplers on one axis. Confirm that Heun reaches a given FID at fewer function evaluations, and identify the NFE region where the second-order solver's advantage is largest.
Disentangles the noise schedule, network preconditioning, loss weighting, and sampler into independently tunable components, introduces the $c_{\text{skip}}/c_{\text{out}}/c_{\text{in}}/c_{\text{noise}}$ denoiser, the $\rho = 7$ noise schedule, and the second-order Heun sampler. The EDM recipe is a common default starting point for training new diffusion models.