Part IV: Generative Vision Models
Chapter 33: Diffusion Models

Guidance: Classifier & Classifier-Free

"Left to my own devices I will draw you something plausible. Whisper a label in my ear and turn the dial up, and I will draw you something unmistakable, occasionally at the cost of subtlety."

A Conditional Generator With a Volume Knob
Big Picture

Guidance is the mechanism that lets you steer a diffusion model toward a condition (a class, a text prompt) and dial how strongly it obeys: classifier guidance nudges each sampling step along the gradient of a separately-trained classifier, while classifier-free guidance, which now dominates, trains one network to predict noise both with and without the condition and then extrapolates away from the unconditional prediction at sampling time. The single scalar guidance scale is the "prompt strength" or "CFG" slider in every image tool: turn it up for tighter adherence and higher fidelity, down for more diversity and fewer artifacts. This section derives both methods from the score view of Section 33.3 and shows the few lines of code that implement the version everyone uses.

So far the chapter has built unconditional generators that sample some image from the data distribution. Useful generation is almost always conditional: you want a photo of a specific class, or an image matching a text prompt. Section 33.2 showed how to condition the network by feeding it an extra input, the network simply takes the class or text embedding alongside the noisy image. But conditioning alone is often too weak; the model treats the condition as a gentle suggestion and produces only loosely related images. Guidance is the technique that amplifies the condition's influence. We start with the original classifier guidance to build intuition from the score, then derive classifier-free guidance, the method that powers modern systems, and discuss the fidelity-diversity trade-off the guidance scale controls.

1. Conditioning vs Guidance Intermediate

To condition a diffusion model, you give the denoiser the condition $c$ as an additional input: $\epsilon_\theta(x_t, t, c)$. For a class label, $c$ is an embedding added to the timestep embedding; for text, $c$ is a sequence of token embeddings consumed by the cross-attention layers of the U-Net, the same attention mechanism from Chapter 22, where the image features form the queries and the text tokens form the keys and values. This is genuine conditioning, and a well-trained conditional model does respond to $c$. The problem is calibration: maximum-likelihood training spreads probability over everything consistent with $c$, which is a lot, so samples often capture the gist of the condition but not its full specificity. Guidance sharpens the conditional distribution, biasing samples toward regions where the condition is strongly satisfied, at some cost to diversity. The next two subsections give two ways to do it.

2. Classifier Guidance Advanced

The score view of Section 33.3 makes guidance almost obvious. We want to sample from the conditional distribution $p(x \mid c)$, whose score is $\nabla_x \log p(x_t \mid c)$. By Bayes' rule, $\log p(x_t \mid c) = \log p(x_t) + \log p(c \mid x_t) - \log p(c)$, and taking the gradient (the last term vanishes since it does not depend on $x$):

$$\nabla_x \log p(x_t \mid c) = \underbrace{\nabla_x \log p(x_t)}_{\text{unconditional score}} + \underbrace{\nabla_x \log p(c \mid x_t)}_{\text{classifier gradient}}.$$

The first term is the score the diffusion model already estimates (via its noise prediction, per the identity in Section 33.3), the same gradient-of-log-density object you met in the energy-based models of Chapter 30. The second term is the gradient of a classifier that predicts the condition from the noisy image. So Dhariwal and Nichol's classifier guidance trains a separate classifier on noisy images and, at each sampling step, adds a multiple $s$ of its gradient to the score:

$$\tilde\epsilon_\theta(x_t, t, c) = \epsilon_\theta(x_t, t) - s\,\sqrt{1 - \bar\alpha_t}\ \nabla_x \log p_\phi(c \mid x_t).$$

The formula is the score equation above, just rewritten in noise terms: the minus sign and the $\sqrt{1-\bar\alpha_t}$ factor are exactly the score-to-noise conversion $s_\theta = -\epsilon_\theta/\sqrt{1-\bar\alpha_t}$ from Section 33.3 applied to both terms, so "add the classifier gradient to the score" becomes "subtract a scaled gradient from the predicted noise." The guidance scale $s$ controls how hard the classifier pushes. This works and was how the first diffusion models beat GANs on conditional ImageNet, but it has two real drawbacks: you must train and maintain a separate classifier, and that classifier has to be trained on noisy images at every noise level, which off-the-shelf classifiers are not. That friction motivated the alternative that replaced it.

Fun Fact

Classifier guidance is closely related to the adversarial-example phenomenon from the security literature: you are following the gradient of a classifier to push an image toward a target class. The difference is intent. An adversarial attack nudges an image just enough to fool a classifier while looking unchanged to humans; classifier guidance nudges a noisy image so the generator produces something the classifier confidently labels. Same gradient, opposite goals, and the same reason both need a classifier that is robust to the input distribution it is shown.

3. Classifier-Free Guidance Intermediate

Classifier-free guidance (CFG) removes the external classifier entirely with a beautiful trick. During training, the model is shown the condition most of the time but, with some probability (say 10 to 20 percent), the condition is dropped and replaced by a special null token $\varnothing$. The single network thus learns both the conditional noise $\epsilon_\theta(x_t, t, c)$ and the unconditional noise $\epsilon_\theta(x_t, t, \varnothing)$. At sampling time, you compute both and extrapolate: push away from the unconditional prediction in the direction of the conditional one.

$$\tilde\epsilon_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \varnothing) + w\,\big(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)\big).$$

Here $w$ is the guidance scale (in the diffusers and Stable Diffusion world it is the guidance_scale parameter, typically 5 to 12 for text-to-image). At $w = 1$ you recover plain conditional sampling; $w = 0$ is unconditional; larger $w$ exaggerates the difference between conditional and unconditional, sharpening adherence to the prompt. The reason this works is that the difference $\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)$ is an implicit estimate of the classifier gradient $\nabla_x \log p(c \mid x_t)$ from subsection 2, obtained without any classifier. Figure 33.6.1 shows the extrapolation geometrically.

x_t uncond eps(., null) cond eps(., c) guided (w > 1) tip extrapolated beyond cond
Figure 33.6.1: Classifier-free guidance as vector extrapolation. From the current point $x_t$, the network gives an unconditional noise estimate (gray) and a conditional one (blue), both drawn as vectors from $x_t$. Their tip-to-tip difference (dashed) is the implicit classifier direction. The guided estimate (green) is the unconditional vector plus $w$ times that difference, so the three tips are collinear and for $w > 1$ the guided tip overshoots beyond the conditional tip, exaggerating the condition. The guidance scale $w$ sets how far past the conditional tip the green arrow reaches.

The function below implements that extrapolation: it runs the conditional and unconditional predictions in one batched pass and combines them with the guidance scale, the drop-in replacement for the plain noise estimate in any sampler.

# Classifier-free guidance at sampling time.
# Run the unconditional and conditional noise predictions in one batched
# pass, then extrapolate beyond the conditional by guidance_scale.
import torch

@torch.no_grad()
def cfg_predict(unet, x_t, t, cond_emb, null_emb, guidance_scale=7.5):
    """Classifier-free guidance: one network, two conditionings, extrapolate."""
    # Batch the conditional and unconditional inputs together for one forward pass
    x_in = torch.cat([x_t, x_t], dim=0)
    t_in = torch.cat([t, t], dim=0)
    c_in = torch.cat([null_emb, cond_emb], dim=0)        # uncond first, cond second
    eps_uncond, eps_cond = unet(x_in, t_in, c_in).sample.chunk(2)
    # Extrapolate away from the unconditional toward the conditional
    return eps_uncond + guidance_scale * (eps_cond - eps_uncond)

# To DISABLE the condition during training, randomly replace cond_emb with null_emb:
# if torch.rand(1) < 0.1: cond_emb = null_emb     # ~10% null-conditioning dropout
Code Fragment 1: The cfg_predict function for classifier-free guidance at sampling time. Conditional and unconditional predictions are computed in a single batched forward pass, then extrapolated by guidance_scale; the commented line shows the only training change needed, randomly dropping the condition to null_emb. Its output replaces the plain eps in any sampler from Section 33.4.
Key Insight: One Slider Trades Fidelity for Diversity

The guidance scale $w$ is the most consequential single knob in conditional generation. Low $w$ (near 1) gives diverse, sometimes loosely-related samples that better match the true data distribution. High $w$ (8 to 15) gives samples that adhere tightly to the prompt with saturated, high-contrast, "obviously generated" looks, and at very high $w$ the images become oversaturated and artifact-ridden because the extrapolation leaves the data manifold. Every "prompt strength," "CFG scale," or "guidance" slider you have ever seen in an image tool is this $w$. Understanding that it is an extrapolation, not a probability, explains why turning it to maximum does not give "maximum quality" but rather maximum distortion.

You Could Build This: A Guidance-Scale Explorer

You do not need to train anything to build a small interactive tool that makes the fidelity-diversity trade-off of the Key Insight tangible. Load any text-to-image pipeline from diffusers, fix the seed, and sweep guidance_scale across a grid (say 1, 3, 5, 7.5, 12, 20) for one prompt, then tile the results into a single strip so the climb from loose-but-diverse to tight-but-oversaturated is visible at a glance. Wrap it in a tiny Gradio slider that regenerates the strip on demand. Difficulty: beginner, about 30 minutes, since it reuses a pretrained checkpoint and only varies one argument. The finished explorer is a clean portfolio demo that shows you can explain, not just invoke, the single most-tuned knob in every image tool, and it doubles as the visual evidence Exercise 33.6.2 asks for.

The Right Tool: guidance_scale in diffusers Pipelines

Every conditional pipeline in diffusers implements classifier-free guidance internally and exposes it as the single argument guidance_scale: pipe(prompt, guidance_scale=7.5). The batched two-pass forward, the null-embedding handling, and the per-step combination of subsection 3 are all done for you, roughly the 15 lines above plus the sampler integration, reduced to one keyword. The null embedding for text models is the encoding of the empty string, which the pipeline computes automatically. Newer pipelines add refinements like guidance rescaling (to counter the oversaturation at high $w$) and per-step guidance schedules; these too are config flags rather than code you write. Implement cfg_predict once to see the mechanism; set guidance_scale in practice.

Practical Example: Tuning Guidance for a Product-Photo Generator

Who: an e-commerce team generating on-brand product backgrounds with a fine-tuned latent diffusion model, 2024. Situation: they exposed the model to non-technical merchandisers who typed a short prompt and expected a usable image. Problem: at the default guidance scale of 7.5 many outputs were oversaturated and had the tell-tale over-contrast look that made products appear unnatural, while merchandisers who lowered it too far got images that ignored the requested style. Decision: the team measured prompt adherence (via a CLIPScore, previewed in Chapter 37) and human preference across a sweep of $w$ from 3 to 12, found the sweet spot near 5, and enabled guidance rescaling to tame the high-$w$ saturation. They then hid the slider and shipped the tuned value as a fixed default. Result: output quality became consistent, support tickets about "weird colors" dropped, and merchandisers stopped needing to understand a parameter they had no intuition for. Lesson: the guidance scale needs tuning per model and per use case, and for non-expert users the right move is often to measure the optimum once and hide the knob, rather than exposing a parameter whose behavior surprises everyone who turns it up.

Research Frontier: Fixing Guidance's Side Effects

Classifier-free guidance is indispensable but its oversaturation and reduced-diversity side effects at high scales spurred a wave of 2023 to 2025 refinements. Guidance rescaling (Lin et al., 2023, arXiv:2305.08891) renormalizes the guided prediction to fix the over-bright outputs. Dynamic and interval guidance apply CFG only during the middle sampling steps, where it helps, and skip it at the ends, where it mostly adds artifacts; the finding that guidance is harmful at the highest noise levels and largely unnecessary at the lowest led to the "guidance interval" of Kynkaanniemi et al. (2024, arXiv:2404.07724), which improved the ImageNet-512 record FID at no extra cost. Autoguidance (Karras et al., 2024, arXiv:2406.02507) replaces the unconditional model with a weaker, under-trained version of the model itself, improving samples without a class condition at all. And a parallel line removes the need for guidance during sampling by distilling a guided teacher into a student that bakes the guidance in, which is essential for the few-step models of Section 33.5 since running two forward passes per step defeats the speedup. Guidance is mature, but how to get its benefits without its distortions is still actively researched.

Exercise 33.6.1: From Bayes to the CFG Formula Conceptual

Starting from the Bayes decomposition in subsection 2 and the score-noise identity from Section 33.3, argue in a short paragraph why the difference $\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)$ plays the role of the classifier gradient $\nabla_x \log p(c \mid x_t)$. Then explain why $w = 1$ in the CFG formula corresponds to ordinary conditional sampling and what $w > 1$ does to the implied conditional distribution.

Exercise 33.6.2: Sweep the Guidance Scale Coding

Using a text-to-image pipeline from diffusers, generate the same prompt and seed at guidance_scale in {1, 3, 5, 7.5, 12, 20}. Arrange the images in a row and describe how prompt adherence, contrast/saturation, and diversity (run a few seeds at each scale) change as the scale rises. Identify the scale at which oversaturation becomes objectionable, and confirm the Key Insight's claim that maximum scale is not maximum quality.

Exercise 33.6.3: The Cost of Two Forward Passes Analysis

Classifier-free guidance requires a conditional and an unconditional network evaluation at every sampling step, doubling the compute per step relative to unguided sampling. For a 20-step DDIM sampler from Section 33.4, compute the total number of network evaluations with and without CFG. Then explain in one paragraph why this doubling is a serious problem for the one- to four-step models of Section 33.5, and what guidance distillation does about it. Relate the trade-off to the two-tier deployment idea from Section 33.4.

4. Classifier Guidance, Derived Rigorously Advanced

Subsection 2 stated the classifier-guidance formula and gave its intuition. Here we derive it carefully, because the derivation exposes exactly what the guidance scale $s$ does to the sampled distribution and why $s = 1$ is special. The setup is a deliberately reshaped target. Instead of sampling the true posterior $p(x \mid y)$ for a condition $y$, classifier guidance samples from a tempered posterior

$$p_s(x \mid y) \;\propto\; p(x)\, p(y \mid x)^{\,s},$$

where the classifier likelihood $p(y \mid x)$ is raised to a power $s \ge 0$ before being combined with the unconditional prior $p(x)$. The exponent $s$ is the only free parameter. When $s = 1$ this is exactly Bayes' rule, $p(x)\,p(y \mid x) \propto p(x \mid y)$, so guidance with $s = 1$ samples the true Bayesian posterior and nothing more. When $s > 1$ the likelihood term is sharpened: regions where the classifier is confident about $y$ get their probability raised to a power greater than one and so dominate even more, which concentrates samples onto the most unambiguous exemplars of $y$ at the cost of diversity. When $s < 1$ the condition is softened toward the unconditional prior. The single scalar is therefore a temperature on the classifier, not a probability, which is the reason turning it up does not give "more correct" samples but rather "more stereotypical, less varied" ones.

Take the log and the gradient with respect to $x$ of the tempered posterior. The normalizing constant does not depend on $x$ and drops, leaving the guided score in its cleanest form:

$$\nabla_x \log p_t(x \mid y) \;=\; \nabla_x \log p_t(x) \;+\; s\,\nabla_x \log p_t(y \mid x).$$

Each term is now unambiguous. The first, $\nabla_x \log p_t(x)$, is the unconditional score the diffusion model already supplies. The second, $\nabla_x \log p_t(y \mid x)$, is the gradient of the log-likelihood of a classifier $p_\phi$ evaluated at the noisy sample $x_t$, and $s$ scales how hard that gradient pushes. Setting $s = 1$ recovers the plain Bayes posterior of subsection 2; the extra factor $s$ is precisely the exponent on the likelihood. This is the score-domain statement. To use it inside an $\epsilon$-prediction sampler we convert via the identity from Section 33.3, $\nabla_x \log p_t(x) = -\,\epsilon_\theta(x_t, t)/\sqrt{1 - \bar\alpha_t}$, which says the predicted noise is the negative score scaled by $\sqrt{1 - \bar\alpha_t}$. Multiplying the whole guided score by $-\sqrt{1 - \bar\alpha_t}$ to express it as an effective noise prediction gives the form you implement:

$$\hat\epsilon(x_t, t) \;=\; \epsilon_\theta(x_t, t) \;-\; \sqrt{1 - \bar\alpha_t}\; s\; \nabla_{x_t} \log p_\phi(y \mid x_t).$$

Read term by term: start from the network's noise prediction $\epsilon_\theta(x_t, t)$, then subtract the classifier gradient $\nabla_{x_t} \log p_\phi(y \mid x_t)$, weighted by the guidance scale $s$ and by $\sqrt{1 - \bar\alpha_t}$. The minus sign is the score-to-noise sign flip; the $\sqrt{1 - \bar\alpha_t}$ factor is the same conversion factor applied to the classifier term so that both pieces live in noise units and can be added. Plug $\hat\epsilon$ into any DDPM or DDIM update from Section 33.4 in place of $\epsilon_\theta$ and the sampler now walks toward the tempered posterior.

Why the Classifier Must See Noisy Inputs

The gradient $\nabla_{x_t} \log p_\phi(y \mid x_t)$ is evaluated at the noisy state $x_t$, not at a clean image. An off-the-shelf classifier trained on clean photographs produces meaningless gradients on a sample that is mostly noise at large $t$, which is why Dhariwal and Nichol train a dedicated classifier on inputs corrupted to every noise level $t$. This requirement, a separate model trained across the full noise schedule, is the practical friction that classifier-free guidance removes.

5. Classifier-Free Guidance: Two Conventions and the Implicit Classifier Advanced

Subsection 3 introduced classifier-free guidance (CFG) and gave one form of its update. There are two widely used algebraic forms of the same operation, and they assign different meanings to the guidance scalar. Confusing them is the single most common source of "my guidance scale of 7.5 behaves like someone else's 6.5" bugs, so we state both explicitly and pin down the conversion. Ho and Salimans train one network jointly on the conditional objective and, with some dropout probability, on the unconditional objective by replacing the condition with a null token $\varnothing$. The same network thus estimates both $\epsilon_\theta(x_t, c)$ and $\epsilon_\theta(x_t, \varnothing)$. Their sampling rule, in the paper's own notation, is

$$\tilde\epsilon_\theta(x_t, c) \;=\; (1 + w)\,\epsilon_\theta(x_t, c) \;-\; w\,\epsilon_\theta(x_t, \varnothing).$$

Here $w = 0$ means no guidance (you get the plain conditional prediction), and $w > 0$ extrapolates. Most implementations, including the diffusers library and Stable Diffusion, instead expose a "guidance scale" $s$ and write the algebraically rearranged form

$$\tilde\epsilon_\theta(x_t, c) \;=\; \epsilon_\theta(x_t, \varnothing) \;+\; s\,\big(\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing)\big).$$

In this form $s = 1$ means no guidance (the unconditional plus one times the difference collapses to the plain conditional), and $s > 1$ extrapolates. The two are the same equation: expanding the second gives $(1)\epsilon_\varnothing + s\,\epsilon_c - s\,\epsilon_\varnothing = s\,\epsilon_c - (s-1)\epsilon_\varnothing$, which matches the first under the relation $s = 1 + w$. The familiar default $s = 7.5$ corresponds to $w = 6.5$.

Warning: Two Guidance-Scale Conventions Differ by One

The Ho and Salimans scalar $w$ and the implementation "guidance scale" $s$ are not the same number. They are related by $s = 1 + w$. No guidance is $w = 0$ in the paper's convention but $s = 1$ in code. The popular default $s = 7.5$ in diffusers is $w = 6.5$ in the paper. When you read a method that reports a guidance value, check which form it uses: the paper form $\tilde\epsilon = (1+w)\epsilon_c - w\,\epsilon_\varnothing$ or the implementation form $\tilde\epsilon = \epsilon_\varnothing + s(\epsilon_c - \epsilon_\varnothing)$. Reporting a bare number without the convention is ambiguous by exactly one unit.

Why does extrapolating between two noise predictions do anything coherent? Convert both to scores and read the result as a probability statement. Using $\epsilon \propto -\nabla \log p$, the implementation form becomes, in score terms,

$$\nabla_x \log \tilde p(x \mid c) \;=\; \nabla_x \log p(x \mid c) \;+\; w\,\big[\nabla_x \log p(x \mid c) - \nabla_x \log p(x)\big].$$

The bracketed difference is the gradient of $\log p(x \mid c) - \log p(x) = \log\frac{p(x \mid c)}{p(x)} = \log p(c \mid x) + \text{const}$, by Bayes' rule. So the difference between the conditional and unconditional scores is the gradient of an implicit classifier $p(c \mid x) \propto p(x \mid c)/p(x)$, obtained without ever training one. CFG is therefore classifier guidance with the classifier replaced by the difference of the two predictions the single network already makes, and the scale $w$ sharpens that implicit classifier exactly as $s$ sharpened the explicit one in subsection 4. This is the precise sense in which CFG "needs no separate classifier."

Algorithm: Classifier-Free Guidance Sampling

Given a network $\epsilon_\theta$ trained jointly on the condition $c$ and the null token $\varnothing$, a guidance scale $s$ (implementation convention, $s = 1$ is off), a decreasing time grid $t_N > t_{N-1} > \cdots > t_0$, and a base sampler update (DDPM or DDIM) from Section 33.4:

  1. Initialize $x_{t_N} \sim \mathcal N(0, I)$.
  2. For $i = N, N-1, \dots, 1$:
    1. Compute the unconditional prediction $\epsilon_\varnothing = \epsilon_\theta(x_{t_i}, t_i, \varnothing)$ and the conditional prediction $\epsilon_c = \epsilon_\theta(x_{t_i}, t_i, c)$, ideally in one batched forward pass.
    2. Combine: $\hat\epsilon = \epsilon_\varnothing + s\,(\epsilon_c - \epsilon_\varnothing)$.
    3. Take one sampler step using $\hat\epsilon$ in place of the plain noise prediction to produce $x_{t_{i-1}}$.
  3. Return $x_{t_0}$.

Cost: two network evaluations per step instead of one. The batched pass in step 2a keeps the wall-clock overhead near a single larger forward rather than two sequential ones.

6. Diffusion for Inverse Problems Advanced

Guidance steers generation toward a label or a prompt. A closely related and practically enormous problem is steering generation to be consistent with a measurement: given a blurred, masked, downsampled, or otherwise degraded observation, reconstruct a plausible clean image. This is an inverse problem, and a pretrained unconditional diffusion model is a powerful image prior for it, no retraining required. Model the measurement as

$$y \;=\; \mathcal A(x_0) \;+\; n, \qquad n \sim \mathcal N(0, \sigma_y^2 I),$$

where $x_0$ is the unknown clean image, $\mathcal A$ is a known forward operator (a blur kernel, a masking operator for inpainting, a downsampling operator for super-resolution), and $n$ is measurement noise of variance $\sigma_y^2$. We want to sample the posterior $p(x_0 \mid y)$. As in classifier guidance, the posterior score splits by Bayes' rule into the prior score the diffusion model gives us plus a measurement-likelihood term:

$$\nabla_{x_t} \log p_t(x_t \mid y) \;=\; \nabla_{x_t} \log p_t(x_t) \;+\; \nabla_{x_t} \log p_t(y \mid x_t).$$

The first term is supplied by the unconditional model. The hard term is the second, the likelihood of the measurement given the noisy intermediate $x_t$, because $y$ is a function of the clean $x_0$, not of $x_t$. There is no closed form for $p_t(y \mid x_t)$, since it requires marginalizing over all clean images consistent with $x_t$. Diffusion Posterior Sampling (DPS) makes this tractable with one well-chosen approximation.

The key tool is Tweedie's formula, which gives the posterior mean of the clean image given the noisy one directly from the score. For the variance-preserving forward process, the minimum-mean-square estimate of $x_0$ given $x_t$ is

$$\hat x_0(x_t) \;=\; \frac{1}{\sqrt{\bar\alpha_t}}\,\Big(x_t + (1 - \bar\alpha_t)\,s_\theta(x_t, t)\Big),$$

where $s_\theta(x_t, t) \approx \nabla_{x_t} \log p_t(x_t)$ is the learned score. Intuitively, $\hat x_0$ is the model's single best guess at the clean image hidden inside $x_t$: it takes the noisy sample, nudges it along the score toward higher density, and rescales by $1/\sqrt{\bar\alpha_t}$ to undo the forward shrinkage. DPS now approximates the intractable $p_t(y \mid x_t)$ by collapsing the distribution over clean images to this point estimate, $p_t(y \mid x_t) \approx p(y \mid \hat x_0(x_t))$. Under the Gaussian measurement model that likelihood is $p(y \mid \hat x_0) \propto \exp\!\big(-\frac{1}{2\sigma_y^2}\|y - \mathcal A(\hat x_0)\|^2\big)$, so its log-gradient is

$$\nabla_{x_t} \log p_t(y \mid x_t) \;\approx\; -\,\frac{1}{\sigma_y^2}\,\nabla_{x_t}\big\|y - \mathcal A(\hat x_0(x_t))\big\|^2 \cdot \tfrac{1}{2}.$$

Note that $\hat x_0$ itself depends on $x_t$ through the network, so this gradient is taken through the score network by automatic differentiation, a vector-Jacobian product, not a gradient of a fixed quantity. In practice DPS folds the $1/(2\sigma_y^2)$ and the inevitable approximation error into a single tunable step size $\zeta_i$ and applies the correction as an extra term after each ordinary reverse step. If $x_{i-1}'$ is the sample produced by one unconditional reverse-diffusion update, the DPS update is

$$x_{i-1} \;\leftarrow\; x_{i-1}' \;-\; \zeta_i\,\nabla_{x_i}\big\|y - \mathcal A(\hat x_0(x_i))\big\|^2.$$

This is remarkably general: any forward operator $\mathcal A$ you can differentiate, linear or nonlinear, plugs in directly, which is why DPS handles inpainting, deblurring, super-resolution, and phase retrieval with the same code and a single pretrained prior.

Algorithm: Diffusion Posterior Sampling (DPS)

Given a pretrained score network $s_\theta(x_t, t)$, a differentiable forward operator $\mathcal A$, a measurement $y$ with noise level $\sigma_y$, step sizes $\{\zeta_i\}$, and a decreasing time grid $t_N > \cdots > t_0$:

  1. Initialize $x_N \sim \mathcal N(0, I)$.
  2. For $i = N, N-1, \dots, 1$:
    1. Estimate the noise / score at $x_i$ and form the Tweedie posterior mean $\hat x_0 = \tfrac{1}{\sqrt{\bar\alpha_i}}\big(x_i + (1 - \bar\alpha_i)\,s_\theta(x_i, t_i)\big)$.
    2. Take one ordinary unconditional reverse step (DDPM or DDIM from Section 33.4) to get the prior-only proposal $x_{i-1}'$.
    3. Compute the measurement residual $\|y - \mathcal A(\hat x_0)\|^2$ and its gradient with respect to $x_i$ by backpropagating through $s_\theta$.
    4. Apply the likelihood correction: $x_{i-1} \leftarrow x_{i-1}' - \zeta_i\,\nabla_{x_i}\|y - \mathcal A(\hat x_0)\|^2$.
  3. Return $x_0$.
Key Insight: One Prior, Every Inverse Problem

The same unconditional diffusion model, trained once, solves inpainting, deblurring, super-resolution, and more, because the only thing that changes between tasks is the forward operator $\mathcal A$ in the data-fidelity term. The diffusion model supplies a generic image prior; the measurement supplies the constraint; DPS glues them by adding the gradient of the measurement residual, evaluated at the Tweedie estimate of the clean image, to each reverse step. This decoupling is the reason a single foundation diffusion model can serve as a universal plug-and-play prior for restoration tasks it was never trained on.

DPS's point-estimate approximation, replacing the full clean-image posterior by its mean $\hat x_0$, is its main weakness, and it ignores how measurement noise interacts with the operator. The follow-up $\Pi$GDM (pseudoinverse-guided diffusion models) sharpens the likelihood for linear operators $H$ by accounting for both the diffusion uncertainty at time $t$ (a variance $r_t^2$) and the measurement noise $\sigma_y^2$, replacing the isotropic $1/\sigma_y^2$ weighting with the noise-aware matrix $(r_t^2 H H^\top + \sigma_y^2 I)^{-1}$. This weights each measurement direction by how much of its uncertainty comes from the prior versus the sensor, which improves reconstruction quality on linear problems at the cost of needing the operator's structure $H$ explicitly rather than just a black-box $\mathcal A$.

Exercise 33.6.4: Derive the Classifier-Guided Score from Bayes Conceptual

Start from the tempered posterior $p_s(x \mid y) \propto p(x)\,p(y \mid x)^s$ of subsection 4. Take the logarithm and the gradient with respect to $x$ to derive the guided score $\nabla_x \log p_t(x \mid y) = \nabla_x \log p_t(x) + s\,\nabla_x \log p_t(y \mid x)$, stating clearly why the normalizing constant drops out. Then explain what is special about $s = 1$ (it recovers the exact Bayesian posterior) and describe, in one or two sentences each, what happens to the sampled distribution as $s$ grows beyond 1 and as $s$ falls below 1. Finally, convert your score to the $\epsilon$-prediction form using the score-noise identity from Section 33.3 and confirm the $-\sqrt{1 - \bar\alpha_t}$ factor on the classifier term.

Exercise 33.6.5: Show CFG Needs No Separate Classifier Analysis

Using the score identity $\epsilon \propto -\nabla \log p$, rewrite the implementation-form CFG update $\tilde\epsilon = \epsilon_\varnothing + s(\epsilon_c - \epsilon_\varnothing)$ as a statement about scores, and show that the difference $\nabla_x \log p(x \mid c) - \nabla_x \log p(x)$ equals $\nabla_x \log p(c \mid x)$ up to an additive constant by applying Bayes' rule. Conclude that classifier-free guidance is classifier guidance with the explicit classifier gradient replaced by the difference of the conditional and unconditional predictions the single network already produces, so no separate classifier is ever trained. As a check, verify algebraically that $s = 1 + w$ converts the implementation form to the Ho and Salimans form $\tilde\epsilon = (1+w)\epsilon_c - w\,\epsilon_\varnothing$.

Exercise 33.6.6: Implement DPS for Linear Inpainting on Toy Data Coding

Take a small pretrained (or quickly trained) unconditional diffusion model on toy data, for example the 2D points or small images from earlier sections. Define a linear inpainting operator $\mathcal A$ as a binary mask that keeps a subset of coordinates or pixels and zeros the rest, and synthesize a measurement $y = \mathcal A(x_0) + n$ with small $\sigma_y$. Implement the DPS algorithm of subsection 6: at each reverse step form the Tweedie estimate $\hat x_0$ from the network output, take one ordinary reverse step, then backpropagate $\|y - \mathcal A(\hat x_0)\|^2$ through the network and apply the correction $x_{i-1} \leftarrow x_{i-1}' - \zeta_i\,\nabla_{x_i}\|y - \mathcal A(\hat x_0)\|^2$. Compare the reconstruction against the masked input and against an unconditional sample, sweep the step size $\zeta_i$, and report how reconstruction fidelity to the observed coordinates trades off against plausibility of the inpainted ones.

Bibliography Advanced

Dhariwal, P., Nichol, A. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS (2021). arXiv:2105.05233
The origin of classifier guidance derived in subsection 4. It trained a noise-aware classifier and added a scaled multiple of its gradient to the diffusion score, sharpening the conditional distribution and letting diffusion models surpass GANs on conditional ImageNet for the first time.
Ho, J., Salimans, T. "Classifier-Free Diffusion Guidance." NeurIPS 2021 Workshop on Deep Generative Models (2022). arXiv:2207.12598
The source of the classifier-free method of subsections 3 and 5. By jointly training one network on the conditional and (with dropout) unconditional objectives, it replaced the external classifier gradient with the difference of the network's own two predictions, the form every modern text-to-image system uses.
Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., Ye, J. C. "Diffusion Posterior Sampling for General Noisy Inverse Problems." ICLR (2023). arXiv:2209.14687
The DPS method of subsection 6. It approximated the intractable measurement likelihood $p_t(y \mid x_t)$ by evaluating it at the Tweedie posterior mean $\hat x_0(x_t)$, turning any pretrained unconditional diffusion model into a plug-and-play prior for general noisy inverse problems with a differentiable forward operator.
Song, J., Vahdat, A., Mardani, M., Kautz, J. "Pseudoinverse-Guided Diffusion Models for Inverse Problems." ICLR (2023). arXiv:2210.06164
The $\Pi$GDM refinement noted at the end of subsection 6. For linear forward operators $H$ it replaces the isotropic likelihood weighting with the measurement-noise-aware matrix $(r_t^2 H H^\top + \sigma_y^2 I)^{-1}$, accounting for both diffusion-time and sensor uncertainty to improve reconstruction on linear inverse problems.