Part IV: Generative Vision Models
Chapter 30: Foundations of Generative Modeling

Evaluating Generators: A First Look

"They could not agree whether my pictures were good, so they asked a third network what it thought, summarized its opinion as one number, and have been arguing about that number ever since. I find this more dignified than asking me. I would only have said they were all my favorites."

A Generator Awaiting Its Frechet Distance
Big Picture

When a model invents images, there is no ground-truth picture to compare against, so the pixel metrics from the start of the book do not apply. The field's answer is to compare distributions in the feature space of a pretrained network: measure how close the cloud of generated images is to the cloud of real images, and decompose that closeness into fidelity and coverage. This section explains why PSNR and SSIM cannot score generation, builds the intuition behind the two classic metrics, Inception Score and Frechet Inception Distance, introduces precision and recall as the distribution-level analogues of the trilemma's fidelity and diversity axes, and is honest about what every automatic metric still misses. It is the on-ramp to the full evaluation treatment in Chapter 37; here we establish the vocabulary and the warnings.

The trilemma of Section 30.5 told you what to care about, fidelity, diversity, speed, but speed is easy to measure (count the passes) while fidelity and diversity are not. This final section of the chapter tackles the measurement problem for those two. We start by ruling out the obvious tools, explain the feature-space move that makes generative evaluation possible at all, walk through the metrics you will see in every paper of the part, and close on their genuine limitations. The metrics arc of the book, PSNR and SSIM in Chapter 1, IoU and mAP in Chapter 23, reaches its generative form here: distribution distances in feature space.

1. Why Pixel Metrics Fail for Generation Beginner

PSNR and SSIM, the metrics you learned in Chapter 1, both compare a candidate image against a specific reference image. That works for restoration, where you denoise or super-resolve a known target and ask how close you got. It is meaningless for generation, because a generated image has no reference: a model asked to produce "a face" can produce any of billions of valid faces, none of which is the right answer. There is no target to subtract. Even worse, a per-pixel comparison would punish a perfectly plausible generated face for not matching some arbitrary real face pixel for pixel, while rewarding a blurry average of many faces (which minimizes pixel error but looks terrible). The mismatch is fundamental, as the snippet below illustrates: two equally valid generated samples have a large pixel distance from each other, so pixel distance cannot be the score.

# Show why pixel metrics cannot score generation: two equally valid faces are far
# apart in pixel space (terrible PSNR), and the blurry average of many faces wins
# on pixel error while looking like no real sample at all.
import torch

# Two equally valid "generated faces" (here, two different real samples stand in).
face_a = torch.rand(3, 64, 64)
face_b = torch.rand(3, 64, 64)            # an entirely different but equally valid face

# Pixel metrics demand a reference and punish any difference from it.
mse = ((face_a - face_b) ** 2).mean()
psnr = 10 * torch.log10(1.0 / mse)
print(f"PSNR between two valid faces: {psnr:.1f} dB")   # PSNR between two valid faces: ~7-8 dB (terrible)

# The blurry AVERAGE of many faces minimizes pixel error but is not a face at all.
mean_face = torch.stack([torch.rand(3, 64, 64) for _ in range(50)]).mean(0)
print("mean-face std (lower = blurrier):", round(mean_face.std().item(), 3))
# A low-variance, washed-out image "wins" on pixel error while looking nothing like a sample.
Code Fragment 1: Why pixel metrics break for generation. The psnr between face_a and face_b is terrible (around 7 to 8 dB) because there is no shared reference, and the low-variance mean_face, which no one would accept as a sample, actually minimizes pixel error. Generation must be scored at the level of the distribution, not the individual pixel.
Fun Note

The blurry mean-of-fifty-faces wins on pixel error and would lose any beauty contest ever held. This is the metric equivalent of answering "what is the average of all songs?" with thirty seconds of beige hum: technically minimal in distance to everything, recognizable as nothing. Pixel error is happy. Everyone else has left the room.

2. The Feature-Space Move Intermediate

The breakthrough that made generative evaluation tractable is to stop comparing individual images and start comparing distributions, in the feature space of a pretrained classifier rather than in raw pixels. Run every real image and every generated image through a fixed network (historically Inception-v3 trained on ImageNet, the same backbone family from Chapter 20) and collect the high-level feature vectors from a late layer. These features capture semantic content, what is in the image, while discarding the pixel-exact arrangement that pixel metrics fixate on. Now you have two clouds of feature vectors, one from real images and one from generated images, and a good generator is one whose cloud sits on top of the real cloud. Figure 30.6.1 shows the idea: evaluation becomes a question about the overlap of two point clouds in feature space.

real generated frozen feature net feature space: compare the two clouds green = real features, orange = generated features; overlap = good generator
Figure 30.6.1: Generative evaluation in feature space. Both real and generated images pass through a frozen pretrained feature network; the result is two clouds of feature vectors. A good generator's cloud (orange) sits on top of the real cloud (green). Every classic metric in this section is a different way of quantifying how well those two clouds match, in mean and covariance (FID), in confidence and variety (Inception Score), or in per-point overlap (precision and recall).

3. Inception Score and Frechet Inception Distance Intermediate

The Inception Score (IS), the first widely adopted metric, uses only the generated images. It runs each generated image through the Inception classifier and rewards two properties at once: each image should be confidently classified (sharp, recognizable content gives a peaked label distribution $p(y \mid \mathbf{x})$), and the set of images should span many classes (high diversity gives a spread-out marginal $p(y)$).

It combines those two wishes into one number by taking the KL divergence between each image's label distribution and the marginal, then exponentiating. The KL divergence (the Kullback-Leibler divergence) is a standard asymmetric measure of how much one probability distribution differs from another, zero only when they are identical. That single divergence captures both wishes at once: it is large only when each image's label distribution is peaked (confident) yet the average over all images is spread across classes (varied), so a confident-but-monotonous generator and a varied-but-blurry one both score low. Its fatal flaw is that it never looks at the real data at all, so it cannot detect that your faces look like ImageNet dogs, only that they are confident and varied.

The Frechet Inception Distance (FID) fixed that by comparing generated features to real features directly. It models each cloud of feature vectors as a multivariate Gaussian, estimates the mean and covariance of the real features $(\boldsymbol{\mu}_r, \boldsymbol{\Sigma}_r)$ and of the generated features $(\boldsymbol{\mu}_g, \boldsymbol{\Sigma}_g)$, and computes the Frechet distance between those two Gaussians. The Frechet distance here is the Wasserstein-2 distance, an optimal-transport measure of how far apart two distributions are, intuitively the minimum total cost to reshape one cloud of probability mass into the other, and for two Gaussians it has the closed form below:

$$\mathrm{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2 + \operatorname{tr}\!\left(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2\big(\boldsymbol{\Sigma}_r \boldsymbol{\Sigma}_g\big)^{1/2}\right).$$

Lower is better; FID is zero only when the two feature distributions have identical mean and covariance. The first term penalizes a shift in average content; the second penalizes a mismatch in spread and correlation, which is sensitive to both blur (collapses covariance) and mode collapse (shrinks the generated cloud). FID has been the default generative-image metric since 2017, and you will see it in nearly every paper in the rest of the part. The code below computes it from two sets of feature vectors, exactly the computation behind the libraries.

# Compute FID directly from two clouds of feature vectors: model each as a Gaussian
# and take the Frechet (Wasserstein-2) distance between them. The mean term catches a
# content shift; the covariance term catches a mismatch in spread and correlation.
import numpy as np
from scipy import linalg

def frechet_distance(feat_real, feat_gen):
    """FID between two arrays of feature vectors, shape (n, d) each."""
    mu_r, mu_g = feat_real.mean(0), feat_gen.mean(0)
    cov_r = np.cov(feat_real, rowvar=False)
    cov_g = np.cov(feat_gen, rowvar=False)
    diff = mu_r - mu_g
    covmean = linalg.sqrtm(cov_r @ cov_g)              # matrix square root of the product
    if np.iscomplexobj(covmean):                       # numerical cleanup
        covmean = covmean.real
    return diff @ diff + np.trace(cov_r + cov_g - 2 * covmean)

# Toy demonstration: closer clouds -> lower FID.
rng = np.random.default_rng(0)
real = rng.normal(0, 1, size=(2000, 64))
close = rng.normal(0.1, 1, size=(2000, 64))           # slightly shifted: small FID
far   = rng.normal(2.0, 1.5, size=(2000, 64))         # shifted and wider: large FID
print("FID (close):", round(frechet_distance(real, close), 2))   # FID (close): ~0.7
print("FID (far):  ", round(frechet_distance(real, far), 2))     # FID (far):   ~150+
Code Fragment 2: Computing FID from two clouds of feature vectors. In frechet_distance, the diff @ diff term captures a shift in average content while the trace term, built from the linalg.sqrtm matrix square root, captures mismatch in spread and correlation. The close cloud scores near zero against real; the shifted, wider far cloud scores high. In practice the features come from Inception-v3, not random Gaussians, but the arithmetic is exactly this.
Common Misconception: FID Does Not Score an Image, and Lower Is Not Automatically Prettier

FID is routinely described as measuring "image quality", which invites two wrong beliefs. The first is that FID scores an individual image: it does not, and cannot. Look at the formula, it consumes the mean $\boldsymbol{\mu}$ and covariance $\boldsymbol{\Sigma}$ of a whole set of feature vectors, so FID is undefined for a single sample and only ever compares a cloud of generated images against a cloud of real ones. You cannot ask "what is the FID of this face?". The second wrong belief is that a lower FID always means more beautiful samples. Because FID folds fidelity and coverage into one scalar, a model can lower its FID purely by covering more modes (broader spread) even as its individual samples get blurrier, or vice versa, exactly the ambiguity the Key Insight of Section 30.5 warned about. A blurry-but-diverse generator can beat a sharp-but-mode-collapsed one on FID. This is why the next section splits the score into precision and recall, and why you should never read a single FID number as a verdict on how good the pictures look.

4. Precision and Recall: Splitting Fidelity From Coverage Advanced

FID is a single number, which means it inherits the ambiguity of Section 30.5: a model can earn a mediocre FID either by producing low-fidelity samples or by missing modes, and FID alone cannot tell you which. The fix is to borrow precision and recall from classification and lift them to distributions. Precision is the fraction of generated samples that fall within the support of the real data, the fidelity axis: are your samples realistic? Recall is the fraction of real data modes that the generated samples cover, the diversity axis: do you produce the full variety? A mode-collapsed GAN has high precision (its few outputs are realistic) but low recall (it misses most of the data); an over-smoothed VAE may have high recall (it spreads everywhere) but lower precision (its samples drift outside the real support). This is the distribution-level version of the fidelity-versus-coverage split, and it is precisely why the medical-imaging team in Section 30.5 needed recall, not the eye test. Figure 30.6.2 makes the two failure modes concrete: the same real support drawn twice, with a mode-collapsed generator on the left and an over-smoothed one on the right. We will formalize the $k$-nearest-neighbor manifold estimators that compute these in Chapter 37.

Same real support, two generators real-data support generated sample Mode-collapsed generator high precision, low recall this mode never covered Over-smoothed generator high recall, lower precision samples outside real support
Figure 30.6.2: Precision and recall split the single FID number into two independent axes. Both panels share the same real-data support (the two green dashed blobs). On the left, a mode-collapsed generator places every orange sample inside one blob: each sample is realistic (high precision) but a whole mode is never produced (low recall). On the right, an over-smoothed generator reaches both modes (high recall) yet scatters some samples into empty space outside the support (lower precision). Reading the two axes separately tells you which failure you have; the single FID could not.
Key Insight: One Number Hides the Trade You Care About

The recurring lesson of this chapter, that fidelity and diversity are independent axes, dictates how to evaluate. Any single scalar (likelihood, Inception Score, FID) compresses two genuinely different properties into one and can therefore be gamed or misread: a model can move its FID by trading fidelity for coverage without changing the number much. Reporting precision and recall (or a fidelity metric alongside a coverage metric) keeps the two axes separate, which is the only way to know whether a low FID came from sharper samples or from broader coverage. When you read or write a generative-model evaluation, insist on at least two numbers, one per axis. This is the same discipline that made you report both precision and recall, never accuracy alone, back in the detection metrics of Chapter 23.

Library Shortcut: FID, IS, and KID in One Call

The from-scratch FID above omits the Inception forward pass, the preprocessing, and numerical safeguards that real evaluation needs. The torch-fidelity library packages all of it, computing Inception Score, FID, and KID from two image folders in a single call:

# Production-grade evaluation in one call: point torch-fidelity at a folder of
# generated images and a folder of real ones. It runs the Inception forward pass,
# handles resizing and value ranges, and returns Inception Score, FID, and KID.
import torch_fidelity

metrics = torch_fidelity.calculate_metrics(
    input1="generated_images/",      # folder of generated samples
    input2="real_images/",           # folder of real reference images
    isc=True, fid=True, kid=True,    # Inception Score, FID, and KID at once
    cuda=True,
)
print(metrics["frechet_inception_distance"])   # the FID, computed with proper Inception features
Code Fragment 3: Inception Score, FID, and KID in one torch_fidelity.calculate_metrics call. Pointed at the input1 folder of generated images and the input2 folder of real ones with the isc, fid, and kid flags set, the library runs the Inception forward pass, handles resizing and value ranges, and computes all three metrics, the careful preprocessing the from-scratch frechet_distance deliberately omitted.

What the from-scratch code only sketched, loading Inception-v3, extracting the correct feature layer, handling resizing and value ranges, and computing the matrix square root robustly, the library does correctly in one call, which is roughly a hundred lines of careful code you do not want to reimplement. The catch, discussed next, is that the result is only as trustworthy as the preprocessing the library standardizes.

5. The Honest Limits of Automatic Metrics Intermediate

Every metric in this section is useful and every one is flawed, and a competent practitioner knows the flaws. FID and IS depend on an ImageNet-trained Inception network, so they are biased toward ImageNet-like content and can misjudge domains far from it (medical scans, satellite imagery, line art). FID is sensitive to sample size (it is biased upward at small $n$) and, notoriously, to image preprocessing: the resizing filter and the exact value range can shift reported FID enough to flip a comparison, which is why the clean-fid project (Parmar et al., 2022) exists specifically to standardize these steps. None of these feature-based metrics measures whether an image is offensive, copyright-infringing, factually consistent with a text prompt, or aesthetically pleasing to a human; for text-to-image you additionally need a text-alignment metric such as CLIPScore (built on the CLIP embeddings from Chapter 25), and for the qualities no automatic metric captures you still need human evaluation. The metric is a proxy; the eye and the human study remain the ground truth, a theme Chapter 37 develops in full alongside safety and governance.

Practical Example: The Leaderboard That Lied

Who: a research team comparing their new image generator against published baselines for a paper. Situation: their model reported a markedly better FID than a strong published competitor, and they were ready to claim state of the art. Problem: a reviewer noted that the team had resized images with a different filter than the competitor's published pipeline used. Decision: rather than argue, the team recomputed every model's FID through the standardized clean-fid pipeline, identical resizing and value range for all entries, so the only thing that varied was the generator. Result: their advantage shrank dramatically and, on two of the four datasets, reversed. The original gap had been mostly an artifact of preprocessing, not of model quality. They reported the clean-fid numbers, a more honest and smaller claim, and added precision and recall so readers could see the fidelity-coverage breakdown. Lesson: a generative metric is a measurement instrument, and like any instrument it must be calibrated identically across the things it compares. An apples-to-oranges FID is worse than no FID, because it looks authoritative. Standardize the pipeline, report more than one number, and treat the metric as a proxy that human judgment must still check.

Research Frontier: Beyond Inception Features

The 2023 to 2026 frontier in generative evaluation is the move away from the aging ImageNet-Inception backbone toward richer feature spaces and better-aligned judges. FID computed on self-supervised DINOv2 features (the backbone from Chapter 25) correlates better with human judgment than Inception-FID and is less biased toward ImageNet classes (Stein et al., 2023); CMMD (Jayasumana et al., 2024) replaces the Gaussian assumption and Inception features with a CLIP-based maximum-mean-discrepancy that is unbiased at small sample sizes, directly addressing FID's two worst flaws. For text-to-image, the field increasingly uses vision-language models as automatic judges of prompt alignment and quality, and human-preference reward models (trained on millions of pairwise human votes) now stand in for the human study during development. The arc that began with PSNR comparing two pixels and matured into FID comparing two feature clouds is still moving, toward features that match human perception and judges that understand the prompt, the full subject of Chapter 37.

You Could Explore: Re-rank Generators Under a Different Encoder

This frontier is unusually open to a motivated student because the tooling is public. The dgm-eval library released with Stein et al. (2023), at github.com/layer6ai-labs/dgm-eval, computes 15 generative-evaluation metrics across 8 different feature encoders (Inception, DINOv2, CLIP, MAE, and more) from one interface. Take a handful of pretrained generators, score them once with Inception-FID and once with DINOv2-FID, and check whether the ranking changes. When two encoders disagree about which generator is best, you have reproduced the paper's central claim that the metric, not only the model, decides the leaderboard, and you can ask which ranking your own eye agrees with. It is a self-contained weekend project that lands you directly on the question Chapter 37 takes up in full.

6. Unified View: One Family of Generative Models Advanced

We have spent this chapter naming five families, VAEs, energy-based models, score-based models, diffusion models, and normalizing flows, plus GANs and autoregressive models, and treating them as separate boxes on a map. That separation is how the field is taught and how the rest of Part IV is organized, but it hides something a graduate student should see plainly: most of these families are the same idea wearing different clothes. They all solve one problem, and the deep derivations that the next three chapters give you (the VAE evidence lower bound in Section 31.3, energy and score matching in Section 30.4, the diffusion training bound in Section 33.2, the score SDEs and probability-flow ODE in Section 33.3, and flow matching in Section 33.5) turn out to be three views of one object. This section is the synthesis. It will not introduce new machinery; it will show that the machinery you are about to learn, chapter by chapter, all fits in one frame. Once you see the frame, choosing a family for a real application stops being memorization and becomes reasoning.

6.1 The Common Goal: Transport Noise to Data

Strip away the names and every family in this chapter is trying to do the same thing: take a simple distribution you can sample trivially, a unit Gaussian $p_0 = \mathcal{N}(\mathbf{0}, \mathbf{I})$, and transport it onto the complicated data distribution $p_{\text{data}}(\mathbf{x})$ you cannot write down. Sampling is always the same ritual: draw cheap noise $\mathbf{z} \sim p_0$, then push it through some learned map until it lands on the data manifold. The families differ in exactly one place: how they represent that map and how they train it.

Hold that sentence fixed, "everyone transports noise to data, they differ in how the transport is represented and trained," and the rest of this section is just two lenses that make the differences precise.

6.2 Shared Lens 1: The Variational / ELBO View

The first lens is the evidence lower bound (ELBO). A VAE cannot maximize the data likelihood $\log p_\theta(\mathbf{x})$ directly because the integral over the latent is intractable, so it introduces an inference distribution $q_\phi(\mathbf{z} \mid \mathbf{x})$ and maximizes a tractable lower bound instead (the full derivation is Section 31.3):

$$\log p_\theta(\mathbf{x}) \;\ge\; \mathbb{E}_{q_\phi(\mathbf{z}\mid\mathbf{x})}\big[\log p_\theta(\mathbf{x}\mid\mathbf{z})\big] \;-\; D_{\mathrm{KL}}\!\big(q_\phi(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big).$$

The first term rewards reconstruction; the second pulls the inference posterior toward the prior. Here is the synthesis that surprises most students the first time they see it: a diffusion model is a deeply hierarchical VAE with a fixed inference chain. Take the single latent $\mathbf{z}$ and replace it with a long ladder of latents $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T$, one per noise level. The "encoder" is no longer learned: it is the fixed Gaussian forward process $q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t \mathbf{I})$ that simply adds a little noise at each step. The "decoder" is the learned reverse process $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$. Maximizing the diffusion likelihood gives exactly the same kind of bound, now summed over all $T$ levels (derived in Section 33.2):

$$\log p_\theta(\mathbf{x}_0) \;\ge\; \mathbb{E}_q\Big[\log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)\Big] \;-\; \sum_{t=2}^{T} \mathbb{E}_q\Big[D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\big)\Big] - D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T\mid\mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\big).$$

Term by term this is the VAE ELBO unrolled $T$ times: one reconstruction term at the bottom, a stack of KL terms (one per noise level) in the middle, and a prior-matching term at the top that vanishes because the forward process is designed so $\mathbf{x}_T$ is pure noise. The single architectural difference, learned encoder versus fixed Gaussian encoder, is what makes diffusion training so stable: there is no posterior to learn and no posterior collapse to fear, because the inference path is frozen by construction. A VAE and a diffusion model are not cousins; they are the same variational object at two extremes of latent depth.

6.3 Shared Lens 2: The Score / Vector-Field View

The second lens looks at the transport not as a likelihood bound but as a vector field that flows noise to data over time. Define a family of distributions $p_t(\mathbf{x})$ that interpolates from data at $t=0$ to noise at $t=1$ (the forward diffusion does exactly this). Three families all learn a time-indexed vector field over this interpolation; they just learn different fields.

Score-based and diffusion models learn the score, the gradient of the log density of the noised data, $\mathbf{s}_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$. The score points "uphill" toward higher-density regions, so following it (with the right noise schedule) walks a noise sample back onto the data manifold. Denoising score matching, introduced in Section 30.4, shows that learning this score is equivalent to learning to denoise: the optimal denoiser and the score are the same object up to a known scale. That is the bridge to the diffusion loss. DDPM trains a network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ to predict the Gaussian noise that was added, and noise prediction and score prediction are related by one exact equation:

$$\mathbf{s}_\theta(\mathbf{x}_t, t) \;=\; \nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) \;=\; -\,\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar\alpha_t}}.$$

The intuition for the minus sign and the scale: on a Gaussian noising path $\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}$, the score of a Gaussian is the negative of the standardized displacement from its mean, and that standardized displacement is exactly $\boldsymbol{\epsilon}/\sqrt{1-\bar\alpha_t}$. So predicting the noise is predicting the (scaled, negated) score; the two training targets in the literature are one target in disguise. This is why a DDPM and a score-based SDE model, derived by completely different routes in Section 33.2 and Section 33.3, produce interchangeable networks.

The probability-flow ODE closes the loop with flows. Section 33.3 shows that the stochastic reverse diffusion has a deterministic twin, an ordinary differential equation whose marginal distributions $p_t$ match the SDE's at every time:

$$\frac{d\mathbf{x}}{dt} \;=\; \mathbf{f}(\mathbf{x}, t) \;-\; \tfrac{1}{2}\,g(t)^2\,\nabla_{\mathbf{x}} \log p_t(\mathbf{x}).$$

This is a deterministic, invertible map from noise to data driven entirely by the learned score, which is the exact definition of a continuous normalizing flow. So diffusion, viewed through the probability-flow ODE, literally is a continuous normalizing flow, and that is why diffusion models can report exact likelihoods (by integrating the instantaneous change of variables along the ODE) even though they were trained as denoisers. The flow family and the diffusion family meet here.

Flow matching learns a velocity instead of a score, but on Gaussian paths the two are the same information. Flow matching (Section 33.5) trains a velocity field $v_\theta(\mathbf{x}, t)$ to regress the velocity of a chosen probability path, typically the straight-line interpolation $\mathbf{x}_t = (1-t)\,\mathbf{x}_0 + t\,\boldsymbol{\epsilon}$. For Gaussian probability paths the marginal velocity is an affine function of the score, so a flow-matching velocity and a diffusion score carry the same content and can be converted into each other; the practical difference is that the straight-line path is easier to integrate, which is why flow matching often samples in fewer steps. Score, noise, and velocity are three coordinates for the same vector field.

6.4 Energy-Based Models: The Root of the Score View

Why does the score, a gradient of a log density, sit at the center of so many families? Because it is what you get when you refuse to compute the one quantity that makes explicit densities intractable. An energy-based model (Section 30.4) writes the density as a Boltzmann distribution over a learned energy $E_\theta$:

$$p_\theta(\mathbf{x}) \;=\; \frac{e^{-E_\theta(\mathbf{x})}}{Z_\theta}, \qquad Z_\theta = \int e^{-E_\theta(\mathbf{x})}\, d\mathbf{x}.$$

The partition function $Z_\theta$ is an integral over the entire data space, hopelessly intractable for images, and it is the reason you cannot just maximize $\log p_\theta$. But take the gradient of the log density with respect to $\mathbf{x}$ and the obstacle disappears:

$$\nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) \;=\; -\,\nabla_{\mathbf{x}} E_\theta(\mathbf{x}),$$

because $Z_\theta$ does not depend on $\mathbf{x}$, so its gradient is zero and the partition function vanishes. The score is the energy gradient with the intractable constant removed. This is the conceptual root of the entire score view: score matching learns $-\nabla E$ without ever computing $Z$, and Langevin dynamics samples by descending the energy (climbing the score) with injected noise. Every score-based and diffusion model is, in this sense, an energy-based model that learns the gradient field directly instead of the energy itself, dodging the partition function the same way. The EBM is the root; the score is its tractable shadow.

6.5 GANs: The Odd One Out

Every family above is organized around a density: a likelihood bound, an exact change of variables, or a score that is the gradient of a log density. GANs (Module 32) abandon density entirely. A GAN never writes down, bounds, or differentiates $p_\theta(\mathbf{x})$. Instead it trains a generator against a learned discriminator in a minimax game, and at the optimum the generator's implicit distribution matches the data because the discriminator can no longer tell them apart. There is no ELBO, no score, no tractable likelihood, and no way to ask "what is the probability of this image" under a GAN. That is the trade: by replacing the explicit density with an adversarial signal, GANs sidestep every intractability above and can produce extremely sharp samples in a single forward pass, but they give up likelihood, they offer no built-in coverage guarantee (hence mode collapse), and the game can be unstable to train. The GAN is the one family that buys sample quality and speed by walking away from the probabilistic frame the others share.

6.6 The Comparison, In One Table

The table below places the families side by side along the axes that actually drive a design choice: what objective trains them, what object they learn, whether they give an exact likelihood, how sampling works, and where each shines.

Family Training objective What is learned Exact likelihood? Sampling Typical use
VAE ELBO (variational lower bound) Encoder $q_\phi$ + decoder $p_\theta$ over a latent No (lower bound only) One latent draw, one decoder pass (fast) Structured latent space, representation learning, fast rough samples
Normalizing flow Exact log-likelihood (change of variables) Invertible map $f_\theta$ with tractable Jacobian Yes (exact) One pass through the inverse map (fast) Density estimation, exact likelihood, invertible encodings
Energy-based model Score matching / contrastive divergence Energy $E_\theta$ (score $=-\nabla E$) No (unknown $Z_\theta$) Iterative MCMC / Langevin (slow) Flexible unnormalized densities, composing constraints
Score-based / diffusion Denoising score matching / ELBO on noise levels Score $\mathbf{s}_\theta$ (equiv. noise $\boldsymbol{\epsilon}_\theta$) at every $t$ Yes via probability-flow ODE Many denoising steps (slow, fewer with solvers) State-of-the-art fidelity and coverage; text-to-image
Flow matching Regress a probability-path velocity Velocity field $v_\theta$ on a (straight) path Yes via the ODE Integrate an ODE (few steps) High quality with faster sampling than DDPM
GAN Adversarial minimax game Generator + discriminator (no density) No (implicit density) One generator pass (very fast) Sharp single-step samples; real-time generation
Autoregressive Exact log-likelihood (factorized) Conditionals $p_\theta(x_i \mid x_{<i})$ Yes (exact) Sequential, one element at a time (slow) Discrete data, exact likelihood, strong text-to-image with tokens
Key Insight: Score, ELBO, and Flow Are Three Views of One Transport

Do not memorize seven unrelated algorithms. Memorize one transport problem, move noise to data, and three lenses on it. The ELBO lens sees the transport as a likelihood bound, and through it a diffusion model is just a deeply hierarchical VAE with a fixed Gaussian inference chain. The score lens sees the transport as a vector field $\nabla_{\mathbf{x}} \log p_t$, and through it diffusion, score matching, and (on Gaussian paths) flow matching are learning the same field, with noise prediction $\boldsymbol{\epsilon}_\theta$, score $\mathbf{s}_\theta = -\boldsymbol{\epsilon}_\theta/\sqrt{1-\bar\alpha_t}$, and velocity $v_\theta$ as three coordinates for it. The flow lens sees the transport as a deterministic ODE, and through the probability-flow ODE diffusion literally is a continuous normalizing flow, which is why a model trained as a denoiser can still report an exact likelihood. Underneath all three sits the energy-based model: the score is just $-\nabla E$ with the partition function removed. GANs are the lone exception, the one family that drops the density and learns the transport through a game instead.

6.7 Choosing a Family: A Selection Guide

The unified view is not only elegant; it tells you how to choose. The decision reduces to four questions, each tracing back to the quality-diversity-speed trilemma of Section 30.5 plus one new axis, do you need the likelihood number itself?

Read against the trilemma, the pattern is clean: flows and autoregressive models buy likelihood; diffusion buys fidelity and coverage by spending speed; flow matching rebalances that spend toward speed; GANs buy speed and sharpness by spending coverage and training stability; VAEs buy a latent space and speed by spending fidelity. No family wins on every axis, which is exactly the trilemma's promise, and the unified view tells you which axis each family chose to sacrifice and why.

Note: The Map Has Borders, Not Walls

The boundaries in the table are softer than they look, which is itself a consequence of the unified view. Latent diffusion runs a diffusion model inside a VAE's latent space, combining a VAE encoder with a diffusion transport. Consistency models distill a multi-step diffusion ODE into a one-step generator, importing GAN-like speed into the diffusion family. Diffusion-GAN hybrids add an adversarial term to a diffusion loss. Because the families share one transport problem and a small set of lenses, they recombine freely; the modern frontier of Chapter 33 and beyond is mostly built from these recombinations rather than from brand-new families.

Exercise 30.6.4: DDPM Loss as Weighted Denoising Score Matching Analysis

The DDPM training objective minimizes $\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}, t}\big[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2\big]$ where $\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}$. Using the relation $\mathbf{s}_\theta(\mathbf{x}_t, t) = -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)/\sqrt{1-\bar\alpha_t}$ from Section 6.3, show that this loss is, up to a per-timestep weight, a denoising-score-matching objective $\mathbb{E}\big[\lambda(t)\,\|\mathbf{s}_\theta(\mathbf{x}_t,t) - \nabla_{\mathbf{x}_t}\log q(\mathbf{x}_t\mid\mathbf{x}_0)\|^2\big]$. Two steps suffice: (a) write the score of the Gaussian transition $q(\mathbf{x}_t \mid \mathbf{x}_0)$ in terms of $\boldsymbol{\epsilon}$ and show it equals $-\boldsymbol{\epsilon}/\sqrt{1-\bar\alpha_t}$; (b) substitute both scores in the score-matching loss and read off the weight $\lambda(t)$ that turns it back into the noise-prediction loss. State what value of $\lambda(t)$ DDPM implicitly uses by setting all weights to one, and one sentence on why a different $\lambda(t)$ corresponds to a different (likelihood-weighted) objective.

Exercise 30.6.5: Pick a Family, Justify the Choice Conceptual

For each of the three applications below, choose one generative family from the table in Section 6.6 and justify it in two or three sentences, naming the axes of the trilemma (Section 30.5) and the likelihood axis you are trading. (1) An on-device photo filter that must stylize a live camera feed at 30 frames per second on a phone. (2) A scientific anomaly detector for telescope images that must flag inputs with unusually low probability under the learned distribution. (3) A text-to-image system for a design tool where users wait a few seconds per image but demand the sharpest, most varied results. For each, also name one family you would not choose and state the specific axis on which it would fail the requirement.

7. Closing the Foundations Beginner

This section completes the foundations of the part. You can now define what it means to model $p(\mathbf{x})$, name and place the five families that do it, work in a latent space, reason about energies and scores, judge any generator on the quality-diversity-speed triangle, and measure two of those three axes with feature-space metrics while knowing exactly where those metrics lie to you. The unified view of the previous section gave you the deeper payoff: those families are not seven disconnected algorithms but one transport problem seen through the ELBO, score, and flow lenses, with the energy-based model at the root and the GAN as the lone density-free exception. Every remaining chapter of Part IV is the detailed construction of one box on the map, and you now carry the vocabulary, the warnings, and the unifying frame to read each one critically. The next chapter takes the very first box, the latent-variable model of Section 30.3, and turns it into a working, trainable system: the autoencoder and the variational autoencoder.

Exercise 30.6.1: Why Not PSNR? Conceptual

In one paragraph, explain to a colleague who knows PSNR and SSIM (from Chapter 1) but not generation why those metrics cannot score an unconditional image generator. Address both halves of the problem: the absence of a reference image, and the way a per-pixel metric rewards blurry averages. Then state in one sentence what the feature-space metrics replace the reference with.

Exercise 30.6.2: Compute FID on a Real Dataset Coding

Using torch-fidelity (or the from-scratch frechet_distance with your own Inception feature extractor), compute FID in three settings on a small dataset such as CIFAR-10: (a) real-versus-real using two disjoint halves of the real data (this is the achievable floor, it will be small but nonzero), (b) real-versus-real with one half corrupted by Gaussian blur, and (c) real-versus-noise using random images. Report the three numbers and write two sentences explaining the ordering and what the nonzero floor in (a) tells you about FID's sample-size sensitivity.

Exercise 30.6.3: Diagnose From the Numbers Analysis

Three generators report the following: Generator X has low FID, high precision, low recall; Generator Y has low FID, lower precision, high recall; Generator Z has high FID but high precision and high recall. Using the definitions from Sections 3 and 4 and the trilemma from Section 30.5, describe in words what each generator's samples most likely look like as a set, which one you would choose for synthetic data augmentation and which for a single hero image, and explain how Generator Z can have high FID despite good precision and recall (hint: consider what a feature-space mean-and-covariance shift can capture that per-point support overlap cannot).