Chapter 30: Foundations of Generative Modeling

"For thirty chapters they asked me what is in this picture. Then one morning they handed me a blank canvas and asked me to imagine a picture that has never existed but could have. I have not slept since. Neither, it turns out, has the distribution of all natural images."
A Latent Vector Looking for Meaning

Big Picture

Everything before this chapter taught a model to answer a question about an image that already exists. Generative modeling reverses the arrow: it asks the model to learn the probability distribution of natural images so well that it can draw new samples from it. That single shift, from estimating a label given an image to estimating the image itself, is the conceptual hinge of the entire fourth part of this book. Once you can model $p(\mathbf{x})$ over images, you can sample novel pictures, fill in missing regions, denoise, super-resolve, edit by manipulating a latent code, and compress. This chapter is the map and the vocabulary. It defines what $p(\mathbf{x})$ even means for a million-dimensional pixel space, it lays out the five families of models that attack the problem differently, it introduces the idea of a latent space that organizes that high-dimensional chaos, it develops the energy and score view that unifies several families, and it confronts the trilemma that no single generator escapes: you can rarely have high quality, high diversity, and fast sampling all at once. By the end you will be able to read any generative paper and place it on the map.

Chapter Overview

For the first three parts of this book a model was a function from images to answers. A classifier from Chapter 20 mapped a photo to a label; a detector from Chapter 23 mapped it to boxes; a depth network from Chapter 27 mapped it to a range map. In every case the image was the given, the input we conditioned on. This chapter is where the image becomes the unknown. We want a model that has internalized what natural images look like, so thoroughly that it can produce a new one on demand: a face that belongs to no one, a street scene that was never photographed, a texture that tiles seamlessly. The technical name for that internalized knowledge is the probability distribution over images, written $p(\mathbf{x})$, and learning it is the project of generative modeling.

The chapter opens by making the distinction precise. Section 30.1 contrasts discriminative models, which learn $p(y \mid \mathbf{x})$, with generative models, which learn $p(\mathbf{x})$ or $p(\mathbf{x}, y)$, and confronts the staggering scale of the problem: a modest color image lives in a space of hundreds of thousands of dimensions, almost all of which contain noise rather than pictures. The set of plausible images is a vanishingly thin manifold inside that space, and a generative model is a machine for describing where that manifold lies and how probability mass spreads across it.

Section 30.2 is the field map. Five families have learned $p(\mathbf{x})$ in fundamentally different ways: variational autoencoders compress to a latent and decode, generative adversarial networks pit a generator against a critic, normalizing flows build an exactly invertible transformation, autoregressive models predict pixels one at a time, and diffusion models learn to reverse a gradual noising process. Each makes a different trade among tractable likelihood, sample quality, and sampling speed, and the rest of Part IV expands each box on this map into its own chapter. Section 30.3 develops the single most reused idea in the part, the latent variable: a low-dimensional code $\mathbf{z}$ whose smooth, structured space the model decodes into images, and which makes interpolation, editing, and conditioning possible.

Section 30.4 takes the energy view. Instead of a normalized probability we model an unnormalized energy $E(\mathbf{x})$, learn its gradient (the score) by score matching, and sample by following that gradient with noise injected, a procedure called Langevin dynamics. This is not a historical detour; it is the direct mathematical ancestor of the diffusion models in Chapter 33, and the section is written so that the score-SDE view there will feel inevitable. Section 30.5 formalizes the three things we want from any generator, fidelity, coverage, and speed, shows why they are in tension, and gives the practitioner's reading of where each family sits. Section 30.6 closes with the measurement problem: if a model invents images, how do we score them? It introduces Inception Score, the feature-distance idea behind the Frechet Inception Distance (FID), and the honest limits of every automatic metric, setting up the full treatment in Chapter 37.

Two threads from earlier in the book converge here. The latent-space idea you will meet in Section 30.3 is the same compression-and-decode logic that Chapter 31 turns into the variational autoencoder. And the score-and-energy machinery of Section 30.4 is, almost line for line, the gradient field that diffusion will learn in Chapter 33. This chapter plants both seeds deliberately, so that the chapters which follow read as the natural unfolding of ideas you already hold.

Prerequisites

This chapter is mostly conceptual and probabilistic, so the deepest prerequisite is comfort with probability: marginal and conditional distributions, expectation, and the idea of a density over a continuous space. You should have read Chapter 18: Neural Networks & PyTorch for Vision, because every generator here is a neural network trained by gradient descent, and the code uses PyTorch tensors and autograd. The latent-space discussion builds on the representation learning of Chapter 25, where you first saw that a learned vector can encode the meaning of an image. The energy and score section uses gradients and the chain rule from the same calculus you used to train networks, plus the change-of-variables idea that also underlies the geometric warps of Chapter 5. Finally, the evaluation section connects to the pixel-level metrics PSNR and SSIM from Chapter 1, which it argues are necessary but not sufficient for generative quality.

Chapter Roadmap

30.1 Generative vs Discriminative: What Does It Mean to Model p(x)? The defining distinction of the part: discriminative models learn the boundary p(y given x), generative models learn the data itself p(x). Why modeling a distribution over a 150,000-dimensional pixel space is hard, the manifold of natural images, and the three things a generative model lets you do that a classifier never could.
30.2 A Map of Generative Families: VAE, GAN, Flow, Autoregressive & Diffusion A field guide to the five ways the community has learned p(x): variational autoencoders, generative adversarial networks, normalizing flows, autoregressive models, and diffusion models. What each optimizes, whether it gives a tractable likelihood, and the trade among quality, diversity, and speed that locates each on the map.
30.3 Latent Variables & the Idea of a Latent Space The most reused idea in generative vision: a low-dimensional code z that a decoder turns into an image. The marginalization that defines a latent-variable model, why a smooth structured latent space makes interpolation and editing possible, and the disentanglement question that the rest of the part keeps returning to.
30.4 Energy-Based Models, Score Matching & Langevin Dynamics Modeling an unnormalized energy E(x) instead of a normalized probability, the intractable partition function that motivates score matching, learning the score (the gradient of log-density) directly, and sampling by Langevin dynamics. The mathematical ancestor of diffusion, built deliberately so Chapter 33 feels inevitable.
30.5 Sampling, Likelihood & the Quality-Diversity-Speed Trilemma What sampling actually computes, why likelihood and sample quality are not the same thing, and the trilemma that no generator escapes: high fidelity, full coverage of the data distribution, and fast sampling rarely come together. A practitioner's reading of where each family sits and why diffusion's speed problem launched a research wave.
30.6 Evaluating Generators: A First Look If a model invents images, how do you score them? Why pixel metrics like PSNR fail for generation, the feature-space idea behind Inception Score and Frechet Inception Distance, what precision and recall mean for a distribution, and the honest limits of every automatic metric. The on-ramp to the full evaluation treatment in Chapter 37.

Remember the Chapter in One Card

If you carry three things out of this chapter, carry these. First, the hinge: generation moves the image from the right of the conditioning bar to the left, from $p(y \mid \mathbf{x})$ to $p(\mathbf{x})$, which is why it is the harder half of the joint. Second, the five families, recalled by the route each one takes to a sample: compress (VAE), contest (GAN), invert (flow), chain (autoregressive), denoise (diffusion). Third, the trilemma: fidelity, diversity, speed, and you usually keep two. Hinge, five families, trilemma, that triplet is the scaffold every later chapter hangs on.

What's Next?

With the map in hand, the rest of Part IV walks the territory one family at a time. Chapter 31: Autoencoders & Variational Autoencoders is the immediate sequel and takes the latent-variable idea of Section 30.3 and turns it into a trainable model: the autoencoder that compresses and reconstructs, then the variational autoencoder that makes the latent space a proper probability distribution you can sample from. It is the cleanest place to make the abstract notion of "decode a latent into an image" concrete and runnable. From there the part follows the map: Chapter 32 builds the adversarial game, and Chapter 33 cashes in the score and Langevin machinery of Section 30.4 to build the diffusion models that dominate image generation today. The trilemma of Section 30.5 and the metrics of Section 30.6 will be your companions through every one of those chapters, the lens through which you judge each new method. Before moving on, make the whole chapter concrete in the Hands-On Lab below, where energy, score, denoising score matching, Langevin sampling, and a coverage metric come together as one small generative model you train and sample from yourself.

Hands-On Lab: A Score-Based Generator You Can Train and Sample

Duration: about 60 to 90 minutes Difficulty: Intermediate

Objective

Build the smallest complete generative model that still uses every load-bearing idea of this chapter: take a two-dimensional dataset whose true distribution you can see, learn its score field with a tiny network trained by denoising score matching, sample new points by annealed Langevin dynamics, and then measure how well the generated cloud covers the real one. The two-dimensional data keeps every quantity plottable, so you watch $p(\mathbf{x})$, the score arrows of Section 30.4, the sampling walk, and the quality-versus-coverage tension of Section 30.5 all on one screen, with no image-sized U-Net or GPU in the way.

What You'll Practice

Treating generation as learning a distribution $p(\mathbf{x})$ over data rather than a label given data, the hinge of Section 30.1.
Training a score network with the denoising score matching objective, where the regression target is a scaling of the noise that was added (Section 30.4).
Conditioning the network on the noise level so one network covers many scales, the multi-scale idea behind annealed sampling (Section 30.4).
Turning a learned score field into samples with annealed Langevin dynamics, the score term plus calibrated noise of Section 30.4.
Reading the quality-diversity-speed trade by varying the step count and a coverage metric, the practitioner's lens of Section 30.5 and the measurement problem of Section 30.6.

Setup

One scientific-Python stack and no download; scikit-learn generates the toy dataset in memory. Everything runs on CPU in a couple of minutes. Install with:

pip install torch scikit-learn matplotlib

The whole lab is one short script. The model is a four-layer multilayer perceptron, deliberately tiny, so the focus stays on the score-and-sample loop rather than on architecture.

Steps

Step 1: Make a distribution you can see

Generate the classic two-moons dataset and standardize it. Two interleaving crescents are an honest stand-in for the manifold picture of Section 30.1: the data occupies a thin, curved region of the plane while most of the plane is empty, so a generator that ignores the shape is instantly visible.

import torch
import numpy as np
from sklearn.datasets import make_moons

def get_data(n=4000):
    x, _ = make_moons(n_samples=n, noise=0.05, random_state=0)
    x = (x - x.mean(0)) / x.std(0)                 # standardize to roughly unit scale
    return torch.tensor(x, dtype=torch.float32)

data = get_data()
# TODO: print data.shape and data.std(0) to confirm a (4000, 2) tensor
# whose two columns each have standard deviation near 1.0.

Hint

print(data.shape, data.std(0)) should report torch.Size([4000, 2]) and two numbers close to 1.0. Standardizing matters because the noise scales in Step 4 are chosen relative to unit-scale data.

Step 2: Build a noise-conditioned score network

The network takes a point and a noise level and returns a two-dimensional score vector, the estimate of $\nabla_{\mathbf{x}} \log p_\sigma(\mathbf{x})$ from Section 30.4. Feeding the noise level $\sigma$ in as an extra input is what lets a single network represent the score at every scale, the trick that makes the annealed sampling of Step 5 possible.

import torch.nn as nn

class ScoreNet(nn.Module):
    def __init__(self, dim=2, hidden=128):
        super().__init__()
        # Input is the point (dim) plus one channel for log(sigma).
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden), nn.SiLU(),
            nn.Linear(hidden, hidden), nn.SiLU(),
            nn.Linear(hidden, hidden), nn.SiLU(),
            nn.Linear(hidden, dim),
        )

    def forward(self, x, sigma):                    # x: (B, 2), sigma: (B, 1)
        # TODO: concatenate x with log(sigma) along dim=1 and pass through
        # self.net. Conditioning on log(sigma) (not sigma) keeps the very
        # small and very large scales numerically comparable.
        h = ...
        return self.net(h)

score_net = ScoreNet()

Hint

h = torch.cat([x, torch.log(sigma)], dim=1) then return self.net(h). Passing log(sigma) rather than sigma spreads a geometric range of scales evenly across the input, which the network learns from far more easily.

Step 3: Write the denoising score matching loss

This is the objective of Section 30.4 made runnable. Add Gaussian noise of scale $\sigma$ to a clean point; the score of that noised point has the closed form $-(\mathbf{x}_\text{noisy} - \mathbf{x}) / \sigma^2 = -\boldsymbol{\epsilon} / \sigma$, so the network's regression target is simply a scaling of the noise that was added. Weighting the per-sample loss by $\sigma^2$ balances the contribution of every scale.

def dsm_loss(net, x, sigmas):
    sigma = sigmas[torch.randint(len(sigmas), (x.size(0),))].unsqueeze(1)  # (B,1)
    eps = torch.randn_like(x)
    x_noisy = x + sigma * eps                       # corrupt each point
    # TODO: the closed-form target score is -eps / sigma. Predict the score
    # with net(x_noisy, sigma), then return the sigma^2-weighted mean squared
    # error between prediction and target: (sigma**2 * (pred - target)**2).mean().
    target = ...
    pred = ...
    return ...

Hint

target = -eps / sigma, pred = net(x_noisy, sigma), and return (sigma ** 2 * (pred - target) ** 2).mean(). The sigma ** 2 weight cancels the 1 / sigma that makes small-noise targets enormous, so no single scale dominates training. This is exactly the loss a diffusion U-Net minimizes in Chapter 33, just in two dimensions.

Step 4: Choose a geometric ladder of noise scales and train

Pick noise levels spaced geometrically from large to small, the annealing schedule that Section 30.4 introduces. The large scales blur the two moons into one broad blob whose score is easy to learn from anywhere; the small scales sharpen them back into the true shape. Train the network to match the score at all of them at once.

sigmas = torch.exp(torch.linspace(np.log(1.0), np.log(0.01), 10))  # 10 scales, large to small
opt = torch.optim.Adam(score_net.parameters(), lr=1e-3)

for step in range(3000):
    idx = torch.randint(len(data), (256,))
    batch = data[idx]
    loss = dsm_loss(score_net, batch, sigmas)
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 500 == 0:
        print(f"step {step:4d}  loss {loss.item():.4f}")

Hint

The loss should fall steadily and settle below roughly 0.1 within 3000 steps on CPU. If it stalls high, confirm Step 3 returns the sigma ** 2-weighted error; an unweighted loss lets the smallest scale swamp the gradient and the network never learns the broad structure.

Step 5: Sample by annealed Langevin dynamics

Now turn the learned arrows into points, the procedure of Section 30.4. Start from broad Gaussian noise, and for each scale from large to small take several Langevin steps: move along the score, then add calibrated Gaussian noise. Walking from coarse to fine scales is what lets the sampler find both moons instead of collapsing onto one, the coverage concern of Section 30.5.

@torch.no_grad()
def sample(net, sigmas, n=2000, steps_per_scale=20, eta=0.1):
    x = torch.randn(n, 2)                            # start from broad noise
    for sigma in sigmas:                            # large to small
        s = torch.full((n, 1), float(sigma))
        step_size = eta * (sigma / sigmas[-1]) ** 2 # bigger steps at coarse scales
        for _ in range(steps_per_scale):
            score = net(x, s)
            # TODO: one Langevin update. Move along the score by
            # 0.5 * step_size * score, then add sqrt(step_size) * randn_like(x).
            # Drop the noise term and the cloud collapses onto the modes.
            x = ...
    return x

samples = sample(score_net, sigmas)

Hint

x = x + 0.5 * step_size * score + torch.sqrt(torch.tensor(step_size)) * torch.randn_like(x). The score term pulls toward the moons; the noise term is what makes this sampling rather than optimization, exactly the distinction Exercise 30.4.2 asks you to feel. Remove it and every sample piles onto the nearest crescent.

Step 6: Plot the result and measure coverage

Overlay real and generated points to judge fidelity by eye, then put a number on coverage with a simple nearest-neighbour recall: the fraction of real points that have a generated point nearby. This is a toy stand-in for the precision-and-recall view of generative evaluation in Section 30.6, where precision asks whether samples are realistic and recall asks whether they cover the data.

import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

plt.scatter(data[:, 0], data[:, 1], s=4, alpha=0.3, label="real")
plt.scatter(samples[:, 0], samples[:, 1], s=4, alpha=0.3, label="generated")
plt.legend(); plt.axis("equal"); plt.title("score-based samples vs data")
plt.savefig("score_samples.png", dpi=120)

# Coverage (toy recall): fraction of real points within radius r of a sample.
nn = NearestNeighbors(n_neighbors=1).fit(samples.numpy())
dist, _ = nn.kneighbors(data.numpy())
# TODO: set coverage to the fraction of entries in dist that are below 0.15,
# then print it. A higher number means the generated cloud reaches more of
# the real data (better recall / coverage).
coverage = ...
print(f"coverage (recall proxy): {coverage:.2f}")

Hint

coverage = (dist < 0.15).mean(). A well-trained model with both moons populated reaches roughly 0.85 or higher; a sampler that collapsed onto one crescent scores near 0.5, the numerical signature of the mode-dropping that Section 30.5 calls a coverage failure.

Expected Output

Two artifacts. First, score_samples.png, where the orange generated cloud traces both crescents of the blue real data rather than smearing across the empty plane or piling onto one moon. Second, a printed coverage near 0.85 or above when both moons are populated. The training loss should fall below roughly 0.1. Now rerun Step 5 with steps_per_scale=2: sampling is much faster but the cloud is noisier and coverage drops, the quality-and-coverage versus speed trade of Section 30.5 made measurable in your own run. Exact numbers vary with seed; what should hold is a visibly two-moon sample and coverage far above the one-moon baseline of 0.5.

Stretch Goals

Disable the noise term in Step 5 (keep only the score step) and replot; watch the samples collapse onto the moon centres and coverage fall, the diversity loss that the noise term prevents, exactly as Exercise 30.4.2 predicts.
Swap make_moons for sklearn.datasets.make_circles or a three-blob Gaussian mixture and retrain; a sampler that handles disconnected modes without dropping any is passing the coverage test of Section 30.5 on a harder distribution.
Add a likelihood-free competitor: fit a two-component Gaussian mixture to the same data with sklearn.mixture.GaussianMixture, sample from it, and compare its coverage and its by-eye fidelity against your score model, a hands-on version of the family comparison on the map of Section 30.2.

Library Shortcut: Diffusers Scales the Same Loop to Images

The script above is roughly seventy lines and exposes the score, the schedule, and every Langevin step on purpose. The Hugging Face diffusers library (the reference source in the bibliography) wraps the identical denoise-and-sample loop, scaled to images and wired to a trained U-Net, behind a handful of calls: a scheduler object holds the noise ladder of Step 4, and a pipeline runs the sampling of Step 5 with one pipe(). That is a seventy-to-a-handful reduction, and the math you just wrote by hand is exactly what those calls hide. Build the score loop once at two dimensions to understand it; reach for diffusers when the data becomes pictures in Chapter 33.

Complete Solution

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.neighbors import NearestNeighbors

torch.manual_seed(0)

# Step 1: a distribution you can see.
def get_data(n=4000):
    x, _ = make_moons(n_samples=n, noise=0.05, random_state=0)
    x = (x - x.mean(0)) / x.std(0)
    return torch.tensor(x, dtype=torch.float32)

data = get_data()

# Step 2: noise-conditioned score network.
class ScoreNet(nn.Module):
    def __init__(self, dim=2, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim + 1, hidden), nn.SiLU(),
            nn.Linear(hidden, hidden), nn.SiLU(),
            nn.Linear(hidden, hidden), nn.SiLU(),
            nn.Linear(hidden, dim),
        )

    def forward(self, x, sigma):
        h = torch.cat([x, torch.log(sigma)], dim=1)
        return self.net(h)

score_net = ScoreNet()

# Step 3: denoising score matching loss.
def dsm_loss(net, x, sigmas):
    sigma = sigmas[torch.randint(len(sigmas), (x.size(0),))].unsqueeze(1)
    eps = torch.randn_like(x)
    x_noisy = x + sigma * eps
    target = -eps / sigma
    pred = net(x_noisy, sigma)
    return (sigma ** 2 * (pred - target) ** 2).mean()

# Step 4: geometric noise ladder and training.
sigmas = torch.exp(torch.linspace(np.log(1.0), np.log(0.01), 10))
opt = torch.optim.Adam(score_net.parameters(), lr=1e-3)
for step in range(3000):
    idx = torch.randint(len(data), (256,))
    loss = dsm_loss(score_net, data[idx], sigmas)
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 500 == 0:
        print(f"step {step:4d}  loss {loss.item():.4f}")

# Step 5: annealed Langevin sampling.
@torch.no_grad()
def sample(net, sigmas, n=2000, steps_per_scale=20, eta=0.1):
    x = torch.randn(n, 2)
    for sigma in sigmas:
        s = torch.full((n, 1), float(sigma))
        step_size = eta * (sigma / sigmas[-1]) ** 2
        for _ in range(steps_per_scale):
            score = net(x, s)
            x = x + 0.5 * step_size * score \
                + torch.sqrt(torch.tensor(step_size)) * torch.randn_like(x)
    return x

samples = sample(score_net, sigmas)

# Step 6: plot and measure coverage.
plt.scatter(data[:, 0], data[:, 1], s=4, alpha=0.3, label="real")
plt.scatter(samples[:, 0], samples[:, 1], s=4, alpha=0.3, label="generated")
plt.legend(); plt.axis("equal"); plt.title("score-based samples vs data")
plt.savefig("score_samples.png", dpi=120)

nn_model = NearestNeighbors(n_neighbors=1).fit(samples.numpy())
dist, _ = nn_model.kneighbors(data.numpy())
coverage = (dist < 0.15).mean()
print(f"coverage (recall proxy): {coverage:.2f}")

Bibliography & Further Reading

Foundational Papers

Kingma, D. P., Welling, M. "Auto-Encoding Variational Bayes." ICLR (2014). arXiv:1312.6114

The variational autoencoder, the cleanest worked example of the latent-variable model of Section 30.3 and the subject of Chapter 31. Introduces the reparameterization trick that makes the latent differentiable.

Goodfellow, I. et al. "Generative Adversarial Networks." NeurIPS (2014). arXiv:1406.2661

The GAN box on the Section 30.2 map: a generator and discriminator trained as adversaries, the likelihood-free family known for sharp samples and mode-collapse risk. Expanded in Chapter 32.

Rezende, D. J., Mohamed, S. "Variational Inference with Normalizing Flows." ICML (2015). arXiv:1505.05770

The flow family of Section 30.2. Builds a density by composing invertible transformations, the only family that gives an exact tractable likelihood via the change-of-variables formula.

van den Oord, A. et al. "Pixel Recurrent Neural Networks (PixelRNN/PixelCNN)." ICML (2016). arXiv:1601.06759

The autoregressive family of Section 30.2: factorize an image as a product of per-pixel conditionals and predict each pixel from the ones before it. Exact likelihood, slow sequential sampling.

Ho, J., Jain, A., Abbeel, P. "Denoising Diffusion Probabilistic Models (DDPM)." NeurIPS (2020). arXiv:2006.11239

The diffusion family of Section 30.2 and the destination of the score machinery in Section 30.4. Learns to reverse a gradual noising process; the foundation of Chapter 33 and modern image generation.

Hyvarinen, A. "Estimation of Non-Normalized Statistical Models by Score Matching." JMLR (2005). jmlr.org/papers/v6/hyvarinen05a

The original score-matching objective of Section 30.4: fit the gradient of log-density and sidestep the intractable partition function of an energy-based model entirely.

Song, Y., Ermon, S. "Generative Modeling by Estimating Gradients of the Data Distribution (NCSN)." NeurIPS (2019). arXiv:1907.05600

The bridge from score matching to generation of Section 30.4. Estimates the score at many noise levels and samples by annealed Langevin dynamics, the immediate precursor of the diffusion view in Chapter 33.

Evaluation Metrics

Heusel, M. et al. "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID)." NeurIPS (2017). arXiv:1706.08500

Introduces the Frechet Inception Distance of Section 30.6: compare the feature statistics of real and generated images under an Inception network. Still the default generative-image metric.

Salimans, T. et al. "Improved Techniques for Training GANs (Inception Score)." NeurIPS (2016). arXiv:1606.03498

The Inception Score of Section 30.6 and several still-used GAN training stabilizers. The first widely adopted automatic generative-image metric, with the diversity-and-confidence intuition this section unpacks.

Books

Prince, S. J. D. "Understanding Deep Learning." MIT Press (2023). udlbook.github.io/udlbook

Open-access text whose generative chapters (VAEs, GANs, normalizing flows, diffusion) give an exceptionally clear, figure-rich treatment of every family on the Section 30.2 map. The companion text for this part.

Murphy, K. P. "Probabilistic Machine Learning: Advanced Topics." MIT Press (2023). probml.github.io/pml-book

The rigorous reference for the probability behind this chapter: latent-variable models, variational inference, energy-based models, and score-based generative modeling, all in one open-access volume.

Tools & Libraries

Hugging Face Diffusers. huggingface.co/docs/diffusers

The de facto library for diffusion and modern generative pipelines used throughout Part IV. A few lines load a pretrained generator and sample from it, the library shortcut against which the from-scratch code in these sections is contrasted.

clean-fid (Parmar et al., "On Aliased Resizing and Surprising Subtleties in GAN Evaluation," CVPR 2022). github.com/GaParmar/clean-fid

A reference FID implementation that fixes the resizing and preprocessing inconsistencies that quietly corrupt reported FID. The tool behind the Section 30.6 warning that FID is sensitive to preprocessing.

torch-fidelity. github.com/toshas/torch-fidelity

A single-call implementation of Inception Score, FID, and KID in PyTorch, the library shortcut for the evaluation code in Section 30.6.

Tutorials & Explainers

Weng, L. "What are Diffusion Models?" Lil'Log (2021, updated). lilianweng.github.io

A careful, equation-complete walk from score matching and Langevin dynamics through DDPM, the perfect companion to Section 30.4 and the on-ramp to Chapter 33.

Song, Y. "Generative Modeling by Estimating Gradients of the Data Distribution." Author's blog. yang-song.net/blog/2021/score

The clearest first-person explanation of the score perspective of Section 30.4, with interactive figures showing Langevin sampling following the learned gradient field.