Part IV: Generative Vision Models
Chapter 30: Foundations of Generative Modeling

A Map of Generative Families: VAE, GAN, Flow, Autoregressive & Diffusion

"Five of us were asked to draw the same distribution. The flow demanded an exactly reversible road. The autoregressive one wrote a left-to-right diary. The GAN hired a forger and a critic. The VAE squeezed everything through a keyhole. The diffusion model just kept erasing noise until a picture confessed. We all arrived. We took wildly different routes."

A Diffusion Model, Halfway Through Denoising
Big Picture

There is no single way to learn $p(\mathbf{x})$. Five families have each found a different tractable handle on the intractable object, and every one trades among three things you cannot freely have at once: an exact likelihood you can compute, high-quality samples, and fast sampling. Variational autoencoders maximize a lower bound on likelihood and decode from a latent. Generative adversarial networks skip likelihood entirely and learn by being judged. Normalizing flows insist on exact invertibility to get an exact likelihood. Autoregressive models factorize the image into a chain of per-pixel predictions. Diffusion models learn to reverse a noising process step by step. This section is the field guide: one paragraph and one line of intuition for each family, what it optimizes, what it gives up, and where it sits on the quality-diversity-speed map. Read it once and you will be able to place any generative paper in Part IV before you finish its abstract.

In Section 30.1 we established that modeling $p(\mathbf{x})$ means describing where natural images live and how probability spreads across that thin manifold. The obstacle is that $p(\mathbf{x})$ is intractable: we cannot write down a normalized density over $150{,}528$ pixels and integrate it. Every family in this section is a strategy for getting useful behavior out of $p(\mathbf{x})$ without ever writing it down directly. We will tour them in the order they tend to be taught, then assemble a comparison table and a map that the rest of the part hangs on. Each family then gets its own chapter; this section is the index card for all of them.

1. Variational Autoencoders: Compress, Then Decode Beginner

A variational autoencoder (VAE) introduces a low-dimensional latent variable $\mathbf{z}$ with a simple prior (usually a standard Gaussian) and a decoder network $p_\theta(\mathbf{x} \mid \mathbf{z})$ that turns a latent into an image. The marginal likelihood $p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}$ is intractable, so the VAE introduces an encoder $q_\phi(\mathbf{z} \mid \mathbf{x})$ and maximizes a tractable lower bound on the log-likelihood, the evidence lower bound or ELBO:

$$\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\phi(\mathbf{z}\mid\mathbf{x})}\big[\log p_\theta(\mathbf{x}\mid\mathbf{z})\big] - D_{\mathrm{KL}}\big(q_\phi(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big).$$

The first term rewards reconstruction; the second, a Kullback-Leibler divergence $D_{\mathrm{KL}}$ (a standard measure of how far one distribution sits from another, zero only when they match), pulls the encoded distribution toward the prior so that sampling $\mathbf{z} \sim p(\mathbf{z})$ and decoding produces plausible images. The VAE gives an approximate, comparable likelihood and a clean, smooth latent space (the subject of Section 30.3), at the cost of samples that are often slightly blurry, because the bound and the Gaussian decoder average over fine detail. Chapter 31 builds the VAE in full, including the reparameterization trick that makes the sampling differentiable.

2. Generative Adversarial Networks: Learn by Being Judged Beginner

A generative adversarial network (GAN) abandons likelihood altogether. A generator $G$ maps a noise vector $\mathbf{z} \sim p(\mathbf{z})$ to an image, and a discriminator $D$ tries to tell real images from generated ones. They play a minimax game:

$$\min_G \max_D \; \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))].$$

Read the formula as a tug-of-war over one quantity: the discriminator $D$ turns its dial to make the expression as large as it can (scoring real images near 1 and fakes near 0), while the generator $G$ turns its own dial to make the same expression as small as it can (fooling $D$ into scoring its fakes near 1). Neither term is a loss to minimize on its own; learning is the contest between the two. The generator never sees the data distribution directly; it only learns from the gradient the discriminator provides. When the game reaches equilibrium the generator's samples are indistinguishable from real data. GANs are the family famous for sharp, photorealistic samples and fast single-pass sampling, but they pay for it: they provide no likelihood, training is unstable, and they are prone to mode collapse, where the generator covers only part of the data distribution (high quality, low diversity).

Common Misconception: A GAN Does Not Compute a Probability

It is tempting to read the discriminator's output $D(\mathbf{x})$ as "the probability that image $\mathbf{x}$ is real" and conclude that a GAN gives you a likelihood you could query, the way a VAE bound or a flow does. In fact a GAN never models $p(\mathbf{x})$ at all. The generator only learns to push samples past whatever discriminator it currently faces, and at equilibrium $D$ collapses to a constant $0.5$ everywhere, carrying no per-image information. You cannot ask a trained GAN "how probable is this face?", which is exactly why GAN papers report sample-quality metrics rather than held-out likelihood, and why the mode collapse above can go undetected: there is no density to expose the missing modes. If you need to score or rank images by probability (for anomaly detection, say, as in the textile-mill example of Section 30.1), a GAN is the wrong family.

The adversarial discriminator is the spiritual descendant of the realism critic idea you saw used to judge representations in Chapter 25. Chapter 32 develops the game and its many stabilizers.

3. Normalizing Flows: An Exactly Reversible Road Intermediate

A normalizing flow insists on something the other families give up: an exact, computable likelihood. It builds the data distribution as an invertible, differentiable transformation $f$ of a simple base distribution (a Gaussian). Because $f$ is invertible, the change-of-variables formula gives the exact density,

$$\log p_\theta(\mathbf{x}) = \log p_{\mathbf{z}}\big(f^{-1}(\mathbf{x})\big) + \log\left|\det \frac{\partial f^{-1}}{\partial \mathbf{x}}\right|,$$

where the Jacobian-determinant term accounts for how the transformation stretches and squeezes volume. That correction term is not optional bookkeeping: probability mass must be conserved, so when $f^{-1}$ compresses a region of pixel space the density there has to rise to keep the total integral at one, and when it expands a region the density must fall. The log-determinant measures exactly that local volume change, and without it the formula would report a base-distribution density that no longer integrates to one over $\mathbf{x}$. This is the same change-of-variables logic that tracked how a homography warps pixel areas back in Chapter 5, now applied to probability mass. Flows let you both sample (push $\mathbf{z}$ through $f$) and evaluate exact likelihood, a rare combination. The price is architectural: every layer must be invertible with an efficiently computable Jacobian determinant. That efficiency requirement is the real constraint, because a general $d \times d$ determinant costs $O(d^3)$, hopeless for $d$ in the millions of pixels. Flows are therefore built from layers whose Jacobian is triangular, so its determinant is just the product of the diagonal entries, an $O(d)$ quantity: coupling layers (split the input, transform one half as a function of the other) and autoregressive layers are the two standard ways to buy invertibility and a cheap determinant at once. This constrains the network design and tends to require many parameters for a given sample quality. Flows are less dominant for raw image generation today, but the invertibility idea reappears in flow matching, the modern continuous-time descendant we flag in Section 30.4 and Chapter 33.

4. Autoregressive Models: One Pixel at a Time Intermediate

An autoregressive model uses the chain rule of probability to factorize the joint distribution over all pixels into a product of conditionals, each pixel depending on the pixels before it in some fixed ordering:

$$p_\theta(\mathbf{x}) = \prod_{i=1}^{D} p_\theta\big(x_i \mid x_1, x_2, \dots, x_{i-1}\big).$$

This is the exact analogue of how a language model predicts the next token, applied to pixels (PixelCNN) or, in modern systems, to discrete image tokens from a learned codebook. The factorization is exact, so autoregressive models give an exact likelihood and tend to produce coherent, high-quality samples. Their defining weakness is speed: sampling requires $D$ sequential forward passes, one per pixel or token, so generating a single image can take thousands of steps. The same self-attention machinery from Chapter 22 powers the strongest modern autoregressive image models, which generate sequences of image tokens.

5. Diffusion Models: Reverse the Noise Beginner

A diffusion model defines a fixed forward process that gradually adds Gaussian noise to an image over many steps until it becomes pure noise, then learns a neural network to reverse that process one step at a time. Training reduces to a simple denoising objective: given a noised image and the noise level, predict the noise that was added. Sampling starts from pure noise and applies the learned reverse step repeatedly until an image emerges. Diffusion models combine the high sample quality of GANs with the stable, likelihood-grounded training of the probabilistic families, which is why they dominate image generation in 2024 to 2026. Their historical weakness is sampling speed (many reverse steps), the problem that drove the fast-sampler research wave of Section 30.5. Crucially, diffusion is the destination of the score-and-Langevin machinery we build in Section 30.4: the reverse step is a learned step along the score of a noised distribution. This connection, drawn explicitly here, is what makes Chapter 33 feel like the natural continuation of this chapter rather than a fresh start. The recurring denoising thread, from the Gaussian and non-local-means filters of Chapter 7 to learned denoising autoencoders, culminates in diffusion as iterative learned denoising.

VAE z decoder img maximize ELBO; smooth latent; samples can blur GAN z generator img critic D no likelihood; sharp; mode-collapse risk Flow z invertible f img exact likelihood via change-of-variables AR x1 x2 x3 ... predict each pixel from earlier ones; exact, slow Diffusion noise denoise denoise img reverse many noising steps; top quality, many steps All five reach a sample of p(x); they differ in route, likelihood access, and cost.
Figure 30.2.1: The five generative families at a glance, one row each. The VAE and GAN push a latent through a single network; the flow uses an invertible map for an exact likelihood; the autoregressive model builds the image pixel by pixel; the diffusion model starts from noise and denoises through many steps. Every row produces a sample from $p(\mathbf{x})$, but the routes, the likelihood access, and the per-sample cost differ sharply, which is exactly what the comparison table below quantifies.

6. The Comparison Table Intermediate

Figure 30.2.1 shows the routes; the table below scores them on the axes that matter in practice. "Tractable likelihood" asks whether you can compute or bound $p(\mathbf{x})$ for a given image. "Sample quality" and "diversity" summarize the family's typical behavior on natural images. "Sampling speed" counts how many network passes one sample costs. The entries are typical tendencies, not laws; a well-engineered member of any family can beat a sloppy member of another.

Table 30.2.1: The five generative families compared on likelihood access, sample quality, diversity, and sampling cost (typical tendencies as of 2026).
Family Optimizes Tractable likelihood? Sample quality Diversity Sampling speed Chapter
VAEELBO (lower bound)Approximate (bound)Medium (can blur)HighFast (1 pass)31
GANAdversarial gameNoHigh (sharp)Lower (collapse risk)Fast (1 pass)32
FlowExact log-likelihoodYes (exact)MediumHighFast (1 pass)30.4 / 33
AutoregressiveExact log-likelihoodYes (exact)HighHighSlow (D passes)34
DiffusionDenoising / scoreBound / ODE estimateHighestHighSlow to medium (many steps)33
Key Insight: There Is No Free Lunch, Only Different Trades

Look down the columns and a pattern emerges: no family wins everywhere. Flows and autoregressive models get exact likelihood but pay in architecture constraints or sampling speed. GANs get sharp fast samples but lose likelihood and risk collapse. Diffusion buys top quality and stable training with many sampling steps. The VAE is the all-rounder that excels at none. This is the quality-diversity-speed trilemma that Section 30.5 states formally: you choose two and compromise on the third. Every advance in Part IV is, at bottom, an attempt to bend one corner of this triangle, latent diffusion to make diffusion cheaper, consistency models to make it few-step, improved GAN losses to recover diversity.

Fun Note

Picture the five families as houseguests asked to draw the same cat (drawn below). The flow insists on a road it can walk backwards. The autoregressive one narrates the cat pixel by pixel like a slow audiobook. The GAN hires an art critic and paints only to shut it up. The VAE squints, mumbles "close enough", and hands you a slightly soft cat. The diffusion model starts with a snowstorm and swears there was a cat in there the whole time. All five leave with a cat. The trilemma is just the bill each one hands you on the way out.

Five friendly cartoon characters around a table each drawing the same cat differently: one walks a reversible road, one inks pixel by pixel from a scroll, one paints under an art critic's gaze, one shrugs and holds a soft blurry cat, and one clears a snowstorm of dots into a sharp cat, illustrating the five generative families VAE, GAN, flow, autoregressive, and diffusion taking different routes to the same sample.
Five families, five wildly different routes to the same picture: invert, chain, contest, compress, and denoise; the trilemma is just the bill each one hands you on the way out.

The code below makes the abstract differences concrete by sketching the one-line sampling signature of each family. The signatures alone reveal the speed story: the families with a single call sample in one pass, while the autoregressive and diffusion families wrap their call in a loop.

# Each family exposes a different sampling SHAPE. The single-pass families return
# after one network call; the autoregressive and diffusion families wrap it in a loop.
# Pseudocode, but the loop structure is exactly what sets per-sample cost.
import torch


def sample_vae(decoder, n, z_dim, device):
    z = torch.randn(n, z_dim, device=device)        # draw from the Gaussian prior
    return decoder(z)                                # ONE pass: latent -> image

def sample_gan(generator, n, z_dim, device):
    z = torch.randn(n, z_dim, device=device)
    return generator(z)                              # ONE pass, just like the VAE decoder

def sample_flow(flow, n, z_dim, device):
    z = torch.randn(n, z_dim, device=device)
    return flow.forward(z)                           # ONE pass through the invertible map

def sample_autoregressive(model, n, D, device):
    x = torch.zeros(n, D, device=device)
    for i in range(D):                               # D SEQUENTIAL passes, one per pixel
        logits = model(x, up_to=i)                   # condition on pixels 0..i-1
        x[:, i] = sample_from(logits)                # then fill pixel i
    return x

def sample_diffusion(denoiser, n, shape, steps, device):
    x = torch.randn(n, *shape, device=device)        # start from pure noise
    for t in reversed(range(steps)):                 # MANY reverse steps
        x = denoiser.reverse_step(x, t)              # each step removes a little noise
    return x
Code Fragment 1: The sampling signature of each family, side by side. sample_vae, sample_gan, and sample_flow all return after a single network call; sample_autoregressive wraps its call in a for i in range(D) loop over pixels; sample_diffusion loops once per reverse step. The shape of the sampling loop, not the architecture, is what sets the per-sample cost in Table 30.2.1.
Try This: Feel the Speed Gap

Replace the network calls in sample_autoregressive and sample_diffusion with a trivial placeholder (say model = lambda *a, **k: torch.zeros(...)) and wrap each in a timer for $D = 4096$ and for diffusion steps set to $10$, $50$, then $250$. Watch the wall-clock grow in proportion to the number of passes while the single-pass sample_vae stays flat no matter the image size. The point to observe: the cost difference between families is not subtle tuning, it is the loop count itself, the same one-pass-versus-many spread that Table 30.2.1 reports. Then bump diffusion's steps back down to $4$ and notice how close it gets to the single-pass families, which is exactly the speed corner that the fast-sampler research of Section 30.5 is chasing.

Library Shortcut: One Pipeline Interface for Every Family

The pseudocode above writes a different sampling loop per family. In practice diffusers hides those differences behind a single pipeline interface, so switching families is switching a class name, not rewriting a loop:

# One call shape for two different families. Whichever pipeline class you load,
# sampling is the same pipe(...) invocation; the per-family reverse loop the
# pseudocode wrote by hand lives inside the pipeline's __call__.
from diffusers import DDPMPipeline, StableDiffusion3Pipeline
# A diffusion model (many reverse steps, handled internally):
pipe = DDPMPipeline.from_pretrained("google/ddpm-cifar10-32").to("cuda")
img = pipe(num_inference_steps=50).images[0]          # the reverse loop is inside .__call__
# A modern rectified-flow text-to-image model, same call shape:
# pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium")
# img = pipe("a watercolor fox").images[0]
Code Fragment 2: One pipeline interface across families in diffusers. Whether the loaded class is DDPMPipeline (a plain diffusion model) or StableDiffusion3Pipeline (a rectified-flow text-to-image system), sampling is the same pipe(...) call; the per-family sampling loop the pseudocode wrote by hand is hidden inside the pipeline's __call__.

What the five hand-written sampling functions express, a per-family loop with its own step count, the library exposes as one pipe(...) call whose internal loop matches the loaded family. The from-scratch signatures exist so you know what that single call is doing differently for a GAN versus a diffusion model under the hood.

Practical Example: Choosing a Family Under a Latency Budget

Who: an engineering team at a mobile photo-editing app adding a "remove this object and fill the gap" feature. Situation: the fill had to run on-device and feel instantaneous, ideally under a few hundred milliseconds, on a phone. Problem: their first prototype used a then-state-of-the-art diffusion inpainter that produced beautiful fills but took fifty reverse steps, roughly 4 seconds per fill on the target phone, far too slow for an interactive editor. Dilemma: option one was to swap in a single-pass GAN inpainter that would hit the latency budget instantly but visibly degrade fill quality on textured backgrounds; option two was to keep the diffusion model and accept the 4-second wait, which usability testing showed users abandoned; option three was to keep diffusion's quality and attack only its step count. Decision: they consulted exactly the trade-off in Table 30.2.1. Quality was non-negotiable, but they could spend research effort to fix speed rather than switch to a lower-quality single-pass GAN. How: they adopted a few-step distilled diffusion sampler (the consistency-model line from Section 30.5) that collapsed fifty steps into four, cutting per-fill time from about 4 seconds to under 300 milliseconds while keeping diffusion quality at near-GAN speed. Result: interactive fills at acceptable latency without abandoning the diffusion quality their users had praised. Lesson: the family map is not a one-time pick; it is the frame for an engineering decision. Knowing that diffusion's only weakness here was step count, and that the field had attacked exactly that weakness, let the team keep the corner of the triangle they cared about and bend the one they did not. Reading the map saved them from a needless quality downgrade.

Research Frontier: The Map Is Converging

Through 2023 to 2026 the boundaries on this map have blurred, which is itself the frontier. Latent diffusion (Rombach et al., 2022, Stable Diffusion) runs a diffusion model inside a VAE's latent space, fusing two families. Flow matching (Lipman et al., 2023) and rectified flow (Liu et al., 2023) recast diffusion as learning a continuous-time flow, reuniting the diffusion and flow families under one continuous-normalizing-flow umbrella; the rectified-flow transformers behind Stable Diffusion 3 (Esser et al., 2024) and FLUX.1 (Black Forest Labs, 2024) are trained this exact way, and they also collapse the latent-diffusion idea into a single transformer backbone (a DiT). Consistency models (Song et al., 2023) and distillation collapse diffusion's many steps toward GAN-like single-pass sampling. Visual autoregressive modeling (VAR, Tian et al., 2024, NeurIPS 2024 best paper) revived the autoregressive family with next-scale prediction and was the first to push a GPT-style autoregressive image model past diffusion transformers on the ImageNet 256 benchmark. The clean five-box taxonomy of this section remains the right mental scaffold, but the most active work lives precisely on the seams between the boxes, borrowing the strength of one family to patch the weakness of another. Hold the map; expect the territory to keep merging.

7. How to Read the Rest of the Part Beginner

You now have the index card. A one-word verb for each family fixes the order in memory: compress (VAE), contest (GAN), invert (flow), chain (autoregressive), denoise (diffusion); each verb is the route that family takes from a code or from noise to a finished sample. The remaining chapters of Part IV each take one box on this map and open it: Chapter 31 the VAE, Chapter 32 the GAN, Chapter 33 diffusion, with flows and autoregressive models appearing where the story needs them. Two pieces of machinery, the latent space and the score, are shared across boxes and deserve their own treatment before we specialize. The next two sections supply them: Section 30.3 develops the latent variable that the VAE, GAN, and latent-diffusion families all rely on, and Section 30.4 develops the energy and score view that becomes diffusion.

Exercise 30.2.1: Place the Paper Conceptual

For each described system, name the family (or hybrid) it belongs to and the single sentence in this section that justifies your choice: (a) a model that adds noise over 1000 steps and trains a U-Net to predict that noise, (b) a model with an encoder and a decoder trained to maximize a lower bound on likelihood, (c) a model that runs a diffusion process inside a learned compressed latent, (d) a model in which a generator is trained only by gradients from a network that distinguishes real from fake, (e) a model that predicts the next image token given all previous tokens with self-attention.

Exercise 30.2.2: Measure the Speed Gap Coding

Using the sampling-signature pseudocode as a template, write two real functions in PyTorch: one that samples a $D$-dimensional vector autoregressively (loop with a trivial linear "model") and one that samples it with a fixed number of diffusion-style reverse steps. Time both for $D = 4096$ and step counts of 25, 50, and 250. Plot wall-clock against the number of passes and confirm the linear relationship. Write one sentence relating your plot to the "Sampling speed" column of Table 30.2.1.

Exercise 30.2.3: Which Trade Would You Make? Analysis

Pick one of these three deployment settings and argue, in a short paragraph, which family you would start from and which corner of the trilemma you would sacrifice: (a) generating synthetic training data to augment a rare-class detector (recall the data-engine idea you will meet in Chapter 37), (b) a creative tool where artists wait happily for a few seconds per image but demand maximum fidelity, (c) a real-time video filter at 30 frames per second. Reference Table 30.2.1 explicitly and name the specific weakness your choice accepts.