Part IV: Generative Vision Models
Chapter 33: Diffusion Models

Chapter 33: Diffusion Models

"They poured static on me until I forgot what I was, then taught a network to un-forget me one grain at a time. I came back as a photograph of a cat I had never seen. The procedure was undignified, I admit, but the results speak for themselves."

A Diffusion Model, Halfway Through Denoising
Big Picture

A diffusion model learns to generate by mastering the opposite of destruction: take a clean image, add Gaussian noise in many small steps until nothing is left but static, then train a network to undo one step of that corruption, and you can start from pure static and walk all the way back to a brand-new image. This single idea, learned iterative denoising, is the engine behind Stable Diffusion, DALL-E, Midjourney, and the video and 3D generators of the chapters that follow. It is also a direct descendant of the denoising you met classically in Chapter 7 and learned in Chapter 31; the difference is that diffusion denoises not once but dozens of times, each step nudging samples toward the data distribution. This chapter builds the idea from the forward corruption process up, shows the three equivalent views that explain why it works (the variational view, the score-based view, and the flow view), then turns to the engineering that made it fast and controllable: efficient samplers, guidance, and the latent-space trick that let the whole thing run on a single consumer GPU.

Chapter Overview

For two chapters you have studied generators that produce an image in a single forward pass. The variational autoencoder of Chapter 31 decodes a latent vector into a picture and trains against a reconstruction-plus-regularization objective; the generative adversarial network of Chapter 32 plays a generator against a discriminator until the fakes fool the critic. Both are powerful and both are fragile. VAEs tend to produce blurry samples because the reconstruction loss averages over plausible outputs; GANs produce sharp samples but train through a delicate, often unstable minimax game and can collapse to a handful of modes. Diffusion models sidestep both problems by giving up the single forward pass. Instead of asking a network to invent an entire image at once, they ask it to perform a much easier task many times: remove a little noise.

The recipe is almost suspiciously simple. The forward process takes a real image and adds a controlled amount of Gaussian noise, repeatedly, over hundreds of steps, until the image is statistically indistinguishable from random static. This process has no learnable parameters at all; it is just a fixed corruption schedule. The reverse process is where the learning happens: a neural network, almost always a convolutional U-Net with attention, is trained to predict the noise that was added at a given step, so that it can be subtracted off. Run the reverse process from pure noise and you generate a sample. Section 33.1 builds both processes from scratch and shows the one beautiful algebraic shortcut, the closed-form jump to any noise level, that makes training tractable.

With the machinery in place, the chapter turns to understanding. Section 33.2 formalizes the denoising diffusion probabilistic model (DDPM): the noise schedule, the three equivalent parameterizations of what the network predicts, and the variational bound that justifies the simple noise-prediction loss everyone actually uses. Section 33.3 reveals that the same model is, in the limit of infinitely many steps, a stochastic differential equation whose drift is the score of the data distribution, the gradient of log-density you first met in the energy-based models of Chapter 30. That continuous view unlocks the probability-flow ODE, a deterministic path between noise and data that the fast samplers of Section 33.4 exploit to cut a thousand sampling steps down to twenty.

The final three sections are about making diffusion practical and steerable. Section 33.5 presents the 2022 to 2024 reframing, flow matching, rectified flow, and consistency models, that straightens the generative path and pushes high-quality sampling toward a single step. Section 33.6 covers guidance, the technique that lets you trade diversity for fidelity and, in its classifier-free form, is the mechanism behind every "prompt strength" slider in every image tool you have used. Section 33.7 closes with latent diffusion: instead of denoising pixels, compress the image into a small latent with an autoencoder and denoise there, a change that dropped the compute cost by an order of magnitude and put Stable Diffusion on laptops. By the end you will understand not just how to call a pipeline but why each piece exists.

The thread running through the chapter is the one promised in Chapter 7: denoising, introduced as a humble image-cleanup operation, returns here as the entire generative engine. The U-Net is the convolution of Chapter 3 made learnable and stacked; the cross-attention that injects text is the attention of Chapter 22; the latent space is the one from Chapter 31. Diffusion is less a new idea than a new way of composing ideas you already hold.

Prerequisites

You should have read Chapter 30: Foundations of Generative Modeling, especially its treatment of energy-based models, score functions, and Langevin dynamics, because the score-based view in Section 33.3 builds directly on it. Chapter 31: Autoencoders & VAEs supplies the variational lower bound that Section 33.2 reuses, the denoising-autoencoder intuition, and the autoencoder that Section 33.7 repurposes for latent diffusion. From the deep-learning part you need the PyTorch training loop of Chapter 18, the convolution and U-Net structure that the denoiser is built from, and the self- and cross-attention of Chapter 22 that conditions the network. Comfort with Gaussian distributions, the reparameterization trick, and basic stochastic calculus notation (you will see $dx = f\,dt + g\,dW$, but we explain every symbol) makes the derivations concrete. The classical denoising of Chapter 7 is the conceptual seed of the whole chapter.

Chapter Roadmap

Remember the Chapter in One Card

If you carry three things out of this chapter, carry these. First, the recipe in three words: destroy, then rebuild. A fixed forward process drowns an image in Gaussian noise, and a single learned network undoes one step of that corruption at a time, so generation is just denoising run many times from pure static. Second, the three views of one model, the same denoiser seen through three lenses: the variational view (DDPM's noise-prediction loss, Section 33.2), the score view (the network estimates the gradient of log-density, Section 33.3), and the flow view (it learns a velocity field along a path from noise to data, Section 33.5); train one, get all three. Third, the step-count dial: the thousand steps of Section 33.1 are not fundamental but a deployment choice that fast samplers (Section 33.4) and straight paths (Section 33.5) cut to a handful, with one scalar of guidance (Section 33.6) trading fidelity for diversity and one autoencoder (Section 33.7) moving the whole process into a cheap latent. Destroy then rebuild, three views of one model, the step-count dial: that triad is the skeleton under every diffusion system you will meet.

What's Next?

This chapter gives you the generative engine; the next chapter gives it a voice. Chapter 34: Text-to-Image Systems takes the conditional diffusion model of Section 33.6 and the latent backbone of Section 33.7 and asks how a sentence becomes a picture: how a text encoder like CLIP or T5 turns a prompt into the conditioning vectors that cross-attention consumes, how the major systems (Stable Diffusion, DALL-E, Imagen, and the SD3 and FLUX generation that adopted flow matching from Section 33.5) differ, and how to prompt and evaluate them. From there, Chapter 35 shows how to edit and control diffusion outputs with masks, edges, and inversion, and Chapter 36 extends the same denoising idea into time and three dimensions. Everything generative that follows is built on the seven sections you are about to read.

Bibliography & Further Reading

Foundational Papers

Sohl-Dickstein, J. et al. "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML (2015). arXiv:1503.03585
The origin of diffusion generative models. It framed generation as reversing a gradual noising process borrowed from statistical physics, the forward-and-reverse idea of Section 33.1, years before the technique became practical.
Ho, J., Jain, A., Abbeel, P. "Denoising Diffusion Probabilistic Models." NeurIPS (2020). arXiv:2006.11239
DDPM, the paper that made diffusion competitive with GANs and the backbone of Section 33.2. It introduced the simplified noise-prediction loss and the linear schedule that the whole field built on.
Song, Y. et al. "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR (2021). arXiv:2011.13456
The unifying SDE framework of Section 33.3. It showed DDPM and score matching are the same model in continuous time, introduced the VE and VP SDEs, and derived the probability-flow ODE.
Song, J., Meng, C., Ermon, S. "Denoising Diffusion Implicit Models." ICLR (2021). arXiv:2010.02502
DDIM, the deterministic fast sampler of Section 33.4. It defined a non-Markovian forward process that shares DDPM's training objective but allows large, deterministic sampling steps.
Rombach, R. et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR (2022). arXiv:2112.10752
Latent diffusion (Stable Diffusion), the architecture of Section 33.7. By diffusing in a compressed autoencoder latent rather than pixels, it cut compute by an order of magnitude and made open-weight text-to-image possible.

Recent Research (2022-2026)

Ho, J., Salimans, T. "Classifier-Free Diffusion Guidance." NeurIPS Workshop (2021). arXiv:2207.12598
Classifier-free guidance, the core of Section 33.6 and the mechanism behind every prompt-strength control. One network is trained with and without conditioning, and the two predictions are extrapolated at sampling time.
Karras, T. et al. "Elucidating the Design Space of Diffusion-Based Generative Models (EDM)." NeurIPS (2022). arXiv:2206.00364
The EDM paper that cleaned up diffusion's notation and design choices, the preconditioning, schedule, and second-order sampler referenced in Sections 33.2 and 33.4. The modern practitioner's reference for what knobs matter.
Lipman, Y. et al. "Flow Matching for Generative Modeling." ICLR (2023). arXiv:2210.02747
Flow matching, the simpler simulation-free training objective of Section 33.5. It generalizes and clarifies diffusion as learning a velocity field that transports noise to data along chosen probability paths.
Song, Y. et al. "Consistency Models." ICML (2023). arXiv:2303.01469
Consistency models of Section 33.5, which learn to map any point on a probability-flow trajectory directly to its endpoint, enabling one- or few-step generation without the quality cliff of naive step reduction.
Peebles, W., Xie, S. "Scalable Diffusion Models with Transformers (DiT)." ICCV (2023). arXiv:2212.09748
The diffusion transformer of Section 33.7. Replacing the U-Net with a ViT-style backbone that scales cleanly, DiT is the architecture behind the Sora and Veo video models, SD3 and SD3.5, FLUX, and most 2024-onward frontier image and video models.
Esser, P. et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)." ICML (2024). arXiv:2403.03206
Stable Diffusion 3, which combines the rectified flow of Section 33.5 with the DiT backbone of Section 33.7, the clearest 2024 example of the chapter's ideas assembled into a frontier system.
Karras, T. et al. "Guiding a Diffusion Model with a Bad Version of Itself (Autoguidance)." NeurIPS (2024). arXiv:2406.02507
Autoguidance, the guidance refinement of Section 33.6. By steering a strong model with a smaller, under-trained copy of itself instead of an unconditional model, it improves fidelity and diversity together and set a 1.01 ImageNet 64x64 FID, the cleanest illustration of how guidance is still being rethought.

Books

Prince, S. J. D. Understanding Deep Learning. MIT Press (2023). udlbook.github.io/udlbook
Chapter 18 gives the clearest textbook derivation of DDPM and the variational bound of Section 33.2, with figures that make the forward and reverse processes intuitive. Free online.
Murphy, K. P. Probabilistic Machine Learning: Advanced Topics. MIT Press (2023). probml.github.io/pml-book
Covers diffusion, score matching, and SDEs (Section 33.3) within the broader probabilistic-modeling landscape, with the rigor to connect them to energy-based models and normalizing flows. Free online.

Tools & Libraries

Hugging Face diffusers. github.com/huggingface/diffusers
The reference library for diffusion: schedulers (DDPM, DDIM, DPM-Solver), pipelines (Stable Diffusion, SDXL, SD3, FLUX), and the U-Net and DiT building blocks. The library shortcut behind nearly every code example in this chapter.
Karras, T. et al. EDM reference code (NVlabs). github.com/NVlabs/edm
The clean, well-documented implementation of the EDM formulation, schedules, and second-order sampler discussed in Sections 33.2 and 33.4. The best codebase to read for diffusion done carefully.
Karras, T. et al. "Analyzing and Improving the Training Dynamics of Diffusion Models (EDM2)." CVPR (2024). arXiv:2312.02696
The 2024 follow-up that set a state-of-the-art ImageNet generation result by fixing magnitude growth in the network, a strong modern baseline referenced across the chapter's research-frontier callouts.

Tutorials & Explainers

Weng, L. "What are Diffusion Models?" Lil'Log (2021, updated). lilianweng.github.io
The most thorough single-page derivation of DDPM, the SDE view, guidance, and fast samplers, covering essentially every section of this chapter with consistent notation. Read it alongside the math.
Hugging Face Diffusion Models Course. github.com/huggingface/diffusion-models-class
A hands-on notebook course that trains a diffusion model from scratch and then with diffusers, mirroring the build-then-shortcut structure of Sections 33.1 and 33.7.