"They poured static on me until I forgot what I was, then taught a network to un-forget me one grain at a time. I came back as a photograph of a cat I had never seen. The procedure was undignified, I admit, but the results speak for themselves."
A Diffusion Model, Halfway Through Denoising
A diffusion model learns to generate by mastering the opposite of destruction: take a clean image, add Gaussian noise in many small steps until nothing is left but static, then train a network to undo one step of that corruption, and you can start from pure static and walk all the way back to a brand-new image. This single idea, learned iterative denoising, is the engine behind Stable Diffusion, DALL-E, Midjourney, and the video and 3D generators of the chapters that follow. It is also a direct descendant of the denoising you met classically in Chapter 7 and learned in Chapter 31; the difference is that diffusion denoises not once but dozens of times, each step nudging samples toward the data distribution. This chapter builds the idea from the forward corruption process up, shows the three equivalent views that explain why it works (the variational view, the score-based view, and the flow view), then turns to the engineering that made it fast and controllable: efficient samplers, guidance, and the latent-space trick that let the whole thing run on a single consumer GPU.
Chapter Overview
For two chapters you have studied generators that produce an image in a single forward pass. The variational autoencoder of Chapter 31 decodes a latent vector into a picture and trains against a reconstruction-plus-regularization objective; the generative adversarial network of Chapter 32 plays a generator against a discriminator until the fakes fool the critic. Both are powerful and both are fragile. VAEs tend to produce blurry samples because the reconstruction loss averages over plausible outputs; GANs produce sharp samples but train through a delicate, often unstable minimax game and can collapse to a handful of modes. Diffusion models sidestep both problems by giving up the single forward pass. Instead of asking a network to invent an entire image at once, they ask it to perform a much easier task many times: remove a little noise.
The recipe is almost suspiciously simple. The forward process takes a real image and adds a controlled amount of Gaussian noise, repeatedly, over hundreds of steps, until the image is statistically indistinguishable from random static. This process has no learnable parameters at all; it is just a fixed corruption schedule. The reverse process is where the learning happens: a neural network, almost always a convolutional U-Net with attention, is trained to predict the noise that was added at a given step, so that it can be subtracted off. Run the reverse process from pure noise and you generate a sample. Section 33.1 builds both processes from scratch and shows the one beautiful algebraic shortcut, the closed-form jump to any noise level, that makes training tractable.
With the machinery in place, the chapter turns to understanding. Section 33.2 formalizes the denoising diffusion probabilistic model (DDPM): the noise schedule, the three equivalent parameterizations of what the network predicts, and the variational bound that justifies the simple noise-prediction loss everyone actually uses. Section 33.3 reveals that the same model is, in the limit of infinitely many steps, a stochastic differential equation whose drift is the score of the data distribution, the gradient of log-density you first met in the energy-based models of Chapter 30. That continuous view unlocks the probability-flow ODE, a deterministic path between noise and data that the fast samplers of Section 33.4 exploit to cut a thousand sampling steps down to twenty.
The final three sections are about making diffusion practical and steerable. Section 33.5 presents the 2022 to 2024 reframing, flow matching, rectified flow, and consistency models, that straightens the generative path and pushes high-quality sampling toward a single step. Section 33.6 covers guidance, the technique that lets you trade diversity for fidelity and, in its classifier-free form, is the mechanism behind every "prompt strength" slider in every image tool you have used. Section 33.7 closes with latent diffusion: instead of denoising pixels, compress the image into a small latent with an autoencoder and denoise there, a change that dropped the compute cost by an order of magnitude and put Stable Diffusion on laptops. By the end you will understand not just how to call a pipeline but why each piece exists.
The thread running through the chapter is the one promised in Chapter 7: denoising, introduced as a humble image-cleanup operation, returns here as the entire generative engine. The U-Net is the convolution of Chapter 3 made learnable and stacked; the cross-attention that injects text is the attention of Chapter 22; the latent space is the one from Chapter 31. Diffusion is less a new idea than a new way of composing ideas you already hold.
Prerequisites
You should have read Chapter 30: Foundations of Generative Modeling, especially its treatment of energy-based models, score functions, and Langevin dynamics, because the score-based view in Section 33.3 builds directly on it. Chapter 31: Autoencoders & VAEs supplies the variational lower bound that Section 33.2 reuses, the denoising-autoencoder intuition, and the autoencoder that Section 33.7 repurposes for latent diffusion. From the deep-learning part you need the PyTorch training loop of Chapter 18, the convolution and U-Net structure that the denoiser is built from, and the self- and cross-attention of Chapter 22 that conditions the network. Comfort with Gaussian distributions, the reparameterization trick, and basic stochastic calculus notation (you will see $dx = f\,dt + g\,dW$, but we explain every symbol) makes the derivations concrete. The classical denoising of Chapter 7 is the conceptual seed of the whole chapter.
Chapter Roadmap
- 33.1 Destroying & Rebuilding: The Forward & Reverse Processes The forward noising process with no learnable parameters, the closed-form jump to any noise level, and the learned reverse process that walks static back to an image. Both built from scratch in PyTorch, with a tiny trainable denoiser on a toy dataset.
- 33.2 DDPM: Noise Schedules, Parameterizations & the Variational View The denoising diffusion probabilistic model in full: linear and cosine noise schedules, the three equivalent prediction targets (noise, clean image, and velocity), and the variational bound that collapses to the simple noise-prediction loss used in practice.
- 33.3 The Score-Based View: VE/VP SDEs & the Probability-Flow ODE Diffusion as a stochastic differential equation, the variance-exploding and variance-preserving formulations, why the reverse drift is the score of the data, and the deterministic probability-flow ODE that shares the same marginals as the noisy SDE.
- 33.4 Fast Sampling: DDIM, Solvers & Step Distillation How to cut sampling from a thousand steps to twenty: the deterministic DDIM sampler, high-order ODE solvers like DPM-Solver, and progressive distillation that trains a student to take the steps a teacher needs many for.
- 33.5 Flow Matching, Rectified Flow & Consistency Models The modern reframing that straightens the generative path: conditional flow matching as a simpler training objective, rectified flow's straight-line transport, and consistency models that learn to map any point on a trajectory to its endpoint in one step.
- 33.6 Guidance: Classifier & Classifier-Free How to steer generation toward a class or a prompt: classifier guidance using the gradient of a noise-robust classifier, and the now-dominant classifier-free guidance that trains one network on both conditional and unconditional objectives and extrapolates between them at sampling time.
- 33.7 Latent Diffusion: Compress First, Then Diffuse The trick that put diffusion on consumer hardware: train an autoencoder to compress images into a small perceptual latent, run the entire diffusion process there, and decode once at the end. The architecture of Stable Diffusion, plus the modern diffusion transformer (DiT) backbone.
If you carry three things out of this chapter, carry these. First, the recipe in three words: destroy, then rebuild. A fixed forward process drowns an image in Gaussian noise, and a single learned network undoes one step of that corruption at a time, so generation is just denoising run many times from pure static. Second, the three views of one model, the same denoiser seen through three lenses: the variational view (DDPM's noise-prediction loss, Section 33.2), the score view (the network estimates the gradient of log-density, Section 33.3), and the flow view (it learns a velocity field along a path from noise to data, Section 33.5); train one, get all three. Third, the step-count dial: the thousand steps of Section 33.1 are not fundamental but a deployment choice that fast samplers (Section 33.4) and straight paths (Section 33.5) cut to a handful, with one scalar of guidance (Section 33.6) trading fidelity for diversity and one autoencoder (Section 33.7) moving the whole process into a cheap latent. Destroy then rebuild, three views of one model, the step-count dial: that triad is the skeleton under every diffusion system you will meet.
What's Next?
This chapter gives you the generative engine; the next chapter gives it a voice. Chapter 34: Text-to-Image Systems takes the conditional diffusion model of Section 33.6 and the latent backbone of Section 33.7 and asks how a sentence becomes a picture: how a text encoder like CLIP or T5 turns a prompt into the conditioning vectors that cross-attention consumes, how the major systems (Stable Diffusion, DALL-E, Imagen, and the SD3 and FLUX generation that adopted flow matching from Section 33.5) differ, and how to prompt and evaluate them. From there, Chapter 35 shows how to edit and control diffusion outputs with masks, edges, and inversion, and Chapter 36 extends the same denoising idea into time and three dimensions. Everything generative that follows is built on the seven sections you are about to read.
Bibliography & Further Reading
Foundational Papers
Recent Research (2022-2026)
Books
Tools & Libraries
diffusers. github.com/huggingface/diffusersTutorials & Explainers
diffusers, mirroring the build-then-shortcut structure of Sections 33.1 and 33.7.