Chapter 31: Autoencoders & Variational Autoencoders

"They asked me to squeeze a megapixel through a thirty-two-dimensional keyhole and then blamed me for the blur. I did not lose your cat. I kept the idea of your cat and threw away which whiskers it had."
A Latent Vector Looking for Meaning

Big Picture

An autoencoder learns to copy its input through a narrow bottleneck, and the bottleneck is the point: forced to reconstruct from a few numbers, the network must discover what those numbers should mean. That single idea, compression as representation, runs the whole chapter. A plain autoencoder gives you a code but no way to sample from it. The variational autoencoder adds a probabilistic twist, training the bottleneck to look like a known distribution so that drawing random codes and decoding them produces new, plausible images. Along the way the VAE introduces three tools you will reuse for the rest of Part IV: a tractable lower bound on the data likelihood (the ELBO), the reparameterization trick that lets gradients flow through randomness, and amortized inference that replaces per-image optimization with a single learned encoder. The chapter then follows the latent idea into its most consequential forms: disentangled codes that separate the factors of an image, deep hierarchies that model fine and coarse structure at once, and discrete codebooks whose tokens become the alphabet that diffusion and autoregressive image models speak.

Chapter Overview

Chapter 30 drew the family tree of generative models and gave you the question every chapter in this part answers in its own way: how do you model the distribution of natural images well enough to draw new samples from it? The autoencoder is the gentlest possible entry point, because it begins with a problem you already understand from Chapter 4 and Chapter 7: compression. Take an image, push it through an encoder that shrinks it to a small code, then a decoder that tries to rebuild the original. Train the pair to minimize reconstruction error and the code becomes a learned, compact description of the image. No labels are involved, which makes the autoencoder the most direct relative of the self-supervised methods of Chapter 25.

The trouble is that a plain autoencoder is a compressor, not a generator. Its latent space has holes: pick a random point and the decoder produces garbage, because nothing during training ever told the code to fill the space smoothly. Section 31.3 is the turning point of the chapter, where the variational autoencoder fixes exactly this. By treating the code as a probability distribution and pushing that distribution toward a simple prior, the VAE makes the latent space dense and samplable: every point decodes to something reasonable, and interpolating between two codes morphs smoothly between two images. The price is a new objective, the evidence lower bound, and a clever piece of engineering, the reparameterization trick, that together are worth understanding deeply because they reappear in diffusion's training objective in Chapter 33.

The first two sections build the foundation. Section 31.1 constructs the plain autoencoder, explains the bottleneck, and shows where its latent space breaks down. Section 31.2 covers the two most useful non-generative variants: the denoising autoencoder, which learns robust features by reconstructing clean images from corrupted ones and is the direct conceptual ancestor of diffusion, and the sparse autoencoder, whose activation penalty has found a surprising second life in 2024 as the leading tool for interpreting what large models represent.

The last three sections follow the latent idea to its frontiers. Section 31.4 asks the latent space to be not just smooth but meaningful, separating an image's independent factors into independent code axes, and confronts posterior collapse, the failure mode where a too-powerful decoder ignores the code entirely. Section 31.5 stacks latents into a hierarchy so that a single model can capture both global layout and fine texture, the design that lets NVAE and its descendants reach diffusion-class image quality. Section 31.6 replaces the continuous code with a learned discrete codebook: VQ-VAE turns an image into a grid of tokens, and those tokens are the bridge to the autoregressive and latent-diffusion image models that dominate the rest of Part IV.

The unifying thread is the latent space itself, the idea that an image lives near a low-dimensional surface inside its enormous pixel space, and that learning the coordinates of that surface is most of the work of generation. You met the concept in Chapter 30; here you learn to build it three different ways. In Chapter 32 you will meet a fourth, adversarial latent, and in Chapter 33 the VAE latent of this chapter becomes the very space that latent diffusion operates in.

If you remember one sentence from this chapter, make it the one in the box below. Every variant you will meet is the same encoder-decoder skeleton under a different constraint, and the constraint is the whole story.

Remember This: The Constraint Decides What the Code Learns

One skeleton, six constraints. Every model in this chapter is the same encoder-bottleneck-decoder; what changes is the single constraint you place on the code, and that constraint alone decides what the code becomes:

Narrow it (undercomplete) and the code learns structure: the plain autoencoder (Section 31.1).
Corrupt the input and the code learns the data manifold, the seed of diffusion: the denoising autoencoder (Section 31.2).
Penalize active units and the code becomes interpretable: the sparse autoencoder (Section 31.2).
Make it a distribution near a prior and the code becomes samplable: the VAE (Section 31.3).
Stack codes by scale and the code becomes sharp: the hierarchical VAE (Section 31.5).
Snap it to a codebook and the code becomes a language: VQ-VAE (Section 31.6).

The phrase to carry out the door: compression by itself gives you a code; the constraint gives you a generator.

Prerequisites

You should have read Chapter 30: Foundations of Generative Modeling for the vocabulary of modeling $p(x)$, latent variables, sampling, and the maximum-likelihood objective; this chapter assumes you know what a generative model is trying to do. The networks here are built with the PyTorch tensor mechanics, autograd, and training loops of Chapter 18: Neural Networks & PyTorch for Vision, and the encoders and decoders are convolutional, so the convolutional layers of Chapter 19 and the transposed convolutions for upsampling should be familiar. The denoising autoencoder of Section 31.2 builds directly on the classical denoising of Chapter 7: Image Restoration & Enhancement. On the math side you need comfort with the Gaussian distribution, expectation, and the idea of a probability density; the KL-divergence and ELBO derivations of Section 31.3 are developed from these from scratch, but a prior brush with them from Chapter 30 will make them faster reading.

Chapter Roadmap

31.1 Autoencoders: Compression as Representation The encoder-bottleneck-decoder architecture, the reconstruction objective, why undercompleteness forces the code to learn structure, the linear case as PCA, and the fatal flaw that the latent space is full of holes you cannot sample from. Built from scratch in PyTorch.
31.2 Denoising & Sparse Autoencoders Two variants that make the code useful without making it generative: the denoising autoencoder that learns robust features by undoing corruption (the conceptual seed of diffusion), and the sparse autoencoder whose activation penalty became the leading 2024 tool for interpreting large models.
31.3 The VAE: ELBO, Reparameterization & Amortized Inference The probabilistic twist that made decoders generative: a stochastic encoder, a prior on the latent code, the evidence lower bound that turns intractable likelihood into a trainable objective, the reparameterization trick that lets gradients flow through sampling, and amortized inference.
31.4 Disentanglement, beta-VAE & Posterior Collapse Reweighting the ELBO to push the latent axes toward independent, interpretable factors; the metrics and limits of disentanglement; and posterior collapse, the failure where a powerful decoder learns to ignore the latent code entirely, with the fixes that prevent it.
31.5 Hierarchical VAEs: From Ladder Networks to NVAE Stacking latents into a deep hierarchy so one model captures global layout and fine texture together: the ladder-network inference path, the depth-stable design tricks of NVAE and very deep VAEs, and how hierarchical VAEs reached image quality competitive with GANs and diffusion.
31.6 Discrete Latents: VQ-VAE & Learned Codebooks Replacing the continuous code with a learned discrete codebook: vector quantization, the straight-through estimator, the commitment loss, VQ-VAE-2 and VQGAN, and why a grid of discrete tokens became the alphabet that latent diffusion and autoregressive image models speak.

What's Next?

The variational autoencoder gave you a generator that is stable to train and produces a smooth, structured latent space, but its samples are famously soft: the Gaussian decoder and the averaging tendency of maximum likelihood leave VAE images a little blurry. Chapter 32: Generative Adversarial Networks attacks exactly that weakness from the opposite direction. Instead of maximizing a likelihood, a GAN pits a generator against a discriminator in a minimax game, and the adversarial pressure produces the crisp, high-frequency detail that VAEs smooth away. You will recognize the GAN's latent space as a cousin of the one you built here, and the encoder-free training will throw the VAE's strengths (a principled objective, an inference network, a calibrated likelihood) into sharp relief. The two families are not rivals so much as complements: the VQGAN of Section 31.6 already fuses a VQ-VAE with an adversarial loss, and the latent space you learned to build in this chapter is the stage on which the diffusion models of Chapter 33 will eventually run. Compression, it turns out, was never just about saving bytes; it was about learning what an image is.

Bibliography & Further Reading

Foundational Papers

Kingma, D. P. & Welling, M. "Auto-Encoding Variational Bayes." ICLR (2014). arXiv:1312.6114

The variational autoencoder. Introduces the ELBO objective, the reparameterization trick, and amortized inference of Section 31.3. The single most important paper in the chapter, and the source of machinery diffusion later reuses.

📄 Paper

Rezende, D. J., Mohamed, S. & Wierstra, D. "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." ICML (2014). arXiv:1401.4082

The independent, contemporaneous derivation of the same VAE idea. Reading it alongside Kingma and Welling clarifies which parts of the recipe are essential and which are presentational.

📄 Paper

Vincent, P. et al. "Stacked Denoising Autoencoders." JMLR (2010). jmlr.org/papers/v11/vincent10a

The denoising autoencoder of Section 31.2. Shows that learning to undo corruption yields robust features, and is the clearest conceptual ancestor of the iterative denoising that defines diffusion in Chapter 33.

📄 Paper

Higgins, I. et al. "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." ICLR (2017). openreview.net/forum?id=Sy2fzU9gl

beta-VAE of Section 31.4. A single scalar that overweights the KL term pushes the latent axes toward independent, interpretable factors, launching the disentanglement literature.

📄 Paper

van den Oord, A., Vinyals, O. & Kavukcuoglu, K. "Neural Discrete Representation Learning (VQ-VAE)." NeurIPS (2017). arXiv:1711.00937

VQ-VAE of Section 31.6. Replaces the continuous latent with a learned discrete codebook and a straight-through estimator, the foundation for token-based image models.

📄 Paper

Architecture & Method Papers

Sønderby, C. K. et al. "Ladder Variational Autoencoders." NeurIPS (2016). arXiv:1602.02282

The ladder VAE of Section 31.5. Couples a bottom-up and a top-down pass so that deep stacks of latents can actually be trained, the inference structure NVAE later scales.

📄 Paper

Vahdat, A. & Kautz, J. "NVAE: A Deep Hierarchical Variational Autoencoder." NeurIPS (2020). arXiv:2007.03898

NVAE of Section 31.5. Depth-stabilizing residual cells, spectral regularization, and careful normalization let a hierarchical VAE reach image quality competitive with GANs.

📄 Paper

Razavi, A., van den Oord, A. & Vinyals, O. "Generating Diverse High-Fidelity Images with VQ-VAE-2." NeurIPS (2019). arXiv:1906.00446

VQ-VAE-2 of Section 31.6. A two-level codebook hierarchy plus a learned autoregressive prior over the codes produces sharp, large images, proving discrete latents scale.

📄 Paper

Esser, P., Rombach, R. & Ommer, B. "Taming Transformers for High-Resolution Image Synthesis (VQGAN)." CVPR (2021). arXiv:2012.09841

VQGAN of Section 31.6. Adds a perceptual and adversarial loss to VQ-VAE for a crisp, compact code, the autoencoder that latent diffusion (Chapter 33) builds on.

📄 Paper

Bricken, T. et al. "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning." Anthropic (2023). transformer-circuits.pub

The 2023 to 2024 revival of the sparse autoencoder from Section 31.2 as an interpretability tool, recovering human-meaningful features from a trained model's activations. The follow-up "Scaling Monosemanticity" (Anthropic, 2024) extended the same recipe to a production model, extracting tens of millions of features.

📝 Blog Post

Chen, J. et al. "Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models (DC-AE)." ICLR (2025). arXiv:2410.10733

The frontier autoencoder of Section 31.1. Residual autoencoding and decoupled high-resolution adaptation push the spatial compression ratio from the usual 8x up to 32x and beyond while holding reconstruction quality, making high-resolution latent diffusion much cheaper.

📄 Paper

Mentzer, F. et al. "Finite Scalar Quantization: VQ-VAE Made Simple (FSQ)." ICLR (2024). arXiv:2309.15505

The codebook-free alternative of Section 31.6. Quantizing each of a few latent dimensions to a small fixed set of values yields an implicit codebook that matches VQ quality without the commitment loss or the codebook-collapse failure mode.

📄 Paper

Books

Kingma, D. P. & Welling, M. "An Introduction to Variational Autoencoders." Foundations and Trends in Machine Learning (2019). arXiv:1906.02691

A book-length, self-contained treatment of the VAE by its inventors. The reference for every derivation in Sections 31.3 and 31.4 that this chapter compresses.

📖 Book

Prince, S. J. D. "Understanding Deep Learning." MIT Press (2023). udlbook.github.io/udlbook

Open-access. Chapters on latent-variable models and the VAE give an exceptionally clear, figure-rich account that pairs well with this chapter's code-first approach.

📖 Book

Murphy, K. P. "Probabilistic Machine Learning: Advanced Topics." MIT Press (2023). probml.github.io/pml-book

Open-access. The deep dives on variational inference, the ELBO, and deep generative models for readers who want the full probabilistic machinery behind Section 31.3.

📖 Book

Tools & Libraries

PyTorch torch.distributions and torch.nn. pytorch.org/docs/stable/distributions

The distribution objects (Normal, kl_divergence, rsample) that turn the from-scratch ELBO and reparameterization of Section 31.3 into a few readable lines.

🔧 Tool

Hugging Face Diffusers, AutoencoderKL. huggingface.co/docs/diffusers

The production VAE that compresses images into the latent space Stable Diffusion runs in, the direct industrial descendant of this chapter's encoder-decoder, used in the library shortcuts of Sections 31.3 and 31.6.

🔧 Tool

lucidrains. vector-quantize-pytorch. github.com/lucidrains/vector-quantize-pytorch

A clean, heavily-used implementation of vector quantization and its modern variants (residual VQ, finite scalar quantization), the library shortcut behind the codebook code of Section 31.6.

🔧 Tool

Datasets & Benchmarks

LeCun, Y. et al. "The MNIST Database of Handwritten Digits." yann.lecun.com/exdb/mnist

The 28x28 digit dataset on which every from-scratch autoencoder and VAE in this chapter is trained, small enough to train on a laptop in minutes and to visualize the full latent space.

📊 Dataset

Matthey, L. et al. "dSprites: Disentanglement testing Sprites dataset." (2017). github.com/google-deepmind/dsprites-dataset

The synthetic dataset with known ground-truth factors (shape, scale, rotation, position) used to measure disentanglement in Section 31.4. Its controlled factors make it the standard beta-VAE testbed.

📊 Dataset