Section 31.3: The VAE: ELBO, Reparameterization & Amortized Inference

"You wanted me to be a point. I insisted on being a cloud. The cloud is why you can wander between two faces and never fall off the edge of the world."
A Posterior Distribution Who Refuses to Collapse

Big Picture

The variational autoencoder turns the plain autoencoder into a generator by a single conceptual move: the encoder no longer outputs a point, it outputs a probability distribution over codes, and that distribution is trained to look like a simple prior so that sampling from the prior and decoding produces new, valid images. Making this work requires three ideas that you will reuse for the rest of Part IV. The evidence lower bound (ELBO) replaces the intractable data likelihood with a quantity you can actually compute and maximize, and it splits cleanly into a reconstruction term and a regularization term. The reparameterization trick rewrites the random sampling step so that gradients can flow through it, the one piece of engineering without which the whole thing cannot be trained by backpropagation. And amortized inference replaces a per-image optimization with a single learned encoder, which is why a VAE encodes a new image in one forward pass. Together they fill the holes of Section 31.1 and give you a latent space you can finally sample.

At the end of Section 31.1 we hit a wall: a plain autoencoder learns a code but not a distribution over codes, so sampling a random code and decoding it produces garbage. Chapter 30 framed generation as modeling $p(x)$ and introduced latent-variable models, where $x$ is generated by first drawing a latent $z$ from a prior and then drawing $x$ from a decoder $p_\theta(x \mid z)$. The VAE is the practical recipe for training exactly such a model with neural networks. This section derives it from that latent-variable picture, with every term built up rather than asserted.

1. The Generative Story and the Intractable Likelihood Intermediate

The VAE posits a simple generative story. To make an image, first draw a latent code from a fixed prior, almost always a standard Gaussian $p(z) = \mathcal{N}(0, I)$, then pass it through a decoder network that outputs the parameters of a distribution over images, and draw the image from that. The standard-normal prior is the key choice: it is the known distribution we will be able to sample from at generation time, and it has no holes. The probability the model assigns to a data point $x$ is obtained by integrating over every code that could have produced it:

p_\theta(x) = \int p_\theta(x \mid z)\, p(z)\, dz

This integral is the problem. For any realistic decoder it has no closed form and cannot be computed, because it sums over the entire continuous latent space. We cannot maximize a likelihood we cannot evaluate. The VAE's escape is to introduce a second network, an encoder $q_\phi(z \mid x)$ that proposes which codes are plausible for a given image, and to use it to build a tractable lower bound on $\log p_\theta(x)$ that we maximize instead. The encoder approximates the true but intractable posterior $p_\theta(z \mid x)$, the distribution over codes that could have generated $x$.

The Number That Shocks: How Big Is "Cannot Be Computed"

"Intractable" sounds like a technicality until you count. Suppose you tried the most naive thing and approximated the integral $\int p_\theta(x \mid z)\, p(z)\, dz$ on a grid, sampling just ten values along each latent axis. For the modest 20-dimensional code of the VAE below, that grid has $10^{20}$ points, so scoring the likelihood of a single MNIST digit would require $10^{20}$ decoder passes. At a billion passes per second that is over three thousand years per image, and MNIST has sixty thousand of them. Worse, almost every one of those $10^{20}$ codes contributes essentially nothing, because for a given digit only a tiny pocket of latent space has non-negligible $p_\theta(x \mid z)$. This is the whole reason the VAE does not integrate at all. The encoder's job is to find that one tiny pocket in a single forward pass, so the ELBO can be estimated from a handful of well-chosen samples instead of $10^{20}$ blind ones.

2. Deriving the ELBO Advanced

Start from the log-likelihood of a single data point and insert the encoder. The derivation is short and worth following line by line, because the result is the objective you will train with and a close cousin of the diffusion objective in Chapter 33. Multiply and divide inside the log by $q_\phi(z \mid x)$, which leaves the value unchanged but rewrites the integral as an expectation over the encoder (an integral $\int q_\phi(z\mid x)\,h(z)\,dz$ is by definition $\mathbb{E}_{q_\phi}[h(z)]$), and then apply Jensen's inequality (the log of an average is at least the average of the logs):

\log p_\theta(x) = \log \int p_\theta(x \mid z) p(z)\, dz = \log \mathbb{E}_{q_\phi(z \mid x)}\!\left[\frac{p_\theta(x \mid z) p(z)}{q_\phi(z \mid x)}\right] \;\ge\; \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log \frac{p_\theta(x \mid z) p(z)}{q_\phi(z \mid x)}\right]

The right-hand side is the evidence lower bound, the ELBO. Splitting the log of the product gives the form we actually optimize:

\mathcal{L}_{\text{ELBO}}(x) = \underbrace{\mathbb{E}_{q_\phi(z \mid x)}\big[\log p_\theta(x \mid z)\big]}_{\text{reconstruction}} \;-\; \underbrace{D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\Vert\,p(z)\big)}_{\text{regularization}}

Read the two terms. The first is the expected log-likelihood of the data under the decoder, which for a Gaussian decoder is just the negative squared reconstruction error: it rewards the decoder for rebuilding $x$ from codes the encoder proposes. The second is the Kullback-Leibler divergence between the encoder's distribution and the prior, which measures how far the encoder's proposed code distribution is from the standard normal: it rewards the encoder for keeping its codes close to the prior. Maximizing the ELBO therefore reconstructs well and keeps the codes Gaussian-shaped, and that second pressure is exactly what fills the holes. The gap between $\log p_\theta(x)$ and the ELBO is itself a KL divergence (between the encoder and the true posterior), so a tight encoder makes the bound tight. Figure 31.3.1 shows the two networks and the two loss terms.

Figure 31.3.1: The variational autoencoder. The encoder $q_\phi$ outputs a mean $\mu$ and log-variance $\log\sigma^2$ rather than a single code. The reparameterization step forms $z = \mu + \sigma \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$, so the sampling is differentiable. The decoder reconstructs $x$ from $z$. Two losses act: the reconstruction term (red, bottom) rebuilds the image, and the KL term (red, top) pulls the encoder's distribution toward the standard-normal prior, filling the latent holes of Section 31.1.

3. The Reparameterization Trick Advanced

There is a problem hiding in the ELBO. The reconstruction term is an expectation over $z \sim q_\phi(z \mid x)$, so to estimate it we must sample a code from the encoder's distribution. But sampling is a stochastic, non-differentiable operation; you cannot backpropagate through "draw a random number." If the gradient cannot pass from the reconstruction loss back into the encoder's parameters $\phi$, the encoder cannot learn. This is the obstacle the reparameterization trick removes.

The trick is to move the randomness outside the parameterized path. Instead of sampling $z$ directly from $\mathcal{N}(\mu, \sigma^2)$, draw a fixed-distribution noise variable $\epsilon \sim \mathcal{N}(0, I)$ that does not depend on any parameters, and then form

z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)

A cartoon engineer keeps a rolling die sealed in a glass jar off to the side while a backward-flowing gradient arrow passes smoothly through two adjustable dials (a center dial and a width dial) on a conveyor but bounces off the sealed jar, illustrating the reparameterization trick that isolates randomness in a fixed noise variable so gradients can flow through the mean and standard deviation into the encoder. — Backpropagation refuses to pass through a dice roll, so the reparameterization trick bottles the randomness in a constant off to the side and leaves only the controllable mean and width dials in the gradient's path.

Now $z$ is a deterministic, differentiable function of the encoder outputs $\mu$ and $\sigma$, with all the randomness isolated in $\epsilon$. Gradients flow cleanly through $\mu$ and $\sigma$ into the encoder, while $\epsilon$ is treated as a constant for each sample. This single re-expression is what makes the VAE trainable by ordinary backpropagation, and the same idea reappears whenever a model must differentiate through a sampling step. The KL term, meanwhile, has a closed form for two Gaussians, so it needs no sampling at all:

D_{\mathrm{KL}}\!\big(\mathcal{N}(\mu, \sigma^2)\,\Vert\,\mathcal{N}(0, I)\big) = \tfrac{1}{2}\sum_{j}\big(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\big)

The next block implements the entire VAE: the encoder producing $\mu$ and $\log\sigma^2$, the reparameterized sample, the decoder, and the two-term loss with the closed-form KL. Read it against Figure 31.3.1.

# Complete VAE: the encoder emits a mean and a log-variance, the
# reparameterization step turns them into a differentiable sample, and the
# loss sums reconstruction with the closed-form Gaussian-to-prior KL.
import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, in_dim=784, code_dim=20):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(in_dim, 400), nn.ReLU())
        self.fc_mu = nn.Linear(400, code_dim)        # mean head
        self.fc_logvar = nn.Linear(400, code_dim)    # log-variance head
        self.dec = nn.Sequential(
            nn.Linear(code_dim, 400), nn.ReLU(),
            nn.Linear(400, in_dim), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.enc(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)                # sigma from log-variance
        eps = torch.randn_like(std)                  # noise, no gradient path
        return mu + std * eps                        # differentiable in mu, std

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.dec(z), mu, logvar

def vae_loss(x_hat, x, mu, logvar):
    recon = F.binary_cross_entropy(x_hat, x, reduction="sum")  # reconstruction
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())  # closed-form KL
    return recon + kl

model = VAE()
x = torch.rand(16, 784)
x_hat, mu, logvar = model(x)
print("loss:", vae_loss(x_hat, x, mu, logvar).item())  # a single scalar to minimize

Code Fragment 1: A complete variational autoencoder in PyTorch. The encoder emits a mean and a log-variance; reparameterize implements $z = \mu + \sigma\epsilon$ so gradients reach the encoder; the loss sums the reconstruction (binary cross-entropy here, suited to $[0,1]$ pixels) and the closed-form Gaussian KL.

Key Insight: The KL Term Is What Fills the Holes

The plain autoencoder of Section 31.1 had holes because nothing constrained where codes lived. The VAE's KL term is precisely that missing constraint: it penalizes any encoder distribution that strays from the standard normal, so across the whole dataset the codes are pushed to collectively tile the region around the origin like a single Gaussian cloud, with no large gaps. After training, sampling $z \sim \mathcal{N}(0, I)$ and decoding lands inside the occupied region, so you get a valid image. This is the entire reason a VAE can generate and a plain autoencoder cannot. The reconstruction term alone would recreate the holes; the KL term alone would ignore the data. Generation lives in their balance.

Common Misconception: The ELBO Is the Likelihood, and the KL Sends Every Code to Zero

Two errors recur here. First, students read "maximize the ELBO" as "maximize $\log p_\theta(x)$." It is not: the ELBO is a lower bound, and the gap to the true log-likelihood is the KL between the encoder $q_\phi(z\mid x)$ and the true posterior, which is generally nonzero. A VAE maximizes a bound on the likelihood, not the likelihood itself, which is why its reported numbers are bounds and why a sharper encoder (a tighter bound) can improve the model without changing the decoder at all. Second, the per-image KL pulls each $q_\phi(z\mid x)$ toward the prior, so students conclude every image should encode to the same $z \approx 0$. If that happened the codes would carry no information and reconstruction would fail (this is exactly the posterior collapse of Section 31.4). In a healthy VAE each image keeps a distinct mean; it is the aggregate of all the per-image clouds that tiles the prior, not any single one. The reconstruction term is what keeps the means apart; the KL only stops them drifting arbitrarily far.

Remember This: The VAE in Three Letters, E-R-A

The VAE is three reusable tools, and they spell ERA, a useful tag because they really did open an era you will keep meeting through Part IV:

E is for ELBO: trade the likelihood you cannot compute for a bound you can, splitting into reconstruction minus KL.
R is for Reparameterization: move the randomness into a constant $\epsilon$ so gradients flow through the sample.
A is for Amortized inference: one encoder predicts the code for any image in a single pass, instead of optimizing per image.

If you can recite E-R-A and say one sentence about each, you have the whole section. All three return in the diffusion objective of Chapter 33.

4. Amortized Inference and Sampling Intermediate

The third letter of the ERA tag above is the subject of this subsection. The word "variational" comes from variational inference, the classical technique of approximating an intractable posterior by optimizing over a family of simpler distributions. Classically you would run that optimization separately for every data point, solving for the best $\mu$ and $\sigma$ of each $x$ from scratch, which is slow. The VAE's encoder does something cleverer called amortized inference: it trains one network to predict the variational parameters $\mu_\phi(x)$ and $\sigma_\phi(x)$ for any input in a single forward pass. The cost of inference is amortized across the whole dataset into the encoder's weights, so encoding a new image is one cheap pass rather than an optimization loop. This is the property that makes VAEs practical and is the difference between the VAE and the per-image latent optimization that GAN inversion in Chapter 32 sometimes resorts to.

Once trained, generation is trivial and uses only the decoder. Draw $z \sim \mathcal{N}(0, I)$, decode, done. The encoder is discarded at generation time, exactly as the reconstruction was discarded in Section 31.1; its job was to shape the latent space during training. The next block samples fresh digits and walks a smooth interpolation between two real ones, the smoothness being the visible payoff of the KL constraint.

# Generation and interpolation with a trained VAE, using only the decoder.
# Sampling the standard-normal prior and decoding yields new digits; walking
# between two encoded means produces a smooth morph because the KL filled the holes.
import torch

model.eval()
with torch.no_grad():
    # Unconditional generation: sample the prior, decode.
    z = torch.randn(64, 20)                  # 64 codes from N(0, I)
    samples = model.dec(z).view(-1, 28, 28)  # 64 brand-new digits

    # Smooth interpolation between two real images' codes.
    mu_a, _ = model.encode(img_a.view(1, -1))
    mu_b, _ = model.encode(img_b.view(1, -1))
    grid = [model.dec((1 - t) * mu_a + t * mu_b).view(28, 28)
            for t in torch.linspace(0, 1, 10)]  # morphs smoothly, no holes
# Unlike Section 31.1's plain AE, every sampled z and every interpolation
# step decodes to a clean digit, because the KL term made the latent dense.

Code Fragment 2: Generating and interpolating with a trained VAE. Sampling the standard-normal prior and decoding yields new digits; interpolating between two encoded means produces a smooth morph with no incoherent frames, the direct contrast with the holes of Section 31.1.

Figure 31.3.2 draws what that interpolation loop produces. Two real digits encode to two means inside the dense latent cloud the KL term built; the straight line between them stays inside the occupied region the whole way, so every decoded step along the path is a clean, valid digit that morphs gradually from the first to the second. In the plain autoencoder of Section 31.1 the same straight line would cross the empty holes between clusters and decode to garbage; here there are no holes to cross.

Figure 31.3.2: Walking the VAE latent space. The two real digits encode to means $\mu_a$ and $\mu_b$ inside the single dense Gaussian cloud that the KL term builds. The straight interpolation line $z = (1-t)\mu_a + t\mu_b$ never leaves the occupied region, so every decoded step (the thumbnails) is a valid digit that morphs smoothly from one class to the other. Run the same line through the plain autoencoder of Section 31.1 and it would cross the empty holes between clusters and decode to noise. Smooth interpolation is the visible signature of a hole-free latent.

Fun Note: How to Differentiate Through a Dice Roll Without Crying

Backpropagation has one firm rule: it will not pass through a coin flip, a dice roll, or any other act of genuine chance, because there is nothing to take a derivative of. The reparameterization trick is the polite workaround a clever person invents to follow the letter of the rule while breaking its spirit. Keep the randomness, but bottle it up in a constant $\epsilon$ you rolled before the network started, then build $z = \mu + \sigma \epsilon$ out of parameters you fully control. The gradient never sees the dice; it only sees the recipe you used to scale and shift them, and that recipe is as differentiable as anything else.

Library Shortcut: Distributions and the Production VAE

Two levels of shortcut exist. For the math, PyTorch's torch.distributions replaces the hand-written reparameterization and KL: q = torch.distributions.Normal(mu, std); z = q.rsample() draws a reparameterized sample (the r is for "reparameterized"), and torch.distributions.kl_divergence(q, prior) computes the KL in one call, turning the two formulas of subsection 3 into two readable lines. For production, you almost never train an image VAE from scratch: Hugging Face Diffusers ships AutoencoderKL, the convolutional VAE that compresses images into the latent space Stable Diffusion runs in, loadable and ready in three lines (AutoencoderKL.from_pretrained(...), then .encode(...) and .decode(...)). The library handles the convolutional architecture, the trained weights, and the scaling factor; you write the from-scratch version once, to understand it, and import it forever after.

Practical Example: Anomaly Detection at a Bottling Plant

Who: a manufacturing engineer responsible for visual quality control on a beverage bottling line, 2024. Situation: defects (cracks, fill errors, label misalignments) were rare and varied, so there were far too few defect images to train a defect classifier, but normal bottles were abundant. Problem: a supervised detector needs labeled defects it did not have, and hand-coded rules missed novel defect types. Decision: the engineer trained a VAE on tens of thousands of images of good bottles only, then flagged any new image whose reconstruction error or ELBO fell outside the normal range, on the logic that a model trained only on normal bottles reconstructs them well and reconstructs anything unusual poorly. Result: the VAE caught defect types never seen during training, because anything off the learned manifold of normal bottles produced a high reconstruction error, and it needed no defect labels. Lesson: a VAE's calibrated likelihood and reconstruction error make it a natural one-class anomaly detector, useful precisely in the common industrial situation where you have many examples of normal and almost none of the failure you are hunting.

Research Frontier: The VAE as Infrastructure for Diffusion

The VAE's most important role in 2024 to 2026 is not as a standalone generator (diffusion produces sharper images) but as the compression layer beneath nearly every large image and video generator. Latent diffusion runs the entire denoising process inside a VAE's latent space, which is why Chapter 33 opens with one. Stable Diffusion 3.5 and the SDXL line use a trained AutoencoderKL; video models such as the open-weight systems of 2024 to 2025 extend it to a spatiotemporal VAE that compresses both space and time. The active 2025 frontier is the autoencoder's design itself, and it is moving fast in two directions. The first is making the latent friendlier to the downstream generator by aligning it with strong pretrained features: REPA-E (Leng et al., "Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers," ICCV 2025, arXiv:2504.10483) shows the VAE and the diffusion transformer can be trained end to end with a representation-alignment loss, reporting a large training speedup over the prior recipe. The second, more radical, direction questions whether the reconstruction-only VAE should be kept at all: Representation Autoencoders (Zheng et al., "Diffusion Transformers with Representation Autoencoders," 2025, arXiv:2510.11690) replace the encoder with a frozen pretrained representation backbone such as DINOv2 paired with a trained decoder, reporting better reconstruction and a semantically richer latent than the standard convolutional VAE. The older open questions remain live too: how high a compression ratio the VAE can reach before the decoder hallucinates, and whether the KL-regularized continuous latent or the discrete codebook of Section 31.6 is the better substrate. The ELBO you derived here is the reason these latents are smooth enough for diffusion to work in.

5. A Second Derivation: The ELBO as an Exact Decomposition Advanced

The Jensen derivation of subsection 2 proves that the ELBO is a lower bound on $\log p_\theta(x)$, but it is silent on one question a careful reader will ask: how big is the gap, and what does it depend on? Knowing that a quantity is "at most the log-likelihood" tells you nothing about whether the bound is loose or tight, and a loose bound can be a poor training signal. There is a second, complementary derivation that answers exactly this. It reaches the same ELBO by a different route, an exact equality rather than an inequality, and in doing so it names the gap precisely. This is the decomposition given by Kingma and Welling in the original VAE paper.

The starting point is not the log-likelihood but the KL divergence between the encoder $q_\phi(z \mid x)$ and the true posterior $p_\theta(z \mid x)$, the object the encoder is trying to approximate. Write that KL out and substitute the definition of the true posterior, $p_\theta(z \mid x) = p_\theta(x, z) / p_\theta(x)$:

D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\Vert\,p_\theta(z \mid x)\big) = \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log \frac{q_\phi(z \mid x)}{p_\theta(z \mid x)}\right] = \mathbb{E}_{q_\phi(z \mid x)}\!\left[\log \frac{q_\phi(z \mid x)\, p_\theta(x)}{p_\theta(x, z)}\right]

Now split the logarithm into the part that depends on $z$ and the part that does not. The term $\log p_\theta(x)$ is constant with respect to $z$, so its expectation under $q_\phi$ is just itself, $\mathbb{E}_{q_\phi}[\log p_\theta(x)] = \log p_\theta(x)$. The remaining piece, $\mathbb{E}_{q_\phi}[\log \frac{q_\phi(z \mid x)}{p_\theta(x, z)}]$, is the negative of the ELBO, because $\mathbb{E}_{q_\phi}[\log \frac{p_\theta(x, z)}{q_\phi(z \mid x)}] = \mathcal{L}_{\text{ELBO}}(x)$ is exactly the bound from subsection 2 (recall $p_\theta(x, z) = p_\theta(x \mid z)\, p(z)$). Putting the two pieces together:

D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\Vert\,p_\theta(z \mid x)\big) = \log p_\theta(x) - \mathcal{L}_{\text{ELBO}}(x)

Rearranging gives the exact identity, the centerpiece of this subsection:

\log p_\theta(x) = \mathcal{L}_{\text{ELBO}}(x) + D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\Vert\,p_\theta(z \mid x)\big)

This is an equality, not an inequality, and it makes the lower-bound property fall out as a one-line corollary: a KL divergence is never negative, so dropping it can only decrease the right-hand side, which gives $\log p_\theta(x) \ge \mathcal{L}_{\text{ELBO}}(x)$, the very bound Jensen produced. But the identity says more than the inequality did. It tells you that the gap between the log-likelihood and the ELBO is exactly the KL divergence between the encoder and the true posterior, and therefore the bound is tight, with equality $\log p_\theta(x) = \mathcal{L}_{\text{ELBO}}(x)$, if and only if $q_\phi(z \mid x) = p_\theta(z \mid x)$ almost everywhere, that is, when the encoder recovers the true posterior exactly.

Key Insight: The ELBO Gap Is the Posterior Gap

The single most useful fact about the ELBO is hidden in the decomposition identity: the amount by which the ELBO falls short of the true log-likelihood is precisely $D_{\mathrm{KL}}(q_\phi(z \mid x)\,\Vert\,p_\theta(z \mid x))$, the error in your posterior approximation. Every shortfall in the bound is a shortfall in the encoder. This reframes the encoder's entire job: it is not merely a convenience for sampling, it is the thing that controls how close your trainable objective sits to the quantity you actually care about. It also explains a fact that surprises newcomers, namely that you can improve a VAE's reported likelihood bound by making the encoder more expressive (a richer $q_\phi$, such as a normalizing flow) without touching the decoder at all: a sharper encoder shrinks the posterior gap, which tightens the bound. When you read "the bound is loose," translate it immediately to "the approximate posterior is far from the true one," because those two statements are the same statement.

The two derivations are worth holding side by side, because each supplies what the other lacks. Jensen's inequality is the quick existence proof: it establishes that some lower bound exists and hands you its form. The decomposition identity is the diagnostic: it establishes what the bound is missing and ties that deficit to a concrete, nameable quantity. They agree on the object (the same $\mathcal{L}_{\text{ELBO}}$ appears in both) but disagree on emphasis, and a graduate-level understanding holds both at once.

Table 31.3.1: The two derivations of the ELBO compared.
Aspect	Jensen derivation (subsection 2)	Decomposition identity (this subsection)
Starting point	$\log p_\theta(x)$, with $q_\phi$ inserted by multiply-and-divide	$D_{\mathrm{KL}}(q_\phi(z \mid x)\,\Vert\,p_\theta(z \mid x))$, the posterior approximation error
Key step	Jensen's inequality: $\log \mathbb{E}[\cdot] \ge \mathbb{E}[\log \cdot]$	Substitute $p_\theta(z \mid x) = p_\theta(x, z)/p_\theta(x)$ and pull out the constant $\log p_\theta(x)$
What it yields	An inequality: $\log p_\theta(x) \ge \mathcal{L}_{\text{ELBO}}(x)$	An exact equality: $\log p_\theta(x) = \mathcal{L}_{\text{ELBO}}(x) + D_{\mathrm{KL}}(q_\phi \,\Vert\, p_\theta(\cdot \mid x))$
What it proves	That the ELBO is a valid lower bound at all	What the bound's gap is, and exactly when it closes
Tightness condition	Implicit (Jensen is tight when the ratio is constant)	Explicit: tight iff $q_\phi(z \mid x) = p_\theta(z \mid x)$ a.e.
Key takeaway	A computable objective exists to maximize	The bound gap is the posterior-approximation error

6. The Rate-Distortion View Advanced

There is a practical puzzle the ELBO alone does not resolve. Two trained VAEs can report the same ELBO and yet behave completely differently: one reconstructs images crisply but uses its latent barely at all, another uses the latent heavily but reconstructs softly. If a single number can hide such different behaviors, then the ELBO is not telling you everything about a model, and you need a finer instrument. Alemi and colleagues supplied one by reading the two ELBO terms through the lens of information theory, recasting the average negative ELBO as a point on a rate-distortion curve.

Take the negative ELBO and average it over the data distribution. The reconstruction term becomes a quantity called the distortion, the expected negative log-likelihood of reconstructing the data from its code:

D = -\,\mathbb{E}_{p_{\text{data}}(x)}\,\mathbb{E}_{q_\phi(z \mid x)}\!\big[\log p_\theta(x \mid z)\big]

The regularization term becomes the rate, the expected KL from the encoder to the prior:

R = \mathbb{E}_{p_{\text{data}}(x)}\,D_{\mathrm{KL}}\!\big(q_\phi(z \mid x)\,\Vert\,p(z)\big)

With these two definitions the average negative ELBO is simply their sum:

\mathbb{E}_{p_{\text{data}}(x)}\big[-\mathcal{L}_{\text{ELBO}}(x)\big] = D + R

The names are chosen deliberately, because both quantities have exact information-theoretic meanings. The rate $R$ is, in the language of source coding, the average number of extra nats it costs to transmit the code $z$ under the encoder relative to the prior, the price of describing which point in latent space each image maps to. The distortion $D$ measures how badly the decoder reconstructs once it has paid that price. Training a VAE by maximizing the ELBO is therefore minimizing $D + R$, a single point on the classic rate-distortion trade-off curve of information theory: spend more bits on the code (higher $R$) and you can reconstruct more faithfully (lower $D$), spend fewer and reconstruction degrades.

The rate also has a clean relationship to mutual information, which is what makes it a diagnostic rather than just a loss term. Let $I(x; z)$ be the mutual information between data and code under the joint $p_{\text{data}}(x)\, q_\phi(z \mid x)$, the amount of information the code actually carries about the image. Alemi and colleagues show this information is squeezed between the distortion and the rate:

H - D \;\le\; I(x; z) \;\le\; R

where $H$ is the (fixed) entropy of the data. The right-hand inequality is the one to remember: the rate is an upper bound on the mutual information between image and code. A model with rate near zero cannot possibly have a code that carries information about its input, no matter what its ELBO says. This is the formal statement of the posterior-collapse intuition from the warning callout in subsection 3: when $R \to 0$ the KL to the prior vanishes, the bound on $I(x; z)$ forces the code to be uninformative, and the decoder learns to ignore $z$ entirely.

Why Equal ELBO Does Not Mean Equal Model

The decomposition $-\mathcal{L} = D + R$ exposes the degeneracy directly. Any two configurations $(\theta, \phi)$ that lie on the same line $D + R = \text{const}$ achieve the same ELBO, yet they can sit at wildly different $(R, D)$ coordinates: a high-rate, low-distortion autoencoding solution and a low-rate, high-distortion near-collapsed solution can be numerically tied on the objective. The plain ELBO has no preference between them, which is precisely why two papers reporting the same bound can ship visibly different models. To select a point along that trade-off you need a knob that changes the slope of the line you are optimizing, and that knob is the $\beta$ coefficient: the $\beta$-VAE objective minimizes $D + \beta R$, so $\beta > 1$ tilts the optimum toward lower rate (a more compressed, more disentangled, but softer-reconstructing model) and $\beta < 1$ tilts it toward higher rate (sharper reconstruction, heavier latent use). Section 31.4 takes up $\beta$-VAE and posterior collapse in full; the rate-distortion picture here is the coordinate system in which that section's trade-offs become geometry.

Exercise 31.3.4: Derive the Decomposition Identity Yourself Conceptual

Without looking back at subsection 5, start from $D_{\mathrm{KL}}(q_\phi(z \mid x)\,\Vert\,p_\theta(z \mid x))$ and derive the identity $\log p_\theta(x) = \mathcal{L}_{\text{ELBO}}(x) + D_{\mathrm{KL}}(q_\phi(z \mid x)\,\Vert\,p_\theta(z \mid x))$ in full, justifying each step. State explicitly where you use $p_\theta(z \mid x) = p_\theta(x, z) / p_\theta(x)$ and where you use $\mathbb{E}_{q_\phi}[\log p_\theta(x)] = \log p_\theta(x)$. Then answer in one sentence: what exact condition on $q_\phi$ makes the ELBO equal to the log-likelihood, and why does that condition follow from the identity rather than needing a separate argument?

Exercise 31.3.5: Sketch the Rate-Distortion Curve Analysis

On axes with rate $R$ on the horizontal and distortion $D$ on the vertical, sketch the rate-distortion curve of a VAE family: the lower-left frontier of achievable $(R, D)$ pairs. (a) Draw a line of constant ELBO, $D + R = c$, and mark several points where it touches the curve to show that one ELBO value corresponds to many models. (b) Mark the location of a posterior-collapsed model and explain, using the bound $I(x; z) \le R$, why it must sit at $R \approx 0$ and therefore at high $D$. (c) Indicate the direction in which increasing $\beta$ in the $D + \beta R$ objective slides the selected operating point along the curve, and say in one sentence why $\beta$ changes which point is optimal even though it does not change the curve itself.

Exercise 31.3.1: Why Not Just Output a Point? Conceptual

A natural objection is that the encoder could output a single code (as in Section 31.1) and we could simply add the KL penalty to keep that code near the origin. Explain in three or four sentences why the VAE instead outputs a distribution ($\mu$ and $\sigma$) and samples from it during training. What does the variance $\sigma$ buy you that a penalized point could not, and how does it relate to the smoothness of the interpolations in subsection 4? Tie your answer to the role $\epsilon$ plays in the reparameterization.

Exercise 31.3.2: Build, Train, and Sample Coding

Train the VAE of subsection 3 on MNIST for ten epochs with a 2-dimensional latent so you can visualize it. After training, (a) lay out a regular grid of $z$ values spanning roughly $[-3, 3]$ in each dimension, decode every grid point, and tile the decoded digits into one image to see the latent manifold; (b) plot the encoded means of the test set colored by label on the same axes. Confirm that the decoded grid varies smoothly and that the encoded points form a single blob centered near the origin rather than the scattered clusters of Section 31.1, and explain which loss term is responsible for each observation.

Exercise 31.3.3: The Two Terms in Tension Analysis

Modify the loss to weight the KL term by a coefficient $\beta$, that is recon + beta * kl, and train the VAE at $\beta \in \{0.1, 1, 4, 10\}$. For each, report the final reconstruction error and the average KL, and inspect samples and reconstructions. Describe the trade-off you observe: as $\beta$ grows, what happens to reconstruction sharpness, to how Gaussian the aggregate latent looks, and to the diversity of samples? You have just rediscovered the central knob of Section 31.4; predict what extreme $\beta$ does to the usefulness of the latent code.