Section 36.5: World Models: Latent Dynamics, RSSM & Learning in Imagination

"I do not need to crash a thousand real cars to learn to drive. I crash a thousand imaginary ones, in a dream I built from watching, and wake up cautious."
An Agent That Learned to Plan Inside Its Own Head

Big Picture

A world model is a learned simulator: it compresses observations into a latent state and learns a transition function that predicts the next latent state from the current one and an action, so an agent can train by rolling out imagined futures instead of acting in the costly real world. The recurrent state-space model (RSSM) is the canonical recipe, splitting the latent into a deterministic memory and a stochastic part so it can model both predictable structure and genuine uncertainty. This section builds the RSSM and shows how Dreamer trains a policy entirely inside it.

Every generative model so far in this chapter produced observations: frames, objects, scenes. A world model produces dynamics. It is the point where generative vision meets decision-making, and it closes an arc that began far back in the book. Chapter 15 introduced state estimation with the Kalman filter: a hidden state, a motion model that predicts how it evolves, and observations that correct it. A world model is that idea made fully learned and deeply nonlinear, with a VAE-style encoder (Chapter 31) for the observation model and a neural network for the transition. The cross-reference map names this exactly: motion models and Kalman state estimation, deepened through video understanding and VAE latents, transformed here into learned latent dynamics.

Quick Review: The Reinforcement-Learning Words This Section Uses

This is the one section of the book that borrows vocabulary from reinforcement learning, the branch of machine learning where an agent learns by acting. The handful of terms below are all you need. An environment is the world the agent acts in; at each step the agent observes a state, picks an action, and receives a scalar reward measuring how good that step was. A policy is the function (here a small neural network) that maps a state to an action; learning a good policy is the goal. A value is the expected total future reward from a state, so a critic that estimates value tells the policy how promising a situation is. An actor-critic method trains the policy (actor) and the value estimate (critic) together. Model-free learning improves the policy purely from real interactions with no internal model of the environment; the world-model approach of this section is its opposite, learning such a model so the policy can practice inside it. A replay buffer is just the stored log of past interactions the model trains on. Nothing here requires prior reinforcement-learning study; these one-line meanings are sufficient for the section.

1. Why Imagine? The Sample-Efficiency Argument Beginner

Reinforcement learning that acts directly in an environment is sample-hungry: a robot or game agent may need millions of real interactions to learn a good policy, and real interactions are slow, expensive, or dangerous. The world-model proposal, articulated in Ha and Schmidhuber's "World Models" (2018), is to learn a model of the environment from logged experience, then train the policy by simulating, or imagining, rollouts inside that model. Imagined steps are cheap (a forward pass, no physics engine, no real robot), so the policy can practice billions of times in its own dream and only occasionally touch reality to refine the model. The illustration below captures the bargain: the agent crashes a thousand imaginary cars in a dream it built from watching, and touches the costly real road only now and then.

A cartoon car-agent sleeps in a tiny bed under a large dream bubble, and inside the dream it practices driving over and over on a soft cloud course, harmlessly bumping cloud-walls with faint repeated ghost-trails. Outside the bubble a single real road with one cone waits, and a small loop arrow refreshes the dream from reality, depicting how a world model lets a policy train in cheap imagined rollouts and only occasionally act in the expensive real world. — An agent does not need to crash a thousand real cars; it crashes a thousand imaginary ones in a dream it built from watching, and the world model's quality is the ceiling on how well that practice transfers.

This is the central payoff and it is dramatic. Make it concrete. A model-free agent that needs 10 million real robot steps, at one second each, faces over 100 days of continuous, collision-prone real-world acting. A Dreamer agent learns from 1 million real steps plus a billion imagined ones, and the imagined steps run at a millisecond each, so almost all the practice moves into the dream. That turns 100 days of robot time into roughly 11 days of real acting plus an overnight GPU run.

That is the felt meaning of world-model agents such as DreamerV3 (Hafner et al., 2023) reaching strong performance with one to two orders of magnitude fewer real environment steps than model-free baselines, with a single configuration mastering domains from Atari to continuous control to Minecraft. The price is that the dream must be good enough; a policy that learns to exploit a flaw in the model (driving through an imaginary wall the model failed to render) fails in reality. The quality of the world model is the ceiling on the policy, which is why evaluation (Section 36.8) is so central.

Fun Note

An agent that learns inside its own dream has exactly the problem of a student who only studies the answer key: if the key has a typo, the student confidently learns the typo. A Dreamer agent that discovers it can phase through a wall the model forgot to render will become a world-class wall-phaser, and then drive straight into a real wall on day one. The polite name for this is "model exploitation"; the blunt name is "the dream lied and the agent believed it."

2. The RSSM: Deterministic Memory Meets Stochastic State Advanced

The technical heart is the Recurrent State-Space Model. The latent state at time $t$ is split into two parts: a deterministic recurrent state $h_t$ that carries reliable memory of the past, and a stochastic state $z_t$, a sampled latent capturing the irreducible uncertainty of what comes next. This split is the RSSM's signature insight. A purely deterministic model cannot represent genuine randomness (a die roll, an opponent's choice); a purely stochastic model struggles to carry long-range memory. Splitting them gets both.

The deterministic part $h_t$ is the hidden state of a gated recurrent unit (GRU), a small recurrent network whose learned gates decide how much of its memory vector to keep versus overwrite at each step. It is the same recurrent update operator you met inside RAFT in Chapter 26, reused here to give the world model a stable thread of memory through time.

The model has four learned components. The recurrent model updates the deterministic state, $h_t = f(h_{t-1}, z_{t-1}, a_{t-1})$, folding in the previous action $a_{t-1}$. The representation model (the encoder, used during training) infers the stochastic state from the current observation, $z_t \sim q(z_t \mid h_t, x_t)$. The transition (prior) model predicts the stochastic state from memory alone, $\hat{z}_t \sim p(z_t \mid h_t)$, with no access to the observation, this is the predictor used to imagine. Decoder, reward, and continuation heads read out pixels, reward, and episode-end from the full state $(h_t, z_t)$. Training maximizes a variational lower bound (the ELBO you met in Chapter 31), reconstructing observations and rewards while a KL term pulls the observation-informed posterior $q$ toward the observation-free prior $p$, so the prior alone can carry the dream forward.

Figure 36.5.1: The RSSM unrolled over two timesteps. The deterministic recurrent state $h$ (blue) flows forward through a GRU that ingests the previous stochastic state and action; the stochastic state $z$ (purple) is drawn from the observation-informed posterior during training and from the observation-free prior during imagination; the full state decodes to a predicted observation and reward (green). The prior path is what lets the model dream forward without seeing the world.

Figure 36.5.1 unrolls the RSSM and highlights the one path that makes imagination possible: the prior $p(z_t \mid h_t)$, which predicts the next stochastic state from memory alone. During training the posterior $q$ (which peeks at the observation) drives reconstruction; during imagination only the prior runs, so the model generates futures without any real input. The code below implements one RSSM transition step.

# One RSSM transition step: a GRU carries deterministic memory, a prior predicts
# the next stochastic state from memory alone (used to imagine), and a posterior
# uses the observation (used in training). The KL between them teaches the dream.
import torch
import torch.nn as nn

class RSSMCell(nn.Module):
    """One step of a Recurrent State-Space Model: deterministic GRU memory plus
    a stochastic latent drawn from a prior (imagination) or posterior (training)."""
    def __init__(self, stoch=32, deter=256, action_dim=6, obs_feat=1024, hidden=256):
        super().__init__()
        self.gru = nn.GRUCell(stoch + action_dim, deter)
        self.prior_net = nn.Sequential(nn.Linear(deter, hidden), nn.SiLU(),
                                       nn.Linear(hidden, 2 * stoch))   # mean, logstd
        self.post_net = nn.Sequential(nn.Linear(deter + obs_feat, hidden), nn.SiLU(),
                                      nn.Linear(hidden, 2 * stoch))
        self.stoch = stoch

    def _sample(self, params):
        mean, logstd = params.chunk(2, dim=-1)
        std = torch.exp(logstd.clamp(-5, 2))
        eps = torch.randn_like(std)
        return mean + std * eps, mean, std        # reparameterized sample

    def forward(self, prev_stoch, prev_action, prev_deter, obs_feat=None):
        x = torch.cat([prev_stoch, prev_action], dim=-1)
        deter = self.gru(x, prev_deter)            # advance deterministic memory
        prior = self.prior_net(deter)              # predict next state from memory alone
        if obs_feat is None:                       # IMAGINATION: dream forward
            stoch, mean, std = self._sample(prior)
            return stoch, deter, (mean, std), None
        post = self.post_net(torch.cat([deter, obs_feat], dim=-1))  # use observation
        stoch, _, _ = self._sample(post)           # TRAINING: posterior sample
        return stoch, deter, prior, post           # KL(post || prior) trains the dream

cell = RSSMCell()
z, h = torch.zeros(4, 32), torch.zeros(4, 256)
a = torch.zeros(4, 6)
z, h, prior, post = cell(z, a, h, obs_feat=torch.randn(4, 1024))
print(z.shape, h.shape)   # torch.Size([4, 32]) torch.Size([4, 256])

Code Fragment 1: One RSSM transition step. The gru advances the deterministic state deter; prior_net predicts the next stochastic state from memory alone (used to imagine when obs_feat is None), while post_net uses the observation (used in training). The KL between posterior and prior is the loss term that teaches the prior to dream accurately.

3. Learning in Imagination: The Dreamer Loop Advanced

With a trained RSSM, the Dreamer recipe (Hafner et al., 2020) learns behavior without touching the environment. Starting from latent states encoded from real replayed experience, it rolls the RSSM forward using only the prior and the current policy's actions, generating an imagined trajectory of latent states, predicted rewards, and continuation flags. An actor-critic is then trained on this imagined trajectory: the critic learns to predict the long-horizon value of latent states, and the actor (policy) is updated to maximize that value. Because the entire rollout is differentiable, gradients of value can flow back through the imagined dynamics into the policy, an analytic policy gradient that model-free methods cannot get.

Key Insight: The Three Loops of a World-Model Agent

A Dreamer-style agent runs three nested loops, and keeping them straight demystifies the whole system. The environment loop (slow, expensive) acts in reality with the current policy and stores experience in a replay buffer. The model loop (offline) trains the RSSM on replayed sequences to reconstruct observations and rewards and to make the prior match the posterior. The imagination loop (fast, billions of steps) rolls the RSSM forward from replayed states and trains the actor-critic purely on dreamed trajectories. Real data trains the model; the model trains the policy; the policy gathers more real data. Sample efficiency comes entirely from the fact that the inner imagination loop never touches the environment.

Right Tool: DreamerV3 Out of the Box

A full RSSM, the actor-critic, the replay buffer, and the imagination loop are roughly a thousand lines of careful code with many stabilizing tricks (symlog rewards, free-bits KL, percentile return normalization). The official DreamerV3 implementation runs an agent on a new environment from one command:

# Train DreamerV3 on a Gym environment with the reference implementation
# (github.com/danijar/dreamerv3). The repo is launched from its main entry
# point with a task and a config name; the schematic call below mirrors that:
#   python dreamerv3/main.py --configs defaults --task gym_CartPole-v1 \
#                            --run.steps 5e4
# Internally this builds the RSSM, the actor-critic, the replay buffer, and the
# imagination loop, all wired together with DreamerV3's stabilization tricks.

Code Fragment 2: Running DreamerV3 on a Gym CartPole task from one launch command with a task and a config name: the from-scratch RSSMCell of Code Fragment 1 is just one of its dozens of components, all wired and carrying the stabilization tricks (symlog rewards, free-bits KL, return normalization) that make a fixed hyperparameter set work across domains.

This replaces the entire from-scratch system, the RSSM cell above is just one of its dozens of components, with a single configured run, and it carries the years of stabilization tricks that make a fixed hyperparameter set work across very different domains. Reach for it whenever you want results rather than understanding.

4. The Limits and the Road Ahead Intermediate

The RSSM-Dreamer recipe is elegant and sample-efficient but has clear boundaries. Its latent is small and its decoder modest, so it excels on games and control tasks with structured dynamics but does not, by itself, generate the rich, photorealistic, long-horizon video that Sections 36.1 and 36.2 produce. The next two sections take the two roads out of this limitation. Section 36.6 scales the world model up to pixel-space generative simulators (GAIA-1, playable game engines) that fuse the video-generation machinery with action conditioning, trading the RSSM's compact latent for the raw generative power of large video models. Section 36.7 goes the opposite way: it argues that the pixel decoder is wasteful and that prediction should happen purely in representation space, the JEPA philosophy. Both descend from the same insight you just built: a latent state plus a learned transition is a model of the world.

Research Frontier: Scaling Latent World Models (2024-2026)

DreamerV3 (2023) showed one configuration mastering 150-plus tasks, settling the question of whether latent world models generalize. The 2024 to 2025 frontier scales the idea: TD-MPC2 couples learned latent dynamics with planning (model-predictive control in latent space) and scales to a single agent across many embodiments; transformer-based world models (IRIS, the tokenized-world-model line, and Diamond, which uses a diffusion decoder inside the world model in 2024) replace the GRU and the small decoder with sequence transformers and diffusion heads, sharply raising visual fidelity and bringing the RSSM lineage into contact with the video-diffusion lineage. The two families of this chapter, compact latent world models and large generative video simulators, are actively merging, and the open question is whether the right architecture is a small predictive latent (Section 36.7) or a large generative one (Section 36.6). That question is the subject of the rest of the chapter.

5. The Original World Model: V, M, C Advanced

Before the RSSM unified everything into one trained objective, Ha and Schmidhuber (2018) proposed a deliberately modular agent with three pieces, each trained almost independently, that already contained every idea this section builds on. The decomposition is worth studying precisely because the parts are visible. A Vision module (V) is a variational autoencoder that compresses each frame $o_t$ into a low-dimensional latent code $z_t$ (the same VAE you built in Chapter 31). A Memory module (M) is a recurrent network that predicts the next latent given the current latent and action. A Controller (C) is a tiny linear policy. Compression, prediction, and control are cleanly separated, which makes the agent easy to reason about and, as we will see, easy to train the controller inside the dream.

The Memory module is the conceptual ancestor of the RSSM transition. The next-frame latent is genuinely uncertain (the same action in the same place can lead to different futures), so M does not predict a single $z_{t+1}$; it predicts a full probability distribution. Ha and Schmidhuber use a mixture density network on top of an LSTM, so the next latent is modeled as a mixture of $K$ Gaussians whose mixing weights, means, and variances are all functions of the recurrent hidden state $h_t$:

p(z_{t+1} \mid a_t, z_t, h_t) = \sum_{k=1}^{K} \pi_k(h_t)\, \mathcal{N}\!\big(z_{t+1};\, \mu_k(h_t),\, \sigma_k^2(h_t)\big).

Each term is one mode of the predicted future. The mixing weights $\pi_k(h_t)$ (a softmax over the $K$ components, so $\sum_k \pi_k = 1$) say how probable each mode is; $\mu_k(h_t)$ is where that mode sits in latent space and $\sigma_k^2(h_t)$ how spread out it is. A single Gaussian would be forced to average over forking futures and would predict the blurry midpoint of two outcomes that never actually occurs; the mixture instead keeps the modes apart. This is the mixture-density-network idea (Bishop, 1994) applied to dynamics, and it is exactly the role the stochastic latent $s_t$ later plays inside the RSSM: a learned representation of irreducible branching in the world.

The Controller is deliberately minuscule. It is a single linear map from the concatenated latent and memory state to an action,

a_t = W_c\,[\,z_t,\, h_t\,] + b_c,

where $[z_t, h_t]$ stacks the VAE latent and the LSTM hidden state into one vector and $(W_c, b_c)$ are the only parameters the controller has. Because V and M already did the hard representational work, C has so few parameters that it can be optimized by a black-box evolution strategy rather than by backpropagation. Ha and Schmidhuber use CMA-ES (covariance matrix adaptation evolution strategy), which maintains a Gaussian distribution over controller parameter vectors, samples a population, scores each by the return of a rollout, and shifts the distribution toward the high-scoring samples. The decisive move, the one that names this whole section, is that those scoring rollouts are run inside M: the controller never touches the real environment during its own optimization. It learns to drive by driving in the dream that M generates, which is why Ha and Schmidhuber titled the relevant experiment "training inside the dream." If the dream is faithful, a controller good in M is good in reality; if M has exploitable flaws, the controller learns to exploit them, the model-exploitation failure flagged in the Fun Note above.

6. PlaNet and the RSSM Objective Advanced

Ha and Schmidhuber's three modules are trained separately, which means M never gets a gradient signal about what V should encode, and the controller cannot send any pressure back through the model. PlaNet (Hafner et al., 2019) fuses the pieces into a single latent-dynamics model trained end-to-end by one variational objective, and in doing so introduces the precise split that defines the RSSM. The state at time $t$ is a pair: a deterministic recurrent state $h_t$ that is a deterministic function of the past, and a stochastic state $s_t$ sampled from a learned distribution. The deterministic part is a GRU update,

h_t = f(h_{t-1},\, s_{t-1},\, a_{t-1}),

which folds the previous stochastic state and action into memory. Conditioned on $h_t$, the model defines four learned distributions: the prior (transition) $p(s_t \mid h_t)$ that predicts the stochastic state from memory alone, the posterior (representation) $q(s_t \mid h_t, o_t)$ that corrects that prediction using the observation, the decoder $p(o_t \mid h_t, s_t)$ that reconstructs pixels, and the reward head $p(r_t \mid h_t, s_t)$ that predicts the scalar reward. Only the prior is needed to imagine; the posterior exists only during training, to supply a target the prior must learn to match.

Training maximizes a variational lower bound on the log-likelihood of the observed sequence, the same ELBO logic as the VAE in Chapter 31 but now summed over time and carrying a reward term:

\mathcal{J} = \sum_t \mathbb{E}_q\!\Big[\log p(o_t \mid h_t, s_t) + \log p(r_t \mid h_t, s_t) - D_{KL}\big(q(s_t \mid h_t, o_t)\,\|\,p(s_t \mid h_t)\big)\Big].

Read term by term, the objective is intuitive. The first term, $\log p(o_t \mid h_t, s_t)$, is reconstruction: the state must contain enough information to redraw the frame. The second, $\log p(r_t \mid h_t, s_t)$, forces the state to be reward-relevant, so the model does not waste capacity on visually salient but task-irrelevant detail. The third, the KL divergence, is the term that makes imagination possible: it pulls the observation-informed posterior $q$ toward the observation-free prior $p$, so that at imagination time the prior alone produces stochastic states statistically indistinguishable from the ones the posterior would have produced had it seen the frame. Without that KL, the prior would never learn to dream; with it, the model can roll forward with no input.

A single-step ELBO trains the prior only to predict one step ahead, but planning needs accurate predictions many steps out, where one-step errors compound. PlaNet adds latent overshooting: it also requires the prior, run $d$ steps without observations, to match the posterior at the destination, averaged over multiple distances $d$:

\mathcal{J}_{LO} = \sum_t \frac{1}{D}\sum_{d=1}^{D} \mathbb{E}\!\Big[-D_{KL}\big(q(s_t \mid o_{\le t})\,\|\,p(s_t \mid s_{t-d})\big)\Big].

Here $p(s_t \mid s_{t-d})$ is the prior advanced $d$ steps forward from the state at $t-d$ using no observations, and the KL drives that multi-step prediction toward the true posterior at $t$. Summing over $d$ from $1$ to $D$ trains the model to be self-consistent over many horizons at once, which is exactly the regime a planner exercises when it imagines long rollouts. The intuition is direct: if you intend to plan twenty steps ahead, train the model to predict twenty steps ahead, not just one.

PlaNet has no learned policy at all. It plans online with CEM (the cross-entropy method) as model-predictive control. At each real step it samples many candidate action sequences from a diagonal Gaussian, imagines each one through the latent dynamics, scores each by the sum of predicted rewards, keeps the top-K highest-scoring sequences (the elites), refits the Gaussian to those elites, and repeats for a few iterations. It then executes only the first action of the best sequence and replans at the next step (receding horizon). Planning replaces the policy entirely; the model is the plan.

7. Dreamer: Actor-Critic in Imagination Advanced

Replanning from scratch at every step, as PlaNet does, is expensive, and it throws away everything the agent learned on the previous step. Dreamer (Hafner et al., 2020) keeps PlaNet's RSSM but replaces online planning with a learned actor-critic trained entirely on imagined trajectories. The critic $v_\psi(s_t)$ estimates the expected long-horizon return from a latent state; the actor $\pi_\theta(a_t \mid s_t)$ is the policy. Both are trained on rollouts the RSSM dreams from real replayed start states, so no environment interaction is spent on behavior learning.

The training target is the $\lambda$-return, a geometric average of $n$-step returns that trades off the bias of the critic's bootstrap against the variance of long imagined rollouts. It is defined recursively over the imagination horizon $H$:

V^\lambda_t = r_t + \gamma\Big[(1-\lambda)\,v_\psi(s_{t+1}) + \lambda\, V^\lambda_{t+1}\Big], \qquad V^\lambda_H = v_\psi(s_H).

The recursion is the heart of the method, so unpack it. At horizon $H$ the return is just the critic's estimate $v_\psi(s_H)$, the boundary condition. Working backward, each step earns the immediate reward $r_t$ plus the discounted future, and that future is a blend: with weight $(1-\lambda)$ it trusts the critic's one-step bootstrap $v_\psi(s_{t+1})$, and with weight $\lambda$ it trusts the deeper return $V^\lambda_{t+1}$ assembled from the rest of the imagined rollout. Setting $\lambda = 0$ recovers the pure one-step TD target (low variance, high bias if the critic is wrong); $\lambda = 1$ recovers the full Monte-Carlo imagined return (unbiased given the model, high variance). Intermediate $\lambda$ interpolates. The discount $\gamma$ down-weights distant rewards as usual.

The critic regresses onto this target with a squared loss, treating the $\lambda$-return as a fixed value via the stop-gradient $\text{sg}(\cdot)$ so the regression target does not chase the parameters being optimized:

\mathcal{L}(\psi) = \mathbb{E}_\pi\!\Big[\sum_t \tfrac{1}{2}\big(v_\psi(s_t) - \text{sg}(V^\lambda_t)\big)^2\Big].

The actor maximizes the same $\lambda$-return, plus an entropy bonus $\mathcal{H}(\pi_\theta(a_t \mid s_t))$ scaled by $\eta$ that keeps the policy exploratory and prevents premature collapse onto a single action:

\mathcal{L}(\theta) = -\,\mathbb{E}_\pi\!\Big[\sum_t V^\lambda_t + \eta\,\mathcal{H}\big(\pi_\theta(a_t \mid s_t)\big)\Big].

The decisive advantage over model-free actor-critics lives in how this gradient is computed. Because the entire imagined rollout, the RSSM transitions, the reward head, and the reparameterized stochastic states, is differentiable, the gradient of $V^\lambda_t$ with respect to the actor parameters $\theta$ can flow backward through the imagined dynamics analytically. The actor receives a direct, low-variance signal of how nudging each action changes the predicted future return, rather than the high-variance score-function (REINFORCE) estimate a model-free agent is stuck with. This analytic policy gradient through a learned model is the single most important reason Dreamer is sample-efficient.

Algorithm: Dreamer Imagination Training

Given a trained RSSM (recurrent model, prior, decoder, reward head), actor $\pi_\theta$, critic $v_\psi$, replay buffer of real experience, horizon $H$, discount $\gamma$, return mixing $\lambda$:

Sample a batch of real sequences from the replay buffer and run the RSSM posterior over them to obtain start states $\{(h_t, s_t)\}$.
From each start state, imagine $H$ steps forward using the prior only: at each step draw $a \sim \pi_\theta(a \mid s)$, advance $h \leftarrow f(h, s, a)$, sample $s \sim p(s \mid h)$ (reparameterized), and predict reward $r$ and continuation. Store the imagined trajectory.
Compute the $\lambda$-returns $V^\lambda_t$ along each imagined trajectory by the backward recursion, with $V^\lambda_H = v_\psi(s_H)$.
Update the critic by descending $\mathcal{L}(\psi) = \mathbb{E}\big[\sum_t \tfrac12 (v_\psi(s_t) - \text{sg}(V^\lambda_t))^2\big]$.
Update the actor by ascending $\mathbb{E}\big[\sum_t V^\lambda_t + \eta\,\mathcal{H}(\pi_\theta(a_t\mid s_t))\big]$, with gradients flowing analytically back through the imagined dynamics.
Periodically act in the real environment with $\pi_\theta$, add the experience to the replay buffer, and continue training the RSSM on it.

8. DreamerV2 and V3: Stabilizing the Recipe Advanced

The original Dreamer worked on continuous control but was fragile on the discrete, visually diverse Atari suite, and it needed per-domain tuning. The V2 and V3 revisions are a sequence of stabilization changes that culminate in one fixed hyperparameter set working across more than 150 tasks. Each change is small and well-motivated, and together they are why DreamerV3 (Hafner et al., 2023; Nature, 2025) is the default modern recipe.

DreamerV2 (Hafner et al., 2021) replaced the Gaussian stochastic latent with a vector of categorical variables. Discrete latents are a better fit for the sharp, multi-modal transitions of games (a screen either changes or it does not), but sampling from a categorical is not differentiable. DreamerV2 uses the straight-through estimator: the forward pass samples a hard one-hot category, while the backward pass pretends the sample was the continuous softmax probability, so gradients still flow into the prior and posterior. It also introduced KL balancing, which recognizes that the prior-posterior KL is doing two jobs at once and should not weight them equally. Training the prior toward the posterior (the dream learning to match reality) matters more than regularizing the posterior toward the prior, so the two directions get different weights:

\mathcal{L}_{KL} = \alpha\, D_{KL}\big(\text{sg}(q)\,\|\,p\big) + (1-\alpha)\, D_{KL}\big(q\,\|\,\text{sg}(p)\big), \qquad \alpha = 0.8.

The first term, with the posterior held fixed by $\text{sg}$, trains the prior to predict; the second, with the prior held fixed, gently regularizes the posterior. The asymmetry $\alpha = 0.8$ puts most of the pressure on improving the prior (the predictor that drives imagination), which is the quantity that determines dream quality.

DreamerV3 (Hafner et al., 2023) adds the changes that finally removed per-task tuning. Rewards and returns in different environments span wildly different magnitudes, from fractions to thousands, which destabilizes regression losses; V3 squashes them with the symlog transform, a signed logarithm that compresses large magnitudes while staying linear near zero and handling negatives:

\text{symlog}(x) = \text{sign}(x)\,\ln\!\big(1 + |x|\big).

The reward and value heads no longer regress a scalar at all. They predict a distribution over a fixed grid of symlog-spaced bins using a two-hot encoding (a target value falls between two grid points and is represented as the two-point interpolation that recovers it in expectation), trained with cross-entropy. This turns an unbounded regression into a stable classification problem that copes with heavy-tailed and sparse rewards without tuning.

V3 also reshapes the world-model loss into three explicitly weighted terms and clips each KL with free bits so the model never wastes effort driving an already-small KL toward zero (which would otherwise collapse the latent to the prior and starve reconstruction):

\mathcal{L} = \beta_{pred}\,\mathcal{L}_{pred} + \beta_{dyn}\,\mathcal{L}_{dyn} + \beta_{rep}\,\mathcal{L}_{rep},

\mathcal{L}_{dyn} = \max\!\big(1,\; D_{KL}(\text{sg}(q)\,\|\,p)\big), \qquad \mathcal{L}_{rep} = \max\!\big(1,\; D_{KL}(q\,\|\,\text{sg}(p))\big),

with $\beta_{pred} = 1$, $\beta_{dyn} = 1$, and $\beta_{rep} = 0.1$. The prediction loss $\mathcal{L}_{pred}$ collects the decoder, reward, and continuation log-likelihoods; the dynamics loss $\mathcal{L}_{dyn}$ (the KL-balanced prior-toward-posterior term) and the representation loss $\mathcal{L}_{rep}$ (posterior-toward-prior) carry over V2's KL balancing, now each floored at $1$ nat by the $\max(1, \cdot)$ free-bits clip. The down-weighting $\beta_{rep} = 0.1$ keeps the representation free to encode useful detail. Two further details complete the recipe: a 1% uniform mixture ("unimix") is blended into every categorical distribution so no class probability is ever exactly zero (which would give infinite KL and dead gradients), and returns are normalized by a running estimate of their percentile range rather than their variance, which is robust to the reward outliers that variance normalization mishandles. The payoff of this stack of small fixes is the headline result: one configuration, no per-task tuning, strong performance from Atari to continuous control to collecting diamonds in Minecraft from scratch.

9. TD-MPC2: Decoder-Free Latent Planning Advanced

Every model so far reconstructs pixels, and that decoder is expensive and arguably wasteful: to choose good actions you do not need to redraw the scene, only to predict reward and value. TD-MPC2 (Hansen et al., 2024) takes this to its conclusion with a decoder-free latent world model that never reconstructs an observation. An encoder maps the observation to a latent, $z = h_\theta(o)$; a latent dynamics model predicts the next latent, $z_{t+1} = d_\theta(z_t, a_t)$; a reward head $R_\theta$, a value head $Q_\theta$, and a policy prior $\pi_\theta$ read out of the latent. The crucial difference is what holds the latent together. With no reconstruction term, the latent is shaped purely by a self-consistency loss that forces the predicted next latent to match the encoder's actual encoding of the next observation:

\mathcal{L}_{\text{cons}} = \big\| d_\theta(z_t, a_t) - \text{sg}\big(h_\theta(o_{t+1})\big) \big\|^2.

The stop-gradient on the target $h_\theta(o_{t+1})$ prevents the trivial collapse where the encoder maps everything to a constant (which would make any prediction perfect). The full objective adds the reward loss and a temporal-difference value loss, both as two-hot cross-entropy as in DreamerV3, and crucially contains no reconstruction term. The latent is therefore free to discard everything about the observation that does not help predict reward and value, which is precisely the task-relevant compression a planner wants.

At decision time TD-MPC2 plans with MPPI (model-predictive path integral control), a sampling-based planner closely related to PlaNet's CEM but with a softer, reward-weighted update. It samples action trajectories (seeded partly from the policy prior $\pi_\theta$ to start from sensible behavior), rolls each through the latent dynamics, and scores each by the discounted sum of predicted rewards plus a terminal value bootstrap, $\sum_t \gamma^t \hat{r}_t + \gamma^H \hat{Q}$. Rather than keeping a hard top-K set of elites as CEM does, MPPI updates the sampling distribution by a reward-weighted softmax over all trajectories, so better trajectories pull the distribution proportionally to how good they are. The terminal $\gamma^H \hat{Q}$ is what lets a short planning horizon see past its own edge: the learned value summarizes everything beyond step $H$, so the planner need not imagine to the end of the episode.

Algorithm: TD-MPC2 MPPI Planning

Given encoder $h_\theta$, latent dynamics $d_\theta$, reward $R_\theta$, value $Q_\theta$, policy prior $\pi_\theta$, current observation $o$, horizon $H$, $N$ sampled trajectories, temperature $\tau$, iterations $M$:

Encode the current observation, $z_0 = h_\theta(o)$. Initialize a per-step action Gaussian (mean $\mu$, std $\sigma$).
Draw a fraction of the $N$ action sequences by rolling the policy prior $\pi_\theta$ through the latent dynamics, and the rest from the current Gaussian, to seed the search with sensible behavior.
For each candidate sequence, roll $z_{t+1} = d_\theta(z_t, a_t)$ and score it $G = \sum_{t=0}^{H-1} \gamma^t R_\theta(z_t, a_t) + \gamma^H Q_\theta(z_H, a_H)$.
Update the Gaussian by a reward-weighted softmax: weight each sequence by $\exp(G / \tau)$ (normalized), set $\mu$ to the weighted mean of the action sequences and $\sigma$ to their weighted std.
Repeat steps 2 to 4 for $M$ iterations, then execute the first action of $\mu$ and replan at the next observation (receding horizon).

10. Comparing the World-Model Family Intermediate

The six systems above are variations on one template, latent state plus learned transition, that differ along a few decisive axes: whether the stochastic latent exists and what form it takes, whether the policy gradient flows analytically through the model, how the prior-posterior KL is handled, whether the model reconstructs observations, and how an action is finally chosen. Reading the table row by row traces the field's trajectory from modular and hand-tuned toward unified, stable, and decoder-free.

System	Stochastic latent	Action gradient	KL handling	Reconstruction	Action selection
World Models (2018)	VAE latent; MDN-RNN mixture for next-step	None (CMA-ES on controller)	None (modules trained separately)	Yes (VAE pixels)	Linear controller, evolved in the dream
PlaNet (2019)	Gaussian $s_t$, RSSM split	None (no policy)	Standard ELBO KL + latent overshooting	Yes (pixels + reward)	CEM model-predictive control
Dreamer (2020)	Gaussian $s_t$, RSSM split	Analytic through imagined rollout	Standard ELBO KL	Yes (pixels + reward)	Learned actor, $\lambda$-return critic
DreamerV2 (2021)	Categorical (straight-through)	Analytic through imagined rollout	KL balancing ($\alpha = 0.8$)	Yes (pixels + reward)	Learned actor, $\lambda$-return critic
DreamerV3 (2023)	Categorical + 1% unimix	Analytic through imagined rollout	KL balancing + free bits ($\max(1,\cdot)$)	Yes (symlog, two-hot heads)	Learned actor, percentile-normalized $\lambda$-return
TD-MPC2 (2024)	Deterministic latent (consistency-shaped)	Through dynamics for value/policy	None (no prior-posterior KL)	No (decoder-free)	MPPI planning seeded by policy prior

Two trends jump out. The KL column moves from "none" through ever more carefully balanced and floored divergences and then back to "none" once the decoder is dropped, because the prior-posterior KL exists to keep a generative latent honest and a consistency-shaped latent does not need it. The reconstruction column stays "yes" for five systems and then turns "no," marking the decoder-free turn that Section 36.7 pushes further with the JEPA philosophy. The action-selection column tells the other story: planning (CEM, MPPI) and learned policies (Dreamer's actor-critic) are two answers to the same question, and TD-MPC2's seeding of MPPI from a learned policy prior shows the two answers converging.

Research Frontier: Where the Family Is Heading (2024-2026)

The decoder-free turn of TD-MPC2 and the categorical, fixed-hyperparameter robustness of DreamerV3 are converging with the transformer and diffusion world models noted earlier (IRIS, Diamond). The live questions are sharp: can a single self-consistency objective, with no reconstruction at all, scale to high-dimensional visual control the way reconstruction-based RSSMs have; does the analytic policy gradient through a learned model keep its sample-efficiency edge as horizons lengthen and models grow; and is the right stochastic latent a small categorical (DreamerV3) or no explicit stochasticity at all (TD-MPC2) once the latent is shaped by consistency rather than generation. These are accessible to a motivated reader: both the DreamerV3 and TD-MPC2 reference implementations are open source and run on a single GPU, so an ablation that swaps one axis of the comparison table above is a genuine, publishable experiment.

Exercises

Conceptual. The RSSM splits its latent into a deterministic state $h$ and a stochastic state $z$. Give a concrete environment in which a purely deterministic latent would fail, and one in which a purely stochastic latent would fail, and explain how the split resolves both. Relate $h$ and the transition function to the Kalman filter's state and motion model from Chapter 15.

Coding. Using the RSSMCell, write an imagination rollout: from a zero initial state, repeatedly call the cell with obs_feat=None and a fixed action, collecting the deterministic states over 20 steps. Plot the norm of $h_t$ over time for a randomly initialized (untrained) cell, and explain why the trajectory is meaningless until the prior and GRU are trained. Then describe what the KL(posterior, prior) loss would change about this rollout once training converges.

Analysis. A model-free agent needs 10 million real environment steps; each real step costs one second of robot time. A Dreamer agent needs 1 million real steps plus 1 billion imagined steps; each imagined step costs one millisecond of GPU time. Compute the wall-clock for each and identify the dominant cost in the Dreamer case. Then argue the failure mode: what happens to the policy if the world model is systematically wrong about one rare but high-stakes transition (an obstacle the decoder never learned to render)?

Conceptual. Exercise 36.5.4: Derive the RSSM variational objective. Starting from the log-likelihood $\log p(o_{1:T}, r_{1:T})$ of an observed trajectory under the generative model (recurrent $h_t = f(h_{t-1}, s_{t-1}, a_{t-1})$, prior $p(s_t \mid h_t)$, decoder $p(o_t \mid h_t, s_t)$, reward $p(r_t \mid h_t, s_t)$), introduce the variational posterior $q(s_t \mid h_t, o_t)$ and apply Jensen's inequality to obtain the lower bound $\mathcal{J}$ given in Section 6. Identify which term is the VAE reconstruction term, which is new to the dynamics setting, and explain precisely where the KL divergence $D_{KL}(q\,\|\,p)$ comes from in the derivation.

Analysis. Exercise 36.5.5: Why split the state into a deterministic $h_t$ and a stochastic $s_t$ at all, rather than making the whole state stochastic? Argue two points. First, explain why a fully stochastic state makes multi-step prediction harder by injecting sampling noise at every step, so that prediction variance grows with horizon, and how the deterministic GRU path provides a low-variance "backbone" that carries long-range information faithfully. Second, explain why a fully deterministic state cannot represent a genuinely branching future (give the die-roll or opponent-move example) and would be forced to predict a blurry average. Conclude by relating the split to the mixture-density-network of Section 5: both are mechanisms for representing multi-modal futures.

Coding. Exercise 36.5.6: Implement $\lambda$-returns and verify them against Monte-Carlo returns. Write a function lambda_return(rewards, values, gamma, lam) that applies the backward recursion $V^\lambda_t = r_t + \gamma[(1-\lambda)v_{t+1} + \lambda V^\lambda_{t+1}]$ with $V^\lambda_H = v_H$. Then verify two limiting cases on a random trajectory: with $\lambda = 1$ the output must equal the discounted Monte-Carlo return $\sum_{k \ge t} \gamma^{k-t} r_k$ ignoring the bootstrap (up to the terminal value), and with $\lambda = 0$ it must equal the one-step TD target $r_t + \gamma v_{t+1}$. Confirm both numerically with an assertion, then plot how the variance of $V^\lambda_t$ (over many random value-function perturbations) changes as $\lambda$ sweeps from $0$ to $1$.

11. Foundational Papers Intermediate

The six systems traced above are documented in a tight sequence of papers; the cards below give the primary sources, each with the single idea it contributed to the lineage.

Ha, D., & Schmidhuber, J. (2018).

World Models.

Advances in Neural Information Processing Systems (NeurIPS) 31. arXiv:1803.10122.

Introduced the V (VAE) + M (MDN-RNN) + C (controller) decomposition and trained a tiny linear controller with CMA-ES entirely inside the model's learned dream.

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019).

Learning Latent Dynamics for Planning from Pixels (PlaNet).

International Conference on Machine Learning (ICML). arXiv:1811.04551.

Introduced the RSSM (deterministic GRU plus stochastic latent), the variational training objective, latent overshooting, and CEM planning in latent space.

Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020).

Dream to Control: Learning Behaviors by Latent Imagination (Dreamer).

International Conference on Learning Representations (ICLR). arXiv:1912.01603.

Replaced online planning with an actor-critic trained on imagined rollouts, propagating analytic value gradients through the learned dynamics via $\lambda$-returns.

Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021).

Mastering Atari with Discrete World Models (DreamerV2).

International Conference on Learning Representations (ICLR). arXiv:2010.02193.

Switched to categorical latents with straight-through gradients and introduced KL balancing ($\alpha = 0.8$) to weight the prior-toward-posterior direction more heavily.

Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023).

Mastering Diverse Domains through World Models (DreamerV3).

arXiv:2301.04104. Published in Nature (2025).

Added symlog transforms, two-hot reward and value heads, free-bits KL, unimix, and percentile return normalization, yielding one fixed hyperparameter set across 150-plus tasks.

Hansen, N., Su, H., & Wang, X. (2024).

TD-MPC2: Scalable, Robust World Models for Continuous Control.

International Conference on Learning Representations (ICLR). arXiv:2310.16828.

A decoder-free latent world model trained by self-consistency, reward, and TD value losses with no reconstruction, planned with MPPI seeded by a learned policy prior.