Section 36.8: Evaluating World Models

"They gave me a perfect score on beauty and a passing grade on motion. Then a physicist watched my video, saw a glass of water refill itself, and gave me a zero. We are now arguing about which of them was measuring the right thing."
A Video Model Awaiting Its Real Performance Review

Big Picture

A world model that looks photorealistic can still be physically nonsensical, so evaluating one means measuring properties that appearance metrics like FID and FVD cannot see: does it obey physics, does it respond correctly to actions, and does it stay coherent over long horizons? This section builds the evaluation triad, physical consistency, controllability, and coherence, that separates a convincing renderer from a genuine simulator, and connects it to the broader generative-evaluation toolkit that Chapter 37 formalizes.

A generated clip can earn a top score for realism and still show a glass of water that refills itself, because the worry running through this entire chapter is the one thing appearance metrics never test: whether a model has learned the structure of the world or only its surface. Section 36.2 noted that text-to-video models fail at counting and conservation; Section 36.6 noted that simulators drift and hallucinate; Section 36.7 argued the whole pixel objective might be misdirected. All of these are claims about quality that you cannot settle by looking, because the failures hide in dynamics and physics, not in any single frame. Evaluation is therefore not an afterthought here; it is the instrument that decides which models are real progress. And the first thing to establish is why the standard metric is insufficient.

1. Why FID and FVD Are Necessary but Not Sufficient Intermediate

The standard generative-video metric is the Fréchet Video Distance (FVD), the temporal cousin of the Fréchet Inception Distance (FID) you will meet in full in Chapter 37. FVD embeds real and generated clips with a pretrained video network (an I3D action-recognition model from Chapter 26) and measures the distance between the two distributions of features, modeled as Gaussians:

$$ \text{FVD} \;=\; \lVert \mu_r - \mu_g \rVert_2^2 \;+\; \operatorname{Tr}\!\left( \Sigma_r + \Sigma_g - 2\,(\Sigma_r \Sigma_g)^{1/2} \right), $$

where $(\mu_r, \Sigma_r)$ and $(\mu_g, \Sigma_g)$ are the mean and covariance of real and generated feature distributions. This is the same feature-statistics-comparison idea the cross-reference map traces from the humble histograms of Chapter 2. FVD is genuinely useful: it captures gross appearance and motion realism and tracks human judgments of overall quality. But it is blind to exactly the failures world models exhibit. A clip where a glass refills itself, an object teleports, or a "turn left" command makes the car go straight can have an excellent FVD, because the feature distribution still looks like real driving video on average. FVD measures distributional realism; it does not measure physical truth, action obedience, or long-range consistency. You must add metrics that probe those directly.

Fun Note

FVD is the metric equivalent of grading an essay by checking that the right proportion of words are nouns. A video of a glass of water cheerfully refilling itself contains all the textures, lighting, and motion statistics of perfectly normal footage, so FVD nods approvingly while the second law of thermodynamics quietly weeps. The metric is not wrong, it is just answering "does this look like the dataset on average?" when the question you needed was "did the candle get shorter?" Always ask what a single number is actually measuring before you let it run your leaderboard.

2. The Evaluation Triad Advanced

Three orthogonal axes, beyond appearance, define a good world model, and each needs its own probe.

Physical consistency. Does the generated world obey physical laws? The cleanest tests are controlled and counterfactual: generate scenarios with known physics (a ball rolling off a table, two objects colliding, a liquid being poured) and check whether the outcome matches the law. Kang et al. (2024), the physical-law study from Section 36.2, do exactly this, varying whether test conditions are in or out of the training distribution to ask whether the model learned the rule or merely interpolated seen motions. Practical proxies include object-permanence checks (track objects through occlusions and verify they reappear correctly), conservation checks (count objects, measure sizes over time), and trajectory-physics checks (fit the generated motion to the expected equation and measure the residual).

Controllability. For an interactive simulator (Section 36.6), does the generated future actually obey the action? This is measured by an action-following score: condition the model on an action, generate, then use a separate perception model (a detector or pose estimator from Parts II and III) to verify the action was carried out. If you command "turn left" and an independent lane-detector confirms the trajectory bent left, the model is controllable; if the video is gorgeous but the car drifts straight, it is not. Controllability is invisible to FVD entirely.

Long-horizon coherence. Does the world stay consistent over time? This targets the drift of Section 36.6 and the identity drift of Section 36.1. Probes include re-identification (an object seen early should be recognizable later), loop-closure (turn 360 degrees and the scene should match where you started), and the flow-based warping error of Section 36.1 extended over long windows. Coherence degrades gracefully in good models and catastrophically in bad ones, and only a long rollout reveals it.

Figure 36.8.1: The world-model evaluation triad. Appearance metrics (FID, FVD) sit at the base: necessary for ruling out obvious unrealism, but blind to dynamics. The three axes that distinguish a genuine simulator from a beautiful renderer, physical consistency, controllability, and long-horizon coherence, each require a dedicated probe that appearance metrics cannot provide.

Figure 36.8.1 organizes the triad over the appearance baseline. The code below implements a concrete controllability probe, the action-following score, which uses an independent perception model exactly as the figure prescribes.

# Controllability probe: command an action, generate the future, then let an
# INDEPENDENT perception model read off what actually happened and check it
# matches the intended effect. This catches the failure FVD is blind to.
import numpy as np

def action_following_score(simulator, perception_model, action_tests):
    """Controllability: does the generated future obey the commanded action?
    perception_model independently measures what the generated world actually did."""
    correct = 0
    for init_frame, action, expected_effect in action_tests:
        simulator.reset(init_frame)
        rollout = [simulator.step(action) for _ in range(16)]   # generate the future
        # an INDEPENDENT model reads off what actually happened (no peeking at the action)
        measured = perception_model.measure_effect(rollout)     # e.g. ego-trajectory bend
        # score 1 if the measured effect matches what the action should have caused
        correct += int(np.sign(measured) == np.sign(expected_effect)
                        and abs(measured) > 0.2 * abs(expected_effect))
    return correct / len(action_tests)

# A model with high FVD but low action-following looks real but ignores control.
# This single number catches the failure FVD cannot see.
print(f"action-following: {action_following_score(sim, detector, tests):.2f}")

Code Fragment 1: A controllability probe. The action_following_score commands an action, generates the future with the simulator, and uses an independent perception_model.measure_effect to verify the action's effect actually occurred. The score is invisible to FVD and directly measures whether a simulator is steerable rather than merely realistic.

Key Insight: Evaluate the Property, Not the Pixels

The governing principle of world-model evaluation is to measure the property you actually care about with a probe designed for it, rather than hoping a general appearance metric captures it. Physical consistency needs counterfactual physics tests; controllability needs an independent action-following readout; coherence needs long rollouts and re-identification. Each probe borrows a tool from earlier in the book, a detector or tracker from Parts II and III, optical flow from Chapters 15 and 26, used as an objective referee on the generated world. This is also why the predictive models of Section 36.7, which produce no pixels, can still be evaluated: you probe their predicted representations for the same properties, sidestepping appearance entirely. A world model is judged by what it gets right about the world, not by how the world looks.

From the Field: The Leaderboard That Lied

A team building a driving world model tracked FVD as their headline metric and watched it improve steadily across six months of training; the leadership deck showed a confident downward curve and the videos looked stunning. When they finally integrated the model into the planning stack as a scenario generator, the planner trained on its output performed worse on the real closed course than one trained on the old, uglier model. The diagnosis, run too late, was that the new model had optimized appearance realism (driving FVD down) while quietly losing action controllability: commanding a lane change produced beautiful video of a car that subtly did not change lanes, so the planner learned wrong action-outcome associations. The team added the action-following probe and a long-horizon coherence check to their CI dashboard alongside FVD, and the next model that improved on all three transferred correctly. The lesson the tech lead wrote into the team's playbook: FVD is a smoke detector, not a fire inspector; a world model must be measured on physics, control, and coherence, or you will ship a gorgeous model that cannot simulate.

3. The Benchmarks and the Open Problem Intermediate

A 2024 to 2025 wave of benchmarks operationalizes the triad. VBench and its successors decompose video-generation quality into many interpretable dimensions (subject consistency, motion smoothness, temporal flicker, and increasingly physics and action dimensions) rather than one FVD number. Physics-focused suites (Physion-style intuitive-physics benchmarks adapted to generation, and the controlled physical-law probes of Kang et al., 2024) test prediction of collisions, support, and containment. Action-controllable benchmarks score generated environments on whether the commanded action produced the right effect. The common thread is decomposition: replace a single opaque score with a panel of targeted probes, each measuring one property, so a model cannot hide a dynamics failure behind appearance realism.

Right Tool: VBench for Decomposed Video Evaluation

Hand-implementing FVD, subject-consistency tracking, motion-smoothness flow analysis, and a dozen other dimensions is a large evaluation codebase. VBench packages the whole decomposed suite:

# VBench: a decomposed video-generation benchmark (schematic; see the repo's
# README for the exact constructor and evaluate() arguments, which evolve).
# pip install vbench ; then:
from vbench import VBench
bench = VBench(device="cuda", full_info_path="vbench_info.json")
results = bench.evaluate(
    videos_path="generated_clips/",
    dimension_list=["subject_consistency", "motion_smoothness",
                    "temporal_flickering", "dynamic_degree"])
print(results)   # a per-dimension score panel, not one opaque number

Code Fragment 2: VBench's evaluate returns a per-dimension score panel over subject_consistency, motion_smoothness, temporal_flickering, and dynamic_degree rather than a single FVD number, so a dynamics failure cannot hide behind appearance realism the way Code Fragment 1's probe catches a controllability failure.

This replaces a sprawling, error-prone evaluation harness, separate implementations of FVD, tracking-based consistency, and flow-based smoothness, with one call that returns a per-dimension panel aligned to human judgments, so you can see which property a model fails rather than only that its single score moved.

4. What "Good" Means, and Where the Chapter Lands Beginner

Pulling the chapter together: a world model is good to the extent that it predicts the consequences of being in and acting on a world, measured by physical consistency, controllability, and long-horizon coherence, with appearance realism (FVD) as a necessary floor rather than the goal. This reframing answers the worry that opened the chapter. Video models, 3D generators, and world simulators are all attempts to generate something that must stay consistent with itself, and the evaluation triad is simply the operational definition of that self-consistency along the three axes the chapter added, time, space, and action.

The arc that ran from a single denoised pixel in Chapter 33 to a controllable simulator of reality is now complete, and one capstone lab remains to make that arc tangible: the hands-on build at the end of this section assembles the chapter's single thread, a latent space with dynamics, into a working agent you can train and score. After that lab, the chapter ends on an evaluation question that is bigger than world models. How do we measure any generative model, manage its risks, and turn it from a curiosity into a trustworthy tool and a data engine? Section 36.8's FVD is one instance of the distribution metrics that Chapter 37: Evaluation, Safety & Generative Data Engines treats in full, and the controllable-simulator capability this chapter celebrated is exactly the capability whose safety and provenance Chapter 37 must confront. The instrument you built here generalizes there.

5. A Concrete Evaluation Toolkit: Named Benchmarks for Each Axis Intermediate

The triad of Section 2 is a principle; a practitioner setting up a leaderboard needs named, citable instruments. The problem with stopping at the principle is that "measure physical consistency" is not runnable: you cannot file a pull request against an abstraction. Each axis has, as of 2025, at least one peer-reviewed or widely adopted benchmark that operationalizes it on real data with a published protocol, and knowing which instrument measures which property (and whether it has been peer-reviewed) is what separates a defensible evaluation from a hand-waved one. This section names the instruments; the comparison table at the end lets you pick the right one for each axis at a glance.

5.1 Physical consistency: Physics-IQ and MORPHEUS

The sharpest instrument for physical-principle understanding is Physics-IQ (Motamed et al., 2025), a benchmark built with DeepMind. The protocol is a clean prediction test: the model sees a short conditioning segment of a real video and must predict the next five seconds, after which the prediction is compared against the true continuation. The benchmark contains 396 real videos spanning five physics domains, fluid dynamics, optics, solid mechanics, magnetism, and thermodynamics, chosen so that the correct continuation is dictated by a physical law rather than by stylistic plausibility. A model that has learned the law (water settles, a magnet attracts, a candle shortens) predicts the right continuation; a model that has only learned appearance produces a plausible-looking but physically wrong one.

The result that should reshape how you read any video-generation leaderboard is Physics-IQ's headline finding: across leading models (Sora, Runway, Pika, Lumiere, VideoPoet, and others), physical understanding is severely limited and is uncorrelated with visual realism. A model can top the appearance metrics and sit near the floor on physics. This is the empirical hammer behind the entire "FVD is not sufficient" argument of Section 1: it is no longer an intuition that pretty video can be physically wrong, it is a measured, near-zero correlation across the strongest models of the day.

MORPHEUS (Zhang et al., 2025) complements Physics-IQ from the conservation-law angle. It evaluates physical reasoning on 80 real videos of physical experiments (pendulums, collisions, and similar setups) whose outcomes are pinned by conservation laws, then scores a model with physics-informed metrics derived from the conserved quantity for each setting (energy, momentum). Because the conservation law is exact for the chosen experiments, MORPHEUS can ask not merely "does this look like physics?" but "is the conserved quantity actually conserved in the generated rollout?" Its finding echoes Physics-IQ: current models generate aesthetically pleasing video yet struggle to encode the underlying physical principles, even with video conditioning and advanced prompting.

5.2 Controllability: action-conditioned prediction error

Controllability is measured by action-conditioned prediction error: condition the model on a known action sequence, generate the next frame or latent, and measure the error of the prediction against the true action-induced outcome. This is the quantity DIAMOND (Alonso et al., 2024) and the Genie line use in their controllability analyses, and it is exactly the action-following idea of Code Fragment 1 expressed as a regression error rather than a binary check. For decoder-free predictors (Section 36.7), the analogous score is planning success: V-JEPA 2-AC reports the success rate of a planner that uses the action-conditioned predictor to reach goals, measuring controllability by whether the model is good enough to plan with, not by any pixel.

5.3 Long-horizon coherence: VBench and rollout consistency

Long-horizon coherence is the drift, object permanence, and scene-persistence dimension that Genie 3 explicitly targets and that only a long rollout exposes. The most widely adopted instrument here is VBench (Huang et al., 2023; CVPR 2024 Highlight), which decomposes video-generation quality into 16 disentangled, human-preference-validated dimensions (subject consistency, background consistency, motion smoothness, temporal flickering, spatial relationship, dynamic degree, and more) rather than one opaque number. Three of those dimensions, subject consistency, background consistency, and temporal flickering, directly operationalize parts of long-horizon coherence: a model whose subject identity or background drifts across frames scores poorly on exactly the consistency dimensions that a renderer-not-simulator would fail. VBench++ (Huang et al., 2024) extends the suite to image-to-video evaluation and adds trustworthiness dimensions. Beyond VBench, the axis is also measured directly as rollout consistency: accumulated open-loop prediction error as a function of horizon, the drift curve the capstone lab plots in Step 6.

Key Insight: Visual Realism Does Not Imply Physical Correctness

The single most important empirical fact in world-model evaluation is that appearance and physics are nearly independent. Physics-IQ (Motamed et al., 2025) measured the correlation between visual realism and physical-principle understanding across the strongest video models and found it close to zero: a model can be at the top of the realism ranking and near the bottom of the physics ranking simultaneously. This is why no amount of FID, KID, or FVD improvement can certify a world model, and why the field is building dedicated physics benchmarks. When you read a video-generation result, treat the realism score and the physics score as answers to two unrelated questions; a high score on one tells you almost nothing about the other.

6. The Gold Standard: Downstream-Task Value Advanced

Every metric above scores the world model in isolation, but for an agent world model there is a more honest question that sidesteps pixels entirely: is the simulator good enough to be useful? The trouble with appearance and even physics scores is that they measure the model against a notion of correctness the designer chose, not against the task the model exists to serve. A simulator earns its keep only if an agent trained inside it performs well in the real environment. This is the gold standard, evaluate the simulator by the utility of the policy it produces, not by how its frames look.

The cleanest instance is DIAMOND (Alonso et al., 2024): a reinforcement-learning agent is trained entirely inside the learned diffusion world model, never touching the real environment during policy learning, and is then scored in the real environment. DIAMOND reports a mean human-normalized score of 1.46 on the Atari-100k benchmark, the best at publication for an agent trained purely within a world model. That single number evaluates the simulator more stringently than any pixel metric could: a world model with beautiful frames but wrong dynamics would train a policy that fails on the real game, dragging the score down. Downstream-task value is the metric that cannot be gamed by appearance, because the policy, not the human eye, is the judge, and the policy only cares whether the dreamed dynamics matched reality. This is the evaluation analogue of the capstone lab's philosophy: the dream is only as good as what you can do with it.

Note: Why Downstream Value Is Not Always Available

Downstream-task value is the strongest signal but the most expensive and the least general. It requires a well-defined task, a trainable agent, and a real environment to score in, which exists for Atari and robotics but not for an open-ended text-to-video world model with no agent and no task. That is precisely why the property-specific probes of Sections 2 and 5 remain necessary: when there is no downstream task to anchor the evaluation, physical consistency, controllability, and coherence are the best available proxies for the utility you cannot yet measure directly.

7. Where the Generative Metrics Fit Intermediate

Chapter 37 develops a full toolkit of generative-model metrics, and it is worth recapping them here to fix exactly why they are necessary but insufficient for world models. The sample-based metrics, FID and FVD (Fréchet distances between feature distributions) and KID (the kernel analogue, an unbiased MMD that behaves better on small samples), measure how close the generated distribution is to the real one in a perceptual feature space. Precision and recall for generative models split that into two numbers: precision asks what fraction of generated samples are realistic, recall asks what fraction of the real distribution the model can cover, separating a model that produces a few perfect samples from one that captures the full variety. CLIPScore measures text-conditioning fidelity: the cosine similarity between a CLIP embedding of the prompt and of the generated frame, answering "did the model generate what was asked?" There are also likelihood-based measures (bits-per-dimension for models with a tractable density), which score how well the model assigns probability to held-out real data.

Every one of these is necessary: a world model that scores terribly on FVD or CLIPScore is generating unrealistic or off-prompt video and has failed a floor test. But every one is insufficient for the same structural reason FVD is. They all compare distributions or static prompt-frame agreement; none of them conditions on an action and checks the consequence, probes a physical law counterfactually, or follows identity across a long rollout. A self-refilling glass passes FID, FVD, KID, precision-recall, and CLIPScore without complaint, because each frame is realistic and on-prompt and the feature distribution matches real footage. The generative metrics rule out the gross failures; the triad and the downstream task catch the failures that hide in dynamics. You need both layers, never the appearance layer alone.

7.1 Comparison: which instrument measures which axis

Benchmark / metric	What it measures	Triad axis	Peer-reviewed?
Physics-IQ (Motamed et al., 2025)	Predicts 5s after a conditioning segment over 396 real videos in 5 physics domains; physical understanding found uncorrelated with realism	Physical consistency	arXiv 2025 (with DeepMind); not yet a venue paper at time of writing
MORPHEUS (Zhang et al., 2025)	Conservation-law adherence on 80 real physical-experiment videos, scored with physics-informed metrics	Physical consistency	arXiv 2025; preprint
Action-conditioned prediction error	Pixel or latent error of next-step prediction given a known action sequence (DIAMOND, Genie analyses); planning success rate for V-JEPA 2-AC	Controllability	Yes (DIAMOND, NeurIPS 2024)
VBench (Huang et al., 2023)	16 disentangled, human-validated dimensions; subject/background consistency and temporal flickering operationalize coherence	Long-horizon coherence (partial)	Yes (CVPR 2024 Highlight)
VBench++ (Huang et al., 2024)	Extends VBench to image-to-video and adds trustworthiness dimensions	Long-horizon coherence (partial)	arXiv 2024; extension of the CVPR paper
Rollout consistency / drift curve	Accumulated open-loop prediction error vs. horizon; object permanence, scene persistence (Genie 3 target)	Long-horizon coherence	Method, not a single benchmark
Downstream agent score (e.g. Atari-100k for DIAMOND)	Real-environment performance of a policy trained entirely inside the world model; utility, not pixels	Gold standard (all axes, implicitly)	Yes (DIAMOND, NeurIPS 2024)
FID / FVD / KID, precision-recall, CLIPScore (Ch. 37)	Distributional realism, sample quality vs. coverage, text-frame agreement	Appearance floor (none of the triad)	Yes (established metrics)

Research Frontier: Toward Physical-Reasoning Evaluation (2024-2026)

World-model evaluation is itself a fast-moving research area. The central 2024 to 2025 result is the controlled physical-law study (Kang et al., 2024) finding that scaling video generation improves in-distribution motion but does not reliably induce out-of-distribution physical laws, sharpening the simulator-versus-renderer debate with real evidence rather than intuition. Alongside it, decomposed benchmarks (the VBench line) and physics-reasoning probes are maturing into standard tools, and a growing thread uses large multimodal models as automated judges of physical plausibility (asking a vision-language model "is this video physically possible?"), echoing the LLM-as-judge methods that Chapter 37 examines critically. The open and consequential question, unresolved as of this writing, is whether passing these evaluations requires architectural commitments (explicit state, the predictive objective of Section 36.7, action conditioning) or simply more scale, the same fork that has run through this entire chapter. How we measure world models will shape which answer the field pursues.

Exercises

Conceptual. Construct a generated driving clip that would score an excellent FVD yet be a terrible world model. Identify which of the three triad axes (physical consistency, controllability, coherence) your example violates, and explain precisely why FVD, being a distance between feature distributions, is blind to that violation.

Coding. Implement a simple physical-consistency probe for a "ball dropped from a height" scenario: generate (or take) a clip, track the ball's vertical position per frame (a detector or color threshold suffices), fit a parabola $y = \tfrac{1}{2} g t^2 + v_0 t + y_0$, and report the fit residual and the implied $g$. Argue how the residual and the recovered $g$ together distinguish a physically faithful model from one that merely produces plausible-looking downward motion.

Analysis. The chapter presented generative world models (Sections 36.5, 36.6) and predictive ones (Section 36.7). Design an evaluation protocol that can fairly compare a pixel-rendering simulator against a decoder-free JEPA predictor on the same task, given that one produces video and the other produces only embeddings. Which triad axes can be measured for both without rendering, and which require a decoder, and what does that asymmetry imply about how to run a fair benchmark?

Analysis. Exercise 36.8.4: Design a full evaluation protocol for a driving world model (a simulator that takes a steering and throttle action sequence and rolls out future camera frames) that spans all three triad axes with named instruments. For physical consistency, specify a Physics-IQ-style counterfactual test for at least two driving-relevant laws (e.g. braking distance under friction, vehicle following). For controllability, define the action-conditioned prediction error and the independent perception model that reads off the realized trajectory. For long-horizon coherence, choose the VBench dimensions that apply and add a rollout-drift curve and a loop-closure check. State, for each axis, the pass criterion and one failure mode the metric would catch that an FVD leaderboard would miss. Analysis

Discussion. Exercise 36.8.5: FID (and its video cousin FVD) is the default headline number in most generative-video papers. Argue both sides of the question "Is FID a meaningful metric for a world model?" Your "yes" case should explain what FID legitimately certifies (a necessary appearance floor) and why a model that fails it is disqualified. Your "no" case should invoke the Physics-IQ finding that visual realism is uncorrelated with physical understanding, and construct a concrete pair of world models where the one with the better FID is the worse simulator by downstream-task value. Conclude with your own position on whether FID belongs in a world-model leaderboard at all, and if so, with what caveats. Discussion

This is the place to assemble the whole chapter into one running system. The capstone lab below builds a tiny latent-dynamics world model, the recurrent state-space recipe of Section 36.5 stripped to its essentials, trains a policy entirely in imagination, and then scores it on the consistency triad this section defined. It is the chapter's single thread, a latent space with dynamics, made into something you can run and measure.

Hands-On Lab: Build and Evaluate a Tiny World Model

Duration: about 60 to 75 minutes Advanced

Objective. Build a complete, self-contained world-model agent on a toy environment that needs no external dataset or GPU: a recurrent state-space model (RSSM) that learns the environment's latent dynamics from logged experience, a policy trained entirely in imagination on dreamed rollouts (the Dreamer recipe of Section 36.5), and an evaluation harness that scores the result on the consistency triad of this section: one-step prediction accuracy (physical consistency), action controllability, and long-horizon coherence (prediction drift over a rollout). The finished artifact is a single figure with three panels, one per triad axis, plus a printed agent return, that lets you say not just "the agent learned" but "the dream is faithful in these specific ways and unfaithful in those."

What You'll Practice

Implementing the RSSM transition (deterministic GRU memory plus a stochastic latent with a prior used for imagination and a posterior used for training) from Section 36.5.
Training latent dynamics with the reconstruction-plus-KL objective, the same evidence-lower-bound structure as the VAEs of Chapter 31.
Running an imagination rollout and training a policy on dreamed trajectories without touching the environment, the sample-efficiency payoff of the three-loops schema.
Measuring action controllability by checking that different actions produce measurably different predicted futures, the second triad axis of this section.
Quantifying long-horizon coherence as the growth of open-loop prediction error, the drift diagnostic that separates a renderer from a simulator.

Setup

Runs on CPU in minutes; no GPU, no Gym, and no downloads are required. The environment is a tiny deterministic-plus-noise navigation task defined in pure NumPy and PyTorch so the whole lab is reproducible from one file.

pip install torch numpy matplotlib

Steps

Step 1: Define a tiny controllable environment and log experience

A world model needs experience to learn from. Build a 1D "slider" task: a point on a line moves left or right by the chosen action plus a little noise, and the observation is a noisy reading of the position. Logging random episodes gives the model loop something to train on, exactly the cheap experience the warehouse-robot story in Section 36.5 logged before dreaming.

import torch, torch.nn as nn, torch.nn.functional as F, numpy as np
torch.manual_seed(0); np.random.seed(0)
ACTIONS = torch.tensor([-1.0, 0.0, 1.0])   # left, stay, right

def step_env(pos, action_idx):
    pos = np.clip(pos + 0.1 * ACTIONS[action_idx].item() + 0.01 * np.random.randn(), -1, 1)
    obs = pos + 0.02 * np.random.randn()    # noisy observation of true position
    reward = 1.0 - abs(pos)                  # reward peaks at the center (pos = 0)
    return pos, np.array([obs], dtype=np.float32), reward

def log_episode(T=30):
    pos, obs_seq, act_seq, rew_seq = np.random.uniform(-1, 1), [], [], []
    for _ in range(T):
        a = np.random.randint(3)
        pos, obs, r = step_env(pos, a)
        # TODO: append obs, a one-hot of the action (length 3), and r to the three lists.
        # Hint: one_hot = np.eye(3, dtype=np.float32)[a]
        ...
    return np.stack(obs_seq), np.stack(act_seq), np.array(rew_seq, dtype=np.float32)

episodes = [log_episode() for _ in range(200)]   # the replay buffer
print(episodes[0][0].shape, episodes[0][1].shape)  # (30, 1) (30, 3)

Hint

Inside the loop: obs_seq.append(obs); act_seq.append(np.eye(3, dtype=np.float32)[a]); rew_seq.append(r). The one-hot action vector is what the RSSM and the policy both consume.

Step 2: Build the RSSM cell

This is the engine of Section 36.5, scaled down. The GRU carries deterministic memory; the prior predicts the next stochastic latent from memory alone (used when dreaming); the posterior corrects it with the observation (used in training). A small decoder maps the latent state back to a predicted observation and reward so the model can be trained and scored.

class RSSM(nn.Module):
    def __init__(self, stoch=8, deter=32, act=3, obs=1, hidden=64):
        super().__init__()
        self.gru = nn.GRUCell(stoch + act, deter)
        self.prior = nn.Sequential(nn.Linear(deter, hidden), nn.SiLU(), nn.Linear(hidden, 2 * stoch))
        self.post  = nn.Sequential(nn.Linear(deter + obs, hidden), nn.SiLU(), nn.Linear(hidden, 2 * stoch))
        self.dec   = nn.Sequential(nn.Linear(deter + stoch, hidden), nn.SiLU(), nn.Linear(hidden, obs + 1))
        self.stoch = stoch

    def sample(self, params):
        mean, logstd = params.chunk(2, -1)
        std = torch.exp(logstd.clamp(-5, 2))
        return mean + std * torch.randn_like(std), mean, std

    def forward(self, z, a, h, obs=None):
        h = self.gru(torch.cat([z, a], -1), h)        # advance deterministic memory
        prior = self.prior(h)
        if obs is None:                                # IMAGINATION: dream the next state
            z, _, _ = self.sample(prior)
            return z, h, prior, None
        # TODO: compute the posterior from cat([h, obs]), sample z from it, and return
        #       (z, h, prior, post) so the KL(post || prior) loss can be formed in Step 3.
        ...

Hint

post = self.post(torch.cat([h, obs], -1)); z, _, _ = self.sample(post); return z, h, prior, post. The decoder is not called here; it is applied to cat([h, z]) in the training and evaluation loops.

Step 3: Train the latent dynamics (the model loop)

Unroll the RSSM over each logged episode with the posterior, decode the predicted observation and reward at every step, and minimize reconstruction error plus the KL that pulls the prior (the dreaming path) toward the posterior (the observing path). This KL is the term that teaches the prior to dream accurately, the same ELBO structure as a VAE.

def kl(prior, post):
    pm, pls = prior.chunk(2, -1); qm, qls = post.chunk(2, -1)
    pv, qv = torch.exp(pls.clamp(-5, 2)) ** 2, torch.exp(qls.clamp(-5, 2)) ** 2
    return (0.5 * ((qm - pm) ** 2 / pv + qv / pv - 1 + (pv.log() - qv.log()))).sum(-1).mean()

model = RSSM(); opt = torch.optim.Adam(model.parameters(), lr=2e-3)
for epoch in range(40):
    obs, act, rew = [torch.tensor(np.stack(x)) for x in zip(*episodes)]  # [B, T, .]
    B, Tn = obs.shape[0], obs.shape[1]
    z, h = torch.zeros(B, model.stoch), torch.zeros(B, 32)
    rec_loss, kl_loss = 0.0, 0.0
    for t in range(Tn):
        z, h, prior, post = model(z, act[:, t], h, obs=obs[:, t])
        pred = model.dec(torch.cat([h, z], -1))
        # TODO: add MSE(pred[:, :1], obs[:, t]) + MSE(pred[:, 1:], rew[:, t:t+1]) to rec_loss,
        #       and add kl(prior, post) to kl_loss.
        ...
    loss = rec_loss / Tn + 1.0 * kl_loss / Tn
    opt.zero_grad(); loss.backward(); opt.step()
    if epoch % 10 == 0: print(f"epoch {epoch}  rec {rec_loss.item()/Tn:.4f}  kl {kl_loss.item()/Tn:.4f}")

Hint

rec_loss = rec_loss + F.mse_loss(pred[:, :1], obs[:, t]) + F.mse_loss(pred[:, 1:], rew[:, t:t+1]); kl_loss = kl_loss + kl(prior, post). Reconstruction loss should fall well below 0.01; if the KL collapses to zero, lower its weight, the posterior-collapse failure of Chapter 31.

Step 4: Train a policy in imagination (the imagination loop)

Now never touch the environment again. Roll the trained RSSM forward with the prior only, let a tiny policy choose actions from the latent state, decode the predicted reward, and train the policy to maximize total dreamed reward. Because the whole rollout is differentiable, the gradient of return flows back through the imagined dynamics into the policy, the analytic policy gradient model-free methods cannot get.

policy = nn.Sequential(nn.Linear(32 + 8, 64), nn.SiLU(), nn.Linear(64, 3))
popt = torch.optim.Adam(policy.parameters(), lr=3e-3)
for it in range(300):
    z, h = torch.zeros(64, 8), torch.zeros(64, 32)   # 64 parallel dreams
    total_reward = 0.0
    for _ in range(15):                               # 15-step imagined horizon
        logits = policy(torch.cat([h, z], -1))
        a = F.softmax(logits, -1)                     # soft action keeps it differentiable
        z, h, _, _ = model(z, a, h, obs=None)         # dream forward with the prior
        # TODO: decode the reward (last column of model.dec(cat([h, z]))) and add its
        #       mean to total_reward.
        ...
    loss = -total_reward                              # maximize dreamed return
    popt.zero_grad(); loss.backward(); popt.step()
    if it % 100 == 0: print(f"iter {it}  dreamed return {total_reward.item():.3f}")

Hint

r = model.dec(torch.cat([h, z], -1))[:, 1:]; total_reward = total_reward + r.mean(). The dreamed return should climb as the policy learns to steer the slider toward the center where reward peaks.

Step 5: Score the consistency triad

This is the section's payoff: a world model is only as good as its dream, so measure it. Compute three numbers, one per triad axis. (1) Physical consistency: one-step open-loop prediction error against held-out real episodes. (2) Controllability: dream two futures from the same state with all-left versus all-right actions and check the predicted positions diverge. (3) Coherence: open-loop prediction error as a function of rollout length, the drift curve.

@torch.no_grad()
def open_loop_error(horizon):
    obs, act, _ = [torch.tensor(np.stack(x)) for x in zip(*[log_episode() for _ in range(64)])]
    z, h = torch.zeros(64, 8), torch.zeros(64, 32)
    z, h, _, _ = model(z, act[:, 0], h, obs=obs[:, 0])   # one observed warm-up step
    err = []
    for t in range(1, horizon):
        z, h, _, _ = model(z, act[:, t], h, obs=None)    # dream, no observation
        pred_obs = model.dec(torch.cat([h, z], -1))[:, :1]
        err.append(F.mse_loss(pred_obs, obs[:, t]).item())
    return err

# TODO: build the controllability check: from a fixed warmed-up (z, h), dream 10 steps
#       with the all-left one-hot and again with the all-right one-hot, decode the final
#       predicted observation for each, and report the absolute difference (should be > 0).
drift = open_loop_error(horizon=25)
print("one-step error:", drift[0], " 24-step error:", drift[-1])

Hint

For controllability, reuse the warm-up from open_loop_error, then run two separate dream loops feeding ACTIONS index 0 (left) and index 2 (right) as fixed one-hot tensors; decode each final state and compare. A faithful model gives a clearly positive gap; a degenerate one gives near zero, meaning actions do not control the future.

Step 6: Plot the triad and read the verdict

Put the three measurements in one figure so the diagnosis is visual. The drift curve is the most revealing: a flat line means a faithful simulator, a sharply rising line means a renderer that looks plausible for one frame and then loses the plot, exactly the distinction this section argues photorealism metrics cannot see.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 3, figsize=(12, 3.2))
ax[0].bar(["1-step"], [drift[0]]); ax[0].set_title("Physical consistency\n(lower is better)")
ax[1].set_title("Controllability\n(left vs right gap)")    # fill from your Step 5 check
ax[2].plot(range(1, 25), drift, marker="."); ax[2].set_title("Coherence: drift vs horizon")
ax[2].set_xlabel("imagined step"); ax[2].set_ylabel("prediction MSE")
plt.tight_layout(); plt.savefig("world_model_triad.png", dpi=130)
print("saved world_model_triad.png")

Hint

Store the controllability gap from Step 5 in a variable and draw it with ax[1].bar(["gap"], [gap]). A healthy run shows a low one-step error, a clearly positive controllability gap, and a drift curve that rises gently rather than exploding.

Expected Output

Training prints a reconstruction loss falling below about 0.01 and a non-collapsed KL; the imagination loop prints a dreamed return that climbs as the policy learns to hold the slider near the center. The final figure world_model_triad.png has three panels: a small one-step prediction error (physical consistency), a clearly positive left-versus-right gap (controllability), and a drift curve that rises gradually with horizon (coherence). Read together, these say the dream is locally accurate, genuinely action-controllable, and degrades gracefully, the three things Section 36.8 argues an FVD or photorealism score would never reveal on its own.

Stretch Goals

Sabotage controllability on purpose: retrain with the action input to the GRU zeroed out, then re-run the triad. The one-step error may stay low while the controllability gap collapses to near zero, a concrete example of the "great FVD, terrible world model" clip from this section's first exercise.
Add a rare high-stakes transition (a "wall" the slider rarely hits) that appears in only a handful of logged episodes, then check whether the dream ever predicts it, the rare-transition failure mode of the Section 36.5 analysis exercise.
Right Tool: swap the from-scratch RSSM for the reference DreamerV3 implementation (github.com/danijar/dreamerv3) on a Gym CartPole-v1 task, the one-command path the library-shortcut callout in Section 36.5 shows, and compare its drift curve against your tiny model's.

Complete Solution

import torch, torch.nn as nn, torch.nn.functional as F, numpy as np
import matplotlib.pyplot as plt
torch.manual_seed(0); np.random.seed(0)
ACTIONS = torch.tensor([-1.0, 0.0, 1.0])

# --- Step 1: environment and logging ---
def step_env(pos, action_idx):
    pos = np.clip(pos + 0.1 * ACTIONS[action_idx].item() + 0.01 * np.random.randn(), -1, 1)
    obs = pos + 0.02 * np.random.randn()
    reward = 1.0 - abs(pos)
    return pos, np.array([obs], dtype=np.float32), reward

def log_episode(T=30):
    pos, obs_seq, act_seq, rew_seq = np.random.uniform(-1, 1), [], [], []
    for _ in range(T):
        a = np.random.randint(3)
        pos, obs, r = step_env(pos, a)
        obs_seq.append(obs); act_seq.append(np.eye(3, dtype=np.float32)[a]); rew_seq.append(r)
    return np.stack(obs_seq), np.stack(act_seq), np.array(rew_seq, dtype=np.float32)

episodes = [log_episode() for _ in range(200)]

# --- Step 2: RSSM ---
class RSSM(nn.Module):
    def __init__(self, stoch=8, deter=32, act=3, obs=1, hidden=64):
        super().__init__()
        self.gru = nn.GRUCell(stoch + act, deter)
        self.prior = nn.Sequential(nn.Linear(deter, hidden), nn.SiLU(), nn.Linear(hidden, 2 * stoch))
        self.post  = nn.Sequential(nn.Linear(deter + obs, hidden), nn.SiLU(), nn.Linear(hidden, 2 * stoch))
        self.dec   = nn.Sequential(nn.Linear(deter + stoch, hidden), nn.SiLU(), nn.Linear(hidden, obs + 1))
        self.stoch = stoch
    def sample(self, params):
        mean, logstd = params.chunk(2, -1)
        std = torch.exp(logstd.clamp(-5, 2))
        return mean + std * torch.randn_like(std), mean, std
    def forward(self, z, a, h, obs=None):
        h = self.gru(torch.cat([z, a], -1), h)
        prior = self.prior(h)
        if obs is None:
            z, _, _ = self.sample(prior)
            return z, h, prior, None
        post = self.post(torch.cat([h, obs], -1))
        z, _, _ = self.sample(post)
        return z, h, prior, post

# --- Step 3: train dynamics ---
def kl(prior, post):
    pm, pls = prior.chunk(2, -1); qm, qls = post.chunk(2, -1)
    pv, qv = torch.exp(pls.clamp(-5, 2)) ** 2, torch.exp(qls.clamp(-5, 2)) ** 2
    return (0.5 * ((qm - pm) ** 2 / pv + qv / pv - 1 + (pv.log() - qv.log()))).sum(-1).mean()

model = RSSM(); opt = torch.optim.Adam(model.parameters(), lr=2e-3)
for epoch in range(40):
    obs, act, rew = [torch.tensor(np.stack(x)) for x in zip(*episodes)]
    B, Tn = obs.shape[0], obs.shape[1]
    z, h = torch.zeros(B, model.stoch), torch.zeros(B, 32)
    rec_loss, kl_loss = 0.0, 0.0
    for t in range(Tn):
        z, h, prior, post = model(z, act[:, t], h, obs=obs[:, t])
        pred = model.dec(torch.cat([h, z], -1))
        rec_loss = rec_loss + F.mse_loss(pred[:, :1], obs[:, t]) + F.mse_loss(pred[:, 1:], rew[:, t:t+1])
        kl_loss = kl_loss + kl(prior, post)
    loss = rec_loss / Tn + 1.0 * kl_loss / Tn
    opt.zero_grad(); loss.backward(); opt.step()
    if epoch % 10 == 0: print(f"epoch {epoch}  rec {rec_loss.item()/Tn:.4f}  kl {kl_loss.item()/Tn:.4f}")

# --- Step 4: policy in imagination ---
policy = nn.Sequential(nn.Linear(32 + 8, 64), nn.SiLU(), nn.Linear(64, 3))
popt = torch.optim.Adam(policy.parameters(), lr=3e-3)
for it in range(300):
    z, h = torch.zeros(64, 8), torch.zeros(64, 32)
    total_reward = 0.0
    for _ in range(15):
        a = F.softmax(policy(torch.cat([h, z], -1)), -1)
        z, h, _, _ = model(z, a, h, obs=None)
        total_reward = total_reward + model.dec(torch.cat([h, z], -1))[:, 1:].mean()
    (-total_reward).backward(); popt.step(); popt.zero_grad()
    if it % 100 == 0: print(f"iter {it}  dreamed return {total_reward.item():.3f}")

# --- Step 5: triad ---
@torch.no_grad()
def open_loop_error(horizon):
    obs, act, _ = [torch.tensor(np.stack(x)) for x in zip(*[log_episode() for _ in range(64)])]
    z, h = torch.zeros(64, 8), torch.zeros(64, 32)
    z, h, _, _ = model(z, act[:, 0], h, obs=obs[:, 0])
    err = []
    for t in range(1, horizon):
        z, h, _, _ = model(z, act[:, t], h, obs=None)
        err.append(F.mse_loss(model.dec(torch.cat([h, z], -1))[:, :1], obs[:, t]).item())
    return err

@torch.no_grad()
def controllability_gap():
    obs, act, _ = [torch.tensor(np.stack(x)) for x in zip(*[log_episode() for _ in range(64)])]
    z0, h0 = torch.zeros(64, 8), torch.zeros(64, 32)
    z0, h0, _, _ = model(z0, act[:, 0], h0, obs=obs[:, 0])
    outs = []
    for idx in (0, 2):                                  # all-left, all-right
        z, h = z0.clone(), h0.clone()
        a = F.one_hot(torch.tensor(idx), 3).float().expand(64, 3)
        for _ in range(10):
            z, h, _, _ = model(z, a, h, obs=None)
        outs.append(model.dec(torch.cat([h, z], -1))[:, :1])
    return (outs[1] - outs[0]).abs().mean().item()

drift = open_loop_error(25)
gap = controllability_gap()
print("one-step error:", drift[0], " 24-step error:", drift[-1], " control gap:", gap)

# --- Step 6: plot ---
fig, ax = plt.subplots(1, 3, figsize=(12, 3.2))
ax[0].bar(["1-step"], [drift[0]]); ax[0].set_title("Physical consistency\n(lower is better)")
ax[1].bar(["gap"], [gap]); ax[1].set_title("Controllability\n(left vs right gap)")
ax[2].plot(range(1, 25), drift, marker="."); ax[2].set_title("Coherence: drift vs horizon")
ax[2].set_xlabel("imagined step"); ax[2].set_ylabel("prediction MSE")
plt.tight_layout(); plt.savefig("world_model_triad.png", dpi=130)
print("saved world_model_triad.png")

8. Further Reading: Benchmarks for World-Model Evaluation Intermediate

Motamed, S., Culp, L., Swersky, K., Jaini, P., and Geirhos, R. "Do generative video models understand physical principles?" (Physics-IQ). (2025). arXiv:2501.09038

The DeepMind benchmark of 396 real videos across fluid dynamics, optics, solid mechanics, magnetism, and thermodynamics: the model sees a conditioning segment and predicts the next five seconds. Its central finding, that physical understanding is severely limited and uncorrelated with visual realism, is the empirical backbone of this section's argument that appearance metrics cannot certify a world model.

📄 Paper

Zhang, C. et al. "Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments." (2025). arXiv:2504.02918

Evaluates physical reasoning on 80 real physical-experiment videos whose outcomes are pinned by conservation laws, scoring models with physics-informed metrics derived from the conserved quantity per setting. The conservation-law complement to Physics-IQ; read it to see how an exact physical invariant turns "looks like physics" into "conserves energy and momentum."

📄 Paper

Huang, Z. et al. "VBench: Comprehensive Benchmark Suite for Video Generative Models." CVPR (2024), Highlight. arXiv:2311.17982

Decomposes video-generation quality into 16 disentangled, human-preference-validated dimensions (subject consistency, background consistency, motion smoothness, temporal flickering, and more) instead of one opaque FVD number. The Right Tool of this section: its subject/background consistency and temporal-flickering dimensions operationalize parts of long-horizon coherence directly.

📄 Paper

Huang, Z. et al. "VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models." (2024). arXiv:2411.13503

Extends VBench to image-to-video evaluation with an adaptive-aspect-ratio image suite and adds trustworthiness dimensions alongside the technical-quality metrics. The version to reach for when your world model is conditioned on an initial frame rather than text alone.

📄 Paper