Part IV: Generative Vision Models
Chapter 36: Video, 3D Generation & World Models

Video, 3D Generation & World Models

"They taught me to draw a single frame and called it intelligence. Then they asked for a thousand frames that agree with each other, in three dimensions, that respond when poked. Now I understand why physics took the universe so long to debug."

A Diffusion Model That Just Discovered the Arrow of Time
Big Picture

Up to this chapter, a generative model produced one still image; here generation grows three new axes at once: time, depth, and agency. A video model must make a thousand frames agree. A 3D model must make every viewpoint agree. A world model must make the future agree with the actions you take inside it. Each axis is a new consistency constraint stacked on the diffusion and latent machinery you already own, and the chapter's arc runs from the most concrete (denoise a clip) to the most ambitious (learn a simulator of reality you can plan and act inside).

Chapter Overview

Part IV has, until now, lived in the world of the single frame. Chapter 33 taught a model to turn noise into one image by iterated denoising; Chapter 34 wired that process to language; Chapter 35 handed you the controls to steer and edit it. This chapter asks what happens when you refuse to stop at one frame. Add a time axis and you get video, where the hard part is not drawing pretty pixels but making frame 240 remember what frame 1 looked like. Add a depth axis and you get 3D generation, where the hard part is making the back of the object agree with the front. Add an action axis and you get world models, where the hard part is making the consequences of your choices obey something resembling physics.

The first two sections build video generation. Section 36.1 takes the U-Net and DiT denoisers from diffusion and shows the small surgical change, temporal attention and 3D convolution, that turns an image model into a clip model, then confronts the central enemy: temporal consistency, the demand that texture, identity, and motion stay coherent across time. Section 36.2 zooms out to the systems level, the Sora-class latent video transformers and the open ecosystem (Stable Video Diffusion, the open replications, the diffusers pipelines) you can actually run, including the spacetime-patch idea that lets one model swallow images and video of any resolution and length.

Sections 36.3 and 36.4 turn to space. Text-to-3D begins with the score-distillation trick that lifts a 2D image prior into a 3D asset, then races through the feed-forward generators (large reconstruction models, Gaussian-splat generators) that collapsed minutes of optimization into a single forward pass. Section 36.4 connects this directly to the neural rendering of Chapter 27: NeRF and Gaussian splatting were ways to fit a captured scene, and generative neural rendering makes them things a model can imagine.

The final four sections are the chapter's intellectual summit: world models. A world model is a learned simulator. Section 36.5 builds the classic recipe, a recurrent state-space model (RSSM) that learns latent dynamics and lets an agent train inside its own dream, the Dreamer lineage. Section 36.6 scales that idea into generative world simulators, GAIA-1 for driving and the playable neural game engines, where the model generates the next frame conditioned on your control input. Section 36.7 presents the contrarian and influential alternative: JEPA, which predicts in representation space and throws the pixel decoder away entirely. Section 36.8 closes with the question that decides whether any of this is real progress: how do you evaluate a simulator, measuring physical consistency, controllability, and long-horizon coherence rather than mere photorealism?

The connective tissue of the whole chapter is a single idea you have met before: a latent space with dynamics. Chapter 31 gave you the latent; this chapter gives the latent a clock, a third spatial dimension, and a controller. By the end you will see video, 3D, and world models not as three separate fields but as three answers to one question: what does it take to generate something that has to stay consistent with itself?

Remember the Chapter in One Schema: Three Axes, One Question

If you remember nothing else from this chapter, remember the three axes and the single question that unites them. Generation in the earlier chapters lived at a point: one still image. Chapter 36 grows it along three axes, each adding one kind of agreement the model must enforce:

Each axis is the same machinery (a latent space, a denoiser, a prior) with one new consistency constraint bolted on. The one question underneath all three: what does it take to generate something that must stay consistent with itself? Time, depth, agency; one question. Carry that schema through every section.

Prerequisites

This chapter sits near the top of Part IV and leans on most of it. The denoising-diffusion machinery of Chapter 33: Diffusion Models (the forward and reverse process, the U-Net and DiT denoisers, classifier-free guidance, latent diffusion) is assumed throughout; video and 3D generation are diffusion with extra axes. The latent-space and reconstruction view of Chapter 31: Autoencoders & VAEs underpins both the video VAE and the latent dynamics of world models. From Part III, the temporal modeling and optical-flow tools of Chapter 26: Video Understanding and the NeRF and Gaussian-splatting neural scene representations of Chapter 27: Depth, 3D Vision & Neural Scene Representations are direct prerequisites for Sections 36.3, 36.4, and the temporal-consistency discussion. The self-supervised representation learning of Chapter 25 motivates the decoder-free predictive models of Section 36.7. Comfort with PyTorch tensors, the attention mechanism, and reinforcement-learning vocabulary (state, action, reward, policy) at the level of a single paragraph is helpful for the world-model sections.

Chapter Roadmap

What's Next?

This chapter ends on a question, how do we know a world model is any good?, and that question opens directly onto Chapter 37: Evaluation, Safety & Generative Data Engines. Section 36.8 introduces the evaluation problem specific to simulators; Chapter 37 generalizes it across all of generative vision, formalizing the distribution metrics (FID, KID, FVD) that we have used informally, treating the safety and provenance questions that interactive simulators sharpen (a model that generates controllable, realistic video is also a model that generates controllable, realistic deception), and closing the loop by using generative models as data engines that train the very detectors and recognizers of Parts II and III. The arc from a single denoised pixel to a simulator of reality is complete by the end of this chapter; Chapter 37 asks what it is worth and how to deploy it responsibly.

Bibliography & Further Reading
Foundational Papers

Ho, J. et al. "Video Diffusion Models." NeurIPS (2022). arXiv:2204.03458

The paper that extended image diffusion to video with a factorized space-time U-Net and a gradient method for conditional sampling. The architectural template Section 36.1 dissects; essential for anyone implementing a temporal denoiser.

๐Ÿ“„ Paper

Blattmann, A. et al. "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." (2023). arXiv:2311.15127

The open-weight latent video diffusion model and its three-stage data-curation recipe. The runnable backbone of Section 36.2's hands-on pipeline; the first stop for practitioners who want to generate video on their own hardware.

๐Ÿ“„ Paper

Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. "DreamFusion: Text-to-3D using 2D Diffusion." ICLR (2023). arXiv:2209.14988

Introduces Score Distillation Sampling, the trick that turns a frozen 2D text-to-image diffusion model into a supervisor for optimizing a 3D representation. The conceptual core of Section 36.3; read it to understand why early text-to-3D was an optimization, not a forward pass.

๐Ÿ“„ Paper

Hong, Y., Zhang, K. et al. "LRM: Large Reconstruction Model for Single Image to 3D." ICLR (2024). arXiv:2311.04400

The feed-forward transformer that maps a single image to a triplane NeRF in about five seconds, ending the era of per-asset optimization. Marks Section 36.3's pivot from distillation to amortization; key reading for practitioners who need fast 3D assets.

๐Ÿ“„ Paper

Ha, D. and Schmidhuber, J. "World Models." NeurIPS (2018). arXiv:1803.10122

The paper that named the field: a VAE perceives, an RNN dreams the dynamics, and a tiny controller is trained entirely inside the dream. The conceptual seed of Sections 36.5 and 36.6; the most accessible entry point for readers new to learned simulators.

๐Ÿ“„ Paper

Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer). ICLR (2020). arXiv:1912.01603

Defines the Recurrent State-Space Model and the actor-critic trained on imagined latent rollouts that Section 36.5 implements. Core reading for reinforcement-learning practitioners; see also DreamerV3 (arXiv:2301.04104) for the version that masters diverse domains with fixed hyperparameters.

๐Ÿ“„ Paper

Assran, M. et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (I-JEPA). CVPR (2023). arXiv:2301.08243

LeCun's decoder-free alternative: predict masked regions in representation space rather than pixel space. The foundation of Section 36.7 for researchers weighing predictive against generative world models; extended to video by V-JEPA (arXiv:2404.08471) and given an action-conditioned planning variant in V-JEPA 2 (arXiv:2506.09985, 2025).

๐Ÿ“„ Paper
Recent Research (2024-2026)

Brooks, T. et al. "Video generation models as world simulators" (Sora technical report). OpenAI (2024). openai.com/research

The spacetime-patch latent transformer and the explicit claim that scaling video generation yields emergent world-simulation behavior. The reference text behind Section 36.2 and the framing of the whole chapter; Sora launched publicly in December 2024 and Sora 2 (2025) added stronger physics and synchronized audio.

๐Ÿ“ Blog Post

Hu, A. et al. "GAIA-1: A Generative World Model for Autonomous Driving." Wayve (2023). arXiv:2309.17080

A 9-billion-parameter autoregressive world model that generates realistic driving video conditioned on text and action. The headline case study of Section 36.6; required reading for anyone applying world models to robotics or autonomy.

๐Ÿ“„ Paper

Valevski, D., Leviathan, Y., Arar, M., and Fruchter, S. "Diffusion Models Are Real-Time Game Engines" (GameNGen). (2024). arXiv:2408.14837

A diffusion model that simulates the game DOOM at about 20 frames per second, conditioned on player actions. The playable neural game engine that anchors Section 36.6's interactive-environment discussion; a striking demo for engineers exploring action-conditioned generation.

๐Ÿ“„ Paper

Tang, J. et al. "DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation." ICLR (2024). arXiv:2309.16653

Replaces the slow NeRF backbone of score distillation with 3D Gaussians, cutting text-to-3D from hours to about two minutes. Central to Sections 36.3 and 36.4; the practical bridge for practitioners who want fast, editable 3D output.

๐Ÿ“„ Paper

Bruce, J. et al. "Genie: Generative Interactive Environments." ICML (2024). arXiv:2402.15391

A foundation world model that learns a latent action space from unlabeled video and lets a user steer the generated environment frame by frame. The unsupervised-action counterpart to GAIA-1 in Section 36.6; the successor Genie 3 (DeepMind, 2025) reached real-time 720p, minute-scale interactive worlds.

๐Ÿ“„ Paper

Kang, B. et al. "How Far is Video Generation from World Model: A Physical Law Perspective." ICML (2025). arXiv:2411.02385

A controlled study of whether scaling video diffusion induces real physical laws or merely memorizes plausible motion. The empirical backbone of Section 36.8's physical-consistency evaluation; essential for researchers designing world-model benchmarks.

๐Ÿ“„ Paper
Books

Prince, S. J. D. Understanding Deep Learning (2023). udlbook.github.io

Chapters on diffusion and on reinforcement learning give the cleanest free treatment of the two engines this chapter fuses. The companion text for the graduate course this part follows; ideal for students who want rigorous yet readable background.

๐Ÿ“– Book

Murphy, K. P. Probabilistic Machine Learning: Advanced Topics (2023). probml.github.io

The state-space-model and variational-inference chapters formalize the latent dynamics and the RSSM evidence lower bound that Section 36.5 builds. Free online; the right reference for readers who want the full probabilistic derivations behind world models.

๐Ÿ“– Book
Tools & Libraries

Hugging Face. Diffusers: video and 3D pipelines documentation. huggingface.co/docs/diffusers

The reference for the runnable code in Sections 36.1 to 36.4, covering StableVideoDiffusionPipeline, the image-to-video and text-to-video pipelines, and the Shap-E and other 3D generators. The practitioner's first stop for reproducing the chapter's examples.

๐Ÿ”ง Tool

danijar. DreamerV3 official implementation. github.com/danijar/dreamerv3

The reference RSSM and imagination-training code that Section 36.5's from-scratch implementation deliberately mirrors at a small scale. The place to go when you want the production version; aimed at engineers moving from the toy model to a scalable one.

๐Ÿ”ง Tool

threestudio. A unified framework for 3D content generation. github.com/threestudio-project/threestudio

The community framework that implements DreamFusion, Magic3D, DreamGaussian, and many other text-to-3D methods behind one interface. The practical home of Section 36.3's algorithms; ideal for practitioners who want to compare distillation methods without reimplementing each.

๐Ÿ”ง Tool