Section 36.2: Text-to-Video Systems

"Give me a sentence and I will give you a world that moves. Just do not ask me how many fingers the person has, or whether the candle is getting shorter. I am still negotiating with thermodynamics."
A Latent Video Transformer With Big Ambitions

Big Picture

Scaling video diffusion to the Sora class required one key idea beyond Section 36.1: stop thinking in fixed-size clips and start thinking in spacetime patches, the variable-length tokens that let a single transformer ingest images and videos of any resolution and duration. Once video is a sequence of spacetime tokens, the same transformer scaling laws that built large language models apply, and a text-to-video system becomes a latent diffusion transformer conditioned on a prompt. This section covers that architecture, the open models that implement it, and how to run one with a few lines of diffusers.

Section 36.1 built a video denoiser by adding temporal layers to an image U-Net and showed how a video VAE compresses the clock. That recipe works, but it bakes in a fixed frame count and a fixed resolution. The leap to systems like OpenAI's Sora, and to the open models that chase it, came from a representation change borrowed from the vision transformers of Chapter 22: cut the compressed spacetime latent into a sequence of patch tokens, and let a transformer denoise the sequence. Because a sequence has no fixed length, one model can train on a still photo (a sequence of one frame's worth of patches) and a long clip (many frames' worth) in the same batch.

1. Spacetime Patches: One Representation for Everything Intermediate

Recall from Chapter 22 that a Vision Transformer turns an image into a grid of patch tokens. The spacetime-patch idea extends this to the time axis: after the video VAE compresses a clip into a latent of shape $(F', C, H', W')$, you cut it into small cubes of size $(t_p, h_p, w_p)$, flatten each cube into a token, and you have a one-dimensional sequence of spacetime patches. A 5-second clip and a single image differ only in how many tokens they produce; the transformer treats both as sequences.

This is the unification that powers Sora-class training. The number of tokens $N$ for a latent of $F'$ frames at $H' \times W'$ with patch size $(t_p, h_p, w_p)$ is

$$ N \;=\; \frac{F'}{t_p} \cdot \frac{H'}{h_p} \cdot \frac{W'}{w_p}, $$

and the transformer's cost scales as $O(N^2)$ in its attention, the familiar quadratic from Chapter 22. Variable $N$ means variable resolution and variable duration are handled by the same weights; you simply pad sequences to a common length within a batch. The diffusion transformer (DiT) you met in Chapter 33 is the denoiser, now operating on these spacetime tokens and conditioned on a text embedding through cross-attention exactly as in Chapter 34.

Figure 36.2.1: The Sora-class pipeline. A 3D VAE compresses raw video into a spacetime latent; the latent is cut into cubes and flattened into a variable-length sequence of spacetime patch tokens; a diffusion transformer denoises the sequence while cross-attending to the text prompt. Because the sequence length is free, one model handles any resolution and any duration, including single images.

Figure 36.2.1 traces the full path from pixels to tokens and back. The crucial property is in the caption: variable sequence length is what lets a single set of weights train jointly on the entire spectrum from one image to a long clip, which is why these models inherit the broad visual knowledge of image datasets while learning motion from video.

Key Insight: Video Generation Inherits Transformer Scaling

The reason the patch representation matters so much is that it makes video generation a sequence-modeling problem, and sequence modeling is the one domain where the field has decisively learned how to scale. The same compute-and-data scaling laws, the same transformer engineering, the same mixed-resolution training tricks that built large language models now apply to video. Sora's technical report frames this directly: scaling a spacetime-patch diffusion transformer produces emergent capabilities (object permanence, rough 3D consistency, simple cause and effect) that were never explicitly engineered. This is the empirical claim that motivates the entire back half of this chapter: scale a good-enough video generator and it starts to behave like a world simulator.

2. The Open Ecosystem: What You Can Actually Run Beginner

Sora's weights stay closed (the product launched publicly in December 2024, and Sora 2 followed in late 2025), but the architecture is public and the open ecosystem has reproduced most of it. As of 2026 the practical landscape has three tiers. Image-to-video open models, with Stable Video Diffusion (Blattmann et al., 2023) as the durable teaching example, are the most reliable: give them a starting frame and they animate it. Text-to-video open models, including CogVideoX, Mochi-1, LTX-Video, and HunyuanVideo (2024) and the more recent Apache-licensed Wan line (Alibaba, 2025), generate clips directly from a prompt using the spacetime-DiT design of Section 1. Closed APIs (Sora 2, Runway Gen-4, Kling, and Google's Veo 3 with native audio) lead on quality, length, and fidelity but expose only a generation endpoint.

For learning and for most products, the open image-to-video models are the right starting point because they are runnable on a single modern GPU and because image conditioning, as Section 36.1 argued, is the most stable mode. The diffusers library wraps them behind a uniform pipeline interface.

# End-to-end image-to-video with Stable Video Diffusion: animate a single still
# frame into a short clip on one consumer GPU. The motion_bucket_id and
# noise_aug_strength dials trade motion against coherence; the seed fixes the result.
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

# Load the open image-to-video model in half precision to fit a single GPU.
pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16, variant="fp16")
pipe.enable_model_cpu_offload()         # stream weights so it fits in modest VRAM

image = load_image("product.png").resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(
    image,
    decode_chunk_size=8,                # decode 8 frames at a time to bound memory
    motion_bucket_id=127,               # higher = more motion, lower = more stable
    noise_aug_strength=0.02,            # small noise on the conditioning image
    num_frames=25,
    generator=generator,
).frames[0]

export_to_video(frames, "out.mp4", fps=7)
print(f"generated {len(frames)} frames")   # generated 25 frames

Code Fragment 1: A complete image-to-video generation in fifteen lines with diffusers and Stable Video Diffusion. The motion_bucket_id is the motion-versus-stability knob from Section 36.1; CPU offload and chunked decoding keep the model inside a single consumer GPU.

The motion_bucket_id parameter is the production knob: 127 is a balanced default, lower values produce nearly still video (safe for product shots), higher values produce dramatic motion (risky for coherence). The noise_aug_strength adds a touch of noise to the conditioning image, which paradoxically improves motion by preventing the model from clamping too tightly to the static input. These two dials, plus the seed, are most of what a practitioner tunes.

You Could Build This: A Catalog-Photo Turntable Generator (beginner, about 45 minutes)

With the Stable Video Diffusion pipeline above and the warping-error metric from Section 36.1, you already have every piece of the rotating-product demo the "flickering product demo" story describes. Build a small script that takes one catalog photo, generates a short clip at three motion_bucket_id settings, and prints the warping error for each so you can pick the value that adds motion without flicker. Wrap it in a tiny command-line tool that writes the chosen clip to out.mp4. The result is a genuine portfolio artifact: a one-photo-to-product-video utility that names the consistency-versus-motion tradeoff in numbers, not vibes, exactly the nightly regression check the startup in Section 36.1 wished it had built first.

Right Tool: Text-to-Video Is the Same Three Lines

Switching from image-to-video to a text-to-video open model is a one-line change of pipeline class; the spacetime-DiT internals are identical, only the conditioning differs:

# Same spacetime-DiT machinery, text conditioning instead of an image:
# only the pipeline class and the prompt change versus the SVD call above.
from diffusers import CogVideoXPipeline
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
frames = pipe("a red kite drifting over a wheat field at sunset",
              num_frames=49, guidance_scale=6.0).frames[0]

Code Fragment 2: Switching to a text-to-video open model with CogVideoXPipeline is a one-line pipeline-class change from the Stable Video Diffusion call above; the prompt replaces the conditioning image while the spacetime-DiT internals stay shared with image-to-video.

This single call hides the text encoder, the 3D VAE, the spacetime patchification, the DiT denoising loop, classifier-free guidance, and the VAE decode, easily a thousand-line system reimplemented from scratch, behind one pipeline object. The library handles the scheduler, the guidance, and the memory-saving offload internally.

3. Length, Resolution, and the Cost Wall Advanced

The spacetime-patch representation is elegant but it runs straight into the quadratic-attention wall. Doubling the clip length roughly quadruples the attention cost: in the token-count formula above, more frames raise only the $F'/t_p$ factor, so twice the duration means twice the tokens ($N$ doubles), and because attention is $O(N^2)$ the cost goes up fourfold. This is why open text-to-video models top out at a few seconds: a 1-minute clip at reasonable resolution is hundreds of thousands of tokens, beyond the memory of any single GPU. Three strategies push the frontier, all of which you have seen in earlier guises.

First, heavier latent compression: a more aggressive video VAE (say $8 \times 16 \times 16$ instead of $4 \times 8 \times 8$) reduces $N$ at the cost of reconstruction fidelity, the same latent-space tradeoff from Chapter 31. Second, windowed and sparse attention, where tokens attend only within local spacetime neighborhoods plus a few global tokens, the efficient-attention idea from Chapter 28. Third, and most important for the rest of this chapter, autoregressive rollout: generate the clip in overlapping windows, each window conditioned on the tail of the previous one, so arbitrarily long video is produced at constant per-window cost. That last strategy is exactly the world-model interface of Section 36.6, where each new window is conditioned not just on the past but on an action.

Fun Note

Quadratic attention means every second of video you ask for makes the model roughly four times grumpier. Want twice the runtime? Pay four times the bill. This is why a model that can dream a flawless four-second clip melts down at twenty seconds: it is not getting dumber, it is getting squared. The field's answer, generate the long clip as a chain of short ones, is the same trick a TV writer uses for a long season: never write the whole thing at once, just keep the previous episode in mind.

From the Field: The Marketing Team That Hit the Length Wall

A mid-size advertising agency adopted an open text-to-video model to prototype 30-second spots before committing to expensive live shoots. Their first sprint went well: 4-second clips of products, scenery, and simple actions looked broadcast-ready. Then a client asked for a continuous 20-second shot following a runner through a city. Every attempt either ran out of GPU memory or, when they reduced resolution to fit, dissolved into incoherence past the 6-second mark; the runner's jacket changed color, the buildings rearranged themselves. The team's lead diagnosed the problem correctly as the quadratic-attention wall, not a quality bug, and switched tactics: they generated the spot as five overlapping 5-second windows, each conditioned on the last frame of the previous window (image-to-video chaining), then cross-faded the seams. The runner stayed coherent because every window re-anchored on a clean frame. The lesson: long video today is not one generation, it is a chain of conditioned short ones, and knowing the difference is the difference between a working pipeline and an out-of-memory error. This chaining is the seed of autoregressive world models.

4. Where Text-to-Video Still Fails Intermediate

Good practice requires knowing the boundary of validity, and text-to-video has a sharp one. These models excel at texture, lighting, atmosphere, and camera motion, the things a 2D image prior already knows, smeared coherently across time. They struggle precisely where physics and counting matter: object permanence over long occlusions (a person walks behind a pillar and emerges as someone else), conservation laws (a poured liquid that gains volume, a candle that does not shorten), and discrete counting (the right number of fingers, legs, wheels). These are not random bugs; they are the symptom that the model has learned the appearance of dynamics without an underlying state that the appearance must respect.

That diagnosis is the entire motivation for the world-model half of this chapter. A text-to-video model is a powerful renderer of plausible motion with no explicit world state; a world model (Sections 36.5 through 36.7) makes the latent state explicit and trains it to predict consequences, which is what coherent physics requires. The evaluation tools of Section 36.8 exist precisely to measure this gap quantitatively rather than by eyeballing finger counts.

Common Misconception: Physically Plausible Video Means the Model Learned Physics

A natural inference from a Sora-class clip of a ball bouncing or water pouring is that the model has internalized gravity, momentum, and conservation, the way a physics engine encodes them. In fact a video diffusion model has no explicit physical state and no equations of motion; it learned the joint distribution of pixel appearances over time and reproduces motions that resemble ones in its training data. When a test matches seen motion it looks physically correct; when you probe a novel or out-of-distribution situation, the controlled study by Kang et al. (2024) shows it interpolates among memorized motions rather than applying the underlying rule, which is exactly why these models let a candle fail to shorten or a poured liquid gain volume. Looking physical and being physical are different claims, and only the explicit-state world models of Sections 36.5 through 36.7 and the evaluation probes of Section 36.8 try to close the gap.

Research Frontier: From Generators to Simulators (2024-2026)

The defining research question of the moment is whether scaling text-to-video crosses the line into genuine world simulation. The Sora report (Brooks et al., 2024) argued yes, citing emergent 3D consistency and object permanence. Skeptical follow-ups push back with controlled experiments: Kang et al. (2024), the physical-law study that anchors Section 36.8, find that scaling improves in-distribution motion but does not reliably induce out-of-distribution physical laws, the model interpolates among seen motions rather than discovering the rule. Meanwhile the open frontier is converging the two: action-conditioned video models (Genie, Bruce et al. 2024; the GameNGen DOOM engine, Valevski et al. 2024) turn a text-to-video backbone into an interactive simulator by conditioning each frame on a control input. The current consensus is nuanced: large video generators are excellent appearance simulators and weaker physics simulators, and closing that gap, through explicit state, longer context, action conditioning, or hybrid objectives, is where the field is investing. The newest closed systems make the gap narrower, not gone: Sora 2 (OpenAI, 2025) and Veo 3 (Google, 2025) advertise markedly more accurate physics and, for Veo 3, natively synchronized audio, yet the controlled physical-law evidence still shows scaling improving seen motions more than discovering unseen laws. The next sections follow that investment.

Exercise 36.2.1: Why Spacetime Patches Unify Image and Video Conceptual

Explain in your own words why the spacetime-patch representation lets a single model train on both still images and videos in the same batch, and why that joint training is valuable. What goes wrong with the fixed-clip-length design of Section 36.1 if you try to feed it a single image?

Exercise 36.2.2: The Motion-Bucket Tradeoff Coding

Using the diffusers Stable Video Diffusion snippet, generate the same starting image at three motion_bucket_id values (50, 127, 200) with a fixed seed. Compute the flow-based warping error from Section 36.1 for each clip. Plot motion-bucket value against warping error and against a subjective motion-magnitude judgment, and describe the tradeoff curve you observe.

Exercise 36.2.3: One Long Sequence Versus Chained Windows Analysis

A 5-second clip at a given resolution produces $N$ spacetime tokens. Estimate the attention FLOPs and peak memory for a 20-second clip generated (a) as one sequence of $4N$ tokens versus (b) as four chained 5-second windows of $N$ tokens each. Quantify the saving, then list the qualitative cost: which consistency property does the chained scheme risk losing at the window seams, and how does the field's autoregressive-rollout approach try to mitigate it?