"They asked me to let every patch talk to every other patch. Then they added time, and now every patch wants to talk to every other patch in every other moment. I have run the numbers. We are going to need a smaller guest list."
An Attention Matrix Contemplating Its Own Quadratic Growth
A video transformer is the Vision Transformer with one more axis in its token sequence, and the entire engineering effort goes into surviving the quadratic cost that the extra axis creates. Cutting an image into patches gave a few hundred tokens; cutting a clip into spatiotemporal tubelets multiplies that by the number of frames, and the all-pairs attention of Chapter 22 grows with the square of the token count. This section extends patch embedding to tubelet embedding, shows why naive joint space-time attention is infeasible, and builds the factorized attention (TimeSformer's divided attention, ViViT's factorized encoder) that splits the cost into a tractable spatial part and temporal part. It closes with VideoMAE, which turns video's redundancy from Section 26.1 into a self-supervised pretraining advantage by masking ninety percent of the tubelets.
The previous section built action recognition out of convolutions; this one rebuilds it out of attention, exactly as Chapter 22 rebuilt image classification. You should hold that chapter's machinery in mind, because the video transformer reuses the transformer block, the class token, and the positional embedding almost unchanged. The genuinely new content is twofold: how to tokenize a clip, and how to make attention over the resulting sequence affordable. Both reduce to managing the time axis, the same theme that organized 3D convolutions and two-stream networks in Section 26.2.
1. Tubelet Embedding: Tokens in Spacetime Beginner
The Vision Transformer turned an image into tokens by cutting it into non-overlapping patches and linearly embedding each. A video transformer has two natural ways to do the same for a clip. The first, used by TimeSformer, embeds each frame into spatial patches independently, producing $T$ groups of $N_s$ patch tokens. The second, the tubelet embedding of ViViT, cuts the clip into 3D blocks (tubelets) of size $t \times p \times p$ that span several frames at once, so each token already fuses a little motion before any attention happens. Tubelet embedding is the exact 3D analogue of patch embedding, and just as the 2D patch embedding turned out to be a strided 2D convolution in Chapter 22, the tubelet embedding is a strided 3D convolution, the same Conv3d primitive from Section 26.2.
The token count is what to watch. For a clip of $T$ frames at $H \times W$ resolution with patch size $p$ and temporal tubelet length $t$, the number of tokens is
For a typical $T = 32$, $H = W = 224$, $p = 16$, $t = 2$, that is $16 \times 14 \times 14 = 3136$ tokens, against the $196$ of a single ViT image. Since attention cost grows as $N^2$, the clip is roughly $(3136 / 196)^2 \approx 256$ times more expensive per attention layer than the image. Figure 26.3.1 shows the tubelet construction and the token-count growth.
The code below implements tubelet embedding as a single Conv3d whose kernel and stride both equal the tubelet size, the cleanest way to express it, exactly mirroring the patch-embedding-as-convolution trick of Chapter 22.
import torch
import torch.nn as nn
class TubeletEmbed(nn.Module):
"""Cut a clip into t x p x p tubelets and embed each as one token."""
def __init__(self, dim=768, patch=16, tubelet=2, in_ch=3):
super().__init__()
# kernel == stride == tubelet size: non-overlapping 3D blocks
self.proj = nn.Conv3d(in_ch, dim,
kernel_size=(tubelet, patch, patch),
stride=(tubelet, patch, patch))
def forward(self, x): # x: (N, C, T, H, W)
z = self.proj(x) # (N, dim, T/t, H/p, W/p)
N, D, t, h, w = z.shape
tokens = z.flatten(2).transpose(1, 2) # (N, t*h*w, dim)
return tokens, (t, h, w)
embed = TubeletEmbed(dim=768, patch=16, tubelet=2)
tokens, grid = embed(torch.randn(1, 3, 32, 224, 224))
print("tokens:", tokens.shape, "grid (t,h,w):", grid)
# tokens: torch.Size([1, 3136, 768]) grid (t,h,w): (16, 14, 14)
Conv3d inside TubeletEmbed, where kernel and stride both equal the tubelet size. A 32-frame clip becomes 3136 tokens of dimension 768; the returned $(t, h, w)$ grid is what factorized attention needs to reshape tokens back into separate temporal and spatial groups.2. The Quadratic Wall, Again Intermediate
You met the quadratic cost of attention in Chapter 22, where it bit a document-analysis team at high resolution. Video makes it worse along a new axis. Joint space-time attention, where every token attends to every other token across all positions and all frames, is the most expressive option and the most expensive: with $N$ tokens the attention matrix has $N^2$ entries, and at $N = 3136$ that is nearly ten million entries per head per layer, before any of the activations. For longer clips or higher resolution it is simply infeasible on commodity hardware.
The remedy is to factorize attention along its axes, the same move that R(2+1)D made for convolution. Instead of one attention over all $N$ tokens, apply spatial attention within each frame (tokens attend only to other tokens in the same frame) and then temporal attention across frames (tokens at the same spatial location attend across time). The cost drops from $O(N^2)$ to roughly $O(N_s^2 \cdot T + T^2 \cdot N_s)$, where $N_s$ is the spatial token count per frame and, in this cost formula, $T$ counts the temporal token groups (the $T/t = 16$ of the running example, not the 32 raw frames). The table below makes the saving concrete.
| Scheme | Approx. attention pairs | Relative cost |
|---|---|---|
| Joint space-time | $N^2 = 3136^2 \approx 9.8\text{M}$ | 1.0x |
| Divided (spatial then temporal) | $N_s^2 T + T^2 N_s \approx 0.66\text{M}$ | ~0.07x |
| Spatial-only (per-frame ViT) | $N_s^2 T \approx 0.61\text{M}$ | ~0.06x |
Table 26.3.1 shows divided attention costing roughly one-fifteenth of joint attention while still letting every token influence every other through the two-step path. TimeSformer's ablation found this divided scheme to be the sweet spot: nearly the accuracy of joint attention at a fraction of the cost. The implementation in subsection 3 builds it.
Joint space-time attention is the model equivalent of inviting everyone you have ever met to the same party and insisting they each hold a personal conversation with everyone else. Add one more guest and the number of required conversations grows by the square; double the guest list and you have quadrupled the small talk. Divided attention is the polite host's compromise: first you only talk to people in your own room (spatial attention), then one delegate per room mingles down the hallway (temporal attention). Nobody meets everybody directly, but a rumor still reaches the whole party in two hops, which is exactly the accuracy-for-cost trade the table records.
Notice that the same idea has now appeared three times. R(2+1)D factorized a 3D convolution into a 2D spatial convolution and a 1D temporal convolution (Section 26.2). Divided space-time attention factorizes one joint attention into spatial and temporal attention. And the windowed attention of Swin (Chapter 22) factorized global attention into local windows plus shifts. Whenever an operation's cost is quadratic in a quantity you can decompose, splitting the operation along that decomposition is the first move to try. The accuracy you give up is usually small because the factors recover global reach through composition; the compute you save is usually large.
3. Divided Space-Time Attention Intermediate
A divided space-time block applies a temporal attention sub-layer and a spatial attention sub-layer in sequence, each wrapped in the residual-and-norm structure of the standard transformer block from Chapter 22. The trick is purely in the reshaping: before temporal attention, group tokens so that those sharing a spatial location across frames sit together; before spatial attention, group tokens so that those sharing a frame sit together. The multi-head attention itself is unchanged. The code reuses a standard attention module and only manipulates the token layout.
import torch
import torch.nn as nn
class DividedSpaceTimeBlock(nn.Module):
"""TimeSformer-style block: temporal attention, then spatial attention."""
def __init__(self, dim=768, heads=12):
super().__init__()
self.norm_t = nn.LayerNorm(dim)
self.attn_t = nn.MultiheadAttention(dim, heads, batch_first=True)
self.norm_s = nn.LayerNorm(dim)
self.attn_s = nn.MultiheadAttention(dim, heads, batch_first=True)
self.norm_m = nn.LayerNorm(dim)
self.mlp = nn.Sequential(nn.Linear(dim, 4 * dim), nn.GELU(), nn.Linear(4 * dim, dim))
def forward(self, x, t, h, w): # x: (N, t*h*w, dim)
B, _, D = x.shape
# --- temporal attention: tokens at the same (h,w) attend across time ---
xt = x.view(B, t, h * w, D).permute(0, 2, 1, 3).reshape(B * h * w, t, D)
a, _ = self.attn_t(self.norm_t(xt), self.norm_t(xt), self.norm_t(xt))
xt = (xt + a).reshape(B, h * w, t, D).permute(0, 2, 1, 3).reshape(B, t * h * w, D)
# --- spatial attention: tokens in the same frame attend to each other ---
xs = xt.view(B, t, h * w, D).reshape(B * t, h * w, D)
a, _ = self.attn_s(self.norm_s(xs), self.norm_s(xs), self.norm_s(xs))
xs = (xs + a).reshape(B, t * h * w, D)
# --- shared MLP sub-layer ---
return xs + self.mlp(self.norm_m(xs))
block = DividedSpaceTimeBlock(dim=768, heads=12)
x = torch.randn(1, 16 * 14 * 14, 768) # the 3136 tokens from subsection 1
print("block out:", block(x, t=16, h=14, w=14).shape) # block out: torch.Size([1, 3136, 768])
DividedSpaceTimeBlock. The attn_t sub-layer reshapes tokens so each spatial location attends across all frames; attn_s reshapes so each frame attends within itself. The nn.MultiheadAttention modules are standard; only the view/permute/reshape token layout changes between the two sub-layers, which is what makes the factorization cheap to implement.Implementing the block teaches the factorization; in production you load a pretrained model. Hugging Face Transformers ships TimeSformer and VideoMAE with their processors, so classifying a clip is a few lines, and the model already carries Kinetics-scale features:
# Classify a clip with a Kinetics-finetuned VideoMAE from the Hugging Face hub.
# The processor and id2label map remove the tokenizing and label-decoding boilerplate.
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import torch
proc = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
# video: a list of 16 RGB frames as numpy arrays or PIL images
inputs = proc(list(video_frames), return_tensors="pt") # handles resize + normalize
with torch.no_grad():
logits = model(**inputs).logits
print(model.config.id2label[logits.argmax(-1).item()]) # e.g. "playing guitar"
VideoMAEImageProcessor and VideoMAEForVideoClassification. The library handles the frame sampling, resize, normalization, and the Kinetics-scale weights internally, and config.id2label maps the predicted index straight to an action name.The processor applies the exact frame count, resize, and normalization the model was trained with, and id2label maps the predicted index to a Kinetics class name. This replaces the tubelet embedding, the factorized blocks, the training loop, and the dataset with four lines and a download.
With the pretrained classifier of Code Fragment 3 you now have everything you need for a small but genuinely useful tool: a script that takes any video file, samples 16 frames with the uniform sampler of Section 26.1, runs the Kinetics-finetuned VideoMAE, and writes a short list of the top predicted actions with their confidences. Slide the 16-frame window along a longer clip and you get a timeline of what is happening when, the skeleton of a video search or auto-tagging service. The whole thing is well under fifty lines and runs on a laptop CPU for a short clip. Difficulty: beginner, about 30 to 45 minutes. Portfolio value is high precisely because it is end to end: it ingests a real file, applies a foundation model, and emits a result a non-engineer can read. Extend it by averaging predictions over several windows for a steadier label, or by swapping in a video-language model from the research-frontier note below to tag actions the Kinetics label set never anticipated.
4. VideoMAE: Redundancy as a Pretraining Advantage Advanced
Video transformers are even hungrier for data than image ones, which collides with the fact that labeled video is scarce and expensive. The answer, as in Chapter 25, is self-supervised pretraining, and the masked autoencoder transfers to video beautifully. VideoMAE masks a large fraction of the tubelets, feeds only the visible ones to the encoder, and asks a lightweight decoder to reconstruct the missing pixels. The striking detail is the masking ratio: where the image MAE of Chapter 25 masked about 75 percent of patches, VideoMAE masks 90 percent or more, and it does so with tube masking (the same spatial locations are masked across all frames) to prevent the model from cheating by copying an unmasked neighboring frame.
This extreme ratio is not an arbitrary tuning choice; it is the direct consequence of the temporal redundancy you measured in Section 26.1. Because adjacent frames carry so little new information, the reconstruction task stays hard and informative even when ninety percent of the input is hidden. The tiny visible fraction also makes pretraining fast. The encoder runs its quadratic-cost attention over only the visible ten percent of tokens, so dropping ninety percent of them cuts the attention cost by roughly a hundredfold, which is what makes masked video pretraining affordable. The code below builds a tube mask and shows how few tokens the encoder actually processes.
It is tempting to read "image MAE masks 75 percent, VideoMAE masks 90 percent" as "more masking is simply better" and to treat the ratio as a free dial you can crank up. In fact the 90 percent works only because it is paired with the tube structure. The high ratio is licensed by temporal redundancy, but the structure is what stops the model from cheating: if you masked 90 percent of tubelets randomly per frame, an almost-identical unmasked copy of nearly every patch would survive in an adjacent frame, so the encoder could reconstruct by copying along time instead of learning anything, and the features would be weak despite the high number. Tube masking removes the same spatial column across all frames precisely to kill that shortcut. So the takeaway is not "mask more", it is "mask in a way that defeats the cheapest reconstruction route given your data's redundancy"; on a static-image MAE, where there is no temporal copy to exploit, 90 percent random masking destroys too much and accuracy falls. A diagnostic question: at a fixed 90 percent ratio, why does tube masking train better features than random masking? If redundancy alone set the ratio, the two would be equivalent, and they are not.
import torch
def tube_mask(t, h, w, mask_ratio=0.9):
"""Mask the same spatial positions across all t temporal groups."""
num_spatial = h * w
num_keep = int(num_spatial * (1 - mask_ratio))
keep_spatial = torch.randperm(num_spatial)[:num_keep] # same for every frame
# build a (t*h*w,) boolean keep-mask by repeating the spatial pattern over time
keep = torch.zeros(t, num_spatial, dtype=torch.bool)
keep[:, keep_spatial] = True
return keep.flatten() # (t*h*w,)
t, h, w = 8, 14, 14
keep = tube_mask(t, h, w, mask_ratio=0.9)
print("total tubelet tokens:", keep.numel()) # total tubelet tokens: 1568
print("tokens seen by encoder:", keep.sum().item()) # tokens seen by encoder: 152
print("fraction processed: {:.0%}".format(keep.float().mean().item())) # fraction processed: 10%
tube_mask: the same keep_spatial positions are dropped in every frame so the model cannot copy motion from an adjacent unmasked frame. At mask_ratio=0.9 the encoder sees only about 152 of 1568 tokens, making self-supervised video pretraining roughly ten times cheaper per clip while staying a hard reconstruction task.VideoMAE and its successors (VideoMAE V2, and the video branch of the joint image-video foundation models) are the modern recipe for pretraining a video transformer without a labeled Kinetics-scale dataset, and they connect directly to the self-supervision arc of Chapter 25. The pretrained backbone is then fine-tuned for action recognition exactly as the R(2+1)D backbone was in Section 26.2. Convolutions and transformers alike have now given us a clip-level label, but neither tells us how each individual pixel moved between two frames; Section 26.4 turns from recognizing motion to measuring it densely, rebuilding the optical flow of Chapter 15 in the deep era with RAFT.
Who: a university lab studying surgical-skill assessment from operating-room video, 2024, with a modest four-GPU cluster. Situation: they had thousands of hours of unlabeled procedure footage but only a few hundred clips with expert skill ratings, far too few to train a video transformer from scratch. Problem: a TimeSformer fine-tuned on the labeled clips alone overfit badly, and they could not afford the compute to pretrain on a public dataset and then domain-adapt. Decision: they ran VideoMAE pretraining directly on their own unlabeled surgical footage, relying on the 90 percent tube-masking ratio to keep the per-clip cost low enough that the whole unlabeled corpus fit in their compute budget over a weekend. Result: the domain-pretrained backbone, fine-tuned on the few hundred rated clips, beat the from-scratch model and an ImageNet-video-pretrained baseline by a clear margin on skill-rating correlation. Lesson: the extreme masking ratio is what made pretraining on a small in-domain corpus affordable; video's redundancy, the same property that made it expensive in Section 26.1, is exactly what makes masked pretraining cheap. When you have unlabeled video in your domain, self-supervised pretraining on it often beats borrowing features from a generic dataset.
The 2023 to 2026 trajectory has been toward general video foundation models. VideoMAE V2 and the InternVideo family scale masked pretraining to roughly billion-parameter encoders, with the InternVid corpus reaching hundreds of millions of clips; the video-language models that pair a video encoder with a language model (the video branch of CLIP-style and the open VideoLLaMA-style systems) enable zero-shot action recognition and video question answering from text prompts, the temporal extension of the CLIP embeddings you will meet in Chapter 34. A separate frontier attacks the remaining quadratic cost for long videos directly: memory-augmented and state-space video models (the video applications of the Mamba family noted in Chapter 22) aim for linear-cost temporal modeling so a model can attend over minutes rather than seconds. How to give a model genuine long-term temporal memory, rather than the few-second window every architecture in this chapter assumes, is one of the defining open problems of video understanding.
For a clip of 64 frames at $224 \times 224$ with spatial patch size 16 and tubelet length 4, compute the number of tokens $N$ using the formula in subsection 1. Then compute the number of attention pairs for joint space-time attention and for divided attention (spatial within frame, temporal across frames), and state the ratio. In two sentences, explain why divided attention's saving grows as the clip gets longer, and connect this to the R(2+1)D factorization of Section 26.2.
Write a JointSpaceTimeBlock that runs a single nn.MultiheadAttention over all $t \cdot h \cdot w$ tokens at once, and compare it to the DividedSpaceTimeBlock from subsection 3 on the same input. Measure the peak GPU memory and the wall-clock time of one forward pass for token grids of increasing size ($t = 4, 8, 16$ at $h = w = 14$). Plot both curves and confirm that the joint block's memory grows quadratically while the divided block's grows far more slowly, reproducing the trade in Table 26.3.1.
VideoMAE masks the same spatial positions across all frames (tube masking) rather than masking each frame's tokens independently. Using the temporal-redundancy measurement from Section 26.1, write a one-page argument for why independent per-frame masking would make the reconstruction task too easy and produce weak features, while tube masking keeps it hard. Support your argument by modifying the tube_mask function to do independent per-frame masking, then describe what information an encoder could exploit under each scheme to reconstruct a masked tubelet.