"A single frame told me a man was holding a door. The next frame said he was holding it open. The one after that revealed he was, in fact, walking through it. Time, it turns out, is the only honest narrator."
A Frame That Finally Saw the Sequel
A video is not a folder of images; it is a signal in space and time, and the central problem of video understanding is deciding how much of your model's budget to spend on each axis. An image classifier can tell you a person is in the frame, but only the temporal axis distinguishes sitting down from standing up, opening a door from closing it, a wave from a slap. This chapter adds that axis to everything you have built in Part III. We start by asking what a clip even is and how to feed one to a network without drowning in redundant pixels. We learn the two architectural families that defined action recognition, the 3D convolution and the two-stream network, and then watch the transformer of Chapter 22 absorb both into a single attention-over-spacetime design. We bring optical flow, introduced classically in Chapter 15, into the deep era with RAFT, a network that estimates dense motion to sub-pixel accuracy. And we close by turning detection into tracking, following many objects across time with learned appearance features rather than hand-tuned motion models.
Chapter Overview
Every model in Part III so far has looked at one image at a time. That was a deliberate simplification, and it is also a profound limitation. The world does not arrive as still photographs; it arrives as a stream, and an enormous amount of meaning lives in how that stream changes. A photograph of a pot on a stove cannot tell you whether the water is about to boil or has already boiled over. A photograph of two people cannot tell you whether they are greeting each other or saying goodbye. Action, intent, causation, and physics all live in the time axis, and to read them a model must see more than one frame.
Adding time sounds like it should be a small change, and it is not. A ten-second clip at thirty frames per second is three hundred images, and naively stacking them into a network multiplies both compute and memory by an order of magnitude while flooding the model with near-duplicate frames. So the first decision in any video system is how to sample and represent the clip, and that decision shapes everything downstream. Section 26.1 works through it: the structure of video data, frame sampling strategies, the redundancy that makes video both expensive and forgiving, and the practical mechanics of decoding clips into tensors with modern tooling.
With clips in hand, Section 26.2 builds the two architectural families that defined the first decade of deep action recognition. The 3D convolutional network generalizes the learnable kernel of Chapter 19 from two spatial dimensions to three spatiotemporal ones, sliding a small cube over the clip so that motion patterns become learned features just as edges did in 2D. The two-stream network takes a different bet: split appearance and motion into separate pathways, feeding raw frames to one and precomputed optical flow to the other, then fuse their predictions. We implement both, compare their trade-offs, and meet the factorized designs (R(2+1)D, I3D, SlowFast) that made 3D convolutions efficient.
Section 26.3 brings the transformer to video. The self-attention you built in Chapter 22 treats an image as a sequence of patches; a video transformer simply extends that sequence across time, treating a clip as a sequence of spatiotemporal tubelets. The quadratic cost of attention, already a concern for high-resolution images, becomes acute when the token count is multiplied by the number of frames, so the section is largely about the factorization tricks (divided space-time attention, ViViT, TimeSformer, the video masked autoencoders) that keep the cost tractable. Section 26.4 returns to optical flow, the dense pixel-level motion field of Chapter 15, and rebuilds it with RAFT, a recurrent network that iteratively refines a flow estimate against an all-pairs correlation volume and set a new accuracy standard that still anchors the field. Section 26.5 finishes the chapter with multi-object tracking: the tracking-by-detection paradigm, the Kalman-and-Hungarian backbone of SORT, and the learned re-identification embeddings of DeepSORT and ByteTrack that let a tracker hold an identity through occlusion and crowding.
The thread that runs through the whole chapter is that time is not free. Every design we study is, at heart, an answer to the same question: where do you spend compute, on the spatial detail of each frame or on the temporal relationships between them, and how do you avoid paying for the redundancy that video is full of? Keep that trade in mind, because it returns transformed in Chapter 36, where the same spatiotemporal modeling is turned around to generate video rather than understand it. The reference card below is the one schema worth carrying out of this chapter: every method in it is a different answer to "space or time, where do I spend?"
Everything in Chapter 26 is a different way to budget compute between the spatial axis (what each frame contains) and the temporal axis (how frames change). Memorize the chapter as five answers to that one question:
- Sample, do not stream (26.1): video is mostly redundant, so keep a handful of frames, not all three hundred. Redundancy is both the tax and the gift.
- Slide a box, or split the streams (26.2): a 3D convolution learns motion as an edge in spacetime; a two-stream network hands motion in as precomputed flow. SlowFast pays for both, cheaply.
- Attend, then factorize (26.3): a video transformer tokenizes spacetime, and divided space-time attention tames the quadratic cost the extra axis creates.
- Encode, correlate, iterate (26.4): RAFT computes dense flow by matching every pixel pair and refining the guess in a loop.
- Predict, match, persist (26.5): a tracker turns per-frame detections into per-object timelines, supplying the object permanence the detector lacks.
If you remember nothing else, remember the question: space or time, where do I spend? Every architecture in this chapter, and the video generators of Chapter 36, must answer it.
Prerequisites
You should have read Chapter 19: Convolutional Neural Networks, because the 3D convolution of Section 26.2 is a direct generalization of the 2D kernel, and Chapter 20: CNN Architectures for the ResNet backbones that video networks inflate into three dimensions. Chapter 22: Vision Transformers is essential for Section 26.3; the video transformer is the patch-and-attend recipe extended over time, and you should be comfortable with self-attention and its quadratic cost. Chapter 23: Object Detection underpins the tracking-by-detection pipeline of Section 26.5. From Part II, Chapter 15: Motion, Optical Flow & Tracking introduced the classical Lucas-Kanade and Horn-Schunck flow and the Kalman filter that this chapter rebuilds with learned components, and the comparison is much sharper if that material is fresh. Comfort with PyTorch tensors of shape (batch, channels, time, height, width) from Chapter 18 will make the code concrete.
Chapter Roadmap
- 26.1 From Frames to Clips: The Temporal Dimension What a video tensor is, why thirty redundant frames per second is both a curse and a blessing, frame sampling strategies (dense, uniform, segment-based), and how to decode clips into 5D tensors with current tooling. The data foundation for the whole chapter.
- 26.2 Action Recognition: 3D CNNs & Two-Stream Networks The 3D convolution as a spatiotemporal kernel, the two-stream split of appearance and motion, and the factorized designs (R(2+1)D, I3D, SlowFast) that made 3D networks efficient. Both families built and compared in PyTorch.
- 26.3 Video Transformers The image transformer extended across time: spatiotemporal tubelet tokens, the token-count explosion, and the factorized attention (TimeSformer, ViViT) and masked-autoencoder pretraining (VideoMAE) that keep attention over spacetime tractable.
- 26.4 Deep Optical Flow: RAFT & Beyond Dense motion estimation in the deep era: the all-pairs correlation volume, the recurrent GRU update operator that iteratively refines flow, and why RAFT's design became the template for modern flow, with a look at the transformer-based successors.
- 26.5 Multi-Object Tracking with Learned Features Tracking-by-detection: the Kalman filter and Hungarian assignment of SORT, the learned re-identification embeddings of DeepSORT, and the low-confidence recovery of ByteTrack that holds identities through occlusion. Built on the detectors of Chapter 23.
The chapter closes with a capstone you build and run. The Hands-On Lab at the end of Section 26.5 assembles a full tracking-by-detection pipeline on a real street clip: it decodes the video with the frame-sampling tooling of Section 26.1, detects people with a Chapter 23 detector, links the detections into stable tracks with your own motion-prediction-and-match logic from Section 26.5, and uses the resulting identities to count how many distinct people cross a line. A final step swaps the hand-built tracker for production ByteTrack in two lines, so you carry away both the from-scratch understanding and the practical shortcut.
What's Next?
Once a model can read the spatial detail of a frame and the temporal structure across frames, it is one short step from understanding a scene to reconstructing it in three dimensions. Chapter 27: Depth, 3D Vision & Neural Scene Representations is the immediate sequel: the optical flow of Section 26.4 and the multi-view geometry of Part II combine to recover depth and structure from motion, and the same neural backbones power monocular depth estimation and neural scene representations like NeRF and Gaussian splatting. Further out, the spatiotemporal modeling of this chapter returns inverted in Chapter 36: Video, 3D Generation & World Models, where instead of classifying or tracking motion a model learns to generate it, and the tracking-as-object-permanence idea of Section 26.5 becomes the world model's grasp of how objects persist when they leave the frame. Understanding motion and generating it are two views of the same temporal structure.
Bibliography & Further Reading
Foundational Papers
Video Transformers (2021-2026)
Tracking
Tools, Libraries & Benchmarks
torchvision.io.read_video / VideoReader API. pytorch.org/visionmodel.track, ByteTrack / BoT-SORT). docs.ultralytics.com/modes/track