Chapter 26: Video Understanding | Building Vision AI

"A single frame told me a man was holding a door. The next frame said he was holding it open. The one after that revealed he was, in fact, walking through it. Time, it turns out, is the only honest narrator."
A Frame That Finally Saw the Sequel

Big Picture

A video is not a folder of images; it is a signal in space and time, and the central problem of video understanding is deciding how much of your model's budget to spend on each axis. An image classifier can tell you a person is in the frame, but only the temporal axis distinguishes sitting down from standing up, opening a door from closing it, a wave from a slap. This chapter adds that axis to everything you have built in Part III. We start by asking what a clip even is and how to feed one to a network without drowning in redundant pixels. We learn the two architectural families that defined action recognition, the 3D convolution and the two-stream network, and then watch the transformer of Chapter 22 absorb both into a single attention-over-spacetime design. We bring optical flow, introduced classically in Chapter 15, into the deep era with RAFT, a network that estimates dense motion to sub-pixel accuracy. And we close by turning detection into tracking, following many objects across time with learned appearance features rather than hand-tuned motion models.

Chapter Overview

Every model in Part III so far has looked at one image at a time. That was a deliberate simplification, and it is also a profound limitation. The world does not arrive as still photographs; it arrives as a stream, and an enormous amount of meaning lives in how that stream changes. A photograph of a pot on a stove cannot tell you whether the water is about to boil or has already boiled over. A photograph of two people cannot tell you whether they are greeting each other or saying goodbye. Action, intent, causation, and physics all live in the time axis, and to read them a model must see more than one frame.

Adding time sounds like it should be a small change, and it is not. A ten-second clip at thirty frames per second is three hundred images, and naively stacking them into a network multiplies both compute and memory by an order of magnitude while flooding the model with near-duplicate frames. So the first decision in any video system is how to sample and represent the clip, and that decision shapes everything downstream. Section 26.1 works through it: the structure of video data, frame sampling strategies, the redundancy that makes video both expensive and forgiving, and the practical mechanics of decoding clips into tensors with modern tooling.

With clips in hand, Section 26.2 builds the two architectural families that defined the first decade of deep action recognition. The 3D convolutional network generalizes the learnable kernel of Chapter 19 from two spatial dimensions to three spatiotemporal ones, sliding a small cube over the clip so that motion patterns become learned features just as edges did in 2D. The two-stream network takes a different bet: split appearance and motion into separate pathways, feeding raw frames to one and precomputed optical flow to the other, then fuse their predictions. We implement both, compare their trade-offs, and meet the factorized designs (R(2+1)D, I3D, SlowFast) that made 3D convolutions efficient.

Section 26.3 brings the transformer to video. The self-attention you built in Chapter 22 treats an image as a sequence of patches; a video transformer simply extends that sequence across time, treating a clip as a sequence of spatiotemporal tubelets. The quadratic cost of attention, already a concern for high-resolution images, becomes acute when the token count is multiplied by the number of frames, so the section is largely about the factorization tricks (divided space-time attention, ViViT, TimeSformer, the video masked autoencoders) that keep the cost tractable. Section 26.4 returns to optical flow, the dense pixel-level motion field of Chapter 15, and rebuilds it with RAFT, a recurrent network that iteratively refines a flow estimate against an all-pairs correlation volume and set a new accuracy standard that still anchors the field. Section 26.5 finishes the chapter with multi-object tracking: the tracking-by-detection paradigm, the Kalman-and-Hungarian backbone of SORT, and the learned re-identification embeddings of DeepSORT and ByteTrack that let a tracker hold an identity through occlusion and crowding.

The thread that runs through the whole chapter is that time is not free. Every design we study is, at heart, an answer to the same question: where do you spend compute, on the spatial detail of each frame or on the temporal relationships between them, and how do you avoid paying for the redundancy that video is full of? Keep that trade in mind, because it returns transformed in Chapter 36, where the same spatiotemporal modeling is turned around to generate video rather than understand it. The reference card below is the one schema worth carrying out of this chapter: every method in it is a different answer to "space or time, where do I spend?"

Key Insight: One Question, Five Answers (the Chapter on a Card)

Everything in Chapter 26 is a different way to budget compute between the spatial axis (what each frame contains) and the temporal axis (how frames change). Memorize the chapter as five answers to that one question:

Sample, do not stream (26.1): video is mostly redundant, so keep a handful of frames, not all three hundred. Redundancy is both the tax and the gift.
Slide a box, or split the streams (26.2): a 3D convolution learns motion as an edge in spacetime; a two-stream network hands motion in as precomputed flow. SlowFast pays for both, cheaply.
Attend, then factorize (26.3): a video transformer tokenizes spacetime, and divided space-time attention tames the quadratic cost the extra axis creates.
Encode, correlate, iterate (26.4): RAFT computes dense flow by matching every pixel pair and refining the guess in a loop.
Predict, match, persist (26.5): a tracker turns per-frame detections into per-object timelines, supplying the object permanence the detector lacks.

If you remember nothing else, remember the question: space or time, where do I spend? Every architecture in this chapter, and the video generators of Chapter 36, must answer it.

Prerequisites

You should have read Chapter 19: Convolutional Neural Networks, because the 3D convolution of Section 26.2 is a direct generalization of the 2D kernel, and Chapter 20: CNN Architectures for the ResNet backbones that video networks inflate into three dimensions. Chapter 22: Vision Transformers is essential for Section 26.3; the video transformer is the patch-and-attend recipe extended over time, and you should be comfortable with self-attention and its quadratic cost. Chapter 23: Object Detection underpins the tracking-by-detection pipeline of Section 26.5. From Part II, Chapter 15: Motion, Optical Flow & Tracking introduced the classical Lucas-Kanade and Horn-Schunck flow and the Kalman filter that this chapter rebuilds with learned components, and the comparison is much sharper if that material is fresh. Comfort with PyTorch tensors of shape (batch, channels, time, height, width) from Chapter 18 will make the code concrete.

Chapter Roadmap

26.1 From Frames to Clips: The Temporal Dimension What a video tensor is, why thirty redundant frames per second is both a curse and a blessing, frame sampling strategies (dense, uniform, segment-based), and how to decode clips into 5D tensors with current tooling. The data foundation for the whole chapter.
26.2 Action Recognition: 3D CNNs & Two-Stream Networks The 3D convolution as a spatiotemporal kernel, the two-stream split of appearance and motion, and the factorized designs (R(2+1)D, I3D, SlowFast) that made 3D networks efficient. Both families built and compared in PyTorch.
26.3 Video Transformers The image transformer extended across time: spatiotemporal tubelet tokens, the token-count explosion, and the factorized attention (TimeSformer, ViViT) and masked-autoencoder pretraining (VideoMAE) that keep attention over spacetime tractable.
26.4 Deep Optical Flow: RAFT & Beyond Dense motion estimation in the deep era: the all-pairs correlation volume, the recurrent GRU update operator that iteratively refines flow, and why RAFT's design became the template for modern flow, with a look at the transformer-based successors.
26.5 Multi-Object Tracking with Learned Features Tracking-by-detection: the Kalman filter and Hungarian assignment of SORT, the learned re-identification embeddings of DeepSORT, and the low-confidence recovery of ByteTrack that holds identities through occlusion. Built on the detectors of Chapter 23.

The chapter closes with a capstone you build and run. The Hands-On Lab at the end of Section 26.5 assembles a full tracking-by-detection pipeline on a real street clip: it decodes the video with the frame-sampling tooling of Section 26.1, detects people with a Chapter 23 detector, links the detections into stable tracks with your own motion-prediction-and-match logic from Section 26.5, and uses the resulting identities to count how many distinct people cross a line. A final step swaps the hand-built tracker for production ByteTrack in two lines, so you carry away both the from-scratch understanding and the practical shortcut.

What's Next?

Once a model can read the spatial detail of a frame and the temporal structure across frames, it is one short step from understanding a scene to reconstructing it in three dimensions. Chapter 27: Depth, 3D Vision & Neural Scene Representations is the immediate sequel: the optical flow of Section 26.4 and the multi-view geometry of Part II combine to recover depth and structure from motion, and the same neural backbones power monocular depth estimation and neural scene representations like NeRF and Gaussian splatting. Further out, the spatiotemporal modeling of this chapter returns inverted in Chapter 36: Video, 3D Generation & World Models, where instead of classifying or tracking motion a model learns to generate it, and the tracking-as-object-permanence idea of Section 26.5 becomes the world model's grasp of how objects persist when they leave the frame. Understanding motion and generating it are two views of the same temporal structure.

Bibliography & Further Reading

Foundational Papers

Simonyan, K. & Zisserman, A. "Two-Stream Convolutional Networks for Action Recognition in Videos." NeurIPS (2014). arXiv:1406.2199

The two-stream network of Section 26.2. One stream sees raw RGB frames for appearance, the other sees stacked optical flow for motion, and their softmax scores are fused. The design that established motion as a first-class signal in deep action recognition.

Tran, D. et al. "Learning Spatiotemporal Features with 3D Convolutional Networks (C3D)." ICCV (2015). arXiv:1412.0767

C3D, the first widely used 3D convolutional network for video, central to Section 26.2. It showed that a homogeneous 3x3x3 spatiotemporal kernel learns useful motion features directly from clips, generalizing the 2D convolution of Chapter 19.

Carreira, J. & Zisserman, A. "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D)." CVPR (2017). arXiv:1705.07750

I3D of Section 26.2, which inflates a pretrained 2D ImageNet network into 3D by replicating its kernels along time, inheriting strong spatial features. Also introduced the Kinetics benchmark that powers the field.

Tran, D. et al. "A Closer Look at Spatiotemporal Convolutions for Action Recognition (R(2+1)D)." CVPR (2018). arXiv:1711.11248

The R(2+1)D factorization of Section 26.2: split each 3D convolution into a 2D spatial convolution followed by a 1D temporal one, adding non-linearity and cutting parameters while improving accuracy. The standard efficient 3D block.

Feichtenhofer, C. et al. "SlowFast Networks for Video Recognition." ICCV (2019). arXiv:1812.03982

SlowFast of Section 26.2: a slow high-detail pathway at a low frame rate and a fast lightweight pathway at a high frame rate, fused laterally. A clean answer to the spatial-versus-temporal budget question that frames the whole chapter.

Teed, Z. & Deng, J. "RAFT: Recurrent All-Pairs Field Transforms for Optical Flow." ECCV (2020), Best Paper. arXiv:2003.12039

RAFT, the entirety of Section 26.4. An all-pairs correlation volume plus a recurrent GRU update operator that iteratively refines a single high-resolution flow field. The accuracy and design template that modern optical flow is still built around.

Video Transformers (2021-2026)

Bertasius, G. et al. "Is Space-Time Attention All You Need for Video Understanding? (TimeSformer)." ICML (2021). arXiv:2102.05095

TimeSformer of Section 26.3. Divided space-time attention, applying spatial and temporal attention in separate steps, is the key trick that makes attention over a video tractable. The convolution-free baseline for video transformers.

Arnab, A. et al. "ViViT: A Video Vision Transformer." ICCV (2021). arXiv:2103.15691

ViViT of Section 26.3, which catalogs the factorized-attention design space (factorized encoder, factorized self-attention, factorized dot-product) and introduces spatiotemporal tubelet embeddings, the 3D analogue of patch embedding.

Tong, Z. et al. "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." NeurIPS (2022). arXiv:2203.12602

VideoMAE of Section 26.3, the video extension of the masked-autoencoder pretraining from Chapter 25. An extreme 90 percent tube-masking ratio exploits temporal redundancy to make self-supervised video pretraining data-efficient.

Tracking

Bewley, A. et al. "Simple Online and Realtime Tracking (SORT)." ICIP (2016). arXiv:1602.00763

SORT of Section 26.5: a Kalman filter for motion prediction plus the Hungarian algorithm for frame-to-frame assignment, the minimal tracking-by-detection pipeline. Fast, simple, and the foundation everything else extends.

Wojke, N. et al. "Simple Online and Realtime Tracking with a Deep Association Metric (DeepSORT)." ICIP (2017). arXiv:1703.07402

DeepSORT of Section 26.5, which adds a learned appearance embedding to SORT so identities survive occlusion when motion prediction alone fails. The bridge from hand-tuned motion to learned re-identification features.

Zhang, Y. et al. "ByteTrack: Multi-Object Tracking by Associating Every Detection Box." ECCV (2022). arXiv:2110.06864

ByteTrack of Section 26.5. The deceptively simple idea of associating low-confidence detection boxes in a second matching pass recovers objects that occlusion has dimmed, setting a strong modern tracking baseline with almost no added machinery.

Tools, Libraries & Benchmarks

TorchVision video models and the torchvision.io.read_video / VideoReader API. pytorch.org/vision

Pretrained R(2+1)D, MViT, and S3D video classifiers and the decoding utilities of Section 26.1, the library shortcut behind most of this chapter's clip-loading and action-recognition code.

PyTorchVideo, the Facebook AI video understanding library. pytorchvideo.org

SlowFast, X3D, and MViT implementations with data loaders and transforms tuned for Kinetics, the reference codebase for the action models of Sections 26.2 and 26.3.

Kay, W. et al. "The Kinetics Human Action Video Dataset." (2017). arXiv:1705.06950 · deepmind/kinetics-i3d

The Kinetics benchmark (400, 600, and 700 action classes) that every model in Sections 26.2 and 26.3 is trained and reported on, and the dataset that made deep action recognition reproducible across labs.

Jocher, G. et al. Ultralytics YOLO with built-in tracking (model.track, ByteTrack / BoT-SORT). docs.ultralytics.com/modes/track

The production multi-object tracker of Section 26.5: a one-line wrapper that runs detection and ByteTrack or BoT-SORT association together, the library shortcut for the from-scratch tracker built in that section.