Part II: Classical Computer Vision
Chapter 15: Motion, Optical Flow & Tracking

Motion, Optical Flow & Tracking

"A photograph tells you where everything is. A video tells you where everything is going, provided you can keep up with thirty deadlines per second."

An Overcommitted Motion Vector
Big Picture

Video is not a stack of unrelated photographs; it is a measurement of motion, and this chapter builds the classical instruments that read it. Two questions organize everything. First, how does each pixel move from one frame to the next? That is optical flow, and it powers stabilization, frame interpolation, motion segmentation, and the feature tracks that fed Chapter 14's reconstructions. Second, how does an object move across hundreds of frames, through occlusions, lighting changes, and lookalike distractors? That is tracking, and it requires not just measurement but memory: a motion model, an uncertainty estimate, and a policy for deciding which detection belongs to which identity. Both questions return, learned end to end, in Chapter 26, and the state-estimation machinery built here grows into the world models of Chapter 36.

Chapter Overview

Every chapter so far has treated the image as a frozen instant. Chapter 14 came closest to motion: it moved a camera through a static world and recovered geometry from the parallax. This chapter inverts the arrangement. Now the world itself moves, and the question is no longer "where is the camera?" but "what is moving, where is it going, and which thing is it?" Those are genuinely different problems. A pixel that brightens might be an object arriving, a shadow leaving, or a cloud uncovering the sun, and nothing in a single frame can tell you which. Time is a new measurement axis, and like every measurement axis it comes with its own noise, its own aliasing, and its own beautiful, treacherous ambiguities.

The first half of the chapter is about pixel motion. Section 15.1 lays the foundations: the distinction between the motion field (the geometric truth) and optical flow (what brightness patterns appear to do), the brightness constancy assumption that links them, and the aperture problem that makes flow locally unknowable in one direction. Section 15.2 turns one unsolvable equation into a solvable system by assuming flow is constant over a small window: the Lucas-Kanade method, whose normal-equations matrix turns out to be exactly the structure tensor of Chapter 10. Corners, it emerges, are not just matchable; they are trackable, and the KLT tracker built on this insight has run for four decades. Section 15.3 goes dense: Horn-Schunck's global energy, the smoothness prior that fills in flow where data is silent, and the variational lineage that leads to the modern OpenCV workhorses.

The second half is about object motion. Section 15.4 exploits the easiest special case, a static camera, where modeling each pixel's history separates the boring background from the interesting foreground: frame differencing, running Gaussians, and the mixture-of-Gaussians models that survive waving trees and flickering lamps. Section 15.5 follows a single chosen object through the frame: mean-shift climbing a color-histogram likelihood, correlation filters matching templates at Fourier speed, and the drift-versus-adaptation dilemma that haunts every template update. Section 15.6 adds the missing ingredients for multi-object tracking: the Kalman filter, which predicts where each object will be and how sure to be about it, and data association, which decides via the Hungarian algorithm which detection belongs to which track. The SORT family assembled from these pieces still underpins production trackers today.

One theme runs through all six sections. Motion estimation is ill-posed: the data never fully determines the answer, so every method is an assumption about constancy plus a prior about smoothness, wrapped around the same least-squares core. Lucas-Kanade assumes local constancy; Horn-Schunck assumes global smoothness; background subtraction assumes temporal stationarity; the Kalman filter assumes dynamic linearity. When the assumption holds, the method works; when it breaks, the failure is diagnosable. Learning that diagnosis is the real skill this chapter teaches, and it transfers intact to the deep methods of Chapter 26, which replace the hand-chosen priors with learned ones but inherit every one of the underlying ambiguities.

Prerequisites

This chapter leans hard on image gradients and convolution from Chapter 3: Spatial Filtering & Convolution; the optical flow constraint is built from spatial and temporal derivatives. Pyramids from Chapter 4: The Frequency Domain & Multi-Scale Analysis reappear in pyramidal Lucas-Kanade and coarse-to-fine flow, and Chapter 4's convolution theorem explains why correlation-filter trackers run at hundreds of frames per second. Histograms and back-projection from Chapter 2: Point Operations, Histograms & Thresholding drive mean-shift tracking, and the morphological cleanup of Chapter 6: Morphology, Binary Images & Shape turns raw foreground masks into usable blobs. The structure tensor and Shi-Tomasi detector from Chapter 10: Keypoints, Descriptors & Matching are reused directly in Section 15.2. Familiarity with Chapter 14: Structure from Motion & Visual SLAM helps motivate feature tracks but is not required.

Chapter Roadmap

What's Next?

With motion, the classical toolkit is nearly complete: Part II has found edges, keypoints, regions, geometry, and now trajectories. What it has not yet done is name things. Chapter 16: Classical Recognition Pipelines takes up recognition the pre-deep-learning way: bag-of-visual-words built from Chapter 10's descriptors, HOG templates, deformable part models, and the classifiers that powered a decade of detection. It is the chapter where classical vision reaches for semantics and discovers its limits, which is exactly the cliffhanger that Part III's neural networks resolve.

Bibliography & Further Reading

Foundational Papers

Horn, B. K. P. and Schunck, B. G. "Determining Optical Flow." Artificial Intelligence 17 (1981). doi:10.1016/0004-3702(81)90024-2
The paper that defined dense optical flow as energy minimization: brightness constancy plus a global smoothness prior. Section 15.3 derives and implements its iterative scheme.
Lucas, B. D. and Kanade, T. "An Iterative Image Registration Technique with an Application to Stereo Vision." IJCAI (1981). ri.cmu.edu (PDF)
The other 1981 flow paper: local least squares in a window, with the Newton-style iteration that still runs inside every KLT implementation. The heart of Section 15.2.
Shi, J. and Tomasi, C. "Good Features to Track." CVPR (1994). doi:10.1109/CVPR.1994.323794
Closed the loop between detection and tracking: the points worth tracking are exactly those where the Lucas-Kanade system is well conditioned. Completes the KLT acronym.
Comaniciu, D. and Meer, P. "Mean Shift: A Robust Approach Toward Feature Space Analysis." IEEE TPAMI (2002). doi:10.1109/34.1000236
The definitive treatment of mean-shift mode seeking, the engine behind Section 15.5's histogram tracker and a clustering tool far beyond tracking.
Bolme, D. S., Beveridge, J. R., Draper, B. A., and Lui, Y. M. "Visual Object Tracking using Adaptive Correlation Filters." CVPR (2010). doi:10.1109/CVPR.2010.5539960
MOSSE: tracking as template correlation computed in the Fourier domain, at over 600 frames per second on 2010 hardware. The opening move of Section 15.5's correlation-filter story.
Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering (1960). doi:10.1115/1.3662552
The original Kalman filter paper. Section 15.6's predict-update cycle, uncertainty propagation, and optimal gain all trace to these eleven pages.
Zivkovic, Z. "Improved Adaptive Gaussian Mixture Model for Background Subtraction." ICPR (2004). doi:10.1109/ICPR.2004.1333992
The algorithm behind OpenCV's MOG2 background subtractor: per-pixel Gaussian mixtures with an automatically chosen number of components. Section 15.4's workhorse.
Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. "Simple Online and Realtime Tracking." ICIP (2016). arXiv:1602.00763
SORT: Kalman prediction plus Hungarian association plus IoU cost, nothing else, and yet competitive with far heavier trackers. Section 15.6 rebuilds it in miniature.

Recent Research (2020-2026)

Teed, Z. and Deng, J. "RAFT: Recurrent All-Pairs Field Transforms for Optical Flow." ECCV (2020). arXiv:2003.12039
The deep flow architecture that ended the variational era on the benchmarks: all-pairs correlation volumes refined by a recurrent update operator. The bridge from Section 15.3 to Chapter 26.
Wang, Y., Lipson, L., and Deng, J. "SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow." ECCV (2024). arXiv:2405.14793
RAFT made fast and robust: mixture-of-Laplace loss and architectural simplifications that brought state-of-the-art flow to real-time rates, referenced in Sections 15.1 and 15.3.
Zhang, Y. et al. "ByteTrack: Multi-Object Tracking by Associating Every Detection Box." ECCV (2022). arXiv:2110.06864
Showed that low-confidence detections are association gold, not garbage: a two-stage matching cascade that fixed most occlusion-induced identity switches. Discussed in Section 15.6.
Ravi, N. et al. "SAM 2: Segment Anything in Images and Videos." (2024). arXiv:2408.00714
Promptable video segmentation with a streaming memory: click an object once and SAM 2 segments and follows it. The modern answer to Section 15.5's single-object tracking problem.
Karaev, N. et al. "CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos." (2024). arXiv:2410.11831
The track-any-point line of work: a transformer that follows thousands of points jointly through occlusions, the direct learned descendant of Section 15.2's KLT tracker.

Books

Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book
Chapter 9 covers motion estimation and Chapter 7.1 feature tracking, with full derivations and exhaustive references; free online.
Labbe, R. Kalman and Bayesian Filters in Python. github.com/rlabbe
A free, executable Jupyter book that builds Kalman filtering from intuition to full multivariate theory; the companion filterpy library is used in Section 15.6's library shortcut.

Tools & Libraries

OpenCV. "Optical Flow" tutorial. docs.opencv.org
The official OpenCV 4.x walkthrough of Lucas-Kanade and Farneback flow, the APIs used throughout this chapter's code, with runnable samples.