Section 26.4: Deep Optical Flow: RAFT & Beyond

"Horn and Schunck handed me one equation and a prayer for smoothness. RAFT handed me a lookup table of every pixel against every other pixel and said, just keep guessing, but guess better each time. Reader, I have never been more confident about where things went."
An Optical Flow Field That Finally Stopped Smearing at the Edges

Big Picture

Optical flow is the dense, per-pixel motion field between two frames, and RAFT computes it by precomputing the similarity of every pixel to every other pixel, then iteratively refining a flow estimate by repeatedly looking up that similarity table. The classical Lucas-Kanade and Horn-Schunck methods of Chapter 15 solved flow with brightness-constancy equations and smoothness priors, and they struggled with large motions and textureless regions. RAFT replaces the hand-designed energy with three learned components: a feature encoder, an all-pairs correlation volume that is the learned analogue of block matching, and a recurrent update operator that mimics a classical optimizer but with learned steps. The result set an accuracy standard that still anchors the field, and its design recurs in stereo, scene flow, and the transformer-based successors.

Every method so far in this chapter has reduced a clip to a single label or a handful of boxes. Now we ask for everything: a motion vector at every one of the hundreds of thousands of pixels in a frame, telling you exactly where each one went. That dense field is optical flow, the displacement of every pixel in frame one to its corresponding location in frame two. You met it classically in Chapter 15, and it returned as the input to the two-stream network of Section 26.2. This section rebuilds it with deep learning. The structure of RAFT (Recurrent All-Pairs Field Transforms) is worth studying in detail not only because it is the dominant flow method but because its three-part design is a template you will see again in stereo depth in Chapter 27. Three words capture the entire recipe and are worth committing to memory: encode, correlate, iterate. Encode both frames into learned features, correlate every pixel against every other to build a similarity table, then iterate a shared-weight update that keeps refining the guess. The three subsections that follow are exactly those three words in order.

1. The Flow Problem and the Classical Baseline Beginner

Optical flow rests on the brightness constancy assumption: a point's intensity does not change as it moves between frames. Writing $I(x, y, t)$ for the image intensity and $(u, v)$ for the per-pixel displacement, a first-order Taylor expansion gives the classical optical flow constraint equation

$$I_x u + I_y v + I_t = 0,$$

where $I_x, I_y$ are the spatial gradients and $I_t$ is the temporal gradient. This is one equation in two unknowns per pixel, the famous aperture problem: a single local window can only determine the motion component along the gradient, not the full vector. Classical methods resolve the ambiguity with extra assumptions: Lucas-Kanade assumes flow is constant in a small window (giving a solvable least-squares system), and Horn-Schunck adds a global smoothness penalty. Both break down on large displacements, where the Taylor expansion is invalid, and on textureless regions, where there is no gradient to constrain anything. RAFT keeps the brightness-constancy intuition but replaces the local linearization with a global correlation search.

Common Misconception: Optical Flow Is the True Motion of Objects

Optical flow is the apparent per-pixel motion of brightness patterns between two frames, not the true 3D motion of objects and not object-level correspondence. The distinction is not pedantic; it causes real failures. A static scene under a moving light source produces strong flow with zero object motion, because brightness constancy is violated and the patterns shift. A spinning untextured sphere produces near-zero flow despite genuine rotation, because there is no moving pattern to track. And flow is defined per pixel, so it gives you no notion of which pixels belong to the same object; that binding is the separate job of the tracker in Section 26.5. RAFT estimates this apparent field more robustly than the classical methods, but it estimates the same quantity, not a corrected "real" motion. A diagnostic question: what flow does a perfectly uniform red wall produce as the camera pans across it? Almost none, because no brightness pattern moves, even though every physical point is in motion relative to the camera.

Fun Fact

RAFT won the Best Paper award at ECCV 2020, an honor more often given to flashy new architectures than to optical flow, a problem many considered nearly solved. Part of what impressed reviewers was that the same network, with no architectural change, also achieved state-of-the-art results on stereo matching and even on visual odometry, just by reinterpreting what the two input frames mean. The "all-pairs correlation then iterative update" recipe turned out to be a general engine for dense correspondence, not a flow-specific trick, which is exactly why it is worth learning carefully.

2. Feature Encoding and the All-Pairs Correlation Volume Intermediate

Here is the move that broke the field open, and at first glance it looks reckless: instead of guessing where each pixel went and checking nearby, RAFT compares every pixel in frame one against every pixel in frame two, all at once, before it commits to a single displacement. That is the "all-pairs" in the name, and it is why RAFT stops smearing at large motions where the classical methods give up. To get there it first runs both frames through a shared convolutional feature encoder that maps each frame to a dense feature map at one-eighth resolution, so each spatial location carries a learned descriptor rather than a raw pixel. This is the learned replacement for the hand-crafted descriptors of Chapter 10; matching learned features is far more robust than matching raw intensities, which is what brightness constancy assumed.

The heart of RAFT is the correlation volume. For every feature in frame one, RAFT computes the dot-product similarity to every feature in frame two, producing a 4D volume $C(i, j, k, l)$ that holds the similarity of pixel $(i, j)$ in frame one to pixel $(k, l)$ in frame two. This is the all-pairs comparison that gives RAFT its name, and it is the global, large-displacement-capable analogue of the local block matching used in classical stereo and flow.

One refinement turns this volume from expensive into practical. RAFT builds a small pyramid of the volume by pooling the second-frame dimensions, so the lookup can read both fine and coarse correspondence. The coarser levels are what let RAFT handle large motion cheaply: a displacement of many pixels at full resolution becomes a displacement of only a few pixels once the second frame is pooled down, so a small fixed-size lookup window still reaches it. Figure 26.4.1 shows the pipeline.

Figure 26.4.1: The RAFT pipeline. A shared feature encoder maps both frames to learned features; the all-pairs correlation volume holds the similarity of every pixel pair; a recurrent GRU update operator repeatedly looks up the correlation around the current flow estimate and refines it, emitting a sequence of increasingly accurate flow fields. The dashed loop is the iterative refinement that gives RAFT its accuracy.

The code below builds the all-pairs correlation volume from two feature maps. It is a single batched matrix multiplication, the same dot-product-as-similarity operation that powered attention in Chapter 22, applied here between two images rather than within one sequence.

import torch

def correlation_volume(fmap1, fmap2):
    """All-pairs correlation between two feature maps.

    fmap1, fmap2: (N, D, H, W) learned features.
    Returns (N, H, W, H, W): similarity of every pixel in 1 to every pixel in 2.
    """
    N, D, H, W = fmap1.shape
    f1 = fmap1.view(N, D, H * W)                 # (N, D, HW)
    f2 = fmap2.view(N, D, H * W)                 # (N, D, HW)
    corr = torch.matmul(f1.transpose(1, 2), f2)  # (N, HW, HW) dot products
    corr = corr / (D ** 0.5)                      # scale, as in attention
    return corr.view(N, H, W, H, W)

f1 = torch.randn(1, 64, 32, 32)                   # 1/8-resolution features
f2 = torch.randn(1, 64, 32, 32)
vol = correlation_volume(f1, f2)
print("correlation volume:", vol.shape)           # correlation volume: torch.Size([1, 32, 32, 32, 32])

Code Fragment 1: The all-pairs correlation_volume as one scaled torch.matmul. Every pixel of frame one is dotted against every pixel of frame two and scaled by $\sqrt{D}$, exactly the similarity computation behind attention, giving a 4D table the update operator queries. Computing it at one-eighth resolution keeps the $H^2 W^2$ size manageable.

3. The Recurrent Update Operator Intermediate

RAFT's third component is what makes it accurate. Rather than predicting flow in one shot, it starts from zero flow and applies a recurrent update operator, a convolutional gated recurrent unit (GRU), that refines the estimate over a fixed number of iterations. A GRU is a small network with a persistent hidden state that it carries from one step to the next: at each step it takes the current input, mixes it with the state through learned gates that decide what to keep and what to overwrite, and emits an updated state. Here that state is a feature map, the same weights act at every iteration, and the running state is what accumulates the flow estimate across steps.

At each iteration the operator does three things. First it looks up the correlation volume at the pixels indicated by the current flow estimate, reading how well the current correspondence matches. Then it combines that lookup with context features from the first frame. Finally it predicts a residual update $\Delta f$ that is added to the current flow. Crucially, all iterations share weights, so the network learns a single update rule applied repeatedly, mimicking the iterations of a classical optimizer but with a learned step. The training loss supervises every intermediate flow, with later iterations weighted more, which teaches the operator to converge. The illustration below captures this encode, correlate, iterate loop as a robot consulting its lookup table and redrawing its arrows.

A cartoon robot repeatedly redrawing a field of motion arrows while consulting a giant all-pairs lookup table on the wall, with a loop arrow and a stack of progressively sharper arrow maps, illustrating how RAFT encodes features, correlates every pixel pair, and iteratively refines the flow estimate. — A hard guess about where everything moved is easier made as many small, repeated corrections than as one giant leap.

The simplified update step below shows the structure: lookup, combine, predict residual, add. A real RAFT uses a more elaborate lookup with a local neighborhood and a multi-level correlation pyramid, but the recurrence is exactly this.

import torch
import torch.nn as nn

class FlowUpdate(nn.Module):
    """One shared-weight RAFT-style update: predict a residual flow and add it."""
    def __init__(self, hidden=128, corr_dim=81, ctx_dim=128):
        super().__init__()
        # GRU-like gate over hidden state h, fed correlation lookup + context + flow
        self.enc = nn.Sequential(nn.Conv2d(corr_dim + 2, 64, 3, padding=1), nn.ReLU())
        self.gru = nn.GRUCell(64 + ctx_dim, hidden)
        self.flow_head = nn.Conv2d(hidden, 2, 3, padding=1)   # predicts (du, dv)

    def forward(self, h, flow, corr_lookup, context):
        N, _, H, W = flow.shape
        inp = torch.cat([corr_lookup, flow], dim=1)            # what the lookup said
        inp = self.enc(inp).flatten(2).transpose(1, 2)         # (N, HW, 64)
        ctx = context.flatten(2).transpose(1, 2)               # (N, HW, ctx_dim)
        h_flat = h.flatten(2).transpose(1, 2).reshape(-1, h.shape[1])
        h_new = self.gru(torch.cat([inp, ctx], -1).reshape(-1, inp.shape[-1] + ctx.shape[-1]),
                         h_flat).view(N, H, W, -1).permute(0, 3, 1, 2)
        delta = self.flow_head(h_new)                          # residual update
        return h_new, flow + delta                             # refined flow

# One iteration on toy tensors (corr lookup window of 9x9 = 81 channels):
upd = FlowUpdate()
h = torch.zeros(1, 128, 32, 32)
flow = torch.zeros(1, 2, 32, 32)
corr_lookup = torch.randn(1, 81, 32, 32)
context = torch.randn(1, 128, 32, 32)
h, flow = upd(h, flow, corr_lookup, context)
print("refined flow:", flow.shape)                             # refined flow: torch.Size([1, 2, 32, 32])

Code Fragment 2: One iteration of the RAFT update operator, FlowUpdate. It reads the corr_lookup around the current flow, mixes in context features, runs a GRUCell, and predicts a residual $(du, dv)$ via flow_head that is added to the flow. The same weights are reused across all iterations, so the network learns one convergent update rule rather than a fixed-depth feed-forward predictor.

Key Insight: Iterative Refinement Beats One-Shot Prediction

The deepest idea in RAFT is architectural humility. Earlier deep-flow networks (FlowNet, PWC-Net) predicted flow in a single forward pass through a fixed pyramid, which forced the network to commit to a coarse estimate and refine it at fixed scales. RAFT instead maintains a single high-resolution flow field and improves it through an unbounded number of weight-shared steps, so at inference you can trade accuracy for speed simply by running more or fewer iterations. This is the same principle that powers diffusion models in Chapter 33: a hard prediction is easier to make as a sequence of small, learned, repeated corrections than as one giant leap, and sharing the correction operator across steps keeps the model small.

4. Using Flow, and the Transformer Successors Intermediate

In practice you rarely implement RAFT; you load it. TorchVision ships pretrained RAFT weights, and computing dense flow between two frames is a few lines. The output is a $(2, H, W)$ field of horizontal and vertical displacements that you can feed to the temporal stream of Section 26.2, use to warp one frame onto another, or visualize as a colour-coded motion image.

Library Shortcut: Pretrained RAFT in a Handful of Lines

The from-scratch components above (encoder, correlation volume, update operator, plus the training that ties them together) are several hundred lines and a training run on the FlyingChairs and Sintel datasets. TorchVision's pretrained RAFT replaces all of it:

# Estimate dense optical flow between two frames with a pretrained RAFT.
# The model emits one flow field per refinement iteration; we keep the last.
import torch
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights

weights = Raft_Large_Weights.DEFAULT
model = raft_large(weights=weights).eval()
transforms = weights.transforms()

img1, img2 = transforms(frame1, frame2)              # resize + normalize a frame pair
with torch.no_grad():
    flow_predictions = model(img1, img2)             # list of flow fields, one per iteration
flow = flow_predictions[-1]                          # the final, most refined estimate
print(flow.shape)                                    # torch.Size([1, 2, H, W])  (du, dv)

Code Fragment 3: The encoder, correlation volume, and update operator of Code Fragments 1 and 2 as a single call using TorchVision's raft_large and Raft_Large_Weights. The library handles the correlation pyramid, the shared-weight recurrence, and the FlyingChairs and Sintel training internally; you read the final, most refined field from flow_predictions[-1] rather than implementing the iterative loop yourself.

The model returns the full sequence of per-iteration flow fields (the dashed loop of Figure 26.4.1); you take the last one. The library handles the correlation pyramid, the shared-weight recurrence, and the trained weights, turning a research codebase into one call. TorchVision also provides flow_to_image to render the field as the standard colour wheel for visualization.

You Could Build This: A Motion-Highlight Overlay

The pretrained RAFT of the shortcut, plus flow_to_image, is enough to build a motion-highlight tool that makes movement visible at a glance. For each consecutive frame pair, estimate the flow, take its per-pixel magnitude, threshold it to a mask, and blend the colour-wheel flow image over the original frame only where motion exceeds the threshold; write the result back out as a video. The product is a clip where the static background stays normal and anything that moves glows with a direction-coded colour, the kind of overlay a security review, a wildlife survey, or a sports replay uses to draw the eye straight to the action. Unlike the frame-warping of Exercise 26.4.2, this build never reconstructs a frame; it only visualizes the field, so it is a clean afternoon project. Difficulty: intermediate, about 45 to 60 minutes. Take it further by accumulating magnitude over a whole clip into a single motion-density heatmap, which turns hours of footage into one image of where activity concentrated.

Practical Example: Flow That Stabilized a Drone's Landing

Who: an engineer on an agricultural-drone team building a vision-based landing assist for a quadcopter with no downward-facing depth sensor, 2024. Situation: the drone needed to estimate its own horizontal drift relative to the ground during the final descent, using only the downward camera. Problem: their classical Lucas-Kanade flow, ported from Chapter 15, was accurate over crop rows with strong texture but collapsed over bare soil and shadow, where the lack of gradient left the flow undefined and the drone drifted. Decision: they replaced it with a pretrained RAFT-small running on the onboard accelerator, accepting the higher compute because RAFT's learned features produced a confident flow even over near-textureless soil where the brightness-constancy gradient was nearly zero. Result: drift estimation stayed reliable across surface types, and the landing accuracy improved enough to remove a fallback ground-marker requirement. Lesson: the textureless-region failure that limited classical flow in Chapter 15 is exactly what learned features fix; RAFT does not rely on a local intensity gradient because its correlation volume matches learned descriptors that remain distinctive where raw pixels are flat. When classical flow fails on weak texture, the deep method is not just more accurate, it is defined where the classical one is not.

Research Frontier: Transformers and Unified Correspondence

RAFT's all-pairs correlation is itself a form of cross-attention between two frames, so it was natural to rebuild flow with transformers explicitly. GMFlow and FlowFormer (2022) recast the matching as global attention over feature tokens, improving large-displacement accuracy and inference speed; the 2023 to 2025 unified-matching models (the GMFlow and UniMatch line) handle optical flow, stereo, and depth with one architecture by changing only what the two input views are, generalizing RAFT's observation that one engine serves all dense-correspondence tasks. A separate thread targets efficiency: SEA-RAFT (2024) and other distilled variants reach RAFT-level accuracy with far fewer iterations for real-time use on edge devices, the deployment concern of Chapter 28. And dense flow is increasingly a supervision signal rather than an end product, used to enforce temporal consistency in the video generators of Chapter 36, where a generated video that disagrees with its own estimated flow is penalized for flicker.

Exercise 26.4.1: The Aperture Problem Conceptual

The optical flow constraint equation $I_x u + I_y v + I_t = 0$ is one equation in two unknowns per pixel. Explain in two or three sentences why this means a single local measurement can only recover the flow component along the image gradient (the aperture problem), and then explain how RAFT's all-pairs correlation volume sidesteps the issue by not relying on a local linearization at all. Relate the correlation volume's role to the block-matching idea you saw in classical stereo in Chapter 13.

Exercise 26.4.2: Run RAFT and Warp a Frame Coding

Using the pretrained RAFT from the library shortcut, estimate the flow between two consecutive frames of any short video. Then use the flow field with torch.nn.functional.grid_sample to warp the first frame toward the second (build the sampling grid by adding the flow to a base coordinate grid). Display the warped first frame, the true second frame, and their difference. A small difference confirms the flow is accurate; identify and explain the regions where the warp fails (occlusions and disocclusions, where no correspondence exists).

Exercise 26.4.3: Iterations Versus Accuracy Analysis

RAFT returns one flow field per refinement iteration. Run the pretrained model on a frame pair, extract the intermediate flow at each iteration, and compute the change in flow (mean magnitude of the per-iteration update) as a function of iteration index. Plot the curve and identify roughly how many iterations are needed before the update becomes negligible. Discuss, in one paragraph, how you would set the iteration count for a real-time application versus an offline high-accuracy one, connecting the trade to the iterative-refinement insight in subsection 3 and the efficiency concerns of Chapter 28.