Section 4.2: The 2D DFT & FFT in Practice

"I do half the work and hand the other half to my twin, who does half the work and hands the other half to her twin. Somehow we finish in log time and management takes the credit."
An Overworked FFT Butterfly

Big Picture

The discrete Fourier transform is the exact, invertible, finite recipe that turns a pixel array into wave weights, and the fast Fourier transform is the algorithm that makes it cheap enough to call casually. This section nails down what the libraries actually compute: the DFT's definition and its three structural properties (conjugate symmetry, periodicity, and the corner-dwelling DC term), the display conventions every spectrum figure relies on, the performance landscape across NumPy, SciPy, OpenCV, and PyTorch, and the convolution theorem that makes large-kernel filtering hundreds of times faster.

Section 4.1 argued that images are sums of waves; this section hands you the machine that finds the weights. The good news is that the machine is two function calls: one forward, one inverse. The work is in understanding its conventions, because nearly every confusing spectrum plot, mysterious wraparound artifact, and "why is my filtered image complex?" bug traces back to a convention that was never stated. We state them all.

1. The 2D Discrete Fourier Transform, Defined Intermediate

For an $M \times N$ image $f(x, y)$, the 2D DFT produces an $M \times N$ array of complex coefficients:

$$F(u, v) = \sum_{x=0}^{M-1} \sum_{y=0}^{N-1} f(x, y)\; e^{-j 2\pi \left( \frac{ux}{M} + \frac{vy}{N} \right)}$$

and the inverse transform reconstructs the image exactly:

$$f(x, y) = \frac{1}{MN} \sum_{u=0}^{M-1} \sum_{v=0}^{N-1} F(u, v)\; e^{+j 2\pi \left( \frac{ux}{M} + \frac{vy}{N} \right)}$$

Each coefficient $F(u,v)$ is the inner product of the image with one complex sinusoid. The exponential $e^{-j2\pi(\dots)}$ is just a compact way to carry a cosine and a sine together (by Euler's formula, $e^{j\theta} = \cos\theta + j\sin\theta$), which is what lets a single complex number record both the strength of a wave and its phase shift at once, the amplitude and phase that Section 4.1 tracked separately. Three structural facts follow directly from the definition, and all three matter daily in practice.

First, $F(0, 0) = \sum_{x,y} f(x,y)$: the zero-frequency or DC coefficient is the plain sum of all pixels, so $F(0,0)/MN$ is the image mean. Second, for a real-valued image the spectrum has conjugate symmetry, $F(-u, -v) = F^{*}(u, v)$: half the coefficients are determined by the other half, which is why the spectrum dots in Figure 4.1.1 came in mirrored pairs, and why rfft2 can store just over half the spectrum with no information loss. Third, the DFT treats the image as periodic: it analyzes an infinite tiling of your image, not the image alone. The right edge wraps to meet the left edge, and if those edges differ in brightness, the artificial seam injects a bright horizontal-plus-vertical cross into the spectrum. The same periodicity is why unpadded frequency-domain convolution wraps around the borders, the trap discussed in part 4 of this section.

Key Insight: The DFT Analyzes a Tiled Image

Every property that puzzles newcomers (the cross artifact in spectra, wraparound in FFT convolution, the need for padding and windowing) is the same single fact wearing different costumes: the DFT's basis functions are periodic, so the transform sees your image repeated forever in all directions. When the spectrum surprises you, first ask what the infinite tiling of your image looks like at the seams.

2. Computing and Displaying a Spectrum Beginner

The raw output of np.fft.fft2 stores frequency $(0,0)$ at array index $[0, 0]$, the top-left corner, with positive frequencies counting up and negative frequencies stored in the upper half of each axis (an inheritance from the transform's periodicity). For viewing, everyone re-centers the spectrum with fftshift so DC sits in the middle, and compresses its enormous dynamic range with a log. The full display recipe:

# Compute a viewable magnitude spectrum: transform, recenter DC with
# fftshift, then log-compress the magnitude. The interleaved checks confirm
# the DC-equals-mean identity and that the inverse transform is exact.
import numpy as np
import matplotlib.pyplot as plt
from skimage import data

img = data.camera().astype(np.float64)         # 512 x 512 grayscale

F = np.fft.fft2(img)                           # complex128, DC at [0, 0]
F_shifted = np.fft.fftshift(F)                 # DC moved to the center
magnitude = np.log1p(np.abs(F_shifted))        # log(1 + |F|) tames the range

# Sanity checks worth internalizing
print(F.shape, F.dtype)                        # (512, 512) complex128
print(np.allclose(F[0, 0].real / img.size, img.mean()))   # True: DC = mean
recon = np.real(np.fft.ifft2(F))
print(np.allclose(recon, img))                 # True: perfectly invertible

plt.imshow(magnitude, cmap="gray"); plt.axis("off"); plt.show()

Code 4.2.1: The canonical spectrum pipeline: fft2, fftshift, log1p of the magnitude. The printed checks confirm the DC coefficient equals the image mean times the pixel count and that the inverse transform reproduces the image to machine precision.

Without the log, the DC term and its low-frequency neighbors are so much larger than everything else (often by six orders of magnitude) that the display would be a single white pixel on black. Without the shift, the visually meaningful "low frequencies in the middle, high frequencies at the edge" layout is scrambled into four corner fragments. Figure 4.2.1 shows exactly what fftshift rearranges; commit it to memory, because forgetting the shift (or applying it twice, or filtering a shifted spectrum with an unshifted mask) is the most common frequency-domain bug in student code.

Figure 4.2.1: fftshift swaps the DFT's quadrants diagonally (A with D, B with C), moving the DC coefficient and the low-frequency energy from the array corners to the center. Filter masks built in centered coordinates must be applied to a shifted spectrum, or shifted back with ifftshift before use.

3. The FFT and the Library Landscape Intermediate

The definition you just met is also a trap: written out literally, it is unusably slow, and the only reason you can call fft2 without a second thought is one algorithm that rescued it. Evaluating the DFT definition directly costs $O(M^2 N^2)$: every output coefficient touches every pixel. The fast Fourier transform computes the identical result in $O(MN \log MN)$ by recursively splitting each 1D transform into even-indexed and odd-indexed halves and reusing the shared work (the "butterfly" pattern). A 2D transform is just 1D FFTs over all rows, then all columns. At $4096 \times 4096$ the speedup factor is roughly a million; the FFT is the reason frequency-domain methods are practical at all, and it is routinely listed among the most important algorithms of the twentieth century.

Fun Fact

Cooley and Tukey published the FFT in 1965, motivated in part by the need to detect Soviet nuclear tests from seismometer data fast enough to matter. It later emerged that Carl Friedrich Gauss had worked out essentially the same trick in 1805, two years before Fourier's own memoir, and left it unpublished in his notebooks, in Latin. The fastest algorithm of the computer age predates the telegraph.

In the Python ecosystem you will meet four implementations, and it is worth knowing when each earns its keep. numpy.fft is always there and always correct; its rfft2 exploits conjugate symmetry to do roughly half the work on real images. scipy.fft is a faster drop-in with a workers= argument for multi-core transforms and better support for odd sizes. OpenCV's cv2.dft is fastest when you feed it sizes it likes; cv2.getOptimalDFTSize tells you the next "friendly" size (products of 2, 3, and 5) to pad to, a habit inherited from the array-centric stack of Chapter 0 that can double throughput for awkward dimensions. And torch.fft runs on the GPU and, crucially, is differentiable, so a frequency-domain term can sit inside a training loss.

# Run the same forward transform through OpenCV and PyTorch to show their
# conventions: OpenCV wants float32 and a padded FFT-friendly size, while
# torch.fft.rfft2 returns the real-input half-spectrum and supports autograd.
import cv2
import numpy as np
import torch

img32 = img.astype(np.float32)
h, w = img32.shape

# --- OpenCV: pad to an FFT-friendly size, then transform -------------
H_opt, W_opt = cv2.getOptimalDFTSize(h), cv2.getOptimalDFTSize(w)
padded = cv2.copyMakeBorder(img32, 0, H_opt - h, 0, W_opt - w,
                            cv2.BORDER_CONSTANT, value=0)
dft = cv2.dft(padded, flags=cv2.DFT_COMPLEX_OUTPUT)   # (H_opt, W_opt, 2)
mag_cv = cv2.magnitude(dft[..., 0], dft[..., 1])

# --- PyTorch: half-spectrum of a real input, GPU and autograd ready ---
x = torch.from_numpy(img)                  # add .cuda() if a GPU is present
F_half = torch.fft.rfft2(x)                # shape (512, 257) complex
power = F_half.real**2 + F_half.imag**2    # differentiable spectral power

Code 4.2.2: The same transform through two other doors. OpenCV wants float32 input, rewards padding to getOptimalDFTSize, and returns real and imaginary parts as two channels; torch.fft.rfft2 returns the non-redundant half-spectrum of a real tensor and participates in autograd.

4. The Convolution Theorem in Practice Intermediate

The single most consequential theorem of this chapter connects it back to Chapter 3: convolution in the spatial domain equals element-wise multiplication in the frequency domain,

$$f * h \;\longleftrightarrow\; F \cdot H$$

Sliding a kernel across an image, the operation you implemented with quadruple loops in Chapter 3, is the same as transforming both image and kernel, multiplying the spectra point by point, and transforming back. For a $K \times K$ kernel on an $M \times N$ image, direct convolution costs $O(MNK^2)$ while the FFT route costs $O(MN \log MN)$ regardless of kernel size. Small kernels are cheaper directly; large kernels are dramatically cheaper through the FFT. The crossover on typical hardware arrives surprisingly early, around kernel widths of 11 to 15:

Proof: The Convolution Theorem From the Shift Property

The theorem follows in one line from the same shift property that powers phase correlation in part 5 below. The DFT of a shifted impulse is a pure phase ramp; writing $W = e^{-j2\pi(\,\cdot\,)}$ for the kernel of the transform, a shift by $(a, b)$ maps to multiplication by $e^{-j2\pi(ua/M + vb/N)}$. Now take the DFT of the (circular) convolution $(f * h)(x, y) = \sum_{a,b} f(a, b)\, h(x - a, y - b)$ and exchange the two finite sums:

$$\mathcal{F}\{f * h\}(u, v) = \sum_{x,y}\left[\sum_{a,b} f(a, b)\, h(x - a, y - b)\right] e^{-j 2\pi\left(\frac{ux}{M} + \frac{vy}{N}\right)} = \sum_{a,b} f(a, b) \sum_{x,y} h(x - a, y - b)\, e^{-j 2\pi\left(\frac{ux}{M} + \frac{vy}{N}\right)}$$

The inner sum is the DFT of $h$ shifted by $(a, b)$, so by the shift property it equals $H(u, v)\, e^{-j2\pi(ua/M + vb/N)}$. Pulling that factor out leaves the outer sum as exactly the DFT of $f$:

$$\mathcal{F}\{f * h\}(u, v) = H(u, v) \sum_{a,b} f(a, b)\, e^{-j 2\pi\left(\frac{ua}{M} + \frac{vb}{N}\right)} = H(u, v)\, F(u, v)$$

So spatial convolution becomes the element-wise product $F \cdot H$. The derivation used circular shifts, which is why the bare product gives circular convolution; the zero-padding in the warning below is what promotes it to the linear convolution of Chapter 3.

# Time direct spatial convolution against FFT convolution as the kernel
# grows. Direct cost rises with kernel area while the FFT route stays
# nearly flat, so large kernels cross over to the frequency domain.
from time import perf_counter
from scipy.signal import convolve2d, fftconvolve

big = np.random.rand(1024, 1024)

for k in [3, 7, 15, 31, 63]:
    kernel = np.random.rand(k, k)
    t0 = perf_counter(); convolve2d(big, kernel, mode="same"); t1 = perf_counter()
    fftconvolve(big, kernel, mode="same");                     t2 = perf_counter()
    print(f"{k:2d}x{k:<2d}  direct {t1-t0:7.2f} s   fft {t2-t1:6.3f} s")

# Representative output (one laptop run; your numbers will differ,
# the shape of the trend will not):
#  3x3   direct    0.07 s   fft  0.090 s
#  7x7   direct    0.30 s   fft  0.092 s
# 15x15  direct    1.30 s   fft  0.095 s
# 31x31  direct    5.50 s   fft  0.098 s
# 63x63  direct   22.10 s   fft  0.103 s

Code 4.2.3: Direct versus FFT convolution as kernel size grows. Direct cost scales with the kernel area; FFT cost is essentially flat, so a 63-pixel kernel is two orders of magnitude faster through the frequency domain.

Warning: Circular Convolution Wraps Around

Because the DFT assumes periodicity, multiplying unpadded spectra computes circular convolution: kernel mass that slides off the right edge re-enters on the left, smearing opposite borders into each other. Linear convolution requires zero-padding both signals to at least $(M + K - 1) \times (N + K - 1)$ before transforming. scipy.signal.fftconvolve handles the padding, the size bookkeeping, and the final cropping for you, which is the main reason to prefer it over a hand-rolled multiply.

Library Shortcut: scipy.signal.fftconvolve and oaconvolve

A correct hand-rolled FFT convolution (pad both arrays, transform, multiply, inverse-transform, crop the valid region, manage real/complex types) runs to about 15 lines with three classic off-by-one traps. The library form is one line:

out = fftconvolve(big, kernel, mode="same")     # padding, cropping handled
# scipy.signal.oaconvolve: same API, overlap-add, better for huge images

Code 4.2.4: Linear FFT convolution as a single SciPy call, with all padding, cropping, and real-input optimizations handled internally.

That is roughly a 15-to-1 line reduction, and the library additionally picks real-input FFT paths automatically, chooses fast transform sizes, and offers oaconvolve, which processes enormous images in tiles so memory stays bounded. SciPy's choose_conv_method will even pick direct versus FFT for you based on measured heuristics.

This theorem is also a lens for what comes later in the book: the learned convolution layers of Chapter 19 are, in the frequency view, learned spectral multipliers, and that equivalence is exactly what several of the architectures in this section's research frontier exploit.

5. Phase Correlation: Alignment in One Transform Advanced

Here is the frequency domain solving a real task end to end. If image $b$ is image $a$ shifted by $(\Delta x, \Delta y)$, the Fourier shift theorem says their spectra differ only by a phase ramp: $B(u,v) = A(u,v)\, e^{-j2\pi(u\Delta x/M + v\Delta y/N)}$. Dividing out the magnitudes leaves a pure complex exponential whose inverse transform is a single sharp spike located exactly at the displacement. This is phase correlation: translation estimation for the price of three FFTs, robust to global illumination changes (which live in magnitude, not phase) and indifferent to how feature-rich the scene is.

# Recover a pure translation between two frames with phase correlation:
# the shift theorem turns a spatial offset into a phase ramp, whose inverse
# transform is a single spike located exactly at the displacement.
import cv2
import numpy as np
from skimage import data

a = data.camera().astype(np.float32)
b = np.roll(a, shift=(7, 15), axis=(0, 1))     # down 7 pixels, right 15

(shift_x, shift_y), response = cv2.phaseCorrelate(a, b)
print(f"shift = ({shift_x:.2f}, {shift_y:.2f}), response = {response:.3f}")
# Recovers the 15-pixel horizontal and 7-pixel vertical displacement to
# sub-pixel accuracy with a response near 1.0. The sign convention
# follows the argument order: verify it once on a synthetic roll like
# this one before trusting it in production.

Code 4.2.5: Phase correlation with cv2.phaseCorrelate recovers a known synthetic translation to sub-pixel precision. The response value near 1.0 indicates a confident, single-peak match.

Practical Example: Aligning Ten Thousand Farm Photos Before Lunch

Who: A backend engineer at an agricultural aerial-survey startup processing burst captures from fixed-wing drones.

Situation: Each survey produced tens of thousands of overlapping frames that needed pairwise alignment before stitching and crop-health analysis, with results due the same day.

Problem: The keypoint pipeline (detect, describe, match, fit) averaged 400 ms per pair and failed outright over uniform fields: freshly plowed earth and young wheat offer almost no distinctive corners to match.

Decision: Because consecutive burst frames differ almost purely by translation, the engineer replaced keypoint matching with FFT phase correlation, with a Hann window to suppress the spectral cross from frame borders, falling back to features only when the correlation response dropped below a threshold.

Result: Alignment fell to under 8 ms per pair on the same CPU, a 50-fold speedup, and the uniform-field failure mode vanished, since phase correlation uses the whole image's spectrum rather than scattered corners. The fallback triggered on under 2 percent of pairs.

Lesson: When the geometric relationship is simple (translation, or rotation-plus-scale via the log-polar Fourier-Mellin extension), the frequency domain solves globally and in near-constant time what feature pipelines solve locally and expensively. The full geometric toolbox for harder warps arrives in Chapter 5.

Research Frontier: The FFT Inside Modern Architectures

The convolution theorem is enjoying a second career in deep learning. FNet (2021) replaced transformer self-attention with plain Fourier mixing; Hyena and other long-convolution sequence models (2023) rely on FFT convolution to apply sequence-length kernels, and FlashFFTConv (ICLR 2024) re-engineered the FFT around GPU tensor cores to make that practical at scale. Fourier Neural Operators learn solution operators for PDEs by parameterizing weights directly in the frequency domain, powering weather-scale models such as FourCastNet. The thread running through all of them is exactly this section's theorem: a global operation that is expensive as a sliding window becomes a cheap element-wise product after a transform, a trick the vision transformer lineage of Chapter 22 keeps rediscovering.

Exercise 4.2.1: Conventions Audit Conceptual

A colleague computes mask * np.fft.fft2(img) where mask is a centered low-pass disk built in fftshift coordinates, then takes the inverse FFT and gets garbage that looks like four ghost images. Explain precisely what went wrong, two distinct ways to fix it (shifting the spectrum versus shifting the mask), and why the result of the correct version can still have a tiny imaginary part that is safe to discard.

Exercise 4.2.2: Find Your Own Crossover Coding

Reproduce Code 4.2.3 on your machine, adding kernel sizes 5, 9, 11, and 21, and also timing cv2.filter2D (which uses heavily optimized direct convolution and switches to a DFT path for large kernels). Plot all three curves on a log scale and report the kernel size where the FFT route first wins against each competitor. Then repeat at image size 256×256 and explain why the crossover moves.

Exercise 4.2.3: Stress-Testing Phase Correlation Analysis

Using Code 4.2.5 as a base, measure how the recovered shift and the response value degrade as you (a) add Gaussian noise of increasing strength, (b) brighten one frame by 30 percent, (c) rotate the second frame by 1, 2, 5, and 10 degrees. Which perturbation breaks the method first, and why does that follow from the fact that phase correlation models only translation?