Section 10.2: Scale & Rotation Invariance: Scale Space

"People keep asking whether I am a small thing nearby or a large thing far away. I stopped answering. I just detect myself at whatever scale shows up."
A Serenely Zoom-Proof Keypoint

Big Picture

A detector that only looks through one window size is blind to the fact that the same object appears at every size, so the fix is to make scale a third search dimension and detect keypoints at the scale where they respond most strongly. This section builds that machinery: Gaussian scale space as the principled stack of progressively blurred images, the scale-normalized Laplacian as the response worth maximizing, the Difference of Gaussians as its cheap approximation, and extremum detection in the resulting 3D volume. A final step, canonical orientation, dispatches rotation the same way. The output is keypoints stamped with position, scale, and angle: exactly the invariant anchor that Section 10.3's SIFT descriptor needs.

Section 10.1 closed with an honest confession: every detector in it evaluates a neighborhood of one fixed size, so none of them survives zoom. Photograph a doorway from across the street and its corner fits the structure tensor's window; walk closer and that same corner becomes a smooth, slowly curving boundary that fills half the image and excites nothing. The physical point did not change. The relationship between the point and the window did. This section repairs that relationship.

1. Why Fixed-Size Windows Fail, and What "Scale" Means Beginner

The deep observation, due to scale-space theory (Lindeberg's work in the 1990s formalized it), is that objects in images do not have a size; they have a size relative to the camera, which the photographer chose freely. An algorithm that hopes to recognize structure across photographs therefore may not privilege any single analysis scale. It must examine all of them, and it needs a principled way to generate "the same image, analyzed at coarser and coarser scales" without inventing artifacts along the way.

You already own the answer. Chapter 3 established that Gaussian smoothing is the unique sensible blur (no new structure created, rotationally symmetric, cascades cleanly), and Chapter 4 stacked Gaussian blurs into pyramids. Gaussian scale space is the continuous version of that pyramid: the family

$$ L(x, y, \sigma) \;=\; G_\sigma * I, $$

the image convolved with Gaussians of every standard deviation $\sigma$. The parameter $\sigma$ is the scale: structure much finer than $\sigma$ has been averaged away in $L(\cdot, \cdot, \sigma)$, structure much coarser survives untouched. Detection will now happen inside this 3D volume $(x, y, \sigma)$ rather than on the 2D image, and "the scale of a keypoint" will mean the $\sigma$ at which it is detected.

Key Insight: Invariance by Search, Not by Cleverness

Scale invariance is not achieved by designing a magically zoom-proof operator. It is achieved by brute honesty: compute the response at every scale, then keep the scale that responds most. When the image is zoomed by a factor $c$, the whole response landscape shifts along the $\sigma$ axis by that same factor, and the per-keypoint maximum shifts with it. The detected scale rides the zoom, which is precisely what invariance means. The same search-then-select pattern handles rotation at the end of this section, and it is worth recognizing as a general design move.

Common Misconception: Scale Invariance Means Any Zoom Will Match

Students often read "scale-invariant" as "matches across any change of size," and conclude that a higher-resolution image is always better and that two photos taken at wildly different distances will match as long as both use SIFT. In fact invariance holds only where the detected scale ranges overlap. A keypoint detected at $\sigma = 2$ pixels in a near shot may correspond to a feature below $\sigma = 1$ in a far shot, where it has been blurred or downsampled out of existence and is never detected at all. The detector cannot invent structure that the second image's sampling threw away. This is why the museum-guide example below caps out: scale invariance buys robustness within the band of scales both images actually resolve, not a license to ignore the camera-distance range of your real input.

2. Octaves and the Cascade Trick Intermediate

A continuous family of blurs must be sampled. The convention, inherited by every implementation from SIFT onward, samples $\sigma$ geometrically: scales come in octaves (each octave doubles $\sigma$), and each octave is subdivided into $s$ intervals with ratio $k = 2^{1/s}$ between consecutive scales. SIFT defaults to $s = 3$, $\sigma_0 = 1.6$. Two practical refinements keep the cost sane. First, the cascade property from Chapter 3 (blurring by $\sigma_1$ then $\sigma_2$ equals blurring by $\sqrt{\sigma_1^2 + \sigma_2^2}$) means each level is produced by a small incremental blur of the previous level, never a giant blur of the original. Second, after each octave the image is downsampled by 2 and the process repeats at half resolution: beyond one octave of blur, the extra pixels carry no information, exactly the argument behind Chapter 4's pyramid. The code below builds one octave from these two rules.

import cv2
import numpy as np

def gaussian_octave(img: np.ndarray, sigma0: float = 1.6,
                    intervals: int = 3) -> np.ndarray:
    """One octave of Gaussian scale space: intervals + 3 images."""
    k = 2.0 ** (1.0 / intervals)
    levels = [cv2.GaussianBlur(img, (0, 0), sigma0)]
    for i in range(1, intervals + 3):
        sigma_prev = sigma0 * k ** (i - 1)
        # cascade property: the increment that takes sigma_prev to k*sigma_prev
        sigma_inc = sigma_prev * np.sqrt(k * k - 1.0)
        levels.append(cv2.GaussianBlur(levels[-1], (0, 0), sigma_inc))
    return np.stack(levels)               # shape: (intervals + 3, H, W)

img = cv2.imread("street.jpg", cv2.IMREAD_GRAYSCALE).astype(np.float32) / 255
gauss = gaussian_octave(img)
dog = gauss[1:] - gauss[:-1]              # Difference of Gaussians, see below
print(gauss.shape, dog.shape)             # (6, H, W) (5, H, W)

Building one octave of Gaussian scale space with incremental blurs via the cascade property, then differencing adjacent levels to get the five-layer DoG stack; the seemingly odd count of intervals + 3 Gaussian images is explained by the extremum search in Figure 10.2.1.

Why $s + 3$ images per octave instead of $s$? The extremum search of Section 4 compares each DoG layer against the layers above and below it, so the top and bottom DoG layers can never host detections. To test a full octave's worth of scales ($s$ layers), you need $s + 2$ DoG layers, hence $s + 3$ Gaussian images. Bookkeeping, but the kind that bites anyone reimplementing SIFT.

Fun Fact

The "+3" in $s+3$ is the single most-asked StackOverflow question about reimplementing SIFT, and the most common off-by-one bug in homegrown versions: drop it and your detector silently refuses to fire on the coarsest and finest scales of every octave, producing a feature map that looks plausible and matches 20 percent worse. The lesson generalizes beyond SIFT: in any sliding-comparison over a stack, the boundary layers are not data, they are padding.

3. What to Detect: the Scale-Normalized Laplacian Advanced

Scale space says where to search. It does not yet say what function of the image to maximize there. The structure tensor of Section 10.1 could be evaluated per scale (and the multi-scale Harris-Laplace detector does exactly that), but SIFT's lineage uses a different, blob-flavored response: the Laplacian of Gaussian, $\nabla^2 L = L_{xx} + L_{yy}$, which you met as an edge-and-blob operator in Chapter 9. A dark blob on a bright background (or vice versa) produces a strong Laplacian response when the Gaussian's width matches the blob's radius; mismatched widths blur the blob away or average it with its surroundings.

One subtlety decides everything. Gaussian derivatives shrink as $\sigma$ grows (blurring flattens slopes), so raw Laplacian responses at different scales are not comparable: coarse scales would always lose. Lindeberg's fix is scale normalization: multiply by $\sigma^2$, one factor of $\sigma$ per derivative order. The quantity worth comparing across the whole volume is

$$ \mathcal{L}(x, y, \sigma) \;=\; \sigma^2 \left| \nabla^2 L(x, y, \sigma) \right|, $$

and a keypoint is a local extremum of $\mathcal{L}$ in all three coordinates. For an ideal circular blob of radius $r$, the maximum lands at $\sigma = r / \sqrt{2}$: the detected scale literally measures the blob. To make that concrete, a dark disc of radius $r = 8$ pixels peaks at $\sigma \approx 8 / 1.414 \approx 5.7$, and a disc of radius $r = 16$ peaks at $\sigma \approx 11.3$, exactly double, so reading off the winning $\sigma$ and multiplying by $\sqrt{2}$ recovers the blob's radius in pixels. This is the theoretical heart of the section; everything after it is approximation and engineering. Here is the twist worth holding onto through the next paragraph: this principled quantity, the one you would expect to cost a fresh Laplacian computation at every voxel, turns out to be almost free, hiding inside the blurred images you already built.

Pause and Consolidate: Two Claims Before the Next Step

Before the engineering arrives, hold onto two results just established. First, the response worth maximizing is the scale-normalized Laplacian $\sigma^2 |\nabla^2 L|$, and a keypoint is its extremum in all three coordinates $(x, y, \sigma)$. Second, the winning $\sigma$ is not bookkeeping; it physically measures the blob, with the peak at $\sigma = r / \sqrt{2}$. The next paragraph adds only one thing: a way to compute that response almost for free.

The engineering: computing $\sigma^2 \nabla^2 L$ at every voxel is expensive, and the Difference of Gaussians you built in the code above is an almost-free stand-in. The link is the heat equation (the diffusion partial differential equation whose solution is exactly Gaussian blurring, with $\sigma^2$ playing the role of time), which states that the rate of change of $L$ as you blur a little more equals its Laplacian. Subtracting two adjacent blurred levels is precisely measuring that rate of change, so the difference approximates the Laplacian. Written out, the first-order expansion gives

$$ D(x, y, \sigma) \;=\; L(x, y, k\sigma) - L(x, y, \sigma) \;\approx\; (k - 1)\,\sigma^2\, \nabla^2 L, $$

so subtracting adjacent scale-space levels yields the scale-normalized Laplacian, including the $\sigma^2$ factor, times a constant $(k-1)$ that is identical for every layer and therefore harmless to extremum detection. The blurred images were needed anyway; the detector response costs one subtraction per level.

4. Extrema in Three Dimensions Intermediate

Detection is now a neighborhood test in the DoG volume: a voxel is a candidate keypoint if it is strictly greater (or strictly smaller) than all 26 neighbors in its $3 \times 3 \times 3$ surroundings: 8 neighbors in its own layer, 9 in the layer above, 9 below. Figure 10.2.1 shows the geometry. The comparison against neighboring scales is what selects the intrinsic scale; the comparison within the layer is ordinary spatial non-maximum suppression, the same hygiene as in Section 10.1.

Figure 10.2.1 Scale-space detection in one octave. Left: Gaussian levels from $\sigma$ to $2\sigma$. Center: adjacent levels are subtracted into DoG layers, each approximating the scale-normalized Laplacian at its scale. Right: a voxel (red) becomes a candidate keypoint only if it beats all 26 neighbors across its own and the two adjacent DoG layers.

The implementation below runs the test on the dog stack built earlier, using a 3D maximum filter as the vectorized form of the 26-neighbor comparison. Note the contrast threshold: DoG values are tiny on low-contrast structure, which is noise-dominated and unrepeatable, so SIFT discards extrema with $|D| < 0.03$ (on images scaled to $[0,1]$) before doing anything else with them.

from scipy.ndimage import maximum_filter, minimum_filter

is_max = dog == maximum_filter(dog, size=(3, 3, 3))   # 26-neighbor maxima
is_min = dog == minimum_filter(dog, size=(3, 3, 3))   # 26-neighbor minima
extrema = (is_max | is_min) & (np.abs(dog) > 0.03)    # contrast gate

# Only interior layers have neighbors above AND below
s, y, x = np.where(extrema[1:-1])
sigma = 1.6 * (2.0 ** ((s + 1) / 3.0))                # layer index -> scale
print(f"{len(x)} candidates, sigma range {sigma.min():.2f} to {sigma.max():.2f}")
# 1862 candidates, sigma range 2.02 to 3.20

Vectorized 26-neighbor extremum detection over the DoG volume using 3D max and min filters, with the low-contrast gate applied immediately; the recovered $\sigma$ per detection converts a layer index into a physical scale in pixels.

5. Refinement: Sub-Pixel Fitting and Edge Rejection Advanced

Raw extrema sit on an integer grid in all three dimensions, and the scale grid is coarse (three samples per octave). SIFT therefore fits a 3D quadratic to the DoG values around each extremum (using finite-difference gradients and the Hessian, the matrix of second partial derivatives that captures the local curvature) and solves for the continuous offset $\hat{\mathbf{x}} = -\mathbf{H}^{-1} \nabla D$. The payoff is real: localization error drops well below a pixel, and the interpolated $|D(\hat{\mathbf{x}})|$ gives a better contrast estimate for the gate above. Offsets larger than half a sample mean the true extremum belongs to a neighboring voxel, and the fit is re-run there.

One failure mode survives all of this: the DoG operator responds strongly along edges, where localization is poor in exactly the way Section 10.1 diagnosed with the aperture problem. The cure is a structure-tensor argument recycled at the descriptor scale: compute the $2 \times 2$ Hessian of the DoG layer at the keypoint, whose eigenvalue ratio plays the same role as the structure tensor's. Rather than eigenvalues, SIFT tests the trace-determinant ratio

$$ \frac{(\operatorname{tr} \mathbf{H})^2}{\det \mathbf{H}} \;<\; \frac{(r + 1)^2}{r}, \qquad r = 10, $$

which accepts a keypoint only if its principal curvatures differ by less than a factor of 10. The echo of the Harris trick (avoid eigendecomposition, use trace and determinant) is no coincidence; it is the same mathematics serving a second tour of duty.

6. Canonical Orientation: Rotation Handled the Same Way Intermediate

Position and scale are now invariantly detected; rotation remains. The strategy repeats the pattern named in the Key Insight: measure everything, select the maximum. Around each keypoint, within a Gaussian window of $1.5\sigma$ at the keypoint's own scale, collect every pixel's gradient orientation into a 36-bin histogram (10 degrees per bin), weighting each vote by gradient magnitude and the Gaussian. The histogram's peak is the keypoint's canonical orientation: the descriptor of Section 10.3 will be computed in a coordinate frame rotated to this angle, so the same physical patch yields the same descriptor no matter how the camera was rolled. Figure 10.2.2 sketches the histogram and one important wrinkle.

Figure 10.2.2 Orientation assignment. Gradient orientations around the keypoint vote into 36 bins, weighted by magnitude and a Gaussian window. The green peak becomes the canonical orientation. The orange bin clears 80 percent of the peak, so the keypoint is duplicated with a second orientation rather than gambling on one.

The wrinkle Figure 10.2.2 highlights: some patches (a checkerboard junction, an X-shaped crossing) genuinely have two strong orientations, and picking one would make the descriptor flip unpredictably between photographs. SIFT's solution is to duplicate: any histogram peak within 80 percent of the maximum spawns its own keypoint with its own orientation. About 15 percent of keypoints get duplicated this way, and Lowe reports the duplicates contribute disproportionately to matching stability. A parabola fit over each peak and its two neighbors refines the angle below the 10-degree bin width.

Library Shortcut: cv2.SIFT_create Runs This Whole Section

Everything above (octave construction, DoG, 26-neighbor extrema, quadratic refinement, contrast and edge gates, orientation histograms, duplication) is two lines with OpenCV: sift = cv2.SIFT_create() then kps = sift.detect(gray, None). A faithful from-scratch version runs several hundred lines, so the reduction is roughly 200-to-1. Each returned cv2.KeyPoint carries the section's outputs as attributes: kp.pt (sub-pixel position), kp.size (diameter, proportional to detected $\sigma$), kp.angle (canonical orientation in degrees), and kp.response (contrast strength, useful for ranking). The constructor exposes the section's thresholds by name: nOctaveLayers=3, contrastThreshold=0.04, edgeThreshold=10, sigma=1.6.

Try This: Sweep the Contrast Gate

The single line that most changes how many keypoints SIFT returns is the contrast threshold of Section 4, the gate that throws out low-contrast (noise-dominated) extrema. Run cv2.SIFT_create(contrastThreshold=c).detect(gray, None) on one photo for c in 0.01, 0.02, 0.04, 0.08 and print len(kps) each time; then draw the keypoints with cv2.drawKeypoints using cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS so each circle's size shows its detected scale. Watch the count drop as c rises, and notice where the casualties are: the first to vanish cluster on flat skies, smooth walls, and faint texture, exactly the places the contrast gate is meant to distrust, while the high-contrast corners survive every setting. The lesson is that detector tuning is a precision-versus-recall dial, not a quality switch.

Practical Example: The Museum Guide That Failed From the Back of the Room

Who: A four-person team building a point-your-phone audio guide for a national art museum.

Situation: Visitors photograph a painting; the app matches it against a gallery of 2,400 reference images and plays the right commentary. The prototype detected FAST corners at a single scale and matched small patches around them.

Problem: Accuracy was 96 percent for visitors standing at arm's length and 41 percent from the back of the room. The reference photos were taken at one distance; visitor photos spanned a 6x range of apparent painting size, and single-scale detection found entirely different corner sets at each distance.

Decision: Replace detection with scale-space keypoints (OpenCV SIFT, defaults), keeping the rest of the pipeline. Reference images were re-indexed once, overnight.

Result: Back-of-room accuracy rose to 93 percent; arm's-length accuracy held. Median detected scales on far shots were about 2.5x smaller than on near shots of the same paintings, confirming the mechanism: the same physical details were being found, at the scales they actually appeared.

Lesson: If users control the camera, scale invariance is not an academic nicety; it is the difference between a demo and a product. Measure your real input's scale range before trusting any fixed-scale detector.

Research Frontier: Do Networks Still Need a Scale Space?

Modern learned detectors split on this section's central question. Some keep the classical answer: XFeat (CVPR 2024) and ALIKED (2023) run explicit multi-scale inference, and DeDoDe (3DV 2024) trains its detector on multi-view consistency so that scale robustness is learned from data rather than built from Gaussians. Others abandon per-keypoint scale search entirely: DUSt3R (CVPR 2024) and MASt3R (ECCV 2024) feed two whole images into a transformer that has seen so many scale combinations in training that invariance emerges as a network property, with no $\sigma$ axis anywhere. The open trade is sample efficiency versus generality: the Gaussian scale space gets invariance from mathematics with zero training data; the transformers get broader robustness (extreme viewpoint, weak texture) at the cost of millions of training pairs and a GPU. Production reconstruction systems in 2026 still overwhelmingly run the classical pyramid, with the learned representations of Chapter 25 advancing every year.

Exercise 10.2.1: The Case for $\sigma^2$ Conceptual

Suppose extrema were detected on the raw Laplacian $|\nabla^2 L|$ without the $\sigma^2$ normalization. Using the fact that Gaussian blurring with scale $\sigma$ shrinks image derivatives roughly in proportion to $1/\sigma$ per derivative order, predict which scales would win every extremum competition and what would happen to detected keypoint scales when an image is zoomed by 2x. Then explain why the DoG approximation does not need an explicit $\sigma^2$ factor added back.

Exercise 10.2.2: Measuring a Blob With a Detector Coding

Render synthetic images containing single dark discs of radius $r \in \{4, 8, 16, 32\}$ pixels on a white background. Build a dense DoG response over $\sigma$ from 2 to 40 (no octave downsampling needed) and, for each disc, plot $|D|$ at the disc center as a function of $\sigma$. Verify that the peak tracks $\sigma \approx r / \sqrt{2}$ and report the percentage error of the prediction at each radius.

Exercise 10.2.3: Repeatability Under Zoom Analysis

Extend Exercise 10.1.3 from rotation to scale: downsample an image by factors 1.25, 1.6, 2.0, and 2.5 (use the resampling guidance of Chapter 5), detect keypoints with both Shi-Tomasi and SIFT, map detections back to original coordinates, and plot repeatability versus zoom factor for each detector. Additionally, for SIFT only, scatter-plot each matched keypoint's detected scale in the zoomed image against its scale in the original. What slope do you expect, and what slope do you measure?