"People keep asking whether I am a small thing nearby or a large thing far away. I stopped answering. I just detect myself at whatever scale shows up."
A Serenely Zoom-Proof Keypoint
A detector that only looks through one window size is blind to the fact that the same object appears at every size, so the fix is to make scale a third search dimension and detect keypoints at the scale where they respond most strongly. This section builds that machinery: Gaussian scale space as the principled stack of progressively blurred images, the scale-normalized Laplacian as the response worth maximizing, the Difference of Gaussians as its cheap approximation, and extremum detection in the resulting 3D volume. A final step, canonical orientation, dispatches rotation the same way. The output is keypoints stamped with position, scale, and angle: exactly the invariant anchor that Section 10.3's SIFT descriptor needs.
Section 10.1 closed with an honest confession: every detector in it evaluates a neighborhood of one fixed size, so none of them survives zoom. Photograph a doorway from across the street and its corner fits the structure tensor's window; walk closer and that same corner becomes a smooth, slowly curving boundary that fills half the image and excites nothing. The physical point did not change. The relationship between the point and the window did. This section repairs that relationship.
1. Why Fixed-Size Windows Fail, and What "Scale" Means Beginner
The deep observation, due to scale-space theory (Lindeberg's work in the 1990s formalized it), is that objects in images do not have a size; they have a size relative to the camera, which the photographer chose freely. An algorithm that hopes to recognize structure across photographs therefore may not privilege any single analysis scale. It must examine all of them, and it needs a principled way to generate "the same image, analyzed at coarser and coarser scales" without inventing artifacts along the way.
You already own the answer. Chapter 3 established that Gaussian smoothing is the unique sensible blur (no new structure created, rotationally symmetric, cascades cleanly), and Chapter 4 stacked Gaussian blurs into pyramids. Gaussian scale space is the continuous version of that pyramid: the family
$$ L(x, y, \sigma) \;=\; G_\sigma * I, $$
the image convolved with Gaussians of every standard deviation $\sigma$. The parameter $\sigma$ is the scale: structure much finer than $\sigma$ has been averaged away in $L(\cdot, \cdot, \sigma)$, structure much coarser survives untouched. Detection will now happen inside this 3D volume $(x, y, \sigma)$ rather than on the 2D image, and "the scale of a keypoint" will mean the $\sigma$ at which it is detected.
Scale invariance is not achieved by designing a magically zoom-proof operator. It is achieved by brute honesty: compute the response at every scale, then keep the scale that responds most. When the image is zoomed by a factor $c$, the whole response landscape shifts along the $\sigma$ axis by that same factor, and the per-keypoint maximum shifts with it. The detected scale rides the zoom, which is precisely what invariance means. The same search-then-select pattern handles rotation at the end of this section, and it is worth recognizing as a general design move.
2. Octaves and the Cascade Trick Intermediate
A continuous family of blurs must be sampled. The convention, inherited by every implementation from SIFT onward, samples $\sigma$ geometrically: scales come in octaves (each octave doubles $\sigma$), and each octave is subdivided into $s$ intervals with ratio $k = 2^{1/s}$ between consecutive scales. SIFT defaults to $s = 3$, $\sigma_0 = 1.6$. Two practical refinements keep the cost sane. First, the cascade property from Chapter 3 (blurring by $\sigma_1$ then $\sigma_2$ equals blurring by $\sqrt{\sigma_1^2 + \sigma_2^2}$) means each level is produced by a small incremental blur of the previous level, never a giant blur of the original. Second, after each octave the image is downsampled by 2 and the process repeats at half resolution: beyond one octave of blur, the extra pixels carry no information, exactly the argument behind Chapter 4's pyramid. The code below builds one octave from these two rules.
import cv2
import numpy as np
def gaussian_octave(img: np.ndarray, sigma0: float = 1.6,
intervals: int = 3) -> np.ndarray:
"""One octave of Gaussian scale space: intervals + 3 images."""
k = 2.0 ** (1.0 / intervals)
levels = [cv2.GaussianBlur(img, (0, 0), sigma0)]
for i in range(1, intervals + 3):
sigma_prev = sigma0 * k ** (i - 1)
# cascade property: the increment that takes sigma_prev to k*sigma_prev
sigma_inc = sigma_prev * np.sqrt(k * k - 1.0)
levels.append(cv2.GaussianBlur(levels[-1], (0, 0), sigma_inc))
return np.stack(levels) # shape: (intervals + 3, H, W)
img = cv2.imread("street.jpg", cv2.IMREAD_GRAYSCALE).astype(np.float32) / 255
gauss = gaussian_octave(img)
dog = gauss[1:] - gauss[:-1] # Difference of Gaussians, see below
print(gauss.shape, dog.shape) # (6, H, W) (5, H, W)
Why $s + 3$ images per octave instead of $s$? The extremum search of Section 4 compares each DoG layer against the layers above and below it, so the top and bottom DoG layers can never host detections. To test a full octave's worth of scales ($s$ layers), you need $s + 2$ DoG layers, hence $s + 3$ Gaussian images. Bookkeeping, but the kind that bites anyone reimplementing SIFT.
3. What to Detect: the Scale-Normalized Laplacian Advanced
Scale space says where to search. It does not yet say what function of the image to maximize there. The structure tensor of Section 10.1 could be evaluated per scale (and the multi-scale Harris-Laplace detector does exactly that), but SIFT's lineage uses a different, blob-flavored response: the Laplacian of Gaussian, $\nabla^2 L = L_{xx} + L_{yy}$, which you met as an edge-and-blob operator in Chapter 9. A dark blob on a bright background (or vice versa) produces a strong Laplacian response when the Gaussian's width matches the blob's radius; mismatched widths blur the blob away or average it with its surroundings.
One subtlety decides everything. Gaussian derivatives shrink as $\sigma$ grows (blurring flattens slopes), so raw Laplacian responses at different scales are not comparable: coarse scales would always lose. Lindeberg's fix is scale normalization: multiply by $\sigma^2$, one factor of $\sigma$ per derivative order. The quantity worth comparing across the whole volume is
$$ \mathcal{L}(x, y, \sigma) \;=\; \sigma^2 \left| \nabla^2 L(x, y, \sigma) \right|, $$
and a keypoint is a local extremum of $\mathcal{L}$ in all three coordinates. For an ideal circular blob of radius $r$, the maximum lands at $\sigma = r / \sqrt{2}$: the detected scale literally measures the blob. This is the theoretical heart of the section; everything after it is approximation and engineering.
The engineering: computing $\sigma^2 \nabla^2 L$ at every voxel is expensive, and the Difference of Gaussians you built in the code above is an almost-free stand-in. A first-order expansion of the heat equation (which Gaussian blurring solves, as Chapter 3 noted) gives
$$ D(x, y, \sigma) \;=\; L(x, y, k\sigma) - L(x, y, \sigma) \;\approx\; (k - 1)\,\sigma^2\, \nabla^2 L, $$
so subtracting adjacent scale-space levels yields the scale-normalized Laplacian, including the $\sigma^2$ factor, times a constant $(k-1)$ that is identical for every layer and therefore harmless to extremum detection. The blurred images were needed anyway; the detector response costs one subtraction per level.
4. Extrema in Three Dimensions Intermediate
Detection is now a neighborhood test in the DoG volume: a voxel is a candidate keypoint if it is strictly greater (or strictly smaller) than all 26 neighbors in its $3 \times 3 \times 3$ surroundings: 8 neighbors in its own layer, 9 in the layer above, 9 below. Figure 10.2.1 shows the geometry. The comparison against neighboring scales is what selects the intrinsic scale; the comparison within the layer is ordinary spatial non-maximum suppression, the same hygiene as in Section 10.1.
The implementation below runs the test on the dog stack built earlier, using a 3D maximum filter as the vectorized form of the 26-neighbor comparison. Note the contrast threshold: DoG values are tiny on low-contrast structure, which is noise-dominated and unrepeatable, so SIFT discards extrema with $|D| < 0.03$ (on images scaled to $[0,1]$) before doing anything else with them.
from scipy.ndimage import maximum_filter, minimum_filter
is_max = dog == maximum_filter(dog, size=(3, 3, 3)) # 26-neighbor maxima
is_min = dog == minimum_filter(dog, size=(3, 3, 3)) # 26-neighbor minima
extrema = (is_max | is_min) & (np.abs(dog) > 0.03) # contrast gate
# Only interior layers have neighbors above AND below
s, y, x = np.where(extrema[1:-1])
sigma = 1.6 * (2.0 ** ((s + 1) / 3.0)) # layer index -> scale
print(f"{len(x)} candidates, sigma range {sigma.min():.2f} to {sigma.max():.2f}")
# 1862 candidates, sigma range 2.02 to 3.20
5. Refinement: Sub-Pixel Fitting and Edge Rejection Advanced
Raw extrema sit on an integer grid in all three dimensions, and the scale grid is coarse (three samples per octave). SIFT therefore fits a 3D quadratic to the DoG values around each extremum (using finite-difference gradients and Hessian) and solves for the continuous offset $\hat{\mathbf{x}} = -\mathbf{H}^{-1} \nabla D$. The payoff is real: localization error drops well below a pixel, and the interpolated $|D(\hat{\mathbf{x}})|$ gives a better contrast estimate for the gate above. Offsets larger than half a sample mean the true extremum belongs to a neighboring voxel, and the fit is re-run there.
One failure mode survives all of this: the DoG operator responds strongly along edges, where localization is poor in exactly the way Section 10.1 diagnosed with the aperture problem. The cure is a structure-tensor argument recycled at the descriptor scale: compute the $2 \times 2$ Hessian of the DoG layer at the keypoint, whose eigenvalue ratio plays the same role as the structure tensor's. Rather than eigenvalues, SIFT tests the trace-determinant ratio
$$ \frac{(\operatorname{tr} \mathbf{H})^2}{\det \mathbf{H}} \;<\; \frac{(r + 1)^2}{r}, \qquad r = 10, $$
which accepts a keypoint only if its principal curvatures differ by less than a factor of 10. The echo of the Harris trick (avoid eigendecomposition, use trace and determinant) is no coincidence; it is the same mathematics serving a second tour of duty.
6. Canonical Orientation: Rotation Handled the Same Way Intermediate
Position and scale are now invariantly detected; rotation remains. The strategy repeats the pattern named in the Key Insight: measure everything, select the maximum. Around each keypoint, within a Gaussian window of $1.5\sigma$ at the keypoint's own scale, collect every pixel's gradient orientation into a 36-bin histogram (10 degrees per bin), weighting each vote by gradient magnitude and the Gaussian. The histogram's peak is the keypoint's canonical orientation: the descriptor of Section 10.3 will be computed in a coordinate frame rotated to this angle, so the same physical patch yields the same descriptor no matter how the camera was rolled. Figure 10.2.2 sketches the histogram and one important wrinkle.
The wrinkle Figure 10.2.2 highlights: some patches (a checkerboard junction, an X-shaped crossing) genuinely have two strong orientations, and picking one would make the descriptor flip unpredictably between photographs. SIFT's solution is to duplicate: any histogram peak within 80 percent of the maximum spawns its own keypoint with its own orientation. About 15 percent of keypoints get duplicated this way, and Lowe reports the duplicates contribute disproportionately to matching stability. A parabola fit over each peak and its two neighbors refines the angle below the 10-degree bin width.
Everything above (octave construction, DoG, 26-neighbor extrema, quadratic refinement, contrast and edge gates, orientation histograms, duplication) is two lines with OpenCV: sift = cv2.SIFT_create() then kps = sift.detect(gray, None). A faithful from-scratch version runs several hundred lines, so the reduction is roughly 200-to-1. Each returned cv2.KeyPoint carries the section's outputs as attributes: kp.pt (sub-pixel position), kp.size (diameter, proportional to detected $\sigma$), kp.angle (canonical orientation in degrees), and kp.response (contrast strength, useful for ranking). The constructor exposes the section's thresholds by name: nOctaveLayers=3, contrastThreshold=0.04, edgeThreshold=10, sigma=1.6.
Who: A four-person team building a point-your-phone audio guide for a national art museum.
Situation: Visitors photograph a painting; the app matches it against a gallery of 2,400 reference images and plays the right commentary. The prototype detected FAST corners at a single scale and matched small patches around them.
Problem: Accuracy was 96 percent for visitors standing at arm's length and 41 percent from the back of the room. The reference photos were taken at one distance; visitor photos spanned a 6x range of apparent painting size, and single-scale detection found entirely different corner sets at each distance.
Decision: Replace detection with scale-space keypoints (OpenCV SIFT, defaults), keeping the rest of the pipeline. Reference images were re-indexed once, overnight.
Result: Back-of-room accuracy rose to 93 percent; arm's-length accuracy held. Median detected scales on far shots were about 2.5x smaller than on near shots of the same paintings, confirming the mechanism: the same physical details were being found, at the scales they actually appeared.
Lesson: If users control the camera, scale invariance is not an academic nicety; it is the difference between a demo and a product. Measure your real input's scale range before trusting any fixed-scale detector.
Modern learned detectors split on this section's central question. Some keep the classical answer: XFeat (CVPR 2024) and ALIKED (2023) run explicit multi-scale inference, and DeDoDe (3DV 2024) trains its detector on multi-view consistency so that scale robustness is learned from data rather than built from Gaussians. Others abandon per-keypoint scale search entirely: DUSt3R (CVPR 2024) and MASt3R (ECCV 2024) feed two whole images into a transformer that has seen so many scale combinations in training that invariance emerges as a network property, with no $\sigma$ axis anywhere. The open trade is sample efficiency versus generality: the Gaussian scale space gets invariance from mathematics with zero training data; the transformers get broader robustness (extreme viewpoint, weak texture) at the cost of millions of training pairs and a GPU. Production reconstruction systems in 2026 still overwhelmingly run the classical pyramid, with the learned representations of Chapter 25 advancing every year.
Suppose extrema were detected on the raw Laplacian $|\nabla^2 L|$ without the $\sigma^2$ normalization. Using the fact that Gaussian blurring with scale $\sigma$ shrinks image derivatives roughly in proportion to $1/\sigma$ per derivative order, predict which scales would win every extremum competition and what would happen to detected keypoint scales when an image is zoomed by 2x. Then explain why the DoG approximation does not need an explicit $\sigma^2$ factor added back.
Render synthetic images containing single dark discs of radius $r \in \{4, 8, 16, 32\}$ pixels on a white background. Build a dense DoG response over $\sigma$ from 2 to 40 (no octave downsampling needed) and, for each disc, plot $|D|$ at the disc center as a function of $\sigma$. Verify that the peak tracks $\sigma \approx r / \sqrt{2}$ and report the percentage error of the prediction at each radius.
Extend Exercise 10.1.3 from rotation to scale: downsample an image by factors 1.25, 1.6, 2.0, and 2.5 (use the resampling guidance of Chapter 5), detect keypoints with both Shi-Tomasi and SIFT, map detections back to original coordinates, and plot repeatability versus zoom factor for each detector. Additionally, for SIFT only, scatter-plot each matched keypoint's detected scale in the zoomed image against its scale in the original. What slope do you expect, and what slope do you measure?