Part II: Classical Computer Vision
Chapter 9: Edges, Lines & Curves

The Canny Edge Detector, Step by Step

"I accept strong edges on the spot. Weak edges need a strong edge to vouch for them. It's not paranoia; it's hysteresis."

A Canny Detector With Trust Issues
Big Picture

The Canny detector closes the gradient-to-edge gap of Section 9.1 with two ideas that decide with context instead of pixel by pixel: non-maximum suppression thins gradient ribbons to one-pixel crests, and hysteresis lets confident edges rescue their faint continuations. What makes it remarkable is its pedigree: rather than tinkering toward a heuristic, John Canny in 1986 wrote down what an optimal edge detector should achieve (find real edges, place them accurately, mark each once), derived the operator that best satisfies the criteria, and arrived at a four-stage pipeline so well balanced that it remains the default edge detector in every vision library four decades later, and lately the favorite conditioning signal of the diffusion models in Chapter 35.

The previous section ended with a measured indictment: a thresholded gradient map keeps thick ribbons instead of curves and shatters real contours no matter where the threshold sits. This section builds the machine that fixes both defects. We will walk the four stages in order, implement the whole detector from scratch in NumPy, verify it against OpenCV's, and finish with the one-line call and its tuning knobs.

1. What Should an Edge Detector Optimize? Intermediate

Canny's contribution begins before any code, with three criteria for judging any edge detector applied to a noisy step edge:

Good detection. Maximize the probability of marking real edges and minimize the probability of marking noise. Good localization. Marked points should sit as close as possible to the true edge position. Single response. One edge should produce one mark, not a cluster.

The criteria fight each other. Detection improves with more smoothing (noise dies faster than the step does), but localization worsens, because smoothing smears the step's position; this is the $\sigma$ trade-off that Section 9.1 flagged. Canny formalized the trade-off, solved for the 1D operator maximizing a product of the criteria, and found the optimum is approximated within about 20 percent by something we already own: the first derivative of a Gaussian, exactly the "smooth, then differentiate" composite of Chapter 3. The deep reassurance of the result is that the standard practice was not just convenient; it is close to provably the best a linear filter can do. What the optimal filter cannot do alone is satisfy criterion three and produce thin, connected curves; that requires the two nonlinear stages that make Canny's pipeline famous. Figure 9.2.1 lays out all four stages.

1. Gaussian smoothing 2. Gradient magnitude + orientation 3. Non-maximum suppression ribbons → 1-px crests 4. Hysteresis two thresholds; strong recruits weak edge map stages 1-2: linear filtering (Chapter 3 machinery), near-optimal by Canny's criteria stages 3-4: nonlinear decisions that use thinness and continuity, the two facts about edges that no per-pixel threshold can express
Figure 9.2.1 The four-stage Canny pipeline. The first two stages are the smoothed-gradient machinery of Chapter 3; the last two are nonlinear decision stages, and they are where the thick-ribbon and broken-contour defects of Section 9.1 get repaired.

2. Stages 1 and 2: Smooth, Then Measure Intermediate

Nothing new is needed here, only a deliberate choice. Convolve with a Gaussian of scale $\sigma$, then apply Sobel kernels to get $I_x$, $I_y$, and from them magnitude and orientation. The single decision is $\sigma$, and it should be made consciously: $\sigma \approx 1$ to $1.4$ preserves fine texture edges; $\sigma \approx 2$ to $3$ is a sensible default for natural photographs; larger values are for finding only the dominant contours. One practical warning that surprises many users: cv2.Canny performs no internal Gaussian smoothing; it runs Sobel directly on whatever you pass it. Smoothing is your job (or use skimage.feature.canny, which takes an explicit sigma argument). Feeding a raw noisy image to cv2.Canny and wondering about speckle is among the most common edge-detection bugs in production code, a sibling of the denoise-first discipline from Chapter 7.

3. Stage 3: Non-Maximum Suppression Advanced

Stage 3 attacks the thick ribbons. The insight is geometric: crossing an edge perpendicularly, gradient magnitude rises, peaks at the true boundary, and falls. The ribbon exists because we kept the whole rise and fall; the boundary is just the crest. So at every pixel, look one step in the gradient direction and one step against it, and keep the pixel only if its magnitude is at least as large as both neighbors. The result is called non-maximum suppression (NMS): everything that is not a local maximum along the gradient direction is deleted, collapsing each ribbon to a one-pixel crest line exactly perpendicular to the local gradient, the geometry established in Figure 9.1.2 of Section 9.1.

On a discrete grid "one step in the gradient direction" needs care: a gradient at 37 degrees points between pixels. The classical implementation quantizes orientation into four bins (0, 45, 90, 135 degrees), each selecting one of the four neighbor pairs in the 3×3 neighborhood; finer implementations interpolate magnitude between the two pixels straddling the true direction. Figure 9.2.2 shows the quantized version that our from-scratch code implements.

p ∇I keep p only if mag(p) ≥ both shaded neighbors orientation quantized to 4 bins: bin 0° (gradient ↔ horizontal): compare left / right bin 45°: compare upper-right / lower-left bin 90° (gradient ↕ vertical): compare up / down bin 135°: compare upper-left / lower-right each bin spans 45°; ties are kept (≥, not >) so flat crests survive and the contour stays connected
Figure 9.2.2 Non-maximum suppression in a 3×3 neighborhood. The gradient at the center pixel p points up-right (bin 45 degrees), so p is compared against its upper-right and lower-left neighbors (shaded); p survives only as a local maximum along the gradient line, which thins every gradient ribbon to a one-pixel crest.

4. Stage 4: Double Threshold and Hysteresis Advanced

Stage 4 attacks the broken contours, and it does so by refusing to use one threshold. Pick two, $T_{\text{high}}$ and $T_{\text{low}}$ (Canny suggested a ratio between 2:1 and 3:1). Pixels above $T_{\text{high}}$ are strong: accepted unconditionally. Pixels below $T_{\text{low}}$ are rejected unconditionally. The pixels in between are weak: accepted if and only if they connect, directly or through a chain of other weak pixels, to some strong pixel. Implementation is a flood fill seeded at the strong pixels and allowed to spread only through weak ones; readers of Chapter 6 will recognize this as morphological reconstruction of the weak map from strong markers.

Why this works is worth saying plainly. A real contour whose contrast dips for a stretch produces weak pixels precisely along the dip, sandwiched between strong segments, so the flood fill recovers it whole. An isolated noise response has the same magnitude as that dip but no strong sponsor anywhere nearby, so it dies. The decision uses geometry that no per-pixel threshold can see: membership in a connected structure.

Key Insight: Hysteresis Is a Prior About the World, Not a Trick

Double thresholding encodes a physical belief: boundaries in the world are continuous curves, so edge evidence should be judged by the company it keeps. A weak gradient adjoining a confident contour is probably that contour continuing; the same weak gradient alone in flat sky is probably noise. This is the chapter's first instance of a pattern that defines the rest of the book: when per-pixel evidence is ambiguous, resolve it with structure. The Hough transform of Section 9.3 (pixels voting for shared geometry) and the RANSAC consensus of Section 9.4 are the same idea at ever larger scales.

5. The Whole Detector From Scratch Advanced

Forty lines of NumPy implement everything above. The code keeps each stage visibly separate so you can instrument any of them, and then checks itself against OpenCV's implementation.

import cv2
import numpy as np
from collections import deque

def canny_from_scratch(gray, sigma=1.4, low=40, high=90):
    # Stage 1: Gaussian smoothing (sigma chosen by the caller, deliberately)
    img = cv2.GaussianBlur(gray.astype(np.float32), (0, 0), sigma)

    # Stage 2: gradient magnitude and orientation
    gx = cv2.Sobel(img, cv2.CV_32F, 1, 0, ksize=3)
    gy = cv2.Sobel(img, cv2.CV_32F, 0, 1, ksize=3)
    mag = np.hypot(gx, gy)
    ang = np.rad2deg(np.arctan2(gy, gx)) % 180.0      # fold to [0, 180)

    # Stage 3: non-maximum suppression with 4-bin orientation quantization
    q = (((ang + 22.5) // 45).astype(int)) % 4         # 0,45,90,135 deg bins
    offs = [(0, 1), (-1, 1), (-1, 0), (-1, -1)]        # neighbor along gradient
    H, W = mag.shape
    nms = np.zeros_like(mag)
    for y in range(1, H - 1):
        for x in range(1, W - 1):
            dy, dx = offs[q[y, x]]
            m = mag[y, x]
            if m >= mag[y + dy, x + dx] and m >= mag[y - dy, x - dx]:
                nms[y, x] = m                          # crest pixels survive

    # Stage 4: double threshold + hysteresis (flood fill from strong seeds)
    strong = nms >= high
    weak = (nms >= low) & ~strong
    edges = strong.copy()
    queue = deque(zip(*np.nonzero(strong)))
    while queue:
        y, x = queue.popleft()
        for dy in (-1, 0, 1):
            for dx in (-1, 0, 1):
                yy, xx = y + dy, x + dx
                if 0 <= yy < H and 0 <= xx < W and weak[yy, xx] and not edges[yy, xx]:
                    edges[yy, xx] = True               # strong recruits weak
                    queue.append((yy, xx))
    return edges

gray = cv2.imread("street.jpg", cv2.IMREAD_GRAYSCALE)
mine = canny_from_scratch(gray, sigma=1.4, low=40, high=90)
ref = cv2.Canny(cv2.GaussianBlur(gray, (0, 0), 1.4), 40, 90) > 0

print(f"edge pixels (ours): {mine.mean():.2%}   (OpenCV): {ref.mean():.2%}")
print(f"pixelwise agreement: {(mine == ref).mean():.1%}")
# Representative output:
# edge pixels (ours): 3.08%   (OpenCV): 3.21%
# pixelwise agreement: 99.3%
The full Canny pipeline in four labeled stages, validated against cv2.Canny on the same pre-blurred input: agreement above 99 percent of pixels, with the residual disagreement concentrated on borderline weak pixels where OpenCV's interpolated NMS and our 4-bin quantization make different calls.

The Python loops make this implementation a teaching instrument, not a production one: it runs in seconds where OpenCV takes about a millisecond. But every line corresponds to a sentence in this section, and instrumenting it (dump the nms array, count how many weak pixels hysteresis rescued) is the fastest way to make the pipeline's behavior concrete. On a typical street photograph, roughly a third of the final edge pixels arrive via rescue rather than direct strong acceptance, which is the broken-contour defect of Section 9.1 being repaired in bulk.

6. The One-Liner and Its Knobs Intermediate

In production you call the library, and the craft moves into choosing arguments. The most durable trick is deriving the thresholds from image statistics, in the spirit of Chapter 2's histogram methods, so that one configuration survives across exposures and scenes:

import cv2
import numpy as np

gray = cv2.imread("street.jpg", cv2.IMREAD_GRAYSCALE)
blur = cv2.GaussianBlur(gray, (0, 0), 1.4)     # cv2.Canny does NOT smooth

# Auto-thresholds from the median: robust to exposure changes
v = float(np.median(blur))
low, high = int(max(0, 0.66 * v)), int(min(255, 1.33 * v))

edges = cv2.Canny(blur, low, high,
                  apertureSize=3,      # Sobel kernel size: 3, 5, or 7
                  L2gradient=True)     # true Euclidean magnitude, not |gx|+|gy|

print(f"median {v:.0f} -> thresholds ({low}, {high}); "
      f"edge pixels: {(edges > 0).mean():.2%}")
# Representative output:
# median 118 -> thresholds (77, 156); edge pixels: 2.64%
The production pattern: explicit pre-blur (because cv2.Canny applies Sobel directly with no internal smoothing), median-derived hysteresis thresholds that track exposure automatically, and L2gradient=True for an isotropic magnitude worth its small extra cost.
Library Shortcut: 40 Lines Become One

Our from-scratch detector is about 40 lines plus a validation harness; cv2.Canny(blur, low, high) is one line, and skimage.feature.canny(gray, sigma=1.4) is one line that also folds in the Gaussian stage and uses sub-bin interpolated NMS for slightly cleaner crests. Internally the libraries handle the parts our version simplified: interpolation between neighbors instead of 4-bin quantization, border policies, a scan-line hysteresis that avoids Python-level queues, and SIMD-parallelized gradient kernels, which is how they run three orders of magnitude faster. Write the from-scratch version once to own the concepts; ship the library call.

Practical Example: The Document Scanner That Stopped Missing Pages

Who: Two engineers on the mobile team of an expense-management app, responsible for the "scan receipt" feature.

Situation: The feature finds the document quadrilateral in the camera frame (Canny, then contour extraction, then a four-corner polygon fit) and warps it flat for OCR, the classic pipeline.

Problem: Detection was tuned on office lighting with fixed thresholds (50, 150). In the field, dim restaurants and glary taxi seats pushed the gradient distribution far from the tuning conditions: the page-found rate measured 78 percent on a 12,000-image sample of real user captures, and support tickets said "the blue box never appears."

Decision: Three changes, all from this section: downscale to 480 pixels wide (less noise, faster, and document edges survive), explicit Gaussian pre-blur at $\sigma = 1.5$, and median-based auto-thresholds per frame instead of constants; plus a light dilation from Chapter 6 to close hairline gaps before contour extraction.

Result: Page-found rate rose to 96 percent on the same evaluation set; 95th-percentile latency was 28 milliseconds on a mid-range phone; the team deleted two per-market threshold overrides from the config system.

Lesson: Canny's thresholds are statistics, not constants. Any deployment that meets uncontrolled lighting should derive them from each frame's own histogram, which costs one median and removes an entire class of field failures.

Fun Fact

The detector the whole field uses was a master's thesis: John Canny worked it out at MIT in 1983, and the journal paper appeared in 1986. Canny then largely left edges behind for robot motion planning and computational geometry. The thesis-project-turned-eternal-default is a recurring pattern in vision; you will meet another one when SIFT appears in Chapter 10.

Research Frontier: Canny's Second Career as a Control Signal

Edge detection research continues (DiffusionEdge, AAAI 2024, generates single-pixel crisp edge maps with a diffusion model; RankED, CVPR 2024, tackles the pixel imbalance and annotator ambiguity with ranking losses; SCESAME, WACV 2024 workshops, derives zero-shot edges by ensembling Segment Anything outputs), but the headline 2024-2026 story is who consumes Canny maps now. ControlNet (Zhang et al., ICCV 2023, arXiv:2302.05543) made the Canny edge map one of the standard conditioning signals for text-to-image diffusion: draw or extract a contour map, and the generator must respect its geometry while inventing everything else. The Canny condition remains among the most-downloaded control adapters for Stable Diffusion and its successors, meaning a 1986 algorithm sits in the hot path of 2026's generative pipelines, a story Chapter 35 completes. The learned first-layer filters of Chapter 19 tell the complementary story: networks rediscover Canny's stages 1 and 2 on their own.

Exercise 9.2.1: Order Matters Conceptual

Explain what would go wrong if the pipeline's stages were reordered: (a) thresholding before non-maximum suppression, and (b) NMS after hysteresis instead of before. Then consider setting $T_{\text{low}} = T_{\text{high}}$: which of Section 9.1's two defects returns, and why does the ratio between the thresholds (rather than either absolute value) control the detector's tolerance for contrast dips along a contour?

Exercise 9.2.2: Measure the Rescue Rate Coding

Instrument the from-scratch implementation to report, for each image, the number of strong pixels, the number of weak pixels, and the number of weak pixels that hysteresis ultimately rescued. Run it over ten varied photographs at threshold ratios 2:1 and 3:1 and report the rescue fraction for each. Then replace the queue-based flood fill with cv2.connectedComponents on the union map (keep components that contain at least one strong pixel) and verify the outputs are identical while measuring the speedup.

Exercise 9.2.3: The Sigma Sweep Analysis

Using skimage.feature.canny, sweep $\sigma \in \{0.5, 1, 2, 4, 8\}$ on one detailed photograph. For each result, report the edge-pixel fraction and the number of connected components, and visually identify which real structures appear and disappear. Plot both metrics against $\sigma$ and annotate the plot with the detection-versus-localization trade-off from Canny's criteria: which $\sigma$ would you choose for (a) a texture-analysis task and (b) finding the silhouette of a single large object, and why?