Part I: Image Processing
Chapter 3: Spatial Filtering & Convolution

Edge-Preserving Smoothing: Bilateral & Guided Filters

"I average with pixels from my side of the edge. The pixels across the boundary seem lovely, truly, but we are 80 gray levels apart and it would never work."

A Diplomatically Selective Bilateral Filter
Big Picture

Edge-preserving filters resolve this chapter's central tension, noise reduction versus edge destruction, by letting each pixel choose its averaging partners based on similarity, not just proximity. The bilateral filter does this with a second Gaussian over intensity differences; the guided filter does it with a local linear model that runs in constant time per pixel regardless of window size. Between them they power the skin smoothing, night modes, and tone mapping inside essentially every smartphone camera, and they mark this chapter's graduation from shift-invariant convolution to content-adaptive filtering, the conceptual road that leads to Chapter 7's patch-based denoisers and beyond.

The smoothers of Section 3.2 obeyed the shift-invariance contract from Section 3.1: same weights everywhere, no exceptions. This section breaks the contract deliberately, and it is worth being precise about why mere parameter tuning could never be enough.

1. Why Every Linear Filter Must Blur Edges Intermediate

Suppose a filter is linear and shift-invariant, applies positive weights summing to one, and is asked to process a perfect step edge: a row of dark pixels at value 50 meeting bright pixels at value 150. Any window that straddles the boundary contains pixels from both populations, and a positive weighted average of values from both sides must land strictly between them. The output therefore contains intermediate values, 80, 110, 130, where the input had none: the step has become a ramp. No choice of positive weights avoids this; wider noise-fighting windows only widen the ramp. This is not an implementation defect but a small theorem: linear smoothing and edge preservation are mathematically incompatible.

The escape requires weights that depend on image content, so that a window straddling an edge averages only the pixels on the center's own side. Content-dependent weights make the filter nonlinear (we saw a first preview with the median in Section 3.2), and they cost us the tidy LSI theory of Section 3.1, including the single-kernel representation and the frequency-domain analysis of Chapter 4. What we buy with that sacrifice is the subject of this section.

2. The Bilateral Filter: Two Gaussians, One Veto Intermediate

The bilateral filter (Tomasi and Manduchi, 1998) multiplies two notions of closeness. A spatial Gaussian $G_{\sigma_s}$ weights neighbors by distance, exactly as in Section 3.2. A range Gaussian $G_{\sigma_r}$ weights them by how similar their intensity is to the center pixel's:

$$ B(x) = \frac{1}{W(x)} \sum_{y \in \Omega} G_{\sigma_s}\!\big(\|x - y\|\big)\; G_{\sigma_r}\!\big(|I(x) - I(y)|\big)\; I(y) $$

where $W(x)$ sums the combined weights so they normalize to one at each pixel. The range term is the veto: a neighbor whose intensity differs from the center by much more than $\sigma_r$ contributes essentially nothing, however close it sits spatially. Figure 3.5.1 shows the consequence at an edge: the effective kernel is a Gaussian with one side amputated, so each pixel averages only within its own region and the step survives smoothing untouched.

image row with edge center pixel edge at dashed line; center sits on the bright side spatial weights Gσs distance only: weights happily cross the edge (this is the bug) bilateral weights Gσs · Gσr range veto zeroes the dark side: averaging stays within the region effective kernel = spatial bell × per-pixel similarity veto; it reshapes itself at every pixel, which is exactly what no shift-invariant convolution is permitted to do
Figure 3.5.1 Bilateral weighting at an edge. The spatial Gaussian (blue) would average across the boundary; multiplying by the range Gaussian (green result) vetoes dissimilar pixels, amputating the kernel at the edge so smoothing never mixes the two populations.

Parameter intuition: $\sigma_s$ plays the familiar Gaussian role (how far to look), while $\sigma_r$ defines what counts as "the same region" in gray levels. Set $\sigma_r$ well below the contrast of edges you must preserve and above the noise amplitude you must remove; the filter then lives in the comfortable gap between the two. As $\sigma_r \to \infty$ the veto never fires and the bilateral degenerates into the plain Gaussian; as $\sigma_r \to 0$ it refuses to average at all.

import cv2

img = cv2.imread("portrait.jpg")

# d=9: window diameter. sigmaColor: range veto (gray levels).
# sigmaSpace: spatial reach (pixels).
smooth = cv2.bilateralFilter(img, d=9, sigmaColor=40, sigmaSpace=5)

# Compare against a plain Gaussian of similar spatial reach:
gauss = cv2.GaussianBlur(img, (9, 9), 2)

edges_b = cv2.Laplacian(cv2.cvtColor(smooth, cv2.COLOR_BGR2GRAY), cv2.CV_32F).var()
edges_g = cv2.Laplacian(cv2.cvtColor(gauss,  cv2.COLOR_BGR2GRAY), cv2.CV_32F).var()
print(f"edge energy: bilateral {edges_b:.0f} vs gaussian {edges_g:.0f}")
# Representative output:
# edge energy: bilateral 142 vs gaussian 61
# The bilateral result keeps over twice the Laplacian energy:
# noise is gone from flat regions but the edges stayed sharp.
Bilateral versus Gaussian smoothing at matched spatial scale, scored by Laplacian variance as an edge-energy proxy: the bilateral output retains the edges that the plain Gaussian averages away.

Two costs come with the magic. First, speed: weights must be recomputed at every pixel, so the naive bilateral is $O(k^2)$ per pixel with no separability rescue (fast approximations exist, and OpenCV's implementation is heavily optimized, but it remains far costlier than GaussianBlur). Second, a signature artifact: on smooth gradients punctuated by noise, the range veto can quantize the gradient into terraces, the staircasing or "cartoon" look, which heavy-handed beautify filters exhibit in the wild.

3. The Guided Filter: Edge Awareness at Box-Filter Prices Advanced

The guided filter (He, Sun, and Tang, 2013) achieves a similar goal through entirely different machinery, and its cost analysis explains its industrial dominance. The idea: assume that within each small window $\omega_k$, the output $q$ is a linear function of a guidance image $G$ (often the input itself): $q_i = a_k G_i + b_k$ for $i \in \omega_k$. Solving the least-squares fit per window gives closed forms:

$$ a_k = \frac{\operatorname{cov}_{\omega_k}(G, I)}{\operatorname{var}_{\omega_k}(G) + \epsilon}, \qquad b_k = \bar{I}_k - a_k \bar{G}_k $$

Where the guide is flat (variance far below $\epsilon$), $a_k \to 0$ and the filter outputs the local mean: full smoothing. Where the guide has a strong edge (variance far above $\epsilon$), $a_k \to 1$ and the output tracks the guide: the edge passes through intact. The regularizer $\epsilon$ thus plays exactly the role $\sigma_r^2$ plays for the bilateral: the variance threshold separating "texture to smooth" from "structure to keep". Every quantity in these formulas is a local mean, computable with box filters, and Section 3.6 shows box filters run in $O(1)$ per pixel regardless of radius. The punchline: edge-preserving smoothing whose cost does not depend on the window size at all.

import cv2
import numpy as np

def box(x, r):
    """Mean filter with window (2r+1)x(2r+1); O(1) per pixel."""
    return cv2.boxFilter(x, -1, (2 * r + 1, 2 * r + 1))

def guided_filter(G, I, r=8, eps=0.01):
    """He et al. guided filter. G: guide, I: input, both float32 in [0,1]."""
    mean_G, mean_I = box(G, r), box(I, r)
    corr_GI = box(G * I, r)
    var_G   = box(G * G, r) - mean_G * mean_G        # local variance of guide
    cov_GI  = corr_GI - mean_G * mean_I              # local covariance
    a = cov_GI / (var_G + eps)                       # edge -> a~1, flat -> a~0
    b = mean_I - a * mean_G
    return box(a, r) * G + box(b, r)                 # average the models

img = cv2.imread("portrait.jpg", cv2.IMREAD_GRAYSCALE)
f = img.astype(np.float32) / 255.0
out = guided_filter(f, f, r=8, eps=0.02)             # self-guided smoothing
cv2.imwrite("portrait_guided.png", (out * 255).astype(np.uint8))
A complete guided filter in twelve lines of NumPy and box filters, following He et al.'s closed-form solution; the coefficient a rises toward 1 at strong guide edges, letting them pass while flat regions collapse to their local mean.
Library Shortcut: cv2.ximgproc.guidedFilter in Practice

The twelve-line implementation above becomes cv2.ximgproc.guidedFilter(guide, src, radius=8, eps=0.02), one line (requires the opencv-contrib-python package). The library version additionally handles color guidance images, which our scalar derivation cannot: a color guide distinguishes regions that differ in hue but match in gray level, noticeably improving results around, say, red-on-green text. It also implements the fast O(N) pipeline with proper border handling and offers cv2.ximgproc.fastGlobalSmootherFilter as a stronger sibling. Twelve-to-one on lines; far more on edge cases handled.

Key Insight: The Guide Need Not Be the Input

The guided filter's deepest feature hides in its signature: the image that supplies the edges (the guide) and the image being filtered can be different images. Filter a noisy no-flash photo guided by its sharp flash twin; smooth a depth map guided by its RGB image; feather a binary segmentation mask guided by the photograph it segments, the trick behind one-line alpha-matting cleanup. This cross-image structure transfer has no analogue among the filters of Section 3.2, and it is why the guided filter, not the bilateral, became the workhorse of computational photography pipelines and the cost-volume filtering used in the stereo matching of Chapter 13.

4. Choosing an Edge-Preserving Smoother Intermediate

Table 3.5.1 places this section's filters alongside their Section 3.2 ancestors. The pattern to internalize: as you move down the table, the filter consults the image content more and the fixed kernel less.

Table 3.5.1 The smoothing family, from content-blind to content-adaptive.
FilterWeights depend onEdge behaviorCost per pixelSignature artifact
Box / GaussianPosition onlyBlurs edges (provably)O(k), separableSoft everything
MedianRank orderPreserves steps, rounds cornersO(1) for uint8 (histogram trick)Posterized patches
BilateralPosition + intensity similarityPreserves strong edgesO(k²), approximations existStaircasing, cartoon look
GuidedLocal linear fit to a guidePreserves guide's edgesO(1), radius-independentFaint halos near thin structures

A second decision that arrives immediately in practice: what to do when the noise is too strong for one pass. Iterating the bilateral filter converges toward flat cartoon regions (sometimes desired for stylization, rarely for photography); the better-behaved route for serious denoising is the patch-based and learned methods of Chapter 7, which generalize "similar pixels" to "similar neighborhoods". A third option, anisotropic diffusion (Perona and Malik, 1990), reformulates edge-aware smoothing as a PDE that diffuses heat everywhere except across edges; it is historically important and conceptually elegant, but in production pipelines the one-shot guided filter almost always wins on predictability and speed.

Practical Example: Sixty Faces Per Second on a Phone

Who: The camera software team at a mid-tier Android phone manufacturer, building the "natural skin smoothing" feature for the front camera.

Situation: Marketing required real-time beautification in the viewfinder: every frame, 30 to 60 fps, on a power-constrained mobile SoC, with explicit instructions that the result must not look "plastic" (their previous generation's Gaussian-based smoother had been mocked in reviews for erasing eyebrows).

Problem: The obvious upgrade, a bilateral filter at the radius needed to smooth skin texture (about 15 pixels at sensor resolution), blew the frame budget by 4x on the target chip, and its staircasing produced flat, waxy cheeks under directional lighting.

Decision: Switch to a self-guided guided filter ($r = 16$, $\epsilon$ tuned per skin-tone luminance band), whose radius-independent cost fit the budget at full radius. Apply it only inside a face-skin mask, blend the smoothed base with 30 percent of the original detail layer back in (the base-plus-detail decomposition from Section 3.3), and leave eyes, brows, and hair untouched via the mask.

Result: 5.8 ms per frame on the SoC's GPU (versus 23 ms for the bilateral), no staircasing, and the detail re-blend preserved enough pore texture that the review-site complaint did not recur.

Lesson: Between two filters of similar quality, the one with radius-independent cost wins on mobile, and "smooth, then add a fraction of detail back" beats "smooth less" for natural results.

Research Frontier: Classical Filters Inside Learned Pipelines

Edge-aware filtering did not retire with deep learning; it moved into the middle of the networks. The guided filter has a differentiable formulation used as a trainable upsampling layer for segmentation and matting heads, where a low-resolution mask is refined by a full-resolution guide, the learned descendant of this section's mask-feathering trick. Bilateral-space processing returned at scale in 2024: "Bilateral Guided Radiance Field Processing" (SIGGRAPH 2024) applies 3D bilateral grids inside radiance-field reconstruction to disentangle per-view camera processing, connecting this section to the neural scene representations of Chapter 27. And monocular depth models such as Apple's Depth Pro (2024, arXiv:2410.02073) compete specifically on the edge fidelity of their depth maps, evaluated with boundary metrics that descend directly from the edge-preservation criterion this section formalized. The question "smooth here, but not across that boundary" turns out to be permanent.

Exercise 3.5.1: Limit Behavior Conceptual

Without coding, describe the output of the bilateral filter in each limit: (a) $\sigma_r \to \infty$ with moderate $\sigma_s$; (b) $\sigma_r \to 0$; (c) $\sigma_s \to \infty$ with moderate $\sigma_r$ (this one is subtle: what set of pixels does each output pixel now average over, and what classical algorithm from Chapter 2's histogram world does the result resemble?). For each limit, name the filter from this chapter that the behavior reduces to or approximates.

Exercise 3.5.2: Cross-Guided Feathering Coding

Take any photo and create a crude binary foreground mask (a hand-drawn polygon via cv2.fillPoly is fine). Feather it three ways: Gaussian blur of the mask, bilateral filtering of the mask, and guided filtering of the mask using the photo as the guide (your 12-line implementation or cv2.ximgproc.guidedFilter). Composite the foreground onto a contrasting background with each feathered mask as alpha. Report which method snaps the mask boundary to the true object contour and explain why only the cross-guided version can do so.

Exercise 3.5.3: The Epsilon Dial Analysis

Using the from-scratch guided filter, process a noisy image at $r = 8$ with $\epsilon \in \{10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\}$ (image scaled to [0,1]). For each result, plot a horizontal slice through a strong edge and through a flat noisy region. Identify the $\epsilon$ at which edge preservation visibly fails, measure the local guide variance at that edge, and verify the rule from this section: edges survive while their local variance exceeds $\epsilon$ and dissolve once it does not.