Section 3.5: Edge-Preserving Smoothing: Bilateral & Guided Filters

"I average with pixels from my side of the edge. The pixels across the boundary seem lovely, truly, but we are 80 gray levels apart and it would never work."
A Diplomatically Selective Bilateral Filter

Big Picture

Edge-preserving filters resolve this chapter's central tension, noise reduction versus edge destruction, by letting each pixel choose its averaging partners based on similarity, not just proximity. The bilateral filter does this with a second Gaussian over intensity differences; the guided filter does it with a local linear model that runs in constant time per pixel regardless of window size. Between them they power the skin smoothing, night modes, and tone mapping inside essentially every smartphone camera, and they mark this chapter's graduation from shift-invariant convolution to content-adaptive filtering, the conceptual road that leads to Chapter 7's patch-based denoisers and beyond.

The smoothers of Section 3.2 obeyed the shift-invariance contract from Section 3.1: same weights everywhere, no exceptions. This section breaks the contract deliberately, and it is worth being precise about why mere parameter tuning could never be enough.

1. Why Every Linear Filter Must Blur Edges Intermediate

Suppose a filter is linear and shift-invariant, applies positive weights summing to one, and is asked to process a perfect step edge: a row of dark pixels at value 50 meeting bright pixels at value 150. Any window that straddles the boundary contains pixels from both populations, and a positive weighted average of values from both sides must land strictly between them. The output therefore contains intermediate values, 80, 110, 130, where the input had none: the step has become a ramp. No choice of positive weights avoids this; wider noise-fighting windows only widen the ramp. This is not an implementation defect but a small theorem: linear smoothing and edge preservation are mathematically incompatible.

The escape requires weights that depend on image content, so that a window straddling an edge averages only the pixels on the center's own side. Content-dependent weights make the filter nonlinear (we saw a first preview with the median in Section 3.2), and they cost us the tidy LSI theory of Section 3.1, including the single-kernel representation and the frequency-domain analysis of Chapter 4. What we buy with that sacrifice is the subject of this section.

2. The Bilateral Filter: Two Gaussians, One Veto Intermediate

The bilateral filter (Tomasi and Manduchi, 1998) multiplies two notions of closeness. A spatial Gaussian $G_{\sigma_s}$ weights neighbors by distance, exactly as in Section 3.2. A range Gaussian $G_{\sigma_r}$ weights them by how similar their intensity is to the center pixel's:

$$ B(x) = \frac{1}{W(x)} \sum_{y \in \Omega} G_{\sigma_s}\!\big(\|x - y\|\big)\; G_{\sigma_r}\!\big(|I(x) - I(y)|\big)\; I(y) $$

where $W(x)$ sums the combined weights so they normalize to one at each pixel. The range term is the veto: a neighbor whose intensity differs from the center by much more than $\sigma_r$ contributes essentially nothing, however close it sits spatially. Figure 3.5.1 shows the consequence at an edge: the effective kernel is a Gaussian with one side amputated, so each pixel averages only within its own region and the step survives smoothing untouched. The illustration below personifies that veto: blend freely with neighbors on your own side of the boundary, but politely decline anyone whose brightness is too different.

A courteous cartoon diplomat on the bright side of a crisp fence blends paint with similar light-colored neighbors while politely declining a friendly dark-colored neighbor reaching across the boundary, leaving the fence sharp. The scene illustrates the bilateral filter's range veto, which averages a pixel only with neighbors of similar intensity so edges survive smoothing untouched. — The bilateral filter averages each pixel only with neighbors on its own side of an edge, politely vetoing anyone too different in brightness; that single veto is how it smooths noise while leaving boundaries crisp.

Figure 3.5.1 Bilateral weighting at an edge. The spatial Gaussian (blue) would average across the boundary; multiplying by the range Gaussian (green result) vetoes dissimilar pixels, amputating the kernel at the edge so smoothing never mixes the two populations.

Parameter intuition: $\sigma_s$ plays the familiar Gaussian role (how far to look), while $\sigma_r$ defines what counts as "the same region" in gray levels. Set $\sigma_r$ well below the contrast of edges you must preserve and above the noise amplitude you must remove; the filter then lives in the comfortable gap between the two. As $\sigma_r \to \infty$ the veto never fires and the bilateral degenerates into the plain Gaussian; as $\sigma_r \to 0$ it refuses to average at all.

# Contrast the bilateral filter against a plain Gaussian at matched spatial
# scale. Laplacian variance serves as an edge-energy proxy: the bilateral
# keeps far more of it, meaning edges survive while flat-region noise is gone.
import cv2

img = cv2.imread("portrait.jpg")

# d=9: window diameter. sigmaColor: range veto (gray levels).
# sigmaSpace: spatial reach (pixels).
smooth = cv2.bilateralFilter(img, d=9, sigmaColor=40, sigmaSpace=5)

# Compare against a plain Gaussian of similar spatial reach:
gauss = cv2.GaussianBlur(img, (9, 9), 2)

edges_b = cv2.Laplacian(cv2.cvtColor(smooth, cv2.COLOR_BGR2GRAY), cv2.CV_32F).var()
edges_g = cv2.Laplacian(cv2.cvtColor(gauss,  cv2.COLOR_BGR2GRAY), cv2.CV_32F).var()
print(f"edge energy: bilateral {edges_b:.0f} vs gaussian {edges_g:.0f}")
# Representative output:
# edge energy: bilateral 142 vs gaussian 61
# The bilateral result keeps over twice the Laplacian energy:
# noise is gone from flat regions but the edges stayed sharp.

Code Fragment 1: Bilateral versus Gaussian smoothing at matched spatial scale, scored by Laplacian variance as an edge-energy proxy: cv2.bilateralFilter retains 142 units of edge energy against the Gaussian's 61, so the edges the plain Gaussian averages away survive.

Two costs come with the magic. First, speed: weights must be recomputed at every pixel, so the naive bilateral is $O(k^2)$ per pixel with no separability rescue (fast approximations exist, and OpenCV's implementation is heavily optimized, but it remains far costlier than GaussianBlur). Second, a signature artifact: on smooth gradients punctuated by noise, the range veto can quantize the gradient into terraces, the staircasing or "cartoon" look, which heavy-handed beautify filters exhibit in the wild.

Fun Fact

The bilateral filter's "cartoon" failure mode quietly launched an aesthetic. Stack a heavy bilateral pass (flat color regions) under the dark edge map from a Laplacian of Section 3.4 and you get the comic-book cartoonizer that OpenCV ships as cv2.stylization and that every "turn my selfie into a Pixar character" app reinvented for a decade. The terracing the photography teams fought to suppress is the exact thing the stylization teams pay for. One filter's bug is another product's headline feature.

3. The Guided Filter: Edge Awareness at Box-Filter Prices Advanced

The guided filter (He, Sun, and Tang, 2013) achieves a similar goal through entirely different machinery, and its cost analysis explains its industrial dominance. The idea: assume that within each small window $\omega_k$, the output $q$ is a linear function of a guidance image $G$ (often the input itself): $q_i = a_k G_i + b_k$ for $i \in \omega_k$. Solving the least-squares fit per window gives closed forms:

$$ a_k = \frac{\operatorname{cov}_{\omega_k}(G, I)}{\operatorname{var}_{\omega_k}(G) + \epsilon}, \qquad b_k = \bar{I}_k - a_k \bar{G}_k $$

Two statistics drive these formulas. $\operatorname{var}_{\omega_k}(G)$ is the variance of the guide inside the window (the mean squared deviation from its average, a measure of how much the guide changes there), and $\operatorname{cov}_{\omega_k}(G, I)$ is the covariance between guide and input (how much they vary together), both computed over the same window $\omega_k$.

With those two statistics in hand, the behavior of $a_k$ is easy to read off in its two extremes. Where the guide is flat (variance far below $\epsilon$), $a_k \to 0$ and the filter outputs the local mean: full smoothing. Where the guide has a strong edge (variance far above $\epsilon$), $a_k \to 1$ and the output tracks the guide: the edge passes through intact. The regularizer $\epsilon$ thus plays exactly the role $\sigma_r^2$ plays for the bilateral: the variance threshold separating "texture to smooth" from "structure to keep".

The code below handles one subtlety. Every pixel belongs to many overlapping windows, each with its own fitted $(a_k, b_k)$, so the final output averages those overlapping models; that is why the code box-filters $a$ and $b$ before forming $a G + b$. Every quantity in these formulas is a local mean, computable with box filters, and Section 3.6 shows box filters run in $O(1)$ per pixel regardless of radius. The punchline: edge-preserving smoothing whose cost does not depend on the window size at all.

# The guided filter from scratch via box filters only: per-window it fits a
# linear model q = a*G + b to the guide, where a -> 1 on edges (pass them
# through) and a -> 0 on flat regions (collapse to the local mean).
import cv2
import numpy as np

def box(x, r):
    """Mean filter with window (2r+1)x(2r+1); O(1) per pixel."""
    return cv2.boxFilter(x, -1, (2 * r + 1, 2 * r + 1))

def guided_filter(G, I, r=8, eps=0.01):
    """He et al. guided filter. G: guide, I: input, both float32 in [0,1]."""
    mean_G, mean_I = box(G, r), box(I, r)
    corr_GI = box(G * I, r)
    var_G   = box(G * G, r) - mean_G * mean_G        # local variance of guide
    cov_GI  = corr_GI - mean_G * mean_I              # local covariance
    a = cov_GI / (var_G + eps)                       # edge -> a~1, flat -> a~0
    b = mean_I - a * mean_G
    return box(a, r) * G + box(b, r)                 # average the models

img = cv2.imread("portrait.jpg", cv2.IMREAD_GRAYSCALE)
f = img.astype(np.float32) / 255.0
out = guided_filter(f, f, r=8, eps=0.02)             # self-guided smoothing
cv2.imwrite("portrait_guided.png", (out * 255).astype(np.uint8))

Code Fragment 2: A complete guided filter in twelve lines of NumPy and box filters, following He et al.'s closed-form solution; the coefficient a rises toward 1 at strong guide edges, letting them pass while flat regions collapse to their local mean.

Library Shortcut: cv2.ximgproc.guidedFilter in Practice

The twelve-line implementation above becomes cv2.ximgproc.guidedFilter(guide, src, radius=8, eps=0.02), one line (requires the opencv-contrib-python package). The library version additionally handles color guidance images, which our scalar derivation cannot: a color guide distinguishes regions that differ in hue but match in gray level, noticeably improving results around, say, red-on-green text. It also implements the fast O(N) pipeline with proper border handling and offers cv2.ximgproc.fastGlobalSmootherFilter as a stronger sibling. Twelve-to-one on lines; far more on edge cases handled.

Key Insight: The Guide Need Not Be the Input

The guided filter's deepest feature hides in its signature: the image that supplies the edges (the guide) and the image being filtered can be different images. Filter a noisy no-flash photo guided by its sharp flash twin; smooth a depth map guided by its red-green-blue (RGB) image; feather a binary segmentation mask guided by the photograph it segments, the trick behind one-line alpha-matting cleanup. This cross-image structure transfer has no analogue among the filters of Section 3.2, and it is why the guided filter, not the bilateral, became the workhorse of computational photography pipelines and the cost-volume filtering used in the stereo matching of Chapter 13.

You Could Build This: A Flash / No-Flash Denoiser Intermediate, 45 to 60 min

The cross-image trick above is the heart of a small but genuinely impressive computational-photography demo. Capture (or download) a pair of the same scene: one dim, noisy no-flash shot with the true colors and ambient mood, and one bright, harsh flash shot that is clean but ugly. Feed the no-flash image as the input and the flash image as the guide to the twelve-line guided_filter from this section. Because the guide carries crisp, low-noise edges, the filter denoises the no-flash photo while transferring the flash frame's structure onto it, keeping the pleasant ambient lighting and discarding the grain. It is the classical core of a technique that shipped in real camera pipelines, you can build it tonight with OpenCV and NumPy alone, and the side-by-side of noisy input, flash guide, and clean output makes a striking portfolio piece.

4. Choosing an Edge-Preserving Smoother Intermediate

Table 3.5.1 places this section's filters alongside their Section 3.2 ancestors. The pattern to internalize: as you move down the table, the filter consults the image content more and the fixed kernel less.

**Table 3.5.1** The smoothing family, from content-blind to content-adaptive.
Filter	Weights depend on	Edge behavior	Cost per pixel	Signature artifact
Box / Gaussian	Position only	Blurs edges (provably)	O(k), separable	Soft everything
Median	Rank order	Preserves steps, rounds corners	O(1) for uint8 (histogram trick)	Posterized patches
Bilateral	Position + intensity similarity	Preserves strong edges	O(k²), approximations exist	Staircasing, cartoon look
Guided	Local linear fit to a guide	Preserves guide's edges	O(1), radius-independent	Faint halos near thin structures

A second decision that arrives immediately in practice: what to do when the noise is too strong for one pass. Iterating the bilateral filter converges toward flat cartoon regions (sometimes desired for stylization, rarely for photography); the better-behaved route for serious denoising is the patch-based and learned methods of Chapter 7, which generalize "similar pixels" to "similar neighborhoods". A third option, anisotropic diffusion (Perona and Malik, 1990), reformulates edge-aware smoothing as a partial differential equation (PDE) that diffuses heat everywhere except across edges; it is historically important and conceptually elegant, but in production pipelines the one-shot guided filter almost always wins on predictability and speed.

Common Misconception: An Edge-Preserving Filter Is Just a Strictly Better Gaussian

Because the bilateral and guided filters keep edges that the Gaussian smears, it is tempting to treat them as a free upgrade and reach for them everywhere a blur was used. Two facts forbid this. First, they are nonlinear and content-dependent: they are not convolutions, have no single kernel, and obey none of the shift-invariant theory of Section 3.1, so they are the wrong pre-filter for the derivative operators of Section 3.4, which assume a clean linear smoothing whose effect on the gradient is predictable. A Gaussian pre-blur sets a known analysis scale; a bilateral pre-blur leaves edges untouched precisely where the derivative is about to measure them, defeating the purpose. Second, edge preservation is not free of artifacts: the bilateral's range veto terraces smooth gradients into the waxy "cartoon" staircase, so on skies, skin, and soft shading it can look worse than the honest Gaussian it replaced. Edge-preserving filters are the right tool when the smoothed image is itself the product and its boundaries must survive, not a universal substitute for the blur.

Practical Example: Sixty Faces Per Second on a Phone

Who: The camera software team at a mid-tier Android phone manufacturer, building the "natural skin smoothing" feature for the front camera.

Situation: Marketing required real-time beautification in the viewfinder: every frame, 30 to 60 frames per second (fps), on a power-constrained mobile system-on-chip (SoC), with explicit instructions that the result must not look "plastic" (their previous generation's Gaussian-based smoother had been mocked in reviews for erasing eyebrows).

Problem: The obvious upgrade, a bilateral filter at the radius needed to smooth skin texture (about 15 pixels at sensor resolution), blew the frame budget by 4x on the target chip, and its staircasing produced flat, waxy cheeks under directional lighting.

Decision: Switch to a self-guided guided filter ($r = 16$, $\epsilon$ tuned per skin-tone luminance band), whose radius-independent cost fit the budget at full radius. Apply it only inside a face-skin mask, blend the smoothed base with 30 percent of the original detail layer back in (the base-plus-detail decomposition from Section 3.3), and leave eyes, brows, and hair untouched via the mask.

Result: 5.8 ms per frame on the SoC's GPU (versus 23 ms for the bilateral), no staircasing, and the detail re-blend preserved enough pore texture that the review-site complaint did not recur.

Lesson: Between two filters of similar quality, the one with radius-independent cost wins on mobile, and "smooth, then add a fraction of detail back" beats "smooth less" for natural results.

Research Frontier: Classical Filters Inside Learned Pipelines

Edge-aware filtering did not retire with deep learning; it moved into the middle of the networks. The guided filter has a differentiable formulation used as a trainable upsampling layer for segmentation and matting heads, where a low-resolution mask is refined by a full-resolution guide, the learned descendant of this section's mask-feathering trick. Bilateral-space processing returned at scale in 2024: "Bilateral Guided Radiance Field Processing" (SIGGRAPH 2024) applies 3D bilateral grids inside radiance-field reconstruction to disentangle per-view camera processing, connecting this section to the neural scene representations of Chapter 27. And monocular depth models such as Apple's Depth Pro (2024, arXiv:2410.02073) compete specifically on the edge fidelity of their depth maps, evaluated with boundary metrics that descend directly from the edge-preservation criterion this section formalized. The question "smooth here, but not across that boundary" turns out to be permanent.

The six filter families of this chapter are now complete, from the plain box average to the content-adaptive guided filter. What remains is the engineering that decides whether any of them can ship: what a kernel does where it runs off the edge of a real image, and how the arithmetic gets fast enough for a frame budget. Section 3.6 settles both questions, and in doing so explains the box-filter constant-time trick this section leaned on and the separability that already makes the Gaussian cheap.

Exercise 3.5.1: Limit Behavior Conceptual

Without coding, describe the output of the bilateral filter in each limit: (a) $\sigma_r \to \infty$ with moderate $\sigma_s$; (b) $\sigma_r \to 0$; (c) $\sigma_s \to \infty$ with moderate $\sigma_r$ (this one is subtle: what set of pixels does each output pixel now average over, and what classical algorithm from Chapter 2's histogram world does the result resemble?). For each limit, name the filter from this chapter that the behavior reduces to or approximates.

Exercise 3.5.2: Cross-Guided Feathering Coding

Take any photo and create a crude binary foreground mask (a hand-drawn polygon via cv2.fillPoly is fine). Feather it three ways: Gaussian blur of the mask, bilateral filtering of the mask, and guided filtering of the mask using the photo as the guide (your 12-line implementation or cv2.ximgproc.guidedFilter). Composite the foreground onto a contrasting background with each feathered mask as alpha. Report which method snaps the mask boundary to the true object contour and explain why only the cross-guided version can do so.

Exercise 3.5.3: The Epsilon Dial Analysis

Using the from-scratch guided filter, process a noisy image at $r = 8$ with $\epsilon \in \{10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\}$ (image scaled to [0,1]). For each result, plot a horizontal slice through a strong edge and through a flat noisy region. Identify the $\epsilon$ at which edge preservation visibly fails, measure the local guide variance at that edge, and verify the rule from this section: edges survive while their local variance exceeds $\epsilon$ and dissolve once it does not.