"Forty-seven years of being blurred, sharpened, compressed, and denoised, and they finally let me retire. My one request: let the astronaut take it from here. She is used to noise."
A Graciously Retired Test Image
An image processing claim is only as good as the data it was tested on and the metric that scored it, and both are standardized for exactly that reason. Shared test images make results comparable across papers and decades; benchmark datasets make them comparable across methods; and metric toolboxes make the numbers themselves reproducible, provided you pin down the parameters that the defaults quietly choose for you. This section is the inventory of all three.
The previous section (Section 8.2) made pipelines fast; this one makes their evaluation trustworthy. Every experiment in Part I needed three ingredients we mostly took for granted: an input image, a degraded-versus-clean pair, and a number that says how well a method did. Here we treat each ingredient as infrastructure: where the standard images live, which datasets benchmark the restoration tasks of Chapter 7, and how to compute the quality metrics introduced in Chapter 1 so that a reviewer, or your own CI pipeline, gets the same number you did.
1. The Standard Test Images Beginner
A handful of photographs have been blurred and denoised millions of times because everyone agreed to use the same ones. The historical home is the USC-SIPI database (maintained since 1977), source of the mandrill, the peppers, and the house; the Kodak suite adds 24 film scans that remain the default smoke test for compression and denoising. For code, the most convenient gallery ships inside scikit-image itself: skimage.data bundles the classic cameraman, an astronaut portrait, coins on a textured background, text, and more, each one load-by-function-call. Code 8.3.1 inventories it.
from skimage import data
for name in ["camera", "astronaut", "coins", "text", "checkerboard",
"page", "moon", "shepp_logan_phantom"]:
img = getattr(data, name)() # each image is a function call
print(f"{name:20s} {img.shape!s:16s} {img.dtype}")
skimage.data downloads nothing for the core images and returns plain NumPy arrays, which makes it the fastest way to get a real photograph into a unit test.camera (512, 512) uint8 astronaut (512, 512, 3) uint8 coins (303, 384) uint8 text (172, 448) uint8 checkerboard (200, 200) uint8 page (191, 384) uint8 moon (512, 512) uint8 shepp_logan_phantom (400, 400) float64
The most famous test image of all, the 1972 "Lena" crop, is conspicuously absent from modern libraries. After decades of debate about using a Playboy centerfold as the field's default benchmark, journals began declining it (Nature journals in 2018), and IEEE stopped accepting new papers containing it as of April 2024. The community's drop-in replacement in scikit-image is data.astronaut(): NASA's portrait of Eileen Collins, the first woman to command a Space Shuttle mission. The epigraph above is her predecessor's gracious handover.
Synthetic targets complement the photographs because their ground truth is mathematical. The Shepp-Logan phantom (printed by Code 8.3.1) has known ellipse boundaries, which is why CT reconstruction papers have used it since 1974. Checkerboards and ramps expose interpolation and quantization artifacts from Chapter 1. The most instructive of all is the zone plate, $I(r) = \tfrac{1}{2}\left(1 + \cos(\alpha r^2)\right)$, whose local frequency grows linearly with radius: a single image that sweeps every spatial frequency, so aliasing from poor resampling (Chapter 5) and the band behavior of filters (Chapter 4) appear as visible rings exactly where theory predicts. Code 8.3.2 builds one, then manufactures the (clean, noisy) pair every denoising experiment needs.
import numpy as np
def zone_plate(n=512, alpha=0.4):
"""Radial chirp: local frequency grows linearly with radius."""
y, x = np.mgrid[-n // 2:n // 2, -n // 2:n // 2].astype(np.float64)
r2 = (x**2 + y**2) / n # normalized squared radius
return 0.5 * (1.0 + np.cos(alpha * r2))
clean = zone_plate() # float64 in [0, 1], exact ground truth
rng = np.random.default_rng(7)
noisy = np.clip(clean + rng.normal(0.0, 25 / 255, clean.shape), 0.0, 1.0)
2. Benchmark Datasets for Restoration Intermediate
When a method must be compared against published results, single images give way to standard datasets. Table 8.3.1 lists the ones that dominate the literature for the tasks of Chapter 7; the names recur in virtually every paper you will read after this part of the book.
| Dataset | Task | Contents | Notes |
|---|---|---|---|
| BSDS500 / BSD68 | Denoising (synthetic) | 500 natural photos; 68-image gray test split | Noise added at sigma 15/25/50; the classic protocol |
| Set12 / Kodak24 | Denoising (synthetic) | 12 gray classics; 24 color film scans | Small, fast, in every results table |
| SIDD | Denoising (real) | ~30k smartphone pairs, 10 scenes, 5 phones | Real sensor noise; exposes methods tuned on Gaussian assumptions |
| DIV2K | Super-resolution | 800 train + 100 val images at 2K | The training set behind nearly every modern SR model |
| Set5 / Set14 / Urban100 | Super-resolution (eval) | 5, 14, and 100 test images | Urban100's repeated facades punish bad upscalers |
| GoPro | Deblurring | 3,214 blur/sharp pairs from 240 fps video | Blur synthesized by averaging real frames |
| LOL | Low-light enhancement | 485 train + 15 test paired exposures | The standard paired benchmark for the Chapter 7 enhancement task |
Two hygiene rules attach to every row. First, respect the published splits: DIV2K's validation images, for example, appear inside other datasets' test sets, and accidental train-test contamination has invalidated more than one results table. Second, record the exact degradation recipe (noise sigma and dtype scale, downsampling kernel for SR, JPEG quality) alongside your scores; "PSNR 31.2 on BSD68" is meaningless without "sigma = 25, grayscale, values in [0, 255]". A quick histogram audit of any downloaded dataset, with the tools from Chapter 2, catches the most common surprises: 16-bit files masquerading as 8-bit, and already-compressed "clean" references.
3. Metric Tooling: PSNR, SSIM & the Default-Parameter Trap Intermediate
Chapter 1 defined the two workhorse full-reference metrics: peak signal-to-noise ratio,
$$\mathrm{PSNR} = 10 \log_{10} \frac{L^2}{\mathrm{MSE}},$$
with $L$ the dynamic range, and the structural similarity index
$$\mathrm{SSIM}(x, y) = \frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)},$$
computed over local windows and averaged. The tooling lives in skimage.metrics, and it hides a reproducibility trap: the original SSIM publication specifies an 11×11 Gaussian-weighted window with $\sigma = 1.5$ and population statistics, while scikit-image defaults to a 7×7 uniform window with sample statistics. Both are legitimate SSIMs; they are just different numbers. Code 8.3.3 computes both and makes the second trap, data_range, explicit.
from skimage.metrics import peak_signal_noise_ratio, structural_similarity
# Floats demand an explicit range; guessing it is the classic silent bug.
psnr = peak_signal_noise_ratio(clean, noisy, data_range=1.0)
ssim_default = structural_similarity(clean, noisy, data_range=1.0)
ssim_paper = structural_similarity( # the Wang et al. (2004) protocol
clean, noisy, data_range=1.0,
gaussian_weights=True, sigma=1.5, # 11x11 Gaussian window
use_sample_covariance=False) # population (1/N) statistics
print(f"PSNR : {psnr:.2f} dB")
print(f"SSIM (defaults) : {ssim_default:.4f}")
print(f"SSIM (paper flags): {ssim_paper:.4f}")
ssim_paper reproduce the original publication's implementation; papers and MATLAB references almost always mean this variant. Report which one you used.PSNR : 20.17 dB SSIM (defaults) : 0.4664 SSIM (paper flags): 0.3923
The SSIM formula is unambiguous; an SSIM number also depends on window shape, statistics convention, data range, channel handling, and any resizing done before scoring. The same is true of PSNR (range and dtype), LPIPS (which backbone network), and later FID (which feature extractor and sample count, as Chapter 37 will stress). Treat every reported metric as a function with parameters, state them when you publish, and pin them in code when you compare. The honest minimum is one sentence: library, version, and flags.
Who: An ML engineer at an imaging startup benchmarking a new denoiser for a customer pitch.
Situation: The pitch claimed the model beat the published BM3D baseline by 0.02 SSIM on BSD68 at sigma 25.
Problem: A prospective customer's team re-ran the comparison and got the opposite ordering. Both sides accused the other of a broken pipeline.
Decision: A joint debugging call traced the gap to tooling, not models: the startup scored with scikit-image defaults (uniform 7×7 window) on float images in [0, 1], while the baseline numbers came from the original Gaussian-window protocol on [0, 255] integers, and one side was also accidentally scoring after a resize. They re-scored everything with one pinned script: paper flags, explicit data_range, no resizing.
Result: The real margin was 0.004 SSIM, within noise across the 68 images. The pitch was rewritten around speed, where the model's advantage was genuine and reproducible.
Lesson: Before comparing two methods, prove you can reproduce the baseline's published number first. If you cannot, your harness, not the method, is the variable.
4. Beyond Pixels: Perceptual and No-Reference Metrics Advanced
PSNR and SSIM compare pixels and local statistics, and they famously prefer over-smoothed results that humans dislike. LPIPS (Zhang et al., 2018) scores the distance between deep network features of the two images instead, and correlates far better with human judgments on restoration outputs. When no clean reference exists at all (real-world enhancement, in-the-wild quality monitoring), no-reference metrics such as NIQE, BRISQUE, and the learned MUSIQ and CLIP-IQA estimate quality from the image alone. Figure 8.3.1 organizes the families into the decision you actually face.
In practice you rarely implement any of these. The pyiqa toolbox wraps more than forty metrics, classical and learned, behind one interface, as Code 8.3.4 shows.
import pyiqa, torch
device = "cuda" if torch.cuda.is_available() else "cpu"
metrics = {name: pyiqa.create_metric(name, device=device)
for name in ["psnr", "ssim", "lpips", "niqe"]}
# Tensors in NCHW, float, [0, 1]; pyiqa also accepts file paths directly.
ref = torch.from_numpy(clean).float()[None, None]
deg = torch.from_numpy(noisy).float()[None, None]
for name, m in metrics.items():
score = m(deg, ref) if not m.metric_mode == "NR" else m(deg)
print(f"{name:6s}: {float(score):8.4f} (lower is better: {m.lower_better})")
lower_better and its reference requirement via metric_mode, which prevents the two most common scripting mistakes in evaluation harnesses.A faithful from-scratch SSIM is roughly 60 lines: Gaussian window construction, five local-statistics convolutions, the stability constants $C_1, C_2$, edge handling, and channel averaging. skimage.metrics.structural_similarity with the three protocol flags from Code 8.3.3 is 1 line, and the library additionally handles multichannel inputs, returns the full similarity map via full=True for spatial debugging, and stays numerically stable for constant patches where naive implementations divide by zero. 60 lines to 1, and the one line is the one reviewers can check.
The newest no-reference metrics are language models. Q-Bench (ICLR 2024) benchmarked how well multimodal LLMs perceive low-level quality attributes, and Q-Align (ICML 2024) turned that ability into a state-of-the-art scorer by teaching an LLM to emit discrete quality levels the way human raters do. DepictQA (ECCV 2024) goes further and outputs reasoned, comparative quality descriptions rather than a single scalar, and CLIP-IQA showed that even a frozen CLIP model scores quality zero-shot from prompt pairs like "good photo" versus "bad photo". On the full-reference side, TOPIQ (IEEE TIP 2023) propagates semantic attention down to local distortions. Most of these ship in pyiqa already, so the practical frontier is one create_metric string away; the open question, pursued across 2025-2026 work, is keeping learned judges honest when the images being judged come from generative models, a problem Chapter 37 takes up in earnest.
For each scenario, choose a metric family from Figure 8.3.1 and one concrete metric, and justify the choice: (a) regression-testing a JPEG encoder against golden outputs; (b) ranking three denoisers for a medical imaging customer who fears hallucinated detail; (c) monitoring the quality of user-uploaded photos in production, where no reference exists; (d) deciding whether a synthetic training set "looks like" the real one. Which scenario is the trap where the obvious metric rewards the wrong behavior?
Build a small benchmark: download the Kodak suite, add Gaussian noise at sigma 15, 25, and 50 (on a documented scale), and score cv2.fastNlMeansDenoisingColored and the bilateral filter from Chapter 7 with PSNR, both SSIM protocols from Code 8.3.3, and LPIPS. Emit one CSV row per (image, sigma, method, metric, value) and a header comment recording library versions and all metric flags. Verify a colleague (or a fresh environment) reproduces your numbers to four decimals.
Construct two degradations of the same Kodak image with equal PSNR (tune their strengths until PSNRs match within 0.05 dB): a Gaussian blur and additive noise. Score both with SSIM and LPIPS. Which degradation does each metric prefer, and what does that reveal about what each one measures? Relate your finding to the over-smoothing bias discussed in Section 4 and to the frequency content removed by each degradation (Chapter 4).