Part I: Image Processing
Chapter 8: Tools of the Trade: The Image Processing Stack

Test Images, Datasets & Quality Metrics Tooling

"Forty-seven years of being blurred, sharpened, compressed, and denoised, and they finally let me retire. My one request: let the astronaut take it from here. She is used to noise."

A Graciously Retired Test Image
Big Picture

An image processing claim is only as good as the data it was tested on and the metric that scored it, and both are standardized for exactly that reason. Shared test images make results comparable across papers and decades; benchmark datasets make them comparable across methods; and metric toolboxes make the numbers themselves reproducible, provided you pin down the parameters that the defaults quietly choose for you. This section is the inventory of all three.

The previous section (Section 8.2) made pipelines fast; this one makes their evaluation trustworthy. Every experiment in Part I needed three ingredients we mostly took for granted: an input image, a degraded-versus-clean pair, and a number that says how well a method did. Here we treat each ingredient as infrastructure: where the standard images live, which datasets benchmark the restoration tasks of Chapter 7, and how to compute the quality metrics introduced in Chapter 1 so that a reviewer, or your own CI pipeline, gets the same number you did.

1. The Standard Test Images Beginner

A handful of photographs have been blurred and denoised millions of times because everyone agreed to use the same ones. The historical home is the USC-SIPI database (maintained since 1977), source of the mandrill, the peppers, and the house; the Kodak suite adds 24 film scans that remain the default smoke test for compression and denoising. For code, the most convenient gallery ships inside scikit-image itself: skimage.data bundles the classic cameraman, an astronaut portrait, coins on a textured background, text, and more, each one load-by-function-call. Code 8.3.1 inventories it.

from skimage import data

for name in ["camera", "astronaut", "coins", "text", "checkerboard",
             "page", "moon", "shepp_logan_phantom"]:
    img = getattr(data, name)()          # each image is a function call
    print(f"{name:20s} {img.shape!s:16s} {img.dtype}")
Code 8.3.1: The built-in test gallery: skimage.data downloads nothing for the core images and returns plain NumPy arrays, which makes it the fastest way to get a real photograph into a unit test.
camera               (512, 512)       uint8
astronaut            (512, 512, 3)    uint8
coins                (303, 384)       uint8
text                 (172, 448)       uint8
checkerboard         (200, 200)       uint8
page                 (191, 384)       uint8
moon                 (512, 512)       uint8
shepp_logan_phantom  (400, 400)       float64
Output 8.3.1a: Eight ready-made test subjects covering portraits, textures, documents, low contrast, and a synthetic CT phantom; note the mixed dtypes, exactly the trap Section 8.1 warned about.
Fun Fact

The most famous test image of all, the 1972 "Lena" crop, is conspicuously absent from modern libraries. After decades of debate about using a Playboy centerfold as the field's default benchmark, journals began declining it (Nature journals in 2018), and IEEE stopped accepting new papers containing it as of April 2024. The community's drop-in replacement in scikit-image is data.astronaut(): NASA's portrait of Eileen Collins, the first woman to command a Space Shuttle mission. The epigraph above is her predecessor's gracious handover.

Synthetic targets complement the photographs because their ground truth is mathematical. The Shepp-Logan phantom (printed by Code 8.3.1) has known ellipse boundaries, which is why CT reconstruction papers have used it since 1974. Checkerboards and ramps expose interpolation and quantization artifacts from Chapter 1. The most instructive of all is the zone plate, $I(r) = \tfrac{1}{2}\left(1 + \cos(\alpha r^2)\right)$, whose local frequency grows linearly with radius: a single image that sweeps every spatial frequency, so aliasing from poor resampling (Chapter 5) and the band behavior of filters (Chapter 4) appear as visible rings exactly where theory predicts. Code 8.3.2 builds one, then manufactures the (clean, noisy) pair every denoising experiment needs.

import numpy as np

def zone_plate(n=512, alpha=0.4):
    """Radial chirp: local frequency grows linearly with radius."""
    y, x = np.mgrid[-n // 2:n // 2, -n // 2:n // 2].astype(np.float64)
    r2 = (x**2 + y**2) / n                # normalized squared radius
    return 0.5 * (1.0 + np.cos(alpha * r2))

clean = zone_plate()                      # float64 in [0, 1], exact ground truth
rng = np.random.default_rng(7)
noisy = np.clip(clean + rng.normal(0.0, 25 / 255, clean.shape), 0.0, 1.0)
Code 8.3.2: A zone plate plus calibrated Gaussian noise (the sigma = 25 convention, expressed on a [0, 1] scale) yields a synthetic restoration benchmark with perfect ground truth: the exact setup behind the denoising numbers of Chapter 7.

2. Benchmark Datasets for Restoration Intermediate

When a method must be compared against published results, single images give way to standard datasets. Table 8.3.1 lists the ones that dominate the literature for the tasks of Chapter 7; the names recur in virtually every paper you will read after this part of the book.

Table 8.3.1: Standard benchmark datasets for Part I restoration tasks.
DatasetTaskContentsNotes
BSDS500 / BSD68Denoising (synthetic)500 natural photos; 68-image gray test splitNoise added at sigma 15/25/50; the classic protocol
Set12 / Kodak24Denoising (synthetic)12 gray classics; 24 color film scansSmall, fast, in every results table
SIDDDenoising (real)~30k smartphone pairs, 10 scenes, 5 phonesReal sensor noise; exposes methods tuned on Gaussian assumptions
DIV2KSuper-resolution800 train + 100 val images at 2KThe training set behind nearly every modern SR model
Set5 / Set14 / Urban100Super-resolution (eval)5, 14, and 100 test imagesUrban100's repeated facades punish bad upscalers
GoProDeblurring3,214 blur/sharp pairs from 240 fps videoBlur synthesized by averaging real frames
LOLLow-light enhancement485 train + 15 test paired exposuresThe standard paired benchmark for the Chapter 7 enhancement task

Two hygiene rules attach to every row. First, respect the published splits: DIV2K's validation images, for example, appear inside other datasets' test sets, and accidental train-test contamination has invalidated more than one results table. Second, record the exact degradation recipe (noise sigma and dtype scale, downsampling kernel for SR, JPEG quality) alongside your scores; "PSNR 31.2 on BSD68" is meaningless without "sigma = 25, grayscale, values in [0, 255]". A quick histogram audit of any downloaded dataset, with the tools from Chapter 2, catches the most common surprises: 16-bit files masquerading as 8-bit, and already-compressed "clean" references.

3. Metric Tooling: PSNR, SSIM & the Default-Parameter Trap Intermediate

Chapter 1 defined the two workhorse full-reference metrics: peak signal-to-noise ratio, $$\mathrm{PSNR} = 10 \log_{10} \frac{L^2}{\mathrm{MSE}},$$ with $L$ the dynamic range, and the structural similarity index $$\mathrm{SSIM}(x, y) = \frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)},$$ computed over local windows and averaged. The tooling lives in skimage.metrics, and it hides a reproducibility trap: the original SSIM publication specifies an 11×11 Gaussian-weighted window with $\sigma = 1.5$ and population statistics, while scikit-image defaults to a 7×7 uniform window with sample statistics. Both are legitimate SSIMs; they are just different numbers. Code 8.3.3 computes both and makes the second trap, data_range, explicit.

from skimage.metrics import peak_signal_noise_ratio, structural_similarity

# Floats demand an explicit range; guessing it is the classic silent bug.
psnr = peak_signal_noise_ratio(clean, noisy, data_range=1.0)

ssim_default = structural_similarity(clean, noisy, data_range=1.0)

ssim_paper = structural_similarity(      # the Wang et al. (2004) protocol
    clean, noisy, data_range=1.0,
    gaussian_weights=True, sigma=1.5,    # 11x11 Gaussian window
    use_sample_covariance=False)         # population (1/N) statistics

print(f"PSNR              : {psnr:.2f} dB")
print(f"SSIM (defaults)   : {ssim_default:.4f}")
print(f"SSIM (paper flags): {ssim_paper:.4f}")
Code 8.3.3: The same image pair, two SSIM protocols. The three flags in ssim_paper reproduce the original publication's implementation; papers and MATLAB references almost always mean this variant. Report which one you used.
PSNR              : 20.17 dB
SSIM (defaults)   : 0.4664
SSIM (paper flags): 0.3923
Output 8.3.3a: The two protocols differ by 0.07 on this pair, far larger than the margins by which methods beat each other in results tables. Protocol disagreement masquerades as algorithmic improvement.
Key Insight: A Metric Is Code, Not Math

The SSIM formula is unambiguous; an SSIM number also depends on window shape, statistics convention, data range, channel handling, and any resizing done before scoring. The same is true of PSNR (range and dtype), LPIPS (which backbone network), and later FID (which feature extractor and sample count, as Chapter 37 will stress). Treat every reported metric as a function with parameters, state them when you publish, and pin them in code when you compare. The honest minimum is one sentence: library, version, and flags.

Practical Example: The 0.02 SSIM That Did Not Exist

Who: An ML engineer at an imaging startup benchmarking a new denoiser for a customer pitch.

Situation: The pitch claimed the model beat the published BM3D baseline by 0.02 SSIM on BSD68 at sigma 25.

Problem: A prospective customer's team re-ran the comparison and got the opposite ordering. Both sides accused the other of a broken pipeline.

Decision: A joint debugging call traced the gap to tooling, not models: the startup scored with scikit-image defaults (uniform 7×7 window) on float images in [0, 1], while the baseline numbers came from the original Gaussian-window protocol on [0, 255] integers, and one side was also accidentally scoring after a resize. They re-scored everything with one pinned script: paper flags, explicit data_range, no resizing.

Result: The real margin was 0.004 SSIM, within noise across the 68 images. The pitch was rewritten around speed, where the model's advantage was genuine and reproducible.

Lesson: Before comparing two methods, prove you can reproduce the baseline's published number first. If you cannot, your harness, not the method, is the variable.

4. Beyond Pixels: Perceptual and No-Reference Metrics Advanced

PSNR and SSIM compare pixels and local statistics, and they famously prefer over-smoothed results that humans dislike. LPIPS (Zhang et al., 2018) scores the distance between deep network features of the two images instead, and correlates far better with human judgments on restoration outputs. When no clean reference exists at all (real-world enhancement, in-the-wild quality monitoring), no-reference metrics such as NIQE, BRISQUE, and the learned MUSIQ and CLIP-IQA estimate quality from the image alone. Figure 8.3.1 organizes the families into the decision you actually face.

What are you comparing against? pristine reference no reference two image sets Full-reference Pixel-faithful: MSE, PSNR Structural: SSIM, MS-SSIM Learned perceptual: LPIPS, DISTS Use for: restoration, compression, codecs No-reference (blind) Natural-statistics: NIQE, BRISQUE Learned: MUSIQ, CLIP-IQA, TOPIQ LLM-based (2024+): Q-Align Use for: in-the-wild quality, enhancement without truth Distributional Feature statistics: FID, KID Text alignment: CLIPScore Covered in depth in Chapter 37 (Part IV) Use for: generative models, dataset-vs-dataset checks All three families are one pip install away via pyiqa (full- and no-reference) and torchmetrics (FID, KID).
Figure 8.3.1: Choosing a quality metric family. With a pristine reference, pick the row matching your question (fidelity, structure, or perception); without one, blind metrics estimate quality from natural-image statistics or learned models; and comparing whole collections of images is a distributional problem, the Chapter 37 story that begins with the histogram statistics of Chapter 2.

In practice you rarely implement any of these. The pyiqa toolbox wraps more than forty metrics, classical and learned, behind one interface, as Code 8.3.4 shows.

import pyiqa, torch

device = "cuda" if torch.cuda.is_available() else "cpu"
metrics = {name: pyiqa.create_metric(name, device=device)
           for name in ["psnr", "ssim", "lpips", "niqe"]}

# Tensors in NCHW, float, [0, 1]; pyiqa also accepts file paths directly.
ref = torch.from_numpy(clean).float()[None, None]
deg = torch.from_numpy(noisy).float()[None, None]

for name, m in metrics.items():
    score = m(deg, ref) if not m.metric_mode == "NR" else m(deg)
    print(f"{name:6s}: {float(score):8.4f}   (lower is better: {m.lower_better})")
Code 8.3.4: One interface, forty-plus metrics: pyiqa exposes each metric's direction via lower_better and its reference requirement via metric_mode, which prevents the two most common scripting mistakes in evaluation harnesses.
Library Shortcut: SSIM in One Call

A faithful from-scratch SSIM is roughly 60 lines: Gaussian window construction, five local-statistics convolutions, the stability constants $C_1, C_2$, edge handling, and channel averaging. skimage.metrics.structural_similarity with the three protocol flags from Code 8.3.3 is 1 line, and the library additionally handles multichannel inputs, returns the full similarity map via full=True for spatial debugging, and stays numerically stable for constant patches where naive implementations divide by zero. 60 lines to 1, and the one line is the one reviewers can check.

Research Frontier: Quality Assessment Meets Multimodal Models (2024-2026)

The newest no-reference metrics are language models. Q-Bench (ICLR 2024) benchmarked how well multimodal LLMs perceive low-level quality attributes, and Q-Align (ICML 2024) turned that ability into a state-of-the-art scorer by teaching an LLM to emit discrete quality levels the way human raters do. DepictQA (ECCV 2024) goes further and outputs reasoned, comparative quality descriptions rather than a single scalar, and CLIP-IQA showed that even a frozen CLIP model scores quality zero-shot from prompt pairs like "good photo" versus "bad photo". On the full-reference side, TOPIQ (IEEE TIP 2023) propagates semantic attention down to local distortions. Most of these ship in pyiqa already, so the practical frontier is one create_metric string away; the open question, pursued across 2025-2026 work, is keeping learned judges honest when the images being judged come from generative models, a problem Chapter 37 takes up in earnest.

Exercise 8.3.1: Match the Metric to the Job Conceptual

For each scenario, choose a metric family from Figure 8.3.1 and one concrete metric, and justify the choice: (a) regression-testing a JPEG encoder against golden outputs; (b) ranking three denoisers for a medical imaging customer who fears hallucinated detail; (c) monitoring the quality of user-uploaded photos in production, where no reference exists; (d) deciding whether a synthetic training set "looks like" the real one. Which scenario is the trap where the obvious metric rewards the wrong behavior?

Exercise 8.3.2: A Reproducible Benchmark Harness Coding

Build a small benchmark: download the Kodak suite, add Gaussian noise at sigma 15, 25, and 50 (on a documented scale), and score cv2.fastNlMeansDenoisingColored and the bilateral filter from Chapter 7 with PSNR, both SSIM protocols from Code 8.3.3, and LPIPS. Emit one CSV row per (image, sigma, method, metric, value) and a header comment recording library versions and all metric flags. Verify a colleague (or a fresh environment) reproduces your numbers to four decimals.

Exercise 8.3.3: When Metrics Disagree Analysis

Construct two degradations of the same Kodak image with equal PSNR (tune their strengths until PSNRs match within 0.05 dB): a Gaussian blur and additive noise. Score both with SSIM and LPIPS. Which degradation does each metric prefer, and what does that reveal about what each one measures? Relate your finding to the over-smoothing bias discussed in Section 4 and to the frequency content removed by each degradation (Chapter 4).