Section 8.3: Test Images, Datasets & Quality Metrics Tooling

"Forty-seven years of being blurred, sharpened, compressed, and denoised, and they finally let me retire. My one request: let the astronaut take it from here. She is used to noise."
A Graciously Retired Test Image

Big Picture

An image processing claim is only as good as the data it was tested on and the metric that scored it, and both are standardized for exactly that reason. Shared test images make results comparable across papers and decades; benchmark datasets make them comparable across methods; and metric toolboxes make the numbers themselves reproducible, provided you pin down the parameters that the defaults quietly choose for you. This section is the inventory of all three.

The previous section (Section 8.2) made pipelines fast; this one makes their evaluation trustworthy. Every experiment in Part I needed three ingredients we mostly took for granted: an input image, a degraded-versus-clean pair, and a number that says how well a method did. Here we treat each ingredient as infrastructure: where the standard images live, which datasets benchmark the restoration tasks of Chapter 7, and how to compute the quality metrics introduced in Chapter 1 so that a reviewer, or your own CI pipeline, gets the same number you did.

1. The Standard Test Images Beginner

A handful of photographs have been blurred and denoised millions of times because everyone agreed to use the same ones. The historical home is the USC-SIPI database (maintained since 1977), source of the mandrill, the peppers, and the house; the Kodak suite adds 24 film scans that remain the default smoke test for compression and denoising. For code, the most convenient gallery ships inside scikit-image itself: skimage.data bundles the classic cameraman, an astronaut portrait, coins on a textured background, text, and more, each one load-by-function-call. Code 8.3.1 inventories it.

# Inventory the built-in test gallery: load each classic image by
# function call and report its shape and dtype, surfacing the mixed
# uint8/float64 conventions that Section 8.1 warned about.
from skimage import data

for name in ["camera", "astronaut", "coins", "text", "checkerboard",
             "page", "moon", "shepp_logan_phantom"]:
    img = getattr(data, name)()          # each image is a function call
    print(f"{name:20s} {img.shape!s:16s} {img.dtype}")

Code 8.3.1: The built-in test gallery: skimage.data downloads nothing for the core images and returns plain NumPy arrays, which makes it the fastest way to get a real photograph into a unit test.

camera               (512, 512)       uint8
astronaut            (512, 512, 3)    uint8
coins                (303, 384)       uint8
text                 (172, 448)       uint8
checkerboard         (200, 200)       uint8
page                 (191, 384)       uint8
moon                 (512, 512)       uint8
shepp_logan_phantom  (400, 400)       float64

Output 8.3.1a: Eight ready-made test subjects covering portraits, textures, documents, low contrast, and a synthetic CT phantom; note the mixed dtypes, exactly the trap Section 8.1 warned about.

Fun Fact

The most famous test image of all, the 1972 "Lena" crop, is conspicuously absent from modern libraries. After decades of debate about using a Playboy centerfold as the field's default benchmark, journals began declining it (Nature Research signaled its discouragement in the late 2010s), and IEEE announced in 2024 that it would no longer accept new papers using the image. The community's drop-in replacement in scikit-image is data.astronaut(): NASA's portrait of Eileen Collins, the first woman to command a Space Shuttle mission. The epigraph above is her predecessor's gracious handover.

Synthetic targets complement the photographs because their ground truth is mathematical. The Shepp-Logan phantom (printed by Code 8.3.1) has known ellipse boundaries, which is why CT reconstruction papers have used it since 1974. Checkerboards and ramps expose interpolation and quantization artifacts from Chapter 1. The most instructive of all is the zone plate, $I(r) = \tfrac{1}{2}\left(1 + \cos(\alpha r^2)\right)$, whose local frequency grows linearly with radius (the rings pack tighter the farther out you go, because differentiating the $r^2$ phase gives a frequency proportional to $r$): a single image that sweeps every spatial frequency, so aliasing from poor resampling (Chapter 5) and the band behavior of filters (Chapter 4) appear as visible rings exactly where theory predicts. Code 8.3.2 builds one, then manufactures the (clean, noisy) pair every denoising experiment needs.

# Build a zone plate (a radial chirp with known frequency content)
# and add calibrated Gaussian noise to it, manufacturing a synthetic
# (clean, noisy) restoration pair with exact mathematical ground truth.
import numpy as np

def zone_plate(n=512, alpha=0.4):
    """Radial chirp: local frequency grows linearly with radius."""
    y, x = np.mgrid[-n // 2:n // 2, -n // 2:n // 2].astype(np.float64)
    r2 = (x**2 + y**2) / n                # normalized squared radius
    return 0.5 * (1.0 + np.cos(alpha * r2))

clean = zone_plate()                      # float64 in [0, 1], exact ground truth
rng = np.random.default_rng(7)
noisy = np.clip(clean + rng.normal(0.0, 25 / 255, clean.shape), 0.0, 1.0)

Code 8.3.2: A zone plate plus calibrated Gaussian noise (the sigma = 25 convention, expressed on a [0, 1] scale) yields a synthetic restoration benchmark with perfect ground truth: the exact setup behind the denoising numbers of Chapter 7.

2. Benchmark Datasets for Restoration Intermediate

When a method must be compared against published results, single images give way to standard datasets, and the choice is rarely innocent: a denoiser that wins on synthetic Gaussian noise can lose badly on real sensor noise, so the dataset you pick quietly decides which method looks best. Table 8.3.1 lists the ones that dominate the literature for the tasks of Chapter 7; the names recur in virtually every paper you will read after this part of the book.

Table 8.3.1: Standard benchmark datasets for Part I restoration tasks.

Dataset	Task	Contents	Notes
BSDS500 / BSD68	Denoising (synthetic)	500 natural photos; 68-image gray test split	Noise added at sigma 15/25/50; the classic protocol
Set12 / Kodak24	Denoising (synthetic)	12 gray classics; 24 color film scans	Small, fast, in every results table
SIDD	Denoising (real)	~30k smartphone pairs, 10 scenes, 5 phones	Real sensor noise; exposes methods tuned on Gaussian assumptions
DIV2K	Super-resolution	800 train + 100 val images at 2K	The training set behind nearly every modern super-resolution model
Set5 / Set14 / Urban100	Super-resolution (eval)	5, 14, and 100 test images	Urban100's repeated facades punish bad upscalers
GoPro	Deblurring	3,214 blur/sharp pairs from 240 fps video	Blur synthesized by averaging real frames
LOL	Low-light enhancement	485 train + 15 test paired exposures	The standard paired benchmark for the Chapter 7 enhancement task

Two hygiene rules attach to every row. First, respect the published splits: DIV2K's validation images, for example, appear inside other datasets' test sets, and accidental train-test contamination has invalidated more than one results table. Second, record the exact degradation recipe (noise sigma and dtype scale, downsampling kernel for super-resolution, JPEG quality) alongside your scores; "PSNR 31.2 on BSD68" is meaningless without "sigma = 25, grayscale, values in [0, 255]". A quick histogram audit of any downloaded dataset, with the tools from Chapter 2, catches the most common surprises: 16-bit files masquerading as 8-bit, and already-compressed "clean" references.

3. Metric Tooling: PSNR, SSIM & the Default-Parameter Trap Intermediate

Chapter 1 defined the two workhorse full-reference metrics: peak signal-to-noise ratio, $$\mathrm{PSNR} = 10 \log_{10} \frac{L^2}{\mathrm{MSE}},$$ with $L$ the dynamic range, and the structural similarity index $$\mathrm{SSIM}(x, y) = \frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)},$$ computed over local windows and averaged. The tooling lives in skimage.metrics, and it hides a reproducibility trap: the original SSIM publication specifies an 11×11 Gaussian-weighted window with $\sigma = 1.5$ and population statistics, while scikit-image defaults to a 7×7 uniform window with sample statistics. (Population statistics divide a variance by $N$; sample statistics divide by $N-1$, a small bias correction that shifts the score slightly.) Both are legitimate SSIMs; they are just different numbers, as the illustration below dramatizes with two judges scoring the same painting through different windows. Code 8.3.3 computes both and makes the second trap, data_range, explicit.

Two earnest cartoon judge robots score the same painting with different numbers because one measures through a small square window and the other through a larger soft circular lens, illustrating that the same SSIM formula yields different scores under different window-shape and statistics conventions, so the metric is code with parameters, not just math. — Same formula, different settings, different score: a metric is code with parameters, and protocol disagreement can masquerade as a real improvement.

# Score the same image pair two ways: PSNR plus SSIM under both the
# scikit-image defaults and the original Wang et al. (2004) protocol,
# making the data_range and window-choice traps explicit.
from skimage.metrics import peak_signal_noise_ratio, structural_similarity

# Floats demand an explicit range; guessing it is the classic silent bug.
psnr = peak_signal_noise_ratio(clean, noisy, data_range=1.0)

ssim_default = structural_similarity(clean, noisy, data_range=1.0)

ssim_paper = structural_similarity(      # the Wang et al. (2004) protocol
    clean, noisy, data_range=1.0,
    gaussian_weights=True, sigma=1.5,    # 11x11 Gaussian window
    use_sample_covariance=False)         # population (1/N) statistics

print(f"PSNR              : {psnr:.2f} dB")
print(f"SSIM (defaults)   : {ssim_default:.4f}")
print(f"SSIM (paper flags): {ssim_paper:.4f}")

Code 8.3.3: The same image pair, two SSIM protocols. The three flags in ssim_paper reproduce the original publication's implementation; papers and MATLAB references almost always mean this variant. Report which one you used.

PSNR              : 20.17 dB
SSIM (defaults)   : 0.4664
SSIM (paper flags): 0.3923

Output 8.3.3a: The two protocols differ by 0.07 on this pair, far larger than the margins by which methods beat each other in results tables. Protocol disagreement masquerades as algorithmic improvement.

Try This: Sweep the SSIM Window and Watch the Score Drift

Run Code 8.3.3 once to confirm the two numbers, then sweep a single parameter to feel how much "the same metric" can move. Loop win_size over 3, 7, 11, 15 in the default call (structural_similarity(clean, noisy, data_range=1.0, win_size=w)) and print the SSIM for each; then flip gaussian_weights between False and True at a fixed window. Observe two things: the SSIM number shifts by several hundredths as the window changes, while the PSNR from the same pair never moves, because PSNR has no window. The takeaway lands without any prose: report the window and the flags, or your "improvement" may just be a wider window. This is a thirty-second experiment, not a project; the full benchmark version lives in the lab below.

Key Insight: A Metric Is Code, Not Math

The SSIM formula is unambiguous; an SSIM number also depends on window shape, statistics convention, data range, channel handling, and any resizing done before scoring. The same is true of PSNR (range and dtype), LPIPS (which backbone network), and later FID (which feature extractor and sample count, as Chapter 37 will stress). Treat every reported metric as a function with parameters, state them when you publish, and pin them in code when you compare. The honest minimum is one sentence: library, version, and flags.

Practical Example: The 0.02 SSIM That Did Not Exist

Who: An ML engineer at an imaging startup benchmarking a new denoiser for a customer pitch.

Situation: The pitch claimed the model beat the published BM3D baseline by 0.02 SSIM on BSD68 at sigma 25.

Problem: A prospective customer's team re-ran the comparison and got the opposite ordering. Both sides accused the other of a broken pipeline.

Decision: A joint debugging call traced the gap to tooling, not models: the startup scored with scikit-image defaults (uniform 7×7 window) on float images in [0, 1], while the baseline numbers came from the original Gaussian-window protocol on [0, 255] integers, and one side was also accidentally scoring after a resize. They re-scored everything with one pinned script: paper flags, explicit data_range, no resizing.

Result: The real margin was 0.004 SSIM, within noise across the 68 images. The pitch was rewritten around speed, where the model's advantage was genuine and reproducible.

Lesson: Before comparing two methods, prove you can reproduce the baseline's published number first. If you cannot, your harness, not the method, is the variable.

4. Beyond Pixels: Perceptual and No-Reference Metrics Advanced

PSNR and SSIM compare pixels and local statistics, and they famously prefer over-smoothed results that humans dislike. LPIPS (Learned Perceptual Image Patch Similarity; Zhang et al., 2018) scores the distance between deep network features of the two images instead, and correlates far better with human judgments on restoration outputs. When no clean reference exists at all (real-world enhancement, in-the-wild quality monitoring), no-reference metrics such as NIQE, BRISQUE, and the learned MUSIQ and CLIP-IQA estimate quality from the image alone. Figure 8.3.1 organizes the families into the decision you actually face.

Common Misconception: "Higher PSNR or SSIM Means the Image Looks Better"

It is tempting to read PSNR and SSIM as direct measures of how good an image looks, so that the higher-scoring denoiser is automatically the better-looking one. In fact both are fidelity measures: PSNR rewards small per-pixel error and SSIM small structural error against a reference, and minimizing that error favors safe, blurry, over-smoothed results that humans often judge worse than a sharper output with slightly higher pixel error. This is why a denoiser can win on PSNR while erasing the texture a viewer cares about, and why LPIPS (which compares deep features) frequently disagrees with PSNR exactly on the images where humans do. Treat PSNR and SSIM as "how close to the reference pixels", not "how good it looks", and pair them with a perceptual metric when appearance is the goal. The diagnostic question: would you still trust the higher score if the winning image were visibly mushier?

Figure 8.3.1: Choosing a quality metric family. With a pristine reference, pick the row matching your question (fidelity, structure, or perception); without one, blind metrics estimate quality from natural-image statistics or learned models; and comparing whole collections of images is a distributional problem, the Chapter 37 story that begins with the histogram statistics of Chapter 2.

In practice you rarely implement any of these. The pyiqa toolbox wraps more than forty metrics, classical and learned, behind one interface, as Code 8.3.4 shows.

# Score one image pair with four metrics through a single interface:
# pyiqa exposes each metric's direction (lower_better) and reference
# requirement (metric_mode), so the harness picks the call form safely.
import pyiqa, torch

device = "cuda" if torch.cuda.is_available() else "cpu"
metrics = {name: pyiqa.create_metric(name, device=device)
           for name in ["psnr", "ssim", "lpips", "niqe"]}

# Tensors in NCHW, float, [0, 1]; pyiqa also accepts file paths directly.
ref = torch.from_numpy(clean).float()[None, None]
deg = torch.from_numpy(noisy).float()[None, None]

for name, m in metrics.items():
    score = m(deg) if m.metric_mode == "NR" else m(deg, ref)
    print(f"{name:6s}: {float(score):8.4f}   (lower is better: {m.lower_better})")

Code 8.3.4: One interface, forty-plus metrics: pyiqa exposes each metric's direction via lower_better and its reference requirement via metric_mode, which prevents the two most common scripting mistakes in evaluation harnesses.

Library Shortcut: SSIM in One Call

A faithful from-scratch SSIM is roughly 60 lines: Gaussian window construction, five local-statistics convolutions, the stability constants $C_1, C_2$, edge handling, and channel averaging. skimage.metrics.structural_similarity with the three protocol flags from Code 8.3.3 is 1 line, and the library additionally handles multichannel inputs, returns the full similarity map via full=True for spatial debugging, and stays numerically stable for constant patches where naive implementations divide by zero. 60 lines to 1, and the one line is the one reviewers can check.

Research Frontier: Quality Assessment Meets Multimodal Models (2024-2026)

The newest no-reference metrics are language models. Q-Bench (ICLR 2024) benchmarked how well multimodal LLMs perceive low-level quality attributes, and Q-Align (ICML 2024) turned that ability into a state-of-the-art scorer by teaching an LLM to emit discrete quality levels the way human raters do. DepictQA (ECCV 2024) goes further and outputs reasoned, comparative quality descriptions rather than a single scalar, and CLIP-IQA showed that even a frozen CLIP model scores quality zero-shot from prompt pairs like "good photo" versus "bad photo". On the full-reference side, TOPIQ (IEEE TIP 2023) propagates semantic attention down to local distortions. Most of these ship in pyiqa already, so the practical frontier is one create_metric string away; the open question, pursued across 2025-2026 work, is keeping learned judges honest when the images being judged come from generative models, a problem Chapter 37 takes up in earnest.

Exercise 8.3.1: Match the Metric to the Job Conceptual

For each scenario, choose a metric family from Figure 8.3.1 and one concrete metric, and justify the choice: (a) regression-testing a JPEG encoder against golden outputs; (b) ranking three denoisers for a medical imaging customer who fears hallucinated detail; (c) monitoring the quality of user-uploaded photos in production, where no reference exists; (d) deciding whether a synthetic training set "looks like" the real one. Which scenario is the trap where the obvious metric rewards the wrong behavior?

Exercise 8.3.2: A Reproducible Benchmark Harness Coding

Build a small benchmark: download the Kodak suite, add Gaussian noise at sigma 15, 25, and 50 (on a documented scale), and score cv2.fastNlMeansDenoisingColored and the bilateral filter from Chapter 7 with PSNR, both SSIM protocols from Code 8.3.3, and LPIPS. Emit one CSV row per (image, sigma, method, metric, value) and a header comment recording library versions and all metric flags. Verify a colleague (or a fresh environment) reproduces your numbers to four decimals.

Exercise 8.3.3: When Metrics Disagree Analysis

Construct two degradations of the same Kodak image with equal PSNR (tune their strengths until PSNRs match within 0.05 dB): a Gaussian blur and additive noise. Score both with SSIM and LPIPS. Which degradation does each metric prefer, and what does that reveal about what each one measures? Relate your finding to the over-smoothing bias discussed in Section 4 and to the frequency content removed by each degradation (Chapter 4).

The three threads of this chapter, choosing a library deliberately (Section 8.1), measuring and comparing speed honestly (Section 8.2), and scoring quality reproducibly (this section), only become a single skill when you wire them into one harness. Put them together in the hands-on lab below, which turns this chapter into a runnable denoising benchmark studio you can keep and reuse.

Difficulty: Intermediate Duration: about 60 to 75 minutes

Hands-On Lab: A Denoising Benchmark Studio

Objective

Build one self-contained, reproducible script that pits three denoisers from three different libraries against a clean-versus-noisy image set, times each one, scores each with PSNR and two SSIM protocols, and prints a ranked leaderboard plus a tidy CSV, the whole Part I toolbox working as a single harness.

What You'll Practice

Reaching across OpenCV, scikit-image, and SciPy ndimage in one pipeline while keeping dtype and value-range conventions straight (Section 8.1).
Timing real work honestly with warm-up runs and repeated trials (Section 8.2).
Manufacturing reproducible (clean, noisy) pairs from built-in test images with a fixed random seed.
Scoring with PSNR and both the scikit-image default and the Wang et al. (2004) SSIM protocols, with an explicit data_range.
Emitting a results table that a colleague can reproduce to four decimals.

Setup

No downloads: every image comes from skimage.data. Install the three libraries the harness compares.

pip install opencv-python scikit-image scipy numpy

Steps

Step 1: Assemble a reproducible test set

Pull a few grayscale classics from skimage.data, normalize them to float in [0, 1] so every library speaks the same units, and add seeded Gaussian noise to manufacture (clean, noisy) pairs. Reusing the seed is what makes the whole lab reproducible.

import numpy as np
from skimage import data, img_as_float

def make_pairs(sigma=25 / 255, seed=7):
    rng = np.random.default_rng(seed)
    images = {
        "camera": data.camera(),
        "coins": data.coins(),
        "moon": data.moon(),
    }
    pairs = {}
    for name, raw in images.items():
        clean = img_as_float(raw)              # uint8 -> float64 in [0, 1]
        # TODO: add Gaussian noise of standard deviation `sigma`, then clip to [0, 1]
        # Hint: rng.normal(0.0, sigma, clean.shape), then np.clip(...)
        noisy = ...
        pairs[name] = (clean, noisy)
    return pairs

pairs = make_pairs()
print({k: v[0].shape for k, v in pairs.items()})

Hint

The clean image is already float in [0, 1], so the noise standard deviation must be on the same scale: 25 gray levels out of 255 is 25 / 255. Clip after adding noise so values stay in range, exactly as Code 8.3.2 did for the zone plate.

Step 2: Wrap three denoisers behind one signature

Each library wants its inputs in a slightly different form. Wrap each denoiser so all three take a float [0, 1] image and return a float [0, 1] image, hiding the per-library conversions behind a common interface.

import cv2
from scipy import ndimage as ndi
from skimage.restoration import denoise_bilateral

def denoise_opencv_nlm(img):
    u8 = (np.clip(img, 0, 1) * 255).astype(np.uint8)   # OpenCV wants uint8
    out = cv2.fastNlMeansDenoising(u8, h=12)
    return out.astype(np.float64) / 255.0

def denoise_scipy_gaussian(img):
    # TODO: return a Gaussian-blurred copy using scipy.ndimage (sigma about 1.0)
    # Hint: ndi.gaussian_filter keeps float input float; no conversion needed
    return ...

def denoise_skimage_bilateral(img):
    # skimage works natively in float [0, 1]
    return denoise_bilateral(img, sigma_color=0.1, sigma_spatial=3)

denoisers = {
    "opencv_nlm": denoise_opencv_nlm,
    "scipy_gaussian": denoise_scipy_gaussian,
    "skimage_bilateral": denoise_skimage_bilateral,
}

Hint

ndi.gaussian_filter(img, sigma=1.0) preserves dtype, so the SciPy path needs no uint8 round-trip. Only the OpenCV path crosses the dtype boundary, the Section 8.1 trap made concrete.

Step 3: Time each denoiser honestly

Naive timing measures import caching and cold caches, not the algorithm. Run one warm-up call, then time several repeats and keep the best, the honest-measurement discipline from Section 8.2.

import time

def timed(fn, img, repeats=3):
    fn(img)                                  # warm-up: ignore the first call
    best = float("inf")
    for _ in range(repeats):
        t0 = time.perf_counter()
        out = fn(img)
        best = min(best, time.perf_counter() - t0)
    # TODO: return both the denoised image `out` and the best time in milliseconds
    return ...

Hint

Return out, best * 1000.0. Reporting the minimum across repeats, not the mean, follows the rule from Section 8.2: the fastest run is the one least polluted by background scheduling noise.

Step 4: Score with PSNR and two SSIM protocols

Reuse the scoring logic from Code 8.3.3. Always pass an explicit data_range, and compute SSIM both under the scikit-image defaults and under the Wang et al. (2004) flags so the protocol gap from Output 8.3.3a is visible in your own results.

from skimage.metrics import peak_signal_noise_ratio, structural_similarity

def score(clean, restored):
    psnr = peak_signal_noise_ratio(clean, restored, data_range=1.0)
    ssim_default = structural_similarity(clean, restored, data_range=1.0)
    # TODO: compute SSIM under the Wang et al. (2004) protocol
    # Hint: gaussian_weights=True, sigma=1.5, use_sample_covariance=False
    ssim_paper = ...
    return psnr, ssim_default, ssim_paper

Hint

The three flags are exactly the ones in Code 8.3.3. Without an explicit data_range=1.0, scikit-image guesses from the dtype, the single most common silent bug in evaluation scripts.

Step 5: Run the grid and collect rows

Loop over every (image, denoiser) combination, restore, time, and score, accumulating one record per run. This nested loop is the harness itself.

rows = []
for img_name, (clean, noisy) in pairs.items():
    for method_name, fn in denoisers.items():
        restored, ms = timed(fn, noisy)
        psnr, ssim_d, ssim_p = score(clean, restored)
        rows.append({
            "image": img_name, "method": method_name,
            "psnr": psnr, "ssim_default": ssim_d,
            "ssim_paper": ssim_p, "time_ms": ms,
        })
print(f"collected {len(rows)} runs")           # expect 3 images x 3 methods = 9

Hint

If you see fewer than nine rows, a denoiser raised an exception and you swallowed it; let errors surface during development, then add handling only once the happy path works.

Step 6: Print a leaderboard and write a reproducible CSV

Average each method's scores across the images, rank by the paper-protocol SSIM, and write every row to CSV with a header comment that pins library versions, so a colleague can reproduce your numbers to four decimals.

import csv, statistics, skimage, cv2 as _cv

# Leaderboard: mean paper-SSIM and mean time per method
by_method = {}
for r in rows:
    by_method.setdefault(r["method"], []).append(r)
print(f"{'method':20s} {'SSIM(paper)':>12s} {'time_ms':>10s}")
for method, recs in sorted(
        by_method.items(),
        key=lambda kv: statistics.mean(r['ssim_paper'] for r in kv[1]),
        reverse=True):
    s = statistics.mean(r["ssim_paper"] for r in recs)
    t = statistics.mean(r["time_ms"] for r in recs)
    print(f"{method:20s} {s:12.4f} {t:10.1f}")

# TODO: write all `rows` to results.csv with csv.DictWriter, prefixing
# a comment line that records skimage.__version__ and _cv.__version__

Hint

Open the file, write one comment line such as # skimage=<ver> opencv=<ver>, then use csv.DictWriter(f, fieldnames=rows[0].keys()) with writeheader() and writerows(rows). Recording versions is what makes "reproducible to four decimals" an honest claim.

Expected Output

A nine-row run and a leaderboard ranked by paper-protocol SSIM. On the seeded set, the non-local means and bilateral methods should clearly outscore the plain Gaussian blur on SSIM (the blur removes noise and signal alike), while the Gaussian path is the fastest, a concrete instance of the speed-versus-quality tradeoff this chapter keeps returning to. Your exact numbers will differ slightly by library version, which is precisely why the CSV records them. A representative leaderboard:

method               SSIM(paper)    time_ms
skimage_bilateral         0.8021       38.4
opencv_nlm                0.7864       21.7
scipy_gaussian            0.6190        2.1

Stretch Goals

Add a noise sweep: rerun the grid at sigma 15, 25, and 50 (the BSD68 protocol from Table 8.3.1) and plot SSIM against sigma per method, watching the curves cross.
Swap the synthetic Gaussian noise for the zone plate of Code 8.3.2 to see which denoiser best preserves the high-frequency rings, connecting the result to the frequency analysis of Chapter 4.
Add a perceptual metric: install pyiqa and add LPIPS (Code 8.3.4) as a fourth column, then check whether it reorders the leaderboard the way Section 4 predicts.

Complete Solution

import csv, statistics, time
import numpy as np
import cv2
from scipy import ndimage as ndi
import skimage
from skimage import data, img_as_float
from skimage.restoration import denoise_bilateral
from skimage.metrics import peak_signal_noise_ratio, structural_similarity

# Step 1: reproducible (clean, noisy) pairs
def make_pairs(sigma=25 / 255, seed=7):
    rng = np.random.default_rng(seed)
    images = {"camera": data.camera(), "coins": data.coins(), "moon": data.moon()}
    pairs = {}
    for name, raw in images.items():
        clean = img_as_float(raw)
        noisy = np.clip(clean + rng.normal(0.0, sigma, clean.shape), 0.0, 1.0)
        pairs[name] = (clean, noisy)
    return pairs

# Step 2: three denoisers, one signature (float [0, 1] in and out)
def denoise_opencv_nlm(img):
    u8 = (np.clip(img, 0, 1) * 255).astype(np.uint8)
    return cv2.fastNlMeansDenoising(u8, h=12).astype(np.float64) / 255.0

def denoise_scipy_gaussian(img):
    return ndi.gaussian_filter(img, sigma=1.0)

def denoise_skimage_bilateral(img):
    return denoise_bilateral(img, sigma_color=0.1, sigma_spatial=3)

denoisers = {
    "opencv_nlm": denoise_opencv_nlm,
    "scipy_gaussian": denoise_scipy_gaussian,
    "skimage_bilateral": denoise_skimage_bilateral,
}

# Step 3: honest timing (warm-up, then best of N)
def timed(fn, img, repeats=3):
    fn(img)
    best = float("inf")
    for _ in range(repeats):
        t0 = time.perf_counter()
        out = fn(img)
        best = min(best, time.perf_counter() - t0)
    return out, best * 1000.0

# Step 4: PSNR plus two SSIM protocols
def score(clean, restored):
    psnr = peak_signal_noise_ratio(clean, restored, data_range=1.0)
    ssim_default = structural_similarity(clean, restored, data_range=1.0)
    ssim_paper = structural_similarity(
        clean, restored, data_range=1.0,
        gaussian_weights=True, sigma=1.5, use_sample_covariance=False)
    return psnr, ssim_default, ssim_paper

# Step 5: run the grid
pairs = make_pairs()
rows = []
for img_name, (clean, noisy) in pairs.items():
    for method_name, fn in denoisers.items():
        restored, ms = timed(fn, noisy)
        psnr, ssim_d, ssim_p = score(clean, restored)
        rows.append({
            "image": img_name, "method": method_name,
            "psnr": round(psnr, 4), "ssim_default": round(ssim_d, 4),
            "ssim_paper": round(ssim_p, 4), "time_ms": round(ms, 1),
        })

# Step 6: leaderboard + reproducible CSV
by_method = {}
for r in rows:
    by_method.setdefault(r["method"], []).append(r)
print(f"{'method':20s} {'SSIM(paper)':>12s} {'time_ms':>10s}")
for method, recs in sorted(
        by_method.items(),
        key=lambda kv: statistics.mean(r['ssim_paper'] for r in kv[1]),
        reverse=True):
    s = statistics.mean(r["ssim_paper"] for r in recs)
    t = statistics.mean(r["time_ms"] for r in recs)
    print(f"{method:20s} {s:12.4f} {t:10.1f}")

with open("results.csv", "w", newline="") as f:
    f.write(f"# skimage={skimage.__version__} opencv={cv2.__version__}\n")
    writer = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
    writer.writeheader()
    writer.writerows(rows)
print("wrote results.csv")