Capstone

Capstone Project: An End-to-End Vision System

Design, build, evaluate, and present a production-grade vision application that spans all four parts: classical preprocessing and geometry, a fine-tuned detector or segmenter, a generative synthetic-data engine, and honest evaluation with deployment.

"I have inspected a hundred thousand circuit boards and approved most of them. The ones I remember are the six I got wrong, which is, now that I think about it, exactly the per-class recall problem you are about to spend a month on."

A Battle-Hardened Inspection Camera
Big Picture

One system, four parts, no shortcuts. Every chapter in this book taught a piece of the vision stack in isolation; the capstone asks you to assemble the whole machine. You will build an automated visual inspection system in which classical preprocessing from Part I normalizes the pixels, the geometry of Part II normalizes the viewpoint, a fine-tuned model from Part III finds the defects, and a generative engine from Part IV manufactures the training data that reality refuses to provide. The thread that holds it together is honest evaluation: a frozen real test set that nothing synthetic ever touches, and an ablation that proves (or disproves) that the generative engine earned its place.

1. The Project Brief

Your client is an electronics manufacturer. Boards pass under a fixed downward-facing camera on a conveyor, one every few seconds. Your job is an automated optical inspection (AOI) system that flags defective boards before they ship: missing components, solder bridges, scratches, open traces, mousebites, misaligned parts. The system must localize each defect with a box or mask, assign a class label, render a board-level pass or fail verdict, and do it all inside a fixed latency budget on hardware the client can afford.

The recommended domain is printed circuit board inspection because it exercises every part of the book and because public data exists to bootstrap from: the DeepPCB dataset provides 1,500 aligned template and test image pairs with six defect classes, and the MVTec AD benchmark includes high-resolution industrial categories for anomaly-style evaluation. Section 10 adapts the same brief to agriculture, retail, and medical imaging. Whatever the domain, the constraint that shapes the whole project stays the same: the interesting defects are rare. You will have hundreds of images of good boards, dozens of common defects, and perhaps five examples of the class that matters most. That scarcity is the design problem the capstone exists to teach.

The complete deliverable is a working system plus the evidence that it works: a repository, a sealed evaluation report, a deployed inference artifact with measured latency, and a presentation. The four milestones below map one-to-one onto the four parts of the book; each lists deliverables, acceptance criteria, and the supporting chapters.

Practical Example: Why False Calls Are the Silent Killer

At a contract electronics plant, every board the AOI flags goes to a human re-inspection station. A newly commissioned line posted a healthy 97% defect recall, and management celebrated. Six weeks later re-inspection had a four-hour backlog: the false-call rate was 8%, so at 1,800 boards per shift the system generated 144 false alarms a day. The fix was not a better model; it was a calibrated operating point chosen on validation, per-class thresholds for the two noisiest classes, and a weekly report tracking false calls per million joints. Production vision systems live or die on the boring metrics. Your capstone will be graded accordingly.

2. Milestone 1: Normalize the Pixels (Part I)

Raw frames off a line camera are not ready for anything. Illumination falls off toward the corners, sensor noise rises with gain, and white balance drifts as the LED panel ages. Milestone 1 builds the deterministic preprocessing front end that turns whatever the camera produces into the stable, comparable image every later stage assumes. This is Part I applied end to end: image formation and the ISP pipeline from Chapter 1, histogram tools and contrast normalization from Chapter 2, denoising filters from Chapter 3, and the restoration mindset of Chapter 7.

Deliverables

Acceptance criteria

3. Milestone 2: Normalize the Geometry (Part II)

Defect detection by comparison only works if every board occupies exactly the same pixels. Milestone 2 builds the geometric normalization chain: calibrate the camera and undistort the image using Zhang's method from Chapter 12, rectify the board plane to a canonical metric frame with a homography as in Chapter 13 and the warping machinery of Chapter 5, then fine-register each board against a golden reference image using the keypoint matching and RANSAC pipeline of Chapter 10. With alignment in hand, a classical baseline detector falls out almost for free: difference against the golden board, threshold, and clean up with the morphology of Chapter 6, in the spirit of the template-comparison pipelines of Chapter 16.

import cv2
import numpy as np

# Undistort with the intrinsics from your calibration session (Chapter 12)
K, dist = np.load("calib_K.npy"), np.load("calib_dist.npy")
img = cv2.undistort(cv2.imread("board_raw.png"), K, dist)

# Rectify the board plane: 4 fiducial centers map to a metric board frame
fid_px = detect_fiducials(img)                  # your Chapter 6 blob detector
board_mm = np.float32([[0, 0], [160, 0], [160, 100], [0, 100]])
H = cv2.getPerspectiveTransform(fid_px, board_mm * PX_PER_MM)
rect = cv2.warpPerspective(img, H, OUT_SIZE)

# Fine registration against the golden board (Chapter 10)
orb = cv2.ORB_create(2000)
k1, d1 = orb.detectAndCompute(rect, None)
k2, d2 = orb.detectAndCompute(golden, None)
matches = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True).match(d1, d2)
src = np.float32([k1[m.queryIdx].pt for m in matches])
dst = np.float32([k2[m.trainIdx].pt for m in matches])
H_fine, inl = cv2.findHomography(src, dst, cv2.RANSAC, 2.0)
aligned = cv2.warpPerspective(rect, H_fine, golden.shape[1::-1])

# Log the median inlier residual: this number becomes a health metric later
proj = cv2.perspectiveTransform(src[None], H_fine)[0]
residual = np.median(np.linalg.norm(proj - dst, axis=1)[inl.ravel() == 1])
Milestone 2 core: undistortion, fiducial-based rectification, and ORB plus RANSAC fine registration against the golden board, with the median inlier residual logged as the alignment health metric the monitoring stage will watch in Section 7.

Deliverables

Acceptance criteria

Fun Note

Every inspection team eventually develops a quasi-religious relationship with the golden board: it lives in a labeled antistatic bag, only two people may touch it, and when someone finally drops it the incident gets a name. Treat yours well; the entire geometric stack is anchored to that one object, which is why Section 9 worries about drift.

4. Milestone 3: Learn the Defects (Part III)

With pixels and geometry normalized, the learned stage has a fair fight. Milestone 3 fine-tunes a detector or segmenter on your real defect data: a one-stage detector from Chapter 23 if boxes are enough, or a segmentation model from Chapter 24 if defect area matters (it usually does for scratches and solder spill). Transfer learning is mandatory, not optional: with a few hundred labeled instances you are firmly in the fine-tuning regime of Chapter 21, and the augmentation policy you choose there must respect the geometry you just normalized (random rotations of a rectified board are a lie; small photometric jitter is honest).

Before any training run, split the data and seal the test set. Split by physical board, never by image crop, so two views of the same board cannot land on both sides of the line. The test split is real images only, drawn from multiple capture sessions, and from this moment forward it is read-only: no model selection, no threshold tuning, nothing. Section 6 builds the evaluation protocol on this frozen set; Section 9 catalogs how teams contaminate it without noticing.

Library Shortcut

Chapter 23 built detection from anchors and losses upward; in the capstone you should stand on a maintained implementation. The Ultralytics API reduces the few hundred lines of a custom training loop, loss wiring, and mAP evaluator to three calls, and handles augmentation, EMA weights, mosaic scheduling, and export internally:

from ultralytics import YOLO

model = YOLO("yolo11s.pt")                    # pretrained weights, Chapter 21 style
model.train(data="pcb_defects.yaml", epochs=100, imgsz=1024)
metrics = model.val(split="test")             # touched exactly once, at the very end
Fine-tuning the capstone detector with Ultralytics: pretrained weights, one dataset config, and a test-split evaluation that is called exactly once in the entire project.

Deliverables

Acceptance criteria

5. Milestone 4: Generate the Rare Defects (Part IV)

Your rarest class has five real examples, and no augmentation policy multiplies five into five hundred. Milestone 4 builds a generative data engine that manufactures the missing training data: take real, defect-free board crops from the training split only, and use diffusion inpainting from Chapter 35, powered by the models of Chapter 33, to paint plausible defects into them at controlled locations. Because you choose the inpainting mask, every synthetic sample is born with a pixel-accurate label, which neatly solves annotation at the same time. For finer structural control, condition the generator on an edge map or printed-layout rendering of the local region with a ControlNet adapter (Zhang et al., 2023), so that synthesized solder bridges actually connect two pads instead of floating in space.

import torch
from diffusers import AutoPipelineForInpainting

pipe = AutoPipelineForInpainting.from_pretrained(
    "diffusers/stable-diffusion-xl-1.0-inpainting-0.1",
    torch_dtype=torch.float16).to("cuda")

# crop: a defect-free region of a REAL rectified board (training split only)
# mask: where the rare defect should appear, e.g. spanning two adjacent pads
out = pipe(
    prompt="macro photograph of a printed circuit board with a solder bridge "
           "short circuit between two adjacent component pins, sharp focus",
    negative_prompt="cartoon, illustration, painting, blurry",
    image=crop, mask_image=mask,
    strength=0.99, guidance_scale=7.0,
    num_inference_steps=30).images[0]

# The mask that placed the defect IS the annotation
x, y, w, h = cv2.boundingRect((np.array(mask) > 0).astype(np.uint8))
Milestone 4 core: diffusion inpainting paints a rare defect into a real defect-free board crop, and the inpainting mask doubles as a pixel-accurate label, so every synthetic sample arrives pre-annotated.

Generation is the easy half; curation is where the milestone is won. Build a quality gate that rejects implausible samples before they reach training: automatic filters first (the defect region must differ from the original crop, and a distributional check such as FID or KID between synthetic and real defect crops must stay sane), then a fast human pass over a random sample. Sweep the mixing ratio: real-only, then real plus 1x, 2x, and 5x synthetic for the rare classes, and let validation pick. The treatment of generative data engines in Chapter 37 is the playbook for this milestone.

Deliverables

Acceptance criteria

Research Frontier

Synthetic defect generation is an active research area, not a settled recipe. AnomalyDiffusion (Hu et al., AAAI 2024) learns to generate anomaly image and mask pairs from only a handful of real examples, directly targeting the few-shot regime this milestone lives in. ControlNet (Zhang et al., ICCV 2023) remains the standard tool for structure-conditioned compositing when defects must respect board layout. On the recognition side, WinCLIP (Jeong et al., CVPR 2023) showed that CLIP-style foundation models from Chapter 25 can score industrial anomalies zero-shot, and a strong capstone variation is to benchmark such a zero-shot scorer against your supervised detector on the same frozen test set.

6. The Metrics Ladder: Honest Evaluation

Evaluation in this project is a ladder, and every rung answers a different stakeholder. At the bottom, pixel-level and box-level scores (IoU, per-detection precision and recall) tell you whether the model localizes defects. The middle rung, per-class F1 at a fixed operating point, tells the team whether each defect type is covered; macro averages hide exactly the rare classes this project is about, so the per-class table is the primary artifact. The top rung speaks the client's language: board-level pass/fail confusion, escape rate (defective boards passed), and false-call rate (good boards flagged). Report all three rungs; argue from the top one.

Key Insight

The frozen real test set is the only source of truth, and synthetic data never touches it. Synthetic images may enter training and only training. A test set containing generated samples measures how well your detector recognizes your generator's habits, a quantity with no relationship to escape rate on the line. Likewise, every time you peek at test numbers to make a decision, the set stops being a test set and becomes a slow validation set. Evaluate on it once per milestone, from one script, and let the validation split absorb all your curiosity.

The ablation is the scientific core of the capstone: it isolates what the generative engine of Milestone 4 actually bought. Hold everything fixed (architecture, recipe, schedule, operating point, matching rule, test set) and vary only the training data. Compute both conditions in one pass of one script so the comparison cannot silently drift apart, and attach bootstrap confidence intervals so a 2-point F1 delta on a 40-image class is read with appropriate suspicion:

from sklearn.metrics import f1_score
import numpy as np

def per_class_f1(y_true, y_pred, classes, n_boot=2000, seed=0):
    rng = np.random.default_rng(seed)
    point = f1_score(y_true, y_pred, labels=classes, average=None)
    idx = np.arange(len(y_true))
    boots = [f1_score(y_true[s], y_pred[s], labels=classes,
                      average=None, zero_division=0)
             for s in (rng.choice(idx, len(idx), replace=True)
                       for _ in range(n_boot))]
    lo, hi = np.percentile(boots, [2.5, 97.5], axis=0)
    return point, lo, hi

# Same frozen test set, same matching rule, same operating point, one run:
f1_real, lo_r, hi_r = per_class_f1(y_true, pred_real_only, CLASSES)
f1_aug,  lo_a, hi_a = per_class_f1(y_true, pred_real_plus_synth, CLASSES)
The ablation evaluator: per-class F1 with bootstrap confidence intervals, computed in a single pass on the same frozen real test set for both the real-only and the real-plus-synthetic models, so the two columns of the final table can never come from mismatched configurations.

The expected (and publishable) shape of the result: synthetic data moves the rare classes substantially, moves the common classes little or not at all, and past some mixing ratio begins to hurt as the generator's quirks start outvoting reality. If your ablation shows something else, Section 9's first failure mode is the place to start debugging.

7. Deployment: ONNX, Latency, and Monitoring

A checkpoint on a workstation is a prototype; the capstone ships. Export the trained model to ONNX and serve it with a runtime, following Chapter 28. Write the latency budget down before you measure: if a board arrives every 2 seconds and the line tolerates one board of buffering, your end-to-end budget might be 500 ms, decomposed into capture (60 ms), preprocessing and rectification (80 ms), inference, and postprocessing with verdict logic (40 ms). What remains is the inference budget, and the number that must fit inside it is the 95th percentile, not the mean: a line stalls on tail latency.

import onnxruntime as ort
import numpy as np, time

model.export(format="onnx", imgsz=1024, half=True)        # Ultralytics exporter
sess = ort.InferenceSession("best.onnx",
                            providers=["CUDAExecutionProvider"])

x = np.random.rand(1, 3, 1024, 1024).astype(np.float16)
for _ in range(20):                                       # warm-up runs
    sess.run(None, {"images": x})
times = []
for _ in range(200):
    t0 = time.perf_counter()
    sess.run(None, {"images": x})
    times.append((time.perf_counter() - t0) * 1000)
print(f"p50={np.percentile(times, 50):.1f} ms  "
      f"p95={np.percentile(times, 95):.1f} ms")
Deployment gate: ONNX export via the Ultralytics exporter and a 200-run latency benchmark with ONNX Runtime; the p95 figure, not the mean, is what gets compared against the line-rate budget.
Tip

If the p95 misses the budget, climb the efficiency toolbox of Chapter 28 in cost order: smaller input resolution (often free if defects are large relative to the new pixel pitch), FP16, then INT8 quantization with a calibration set, then a smaller backbone. Re-run the frozen-test evaluation after every step; a quantized model is a new model, and its per-class F1 must be re-earned, not assumed.

Production Pattern: Monitor the Inputs, Not Just the Outputs

Model scores drift last; inputs drift first. Instrument the pipeline to log, per board: mean and standard deviation of background brightness (Milestone 1's health), the median registration residual from the Milestone 2 code (geometry's health), the score distribution of the detector, and the daily false-call and escape counters from re-inspection. Add a canary: a known golden board run through the full pipeline once per shift, with alarms on any metric leaving its commissioning band. This pattern, from Chapter 28's monitoring section, is what converts Section 9's calibration-drift failure from a bad month into a same-day alert.

8. Marking Rubric

The rubric below weights the system the way production does: a third of the grade sits in evidence (evaluation and ablation) rather than in any single model. Graders should be able to reproduce every number from the repository alone.

Table C.1: Capstone marking rubric: components, weights, and what separates full marks from common deductions.
ComponentWeightFull marks look likeFrequent deductions
Milestone 1: imaging pipeline10%Measured acceptance criteria met; deterministic, configurable stagesHand-tuned constants with no measurement; corrections justified by eye
Milestone 2: geometry & registration15%Sub-pixel calibration, residuals logged, classical baseline reportedAlignment failures silently dropped; no baseline to compare against
Milestone 3: learned detector20%Clean splits by board, defensible operating point, per-class tableSplit by image crop; threshold tuned on the test set
Milestone 4: generative engine15%Quality gate with rejection stats; mixing ratio chosen on validationUngated generation dumped into training; ratio chosen on test
Honest evaluation & ablation20%Frozen real test set, one-pass ablation, bootstrap CIs, all three ladder rungsSynthetic images in test; ablation conditions evaluated from different scripts
Deployment & monitoring10%ONNX artifact, p95 inside a written budget, input-drift instrumentationMean latency only; no monitoring plan beyond "watch the scores"
Presentation & report10%Claims traceable to artifacts; failure gallery; limitations as scoped boundariesDemo-only presentation with no numbers; metrics defined nowhere

9. Common Failure Modes

Three failures account for most weak capstones. All are detectable early, and each maps back to one milestone's discipline.

9.1 The synthetic-to-real gap

Warning: When the Generator Trains the Detector to See Generations

Symptom: validation scores on synthetic-heavy data look wonderful while the frozen real test set barely moves, or rare-class precision drops because the detector now fires on textures your generator favors. Diagnosis: compare FID or KID between synthetic and real defect crops (Chapter 37) and visually difference matched pairs; over-clean edges, repeated micro-textures, and color shifts are the usual tells. Remedies, in order: inpaint into real backgrounds rather than generating whole images, pass synthetic samples through the same Milestone 1 preprocessing so they inherit the sensor's noise, tighten the quality gate, and cap the mixing ratio at what validation supports. The gap shrinks when the generator is forced to live inside real pixels.

9.2 Test-set contamination

Warning: The Leak You Built Yourself

Contamination rarely looks like cheating; it looks like convenience. Two crops of the same physical board land in train and test because the split was done per image. A synthetic sample is generated from a background crop whose source board is in the test split. A threshold gets nudged after a test run "just to see". Each leak inflates every number downstream and is invisible in the final report. Defenses: split by board serial before any other processing, deduplicate near-identical images across splits with a perceptual hash, generate synthetic data exclusively from training-split sources, and treat the test evaluation script's run count as an auditable fact (once per milestone). If a number looks too good, assume contamination before brilliance.

9.3 Calibration drift

Postmortem: The Tuesday the Recall Quietly Left

A deployed inspection line ran beautifully for five weeks, then rare-defect recall faded over three days. Nobody changed the model. A technician had bumped the camera mount; the shift was under a millimeter, small enough that images looked fine, large enough that rectified boards landed 6 pixels off the golden template, so features learned around pad edges were now sampling neighboring copper. The registration residual from Milestone 2's code had been rising the whole time; no one had plotted it. The fix took an hour (re-shoot the calibration target, refresh the homography); detecting the problem took three days of escaped defects. Geometry is a wear item: monitor residuals, schedule recalibration, and alarm on the canary board, exactly as Section 7 prescribes.

10. Variations for Other Domains

The architecture transfers wherever a roughly planar scene is inspected against an expectation. Keep the skeleton; swap the domain.

Note: Medical Caveats Are Not Optional

A medical variation of this capstone is a methods exercise, not a clinical tool. Synthetic pathology may be explored in training, but the evaluation set must be real, expert-annotated, and ideally multi-site; a model that learns a generator's idea of a lesion endangers patients in a way a false-called circuit board never will. Class imbalance interacts with demographics, so per-subgroup metrics join the ladder. Regulatory frameworks treat such systems as medical devices, and nothing built in a capstone approaches clinical validation. If you choose this variation, recruit a clinical advisor, state the scope boundary on the first slide, and read the governance discussion in Chapter 37 before generating a single image.

11. Presenting Your System

The presentation is fifteen minutes, and its job is to let an informed skeptic believe your numbers. A structure that works:

  1. The problem and the metric (2 minutes). One slide of line context, then define escape rate, false-call rate, and per-class F1 before showing any result. A metric defined after its value is an excuse.
  2. The system in four slides (4 minutes). One slide per milestone, each anchored by one image: the before-and-after of preprocessing, the aligned-versus-golden overlay, the detector's per-class table, a strip of gated synthetic samples beside real ones.
  3. The ablation table (3 minutes). This is the centerpiece: real-only versus real-plus-synthetic, per class, with confidence intervals, on the frozen real test set. Walk through the rare class where the engine paid for itself.
  4. Live or recorded demo (3 minutes). A board goes in, a verdict comes out, the latency counter is visible. Show one failure case and narrate why the system missed it; a failure you can explain builds more trust than ten successes.
  5. Boundaries and roadmap (2 minutes). State the validated regime as scope, not apology: board family, lighting band, defect classes covered, p95 latency on named hardware. Then the one improvement you would ship next.

The written report follows the same spine with the evidence attached: dataset cards, the sealed evaluation protocol, the ablation with CIs, the latency benchmark, and the monitoring plan. Every claim in the slides should trace to a file in the repository; graders are reviewers with a rubric.

Stretch Goals
  1. Conceptual: design the operating-point policy for a line where an escaped defect costs 500 times a false call. Where does each rung of the metrics ladder enter the decision, and what threshold would you set per class?
  2. Coding: add a zero-shot anomaly scorer in the spirit of WinCLIP using the foundation models of Chapter 25, and benchmark it against your supervised detector on the same frozen test set and the same ladder.
  3. Analysis: for your worst-performing class, trace five individual test failures through the full pipeline (preprocessed image, registration residual, detector score, verdict) and identify the stage where each was lost. Report whether the fix belongs to Milestone 1, 2, 3, or 4.

Where to Go from Here

The capstone is the last page of the story, but the book was built for revisiting: the Table of Contents jumps to any chapter, and the appendices hold the reference material your project will keep pulling on. Ship the system, present the evidence as it stands, and then do what every vision engineer does the week after a launch: start watching the monitoring dashboard.