"I have inspected a hundred thousand circuit boards and approved most of them. The ones I remember are the six I got wrong, which is, now that I think about it, exactly the per-class recall problem you are about to spend a month on."
A Battle-Hardened Inspection Camera
One system, four parts, no shortcuts. Every chapter in this book taught a piece of the vision stack in isolation; the capstone asks you to assemble the whole machine. You will build an automated visual inspection system in which classical preprocessing from Part I normalizes the pixels, the geometry of Part II normalizes the viewpoint, a fine-tuned model from Part III finds the defects, and a generative engine from Part IV manufactures the training data that reality refuses to provide. The thread that holds it together is honest evaluation: a frozen real test set that nothing synthetic ever touches, and an ablation that proves (or disproves) that the generative engine earned its place.
1. The Project Brief
Your client is an electronics manufacturer. Boards pass under a fixed downward-facing camera on a conveyor, one every few seconds. Your job is an automated optical inspection (AOI) system that flags defective boards before they ship: missing components, solder bridges, scratches, open traces, mousebites, misaligned parts. The system must localize each defect with a box or mask, assign a class label, render a board-level pass or fail verdict, and do it all inside a fixed latency budget on hardware the client can afford.
The recommended domain is printed circuit board inspection because it exercises every part of the book and because public data exists to bootstrap from: the DeepPCB dataset provides 1,500 aligned template and test image pairs with six defect classes, and the MVTec AD benchmark includes high-resolution industrial categories for anomaly-style evaluation. Section 10 adapts the same brief to agriculture, retail, and medical imaging. Whatever the domain, the constraint that shapes the whole project stays the same: the interesting defects are rare. You will have hundreds of images of good boards, dozens of common defects, and perhaps five examples of the class that matters most. That scarcity is the design problem the capstone exists to teach.
The complete deliverable is a working system plus the evidence that it works: a repository, a sealed evaluation report, a deployed inference artifact with measured latency, and a presentation. The four milestones below map one-to-one onto the four parts of the book; each lists deliverables, acceptance criteria, and the supporting chapters.
At a contract electronics plant, every board the AOI flags goes to a human re-inspection station. A newly commissioned line posted a healthy 97% defect recall, and management celebrated. Six weeks later re-inspection had a four-hour backlog: the false-call rate was 8%, so at 1,800 boards per shift the system generated 144 false alarms a day. The fix was not a better model; it was a calibrated operating point chosen on validation, per-class thresholds for the two noisiest classes, and a weekly report tracking false calls per million joints. Production vision systems live or die on the boring metrics. Your capstone will be graded accordingly.
2. Milestone 1: Normalize the Pixels (Part I)
Raw frames off a line camera are not ready for anything. Illumination falls off toward the corners, sensor noise rises with gain, and white balance drifts as the LED panel ages. Milestone 1 builds the deterministic preprocessing front end that turns whatever the camera produces into the stable, comparable image every later stage assumes. This is Part I applied end to end: image formation and the ISP pipeline from Chapter 1, histogram tools and contrast normalization from Chapter 2, denoising filters from Chapter 3, and the restoration mindset of Chapter 7.
Deliverables
- A written capture specification: camera model, lens, working distance, exposure, gain, and lighting geometry, with the reasoning for each choice (Appendix E in the reference shelf is the hardware companion).
- A flat-field illumination correction estimated from images of a blank reference target, applied as a per-pixel gain map.
- A denoising and contrast-normalization stage with parameters justified by measurement: noise standard deviation before and after, on a flat patch.
- A before-and-after report: histograms, line profiles across the board, and a pixel-difference heatmap of two consecutive captures.
Acceptance criteria
- Background brightness variation across the field of view drops by at least half after flat-field correction (measure the standard deviation of a blank target image).
- Two consecutive captures of the same stationary board differ by a mean absolute pixel error under 2 gray levels after preprocessing.
- Every stage is deterministic and parameterized in one config file; no hand-tuned magic numbers buried in code.
3. Milestone 2: Normalize the Geometry (Part II)
Defect detection by comparison only works if every board occupies exactly the same pixels. Milestone 2 builds the geometric normalization chain: calibrate the camera and undistort the image using Zhang's method from Chapter 12, rectify the board plane to a canonical metric frame with a homography as in Chapter 13 and the warping machinery of Chapter 5, then fine-register each board against a golden reference image using the keypoint matching and RANSAC pipeline of Chapter 10. With alignment in hand, a classical baseline detector falls out almost for free: difference against the golden board, threshold, and clean up with the morphology of Chapter 6, in the spirit of the template-comparison pipelines of Chapter 16.
import cv2
import numpy as np
# Undistort with the intrinsics from your calibration session (Chapter 12)
K, dist = np.load("calib_K.npy"), np.load("calib_dist.npy")
img = cv2.undistort(cv2.imread("board_raw.png"), K, dist)
# Rectify the board plane: 4 fiducial centers map to a metric board frame
fid_px = detect_fiducials(img) # your Chapter 6 blob detector
board_mm = np.float32([[0, 0], [160, 0], [160, 100], [0, 100]])
H = cv2.getPerspectiveTransform(fid_px, board_mm * PX_PER_MM)
rect = cv2.warpPerspective(img, H, OUT_SIZE)
# Fine registration against the golden board (Chapter 10)
orb = cv2.ORB_create(2000)
k1, d1 = orb.detectAndCompute(rect, None)
k2, d2 = orb.detectAndCompute(golden, None)
matches = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True).match(d1, d2)
src = np.float32([k1[m.queryIdx].pt for m in matches])
dst = np.float32([k2[m.trainIdx].pt for m in matches])
H_fine, inl = cv2.findHomography(src, dst, cv2.RANSAC, 2.0)
aligned = cv2.warpPerspective(rect, H_fine, golden.shape[1::-1])
# Log the median inlier residual: this number becomes a health metric later
proj = cv2.perspectiveTransform(src[None], H_fine)[0]
residual = np.median(np.linalg.norm(proj - dst, axis=1)[inl.ravel() == 1])
Deliverables
- A calibration report: reprojection error, recovered intrinsics, and distortion profile, following the quality checks of Chapter 12.
- The rectification and registration pipeline above, packaged as one function from raw frame to aligned board image.
- A classical baseline detector (difference, threshold, morphological cleanup, connected components) with its own per-class scores. It will lose to the learned model; the point is to know by how much.
Acceptance criteria
- Calibration RMS reprojection error below 0.5 pixels.
- Median registration residual against the golden board below 2 pixels across the whole dataset, with the failures (boards that did not align) logged and counted rather than silently dropped.
- The classical baseline achieves nontrivial recall on large defects, documented in the same report format the learned model will use.
Every inspection team eventually develops a quasi-religious relationship with the golden board: it lives in a labeled antistatic bag, only two people may touch it, and when someone finally drops it the incident gets a name. Treat yours well; the entire geometric stack is anchored to that one object, which is why Section 9 worries about drift.
4. Milestone 3: Learn the Defects (Part III)
With pixels and geometry normalized, the learned stage has a fair fight. Milestone 3 fine-tunes a detector or segmenter on your real defect data: a one-stage detector from Chapter 23 if boxes are enough, or a segmentation model from Chapter 24 if defect area matters (it usually does for scratches and solder spill). Transfer learning is mandatory, not optional: with a few hundred labeled instances you are firmly in the fine-tuning regime of Chapter 21, and the augmentation policy you choose there must respect the geometry you just normalized (random rotations of a rectified board are a lie; small photometric jitter is honest).
Before any training run, split the data and seal the test set. Split by physical board, never by image crop, so two views of the same board cannot land on both sides of the line. The test split is real images only, drawn from multiple capture sessions, and from this moment forward it is read-only: no model selection, no threshold tuning, nothing. Section 6 builds the evaluation protocol on this frozen set; Section 9 catalogs how teams contaminate it without noticing.
Chapter 23 built detection from anchors and losses upward; in the capstone you should stand on a maintained implementation. The Ultralytics API reduces the few hundred lines of a custom training loop, loss wiring, and mAP evaluator to three calls, and handles augmentation, EMA weights, mosaic scheduling, and export internally:
from ultralytics import YOLO
model = YOLO("yolo11s.pt") # pretrained weights, Chapter 21 style
model.train(data="pcb_defects.yaml", epochs=100, imgsz=1024)
metrics = model.val(split="test") # touched exactly once, at the very end
Deliverables
- The dataset card: class definitions with example crops, labeling rules, split protocol, and per-split counts per class.
- A trained detector or segmenter with training curves, the chosen operating point, and the validation evidence behind that choice.
- Per-class precision, recall, and F1 on the frozen real test set, where $F_1 = \frac{2PR}{P+R}$, plus mAP@0.5 for detectors or per-class IoU for segmenters.
Acceptance criteria
- The learned model beats the Milestone 2 classical baseline on macro F1 on the frozen test set.
- Per-class results are reported for every class, including the rare ones where the honest number is currently poor; that gap is the motivation for Milestone 4.
- All reported numbers come from one evaluation script committed to the repository, so the grader can re-run them.
5. Milestone 4: Generate the Rare Defects (Part IV)
Your rarest class has five real examples, and no augmentation policy multiplies five into five hundred. Milestone 4 builds a generative data engine that manufactures the missing training data: take real, defect-free board crops from the training split only, and use diffusion inpainting from Chapter 35, powered by the models of Chapter 33, to paint plausible defects into them at controlled locations. Because you choose the inpainting mask, every synthetic sample is born with a pixel-accurate label, which neatly solves annotation at the same time. For finer structural control, condition the generator on an edge map or printed-layout rendering of the local region with a ControlNet adapter (Zhang et al., 2023), so that synthesized solder bridges actually connect two pads instead of floating in space.
import torch
from diffusers import AutoPipelineForInpainting
pipe = AutoPipelineForInpainting.from_pretrained(
"diffusers/stable-diffusion-xl-1.0-inpainting-0.1",
torch_dtype=torch.float16).to("cuda")
# crop: a defect-free region of a REAL rectified board (training split only)
# mask: where the rare defect should appear, e.g. spanning two adjacent pads
out = pipe(
prompt="macro photograph of a printed circuit board with a solder bridge "
"short circuit between two adjacent component pins, sharp focus",
negative_prompt="cartoon, illustration, painting, blurry",
image=crop, mask_image=mask,
strength=0.99, guidance_scale=7.0,
num_inference_steps=30).images[0]
# The mask that placed the defect IS the annotation
x, y, w, h = cv2.boundingRect((np.array(mask) > 0).astype(np.uint8))
Generation is the easy half; curation is where the milestone is won. Build a quality gate that rejects implausible samples before they reach training: automatic filters first (the defect region must differ from the original crop, and a distributional check such as FID or KID between synthetic and real defect crops must stay sane), then a fast human pass over a random sample. Sweep the mixing ratio: real-only, then real plus 1x, 2x, and 5x synthetic for the rare classes, and let validation pick. The treatment of generative data engines in Chapter 37 is the playbook for this milestone.
Deliverables
- The generation pipeline with its prompt and mask strategy, plus the quality gate and its rejection statistics (how many samples were generated, how many survived).
- A synthetic dataset card mirroring the real one: per-class counts, generation parameters, and example accepted and rejected samples.
- The ablation experiment of Section 6: identical training recipe, real-only versus real plus synthetic, evaluated in one pass on the frozen real test set.
Acceptance criteria
- No synthetic image, and no real image used as inpainting source material, appears in the test split or shares a source board with it.
- The ablation reports per-class F1 deltas with bootstrap confidence intervals, computed by one script in one run, for both conditions.
- The report states where synthetic data helped, where it did nothing, and at what mixing ratio returns diminished, as measured on validation.
Synthetic defect generation is an active research area, not a settled recipe. AnomalyDiffusion (Hu et al., AAAI 2024) learns to generate anomaly image and mask pairs from only a handful of real examples, directly targeting the few-shot regime this milestone lives in. ControlNet (Zhang et al., ICCV 2023) remains the standard tool for structure-conditioned compositing when defects must respect board layout. On the recognition side, WinCLIP (Jeong et al., CVPR 2023) showed that CLIP-style foundation models from Chapter 25 can score industrial anomalies zero-shot, and a strong capstone variation is to benchmark such a zero-shot scorer against your supervised detector on the same frozen test set.
6. The Metrics Ladder: Honest Evaluation
Evaluation in this project is a ladder, and every rung answers a different stakeholder. At the bottom, pixel-level and box-level scores (IoU, per-detection precision and recall) tell you whether the model localizes defects. The middle rung, per-class F1 at a fixed operating point, tells the team whether each defect type is covered; macro averages hide exactly the rare classes this project is about, so the per-class table is the primary artifact. The top rung speaks the client's language: board-level pass/fail confusion, escape rate (defective boards passed), and false-call rate (good boards flagged). Report all three rungs; argue from the top one.
The frozen real test set is the only source of truth, and synthetic data never touches it. Synthetic images may enter training and only training. A test set containing generated samples measures how well your detector recognizes your generator's habits, a quantity with no relationship to escape rate on the line. Likewise, every time you peek at test numbers to make a decision, the set stops being a test set and becomes a slow validation set. Evaluate on it once per milestone, from one script, and let the validation split absorb all your curiosity.
The ablation is the scientific core of the capstone: it isolates what the generative engine of Milestone 4 actually bought. Hold everything fixed (architecture, recipe, schedule, operating point, matching rule, test set) and vary only the training data. Compute both conditions in one pass of one script so the comparison cannot silently drift apart, and attach bootstrap confidence intervals so a 2-point F1 delta on a 40-image class is read with appropriate suspicion:
from sklearn.metrics import f1_score
import numpy as np
def per_class_f1(y_true, y_pred, classes, n_boot=2000, seed=0):
rng = np.random.default_rng(seed)
point = f1_score(y_true, y_pred, labels=classes, average=None)
idx = np.arange(len(y_true))
boots = [f1_score(y_true[s], y_pred[s], labels=classes,
average=None, zero_division=0)
for s in (rng.choice(idx, len(idx), replace=True)
for _ in range(n_boot))]
lo, hi = np.percentile(boots, [2.5, 97.5], axis=0)
return point, lo, hi
# Same frozen test set, same matching rule, same operating point, one run:
f1_real, lo_r, hi_r = per_class_f1(y_true, pred_real_only, CLASSES)
f1_aug, lo_a, hi_a = per_class_f1(y_true, pred_real_plus_synth, CLASSES)
The expected (and publishable) shape of the result: synthetic data moves the rare classes substantially, moves the common classes little or not at all, and past some mixing ratio begins to hurt as the generator's quirks start outvoting reality. If your ablation shows something else, Section 9's first failure mode is the place to start debugging.
7. Deployment: ONNX, Latency, and Monitoring
A checkpoint on a workstation is a prototype; the capstone ships. Export the trained model to ONNX and serve it with a runtime, following Chapter 28. Write the latency budget down before you measure: if a board arrives every 2 seconds and the line tolerates one board of buffering, your end-to-end budget might be 500 ms, decomposed into capture (60 ms), preprocessing and rectification (80 ms), inference, and postprocessing with verdict logic (40 ms). What remains is the inference budget, and the number that must fit inside it is the 95th percentile, not the mean: a line stalls on tail latency.
import onnxruntime as ort
import numpy as np, time
model.export(format="onnx", imgsz=1024, half=True) # Ultralytics exporter
sess = ort.InferenceSession("best.onnx",
providers=["CUDAExecutionProvider"])
x = np.random.rand(1, 3, 1024, 1024).astype(np.float16)
for _ in range(20): # warm-up runs
sess.run(None, {"images": x})
times = []
for _ in range(200):
t0 = time.perf_counter()
sess.run(None, {"images": x})
times.append((time.perf_counter() - t0) * 1000)
print(f"p50={np.percentile(times, 50):.1f} ms "
f"p95={np.percentile(times, 95):.1f} ms")
If the p95 misses the budget, climb the efficiency toolbox of Chapter 28 in cost order: smaller input resolution (often free if defects are large relative to the new pixel pitch), FP16, then INT8 quantization with a calibration set, then a smaller backbone. Re-run the frozen-test evaluation after every step; a quantized model is a new model, and its per-class F1 must be re-earned, not assumed.
Model scores drift last; inputs drift first. Instrument the pipeline to log, per board: mean and standard deviation of background brightness (Milestone 1's health), the median registration residual from the Milestone 2 code (geometry's health), the score distribution of the detector, and the daily false-call and escape counters from re-inspection. Add a canary: a known golden board run through the full pipeline once per shift, with alarms on any metric leaving its commissioning band. This pattern, from Chapter 28's monitoring section, is what converts Section 9's calibration-drift failure from a bad month into a same-day alert.
8. Marking Rubric
The rubric below weights the system the way production does: a third of the grade sits in evidence (evaluation and ablation) rather than in any single model. Graders should be able to reproduce every number from the repository alone.
| Component | Weight | Full marks look like | Frequent deductions |
|---|---|---|---|
| Milestone 1: imaging pipeline | 10% | Measured acceptance criteria met; deterministic, configurable stages | Hand-tuned constants with no measurement; corrections justified by eye |
| Milestone 2: geometry & registration | 15% | Sub-pixel calibration, residuals logged, classical baseline reported | Alignment failures silently dropped; no baseline to compare against |
| Milestone 3: learned detector | 20% | Clean splits by board, defensible operating point, per-class table | Split by image crop; threshold tuned on the test set |
| Milestone 4: generative engine | 15% | Quality gate with rejection stats; mixing ratio chosen on validation | Ungated generation dumped into training; ratio chosen on test |
| Honest evaluation & ablation | 20% | Frozen real test set, one-pass ablation, bootstrap CIs, all three ladder rungs | Synthetic images in test; ablation conditions evaluated from different scripts |
| Deployment & monitoring | 10% | ONNX artifact, p95 inside a written budget, input-drift instrumentation | Mean latency only; no monitoring plan beyond "watch the scores" |
| Presentation & report | 10% | Claims traceable to artifacts; failure gallery; limitations as scoped boundaries | Demo-only presentation with no numbers; metrics defined nowhere |
9. Common Failure Modes
Three failures account for most weak capstones. All are detectable early, and each maps back to one milestone's discipline.
9.1 The synthetic-to-real gap
Symptom: validation scores on synthetic-heavy data look wonderful while the frozen real test set barely moves, or rare-class precision drops because the detector now fires on textures your generator favors. Diagnosis: compare FID or KID between synthetic and real defect crops (Chapter 37) and visually difference matched pairs; over-clean edges, repeated micro-textures, and color shifts are the usual tells. Remedies, in order: inpaint into real backgrounds rather than generating whole images, pass synthetic samples through the same Milestone 1 preprocessing so they inherit the sensor's noise, tighten the quality gate, and cap the mixing ratio at what validation supports. The gap shrinks when the generator is forced to live inside real pixels.
9.2 Test-set contamination
Contamination rarely looks like cheating; it looks like convenience. Two crops of the same physical board land in train and test because the split was done per image. A synthetic sample is generated from a background crop whose source board is in the test split. A threshold gets nudged after a test run "just to see". Each leak inflates every number downstream and is invisible in the final report. Defenses: split by board serial before any other processing, deduplicate near-identical images across splits with a perceptual hash, generate synthetic data exclusively from training-split sources, and treat the test evaluation script's run count as an auditable fact (once per milestone). If a number looks too good, assume contamination before brilliance.
9.3 Calibration drift
A deployed inspection line ran beautifully for five weeks, then rare-defect recall faded over three days. Nobody changed the model. A technician had bumped the camera mount; the shift was under a millimeter, small enough that images looked fine, large enough that rectified boards landed 6 pixels off the golden template, so features learned around pad edges were now sampling neighboring copper. The registration residual from Milestone 2's code had been rising the whole time; no one had plotted it. The fix took an hour (re-shoot the calibration target, refresh the homography); detecting the problem took three days of escaped defects. Geometry is a wear item: monitor residuals, schedule recalibration, and alarm on the canary board, exactly as Section 7 prescribes.
10. Variations for Other Domains
The architecture transfers wherever a roughly planar scene is inspected against an expectation. Keep the skeleton; swap the domain.
- Agriculture: produce grading and leaf-disease scouting. The conveyor becomes a grading line or a field rig; illumination correction (Milestone 1) becomes the hard part because sunlight refuses to be commissioned. Geometric normalization weakens from homography to coarse alignment, so lean harder on the detector's augmentation policy from Chapter 21. The generative engine inpaints lesions and blemishes onto healthy produce; the frozen test set must span growing seasons, or drift will masquerade as accuracy.
- Retail: shelf audit and planogram compliance. Rectify shelf photos to the shelf plane with the same homography toolkit, detect and count SKUs (the densely packed SKU-110K dataset is the classic benchmark), and compare against the planogram. The rare classes are new or redesigned packages, which the generative engine composites onto shelf backgrounds. Evaluation adds a structured-output rung to the ladder: facing counts and out-of-stock flags per shelf section.
- Medical imaging: with serious caveats. Lesion detection in dermoscopy or radiographs shares the pipeline shape, and the scarcity of rare pathology makes synthetic augmentation tempting. Proceed only with the constraints in the note below.
A medical variation of this capstone is a methods exercise, not a clinical tool. Synthetic pathology may be explored in training, but the evaluation set must be real, expert-annotated, and ideally multi-site; a model that learns a generator's idea of a lesion endangers patients in a way a false-called circuit board never will. Class imbalance interacts with demographics, so per-subgroup metrics join the ladder. Regulatory frameworks treat such systems as medical devices, and nothing built in a capstone approaches clinical validation. If you choose this variation, recruit a clinical advisor, state the scope boundary on the first slide, and read the governance discussion in Chapter 37 before generating a single image.
11. Presenting Your System
The presentation is fifteen minutes, and its job is to let an informed skeptic believe your numbers. A structure that works:
- The problem and the metric (2 minutes). One slide of line context, then define escape rate, false-call rate, and per-class F1 before showing any result. A metric defined after its value is an excuse.
- The system in four slides (4 minutes). One slide per milestone, each anchored by one image: the before-and-after of preprocessing, the aligned-versus-golden overlay, the detector's per-class table, a strip of gated synthetic samples beside real ones.
- The ablation table (3 minutes). This is the centerpiece: real-only versus real-plus-synthetic, per class, with confidence intervals, on the frozen real test set. Walk through the rare class where the engine paid for itself.
- Live or recorded demo (3 minutes). A board goes in, a verdict comes out, the latency counter is visible. Show one failure case and narrate why the system missed it; a failure you can explain builds more trust than ten successes.
- Boundaries and roadmap (2 minutes). State the validated regime as scope, not apology: board family, lighting band, defect classes covered, p95 latency on named hardware. Then the one improvement you would ship next.
The written report follows the same spine with the evidence attached: dataset cards, the sealed evaluation protocol, the ablation with CIs, the latency benchmark, and the monitoring plan. Every claim in the slides should trace to a file in the repository; graders are reviewers with a rubric.
- Conceptual: design the operating-point policy for a line where an escaped defect costs 500 times a false call. Where does each rung of the metrics ladder enter the decision, and what threshold would you set per class?
- Coding: add a zero-shot anomaly scorer in the spirit of WinCLIP using the foundation models of Chapter 25, and benchmark it against your supervised detector on the same frozen test set and the same ladder.
- Analysis: for your worst-performing class, trace five individual test failures through the full pipeline (preprocessed image, registration residual, detector score, verdict) and identify the stage where each was lost. Report whether the fix belongs to Milestone 1, 2, 3, or 4.
Where to Go from Here
The capstone is the last page of the story, but the book was built for revisiting: the Table of Contents jumps to any chapter, and the appendices hold the reference material your project will keep pulling on. Ship the system, present the evidence as it stands, and then do what every vision engineer does the week after a launch: start watching the monitoring dashboard.