Appendix D: Reading Pathways | Building Vision AI

"Nobody attends to every token. The art is choosing what to skip and then describing the choice as a principled sparsity pattern."
An Attention Head With Selective Focus

Big Picture

This book is a dependency graph, not a queue. The table of contents shows thirty-nine chapters in publication order, but very few readers need them in that order. This appendix gives five tested traversals of the graph, one per audience: the engineer shipping a vision feature this quarter, the researcher headed for generative models, the ComfyUI practitioner who wants to understand the machinery behind the nodes, the self-study learner walking the full arc, and the classical CV veteran modernizing a working stack. Each pathway lists the chapters in reading order, says what to skip on a first pass, names the signature arcs to watch for, and ends with a concrete "you are done when" outcome so you know the pathway has paid off.

1. How to Use These Pathways

Every pathway below follows the same template. First comes an ordered chapter list with links: read these, in this order. Then a short "skip on first pass" note, because skipping is a feature of efficient reading, not a failure of discipline; everything you skip remains exactly where you left it. Then the arcs to watch for, and finally the outcome that tells you the pathway is complete. No prior chapters are required to read this appendix itself; if you are deciding where to start the book, start here.

Two recurring story lines run through the whole book, and every pathway crosses at least one of them. The first is that concepts introduced classically return learned: the convolution kernel you slide by hand in Chapter 3 returns as the learnable layer of Chapter 19, and the multi-view geometry of Chapters 12 through 14 returns inside the neural scene representations of Chapter 27. The second is that restoration becomes generation: the denoising of Chapter 7 becomes the diffusion engine of Chapter 33, and classical inpainting becomes the generative editing of Chapter 35.

Key Insight

Whichever pathway you take, the arcs are the spine of the book. A chapter read in isolation teaches a technique; a chapter read as part of an arc teaches why the technique exists and what replaced it. When a pathway says "watch for the arc", it means: pause at the early chapter long enough to be genuinely surprised when the idea returns wearing learned weights.

If you find mathematical gaps along any pathway, Appendix A patches them on demand. If you want calendar structure instead of self-pacing, Appendix C turns three of these pathways into week-by-week syllabi.

2. Pathway 1: The Working Software Engineer

You build software for a living, a vision feature just landed on your roadmap (detect objects in uploaded photos, segment products on shelves, classify defects on a line), and you need to ship something defensible fast. Your pathway front-loads the deep learning chapters and treats the classical material as a just-in-time reference.

Chapter 0: Foundations: The Python Imaging Stack, because every bug you will ever file starts with shape, dtype, and channel order.
Chapter 2: Point Operations, Histograms & Thresholding and Chapter 3: Spatial Filtering & Convolution, the minimum image processing for sane preprocessing.
Chapter 18: Neural Networks & PyTorch for Vision through Chapter 21: Training Recipes, in order: the training loop, CNNs, architectures, and the recipe that makes transfer learning work.
Chapter 22: Vision Transformers, Chapter 23: Object Detection, and Chapter 24: Segmentation, the three task chapters most production features draw on.
Chapter 28: Efficient Vision & Edge Deployment and Chapter 29: Tools of the Trade, because a model that does not run on the target hardware is a prototype.

Skip on first pass: the frequency domain (Chapter 4), morphology (Chapter 6), all of Part II, and all of Part IV except Chapter 37's section on synthetic training data, which is worth a detour the day labels run short. Backfill on demand: Chapter 1 when color spaces bite, Chapter 5 and Chapter 12 the first time a camera needs calibrating, Chapters 9 and 10 when a fifty-line classical solution beats a fifty-megabyte model.

Arcs to watch: convolution returns learned (Chapter 3 to Chapter 19), and the histogram thinking of Chapter 2 returns as the normalization and augmentation statistics of Chapter 21.

You are done when you can fine-tune a detector on your own data (Chapter 23), export it through ONNX to your target runtime (Chapter 28), and explain to a code reviewer exactly why the input tensor is normalized the way it is.

Practical Example: Six Weeks to Shelf Monitoring

Who: A backend engineer at a retail-analytics startup, handed a shelf-monitoring feature with a six-week deadline and no vision background.

Path taken: Chapters 0, 2, 3 in the first week; Chapters 18 through 21 in weeks two and three; Chapter 23 in week four; Chapters 28 and 29 while integrating.

The detour: In week five, bounding boxes from the ceiling-mounted cameras came back skewed. A one-day backfill of Chapter 5 (homographies) and Chapter 12 (calibration) fixed the rectification step; nothing about the detector itself was wrong.

Lesson: The pathway is not a wall around the rest of the book. It is a default route plus permission to take exactly the detours your problem demands.

3. Pathway 2: The Researcher Headed for Generative Models

You are a graduate student or industrial researcher whose destination is Part IV: you want to understand diffusion models well enough to extend them, not just run them. Your pathway builds the signal-processing and deep learning foundations that generative papers silently assume, then reads Part IV in full.

Chapter 0, then Chapter 3: Spatial Filtering & Convolution and Chapter 4: The Frequency Domain: convolution, frequency, and multi-scale thinking are the native language of generative architectures.
Chapter 7: Image Restoration & Enhancement, read slowly. Noise models, denoising, and inpainting are the classical halves of the book's central arc.
Chapters 18 through 22: PyTorch, CNNs, architectures, training recipes, and vision transformers; the U-Nets and DiTs of Part IV are assembled from exactly these parts.
Chapter 25: Self-Supervised Learning & Vision Foundation Models, primarily for CLIP, which becomes the conditioning backbone of text-to-image systems.
All of Part IV in order: Chapter 30, 31, 32, 33, 34, 35, 36, 37, and 38.

Skip on first pass: nearly all of Part II; return to Chapters 12 through 14, plus Chapter 27, just before the 3D generation material in Chapter 36. Detection and segmentation (Chapters 23 and 24) are optional until you need them as evaluation probes for synthetic data.

Arcs to watch: denoising becomes diffusion (Chapter 7 to Chapter 33) is your main arc; ride it deliberately. Two supporting arcs: the frequency and pyramid ideas of Chapter 4 explain why latent diffusion compresses before it diffuses (Chapter 33), and classical inpainting (Chapter 7) becomes generative editing (Chapter 35).

You are done when you can derive the ELBO from Chapter 31 on a whiteboard, explain what the DDPM loss in Chapter 33 is actually regressing, and place any new generative paper onto the family map of Chapter 30 within a page of reading its abstract.

4. Pathway 3: The Generative-AI Practitioner

You live in ComfyUI or a similar node graph. You can build workflows that produce remarkable images, and you want to know what the boxes actually do to the tensors flowing between them. Your pathway reads Part IV as the main text, with four short dips into earlier chapters exactly where a node needs grounding.

Chapter 30: Foundations of Generative Modeling, for the map of generative families and the latent-space idea behind every workflow.
Chapter 31: Autoencoders & VAEs: the VAE Encode and VAE Decode nodes you wire daily, explained.
Chapter 33: Diffusion Models, the heart of the pathway: samplers, schedulers, guidance scale, and why step counts trade speed for fidelity.
Chapter 34: Text-to-Image Systems: what the CLIP Text Encode node produces and how conditioning enters the U-Net or DiT.
Chapter 35: Controllable Generation & Image Editing: ControlNet, LoRA, inpainting, and instruction-based editing, the rest of your node palette.
Second pass: Chapter 32 (where GANs still win), Chapter 36 (video and 3D generation), Chapter 37 (evaluation, watermarking, licensing), and Chapter 38, whose ComfyUI section will reorganize what you already know.

The four dips: read Chapter 3 when Chapter 33 mentions convolution (a kernel is a thing you can slide by hand); Chapter 7 to see denoising and inpainting before they became generative; Chapter 19 to understand the U-Net's convolutional blocks; and Chapter 22 when you reach DiT architectures and need to know what attention does.

Skip on first pass: everything in Parts I through III not named above. You can be a superb workflow engineer without epipolar geometry.

Arcs to watch: denoising becomes diffusion, experienced backwards: you already know the destination, and the dips show you the origin. Watch also how Chapter 31's compression arc explains why "latent" appears in half your node names.

You are done when you can narrate your favorite workflow node by node, stating what each one does to the tensor passing through it; predict the effect of changing sampler, step count, or guidance scale before pressing run; and debug a broken graph by reasoning about the machinery instead of re-rolling the seed.

Fun Note

The universal first instinct when a generation looks wrong is to change the seed and run it again. This is the generative equivalent of turning it off and on again: occasionally effective, never explanatory. The entire point of this pathway is to make the second instinct (asking which stage of the pipeline disagrees with you) cheaper than the first.

5. Pathway 4: The Self-Study Learner

You are reading cover to cover, and the four-part arc is your pathway: pixels (Chapters 0 to 8), geometry and classical vision (Chapters 9 to 17), learning (Chapters 18 to 29), and generation (Chapters 30 to 38). What you need from this appendix is pacing and checkpoints, not a route.

At six to eight hours per week, a comfortable schedule is: Part I in five weeks, Part II in five, Part III in seven (it is the longest part and the one with the most code), and Part IV in five, with the final two weeks for the capstone project: roughly six months end to end. The "Tools of the Trade" chapters that close each part (8, 17, 29, 38) are references: skim them once, then return as needed rather than studying them line by line.

Checkpoints, one per milestone; do not move on until the artifact works:

End of Part I: build the document scanner of Chapter 5 from a photo of a real receipt on your desk.
End of Part II: the lane-marking detector of Chapter 9 and a two-image panorama via the homography machinery of Chapter 13.
Mid Part III: the CIFAR-10 CNN of Chapter 19 trained end to end, then a detector fine-tuned on data you collected yourself (Chapter 23).
End of Part IV: a controllable generation workflow combining a base model, one ControlNet, and one LoRA (Chapter 35), evaluated with the metrics of Chapter 37.

Skip on first pass: nothing permanently, but it is fine to read Chapter 14 and Chapter 16 at survey depth the first time through; both deepen enormously on a second visit after Part III.

Arcs to watch: all four. Convolution returns learned (3 to 19), denoising becomes diffusion (7 to 33), inpainting becomes generative editing (7 to 35), and geometry returns neural (12 to 14, then 27). You are the one reader who experiences every arc in its intended order.

You are done when you complete the capstone: an end-to-end vision system that spans all four parts, with classical preprocessing, a fine-tuned model, a synthetic-data component, and an honest evaluation. If you prefer external structure, the semester tracks in Appendix C map this same pathway onto an academic calendar.

6. Pathway 5: The Classical CV Engineer Modernizing a Stack

You have years of OpenCV behind you: calibration rigs, inspection lines, tracking systems. You do not need to be told what a homography is; you need the shortest credible route from hand-crafted features to the modern learned stack. Your pathway treats Part II as a fast review mirror and Part III as the main text.

Skim Chapters 9 through 17 in a week, fast, with two goals: mark anything that surprises you for a slow second read, and read Chapter 16's closing section carefully, because the story of why hand-crafted pipelines plateaued is written for exactly you.
Then read Chapters 18 through 29 in order and at full depth: PyTorch, CNNs, architectures, training recipes, transformers, detection, segmentation, self-supervision, video, 3D, and deployment.

Skip on first pass: Part I, apart from a quick skim of Chapter 3 and Chapter 7 to align vocabulary with the arcs; you know this material, but the book's framing of it pays off later. Part IV is optional on the first pass except Chapter 37's treatment of generative models as synthetic-data engines, which is directly useful for inspection problems where defect examples are rare.

Arcs to watch: yours are the correspondence arcs. The descriptors of Chapter 10 become the learned embeddings of Chapter 25; the optical flow and trackers of Chapter 15 return as RAFT and learned multi-object tracking in Chapter 26; and the reconstruction pipelines of Chapter 14 feed directly into the radiance fields and splats of Chapter 27.

You are done when you can take one of your existing classical pipelines and articulate, stage by stage, which components a learned model should replace, which should stay classical, and why; and you have fine-tuned a modern segmenter from Chapter 24 on your own imagery to back the argument with numbers.

7. Keeping Your Pathway Current

The pathways are stable because the foundations are stable: convolution, geometry, backpropagation, and the ELBO will not change under you. The frontier chapters, however, describe a moving field, and how you read them should reflect that. Read their mechanisms as durable and their model names as timestamps.

Research Frontier

Five chapters sit closest to the moving edge and reward a re-visit whenever you return to the book. Chapter 25: the foundation-model landscape keeps shifting, with promptable segmenters in the SAM 2 line and ever-stronger self-supervised backbones in the DINO family. Chapter 27: 3D Gaussian splatting has moved from paper to production tooling at remarkable speed. Chapter 34: the model landscape (FLUX.1, Stable Diffusion 3.5, and their successors) turns over roughly yearly, while the underlying latent-diffusion and DiT machinery persists. Chapter 36: Sora-class video generation and interactive world models in the Genie line are the fastest-moving topics in the book. And Chapter 37: provenance standards such as C2PA are still being adopted as you read this. The reading strategy for all five is the same: learn the mechanism from the chapter, then check the part's Tools of the Trade chapter for the current names.

Finally, pathways are starting points, not contracts. Engineers drift into Part IV when a feature needs generation; practitioners discover they enjoy the geometry they swore to skip. When that happens, switch: find your new audience above, see which of its chapters you have already covered, and continue from the first one you have not. The hardware decisions that eventually face every pathway (cameras, GPUs, edge devices) are waiting in Appendix E.