What This Book Covers | Building Vision AI

This book is organized as a single arc in four parts: pixels, geometry, learning, generation. Thirty-nine chapters and 213 sections take you from "an image is a NumPy array" to "a diffusion model can synthesize the training data for your next detector." Five appendices and a capstone project round out the journey. The full map lives in the Table of Contents; this page walks the arc so you know where each part fits and why the order matters.

Part I: Image Processing (Chapters 0 to 8)

Everything begins with the pixel. Part I opens with the Python imaging stack (Chapter 0), because in this book every concept is something you run, not just something you read. From there it builds the signal-processing bedrock: how digital images are formed, sampled, and encoded (Chapter 1); per-pixel transforms, histograms, and thresholding (Chapter 2); convolution and spatial filtering (Chapter 3); the frequency domain, aliasing, and image pyramids (Chapter 4); geometric transformations and warping (Chapter 5); morphology and shape (Chapter 6); and restoration, from denoising to deblurring, inpainting, and HDR (Chapter 7). Chapter 8 consolidates the practical stack: OpenCV, scikit-image, Pillow, and the performance tooling around them.

Nothing in Part I is merely historical. These operations run inside every camera app, every dataset pipeline, and every preprocessing stage of every model you will ever train.

Part II: Classical Computer Vision (Chapters 9 to 17)

Part II is vision before learning: the structured, geometric core of the field. It moves from edges, lines, and curves (Chapter 9) to keypoints, descriptors, and robust matching (Chapter 10), then classical segmentation (Chapter 11). The middle of the part is the geometry spine: camera models and calibration (Chapter 12), two-view geometry, stereo, and depth (Chapter 13), and structure from motion with visual SLAM (Chapter 14). Motion, optical flow, and tracking follow (Chapter 15), and Chapter 16 closes the historical loop with the hand-crafted recognition pipelines, HOG plus SVM, Viola-Jones, deformable part models, and the reasons they plateaued. Chapter 17 is the part's tools reference, from OpenCV's deeper corners to COLMAP.

The geometry chapters in particular are not optional nostalgia: calibration, epipolar geometry, and triangulation return almost verbatim inside the neural 3D methods of Part III and the world generators of Part IV.

Part III: Deep Learning for Computer Vision (Chapters 18 to 29)

Part III is the largest part of the book, because it is where most working vision systems live today. It starts from first principles with neural networks and PyTorch (Chapter 18), then makes the convolution of Chapter 3 learnable (Chapter 19) and tells the architecture story from LeNet to ConvNeXt (Chapter 20). Training recipes, augmentation, and transfer learning get their own chapter (21), as do vision transformers (22). The core applied tasks follow: object detection (23) and segmentation up to promptable models like SAM (24). Chapter 25 covers the shift that redefined the field, self-supervised learning and vision foundation models such as CLIP and DINO. Video understanding (26), depth, 3D, and neural scene representations including NeRF and Gaussian splatting (27), and efficient deployment to edge hardware (28) complete the technical material, and Chapter 29 consolidates the deep vision stack: torchvision, timm, Hugging Face, Ultralytics, and friends.

Part IV: Generative Vision Models (Chapters 30 to 38)

Part IV turns the pipeline around: from recognizing images to producing them. It opens with the foundations of generative modeling and the map of model families (Chapter 30), then works through autoencoders and VAEs (31), GANs (32), and diffusion models (33), the engine of the current generation. Chapter 34 goes inside text-to-image systems such as Stable Diffusion and its successors; Chapter 35 covers controllable generation and image editing, from ControlNet to LoRA personalization and instruction-based edits. Chapter 36 extends generation to video, 3D, and interactive world models. Chapter 37 addresses what serious practitioners cannot skip: evaluation metrics, human studies, deepfake detection, watermarking and provenance, licensing, and using generators as synthetic-data engines for the models of Part III. Chapter 38 closes with the generative tools stack, from Diffusers to ComfyUI.

Beyond the Chapters

Five appendices support the main text: mathematical foundations (A), a datasets and benchmarks catalog (B), ready-made week-by-week teaching schedules (C), per-audience reading pathways (D), and a hardware guide covering cameras, GPUs, and edge devices (E). The capstone project then asks you to put all four parts to work in one end-to-end system: classical preprocessing and geometry, a fine-tuned detector or segmenter, a generative synthetic-data engine, and an honest evaluation with deployment.

One thread runs through everything: concepts introduced classically return learned. Convolution becomes the CNN layer, denoising becomes diffusion, inpainting becomes generative editing, camera geometry becomes NeRF. The book is sequenced so that each return visit feels like recognition rather than revelation. The next page describes who this journey is designed for.