Contents | Building Vision AI: From Pixels to Generative Models

4 parts · 39 chapters · 219 sections, plus front matter, 7 appendices, and a capstone. Every chapter and section linked below is complete and live. The directory path is shown under each chapter.

Front Matter · Why This Book Exists

8 entries

F1
Why This Book ExistsVision AI spans sixty years of ideas, from convolution kernels to diffusion models; this book teaches them as one connected story.
front-matter/foreword.html
F2
What This Book CoversThe four-part arc: pixels, geometry, learning, generation.
front-matter/fm-what-this-book-covers.html
F3
Who Should Read This BookEngineers with basic Python and linear algebra; no prior computer vision required.
front-matter/fm-who-should-read.html
F4
What's InsideA guided preview of the book's signature elements: worked pipelines, library shortcuts, callouts, and labs.
front-matter/look-inside-preview.html
F5
How to Use This BookReading paths for engineers, researchers, and self-study learners; how the parts depend on each other.
front-matter/fm-how-to-use.html
F6
About the AuthorsWho wrote this book and how.
front-matter/about-authors.html
F7
Copyright & LegalEdition, license, and attribution.
front-matter/copyright.html
F8
About the Hands-On AI Science SeriesThe nine-book Hands-On AI Science series and where this volume fits.
front-matter/about-the-series.html

Part I · Image Processing

9 chapters · 49 sections

The signal-processing bedrock: pixels, color, histograms, filtering, frequency, geometry, morphology, and restoration.

0
Foundations: The Python Imaging Stack An image is a NumPy array; master the array and the entire vision stack opens up.
part-1-image-processing/module-00-python-imaging-stack/
1
Digital Image Fundamentals From photons to pixels: how a digital image is born, encoded, and judged.
part-1-image-processing/module-01-digital-image-fundamentals/
2
Point Operations, Histograms & Thresholding Per-pixel transforms are the simplest tools in vision, and still among the most used.
part-1-image-processing/module-02-point-operations-histograms/
3
Spatial Filtering & Convolution The kernel is the atom of image processing, and the same operation that powers CNNs in Part III.
part-1-image-processing/module-03-spatial-filtering-convolution/
4
The Frequency Domain & Multi-Scale Analysis Every image is a sum of waves; seeing it that way explains aliasing, compression, and pyramids in one stroke.
part-1-image-processing/module-04-frequency-domain-multiscale/
5
Geometric Transformations & Image Warping Rotating, rectifying, and registering images: the coordinate machinery behind every camera app.
part-1-image-processing/module-05-geometric-transformations/
6
Morphology, Binary Images & Shape Once an image is binary, a small algebra of erosions and dilations solves a surprising share of industrial vision.
part-1-image-processing/module-06-morphology-binary-shape/
7
Image Restoration & Enhancement Undoing damage: noise, blur, missing pixels, and limited dynamic range, with the classical methods deep models later learned to beat.
part-1-image-processing/module-07-restoration-enhancement/
8
Tools of the Trade: The Image Processing Stack Consolidated reference: libraries, performance tooling, datasets, and external resources for this part.
part-1-image-processing/module-08-tools-of-the-trade/

Part II · Classical Computer Vision

9 chapters · 48 sections

Vision before learning: features, matching, multi-view geometry, motion, and the recognition pipelines that defined an era.

9
Edges, Lines & Curves From raw gradients to structured geometry: the first step from processing images to understanding them.
part-2-classical-computer-vision/module-09-edges-lines-curves/
10
Keypoints, Descriptors & Matching Find the same point in two photographs and most of geometric vision follows.
part-2-classical-computer-vision/module-10-keypoints-descriptors-matching/
11
Classical Segmentation & Grouping Carving an image into meaningful regions with clustering, watersheds, and graphs.
1. 11.1 Segmentation as Clustering: K-Means & Mean-Shift
2. 11.2 Region Growing & Split-and-Merge
3. 11.3 The Watershed Transform
4. 11.4 Graph-Based Segmentation: Graph Cuts & GrabCut
5. 11.5 Superpixels: SLIC & Friends
part-2-classical-computer-vision/module-11-classical-segmentation/
12
Camera Models & Calibration The pinhole camera turns 3D into 2D; calibration tells you exactly how.
part-2-classical-computer-vision/module-12-camera-models-calibration/
13
Two-View Geometry, Stereo & Depth Two cameras and a bit of linear algebra recover what one camera lost: depth.
1. 13.1 Epipolar Geometry: The Geometry of Two Views
2. 13.2 Essential & Fundamental Matrices
3. 13.3 Homographies & Panorama Stitching
4. 13.4 Stereo Rectification & Disparity Estimation
5. 13.5 From Disparity to Depth Maps
6. 13.6 Triangulation & 3D Point Recovery
part-2-classical-computer-vision/module-13-two-view-stereo-depth/
14
Structure from Motion & Visual SLAM From a pile of photos to a 3D model, and from a moving camera to a live map.
part-2-classical-computer-vision/module-14-sfm-visual-slam/
15
Motion, Optical Flow & Tracking Video adds time; flow and tracking turn pixel motion into object motion.
part-2-classical-computer-vision/module-15-motion-flow-tracking/
16
Classical Recognition Pipelines Hand-crafted features plus shallow classifiers ruled recognition for two decades; understanding why they plateaued explains why deep learning won.
part-2-classical-computer-vision/module-16-classical-recognition/
17
Tools of the Trade: The Classical CV Stack Consolidated reference: libraries, reconstruction tooling, datasets, and external resources for this part.
part-2-classical-computer-vision/module-17-tools-of-the-trade/

Part III · Deep Learning for Computer Vision

12 chapters · 67 sections

Vision learned end to end: CNNs, transformers, detection, segmentation, self-supervision, video, 3D, and deployment.

18
Neural Networks & PyTorch for Vision Everything Part III builds on: tensors, autograd, and a training loop you fully understand.
part-3-deep-learning-for-vision/module-18-neural-networks-pytorch/
19
Convolutional Neural Networks The convolution from Chapter 3, made learnable: weight sharing, hierarchy, and the inductive bias that fits images.
part-3-deep-learning-for-vision/module-19-convolutional-neural-networks/
20
CNN Architectures: From LeNet to ConvNeXt A decade of architecture search, told as a story of bottlenecks found and removed.
part-3-deep-learning-for-vision/module-20-cnn-architectures/
21
Training Recipes: Data, Augmentation & Transfer In practice the recipe matters as much as the architecture; this chapter is the recipe.
part-3-deep-learning-for-vision/module-21-training-recipes/
22
Vision Transformers Treat an image as a sequence of patches and the transformer takes over; the question is when that trade is worth it.
part-3-deep-learning-for-vision/module-22-vision-transformers/
23
Object Detection Where are the objects and what are they: the task that drives much of applied vision.
part-3-deep-learning-for-vision/module-23-object-detection/
24
Segmentation: Semantic, Instance & Promptable From a label per image to a label per pixel, and on to models that segment anything you point at.
part-3-deep-learning-for-vision/module-24-segmentation/
25
Self-Supervised Learning & Vision Foundation Models Labels stopped being the bottleneck: how vision models learn from raw pixels and from language.
part-3-deep-learning-for-vision/module-25-self-supervised-foundation-models/
26
Video Understanding Adding the time axis: actions, motion, and tracking with learned features.
1. 26.1 From Frames to Clips: The Temporal Dimension
2. 26.2 Action Recognition: 3D CNNs & Two-Stream Networks
3. 26.3 Video Transformers
4. 26.4 Deep Optical Flow: RAFT & Beyond
5. 26.5 Multi-Object Tracking with Learned Features
part-3-deep-learning-for-vision/module-26-video-understanding/
27
Depth, 3D Vision & Neural Scene Representations Deep networks meet the geometry of Part II: learned depth, point clouds, radiance fields, and splats.
1. 27.1 Monocular Depth Estimation
2. 27.2 3D Representations: Point Clouds, Voxels & Meshes
3. 27.3 Learning on Point Clouds: PointNet & Successors
4. 27.4 NeRF: Neural Radiance Fields
5. 27.5 3D Gaussian Splatting
6. 27.6 Capture-to-Render Pipelines in Practice
part-3-deep-learning-for-vision/module-27-depth-3d-neural-scenes/
28
Efficient Vision & Edge Deployment A model that cannot run on the target hardware is a prototype; this chapter ships it.
part-3-deep-learning-for-vision/module-28-efficient-vision-deployment/
29
Tools of the Trade: The Deep Vision Stack Consolidated reference: model hubs, frameworks, data tooling, and external resources for this part.
part-3-deep-learning-for-vision/module-29-tools-of-the-trade/

Part IV · Generative Vision Models

9 chapters · 55 sections

Models that create: VAEs, GANs, diffusion, text-to-image, controllable editing, video and 3D generation, evaluation and governance.

30
Foundations of Generative Modeling From recognizing images to producing them: what it means to model the distribution of natural images.
part-4-generative-vision-models/module-30-generative-foundations/
31
Autoencoders & Variational Autoencoders Compression as representation, and the probabilistic twist that made decoders generative.
part-4-generative-vision-models/module-31-autoencoders-vaes/
32
Generative Adversarial Networks Two networks in a game: the family that made photorealistic generation possible, and the lessons it left behind.
part-4-generative-vision-models/module-32-gans/
33
Diffusion Models Destroy an image with noise, learn to rebuild it, and you get the engine behind modern image generation.
part-4-generative-vision-models/module-33-diffusion-models/
34
Text-to-Image Systems Inside the systems that turn a sentence into an image, from CLIP conditioning to full production stacks.
part-4-generative-vision-models/module-34-text-to-image/
35
Controllable Generation & Image Editing From prompt roulette to precise control: structure, identity, and edits that preserve everything else.
part-4-generative-vision-models/module-35-controllable-generation-editing/
36
Video, 3D Generation & World Models Generation grows axes: time, depth, and agency, from video diffusion to world models that learn to simulate.
part-4-generative-vision-models/module-36-video-3d-world-generation/
37
Evaluation, Safety & Generative Data Engines Measuring what generators produce, governing how they are used, and putting them to work as synthetic-data engines for the models of Part III.
part-4-generative-vision-models/module-37-evaluation-safety-data-engines/
38
Tools of the Trade: The Generative Vision Stack Consolidated reference: generation libraries, workflow engines, hosted APIs, and external resources for this part.
part-4-generative-vision-models/module-38-tools-of-the-trade/

Appendices · Reference and Pedagogy

7 appendices

A
Mathematical Foundations for VisionThe essential linear algebra, probability, optimization, and signal processing behind every chapter.
appendices/appendix-a-mathematical-foundations/
B
Datasets & Benchmarks CatalogA per-task reference: classification, detection, segmentation, geometry, flow, video, and generation benchmarks, with licensing notes.
appendices/appendix-b-datasets-benchmarks/
C
Course SyllabiTested course tracks built from the book: a one-semester image processing and classical CV course, a deep vision course, and a generative vision course, with week-by-week schedules.
appendices/appendix-c-course-syllabi/
D
Reading PathwaysPer-audience reading guides for engineers, researchers, generative-AI practitioners, and self-study learners.
appendices/appendix-d-reading-pathways/
E
Cameras, GPUs & Edge Hardware GuideChoosing sensors, lenses, GPUs, and edge devices for vision workloads, from lab prototypes to production lines.
appendices/appendix-e-cameras-gpus-edge-hardware/
F
Agents That Helped to Write This BookRoster of the 42 specialist AI agents in the writing pipeline that produced this manuscript, with a card per agent.
appendices/appendix-f-agent-roster/
G
Application Reference MapsA domain router: pick an application, then follow the chapter path, data contract, metric shift, and tool stack that make the project credible.
appendices/appendix-g-application-reference-maps/

Capstone · End-to-End Vision System

1 project

★
Capstone Project: An End-to-End Vision SystemDesign, build, evaluate, and present a production-grade vision application that spans all four parts: classical preprocessing and geometry, a fine-tuned detector or segmenter, a generative synthetic-data engine, and honest evaluation with deployment.
capstone/