Front Matter · Why This Book Exists
7 entries- F1Why This Book ExistsVision AI spans sixty years of ideas, from convolution kernels to diffusion models; this book teaches them as one connected story.
front-matter/foreword.html - F2What This Book CoversThe four-part arc: pixels, geometry, learning, generation.
front-matter/fm-what-this-book-covers.html - F3Who Should Read This BookEngineers with basic Python and linear algebra; no prior computer vision required.
front-matter/fm-who-should-read.html - F4What's InsideA guided preview of the book's signature elements: worked pipelines, library shortcuts, callouts, and labs.
front-matter/look-inside-preview.html - F5How to Use This BookReading paths for engineers, researchers, and self-study learners; how the parts depend on each other.
front-matter/fm-how-to-use.html - F6About the AuthorsWho wrote this book and how.
front-matter/about-authors.html - F7Copyright & LegalEdition, license, and attribution.
front-matter/copyright.html
Part I · Image Processing
9 chapters · 49 sectionsThe signal-processing bedrock: pixels, color, histograms, filtering, frequency, geometry, morphology, and restoration.
-
0Foundations: The Python Imaging Stack An image is a NumPy array; master the array and the entire vision stack opens up.
- 0.1 Images as Arrays: Pixels, Channels & Dtypes
- 0.2 The Python Imaging Ecosystem: OpenCV, scikit-image & Pillow
- 0.3 Reading, Writing & Displaying Images
- 0.4 Conventions & Pitfalls: BGR vs RGB, uint8 vs float, Row-Column Order
- 0.5 A First Pipeline: Load, Process, Measure, Save
part-1-image-processing/module-00-python-imaging-stack/ -
1Digital Image Fundamentals From photons to pixels: how a digital image is born, encoded, and judged.
- 1.1 Image Formation: Optics, Sensors & the ISP Pipeline
- 1.2 Sampling & Quantization
- 1.3 Resolution, Bit Depth & Dynamic Range
- 1.4 Color Science & Color Spaces: RGB, HSV, Lab & YCbCr
- 1.5 Image Formats & Compression: PNG, JPEG & WebP
part-1-image-processing/module-01-digital-image-fundamentals/ -
2Point Operations, Histograms & Thresholding Per-pixel transforms are the simplest tools in vision, and still among the most used.
- 2.1 Brightness, Contrast & Gamma Correction
- 2.2 Image Histograms & Statistics
- 2.3 Histogram Equalization & CLAHE
- 2.4 Thresholding: Global, Otsu & Adaptive
- 2.5 Image Arithmetic, Blending & Compositing
part-1-image-processing/module-02-point-operations-histograms/ -
3Spatial Filtering & Convolution The kernel is the atom of image processing, and the same operation that powers CNNs in Part III.
- 3.1 Convolution & Correlation: The Workhorse Operation
- 3.2 Smoothing: Box, Gaussian & Median Filters
- 3.3 Sharpening & Unsharp Masking
- 3.4 Derivative Filters: Sobel, Laplacian & LoG
- 3.5 Edge-Preserving Smoothing: Bilateral & Guided Filters
- 3.6 Borders, Separability & Performance
part-1-image-processing/module-03-spatial-filtering-convolution/ -
4The Frequency Domain & Multi-Scale Analysis Every image is a sum of waves; seeing it that way explains aliasing, compression, and pyramids in one stroke.
- 4.1 Fourier Intuition: Images as Sums of Waves
- 4.2 The 2D DFT & FFT in Practice
- 4.3 Frequency-Domain Filtering: Low-Pass, High-Pass & Notch
- 4.4 The Sampling Theorem, Aliasing & Anti-Aliasing
- 4.5 Image Pyramids: Gaussian & Laplacian
- 4.6 Wavelets & Time-Frequency Trade-offs
part-1-image-processing/module-04-frequency-domain-multiscale/ -
5Geometric Transformations & Image Warping Rotating, rectifying, and registering images: the coordinate machinery behind every camera app.
- 5.1 The Transformation Hierarchy: Translation to Projective
- 5.2 Homogeneous Coordinates & Transformation Matrices
- 5.3 Interpolation: Nearest, Bilinear, Bicubic & Lanczos
- 5.4 Warping, Remapping & Inverse Mapping
- 5.5 Image Registration & Alignment
- 5.6 Worked Example: A Document Scanner from Scratch
part-1-image-processing/module-05-geometric-transformations/ -
6Morphology, Binary Images & Shape Once an image is binary, a small algebra of erosions and dilations solves a surprising share of industrial vision.
- 6.1 Binary Images, Neighborhoods & Connectivity
- 6.2 Erosion & Dilation
- 6.3 Opening, Closing & Morphological Gradients
- 6.4 Connected Components & Region Properties
- 6.5 Distance Transforms & Skeletonization
- 6.6 Contours, Moments & Shape Descriptors
part-1-image-processing/module-06-morphology-binary-shape/ -
7Image Restoration & Enhancement Undoing damage: noise, blur, missing pixels, and limited dynamic range, with the classical methods deep models later learned to beat.
- 7.1 Noise Models & Degradation Pipelines
- 7.2 Classical Denoising: From Gaussian to Non-Local Means
- 7.3 Deblurring & Deconvolution: Wiener & Richardson-Lucy
- 7.4 Inpainting: Filling the Holes
- 7.5 Classical Super-Resolution
- 7.6 HDR Imaging & Tone Mapping
part-1-image-processing/module-07-restoration-enhancement/ -
8Tools of the Trade: The Image Processing Stack Consolidated reference: libraries, performance tooling, datasets, and external resources for this part.
- 8.1 Library Landscape: OpenCV, scikit-image, Pillow & SciPy ndimage
- 8.2 Performance: Vectorization, OpenCV Optimizations & GPU Arrays
- 8.3 Test Images, Datasets & Quality Metrics Tooling
- 8.4 Curated References & Further Reading
part-1-image-processing/module-08-tools-of-the-trade/
Part II · Classical Computer Vision
9 chapters · 48 sectionsVision before learning: features, matching, multi-view geometry, motion, and the recognition pipelines that defined an era.
-
9Edges, Lines & Curves From raw gradients to structured geometry: the first step from processing images to understanding them.
- 9.1 What Is an Edge? Gradients Revisited
- 9.2 The Canny Edge Detector, Step by Step
- 9.3 The Hough Transform: Lines & Circles
- 9.4 Fitting Curves: Least Squares & Robust Alternatives
- 9.5 Worked Example: Lane-Marking Detection
part-2-classical-computer-vision/module-09-edges-lines-curves/ -
10Keypoints, Descriptors & Matching Find the same point in two photographs and most of geometric vision follows.
- 10.1 Corner Detection: Harris, Shi-Tomasi & FAST
- 10.2 Scale & Rotation Invariance: Scale Space
- 10.3 SIFT: The Descriptor That Defined a Decade
- 10.4 Fast Binary Alternatives: BRIEF, ORB & AKAZE
- 10.5 Descriptor Matching & the Ratio Test
- 10.6 RANSAC & Robust Model Fitting
part-2-classical-computer-vision/module-10-keypoints-descriptors-matching/ -
11Classical Segmentation & Grouping Carving an image into meaningful regions with clustering, watersheds, and graphs.
- 11.1 Segmentation as Clustering: K-Means & Mean-Shift
- 11.2 Region Growing & Split-and-Merge
- 11.3 The Watershed Transform
- 11.4 Graph-Based Segmentation: Graph Cuts & GrabCut
- 11.5 Superpixels: SLIC & Friends
part-2-classical-computer-vision/module-11-classical-segmentation/ -
12Camera Models & Calibration The pinhole camera turns 3D into 2D; calibration tells you exactly how.
- 12.1 The Pinhole Camera & Intrinsic Parameters
- 12.2 Lens Distortion & Its Correction
- 12.3 Camera Calibration: Zhang's Method in Practice
- 12.4 Extrinsics & Pose Estimation: The PnP Problem
- 12.5 Calibration Workflows, Targets & Quality Checks
part-2-classical-computer-vision/module-12-camera-models-calibration/ -
13Two-View Geometry, Stereo & Depth Two cameras and a bit of linear algebra recover what one camera lost: depth.
- 13.1 Epipolar Geometry: The Geometry of Two Views
- 13.2 Essential & Fundamental Matrices
- 13.3 Homographies & Panorama Stitching
- 13.4 Stereo Rectification & Disparity Estimation
- 13.5 From Disparity to Depth Maps
- 13.6 Triangulation & 3D Point Recovery
part-2-classical-computer-vision/module-13-two-view-stereo-depth/ -
14Structure from Motion & Visual SLAM From a pile of photos to a 3D model, and from a moving camera to a live map.
- 14.1 Feature Tracks & Correspondence Across Many Views
- 14.2 Incremental Structure from Motion
- 14.3 Bundle Adjustment: Polishing the Reconstruction
- 14.4 Visual SLAM: Mapping While Moving
- 14.5 COLMAP & Modern Reconstruction Pipelines
part-2-classical-computer-vision/module-14-sfm-visual-slam/ -
15Motion, Optical Flow & Tracking Video adds time; flow and tracking turn pixel motion into object motion.
- 15.1 Motion Fields & the Brightness Constancy Assumption
- 15.2 Sparse Flow: Lucas-Kanade & Feature Tracking
- 15.3 Dense Flow: Horn-Schunck to Variational Methods
- 15.4 Background Subtraction & Change Detection
- 15.5 Object Tracking: Mean-Shift, Correlation Filters & Re-Detection
- 15.6 Kalman Filters & Multi-Object Data Association
part-2-classical-computer-vision/module-15-motion-flow-tracking/ -
16Classical Recognition Pipelines Hand-crafted features plus shallow classifiers ruled recognition for two decades; understanding why they plateaued explains why deep learning won.
- 16.1 Template Matching & Its Limits
- 16.2 Bag of Visual Words & Spatial Pyramids
- 16.3 HOG + SVM: The Pedestrian Detection Era
- 16.4 Viola-Jones: Real-Time Face Detection
- 16.5 Deformable Part Models
- 16.6 Why Hand-Crafted Pipelines Plateaued: The Bridge to Deep Learning
part-2-classical-computer-vision/module-16-classical-recognition/ -
17Tools of the Trade: The Classical CV Stack Consolidated reference: libraries, reconstruction tooling, datasets, and external resources for this part.
- 17.1 OpenCV Beyond the Basics: features2d, calib3d & video
- 17.2 Reconstruction Tooling: COLMAP, OpenMVG & Friends
- 17.3 Datasets & Benchmarks for Geometry, Flow & Tracking
- 17.4 Curated References & Further Reading
part-2-classical-computer-vision/module-17-tools-of-the-trade/
Part III · Deep Learning for Computer Vision
12 chapters · 67 sectionsVision learned end to end: CNNs, transformers, detection, segmentation, self-supervision, video, 3D, and deployment.
-
18Neural Networks & PyTorch for Vision Everything Part III builds on: tensors, autograd, and a training loop you fully understand.
- 18.1 From Linear Models to Multi-Layer Perceptrons
- 18.2 Backpropagation & Optimization in a Nutshell
- 18.3 PyTorch Essentials: Tensors, Autograd & nn.Module
- 18.4 Datasets, DataLoaders & Input Pipelines
- 18.5 The Training Loop: Losses, Metrics & Checkpointing
- 18.6 GPUs, Mixed Precision & Reproducibility
part-3-deep-learning-for-vision/module-18-neural-networks-pytorch/ -
19Convolutional Neural Networks The convolution from Chapter 3, made learnable: weight sharing, hierarchy, and the inductive bias that fits images.
- 19.1 Why Convolution? Locality, Weight Sharing & Inductive Bias
- 19.2 Convolution Layers: Channels, Stride, Padding & Dilation
- 19.3 Pooling, Receptive Fields & Feature Hierarchies
- 19.4 Batch Normalization & Friends
- 19.5 A CNN from Scratch: CIFAR-10 End to End
- 19.6 Visualizing What CNNs Learn
part-3-deep-learning-for-vision/module-19-convolutional-neural-networks/ -
20CNN Architectures: From LeNet to ConvNeXt A decade of architecture search, told as a story of bottlenecks found and removed.
- 20.1 LeNet & AlexNet: The Breakthrough Years
- 20.2 VGG & Inception: Depth vs Width
- 20.3 ResNet: Residual Learning Changes Everything
- 20.4 Efficient Designs: MobileNet, ShuffleNet & EfficientNet
- 20.5 ConvNeXt: The CNN, Modernized
- 20.6 Choosing an Architecture in Practice
part-3-deep-learning-for-vision/module-20-cnn-architectures/ -
21Training Recipes: Data, Augmentation & Transfer In practice the recipe matters as much as the architecture; this chapter is the recipe.
- 21.1 Vision Datasets & the ImageNet Legacy
- 21.2 Data Augmentation: From Flips to MixUp & CutMix
- 21.3 Transfer Learning & Fine-Tuning Strategies
- 21.4 Regularization, Schedules & the Modern Training Recipe
- 21.5 Class Imbalance, Label Noise & Real-World Data
- 21.6 Debugging Training: Curves, Overfitting & Sanity Checks
part-3-deep-learning-for-vision/module-21-training-recipes/ -
22Vision Transformers Treat an image as a sequence of patches and the transformer takes over; the question is when that trade is worth it.
- 22.1 Attention & the Transformer Block, Vision Edition
- 22.2 ViT: Images as Sequences of Patches
- 22.3 Data-Efficient Training: DeiT & Augmentation for ViTs
- 22.4 Hierarchical Designs: Swin & Pyramid Transformers
- 22.5 CNNs vs ViTs: Inductive Bias, Scale & Hybrids
part-3-deep-learning-for-vision/module-22-vision-transformers/ -
23Object Detection Where are the objects and what are they: the task that drives much of applied vision.
- 23.1 The Detection Problem: Boxes, IoU & mAP
- 23.2 Two-Stage Detectors: The R-CNN Family
- 23.3 One-Stage Detectors: YOLO, SSD & RetinaNet
- 23.4 Anchor-Free & Keypoint-Based Detection
- 23.5 DETR: Detection as Set Prediction
- 23.6 Training & Deploying a Detector on Custom Data
part-3-deep-learning-for-vision/module-23-object-detection/ -
24Segmentation: Semantic, Instance & Promptable From a label per image to a label per pixel, and on to models that segment anything you point at.
- 24.1 Semantic Segmentation: FCN, U-Net & DeepLab
- 24.2 Instance Segmentation: Mask R-CNN
- 24.3 Panoptic Segmentation: Unifying Things & Stuff
- 24.4 Transformer Segmenters: SegFormer & Mask2Former
- 24.5 Segment Anything: Promptable Segmentation
- 24.6 Losses, Metrics & Evaluation for Dense Prediction
part-3-deep-learning-for-vision/module-24-segmentation/ -
25Self-Supervised Learning & Vision Foundation Models Labels stopped being the bottleneck: how vision models learn from raw pixels and from language.
- 25.1 Pretext Tasks: Learning Without Labels
- 25.2 Contrastive Learning: SimCLR & MoCo
- 25.3 Self-Distillation & Masked Image Modeling: DINO & MAE
- 25.4 CLIP: Language as Supervision
- 25.5 Open-Vocabulary Detection & Segmentation
- 25.6 The Vision Foundation Model Landscape
part-3-deep-learning-for-vision/module-25-self-supervised-foundation-models/ -
26Video Understanding Adding the time axis: actions, motion, and tracking with learned features.
- 26.1 From Frames to Clips: The Temporal Dimension
- 26.2 Action Recognition: 3D CNNs & Two-Stream Networks
- 26.3 Video Transformers
- 26.4 Deep Optical Flow: RAFT & Beyond
- 26.5 Multi-Object Tracking with Learned Features
part-3-deep-learning-for-vision/module-26-video-understanding/ -
27Depth, 3D Vision & Neural Scene Representations Deep networks meet the geometry of Part II: learned depth, point clouds, radiance fields, and splats.
- 27.1 Monocular Depth Estimation
- 27.2 3D Representations: Point Clouds, Voxels & Meshes
- 27.3 Learning on Point Clouds: PointNet & Successors
- 27.4 NeRF: Neural Radiance Fields
- 27.5 3D Gaussian Splatting
- 27.6 Capture-to-Render Pipelines in Practice
part-3-deep-learning-for-vision/module-27-depth-3d-neural-scenes/ -
28Efficient Vision & Edge Deployment A model that cannot run on the target hardware is a prototype; this chapter ships it.
- 28.1 The Efficiency Toolbox: Quantization, Pruning & Distillation
- 28.2 Export & Runtimes: ONNX, TensorRT & OpenVINO
- 28.3 Edge & Mobile Vision: From Jetson to Phones
- 28.4 Serving Vision Models: Batching, Throughput & Latency
- 28.5 Monitoring, Drift & Continual Improvement
part-3-deep-learning-for-vision/module-28-efficient-vision-deployment/ -
29Tools of the Trade: The Deep Vision Stack Consolidated reference: model hubs, frameworks, data tooling, and external resources for this part.
- 29.1 Model Hubs & Libraries: torchvision, timm, Hugging Face & Ultralytics
- 29.2 Detection & Segmentation Frameworks: Detectron2 & MMDetection
- 29.3 Data Tooling: Annotation, Versioning, FiftyOne & Roboflow
- 29.4 Experiment Tracking, Curated References & Further Reading
part-3-deep-learning-for-vision/module-29-tools-of-the-trade/
Part IV · Generative Vision Models
9 chapters · 55 sectionsModels that create: VAEs, GANs, diffusion, text-to-image, controllable editing, video and 3D generation, evaluation and governance.
-
30Foundations of Generative Modeling From recognizing images to producing them: what it means to model the distribution of natural images.
- 30.1 Generative vs Discriminative: What Does It Mean to Model p(x)?
- 30.2 A Map of Generative Families: VAE, GAN, Flow, Autoregressive & Diffusion
- 30.3 Latent Variables & the Idea of a Latent Space
- 30.4 Energy-Based Models, Score Matching & Langevin Dynamics
- 30.5 Sampling, Likelihood & the Quality-Diversity-Speed Trilemma
- 30.6 Evaluating Generators: A First Look
part-4-generative-vision-models/module-30-generative-foundations/ -
31Autoencoders & Variational Autoencoders Compression as representation, and the probabilistic twist that made decoders generative.
- 31.1 Autoencoders: Compression as Representation
- 31.2 Denoising & Sparse Autoencoders
- 31.3 The VAE: ELBO, Reparameterization & Amortized Inference
- 31.4 Disentanglement, beta-VAE & Posterior Collapse
- 31.5 Hierarchical VAEs: From Ladder Networks to NVAE
- 31.6 Discrete Latents: VQ-VAE & Learned Codebooks
part-4-generative-vision-models/module-31-autoencoders-vaes/ -
32Generative Adversarial Networks Two networks in a game: the family that made photorealistic generation possible, and the lessons it left behind.
- 32.1 The Adversarial Game
- 32.2 Training Pathologies: Mode Collapse & Instability
- 32.3 DCGAN to StyleGAN: The Architecture Lineage
- 32.4 Conditional GANs & Image-to-Image Translation: pix2pix & CycleGAN
- 32.5 GAN Inversion & Latent-Space Editing
- 32.6 GANs Today: Where They Still Win
part-4-generative-vision-models/module-32-gans/ -
33Diffusion Models Destroy an image with noise, learn to rebuild it, and you get the engine behind modern image generation.
- 33.1 Destroying & Rebuilding: The Forward & Reverse Processes
- 33.2 DDPM: Noise Schedules, Parameterizations & the Variational View
- 33.3 The Score-Based View: VE/VP SDEs & the Probability-Flow ODE
- 33.4 Fast Sampling: DDIM, Solvers & Step Distillation
- 33.5 Flow Matching, Rectified Flow & Consistency Models
- 33.6 Guidance: Classifier & Classifier-Free
- 33.7 Latent Diffusion: Compress First, Then Diffuse
part-4-generative-vision-models/module-33-diffusion-models/ -
34Text-to-Image Systems Inside the systems that turn a sentence into an image, from CLIP conditioning to full production stacks.
- 34.1 Connecting Text & Pixels: CLIP & Text Encoders
- 34.2 Inside Stable Diffusion: VAE, U-Net, DiT & Conditioning
- 34.3 The Model Landscape: DALL-E, Imagen, Midjourney & FLUX
- 34.4 Autoregressive & Token-Based Image Generation
- 34.5 Prompt Engineering for Image Generation
- 34.6 Fine-Tuning Text-to-Image Models
part-4-generative-vision-models/module-34-text-to-image/ -
35Controllable Generation & Image Editing From prompt roulette to precise control: structure, identity, and edits that preserve everything else.
- 35.1 Spatial Control: ControlNet & Conditioning Adapters
- 35.2 Personalization: LoRA, DreamBooth & Textual Inversion
- 35.3 Inpainting, Outpainting & Object Replacement
- 35.4 Instruction-Based Editing
- 35.5 Real-Image Inversion & Faithful Editing
- 35.6 Composing Multi-Step Editing Workflows
part-4-generative-vision-models/module-35-controllable-generation-editing/ -
36Video, 3D Generation & World Models Generation grows axes: time, depth, and agency, from video diffusion to world models that learn to simulate.
- 36.1 Video Diffusion: Architectures & Temporal Consistency
- 36.2 Text-to-Video Systems: Sora-Class Models & the Open Ecosystem
- 36.3 Text-to-3D & Image-to-3D Generation
- 36.4 Generative Neural Rendering: From Splats to Scenes
- 36.5 World Models: Latent Dynamics, RSSM & Learning in Imagination
- 36.6 Generative World Simulators: From GAIA-1 to Interactive Environments
- 36.7 Predictive World Models: JEPA & Decoder-Free Latents
- 36.8 Evaluating World Models: Physical Consistency, Controllability & Coherence
part-4-generative-vision-models/module-36-video-3d-world-generation/ -
37Evaluation, Safety & Generative Data Engines Measuring what generators produce, governing how they are used, and putting them to work as synthetic-data engines for the models of Part III.
- 37.1 Measuring Image Quality: FID, KID, Precision-Recall & CLIPScore
- 37.2 Human Evaluation & Preference Studies
- 37.3 Generative Models as Data Engines: Synthetic Data for Training Vision Systems
- 37.4 Deepfakes, Detection & Misuse
- 37.5 Watermarking & Content Provenance: C2PA & Beyond
- 37.6 Licensing, Copyright & Responsible Deployment
part-4-generative-vision-models/module-37-evaluation-safety-data-engines/ -
38Tools of the Trade: The Generative Vision Stack Consolidated reference: generation libraries, workflow engines, hosted APIs, and external resources for this part.
- 38.1 Hugging Face Diffusers & the Python Generation Stack
- 38.2 Node-Based Workflows: ComfyUI & Workflow Engines
- 38.3 Hosted Generation APIs & Services
- 38.4 Curated References & Further Reading
part-4-generative-vision-models/module-38-tools-of-the-trade/
Appendices · Reference and Pedagogy
6 appendices- AMathematical Foundations for VisionThe essential linear algebra, probability, optimization, and signal processing behind every chapter.
appendices/appendix-a-mathematical-foundations/ - BDatasets & Benchmarks CatalogA per-task reference: classification, detection, segmentation, geometry, flow, video, and generation benchmarks, with licensing notes.
appendices/appendix-b-datasets-benchmarks/ - CCourse SyllabiTested course tracks built from the book: a one-semester image processing and classical CV course, a deep vision course, and a generative vision course, with week-by-week schedules.
appendices/appendix-c-course-syllabi/ - DReading PathwaysPer-audience reading guides for engineers, researchers, generative-AI practitioners, and self-study learners.
appendices/appendix-d-reading-pathways/ - ECameras, GPUs & Edge Hardware GuideChoosing sensors, lenses, GPUs, and edge devices for vision workloads, from lab prototypes to production lines.
appendices/appendix-e-cameras-gpus-edge-hardware/ - FAgents That Helped to Write This BookRoster of the 42 specialist AI agents in the writing pipeline that produced this manuscript, with a card per agent.
appendices/appendix-f-agent-roster/
Capstone · End-to-End Vision System
1 project- ★Capstone Project: An End-to-End Vision SystemDesign, build, evaluate, and present a production-grade vision application that spans all four parts: classical preprocessing and geometry, a fine-tuned detector or segmenter, a generative synthetic-data engine, and honest evaluation with deployment.
capstone/