Second Edition · 2026

Building Vision AI From Pixels to Generative Models

A practitioner's guide to image processing, classical computer vision, deep learning, and generative vision models.

Alexander (Sasha) Apartsin, Ph.D. & Yehudit Aperstein, Ph.D.

Vision is the richest channel through which intelligence meets the world. This book is one connected journey through the theories, models, and engineering practices for systems that see, interpret, and create images. It starts with the signal-processing bedrock of image formation, builds through classical computer vision and deep learning, then moves into generative models that synthesize images, video, and 3D worlds, before closing with the evaluation, safety, and deployment concerns that govern real systems.

4 parts 39 chapters 219 sections 7 appendices & a capstone

The Four-Part Arc

Each part stands on the one before it; together they span sixty years of vision in one continuous build.

Image Processing

The signal-processing bedrock: pixels, color, histograms, convolution, the frequency domain, geometric warping, morphology, and restoration. The operations every vision system stands on.

9 chapters · 49 sections II

Classical Computer Vision

Vision before learning: edges and keypoints, matching and RANSAC, camera models, stereo and structure from motion, optical flow, and the hand-crafted recognition pipelines that defined an era.

9 chapters · 48 sections III

Deep Learning for Computer Vision

Vision learned end to end: PyTorch foundations, CNNs and vision transformers, detection and segmentation, self-supervision and foundation models, video, 3D, and deployment to real hardware.

12 chapters · 67 sections IV

Generative Vision Models

Models that create: VAEs, GANs, and diffusion; text-to-image systems and controllable editing; video, 3D, and world generation; plus the evaluation, safety, and synthetic-data practice that makes them useful.

9 chapters · 55 sections

How This Book Teaches

Five habits, kept in every chapter from the first pixel to the last sample.

Worked Pipelines

Every chapter builds complete, runnable systems (a document scanner, a lane detector, a CIFAR-10 classifier end to end), never isolated snippets.

Library Shortcuts

After each from-scratch build, a shortcut callout shows the same task in a few lines of OpenCV, scikit-image, PyTorch, or diffusers, and names exactly what the library handles for you.

A Callout System

Pitfalls, math asides, practical industry examples, and cross-references are typeset as distinct boxes, so you can read deep or skim fast and never miss a trap.

Exercises & Labs

Each chapter closes with hands-on exercises that extend its worked pipelines, from quick checks to small projects you can put in a portfolio.

Classical Ideas Return Learned

Convolution becomes the CNN layer, denoising becomes diffusion, inpainting becomes generative editing, and multi-view geometry returns in NeRF. One story, told twice.

The Hands-On AI Science Series

Building Vision AI is one of nine connected books, each a deep, build-it-yourself guide to a major field of AI.

Hands-On AI Science is a series of in-depth guides to the major fields of artificial intelligence. Every book goes deep into the theory, models, and internals, covering the classical foundations and the most recent ideas, then shows you how to build each one in Python with the modern libraries and tools that get the job done. The writing stays plain and light (illustrations, analogies, mental models, worked examples, and a little fun) without trading away rigor or coverage. Each volume is self-contained and complete enough to anchor a full course on its subject.