Why This Book Exists | Building Vision AI

"They handed me sixty years of computer vision and asked for a summary. I said: first we sharpened the pixels, then we matched them, then we learned them, and now we make them up."
A Well-Calibrated Camera With a Long Memory

Computer vision is one of the oldest dreams in computing and one of its newest industries. The earliest programs that tried to interpret photographs date back to the 1960s, when researchers believed a summer project might suffice to connect a camera to a computer and have it describe what it saw. Six decades later, the field has delivered something stranger and more wonderful than that summer project imagined: systems that not only describe images but produce them, edit them, reconstruct the 3D world behind them, and reason about what they contain.

Along the way, the field accumulated four distinct bodies of knowledge. Image processing gave us the mathematics of pixels: filtering, frequency analysis, geometric warping, morphology. Classical computer vision gave us structure: features, matching, camera models, multi-view geometry, motion. Deep learning rebuilt recognition end to end: convolutional networks, transformers, detection, segmentation, self-supervision. And generative modeling, the newest layer, turned the whole pipeline around: instead of mapping images to labels, we now map noise and language to images, video, and 3D scenes.

The trouble is that these four bodies of knowledge are almost always taught separately. The signal-processing texts stop before learning begins. The geometry texts treat neural networks as someone else's department. The deep learning tutorials start at the tensor and never look down at the pixel. And the generative-model material often assumes you arrived yesterday, armed with a prompt and an API key. A practitioner assembling a real system must stitch these fragments together alone, usually under deadline, usually by trial and error.

This book exists because the four layers are not four subjects. They are one story, and the story has recurring characters. The convolution kernel you meet in Chapter 3 as a humble blur returns in Chapter 19 with learnable weights and becomes the engine of deep recognition. The denoising methods of Chapter 7 return in Chapter 33 as the training objective of diffusion models, the engine of modern image generation. Inpainting, introduced as a classical restoration trick, returns as generative editing in Chapter 35. The camera geometry of Chapters 12 through 14, which once seemed destined for robotics labs only, returns inside neural radiance fields in Chapter 27 and 3D generation in Chapter 36. Read in sequence, each new idea lands on prepared ground; nothing has to be taken on faith.

There is also a practical argument for the whole arc, and it is the argument that working engineers feel most sharply. The person debugging a detector that fails at night needs color spaces and histogram statistics, not another architecture. The person whose segmentation masks shimmer between video frames needs sampling theory and interpolation. The person fine-tuning a diffusion model produces better results faster when they understand what noise schedules and frequency content actually are. Modern libraries are so good that you can operate them blind, and that is precisely when foundations matter most: the libraries handle the common case, and your job begins where the common case ends.

Why write this book now? Because the generative wave has pulled an enormous new audience into vision, many of them arriving from software engineering, data science, or language modeling, and the on-ramps have not kept up. Foundation models, promptable segmenters, and text-to-image systems are remarkable, but they are the top floors of a building. This book walks you up the stairs, floor by floor, and points out which beams carry the weight.

The book is written for builders. Every chapter pairs the from-scratch implementation, where the understanding lives, with the production library call, where the shipping happens. Every part ends with a Tools of the Trade chapter that consolidates the practical stack. And the whole journey converges on a capstone project that spans all four parts: classical preprocessing, a fine-tuned recognition model, a generative data engine, and an honest evaluation.

Sixty years of ideas, one connected story, told for people who intend to build things. That is why this book exists. Welcome to it.