Part II: Classical Computer Vision | Building Vision AI: From Pixels to Generative Models

Part Overview

Part I treated an image as a signal: an array to be filtered, transformed, and restored. Part II changes the question. Instead of asking "how do I clean this image?", it asks "what does this image tell me about the world that produced it?" For decades before learned features, vision researchers answered with geometry, optimization, and robust statistics, and their answers still run in production today: the panorama mode on your phone, the tracking in an AR headset, the map inside a robot vacuum. Just as importantly, these methods supply the concepts and vocabulary that deep learning would later inherit.

The nine chapters build in a deliberate sequence. Chapter 9 takes the gradients of Part I and organizes them into structure: edges become lines and circles through Canny, the Hough transform, and robust curve fitting. Chapter 10 then poses the question that unlocks most of geometric vision: can you find the same physical point in two photographs? Corners, scale space, SIFT, binary descriptors, the ratio test, and RANSAC answer it. Chapter 11 widens the view from points to regions, grouping pixels into meaningful areas with clustering, watersheds, and graph cuts.

With correspondence in hand, the part turns to geometry proper. Chapter 12 introduces the pinhole camera and calibration: the exact mathematics of how a 3D scene becomes a 2D image. Chapter 13 puts two calibrated views together and recovers what a single camera lost, building from epipolar geometry through stereo disparity to triangulated 3D points. Chapter 14 scales from two views to hundreds: structure from motion turns a pile of photos into a 3D model, and visual SLAM lets a moving camera build a live map.

Chapter 15 adds the time axis, turning pixel motion into object motion with optical flow, background subtraction, trackers, and Kalman filters. Chapter 16 tells the closing story of the era: hand-crafted features feeding shallow classifiers ruled recognition for two decades, and understanding exactly why that recipe plateaued explains why deep learning won. Chapter 17 consolidates the part's tooling: OpenCV's geometry and video modules, reconstruction pipelines such as COLMAP, and the datasets and benchmarks that keep these methods honest.

Big Picture

Deep learning replaced the features of this part, but not its geometry. Modern pipelines still clean matches with RANSAC, calibrate cameras with Zhang's method, and bootstrap NeRF and Gaussian splatting from COLMAP reconstructions. Learn this part well and Part III will read as a substitution of components inside a structure you already understand, not as a fresh start.

Chapter 9 Edges, Lines & Curves

From raw gradients to structured geometry: the first step from processing images to understanding them.

Chapter 10 Keypoints, Descriptors & Matching

Find the same point in two photographs and most of geometric vision follows.

Chapter 11 Classical Segmentation & Grouping

Carving an image into meaningful regions with clustering, watersheds, and graphs.

Chapter 12 Camera Models & Calibration

The pinhole camera turns 3D into 2D; calibration tells you exactly how.

Chapter 13 Two-View Geometry, Stereo & Depth

Two cameras and a bit of linear algebra recover what one camera lost: depth.

Chapter 14 Structure from Motion & Visual SLAM

From a pile of photos to a 3D model, and from a moving camera to a live map.

Chapter 15 Motion, Optical Flow & Tracking

Video adds time; flow and tracking turn pixel motion into object motion.

Chapter 16 Classical Recognition Pipelines

Hand-crafted features plus shallow classifiers ruled recognition for two decades; understanding why they plateaued explains why deep learning won.

Chapter 17 Tools of the Trade: The Classical CV Stack

Consolidated reference: libraries, reconstruction tooling, datasets, and external resources for this part.

Where This Part Leads

In Part III: Deep Learning for Computer Vision, the hand-designed pipeline of this part is rebuilt with learned components. The convolution kernels of Chapter 3 become trainable CNN layers, the descriptors of Chapter 10 become learned embeddings, the segmentation of Chapter 11 returns as dense per-pixel prediction, and the geometry of Chapters 12 through 14 resurfaces in monocular depth estimation and neural scene representations. The questions stay the same; the answers start being learned from data.