Part III

Deep Learning for Computer Vision

Vision learned end to end: CNNs, transformers, detection, segmentation, self-supervision, video, 3D, and deployment.

Part Overview

Part II ended with a question. Hand-crafted recognition pipelines, HOG templates, deformable parts, bags of visual words, climbed steadily for two decades and then stopped climbing. The features themselves were the ceiling: no amount of classifier tuning could recover what a fixed descriptor had already thrown away. Part III is the answer the field converged on. Stop designing the features and learn them from data, end to end, with the task's own loss as the teacher. Across twelve chapters this part rebuilds vision on that single idea, from a first training loop to foundation models that learn from the raw internet, and it does so without discarding Parts I and II; it keeps returning to them.

The first five chapters build the machinery in strict sequence. Chapter 18 lays the foundation everything else stands on: tensors, autograd, and a PyTorch training loop you understand line by line rather than copy. Chapter 19 takes the convolution you met in Chapter 3 and makes its kernel weights learnable, showing why locality and weight sharing are exactly the right inductive bias for images. Chapter 20 then compresses a decade of architecture research, LeNet to ConvNeXt, into a story of bottlenecks found and removed. Chapter 21 supplies the craft that papers underreport: the data, augmentation, and transfer-learning recipe that decides whether a given architecture actually trains. Chapter 22 completes the toolkit with vision transformers, which trade convolution's built-in bias for scale, and asks when that trade is worth making.

With the toolkit in hand, the middle chapters turn to the tasks applied vision is hired for. Chapter 23 asks where the objects are and what they are, the question behind most production vision systems. Chapter 24 pushes from a label per image to a label per pixel, and on to promptable models that segment anything you point at. Chapter 25 then removes the labels altogether: self-supervision and language supervision are how today's vision foundation models came to be, and they change what "training a model" means for every chapter before and after.

The remaining chapters extend the same machinery along new axes. Chapter 26 adds time: actions, motion, and tracking with learned features. Chapter 27 brings deep networks back to the geometry of Part II, with learned depth, point clouds, radiance fields, and Gaussian splats. Chapter 28 confronts the constraint every real system meets, that a model which cannot run on the target hardware is a prototype, and ships it. Chapter 29 closes the part with a consolidated reference for the deep vision stack: model hubs, frameworks, data tooling, and the resources you will reach for daily.

Chapters: 12 (Chapters 18–29). Read Chapters 18 through 22 in order; the task chapters that follow can be read selectively, though each assumes the toolkit that precedes it.

Big Picture

Every chapter in this part makes the same move: take a stage that Parts I and II designed by hand, replace it with a learned module, and let the task's loss decide what the features should be. The convolution kernel, the descriptor, the matcher, the depth map: each returns here, learned.

Everything Part III builds on: tensors, autograd, and a training loop you fully understand.

The convolution from Chapter 3, made learnable: weight sharing, hierarchy, and the inductive bias that fits images.

A decade of architecture search, told as a story of bottlenecks found and removed.

In practice the recipe matters as much as the architecture; this chapter is the recipe.

Treat an image as a sequence of patches and the transformer takes over; the question is when that trade is worth it.

Where are the objects and what are they: the task that drives much of applied vision.

From a label per image to a label per pixel, and on to models that segment anything you point at.

Labels stopped being the bottleneck: how vision models learn from raw pixels and from language.

Adding the time axis: actions, motion, and tracking with learned features.

Deep networks meet the geometry of Part II: learned depth, point clouds, radiance fields, and splats.

A model that cannot run on the target hardware is a prototype; this chapter ships it.

Consolidated reference: model hubs, frameworks, data tooling, and external resources for this part.

Where This Part Leads

The models in this part recognize what is in an image; the models in Part IV: Generative Vision Models produce images of their own. The arc continues there: the denoising of Chapter 7 returns as diffusion, the autoencoder becomes a generator, and the detectors and segmenters you train here become both the consumers of synthetic training data and the judges of what generative models create.