"For years they fed me labeled cats and labeled dogs, one expensive sticker at a time. Then one day they took away the stickers, showed me a billion unlabeled photos, and said figure it out yourself. I was offended. Then I got better at everything."
A Newly Self-Supervised Backbone
For most of deep vision's history the bottleneck was not the model but the labels: every percentage point of accuracy was paid for in human annotation. Self-supervised learning removes that bottleneck by inventing the supervision signal from the raw data itself, and pairing images with the text humans have already written about them turns the entire internet into a training set. A model trained this way learns features that transfer to classification, detection, segmentation, and depth with little or no task-specific labeling, and a single such model can serve dozens of downstream tasks. That is what the word foundation model means in vision. This chapter traces the idea from its earliest pretext tasks through contrastive learning, self-distillation, masked image modeling, and language supervision, and ends at the open-vocabulary systems and general-purpose backbones that anchor production computer vision in 2024 to 2026.
Chapter Overview
Every chapter of Part III so far has assumed a labeled dataset. Chapter 20 trained classifiers on ImageNet's 1.28 million hand-labeled images; Chapter 23 needed boxes drawn by annotators; Chapter 24 needed per-pixel masks, the most expensive labels of all. The dirty secret of supervised vision is that the labels, not the architectures, set the ceiling. Drawing a segmentation mask can take a trained annotator several minutes per image, and the world has far more images than it will ever have masks. This chapter is about the escape from that trap: learning useful visual representations without task labels, and then learning from the one kind of label the internet produces for free at planetary scale, namely the text that already accompanies images.
The story runs in two acts. The first act, self-supervision from pixels alone, asks the model to solve a puzzle whose answer is hidden in the image itself. Section 25.1 introduces these pretext tasks: predict the rotation applied to an image, reassemble a shuffled jigsaw, colorize a grayscale photo. The label is generated automatically, so the data is effectively infinite, and the features learned in service of the puzzle turn out to be useful for real tasks. Section 25.2 sharpens the idea into contrastive learning, where two augmented views of the same image are pulled together in feature space and pushed away from every other image. SimCLR and MoCo made this competitive with supervised pretraining and taught the field that augmentation choice, batch size, and a momentum-updated target network are the levers that matter.
Section 25.3 covers the two ideas that closed the gap with supervision and, in DINO's case, produced features so clean they segment objects with no labels at all: self-distillation, where a student network learns to match a slowly-updated teacher, and masked image modeling, where the model reconstructs patches it was not allowed to see. MAE's asymmetric encoder-decoder and seventy-five percent masking made this both effective and cheap, importing into vision the masked-prediction recipe that built large language models.
The second act brings in language. Section 25.4 is CLIP, the model that reorganized the field: train an image encoder and a text encoder together so that an image and its caption land at the same point in a shared embedding space, on four hundred million image-text pairs scraped from the web. The payoff is zero-shot classification, recognizing categories it was never explicitly trained on, simply by comparing the image to text descriptions. Section 25.5 shows how that open vocabulary propagates into the dense tasks: detectors and segmenters that find and outline objects named by an arbitrary text phrase, including the Segment Anything Model that made promptable segmentation a primitive. Section 25.6 steps back to survey the landscape: which backbones to reach for, how they relate, and where the frontier is heading.
A thread you have been following since Chapter 10 reaches its conclusion here. There you built hand-crafted descriptors, SIFT and ORB, that summarized a patch into a vector robust to lighting and viewpoint. The whole project of this chapter is to learn those descriptors instead of designing them, and to learn them so well that one vector serves every task. By Section 25.4 the learned descriptor has become a CLIP embedding that can be compared not just to other images but to words, the universal descriptor the book has been building toward. The transfer-learning idea of Chapter 21 also completes here: a pretrained backbone becomes a foundation model when one frozen set of features serves many tasks at once.
Every method in this chapter is the same move, repeated with a different free label: invent the supervision the data already contains. The progression is worth carrying as the chapter's spine, because each section just changes where the free label comes from. Predict the change you applied (rotate, shuffle, decolor): the pretext tasks of 25.1. Predict the match, that two views are the same image: contrastive learning in 25.2. Predict your own past, agreeing with a slow copy of yourself, or predict the hidden patches: self-distillation and masked modeling in 25.3. Predict the caption humans already typed: CLIP in 25.4. The label is always free; the understanding never is. Sections 25.5 and 25.6 then spend that understanding, carrying the open vocabulary into dense prediction and surveying the foundation models the four free labels produced.
Prerequisites
You should have read Chapter 22: Vision Transformers, because the dominant self-supervised backbones (DINO, MAE, CLIP's image encoder) are Vision Transformers, and several methods exploit the patch structure that chapter built. Chapter 21: Training Recipes is essential: this chapter lives and dies by data augmentation, and the linear-probe and fine-tuning protocols used to evaluate self-supervised models are the transfer-learning protocols you learned there. Chapter 20: CNN Architectures supplies the ResNet backbones that the early contrastive methods used and that still appear as encoders. Comfort with the softmax, the dot product as a similarity measure, and the cross-entropy loss (used since Chapter 18) makes the contrastive and CLIP objectives concrete. Knowing how detection and segmentation heads work from Chapter 23 and Chapter 24 will help you appreciate the open-vocabulary extensions in Section 25.5.
Chapter Roadmap
- 25.1 Pretext Tasks: Learning Without Labels The core idea of self-supervision: invent a label from the data itself. Rotation prediction, jigsaw puzzles, and colorization as pretext tasks, the transfer protocol that measures whether the learned features are any good, and why a good pretext task forces the model to understand content rather than exploit a shortcut.
- 25.2 Contrastive Learning: SimCLR & MoCo Pull two augmented views of one image together and push every other image away. The InfoNCE loss, why augmentation choice is the real architecture, SimCLR's reliance on enormous batches, and MoCo's momentum encoder and queue that decouple the number of negatives from the batch size.
- 25.3 Self-Distillation & Masked Image Modeling: DINO & MAE Learning without negatives. DINO's student-teacher self-distillation with centering and sharpening that produces emergent object segmentation, and MAE's masked autoencoding: hide 75 percent of patches, reconstruct them with an asymmetric encoder-decoder, and import the masked-prediction recipe of large language models into vision.
- 25.4 CLIP: Language as Supervision Train an image encoder and a text encoder so that an image and its caption meet in one embedding space. The symmetric contrastive objective over a batch of image-text pairs, zero-shot classification via prompt engineering, and why four hundred million web pairs beat curated labels for transfer and robustness.
- 25.5 Open-Vocabulary Detection & Segmentation Carrying the open vocabulary into dense prediction. Region-text alignment for open-vocabulary detectors, language-driven open-vocabulary segmentation, the Segment Anything Model as a promptable mask primitive, and the grounded pipelines that detect and segment anything you can name.
- 25.6 The Vision Foundation Model Landscape A practitioner's map of the 2024 to 2026 foundation models: DINOv2's general-purpose frozen features, the CLIP family and SigLIP, the SAM line through SAM 3's concept segmentation, and how to choose a backbone. Scaling laws, the JEPA direction toward predicting in representation space, and the open questions ahead.
What's Next?
With a foundation model in hand you have a backbone whose features already understand objects, parts, and scenes before you train on a single labeled example of your task. Chapter 26: Video Understanding is the immediate sequel: the same self-supervised and masked-prediction recipes extend to the time axis, where masked video modeling and contrastive learning across frames teach a model motion and temporal consistency without action labels, and where the JEPA direction previewed in Section 25.6 becomes predictive video modeling. The foundation features also feed forward into Chapter 27, where DINOv2 features have become a standard input to monocular depth and 3D systems, and into the generative half of the book, where the CLIP text encoder of Section 25.4 is the exact component that lets a text prompt steer a text-to-image model, and where CLIPScore becomes a way to evaluate generated images. The descriptor you spent the book learning to learn now does double duty: it reads images and it reads the words about them.
Bibliography & Further Reading
Foundational Papers
Recent Research (2022-2026)
Tools & Libraries
AutoModel and AutoProcessor replace each from-scratch pipeline.torch.hub, the practical default for frozen-feature transfer.