Chapter 25: Self-Supervised Learning & Vision Foundation Models

"For years they fed me labeled cats and labeled dogs, one expensive sticker at a time. Then one day they took away the stickers, showed me a billion unlabeled photos, and said figure it out yourself. I was offended. Then I got better at everything."
A Newly Self-Supervised Backbone

Big Picture

For most of deep vision's history the bottleneck was not the model but the labels: every percentage point of accuracy was paid for in human annotation. Self-supervised learning removes that bottleneck by inventing the supervision signal from the raw data itself, and pairing images with the text humans have already written about them turns the entire internet into a training set. A model trained this way learns features that transfer to classification, detection, segmentation, and depth with little or no task-specific labeling, and a single such model can serve dozens of downstream tasks. That is what the word foundation model means in vision. This chapter traces the idea from its earliest pretext tasks through contrastive learning, self-distillation, masked image modeling, and language supervision, and ends at the open-vocabulary systems and general-purpose backbones that anchor production computer vision in 2024 to 2026.

Chapter Overview

Every chapter of Part III so far has assumed a labeled dataset. Chapter 20 trained classifiers on ImageNet's 1.28 million hand-labeled images; Chapter 23 needed boxes drawn by annotators; Chapter 24 needed per-pixel masks, the most expensive labels of all. The dirty secret of supervised vision is that the labels, not the architectures, set the ceiling. Drawing a segmentation mask can take a trained annotator several minutes per image, and the world has far more images than it will ever have masks. This chapter is about the escape from that trap: learning useful visual representations without task labels, and then learning from the one kind of label the internet produces for free at planetary scale, namely the text that already accompanies images.

The story runs in two acts. The first act, self-supervision from pixels alone, asks the model to solve a puzzle whose answer is hidden in the image itself. Section 25.1 introduces these pretext tasks: predict the rotation applied to an image, reassemble a shuffled jigsaw, colorize a grayscale photo. The label is generated automatically, so the data is effectively infinite, and the features learned in service of the puzzle turn out to be useful for real tasks. Section 25.2 sharpens the idea into contrastive learning, where two augmented views of the same image are pulled together in feature space and pushed away from every other image. SimCLR and MoCo made this competitive with supervised pretraining and taught the field that augmentation choice, batch size, and a momentum-updated target network are the levers that matter.

Section 25.3 covers the two ideas that closed the gap with supervision and, in DINO's case, produced features so clean they segment objects with no labels at all: self-distillation, where a student network learns to match a slowly-updated teacher, and masked image modeling, where the model reconstructs patches it was not allowed to see. MAE's asymmetric encoder-decoder and seventy-five percent masking made this both effective and cheap, importing into vision the masked-prediction recipe that built large language models.

The second act brings in language. Section 25.4 is CLIP, the model that reorganized the field: train an image encoder and a text encoder together so that an image and its caption land at the same point in a shared embedding space, on four hundred million image-text pairs scraped from the web. The payoff is zero-shot classification, recognizing categories it was never explicitly trained on, simply by comparing the image to text descriptions. Section 25.5 shows how that open vocabulary propagates into the dense tasks: detectors and segmenters that find and outline objects named by an arbitrary text phrase, including the Segment Anything Model that made promptable segmentation a primitive. Section 25.6 steps back to survey the landscape: which backbones to reach for, how they relate, and where the frontier is heading.

A thread you have been following since Chapter 10 reaches its conclusion here. There you built hand-crafted descriptors, SIFT and ORB, that summarized a patch into a vector robust to lighting and viewpoint. The whole project of this chapter is to learn those descriptors instead of designing them, and to learn them so well that one vector serves every task. By Section 25.4 the learned descriptor has become a CLIP embedding that can be compared not just to other images but to words, the universal descriptor the book has been building toward. The transfer-learning idea of Chapter 21 also completes here: a pretrained backbone becomes a foundation model when one frozen set of features serves many tasks at once.

Key Insight: One Idea, Four Free Labels

Every method in this chapter is the same move, repeated with a different free label: invent the supervision the data already contains. The progression is worth carrying as the chapter's spine, because each section just changes where the free label comes from. Predict the change you applied (rotate, shuffle, decolor): the pretext tasks of 25.1. Predict the match, that two views are the same image: contrastive learning in 25.2. Predict your own past, agreeing with a slow copy of yourself, or predict the hidden patches: self-distillation and masked modeling in 25.3. Predict the caption humans already typed: CLIP in 25.4. The label is always free; the understanding never is. Sections 25.5 and 25.6 then spend that understanding, carrying the open vocabulary into dense prediction and surveying the foundation models the four free labels produced.

Prerequisites

You should have read Chapter 22: Vision Transformers, because the dominant self-supervised backbones (DINO, MAE, CLIP's image encoder) are Vision Transformers, and several methods exploit the patch structure that chapter built. Chapter 21: Training Recipes is essential: this chapter lives and dies by data augmentation, and the linear-probe and fine-tuning protocols used to evaluate self-supervised models are the transfer-learning protocols you learned there. Chapter 20: CNN Architectures supplies the ResNet backbones that the early contrastive methods used and that still appear as encoders. Comfort with the softmax, the dot product as a similarity measure, and the cross-entropy loss (used since Chapter 18) makes the contrastive and CLIP objectives concrete. Knowing how detection and segmentation heads work from Chapter 23 and Chapter 24 will help you appreciate the open-vocabulary extensions in Section 25.5.

Chapter Roadmap

25.1 Pretext Tasks: Learning Without Labels The core idea of self-supervision: invent a label from the data itself. Rotation prediction, jigsaw puzzles, and colorization as pretext tasks, the transfer protocol that measures whether the learned features are any good, and why a good pretext task forces the model to understand content rather than exploit a shortcut.
25.2 Contrastive Learning: SimCLR & MoCo Pull two augmented views of one image together and push every other image away. The InfoNCE loss, why augmentation choice is the real architecture, SimCLR's reliance on enormous batches, and MoCo's momentum encoder and queue that decouple the number of negatives from the batch size.
25.3 Self-Distillation & Masked Image Modeling: DINO & MAE Learning without negatives. DINO's student-teacher self-distillation with centering and sharpening that produces emergent object segmentation, and MAE's masked autoencoding: hide 75 percent of patches, reconstruct them with an asymmetric encoder-decoder, and import the masked-prediction recipe of large language models into vision.
25.4 CLIP: Language as Supervision Train an image encoder and a text encoder so that an image and its caption meet in one embedding space. The symmetric contrastive objective over a batch of image-text pairs, zero-shot classification via prompt engineering, and why four hundred million web pairs beat curated labels for transfer and robustness.
25.5 Open-Vocabulary Detection & Segmentation Carrying the open vocabulary into dense prediction. Region-text alignment for open-vocabulary detectors, language-driven open-vocabulary segmentation, the Segment Anything Model as a promptable mask primitive, and the grounded pipelines that detect and segment anything you can name.
25.6 The Vision Foundation Model Landscape A practitioner's map of the 2024 to 2026 foundation models: DINOv2's general-purpose frozen features, the CLIP family and SigLIP, the SAM line through SAM 3's concept segmentation, and how to choose a backbone. Scaling laws, the JEPA direction toward predicting in representation space, and the open questions ahead.

What's Next?

With a foundation model in hand you have a backbone whose features already understand objects, parts, and scenes before you train on a single labeled example of your task. Chapter 26: Video Understanding is the immediate sequel: the same self-supervised and masked-prediction recipes extend to the time axis, where masked video modeling and contrastive learning across frames teach a model motion and temporal consistency without action labels, and where the JEPA direction previewed in Section 25.6 becomes predictive video modeling. The foundation features also feed forward into Chapter 27, where DINOv2 features have become a standard input to monocular depth and 3D systems, and into the generative half of the book, where the CLIP text encoder of Section 25.4 is the exact component that lets a text prompt steer a text-to-image model, and where CLIPScore becomes a way to evaluate generated images. The descriptor you spent the book learning to learn now does double duty: it reads images and it reads the words about them.

Bibliography & Further Reading

Foundational Papers

Gidaris, S., Singh, P., Komodakis, N. "Unsupervised Representation Learning by Predicting Image Rotations." ICLR (2018). arXiv:1803.07728

RotNet, the cleanest pretext task of Section 25.1. Predicting which of four rotations was applied forces the network to recognize canonical object orientation, and the features transfer surprisingly well. The simplest possible entry into self-supervision.

Chen, T. et al. "A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)." ICML (2020). arXiv:2002.05709

SimCLR of Section 25.2. Establishes the InfoNCE-based view-contrast recipe, the projection head, and the central finding that composition of augmentations and very large batches drive contrastive quality.

He, K. et al. "Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)." CVPR (2020). arXiv:1911.05722

MoCo of Section 25.2. A momentum-updated key encoder and a queue of negatives decouple the number of negatives from the batch size, making strong contrastive learning feasible without thousands of GPUs of batch.

Caron, M. et al. "Emerging Properties in Self-Supervised Vision Transformers (DINO)." ICCV (2021). arXiv:2104.14294

DINO of Section 25.3. Self-distillation with no labels and no negatives; its attention maps segment foreground objects for free, the emergent property that made self-supervised ViTs famous.

He, K. et al. "Masked Autoencoders Are Scalable Vision Learners (MAE)." CVPR (2022). arXiv:2111.06377

MAE of Section 25.3. Mask 75 percent of patches, encode only the visible ones, and reconstruct with a lightweight decoder. The asymmetric design makes pretraining cheap and the representations strong under fine-tuning.

Radford, A. et al. "Learning Transferable Visual Models From Natural Language Supervision (CLIP)." ICML (2021). arXiv:2103.00020

CLIP of Section 25.4. A symmetric image-text contrastive objective over 400 million web pairs yields a shared embedding space, zero-shot classification, and the robustness that made language supervision the dominant pretraining signal.

Recent Research (2022-2026)

Oquab, M. et al. "DINOv2: Learning Robust Visual Features without Supervision." TMLR (2024). arXiv:2304.07193

The general-purpose frozen backbone of Section 25.6. Combines self-distillation and masked modeling at scale with heavy data curation; its features transfer to classification, segmentation, and depth without fine-tuning.

Zhai, X. et al. "Sigmoid Loss for Language Image Pre-Training (SigLIP)." ICCV (2023). arXiv:2303.15343

The SigLIP refinement of Section 25.4 and 25.6. Replaces CLIP's softmax contrastive loss with a pairwise sigmoid loss that removes the all-pairs normalization, trains well at smaller batch sizes, and now anchors many open-vocabulary systems.

Kirillov, A. et al. "Segment Anything (SAM)." ICCV (2023). arXiv:2304.02643

SAM of Sections 25.5 and 25.6. A promptable segmentation model trained on 1.1 billion masks; click, box, or text prompts yield masks for objects it was never explicitly taught, the first true segmentation foundation model.

Liu, S. et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." ECCV (2024). arXiv:2303.05499

The open-vocabulary detector of Section 25.5. Fuses a detection transformer with text grounding so any phrase becomes a detectable category, the front end of the detect-then-segment grounded pipelines.

Assran, M. et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)." CVPR (2023). arXiv:2301.08243

The JEPA direction of Section 25.6. Predicts masked regions in representation space rather than pixel space, avoiding the pull toward low-level detail that pixel reconstruction imposes, and pointing toward the predictive world models of Chapter 26.

Carion, N. et al. "SAM 3: Segment Anything with Concepts." (2025). arXiv:2511.16719

The concept-promptable successor named in Sections 25.5 and 25.6. Introduces promptable concept segmentation: a short noun phrase or image exemplar returns masks and identities for every matching instance at once, folding the open-vocabulary detect-then-segment pipeline into a single model.

Siméoni, O. et al. "DINOv3." (2025). arXiv:2508.10104

The DINOv2 successor of Section 25.6. Scales self-supervised pretraining to a 7-billion-parameter ViT on 1.7 billion curated images and adds Gram anchoring to keep dense features sharp over long training, reporting state-of-the-art dense prediction with frozen weights.

Tools & Libraries

OpenCLIP. github.com/mlfoundations/open_clip

The open reproduction and extension of CLIP used in the Section 25.4 library shortcut, with dozens of pretrained checkpoints (CLIP, SigLIP) and the LAION training pipeline behind them.

Hugging Face Transformers. huggingface.co/docs/transformers

High-level loaders for CLIP, SigLIP, Grounding DINO, SAM, and DINOv2 used across this chapter's library shortcuts; a few lines of AutoModel and AutoProcessor replace each from-scratch pipeline.

DINOv2 repository (Meta AI Research). github.com/facebookresearch/dinov2

Official weights and inference code for the DINOv2 backbones of Section 25.6, loadable directly through torch.hub, the practical default for frozen-feature transfer.

Tutorials & Explainers

Weng, L. "Self-Supervised Representation Learning." Lil'Log. lilianweng.github.io

A thorough, regularly cited survey of pretext tasks and contrastive methods that complements Sections 25.1 and 25.2 with intuition and a wide method comparison.

OpenAI. "CLIP: Connecting Text and Images." Blog and model card. openai.com/research/clip

The accessible overview of CLIP's design, zero-shot evaluation, and prompt engineering that Section 25.4 formalizes, with the original zero-shot demos.

Datasets & Benchmarks

Schuhmann, C. et al. "LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models." NeurIPS Datasets (2022). arXiv:2210.08402

The open 5-billion image-text dataset that made reproducing CLIP-scale training (Section 25.4) possible outside large labs, the data behind OpenCLIP and many open foundation models.

Deng, J. et al. "ImageNet: A Large-Scale Hierarchical Image Database." CVPR (2009). image-net.org

The benchmark on which self-supervised methods are evaluated by linear probe and fine-tuning throughout the chapter, and the zero-shot target for CLIP in Section 25.4.