Part III: Deep Learning for Computer Vision
Chapter 25: Self-Supervised Learning & Vision Foundation Models

Pretext Tasks: Learning Without Labels

"They handed me a photo cut into nine squares and shuffled like a deck of cards, and said: put it back. I had never been told what a dog was, but to win this game I had to learn that ears go above legs. The label was free. The understanding was not."

A Jigsaw-Solving Convolutional Network
Big Picture

Self-supervised learning manufactures its own labels from the raw input, turning an unlabeled image into a supervised problem whose answer the data already contains. The trick is to design a task that the model can only solve by understanding the image's content, not by exploiting a trivial shortcut. A pretext task is that puzzle: rotate an image and predict the rotation, scramble it into a jigsaw and reassemble it, strip its color and paint it back. Solving the puzzle is never the goal. The goal is the representation the network builds along the way, a representation we then transfer to real tasks with a tiny amount of labeled data. This section establishes the recipe, the evaluation protocol that tells you whether it worked, and the design principle that separates a useful pretext task from a fooled one.

In the previous chapters you trained networks against human labels: a class index from ImageNet, a bounding box, a per-pixel mask. This section steps off that path entirely. We will train a network with no human labels at all, by inventing a supervision signal that can be computed automatically from any image. The plan is to introduce the general framework, build the simplest pretext task (rotation prediction) end to end in PyTorch, survey two more (jigsaw and colorization) that illustrate different design choices, and then pin down the protocol that measures whether the learned features are any good. By the end you will understand why self-supervision works at all, and why some pretext tasks teach far more than others. This is the foundation that Section 25.2 sharpens into contrastive learning. The illustration below captures the central bargain: free data, self-made labels.

A cheerful robot sits before a cascade of unlabeled photo-cards while a discarded price-tag sticker rests in a wastebasket, and the robot draws its own checkmark on the back of a card, depicting how self-supervised learning throws away expensive human labels and invents the supervision signal from the raw images themselves.
When the expensive stickers go in the bin, the data becomes infinite and the model has to grade its own homework.

1. The Self-Supervised Recipe Beginner

The entire field rests on one move. Take an unlabeled image $x$, apply a transformation $t$ whose parameters you choose, and obtain a transformed input $\tilde{x} = t(x)$. Now ask the network to predict something about $t$ from $\tilde{x}$. Because you chose $t$, you know the correct answer, so you have a label for free. The network is trained with ordinary supervised loss (usually cross-entropy) against that automatically generated label. The crucial point is that this is genuine supervised learning; the only difference from Chapter 20 is the source of the labels.

Why would this teach anything useful? Because for many well-chosen transformations, the only way to predict $t$ reliably is to recognize what is in the image. To tell that a photo of a dog has been rotated ninety degrees, you must already have an internal sense of which way a dog normally faces; sky is up, legs point down, faces are upright. The network cannot get this from low-level cues alone, so it is pushed to build a representation that encodes object orientation, parts, and layout. That representation, captured in the network's intermediate layers, is the prize. The pretext task is scaffolding we throw away.

Common Misconception: Self-Supervised Is Not Unsupervised

It is tempting to file self-supervised learning under "unsupervised learning" because no human labels are involved. In fact every method in this chapter is ordinary supervised learning with a cross-entropy or regression loss against a concrete target; the only difference is that the target (the rotation index, the masked patch, the matching caption) is computed automatically from the image rather than typed by an annotator. Unsupervised methods such as k-means clustering or a plain autoencoder have no per-example correct answer to predict; self-supervision manufactures one. Diagnostic question: if rotation prediction has a labeled target and a cross-entropy loss, what makes it "self"-supervised rather than just supervised? The answer is the source of the label, not the absence of a label.

Key Insight: The Pretext Task Is a Means, Never an End

Nobody needs a production system that predicts image rotations. The pretext task exists only to force the network to learn a representation, and we measure success not by accuracy on the pretext task but by how well the frozen features transfer to a real downstream task such as classification or detection. A pretext task that scores ninety-nine percent but whose features transfer poorly has failed. This decoupling of training objective from evaluation objective is the defining feature of representation learning and runs through the entire chapter, all the way to CLIP in Section 25.4.

2. Rotation Prediction, End to End Beginner

The cleanest pretext task to implement is RotNet (Gidaris et al., 2018). Rotate each image by one of four angles, $0$, $90$, $180$, or $270$ degrees, and train the network to classify which of the four rotations was applied. That is a four-way classification problem with labels generated on the fly. The transformation $t$ is a discrete rotation; the label is the rotation index. Figure 25.1.1 shows the setup: one image becomes four training examples, each tagged with the angle it carries.

0° (y=0) 90° (y=1) 180° (y=2) 270° (y=3) shared backbone ResNet / ViT encoder 4-way head predict rotation cross-entropy the encoder is the part we keep
Figure 25.1.1: The RotNet pretext task. Each unlabeled image is rotated into four versions whose labels (the rotation index) are known by construction. A shared backbone and a four-way classification head are trained with ordinary cross-entropy. After training, the head is discarded and the backbone's features are transferred to a downstream task.

The code below builds the full pipeline against a torchvision backbone. The key piece is the data transform that produces the four rotations and their labels; everything else is a standard supervised training step you have seen since Chapter 18.

import torch
import torch.nn as nn
import torchvision

def make_rotations(images):
    """Given a batch (B, C, H, W), return 4B rotated images and their rotation labels."""
    rots, labels = [], []
    for k in range(4):                       # k = number of 90-degree turns
        rots.append(torch.rot90(images, k, dims=(2, 3)))  # rotate H,W plane
        labels.append(torch.full((images.size(0),), k, dtype=torch.long))
    return torch.cat(rots, dim=0), torch.cat(labels, dim=0)

# A ResNet-18 backbone with a fresh 4-way head replacing the 1000-way ImageNet head.
backbone = torchvision.models.resnet18(weights=None)
backbone.fc = nn.Linear(backbone.fc.in_features, 4)   # 4 rotation classes
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(backbone.parameters(), lr=0.1, momentum=0.9)

images = torch.randn(8, 3, 64, 64)          # 8 unlabeled images, no class labels needed
rot_imgs, rot_labels = make_rotations(images)         # -> (32, 3, 64, 64), (32,)
logits = backbone(rot_imgs)
loss = criterion(logits, rot_labels)        # supervised loss against free labels
loss.backward(); optimizer.step()
print("rotated batch:", rot_imgs.shape, "labels:", rot_labels[:8].tolist())
# rotated batch: torch.Size([32, 3, 64, 64]) labels: [0, 0, 0, 0, 0, 0, 0, 0]
Code Fragment 1: RotNet in PyTorch. make_rotations turns one batch of unlabeled images into four times as many labeled examples using torch.rot90; the rest is a vanilla cross-entropy step. The printed labels are all 0 because the function stacks the batch by rotation (all the unrotated copies first, then all the 90-degree copies, and so on), so the first eight entries are the eight unrotated images; the labels 1, 2, 3 follow further down the stacked tensor. The class labels of the images are never used.

Notice that no .targets or class annotations appear anywhere. The supervision comes entirely from make_rotations. Once this network is trained on a large unlabeled corpus, we throw away backbone.fc (the rotation head) and keep the convolutional trunk as a feature extractor. Subsection 4 below explains how we measure whether those features are good.

Fun Fact

RotNet has a famous blind spot: it fails on images with no canonical orientation. A photo of soup from directly above, a microscope slide, an aerial satellite image of farmland, all look equally plausible at any rotation, so the network cannot solve the task and learns little. This is not a bug in the code; it is the task honestly reporting that the pretext signal does not exist for those images. The lesson generalizes: a pretext task only teaches what its assumption about the world makes learnable.

Try This: Shrink the Pretext Task and Watch It Get Easier

Before reaching for the full linear probe, you can build intuition about pretext difficulty in one cheap change to make_rotations. Replace the four-way rotation (for k in range(4)) with a two-way version that uses only k = 0 and k = 2 (upright versus upside-down), and rerun a few training steps. The pretext task gets noticeably easier: distinguishing $0$ from $180$ degrees needs only a coarse sense of which way is up, while telling $90$ from $270$ demands a finer reading of left-right structure. The lesson to observe is that an easier pretext task is not a better one. A task the network can solve with a shallow cue teaches a shallower representation, which is exactly why the four-way version, and the harder objectives of the next sections, transfer further. Vary the angle set and you are watching the difficulty-versus-richness trade-off the whole chapter turns on.

3. Jigsaw Puzzles and Colorization Intermediate

Rotation is the simplest pretext task, but it is one of many, and the variety illustrates the design space. The jigsaw task (Noroozi and Favaro, 2016) cuts the image into a grid of tiles, shuffles them by one of a fixed set of permutations, and asks the network to predict which permutation was used. To reassemble the puzzle the network must understand spatial relationships between parts: a face tile belongs above a body tile, a wheel sits below a fender. Because the number of permutations of nine tiles is enormous, the task uses a curated subset of a few hundred permutations chosen to be maximally distinct, turning it into a manageable classification problem.

Colorization (Zhang et al., 2016) takes the opposite tack: it is a generative pretext task. The network receives a grayscale image and must predict the missing color channels. To color a banana yellow and grass green it must recognize the objects, since color is a semantic property, not a local one. Colorization predicts a dense output (a color per pixel) rather than a single class, so it teaches features useful for dense downstream tasks. The contrast with rotation is instructive: rotation is a global, four-way classification; colorization is a dense, per-pixel regression. Both invent their labels from the data, but they push the representation in different directions.

A recurring danger unites all three. A pretext task is only as good as the shortcut it forbids. If the network can predict the rotation by reading a camera watermark in the corner, or solve the jigsaw by matching the chromatic aberration at tile edges, or colorize by memorizing a texture statistic, it will do exactly that and learn nothing about content. The illustration below shows this loophole in cartoon form, and the practical example below is a real instance of the same failure.

A student robot cheats at a leaf jigsaw puzzle by secretly matching tiny identical corner marks on the pieces instead of looking at the picture, while a watchful owl teacher raises an eyebrow, illustrating how a pretext task can be solved by a low-level artifact shortcut so the model learns the camera rather than the content.
A pretext task that is suspiciously easy is a warning sign: the model probably found a low-level loophole and learned nothing about the image.
Practical Example: The Pretext Task That Learned the Camera, Not the Content

Who: a three-person applied-research group at an agritech startup, 2022, building a crop-disease classifier from a large archive of unlabeled field photos. Situation: labels were scarce and expensive (an agronomist had to inspect each leaf), so they pretrained a backbone with the jigsaw pretext task on roughly two hundred thousand unlabeled photos, then planned to fine-tune on their few thousand labeled examples. Problem: the jigsaw pretraining reached suspiciously high accuracy, yet the transferred features performed barely better than random initialization on disease classification. Decision: they inspected which tile boundaries the model attended to and discovered the photos were captured on two phone models with slightly different JPEG compression, leaving faint blocking artifacts at consistent positions. The network was solving the jigsaw by aligning compression-block grids across tiles, a shortcut with zero semantic content. They added strong color jitter and re-encoded every tile through a fresh random JPEG quality to destroy the artifact, and switched to random tile gaps so edge cues could not be matched. Result: pretext accuracy dropped (the easy shortcut was gone), but downstream disease accuracy jumped by eleven points, finally beating the supervised-from-scratch baseline. Lesson: a pretext task that is too easy is a warning sign. When self-supervision underperforms, the first suspect is a low-level shortcut; the cure is augmentation aggressive enough to make the shortcut useless, which forces the model back onto content.

4. Measuring Success: The Linear Probe Intermediate

Since the pretext accuracy is not the goal, we need a separate yardstick for representation quality. The standard tool is the linear probe, an evaluation protocol inherited directly from the transfer-learning discussion of Chapter 21. Freeze the pretrained backbone so its weights cannot change, extract features for a labeled dataset, and train a single linear classifier (one fully connected layer) on top of those frozen features. The accuracy of that linear classifier measures how linearly separable the learned representation is: if a simple linear boundary can carve the frozen features into the right classes, the features have already done the hard work of organizing the data semantically.

Formally, if the frozen encoder maps an image to a feature vector $z = f_\theta(x)$ with $\theta$ fixed, the linear probe fits only a weight matrix $W$ and bias $b$ by minimizing

$$\min_{W, b} \; \frac{1}{N}\sum_{i=1}^{N} \mathcal{L}_{\text{CE}}\big(W z_i + b, \; y_i\big), \qquad z_i = f_\theta(x_i) \ \text{(frozen)}$$

where $\mathcal{L}_{\text{CE}}$ is cross-entropy and $y_i$ are the downstream labels. Only $W$ and $b$ are learned; $\theta$ never moves. The two other common protocols are fine-tuning (unfreeze the whole backbone and train end to end on the downstream labels, which measures the representation as a starting point rather than as a fixed feature) and k-nearest-neighbors (classify each test feature by a majority vote of its nearest training features, which needs no training at all and is a pure measure of feature-space structure). The code below implements the linear probe directly.

import torch
import torch.nn as nn

@torch.no_grad()
def extract_features(encoder, loader, device):
    """Run the frozen encoder over a labeled loader and stack features + labels."""
    encoder.eval()
    feats, labels = [], []
    for x, y in loader:
        feats.append(encoder(x.to(device)).flatten(1).cpu())  # frozen forward pass
        labels.append(y)
    return torch.cat(feats), torch.cat(labels)

def linear_probe(train_feats, train_labels, num_classes, epochs=100):
    """Train ONLY a linear layer on frozen features; the encoder is untouched."""
    clf = nn.Linear(train_feats.size(1), num_classes)
    opt = torch.optim.Adam(clf.parameters(), lr=1e-3)
    loss_fn = nn.CrossEntropyLoss()
    for _ in range(epochs):
        opt.zero_grad()
        loss = loss_fn(clf(train_feats), train_labels)        # only clf has gradients
        loss.backward(); opt.step()
    return clf

# Usage sketch: freeze the RotNet trunk, drop its rotation head, probe on real labels.
encoder = backbone                       # from Section 2
encoder.fc = nn.Identity()               # expose 512-d features instead of the 4-way head
for p in encoder.parameters():
    p.requires_grad_(False)              # freeze: this is what makes it a linear PROBE
# feats, labels = extract_features(encoder, labeled_loader, "cuda")
# clf = linear_probe(feats, labels, num_classes=10)
print("encoder frozen:", not any(p.requires_grad for p in encoder.parameters()))
# encoder frozen: True
Code Fragment 2: The linear-probe protocol. Setting requires_grad_(False) on every encoder parameter is the line that defines the probe: only the final linear layer learns, so the reported accuracy reflects the frozen representation, not a re-trained network.
One Number That Shows the Probe Works

Put three backbones through the same ImageNet linear probe and the abstraction becomes concrete. A randomly initialized network (no training at all) lands near the low single digits, barely above the one-in-a-thousand chance of guessing. A RotNet trained only to predict rotations, never shown a single class label, jumps to roughly forty percent. A fully supervised ImageNet model sits near seventy-five percent. The leap from a few percent to forty percent is the entire payoff of self-supervision made visible in one column of numbers: predicting rotations, a task nobody cares about, taught the frozen features to separate a thousand object classes a linear boundary can read off. The remaining gap to seventy-five percent is exactly what contrastive learning and masked modeling close in the next two sections.

The linear probe is what lets us compare RotNet, jigsaw, colorization, and the contrastive methods of Section 25.2 on a level field. It is also why the field converged on the methods it did: contrastive and masked-modeling approaches produce far more linearly separable features than the early pretext tasks, which is exactly the gap the three numbers above expose and the next two sections close.

Library Shortcut: A Frozen Backbone in Two Lines

You rarely train RotNet yourself today; you load a strong pretrained backbone and probe or fine-tune it. With timm the entire feature-extractor setup of the code above collapses to two lines:

# Load a pretrained self-supervised ViT as a frozen feature extractor.
# num_classes=0 strips the classification head, exposing the raw embedding.
import timm
# A self-supervised ViT backbone, features only (no classifier), weights downloaded:
encoder = timm.create_model("vit_small_patch16_224.dino", pretrained=True, num_classes=0)
features = encoder(torch.randn(1, 3, 224, 224))   # (1, 384) frozen embedding
Code Fragment 3: The same frozen-backbone setup in two lines using timm. The create_model call downloads the DINO checkpoint, builds the matching ViT architecture, and applies num_classes=0 to expose the 384-dimensional embedding directly. The library handles the head surgery and weight loading that Code Fragment 2 did by hand.

timm handles the architecture, the pretrained weights (here a DINO checkpoint from Section 25.3), the correct preprocessing, and the removal of the classification head via num_classes=0. What took roughly twenty lines of manual head surgery and weight loading becomes two, and the library exposes hundreds of self-supervised checkpoints through the same call. The from-scratch version above exists so you understand what those two lines do internally.

Research Frontier: From Hand-Designed Puzzles to Learned Objectives

The pretext tasks of this section were largely superseded between 2020 and 2023 by contrastive and masked-modeling objectives, but the underlying question, what is the best free supervision signal, is still open and active. Masked image modeling (the MAE of Section 25.3) can be read as a vastly more powerful pretext task: predict the missing patches rather than the rotation. The 2023 to 2024 I-JEPA and V-JEPA work from Meta pushes this further by predicting in representation space instead of pixel space, arguing that pixel-level reconstruction wastes capacity on imperceptible detail; we will return to this in Section 25.6. And the lesson that learned descriptors beat hand-crafted ones, which began with SIFT and ORB back in Chapter 10, reaches its sharpest form here: the field has moved from hand-designing features, to hand-designing pretext tasks that learn features, to learning the objective itself.

Exercise 25.1.1: When Does Rotation Prediction Fail? Conceptual

List four categories of images for which RotNet would learn almost nothing because the rotation is not predictable from content, and explain the shared property that makes them fail. Then describe a different pretext task that would learn useful features on those same images, and justify why its free label is recoverable from content where the rotation label is not. Connect your answer to the Key Insight that the pretext task only teaches what its assumption makes learnable.

Exercise 25.1.2: Build and Probe a RotNet Coding

Using the code in Sections 2 and 4, pretrain a small ResNet on the CIFAR-10 training images with the rotation pretext task (ignore the class labels during pretraining). Then freeze the backbone and run a linear probe using the real CIFAR-10 labels, reporting test accuracy. As a control, run the same linear probe on a randomly initialized (untrained) backbone. The gap between the two numbers is the value the rotation pretext task added. Report both numbers and one sentence interpreting the gap.

Exercise 25.1.3: Diagnosing a Shortcut Analysis

Reread the agritech practical example. Suppose your own jigsaw pretraining reaches 95 percent permutation accuracy but transfers poorly. Design a three-step diagnostic procedure to determine whether the model is exploiting a low-level shortcut: what would you measure, what augmentation would you add as a test, and what change in pretext accuracy versus downstream accuracy would confirm the shortcut hypothesis? Explain why a drop in pretext accuracy after augmentation is good news in this situation.