Section 21.3: Transfer Learning & Fine-Tuning Strategies

"I spent a million images learning what an edge is, what a texture is, what a wheel is. You have four hundred photos of your specific kind of bird. Please do not ask me to forget the wheel. Just let me adjust the last few opinions."
A Pretrained Backbone Pleading for Its Early Layers

Big Picture

Almost no real vision project trains from scratch; you start from a backbone pretrained on a large dataset and adapt it, and the three decisions that matter are what to freeze, what to fine-tune, and how to set the learning rates so you improve the features without destroying them. Transfer learning works because the early layers of a vision network learn general-purpose features (edges, textures, simple shapes) that are useful for almost any visual task, while only the late layers are specialized to the original classes. This section turns that observation into a workflow: feature extraction for tiny datasets, full fine-tuning for larger ones, discriminative learning rates so early and late layers learn at different speeds, and a decision tree that picks the right strategy from your data size and how far your domain sits from ImageNet.

Augmentation in Section 21.2 stretched the data you already have; transfer learning does the complementary thing, reusing the knowledge a network has already extracted from data someone else collected. This is the most immediately useful section in the chapter for a working engineer. You have already used pretrained weights in Chapter 20 for inference; here we adapt them to a new task. The premise is that the features a network learned on ImageNet's million images are a far better starting point than random initialization, even for a task that looks nothing like ImageNet. The art is in adapting those features without wrecking them, and that art reduces to a small number of well-understood choices, which this section makes explicit. Transfer learning is the same idea that, scaled up, becomes foundation-model fine-tuning in Chapter 25 and LoRA adapters for generators in Chapter 34. The illustration below captures the instinct that drives every choice in this section: keep the hard-won foundations, and change only the top.

A friendly robot protectively shields the worn bottom blocks of its layer stack, which carry edge and texture patterns, while offering only the shiny top block to be replaced, illustrating transfer learning where early general features are frozen and protected while the classifier head is swapped and fine-tuned. — A pretrained backbone has already learned what an edge is; transfer learning means adjusting its last opinions, not erasing its hard-won foundations.

1. Why Transfer Works: General Early, Specific Late Beginner

The empirical foundation of transfer learning is a layered structure in what a network learns. The first convolutional layer of almost any trained vision model learns oriented edge and color-blob detectors that look strikingly like the Gabor filters and Sobel kernels of Chapter 3, an arc we flagged when we noted that learned first-layer filters resemble classical edge detectors. Middle layers compose these into textures and parts. Only the final layers assemble parts into the specific categories of the training task. This gradient from general to specific is exactly why transfer is possible: the general layers are reusable, and only the specific layers need replacing.

Key Insight: The First Layer Is Almost Always the Same Picture

Here is the result that makes freezing feel safe rather than reckless. Train AlexNet, VGG, ResNet, and a vision transformer on completely different datasets, and visualize the first layer each one learned: you get the same picture every time, a tiled grid of oriented edge detectors and color-opponent blobs that are visually indistinguishable from the Gabor and Sobel kernels you built by hand in Chapter 3. Gradient descent, given any natural images, rediscovers the same low-level vocabulary because that vocabulary is dictated by the statistics of natural images, not by the labels. So when you freeze the early layers of an ImageNet backbone for a task it has never seen, you are not gambling that its edge detectors happen to fit; you are reusing the one part of the network that would have come out nearly identical even if you had retrained it from scratch on your own data. The shock of transfer learning is not that it works on similar tasks, it is that the first layer barely cares what the task is.

Figure 21.3.1 makes the consequence concrete. The closer your task is to ImageNet and the more data you have, the more of the network you should be willing to adapt; the more your task differs or the less data you have, the more you should keep frozen and reuse.

Figure 21.3.1: The general-to-specific gradient through network depth. Early layers learn reusable low-level features, late layers learn task-specific concepts, and the classifier head is always replaced. Transfer strategies differ only in where they draw the freeze-versus-fine-tune boundary.

2. Two Core Strategies: Feature Extraction and Fine-Tuning Beginner

The two endpoints of the strategy spectrum are feature extraction and full fine-tuning. In feature extraction you freeze the entire pretrained backbone, treat it as a fixed feature computer, and train only a new classifier head on top. This is fast, needs little data, and cannot overfit much because almost no parameters are learning. In full fine-tuning you replace the head and then continue training the whole network at a small learning rate, letting every layer adjust to the new task. This needs more data and compute but reaches higher accuracy when the new task differs meaningfully from the pretraining task. The first step in both is identical and is the one beginners most often get wrong: replace the head, because the pretrained final layer outputs the original number of classes, not yours.

import torch.nn as nn
from torchvision.models import resnet50, ResNet50_Weights

def build_transfer_model(num_classes, mode="feature_extract"):
    model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
    if mode == "feature_extract":
        for p in model.parameters():        # freeze the whole backbone
            p.requires_grad = False
    # Replace the 1000-class head with a fresh head for OUR classes.
    in_features = model.fc.in_features
    model.fc = nn.Linear(in_features, num_classes)   # new head is trainable
    return model

m = build_transfer_model(num_classes=37, mode="feature_extract")
trainable = sum(p.numel() for p in m.parameters() if p.requires_grad)
total = sum(p.numel() for p in m.parameters())
print(f"trainable {trainable:,} of {total:,}")
# Expected: trainable 75,813 of 23,583,845   (only the new head learns)

Code Fragment 1: Feature extraction in practice: freeze the backbone, swap the head. The loop over model.parameters() sets requires_grad = False on the whole body, then model.fc is replaced with a fresh nn.Linear for the new class count. The printed count confirms only about 76k of 23.6M parameters update, so this trains fast and resists overfitting even on a few hundred images.

Key Insight: The Head Is Always New, the Body Is a Choice

Two things are non-negotiable in transfer learning, and one is a decision. Non-negotiable: you always replace the classifier head, because it outputs the wrong number of classes, and you always match the pretrained preprocessing from Section 21.1. The decision is how much of the body to unfreeze, and that is what the rest of this section is about. A useful default progression is to start with feature extraction (head only), confirm the pipeline works, then unfreeze and fine-tune for the accuracy gain if your data supports it.

Common Misconception: "Full Fine-Tuning Always Beats Feature Extraction"

A common belief is that unfreezing the whole backbone is strictly the stronger move and that frozen feature extraction is just a weaker shortcut you settle for. In fact the opposite is often true on exactly the small or distant datasets where transfer matters most: with only a few hundred images, full fine-tuning has tens of millions of free parameters chasing a tiny supervision signal, so it overfits and can catastrophically forget the general edge and texture filters of subsection 1, ending up below a frozen backbone with a trained head. Table 21.3.1 exists precisely because the right amount of unfreezing is a function of your data size and domain gap, not a quantity to always maximize. The honest default is to feature-extract first, measure on a held-out set, and unfreeze only as far as the validation curve keeps improving.

3. Discriminative Learning Rates and Gradual Unfreezing Intermediate

Full fine-tuning has a hidden danger. If you fine-tune the whole network at a single learning rate, the freshly-initialized head produces large, noisy gradients in its first steps, and those gradients flow back and disturb the carefully-learned early features before the head has stabilized. Two techniques, both popularized by the fastai practitioners, prevent this. Discriminative (layer-wise) learning rates assign smaller learning rates to earlier layers and larger ones to later layers, so the general early features change slowly while the specific late layers adapt quickly. A common pattern is a geometric decay, where each earlier group's learning rate is a fixed fraction (say one third) of the group after it.

Gradual unfreezing is the temporal version of the same idea: train the new head alone for an epoch or two so it stabilizes, then unfreeze the next block back and continue, progressively thawing toward the input. Both techniques share one goal, protect the valuable early features from the chaos of an untrained head. The code below shows discriminative learning rates via parameter groups, the mechanism PyTorch optimizers provide for exactly this.

import torch

def make_param_groups(model, base_lr=1e-3, decay=0.3):
    """Earlier ResNet stages get geometrically smaller learning rates."""
    stages = [model.conv1, model.layer1, model.layer2,
              model.layer3, model.layer4, model.fc]
    groups = []
    n = len(stages)
    for i, stage in enumerate(stages):
        # last stage (head) gets base_lr; each earlier stage is `decay` times smaller
        lr = base_lr * (decay ** (n - 1 - i))
        groups.append({"params": stage.parameters(), "lr": lr})
    return groups

model = build_transfer_model(num_classes=37, mode="finetune")
for p in model.parameters():
    p.requires_grad = True                        # full fine-tune
opt = torch.optim.AdamW(make_param_groups(model, base_lr=1e-3))
for g in opt.param_groups:
    print(f"lr = {g['lr']:.2e}")
# Expected: lr = 2.43e-06 / 8.10e-06 / 2.70e-05 / 9.00e-05 / 3.00e-04 / 1.00e-03

Code Fragment 2: Discriminative learning rates as optimizer parameter groups. make_param_groups walks the ResNet stages from conv1 to fc and scales each earlier stage's rate by decay ** (n - 1 - i), so the head learns at $10^{-3}$ while the first convolution learns roughly four hundred times slower. AdamW accepts the resulting list of per-group learning rates directly, protecting the general edge and texture filters from the head's early noise.

Fun Fact

The term "catastrophic forgetting" is the field's gloriously dramatic name for what happens when you fine-tune too aggressively: the network, asked to learn your 400 birds, cheerfully overwrites the million-image education it took weeks to acquire and forgets what an edge looks like. It is the deep-learning equivalent of cramming for one exam and walking out unable to remember your own phone number. Discriminative learning rates and gradual unfreezing exist precisely so the backbone keeps its hard-won general knowledge while it picks up the new specifics.

Practical Example: 400 Images, One Afternoon, Production Accuracy

Who: a small e-commerce team needing to classify product photos into 37 fine-grained furniture categories, 2025. Situation: they had only about 400 labeled images, roughly a dozen per class, and no budget for large-scale annotation. Problem: training a ResNet-50 from scratch on 400 images produced a model barely above chance, badly overfit within a few epochs. Decision: they switched to transfer learning. First feature extraction (frozen backbone, new head) reached usable accuracy in minutes; then full fine-tuning with discriminative learning rates ($10^{-3}$ on the head decaying to $10^{-6}$ on the stem) lifted it further. They applied the augmentation of Section 21.2 to stretch the tiny dataset. Result: a model good enough to deploy, trained in an afternoon on a single GPU, from 400 images. Lesson: for small datasets, transfer learning is not an optimization, it is the only thing that works. The million images the backbone already saw are worth far more than the 400 you can label, so reuse them and adapt gently.

You Could Build This: A Custom Classifier From Your Own Photos

You now have everything you need for a genuinely useful portfolio project, and it does not require a labeled benchmark. Pick something you personally care to tell apart (the dog breeds at your local park, the bird species at your feeder, the five plants on your windowsill, the components in your electronics drawer) and shoot or collect a few dozen photos per class. Apply exactly the workflow of this section: load a pretrained backbone with timm.create_model(..., num_classes=N), feature-extract first to confirm the pipeline, then full fine-tune with the discriminative learning rates of subsection 3 and the augmentation of Section 21.2 to stretch the tiny set. This is intermediate and takes about two to three hours, most of it spent collecting images rather than coding. It complements the Oxford Pets lab in Section 21.6 by forcing you to own the messy part the benchmark hides: deciding your classes, gathering real photos, and confronting your own class imbalance. A working "trained on 200 photos I took myself" classifier, with its validation curve, is exactly the kind of concrete artifact that stands out in an interview.

4. A Decision Tree for Picking a Strategy Intermediate

The choice among feature extraction, partial fine-tuning, and full fine-tuning is governed by two axes: how much labeled data you have, and how far your domain sits from the pretraining domain. Figure 21.3.2 lays those two axes out as a plane, and the four quadrants give a clean rule of thumb, summarized afterward in Table 21.3.1. With little data and a similar domain, freeze the backbone and train only the head, because you cannot afford to fit many parameters and the features already fit. With lots of data and a distant domain, fine-tune everything or even consider training from scratch, because you have the data to reshape the features and they need reshaping. The two mixed quadrants call for partial fine-tuning, where you thaw the later blocks and keep the early ones frozen.

Figure 21.3.2: The transfer-strategy decision plane. The vertical axis is how much labeled data you have, the horizontal axis is how far your domain sits from ImageNet. Freezing is safest in the small-data, similar-domain corner (green); full fine-tuning takes over as either data grows or the domain drifts, with training from scratch reserved for the large-data, very-distant corner (red). The same four recommendations are tabulated in Table 21.3.1.

Table 21.3.1: Choosing a transfer strategy from data size and domain gap.

	Similar domain (close to ImageNet)	Distant domain (medical, satellite, microscopy)
Small data (hundreds)	Feature extraction: freeze backbone, train head only	Partial fine-tune: thaw later blocks, heavy augmentation
Large data (tens of thousands+)	Fine-tune all, discriminative learning rates	Fine-tune all (or from scratch if data is huge and very distant)

Two practical notes refine the table. First, batch-normalization layers (from Chapter 19) hold running statistics; when you freeze a backbone for feature extraction, keep those layers in evaluation mode so they do not update their statistics on your small dataset. Second, "domain gap" is the dominant factor for distant domains like medical imaging or satellite imagery, where even general edge features transfer somewhat but the textures and parts are unfamiliar, which is why the distant quadrants lean toward more fine-tuning. The decision tree is a starting point; always confirm by measuring on a held-out validation set.

Library Shortcut: timm Backbones with a Built-In Head Swap

Manually slicing a torchvision model's .fc works but differs per architecture (it is .classifier on others, a list on yet others). The timm library standardizes head replacement and frozen feature extraction across hundreds of backbones with one factory call:

import timm
# num_classes swaps the head automatically for ANY of ~1000 architectures.
model = timm.create_model("convnext_tiny", pretrained=True, num_classes=37)

# Feature-extraction mode: a pooled feature vector, no head, in one flag.
feat = timm.create_model("convnext_tiny", pretrained=True,
                         num_classes=0, global_pool="avg")
# data_config gives the EXACT preprocessing for this checkpoint:
cfg = timm.data.resolve_data_config({}, model=model)
transform = timm.data.create_transform(**cfg)

Code Fragment 3: The same head swap and feature extraction across any backbone via timm. Passing num_classes=37 to create_model replaces the head for any of roughly a thousand architectures, while num_classes=0 returns a pooled feature vector. resolve_data_config then supplies the exact preprocessing for that checkpoint, eliminating the per-model head wiring of Code Fragment 1 and the preprocessing-mismatch failure mode of Section 21.1.

timm handles the architecture-specific head surgery, exposes a uniform num_classes argument, and ships the matching preprocessing per checkpoint via resolve_data_config, eliminating both the per-model head wiring and the preprocessing-mismatch failure mode of Section 21.1. This is the production path for the transfer workflow.

Research Frontier: Parameter-Efficient Fine-Tuning

As backbones grew into the hundreds of millions and billions of parameters (the foundation models of Chapter 25), full fine-tuning became expensive to train and store, one full copy of the model per task. The 2021-2026 answer is parameter-efficient fine-tuning (PEFT). LoRA (Low-Rank Adaptation) freezes the entire backbone and learns only small low-rank update matrices, often under 1% of the parameters, achieving accuracy close to full fine-tuning at a fraction of the cost. Visual prompt tuning and adapter modules pursue the same goal for vision transformers. PEFT is now the default for adapting large models, and it reappears directly in Chapter 34, where LoRA is how you teach a giant text-to-image model a new style or subject from a handful of images. The discriminative-learning-rate intuition of subsection 3, change the specialized parts a lot and the general parts a little, is the conceptual ancestor of these methods.

Exercise 21.3.1: Place Your Project on the Tree Conceptual

For each scenario, name the recommended strategy from Table 21.3.1 and justify it in two sentences: (a) 50,000 photos of everyday objects from web images, (b) 300 grayscale microscopy images of cell types, (c) 2,000 satellite tiles for land-use classification, (d) 5 million product photos for a retail catalog. For each, also state what you would do with the batch-normalization layers and why, referring to Chapter 19.

Exercise 21.3.2: Feature Extraction vs Fine-Tuning, Measured Coding

Take a pretrained ResNet-50 and a small dataset (for example the Oxford Pets or Flowers-102 dataset, both in torchvision). Train three models with identical augmentation and schedule: feature extraction (frozen backbone), full fine-tuning at a single learning rate, and full fine-tuning with the discriminative learning rates from subsection 3. Report final validation accuracy and training time for each. Confirm that discriminative learning rates match or beat the single-rate fine-tune while disturbing the early features less.

Exercise 21.3.3: Visualize What the First Layer Learned Analysis

Load a pretrained ResNet-50 and extract the weights of its first convolution (model.conv1.weight, shape $64 \times 3 \times 7 \times 7$). Visualize all 64 filters as small RGB images. Identify oriented edges, color-opponent blobs, and frequency-selective patterns, and compare them to the Gabor and Sobel kernels of Chapter 3. Write a paragraph arguing, from what you see, why these specific filters are safe to freeze and reuse across almost any visual task.