Part III: Deep Learning for Computer Vision
Chapter 22: Vision Transformers

Data-Efficient Training: DeiT & Augmentation for ViTs

"My first trainer fed me three hundred million pictures and called me a prodigy. My second trainer had one million, a bag of tricks, and a wise old CNN sitting next to me whispering answers. I learned just as much. The second trainer had the better story."

A Vision Transformer Who Found a Mentor
Big Picture

The original ViT only beat CNNs after pretraining on roughly 300 million images, because a model with no locality bias has to learn from data what a convolution gets for free; DeiT closed that gap on plain ImageNet by replacing scale with three things, very heavy augmentation, strong regularization, and a distillation token that learns from a CNN teacher. This is the section that turned the ViT from a result you could only reproduce inside a large lab into a model you can train on a single 8-GPU node in a few days. The recipe matters as much as the architecture, exactly the lesson of Chapter 21, and ViTs feel that lesson more acutely than any CNN.

Section 22.2 left us with a working ViT and an unanswered question: it needs "enough data", but how much, and what do you do if you do not have it? The honest answer from the 2020 paper was sobering. On ImageNet alone, trained from scratch, ViT-Base scored several points below a comparable ResNet of the kind you met in Chapter 20. It only pulled ahead after pretraining on JFT-300M, Google's internal 300-million-image dataset, which almost no one outside a handful of labs can touch. For a year that made the ViT look like a beautiful idea gated behind data most teams will never have. This section is the story of how that gate came down.

1. Why ViTs Are Data-Hungry Beginner

The data appetite is a direct consequence of the missing inductive bias from Section 22.1. A convolution assumes locality and translation equivariance, and those assumptions are correct for natural images, so a CNN starts partway to the answer and needs comparatively little data to finish. A ViT assumes nothing about spatial structure; it must learn, purely from labeled examples, that nearby patches are related and that an object looks the same wherever it appears. Learning a true fact about the world from data, rather than having it hard-coded, costs examples. With a small dataset the ViT has enough freedom to fit the training set in ways that do not generalize, so it overfits where the CNN, hemmed in by its biases, does not.

Figure 22.3.1 sketches the qualitative picture the ViT paper reported: at small data scales the CNN wins because its biases substitute for data, but the curves cross, and at large scales the ViT's flexibility lets it keep improving past the point where the CNN's biases become a ceiling. The exact crossover depends on the architectures and recipe, and pinning it down quantitatively is the subject of Section 22.5; here it motivates the problem DeiT solves.

accuracy dataset size (log scale) ImageNet-1k ImageNet-21k JFT-300M crossover CNN ViT ViT keeps climbing
Figure 22.3.1: The qualitative data-scale story (after the ViT paper). At small data the CNN's inductive biases substitute for examples and it leads; the curves cross as data grows, and at very large scale the ViT's freedom from those biases lets it surpass the CNN. DeiT's goal is to lift the ViT curve at the ImageNet-1k end without 300 million images.

2. The DeiT Recipe: Augmentation and Regularization Intermediate

DeiT (Data-efficient image Transformer, Touvron et al., 2021) made one central claim: most of the ViT's data hunger can be fed by augmentation and regularization instead of by more images. The recipe is essentially the modern training bundle of Chapter 21, applied unusually aggressively. It combines RandAugment (automated, stacked photometric and geometric ops), MixUp and CutMix (the label-mixing augmentations of Section 21.2 that blend both pixels and targets), random erasing, strong stochastic depth (randomly dropping whole transformer blocks during training), label smoothing, weight decay, and a long cosine schedule with warmup. Each technique fights overfitting from a different angle; together they let ViT-Base reach competitive ImageNet accuracy trained only on ImageNet's $1.28$ million images.

The point worth internalizing is that the augmentation is doing the job the missing inductive bias used to do. MixUp and CutMix manufacture a near-endless supply of varied training images, forcing the model to rely on robust, generalizable features rather than memorizing the finite training set, which is exactly the freedom-without-data problem subsection 1 described. The code below assembles the DeiT-style transform stack with the torchvision v2 transforms from Chapter 21.

# Assemble the DeiT training augmentation: per-image RandAugment and random
# erasing in the Compose, plus batch-level MixUp/CutMix that mix images and
# labels together. This manufactured variety stands in for more real images.
import torch
from torchvision.transforms import v2

# DeiT-style training augmentation: aggressive by CNN standards.
train_tf = v2.Compose([
    v2.RandomResizedCrop(224, antialias=True),
    v2.RandomHorizontalFlip(),
    v2.RandAugment(num_ops=2, magnitude=9),          # automated stacked ops
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406],         # ImageNet statistics
                 std=[0.229, 0.224, 0.225]),
    v2.RandomErasing(p=0.25),                         # cut out a random region
])

# MixUp / CutMix operate on a whole batch (mixing images AND labels):
mixup_cutmix = v2.RandomChoice([
    v2.MixUp(num_classes=1000),
    v2.CutMix(num_classes=1000),
])
# in the loop:  images, labels = mixup_cutmix(images, labels)
Code Fragment 1: The DeiT-style augmentation stack. Per-image RandAugment and random erasing plus batch-level MixUp/CutMix substitute synthetic data variety for the millions of real images the original ViT required.
Key Insight: Augmentation Is a Substitute for Inductive Bias

A CNN gets locality and translation equivariance from its architecture, for free, with no data cost. A ViT must learn invariances from examples, so you manufacture those examples: random crops teach translation tolerance, flips teach reflection tolerance, color jitter teaches photometric robustness, MixUp and CutMix teach the model not to over-trust any single region. Seen this way, heavy augmentation is not a hack bolted onto ViT training; it is how you pay, in synthetic data, the inductive-bias bill that the convolution settled in hardware. This is why ViTs are far more sensitive to the augmentation recipe than CNNs are: turn the augmentation off and a ViT overfits badly, while a CNN typically degrades far more gently.

3. Distillation Through a Token Advanced

DeiT's signature contribution is a new way to do knowledge distillation. The classical idea (Hinton et al.) is to train a small "student" to match the softened output probabilities of a strong "teacher", transferring the teacher's knowledge. The word "softened" matters: a one-hot label says only "this is a cat", but a teacher's full probability vector says "mostly cat, a little lynx, almost never car", and those relative confidences encode which classes look alike. Learning to reproduce that richer signal teaches the student more per image than the bare label does. DeiT adapts this to the transformer in an architecturally native way: it adds a second special token, the distillation token, alongside the class token from Section 22.2. The distillation token flows through the encoder just like the class token, but its output is trained to predict the teacher's label rather than the ground-truth label. At inference the predictions from both tokens are combined.

The teacher in DeiT is, pointedly, a strong CNN (a RegNet). The ViT student therefore learns from a network that already has the locality and translation biases the ViT lacks, in effect importing those biases through the teacher's predictions rather than rediscovering them from scratch. This explains why CNN teachers worked better than ViT teachers in their experiments: the student is borrowing exactly the inductive bias it is missing. The training objective combines a standard cross-entropy on the class token with a distillation loss on the distillation token,

$$\mathcal{L} = \tfrac{1}{2}\,\mathcal{L}_{\text{CE}}\big(\psi(z_{\text{cls}}),\, y\big) \;+\; \tfrac{1}{2}\,\mathcal{L}_{\text{CE}}\big(\psi(z_{\text{dist}}),\, y_{\text{teacher}}\big)$$

where $z_{\text{cls}}$ and $z_{\text{dist}}$ are the two tokens' output logits, $\psi$ is the softmax, $y$ is the true label, and $y_{\text{teacher}}$ is the CNN teacher's predicted (hard) label. Figure 22.3.2 shows the two-token architecture, and the code sketches the training step. The illustration below gives the mentor-and-student picture behind it.

A wise old owl-professor representing a convolutional network whispers a hint carrying a small grid-pattern idea into the ear of an eager young student robot representing a vision transformer, which has two badges for its class token and distillation token, illustrating how DeiT imports a CNN teacher's inductive bias through distillation instead of needing hundreds of millions of images.
DeiT lets the transformer skip the three-hundred-million-image apprenticeship by seating a wise convolutional mentor beside it to whisper the locality biases it never learned on its own.
cls dist patches Transformer Encoder cls head dist head CE vs true y CE vs teacher CNN teacher (e.g. RegNet) y_teacher
Figure 22.3.2: DeiT's token-based distillation. A second learnable token, the distillation token, passes through the same encoder as the class token. Its head is trained to match a CNN teacher's predictions while the class head is trained on the true labels, importing the teacher's inductive bias through a token.
# DeiT's two-token training: the class token is supervised on ground truth and
# the distillation token on a CNN teacher's hard label, so the student imports
# the teacher's locality bias. At inference the two predictions are averaged.
import torch.nn.functional as F

def deit_distillation_loss(cls_logits, dist_logits, targets, teacher_logits):
    """Hard-label distillation: average of true-label CE and teacher-label CE."""
    teacher_labels = teacher_logits.argmax(dim=1)          # CNN teacher's prediction
    loss_cls = F.cross_entropy(cls_logits, targets)        # supervise on ground truth
    loss_dist = F.cross_entropy(dist_logits, teacher_labels)  # supervise on teacher
    return 0.5 * loss_cls + 0.5 * loss_dist

# at inference, fuse the two predictions:
def deit_predict(cls_logits, dist_logits):
    return (F.softmax(cls_logits, -1) + F.softmax(dist_logits, -1)) / 2
Code Fragment 2: The DeiT distillation objective and inference fusion. The distillation token learns from the CNN teacher's hard labels while the class token learns from ground truth; their softmaxes are averaged at test time.
Try This: Dial the Augmentation Up and Down

You can feel the Key Insight in a single afternoon with the small ViT from the Section 22.2 code on CIFAR-100 (resized to $224$). Hold everything fixed except one knob: the RandAugment magnitude in Code Fragment 1, sweeping it across $0$ (off), $5$, $9$ (the DeiT default), and $14$. Train each setting for the same few epochs and watch two numbers, the final training accuracy and the validation accuracy. Expect the gap between them to be widest at magnitude=0 (the model memorizes the training set) and to shrink as the magnitude rises, until very large magnitudes start to hurt both numbers because the images become too distorted to learn from. Seeing the train-validation gap close as you turn one dial is the augmentation-as-inductive-bias argument made concrete, and it is the same overfitting signature the medical-imaging example below describes.

Fun Fact

DeiT's authors found that the class token and the distillation token, trained on the same images but different targets, end up genuinely different: the cosine similarity between their final representations is well below one, and the two heads disagree on a meaningful fraction of images. The model is not just learning one thing twice; it learns two complementary views, one anchored to the ground truth and one anchored to the CNN's worldview, and averaging them beats either alone. The transformer got a second opinion and used it.

Library Shortcut: Train-Ready DeiT From timm

You do not implement the distillation token, the teacher loop, and the full augmentation stack by hand for a real project. timm ships DeiT with pretrained weights and the entire recipe wired in:

# Skip the hand-built recipe: load a distilled DeiT-Base with both tokens
# already trained and fused, then fine-tune with timm's script that bundles
# the full augmentation and regularization stack shown above.
import timm
# distilled DeiT-Base: both tokens already trained, fused at inference
model = timm.create_model("deit_base_distilled_patch16_224", pretrained=True).eval()
# fine-tune on your data with timm's training script, which already includes
# RandAugment, MixUp, CutMix, random erasing, stochastic depth, and EMA.
Code Fragment 3: A train-ready distilled DeiT-Base from timm. The single create_model call returns the two-token architecture with both heads pretrained and fused at inference; the training script supplies the full augmentation recipe spelled out by hand above.

The library handles the two-token architecture, the prediction fusion, and the dozen-knob augmentation recipe that this section spells out by hand. Reproducing DeiT's recipe yourself is roughly a few hundred lines of careful configuration; timm.create_model plus its training script is the production path, and the from-scratch version above exists so you understand what those knobs do.

Practical Example: A ViT That Would Not Train Until the Recipe Changed

Who: a computer-vision team at a medical-imaging startup, 2023, classifying skin-lesion photos. Situation: they had read that ViTs were state-of-the-art and swapped their ResNet-50 backbone for a ViT-Base, reusing their existing training script (light augmentation, a flip and a small crop, tuned years earlier for the CNN). Problem: the ViT trained to near-perfect training accuracy and a validation accuracy several points worse than the ResNet it replaced, a textbook overfit. The team nearly concluded "ViTs do not work for us". Decision: instead of abandoning the architecture, an engineer who had read the DeiT paper switched to the full DeiT recipe (RandAugment, MixUp, CutMix, stochastic depth, a longer cosine schedule) and initialized from ImageNet-pretrained DeiT weights rather than from scratch. Result: the same ViT now matched and then beat the ResNet on validation, with the train-validation gap shrinking to normal. Lesson: a ViT trained on a CNN's recipe is set up to fail, because the recipe is where the ViT's missing inductive bias gets paid for. The architecture was never the problem; the augmentation and the pretrained initialization were.

Research Frontier: From Distillation to Self-Supervised Pretraining

DeiT solved the data problem with augmentation and a CNN teacher, but the 2022 to 2024 frontier solved it a different way: pretrain the ViT on unlabeled images with a self-supervised objective, then fine-tune. Masked Autoencoders (He et al., 2022, arXiv:2111.06377) mask out most patches and train the ViT to reconstruct them, learning rich features with no labels at all; DINOv2 (Oquab et al., 2024, arXiv:2304.07193) combines self-distillation and masking to produce a frozen ViT backbone whose features transfer to classification, detection, segmentation, and depth without fine-tuning. These methods make the data-hunger of subsection 1 a non-issue by feeding the model enormous amounts of unlabeled data, the natural sequel to DeiT and the subject of Chapter 25. DeiT remains the cleanest demonstration that the supervised ImageNet gap was a recipe gap, not an architecture gap.

Exercise 22.3.1: Why a CNN Teacher? Conceptual

DeiT found that distilling from a strong CNN teacher worked better than distilling from a strong ViT teacher of equal accuracy. Using the inductive-bias argument of subsections 1 and 3, explain in a paragraph why a CNN teacher might transfer something a ViT teacher cannot, even when both achieve the same top-1 accuracy. Then propose a falsifiable prediction: if your explanation is correct, how should the benefit of the CNN teacher change as the student ViT is given more and more training data?

Exercise 22.3.2: Ablate the Augmentation Coding

Take a small ViT (reduce depth and embed_dim in the Section 22.2 code so it trains quickly) and train it on CIFAR-100 (resized to $224$) under two regimes: (a) flips and crops only, the CNN-style light augmentation; (b) the full DeiT stack from subsection 2 plus MixUp. Plot training and validation accuracy for both. You should observe regime (a) reaching higher training accuracy but lower validation accuracy, the overfitting signature of the medical-imaging example. Report the train-validation gap for each and relate it to the Key Insight.

Exercise 22.3.3: Quantify the Distillation Benefit Analysis

Using the deit_distillation_loss and deit_predict functions, train a student ViT three ways on a subset of ImageNet or a similar dataset: class token only, distillation token only (teacher labels only), and both with fused inference. A pretrained CNN from torchvision can serve as the teacher. Report the three validation accuracies and the agreement rate between the two heads. Discuss whether the fused model's gain over the better single head is consistent with the "second opinion" framing in the Fun Fact, and analyze one image where the two heads disagree.