"Everyone wanted to know who won, the convolution or the attention. The answer turned out to be the team that stopped picking sides, stapled a convolutional stem onto a transformer, and went home early."
A Hybrid Backbone Who Refuses to Take Sides
The CNN-versus-ViT question has a calm, evidence-based answer: inductive bias wins at small and medium data scale and attention wins at very large scale, but with a strong modern recipe the gap at any fixed scale is small, and the architectures that actually win benchmarks in 2024 to 2026 are hybrids that keep the convolution's priors and add attention's global mixing. This closing section turns the chapter's running trade-off into a decision you can make: how much data do you have, how much resolution do you need, what does each family see differently, and why "convolution or attention" was the wrong question all along.
You arrive at the chapter's destination with all the pieces. Section 22.1 showed that attention discards locality, weight sharing, and a fixed receptive field. Section 22.3 showed that those discarded biases must be repaid in data or in augmentation. Section 22.4 showed that the most successful "transformer" backbone quietly reintroduced the very biases the plain ViT abandoned. The honest synthesis is not a winner; it is a map of when each design pays off, and a recognition that the field has already converged on combining them. This section draws that map.
1. What Inductive Bias Actually Buys Beginner
An inductive bias is a built-in assumption that narrows what the model can express, and a good one trades a little flexibility for a lot of data efficiency. The convolution's locality assumption says "the label depends on local patterns"; its translation equivariance says "a pattern means the same thing wherever it appears." Both are true for natural images most of the time, so a CNN does not have to spend examples discovering them. A plain ViT has neither assumption, so it can in principle represent functions a CNN cannot, but it must learn from data even the obvious facts, and that costs examples. The table makes the contrast concrete.
| Property | CNN | Plain ViT | Consequence |
|---|---|---|---|
| Locality | Built in (small kernel) | None (global from layer 1) | ViT must learn locality from data |
| Translation equivariance | Built in (weight sharing) | Weak (broken by positional embeddings) | ViT learns invariance via augmentation |
| Receptive field | Grows with depth | Global immediately | ViT relates distant regions in one layer |
| Cost in tokens / pixels | $O(N k^2 d)$, linear | $O(N^2 d)$, quadratic | ViT expensive at high resolution |
| Data efficiency | High (biases substitute for data) | Low without large pretraining | CNN wins on small datasets |
| Scaling ceiling | Lower (biases eventually limit) | Higher (flexibility keeps paying) | ViT wins at very large scale |
The right-hand column is the practical guidance. If you have a few thousand to a few hundred thousand labeled images and no pretraining, a CNN (or a Swin, which has CNN-like biases) is the safer default. If you can pretrain on tens of millions of images or borrow a model that already did, the ViT's higher ceiling becomes reachable. Read these rows against your own data scale and resolution before you commit to an architecture.
2. The Scale Crossover, Quantified Intermediate
The qualitative crossover of Section 22.3 can be made precise with published numbers. On ImageNet-1k alone (about $1.3$ million images), the original ViT-Base scored several points below a comparable ResNet, and DeiT (Section 22.3) closed that gap with recipe and distillation rather than data. Pretraining on ImageNet-21k (about $14$ million images) flips the ranking: the ViT pulls ahead. Pretraining on JFT-300M ($300$ million images) widens the ViT's lead and lets it keep improving as the model grows, all the way to the $22$-billion-parameter ViT-22B (Dehghani et al., 2023), where the transformer's clean scaling is its decisive advantage. By "clean scaling" we mean that making the model bigger and feeding it more data keeps improving accuracy predictably, without the architecture hitting a wall, which is exactly the behavior you want when you have the budget to grow. The crossover sits somewhere in the tens of millions of images for these particular architectures.
But the most important number in this literature is a negative result that reframes the whole debate. ConvNeXt (Liu et al., 2022) took a plain ResNet and changed nothing about its convolutional nature, only its design details and training recipe (larger kernels, fewer activations, layer norm, and the AdamW-and-augmentation bundle of Chapter 21). That modernized CNN matched or exceeded Swin Transformer on ImageNet classification, detection, and segmentation across the model sizes it reported. The conclusion the field drew is sharp: much of the apparent "transformer advantage" through 2021 was the modern training recipe, not the attention mechanism. The architecture comparison is only fair when both sides use the same recipe, and when they do, the gap at any fixed data scale is small. The code below illustrates how cheaply you can run that fair comparison yourself with timm.
# Stand four families side by side at a matched parameter budget: a CNN, a
# modernized CNN, a plain ViT, and a hierarchical Swin. Comparing them fairly
# means one recipe for all; here we just confirm the shapes and sizes line up.
import timm, torch
# A fair head-to-head: same input, same preprocessing source, matched scale.
names = ["resnet50", "convnext_tiny", "vit_small_patch16_224", "swin_tiny_patch4_window7_224"]
x = torch.randn(1, 3, 224, 224)
for name in names:
m = timm.create_model(name, pretrained=False)
n_params = sum(p.numel() for p in m.parameters()) / 1e6
with torch.no_grad():
y = m(x)
print(f"{name:32s} params={n_params:5.1f}M logits={tuple(y.shape)}")
# resnet50 params= 25.6M logits=(1, 1000)
# convnext_tiny params= 28.6M logits=(1, 1000)
# vit_small_patch16_224 params= 22.1M logits=(1, 1000)
# swin_tiny_patch4_window7_224 params= 28.3M logits=(1, 1000)
The single most common error in reading the CNN-versus-ViT literature is comparing a transformer trained with a 2021 recipe to a CNN trained with a 2015 recipe and crediting the gap to attention. ConvNeXt is the controlled experiment that isolates the variable: hold the recipe fixed, and the architectures are close. When you read "model X beats model Y", your first question should be "were they trained the same way?" Architecture differences are real but second-order next to recipe differences at the data scales most teams operate in, the central lesson of Chapter 21 applied to this chapter. The illustration below makes this unfair comparison plain.
3. What They See Differently Advanced
Accuracy parity does not mean the two families are interchangeable; they reach similar scores by computing different things, and the differences matter for robustness. Studies probing trained CNNs and ViTs (notably Raghu et al., "Do Vision Transformers See Like Convolutional Neural Networks?", 2021) found three consistent contrasts. First, ViTs propagate more uniform representations across depth, with strong early-layer global information, whereas CNNs build up from strictly local features. Second, ViTs preserve spatial information later into the network and rely more on the residual stream to carry it. Third, and most practically, ViTs tend to be more robust to certain distribution shifts and texture perturbations, while CNNs are famously biased toward texture over shape.
That texture-versus-shape contrast has a concrete consequence. CNNs often classify by local texture (the canonical demonstration is a cat with elephant-skin texture being called an elephant), while ViTs, attending globally, lean more on shape and global layout. Neither is strictly better, but they fail differently, which is exactly why ensembling a CNN and a ViT, or building a hybrid that has both, tends to be more robust than either alone. This is the empirical bridge to subsection 4: the families are complementary, so combine them.
When researchers feed both families an image with conflicting cues, a cat silhouette filled with elephant skin texture, the CNN tends to shout "elephant" (it trusts the texture) while the ViT is more likely to say "cat" (it trusts the global shape). The disagreement is consistent enough on average to serve as a rough diagnostic: if your two models disagree most on texture-vs-shape images, you are probably looking at a CNN and a ViT, not two CNNs. The architectures have visibly different aesthetic preferences about what an image "really" is.
4. Why the Winners Are Hybrids Advanced
The resolution of the whole debate is that the strongest practical architectures stopped choosing. They keep convolutional priors where those priors are cheap and correct, and add attention where global mixing helps. Three patterns dominate. The first is a convolutional stem: replace the ViT's single big patch-embedding convolution with a few small $3 \times 3$ convolutions, which injects local bias early and was shown to make ViTs train more stably (Xiao et al., "Early Convolutions Help Transformers See Better", 2021). The second is hierarchical, windowed attention, the Swin family of Section 22.4, which is already a hybrid in spirit. The third is interleaving conv and attention blocks, as in CoAtNet (Dai et al., 2021), MobileViT, and the EfficientFormer line for mobile deployment of Chapter 28.
The unifying recipe is to use convolutions in the early, high-resolution stages, where locality is both correct and where global attention would be ruinously expensive, and to use attention in the later, low-resolution stages, where global mixing is affordable and most valuable. Figure 22.5.1 sketches this canonical hybrid layout, which underlies a large share of the strongest 2023 to 2026 backbones.
You do not assemble a hybrid by hand; timm exposes the whole hybrid zoo through the same create_model call, so trying a convolutional-stem ViT or a conv-attention interleave against a pure CNN baseline is a one-line change:
# Load three points on the conv-to-attention spectrum through one interface, so
# the whole spectrum is a one-string sweep on your own validation data rather
# than a hand-built architecture per candidate.
import timm
hybrid = timm.create_model("coatnet_1_rw_224", pretrained=True).eval() # conv + attention
mobile = timm.create_model("mobilevit_s", pretrained=True).eval() # mobile hybrid
convnext = timm.create_model("convnext_base", pretrained=True).eval() # modernized pure CNN
# swap the string to benchmark a different point on the conv-attention spectrum
create_model call, swapping the string is the whole experiment.The library handles the per-family stem, the stage configuration, the conv-attention interleaving, and the pretrained weights and preprocessing. Because every model shares the interface, you can sweep the entire spectrum from pure CNN to pure ViT to hybrid on your own validation set and let your data, not the hype cycle, pick the architecture, exactly the empirical discipline this chapter has argued for throughout.
Who: a retail-analytics company classifying product photos into a few thousand fine-grained SKUs, 2025. Situation: the literature pushed them toward ViTs, but they had only about $80{,}000$ labeled images, well below the ViT crossover of subsection 2. Problem: a from-scratch ViT-Base overfit badly (the Section 22.3 failure mode), while the team worried a plain ResNet was "outdated". Decision: rather than argue, they ran the one-line timm sweep from the library shortcut, fine-tuning a pretrained ResNet-50, a ConvNeXt-Tiny, a Swin-Tiny, and a CoAtNet from ImageNet weights, all with the same modern recipe, on their data. Result: ConvNeXt-Tiny and CoAtNet tied for the top spot within noise, the plain ViT trailed (too little data, even pretrained), and Swin was close behind. They shipped ConvNeXt-Tiny for its lower latency. Lesson: at $80{,}000$ images the inductive-bias families (modern CNN and hybrid) beat the pure ViT, exactly as Table 22.5.1 predicts, and the right move was a cheap empirical sweep rather than a theoretical allegiance to either camp.
By 2024 to 2026 the conv-versus-attention framing is being overtaken on two fronts. Self-supervised foundation backbones such as DINOv2 (Oquab et al., 2024) make the architecture choice secondary to the pretraining objective: a ViT trained on enough unlabeled data yields frozen features that transfer to classification, detection, segmentation, and depth without fine-tuning, the subject of Chapter 25. Meanwhile, sub-quadratic sequence mixers, the Mamba and Vision Mamba (Vim, 2024) state-space family, and gated linear-attention variants, aim to keep attention's global reach at the convolution's linear cost, attacking the $O(N^2)$ term directly rather than working around it with windows. The likely 2026 answer to "CNN or ViT?" is "a self-supervised hybrid, and the mixing operator is whichever is cheapest at the resolution you need", which is precisely the non-partisan, evidence-first stance this chapter has built toward.
That settles the question the whole chapter opened with. The convolution's biases buy data efficiency, attention buys scale-driven freedom, and the strongest 2024 to 2026 systems simply refuse to choose, pricing the trade against their own data and resolution exactly as Table 22.5.1 lays out. With that map in hand, the patch-and-attend recipe is ready to leave classification behind. Chapter 23: Object Detection is where it goes next: the hierarchical backbones of Section 22.4 become the feature extractors that detectors sit on, and the very self-attention you built in Section 22.1 reappears in the DETR decoder, which reframes detection as set prediction and drops the hand-built anchors and non-maximum suppression of classical pipelines. Attention does not stop at telling you what an image contains; it is about to start telling you where.
Using Table 22.5.1 and the scale numbers of subsection 2, write a short decision rule (three or four sentences) that a colleague could follow to pick between a modern CNN, a Swin, and a plain ViT given only their dataset size and target resolution. Be explicit about where the boundaries fall and why, citing the inductive-bias-versus-data trade-off. Then state one situation where your rule would deliberately pick the plain ViT despite a modest dataset, and justify it (hint: pretraining or self-supervised foundation weights).
Using the timm sweep from subsections 2 and 4, fine-tune a pretrained ResNet-50, ConvNeXt-Tiny, Swin-Tiny, and ViT-Small on a medium dataset such as a subset of iNaturalist or Food-101, with an identical recipe (same augmentation, optimizer, schedule, epochs). Report top-1 accuracy, parameter count, and inference latency for each. Discuss whether your ranking matches the ConvNeXt finding (architectures close when the recipe is held fixed) and the retail-analytics example, and identify which model you would ship and why.
Construct or download a small set of texture-shape conflict images (an object of one class rendered with the texture of another, the cue-conflict stimuli from the shape-bias literature). Run a pretrained CNN (ResNet-50) and a pretrained ViT or DeiT on them and record, per image, whether each model predicts the shape class or the texture class. Tabulate the shape-bias rate for each model and discuss whether your results match the contrast described in subsection 3 and the Fun Fact. Conclude with one sentence on why this disagreement makes a CNN-plus-ViT ensemble more robust than either alone.