Section 20.1: LeNet & AlexNet: The Breakthrough Years

"I waited fourteen years for the world to build me a GPU and label a million pictures. People call 2012 an overnight success. I call it the slowest sunrise in history."
A Patient Convolutional Network, Finally Trained

Big Picture

The convolutional template was finished in 1998; what changed in 2012 was not the idea but the removal of two bottlenecks that had quietly capped it for a decade: not enough compute, and a nonlinearity that killed gradients in deep stacks. LeNet-5 proved the recipe of convolution, subsample, classify, trained by the backpropagation of Chapter 18. AlexNet kept that recipe almost unchanged and removed the ceilings around it with two GPUs, the ReLU activation, dropout, and aggressive data augmentation. Its 2012 ImageNet win cut the error rate so sharply that the entire field pivoted to deep learning within a year. Read this section as the first link in the chapter's chain: every later architecture is a fix for a bottleneck, and AlexNet is the prototype of that move.

You arrive here having built a convolutional network from scratch in Chapter 19 and trained it on CIFAR-10. You know the conv-BN-ReLU block, pooling, and the receptive field. This section steps back to ask where those parts came from, and the answer is a single network published in 1998 that already contained almost all of them.

The interesting historical fact is that the design was right long before it could win. Understanding why it could not win in 1998, and exactly what 2012 changed, teaches the most useful habit in this chapter: treat an architecture as the solution to a named bottleneck, not as a list of layers to memorize. The illustration below pictures that relay of fixes across a decade, each design dropping the one weight that held the last one back.

A relay race of progressively sleeker cartoon robots passing a glowing nine-square convolution filter baton up a rising path, each robot dropping a heavy weight as it hands off, picturing how each CNN architecture removes the single bottleneck that capped the one before it across a decade of redesign. — Every architecture in this chapter is the answer to one question: what capped the last one?

1. LeNet-5: The Template That Waited Beginner

LeNet-5, from Yann LeCun and colleagues at Bell Labs, was built to read handwritten digits on bank checks and postal envelopes. It is tiny by modern standards, around 60,000 parameters, yet it established the architectural grammar that every network in this chapter speaks. The flow is: a convolution layer extracts local features, a subsampling (pooling) layer shrinks the spatial map and adds a little translation tolerance, and this pair repeats, doubling the channel count as the spatial size halves, until a few fully connected layers map the final feature vector to class scores. That pattern, spatial size down, channel count up, then classify, is the skeleton of LeNet, AlexNet, VGG, and ResNet alike.

Figure 20.1.1 lays out the LeNet-5 pipeline so you can see the alternation of convolution and subsampling that the rest of the chapter inherits.

Figure 20.1.1: The LeNet-5 pipeline. Two convolution-then-subsample stages reduce a $32 \times 32$ input to a compact $5 \times 5 \times 16$ feature map, which three fully connected layers turn into ten digit scores. The spatial-down, channels-up rhythm is the skeleton every architecture in this chapter reuses.

We can rebuild LeNet-5 almost verbatim with the layers from Chapter 19. The original used a tanh nonlinearity and an average-pooling variant; the version below substitutes ReLU and max-pooling, which is how the network is taught today, and which trains noticeably faster.

# Rebuild the 1998 LeNet-5 topology with modern layers from Chapter 19.
# Two conv-then-pool stages feed three fully connected layers; the only
# departures from the original are ReLU (for tanh) and max-pool (for avg-pool).
import torch
import torch.nn as nn

class LeNet5(nn.Module):
    """Modern reading of LeNet-5: ReLU + max-pool instead of tanh + avg-pool."""
    def __init__(self, num_classes: int = 10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5),   # 32x32 -> 28x28, 6 feature maps
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                  # 28x28 -> 14x14
            nn.Conv2d(6, 16, kernel_size=5),  # 14x14 -> 10x10
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                  # 10x10 -> 5x5
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),                     # 16 * 5 * 5 = 400 features
            nn.Linear(16 * 5 * 5, 120), nn.ReLU(inplace=True),
            nn.Linear(120, 84), nn.ReLU(inplace=True),
            nn.Linear(84, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

net = LeNet5()
n_params = sum(p.numel() for p in net.parameters())
print(f"LeNet-5 parameters: {n_params:,}")
y = net(torch.randn(1, 1, 32, 32))
print("output shape:", y.shape)

LeNet-5 parameters: 61,706 output shape: torch.Size([1, 10])

Code Fragment 1: A faithful, runnable LeNet-5 in modern PyTorch, where self.features holds the two conv-and-pool stages and self.classifier the three linear layers. The 61,706-parameter count and the $5 \times 5 \times 16$ feature map flattened to 400 inputs before the classifier match Figure 20.1.1 exactly.

Key Insight: The Design Was Not the Bottleneck

LeNet-5 already had learnable convolutions, pooling, a feature hierarchy, and end-to-end training by gradient descent in 1998. What it lacked was scale: the MNIST-sized data and the CPU compute of the era could not exercise a network big enough to solve natural images. The lesson that opens this chapter is that for years the ceiling was not the idea but the resources around it. AlexNet is what happened when those resources finally arrived.

2. The 2012 ImageNet Moment Beginner

The benchmark that mattered was ImageNet: 1.2 million training images across 1000 classes, scored by top-5 error (the fraction of images whose true label is not among the model's five highest-scoring guesses). Through 2011, the leaderboard was dominated by hand-engineered features (the SIFT and bag-of-visual-words pipelines of Chapter 10 and Chapter 16) feeding linear classifiers, with the best systems near 26% top-5 error. In 2012, AlexNet, a deep convolutional network trained on two GPUs, scored 15.3% top-5 error. A gap of more than ten percentage points over the best classical pipeline was not an incremental win; it was a regime change, and within twelve months almost every serious entry was a deep CNN.

Fun Fact

AlexNet was split across two GPUs because each card of the era held only 3 GB of memory, not enough for the whole network. The authors wired the two halves to communicate only at certain layers, producing the famous two-row architecture diagram. A pure engineering workaround for a memory limit accidentally became one of the most reproduced figures in deep learning, and an early hint that grouped convolutions (Section 20.4) could be useful in their own right.

3. The Four Bottlenecks AlexNet Removed Intermediate

AlexNet is best understood not as a new idea but as four targeted fixes layered onto the LeNet template, scaled up to roughly 60 million parameters. Each fix addresses a specific reason a deep network of the time would fail to train or generalize.

Bottleneck one, dead gradients: ReLU. LeNet used tanh, whose gradient saturates toward zero for large positive or negative inputs. Stack several saturating layers and the backpropagated gradient (Chapter 18) shrinks toward nothing, so early layers barely learn. The rectified linear unit $\text{ReLU}(x) = \max(0, x)$ has gradient exactly $1$ for all positive inputs, so it does not saturate on that side. AlexNet reported that ReLU reached a target training error in several times fewer training epochs than tanh on the same network. This is the change that makes deep training practical.

Bottleneck two, compute: GPUs. A network this size was infeasible on CPUs of the era. The authors wrote custom GPU convolution kernels and trained for about a week on two cards. The takeaway generalizes: hardware that makes the forward and backward pass cheap is itself an architectural enabler, a theme that returns when we count FLOPs (floating-point operations, the multiply-adds per image) in Section 20.6.

Table 20.1.1: The four bottlenecks AlexNet removed, each a fix that later architectures inherit.

Bottleneck	Symptom in 1998-era nets	AlexNet's fix
Dead gradients	tanh saturates, early layers barely learn	ReLU (non-saturating)
Compute	infeasible to train on CPUs	two GPUs, custom kernels
Overfitting	60M params memorize 1.2M images	dropout, heavy augmentation
Brittle pooling	non-overlapping windows lose robustness	overlapping max-pool

Bottleneck three, overfitting: dropout and augmentation. Sixty million parameters on 1.2 million images is a recipe for memorization. AlexNet used dropout in the fully connected layers (randomly zeroing half the activations during training, forcing redundant representations) and heavy data augmentation (random crops, horizontal flips, and color jitter, the geometric transforms of Chapter 5 turned into a regularizer). You will study these systematically in Chapter 21.

Bottleneck four, brittle pooling: overlapping max-pool. AlexNet used $3 \times 3$ pooling with stride $2$, so successive pooling windows overlap. This small change reduced error slightly and made the features marginally harder to overfit. The implementation below assembles all four fixes.

# AlexNet assembles the four bottleneck fixes onto the LeNet template:
# ReLU after every conv, an overlapping 3x3 stride-2 max-pool, and dropout
# in the dense head. The wide 11x11 stride-4 stem shrinks 224x224 fast.
import torch
import torch.nn as nn

class AlexNet(nn.Module):
    """AlexNet for 224x224 ImageNet inputs (the torchvision topology)."""
    def __init__(self, num_classes: int = 1000):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),  # large 11x11 stem
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),                  # overlapping pool
            nn.Conv2d(64, 192, kernel_size=5, padding=2), nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(0.5), nn.Linear(256 * 6 * 6, 4096), nn.ReLU(inplace=True),
            nn.Dropout(0.5), nn.Linear(4096, 4096), nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.avgpool(self.features(x))
        return self.classifier(torch.flatten(x, 1))

net = AlexNet()
print(f"AlexNet parameters: {sum(p.numel() for p in net.parameters()):,}")
print(net(torch.randn(2, 3, 224, 224)).shape)

AlexNet parameters: 61,100,840 torch.Size([2, 1000])

Code Fragment 2: AlexNet in modern PyTorch. The $11 \times 11$ stride-4 stem, the MaxPool2d(kernel_size=3, stride=2) overlapping pool, and the two Dropout(0.5) dense layers are the four bottleneck fixes of subsection 3; the 61,100,840 parameter count is dominated by the first Linear(256 * 6 * 6, 4096) layer.

Notice where the parameters live. The convolutional layers carry only a few million weights, while the first fully connected layer alone (mapping $256 \times 6 \times 6 = 9216$ activations to $4096$) holds about 37 million. This imbalance, cheap convolutions and expensive dense layers, is the inefficiency that VGG inherits and that the $1 \times 1$ bottleneck of Inception and global average pooling later remove. The seed of Section 20.2 is already visible here.

Library Shortcut: A Pretrained AlexNet in Three Lines

You almost never define these classic networks by hand. torchvision ships them with ImageNet-pretrained weights, turning the ~30 lines above into three:

# Load AlexNet with ImageNet-pretrained weights and the matching preprocessing.
# weights.transforms() is the exact resize/crop/normalize the weights expect,
# so you never reconstruct the input pipeline by hand.
from torchvision.models import alexnet, AlexNet_Weights
weights = AlexNet_Weights.IMAGENET1K_V1
model = alexnet(weights=weights).eval()
preprocess = weights.transforms()  # resize, center-crop, normalize, all handled
# model(preprocess(pil_image).unsqueeze(0)) -> 1000 ImageNet logits

The library handles the weight download and caching, the exact preprocessing pipeline (the resize, crop, and per-channel normalization the weights expect), and evaluation-mode behavior. Getting the preprocessing wrong is the single most common cause of "my pretrained model gives garbage", and weights.transforms() removes that failure mode entirely.

Code Fragment 3: The same pretrained AlexNet in three lines using torchvision instead of the hand-written class above. The library handles the weight download, the weights.transforms() preprocessing, and eval-mode behavior internally, letting you focus on running inference rather than rebuilding the network and its input pipeline.

4. Why It Worked: Scale Met a Ready Idea Intermediate

It is tempting to read AlexNet as a clever new architecture, but the more accurate and more useful reading is that a fourteen-year-old idea finally met the three things it had always needed: enough labeled data (ImageNet), enough compute (GPUs), and a nonlinearity that did not strangle gradients in a deep stack (ReLU). The convolutional structure was essentially LeNet's. This reframing matters because it tells you what to look for in every subsequent architecture: not "what novel layer did they invent?" but "what specific ceiling did they raise, and how would I detect that I am hitting the same ceiling in my own model?"

Practical Example: The Team That Skipped the Preprocessing

Who: a two-person startup building a plant-disease classifier for a farm-equipment company, 2024. Situation: they fine-tuned a pretrained CNN backbone and got 94% validation accuracy in the notebook, then deployed it behind a web API. Problem: in the field the API returned near-random predictions. Decision: rather than retrain, the engineer traced one image end to end and found that the API resized to $256 \times 256$ and fed raw 0 to 255 pixel values, while training had used weights.transforms(): center-crop to $224$ and normalize with the ImageNet mean and standard deviation. The deployed inputs lived in a completely different numerical range than anything the network had seen. Result: switching the API to call the exact same weights.transforms() pipeline restored field accuracy to the validation level the same afternoon, no retraining. Lesson: a pretrained network is a function of a precise input distribution. The preprocessing is part of the model, and the library shortcut above exists precisely so you cannot get it subtly wrong.

Research Frontier: Old Backbones, New Recipes

You might assume a 2012 design is purely historical, but a striking 2021-2024 line of work says otherwise. Wightman, Touvron, and Jegou's "ResNet strikes back" (arXiv:2110.00476) showed that simply training a plain ResNet-50 with a modern recipe (better augmentation, longer schedules, label smoothing, the techniques of Chapter 21) lifts its ImageNet accuracy by several points, comfortably past many newer architectures trained the old way. The same lesson reaches AlexNet-class designs: most of the gap between 2012 and 2024 is the recipe, not the topology. This is the thread that the ConvNeXt story of Section 20.5 pulls all the way through, and it is why this section's "scale met a ready idea" framing is not just a historical curiosity but an active research finding.

Exercise 20.1.1: Count the Parameter Imbalance Conceptual

Using the AlexNet code above, compute by hand the parameter count of (a) the first convolution layer (input 3 channels, 64 output channels, $11 \times 11$ kernel, plus bias) and (b) the first fully connected layer ($9216 \to 4096$, plus bias). Express each as a fraction of the 61M total. Explain in two sentences why the convolutional layers are so cheap relative to the dense layers, referring to the weight-sharing argument of Chapter 19, and predict which layer a memory-constrained design should attack first.

Exercise 20.1.2: ReLU versus tanh, Measured Coding

Take the LeNet5 class above and make two copies, one with nn.ReLU and one with nn.Tanh throughout. Train both on the MNIST or Fashion-MNIST loader from torchvision (resize to $32 \times 32$) with identical optimizer and learning rate for five epochs, logging training loss per step. Plot both loss curves on one axis. Confirm AlexNet's claim that ReLU reaches a given loss in fewer steps, and report the approximate speedup. Then increase both networks to eight convolution layers and observe whether the tanh version stalls, illustrating the dead-gradient bottleneck of subsection 3.

Exercise 20.1.3: Where Did the Error Go? Analysis

The 2010 to 2012 ImageNet top-5 error fell roughly from 28% to 26% to 15.3%. Find the published winning-entry descriptions for those three years (the references in this chapter's bibliography and the ImageNet challenge paper are starting points). For each year, classify the system as "hand-engineered features plus linear classifier" or "deep CNN", and write a short paragraph arguing whether the 2012 jump is better explained by a new architecture, by more data and compute, or by both. Tie your answer to the "scale met a ready idea" framing of subsection 4.