"Hand me a smudged photograph and I will tell you what it was supposed to be. I have seen ten thousand clean ones; the smudge is the only part I have never been allowed to keep."
A Denoising Autoencoder Who Has Made Peace With Noise
You can make an autoencoder learn a far better representation without touching its bottleneck, simply by changing the task it is trained on: corrupt the input and ask it to reconstruct the clean original, or penalize how many code units fire at once. The denoising autoencoder learns features that are robust because they must survive noise, and in doing so it secretly learns the gradient of the data distribution, which is the exact quantity diffusion models will spend Chapter 33 estimating. The sparse autoencoder learns features that are interpretable because only a few are allowed to be active for any input, a property that lay mostly dormant for a decade and then, in 2023 and 2024, became the leading method for cracking open what large models represent. Neither variant is generative in the sampling sense, but both sharpen the central lesson of the chapter: the constraint you impose on the code is what decides what the code learns.
In Section 31.1 you built a plain autoencoder and saw that its undercomplete bottleneck forces the code to capture structure. This section keeps the encoder-decoder skeleton but swaps the constraint. Instead of relying on a narrow bottleneck, the denoising autoencoder uses a corruption process, and the sparse autoencoder uses an activation penalty. Both ideas predate deep learning's modern era, both turned out to matter far more than their inventors expected, and both connect directly forward to material later in this book. We take them one at a time.
1. The Denoising Autoencoder Beginner
The denoising autoencoder (DAE) changes one line of the training procedure. Before passing an image to the encoder, corrupt it: add Gaussian noise, set a random fraction of pixels to zero, or blur it. Crucially, the reconstruction target stays the clean original. The network sees $\tilde{x} = x + \text{noise}$ and is scored on how well its output $\hat{x}$ matches the uncorrupted $x$. The objective is
where $C$ is the corruption distribution. This is the classical denoising of Chapter 7 turned into a learning problem: where Chapter 7 hand-designed a Gaussian or non-local-means filter to remove noise, the DAE learns the denoiser from examples. To undo corruption the network cannot copy pixels through (the pixels are wrong); it must understand what a clean image looks like and project the noisy input back onto that manifold. That projection is exactly the structure we want the code to capture, and it is why a denoising autoencoder learns robust, transferable features even when its bottleneck is generous or absent. The illustration below casts the denoiser as an art-restorer who has seen thousands of clean examples.
There is a precise sense, not just a metaphor, in which this network learns the gradient of the data distribution. For Gaussian corruption $\tilde{x} = x + \sigma\varepsilon$, Vincent's denoising-score-matching identity (derived in full in Section 30.4) shows that the optimal denoiser's residual, rescaled, is exactly the score of the noised density:
In words: training a network to point a noisy sample back toward clean data is the same as training it to estimate which way density increases. That equivalence is the entire bridge to Chapter 33, where a denoiser run across a whole schedule of noise levels becomes a generative model; the DAE here is its single-noise-level ancestor.
The implementation reuses the encoder and decoder from Section 31.1 verbatim. The only change is that we add noise to the input inside the training loop while keeping the clean image as the target, as the comments in the next block make explicit.
# Denoising autoencoder: reuse the Section 31.1 architecture unchanged and
# alter only the training loop. Corrupt the input with Gaussian noise but
# score the output against the CLEAN original, forcing the net to denoise.
import torch
import torch.nn as nn
model = Autoencoder(code_dim=32) # same architecture as Section 31.1
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
noise_std = 0.4 # corruption strength
for epoch in range(5):
for x, _ in loader: # x is the CLEAN image, in [0, 1]
x_noisy = (x + noise_std * torch.randn_like(x)).clamp(0, 1) # corrupt
x_hat, _ = model(x_noisy) # encode/decode the CORRUPTED input
loss = loss_fn(x_hat, x) # but score against the CLEAN target
opt.zero_grad(); loss.backward(); opt.step()
# At test time, feed a noisy digit and the DAE returns a clean one,
# having learned what "a digit" looks like rather than how to copy pixels.
The masking-noise variant, where a random fraction of input pixels are zeroed rather than perturbed, is worth a special mention because you have already met its descendant. Training a network to fill in missing pixels is exactly the pretext task of the Masked Autoencoder (MAE) from Chapter 25, which masks 75 percent of image patches and reconstructs them with a vision transformer. MAE is, in the most literal sense, a denoising autoencoder whose corruption is aggressive masking and whose backbone is a vision transformer. Figure 31.2.1 shows the corrupt-then-restore loop that unifies them.
There is a precise and consequential result here. Training a denoising autoencoder with small Gaussian noise makes its reconstruction-minus-input vector proportional to $\nabla_x \log p(x)$, the gradient of the log data density, also called the score. In plain terms, the DAE learns which direction to nudge a corrupted image to make it more like real data. A diffusion model is nothing but a denoiser trained at many noise levels at once, run iteratively: corrupt all the way to pure noise, then denoise step by step back to an image. The single-step denoiser of this section is the seed; Chapter 33 grows it into the dominant generative model of the era. The link between denoising and the score was sketched in Chapter 30; here you have built the object that realizes it.
Difficulty: beginner. Time: about 45 to 60 minutes. With the denoising autoencoder of subsection 1 you already have everything you need for a small, showable project: a one-click denoiser for grainy phone photos or old scans. Take a folder of clean images, add synthetic Gaussian noise on the fly during training exactly as in Code Fragment 1, and train a convolutional encoder-decoder to recover the clean original. Wrap the trained model in a tiny script that loads a noisy image, runs one forward pass, and saves the cleaned result side by side with the input. The payoff is a before-and-after gallery that demonstrates the chapter's central claim in one glance: the network learned the clean-data manifold rather than memorizing pixels. To make it portfolio-grade, add the noise-level conditioning of Exercise 31.2.2 so a single model handles light and heavy noise, and benchmark it against the classical Gaussian blur of Chapter 7 on the same images to show what learning buys you over hand-designed filtering.
2. The Sparse Autoencoder Intermediate
The sparse autoencoder takes the opposite tack from the bottleneck. Instead of making the code small, make it large, often larger than the input (overcomplete), but add a penalty that forces only a few code units to be active (nonzero) for any given input. The intuition is that each input should be explained as a combination of a handful of reusable parts drawn from a big dictionary, the way a face is a few specific features chosen from a vast catalog of possible features. The objective adds a sparsity term to the reconstruction loss:
where $\lVert z \rVert_1 = \sum_i |z_i|$ is the L1 norm of the code and $\lambda$ controls how aggressively sparsity is enforced. The L1 penalty is the same convexity trick that drives lasso regression: it pushes small activations exactly to zero rather than merely shrinking them, so the learned code is genuinely sparse, with most units off and a few strongly on. Because the constraint, not the dimension, does the regularizing, the bottleneck can be wide, and a wide-but-sparse code often learns more interpretable features than a narrow dense one. The next block adds the penalty to the training loop.
# Sparse autoencoder: make the code wider than the input (overcomplete)
# but add an L1 penalty so only a few units fire per image. The constraint,
# not a narrow bottleneck, is what now forces the code to learn structure.
import torch
import torch.nn as nn
class SparseAutoencoder(nn.Module):
def __init__(self, in_dim=784, code_dim=1024): # overcomplete: 1024 > 784
super().__init__()
self.encoder = nn.Sequential(nn.Linear(in_dim, code_dim), nn.ReLU())
self.decoder = nn.Linear(code_dim, in_dim)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z), z
model = SparseAutoencoder()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
lam = 1e-3 # sparsity weight lambda
for epoch in range(10):
for x, _ in loader:
x_hat, z = model(x)
recon = ((x - x_hat) ** 2).mean()
sparsity = z.abs().mean() # L1 penalty drives most units to zero
loss = recon + lam * sparsity
opt.zero_grad(); loss.backward(); opt.step()
frac_active = (z > 1e-3).float().mean().item()
print(f"epoch {epoch}: active units = {frac_active:.1%}")
# epoch 0: active units = 41.2%
# ...
# epoch 9: active units = 6.8% <- only a handful of the 1024 units fire per input
If your goal is the sparse dictionary itself rather than a trainable encoder, scikit-learn's DictionaryLearning and SparseCoder solve the classical version of this objective (a linear dictionary with sparse codes) in a handful of lines, with no PyTorch, no GPU, and no training loop. DictionaryLearning(n_components=256, alpha=1.0).fit(X) learns the dictionary and .transform(X) gives the sparse codes, replacing roughly thirty lines of model and loop. The neural sparse autoencoder earns its keep only when you need a nonlinear encoder or want to scale to the activation streams of a large pretrained model, which is exactly the interpretability use case below.
3. Sparse Autoencoders for Interpretability Advanced
For roughly a decade the sparse autoencoder was a respectable but minor member of the representation-learning toolkit. Then it found an unexpected and important second career. The problem it solves there is superposition: a trained neural network packs far more concepts into its activation vectors than it has neurons, so a single neuron lights up for many unrelated things and is impossible to interpret. The activations are dense and entangled. A sparse autoencoder trained on those activations learns to re-express each dense activation vector as a sparse combination of a much larger set of learned features, and those features turn out to be far more human-interpretable, one feature firing for "text in French," another for "the Golden Gate Bridge," another for a specific code pattern.
Think of the network's activation vector as a small hotel that has accepted far more guests (concepts) than it has rooms (neurons). Each neuron-room is shared by several unrelated guests on overlapping shifts, so peeking into one room tells you little: you might find a French-text guest, a Golden-Gate-Bridge guest, and a list-syntax guest all using bed number 240 at different moments. The sparse autoencoder is the front desk clerk who issues every guest a private room in a much larger annex (the overcomplete code), with the rule that only a few rooms are occupied for any one input; now each room maps to exactly one guest, so reading the room number tells you the concept. The crowding works only because, for any single input, just a handful of the thousands of possible concepts are actually present, which is precisely the sparsity the L1 penalty enforces.
Where this model breaks down: real features are not perfectly private; the recovered annex rooms can still share a little (features are not guaranteed fully monosemantic), and choosing the annex size wrong leaves some guests doubled up or some rooms empty.
This is the technique behind the 2023 to 2024 wave of mechanistic-interpretability results, most prominently Anthropic's dictionary-learning work that recovered millions of monosemantic features from a production language model. The autoencoder is overcomplete by a large factor (tens of thousands of features for a few thousand activation dimensions), trained with exactly the L1 penalty of subsection 2, on the cached activations of a frozen model. The next block sketches the setup; the only conceptual change from subsection 2 is what the data is.
# The same sparse autoencoder, now aimed at interpretability: train it on a
# frozen model's cached activations with a large overcomplete factor so each
# learned feature becomes monosemantic (fires for one nameable concept).
import torch
import torch.nn as nn
# `acts` is a large tensor of cached activations from a frozen model,
# shape (num_tokens, d_model), e.g. d_model = 768.
class InterpSAE(nn.Module):
def __init__(self, d_model=768, expansion=16): # 16x overcomplete
super().__init__()
self.enc = nn.Linear(d_model, d_model * expansion)
self.dec = nn.Linear(d_model * expansion, d_model)
def forward(self, a):
features = torch.relu(self.enc(a)) # sparse, interpretable feature set
return self.dec(features), features
sae = InterpSAE()
# Train to reconstruct `acts` with a strong L1 penalty on `features`.
# After training, each large feature dimension tends to fire for ONE
# human-recognizable concept; inspecting the inputs that maximize a
# feature reveals what it detects. This is dictionary learning on a model's mind.
The sparse autoencoder spent roughly ten years as the polite acquaintance at the representation-learning party that everyone vaguely remembered inviting. It was respectable, it had a clean objective, and almost nobody used it. Then interpretability researchers needed a tool to pry apart the tangled activations of giant models, looked around, and found this exact L1-penalized network sitting in the corner, perfectly suited to the job it was never originally hired for. The lesson, if there is one, is that a good idea with a sparsity penalty can lie dormant for a decade and still wake up as the protagonist.
The sparse autoencoder is, as of 2026, the dominant tool for reading the internal representations of large models. Anthropic's "Towards Monosemanticity" (2023) and the follow-up "Scaling Monosemanticity" (2024) recovered tens of millions of interpretable features from production-scale models, and the same recipe is now applied to vision and multimodal backbones to ask what a CLIP or a DINOv2 model represents internally. The active frontier is making the features cleaner and cheaper: gated and top-k sparse autoencoders (2024) replace the L1 penalty with a sharper sparsity mechanism to reduce a known shrinkage bias, and the JumpReLU variant (Rajamanoharan et al., DeepMind, "Jumping Ahead," 2024, arXiv:2407.14435) sharpens the on-off decision and reports state-of-the-art reconstruction fidelity at a given sparsity over the Gated and top-k variants on Gemma 2 activations. The link to this section is exact: every one of these systems is the overcomplete, L1-penalized autoencoder you just built, trained on activations instead of pixels. The humble sparse autoencoder became a microscope for neural networks.
Who: a biology lab with a low-light fluorescence microscope and no machine-learning staff, 2024. Situation: their images were grainy because longer exposures bleached the samples, so every frame carried heavy shot noise. Problem: off-the-shelf denoisers trained on natural photos washed out the faint structures the biologists cared about, and they had no clean ground-truth images to train a supervised denoiser. Decision: they trained a convolutional denoising autoencoder using a self-supervised scheme, treating one noisy capture as the input and a second independent noisy capture of the same field as the target, so the network learned to predict the signal common to both while the independent noise averaged out. Result: the DAE produced markedly cleaner images that preserved the faint structures, trained entirely on their own noisy data with no clean references. Lesson: the denoising autoencoder's defining feature, that it learns the clean-data manifold from corrupted examples, makes it the right tool exactly when clean training data does not exist, a situation more common in scientific imaging than in consumer photography.
A denoising autoencoder can have a bottleneck that is wide enough to copy the input through, yet it does not learn the identity. Explain in three or four sentences why the corruption prevents the identity shortcut even when the architecture would otherwise allow it. Then explain the connection to dropout: in what sense is a denoising autoencoder applying a form of noise that plays the same regularizing role, and where in the network does each apply its noise?
Train the denoising autoencoder of subsection 1 on MNIST, but instead of a fixed noise_std, sample the noise standard deviation uniformly from $[0.0, 0.8]$ for each batch. At test time, feed digits corrupted at several noise levels and display the reconstructions. Then concatenate the chosen noise level as an extra input feature to the encoder and retrain. Compare reconstructions with and without the noise-level conditioning, and explain in a paragraph why telling the network how much noise it is facing helps, connecting this directly to the noise-level conditioning that Chapter 33's diffusion U-Net requires.
Train the sparse autoencoder of subsection 2 on MNIST with three values of $\lambda$ (for example $10^{-4}$, $10^{-3}$, $10^{-2}$). For each, measure the average fraction of active code units and the reconstruction MSE, and visualize the decoder weight vector of ten randomly chosen code units as $28 \times 28$ images (the "dictionary atoms"). Describe how the atoms change as $\lambda$ grows: do they become more localized and stroke-like, and at what point does sparsity start to hurt reconstruction? Relate the trade-off to the interpretability use case of subsection 3, where cleaner features are worth some reconstruction error.