Section 18.1: From Linear Models to Multi-Layer Perceptrons

"Give me one matrix and I will draw you a straight line through anything. Give me two matrices with nothing between them and I will draw you the exact same line, twice the work, none the wiser."
A Linear Classifier With Delusions of Depth

Big Picture

A neural network earns its power not from depth alone but from the nonlinear activation squeezed between its linear layers; remove that nonlinearity and any tower of layers collapses, algebraically, into a single linear map. This section starts from the linear classifier your linear algebra already supports, proves why stacking it changes nothing, inserts one nonlinear function to break the collapse, and arrives at the multi-layer perceptron: the simplest model that can separate classes a straight line cannot.

Part II ended with classical recognition pipelines that fed hand-crafted features into a linear classifier, the support vector machine and its relatives in Chapter 16. The classifier was a single matrix; all the cleverness lived upstream in the features a human designed. Deep learning moves the cleverness inside the model, but the journey starts from that same linear classifier. We will build it, ask what we gain by stacking copies of it, discover the answer is "nothing", and then make one small change that turns it into a function approximator powerful enough to motivate the entire rest of this book. The illustration below captures that one move in a single picture.

A cartoon scientist holds a straight ruler over two tangled clusters of red and blue dots that no line can separate, while a robot bends the whole rubber sheet underneath so the same straight ruler now divides the colors cleanly, depicting how a hidden layer reshapes the input space to make classes linearly separable. — If a straight line cannot cut your data, bend the space, not the line: that single move is the entire job of a hidden layer.

1. The Linear Classifier, Revisited Beginner

A linear classifier scores an input vector $\mathbf{x} \in \mathbb{R}^d$ against $C$ classes with a single affine map. We stack the per-class weight vectors into a matrix $W \in \mathbb{R}^{C \times d}$ and add a bias $\mathbf{b} \in \mathbb{R}^C$:

$$ \mathbf{z} = W\mathbf{x} + \mathbf{b}, \qquad \mathbf{z} \in \mathbb{R}^C. $$

Each component $z_c$ is the unnormalized score (the logit) for class $c$. To turn logits into a probability distribution we apply the softmax function, which exponentiates and normalizes:

$$ p_c = \frac{e^{z_c}}{\sum_{k=1}^{C} e^{z_k}}. $$

For a vision input this $\mathbf{x}$ is, for now, a flattened image: a $28 \times 28$ grayscale picture becomes a length-784 vector by reading its pixels in row-major order. That flattening discards every spatial relationship between neighboring pixels, which is exactly the sin Chapter 19 will repair. For this section it is a feature, not a bug: it keeps the model a plain matrix multiply so we can reason about it cleanly. The code below builds the classifier with nothing but NumPy, to show there is no framework magic in a forward pass.

# Linear image classifier in pure NumPy: one affine map (W x + b) turns a
# flattened image into logits, then a numerically stable softmax converts
# those logits into a per-image probability distribution over the C classes.
import numpy as np

def softmax(z):
    # subtract the row max for numerical stability before exponentiating
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)

rng = np.random.default_rng(0)
batch, d, C = 4, 784, 10            # 4 flattened 28x28 images, 10 classes
X = rng.standard_normal((batch, d))
W = rng.standard_normal((C, d)) * 0.01
b = np.zeros(C)

Z = X @ W.T + b                     # logits, shape (4, 10)
P = softmax(Z)                      # class probabilities, shape (4, 10)
print(P.shape, P.sum(axis=1))       # (4, 10) [1. 1. 1. 1.]

Code Fragment 1: A linear image classifier in pure NumPy: the single X @ W.T + b matrix multiply produces logits, and the numerically stable softmax (which subtracts the row max before exponentiating) turns them into per-image class probabilities. The printed P.sum(axis=1) confirms each of the four rows is a valid distribution summing to one.

The output confirms the shape and that each row of P is a valid distribution. This single linear layer is the entire model. The only question deep learning answers differently from Chapter 16 is where $W$ comes from: there it was fit by a convex solver on hand-crafted features; here it will be discovered by gradient descent (Section 18.2) directly on pixels.

2. Why Stacking Linear Layers Gains Nothing Beginner

The obvious idea for more power is depth: send $\mathbf{x}$ through one linear layer, then another. Let the first layer be $W_1, \mathbf{b}_1$ producing a hidden vector $\mathbf{h} = W_1\mathbf{x} + \mathbf{b}_1$, and the second be $W_2, \mathbf{b}_2$ producing $\mathbf{z} = W_2\mathbf{h} + \mathbf{b}_2$. Substituting one into the other:

$$ \mathbf{z} = W_2(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = \underbrace{(W_2 W_1)}_{W'}\,\mathbf{x} + \underbrace{(W_2\mathbf{b}_1 + \mathbf{b}_2)}_{\mathbf{b}'}. $$

The two-layer network is identically a one-layer network with weights $W' = W_2 W_1$ and bias $\mathbf{b}' = W_2\mathbf{b}_1 + \mathbf{b}_2$. No matter how many linear layers you stack, the composition is one matrix; the family of functions you can represent never grows beyond "straight cuts through input space". This is not an approximation or a training difficulty: it is an algebraic identity. The experiment below verifies it numerically.

# Numerical check of the collapse identity: run a batch through two stacked
# linear layers with no activation, then through the single equivalent layer
# whose weight is W2 @ W1, and confirm the two outputs agree exactly.
import numpy as np
rng = np.random.default_rng(1)
x = rng.standard_normal((5, 6))           # batch of 5, input dim 6
W1, b1 = rng.standard_normal((8, 6)), rng.standard_normal(8)
W2, b2 = rng.standard_normal((3, 8)), rng.standard_normal(3)

# two linear layers, no nonlinearity between them
h = x @ W1.T + b1
z_deep = h @ W2.T + b2

# the equivalent single linear layer
W_eq = W2 @ W1                            # (3, 6)
b_eq = W2 @ b1 + b2                       # (3,)
z_flat = x @ W_eq.T + b_eq

print(np.allclose(z_deep, z_flat))        # True

Code Fragment 2: A numerical proof that two stacked linear layers equal one: z_deep from the two-layer pass and z_flat from the collapsed W_eq = W2 @ W1 single matrix agree to floating-point precision, so np.allclose prints True. This confirms the algebraic identity above and explains why depth without a nonlinearity buys nothing.

Key Insight: Depth Without Nonlinearity Is an Illusion

The representational power of a network comes from interleaving linear maps with nonlinear functions, never from the linear maps alone. A "deep" net with no activations is a costly way to compute a single matrix product. This is why the activation function is not a tuning detail but the load-bearing wall of the architecture: it is the only thing standing between you and a glorified linear regression.

3. The Nonlinearity That Breaks the Collapse Beginner

Insert a nonlinear function $\sigma$ applied element-wise after the first linear layer, and the collapse argument fails because $W_2 \sigma(W_1 \mathbf{x})$ cannot be folded into a single matrix. The simplest and now-dominant choice is the rectified linear unit, $\mathrm{ReLU}(u) = \max(0, u)$, which zeroes negative values and passes positive ones unchanged. Its appeal is brutal simplicity: cheap to compute, a gradient of exactly $1$ for positive inputs (so signal does not shrink as it backpropagates through depth), and a hard zero elsewhere that yields sparse activations.

ReLU was not the first choice historically. Two older alternatives, the logistic sigmoid $\sigma(u) = 1/(1+e^{-u})$ and $\tanh$, squash inputs into a bounded range but saturate: for large magnitudes their gradient is near zero, which starves the deep layers of learning signal, the vanishing-gradient problem ReLU was introduced to sidestep.

Geometrically, each ReLU unit folds input space along a hyperplane, and a layer of them tiles space into linear regions. A linear region here just means a patch of input space inside which the same set of ReLU units is active, so the network behaves like one fixed linear map; cross into the next patch and a different unit switches on, giving a different linear map and thus a bend. Stacking layers multiplies the number of regions, which is how a network builds a curved decision boundary out of many flat pieces. Figure 18.1.1 contrasts the three activation shapes that the prose just described.

Figure 18.1.1 The three classic activation functions. ReLU is a hinge with gradient exactly 0 or 1, which preserves signal through depth. Sigmoid and tanh are smooth S-curves that saturate at their extremes, where their near-zero gradient stalls learning in deep stacks. ReLU's non-saturating positive branch is why it became the default after 2012.

With a nonlinearity in place, a two-layer network $\mathbf{z} = W_2\,\sigma(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$ is genuinely more expressive than a single layer. This is the multi-layer perceptron (MLP), and the layer producing $\mathbf{h}$ is the hidden layer because its outputs are never observed directly, only consumed by the next layer.

4. The Multi-Layer Perceptron Intermediate

An MLP is a chain of (linear layer, activation) pairs ending in a final linear layer that produces logits. With one hidden layer of width $H$, the forward pass on a flattened image is two matrix multiplies with a ReLU between them. The code makes a complete forward pass explicit, still in NumPy, so the only new ingredient beyond Section 1 is the single maximum(0, ...) call.

# One-hidden-layer MLP forward pass in NumPy: two weight matrices with an
# elementwise ReLU between them, plus the He-scaled initialization that keeps
# activations well-conditioned. This is the linear classifier plus one max(0, .).
import numpy as np

def softmax(z):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=1, keepdims=True)

rng = np.random.default_rng(2)
batch, d, H, C = 8, 784, 128, 10           # hidden layer of width 128
X  = rng.standard_normal((batch, d))
W1 = rng.standard_normal((H, d)) * np.sqrt(2.0 / d)   # He init for ReLU
b1 = np.zeros(H)
W2 = rng.standard_normal((C, H)) * np.sqrt(2.0 / H)
b2 = np.zeros(C)

h = np.maximum(0.0, X @ W1.T + b1)         # hidden activations (ReLU)
Z = h @ W2.T + b2                          # output logits
P = softmax(Z)
print(h.shape, P.shape)                    # (8, 128) (8, 10)

Code Fragment 3: A one-hidden-layer MLP forward pass: the only structural difference from Code Fragment 1 is the np.maximum(0.0, ...) ReLU between the two weight matrices, and the sqrt(2.0 / d) He-scaled initialization tuned for ReLU's variance. The printed shapes (8, 128) and (8, 10) trace the batch from the width-128 hidden layer to the ten output logits.

Notice the initialization scale $\sqrt{2/d}$, the He initialization matched to ReLU. Section 18.2 will explain why the scale of initial weights matters for gradient flow; for now treat it as the recipe that keeps activations from exploding or vanishing as they pass through the layer. The remarkable fact about this humble architecture is the universal approximation theorem: a single hidden layer of sufficient width can approximate any continuous function on a bounded domain to arbitrary accuracy. The theorem says nothing about how wide (it can be impractically large) or how to find the weights, which is exactly why depth and good optimization, not just width, drive practice.

Common Misconception: "Universal Approximation Means One Wide Layer Is Enough"

Students often read the universal approximation theorem as a promise that a single wide hidden layer can solve any vision problem, so depth is just optional decoration. In fact the theorem is an existence result: it guarantees that some set of weights exists, but says nothing about how many neurons that takes (it can be exponentially many) or whether gradient descent can ever find them from data. For real images the required width is astronomically impractical, and a deeper-but-narrower network represents the same function with far fewer parameters because each layer reuses the features built by the one below. Approximation power and learnability are two different questions; the theorem answers only the first. The spiral-arm argument in Exercise 18.1.3 makes the gap concrete, and the convolutional prior of Chapter 19 exists precisely because brute-force width on raw pixels is hopeless.

Library Shortcut: The Same MLP in Four Lines of PyTorch

Everything above, the two weight matrices, the bias vectors, the ReLU, and the correct ReLU-tuned initialization, is a single nn.Sequential in PyTorch: nn.Sequential(nn.Flatten(), nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10)). That replaces roughly 15 lines of manual NumPy (weight allocation, init scaling, the forward pass) with four, and the framework also wires up automatic differentiation so the backward pass of Section 18.2 comes for free, initializes each nn.Linear with a sensible default (Kaiming-uniform, which you can swap for explicit He initialization via torch.nn.init when a ReLU follows), and moves the whole thing to a GPU with one .to(device) call. We meet nn.Module and nn.Sequential properly in Section 18.3.

Practical Example: The Tabular Team That Reached for a CNN

Who: A data team at a logistics startup predicting next-day package volume per depot from a few dozen numeric and categorical features.

Situation: A new hire, fresh from a computer-vision course, proposed a convolutional network because "deep learning won vision". The features were tabular: counts, day-of-week, weather codes, prior-day volumes.

Problem: Convolution assumes a spatial or sequential neighborhood structure, which tabular columns do not have; column 7 being adjacent to column 8 means nothing. The CNN trained slowly, overfit, and never beat the gradient-boosted-tree baseline.

Decision: The team stepped back to the right tool. They kept a plain MLP (a few hidden layers with ReLU and dropout) for the neural baseline, fed it normalized features, and used it mainly to learn embeddings for the high-cardinality categorical columns, which the trees could not do cleanly.

Result: The MLP matched the gradient-boosted-tree baseline (within 0.3 percent on mean absolute error) at a fraction of the CNN's training time, and feeding its learned category embeddings back into the ensemble cut validation error by about 4 percent. No convolution was harmed in the making of the model.

Lesson: The MLP, not the CNN, is the general-purpose neural network for unstructured-but-not-spatial data. Convolution is a structural prior for grids of pixels; reach for it in Chapter 19 when the data actually is a grid, not because the phrase "deep learning" appears in the project brief.

5. Decision Boundaries: What the Hidden Layer Buys You Intermediate

The clearest way to feel the difference is the exclusive-or problem: four points labeled so that no straight line separates the two classes. A linear classifier provably cannot solve it (this is the historical result that froze neural-network research for a decade). A two-layer MLP solves it easily, because the hidden ReLU units carve input space into regions the output layer can then combine. The snippet evaluates an MLP whose weights are set by hand to the known XOR solution, to show the mechanism without yet invoking training.

# Solve XOR with a hand-built MLP: two hidden ReLU units detect the "exactly
# one input on" cases, and a single output combines them. This shows the
# mechanism (depth plus nonlinearity) without any training.
import numpy as np

# XOR: class 1 when exactly one input is 1
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
y = np.array([0, 1, 1, 0])

# Hand-set weights: two hidden ReLU units detect the two "exactly one on" cases
W1 = np.array([[1.0, 1.0], [1.0, 1.0]])
b1 = np.array([0.0, -1.0])
W2 = np.array([[1.0, -2.0]])      # combine the two hidden features
b2 = np.array([0.0])

h = np.maximum(0.0, X @ W1.T + b1)
score = (h @ W2.T + b2).ravel()
pred = (score > 0.5).astype(int)
print(pred, "matches labels:", np.array_equal(pred, y))   # [0 1 1 0] matches labels: True

Code Fragment 4: An MLP solving XOR with hand-set weights: the hidden ReLU layer creates two features whose linear combination (W2 = [[1, -2]]) separates a problem that defeats any single linear layer. The printed [0 1 1 0] and matches labels: True confirm the network reproduces the XOR truth table exactly.

To see the mechanism on one input, trace $[1, 1]$: both hidden units receive $1+1=2$ before their biases, so the first unit outputs $\max(0, 2)=2$ and the second outputs $\max(0, 2-1)=1$; the output then computes $1\cdot 2 - 2\cdot 1 = 0$, which is below the $0.5$ threshold, giving class 0, exactly as XOR requires. The output [0 1 1 0] matches the labels exactly. The single hidden layer transformed the inputs into a new space where the classes are linearly separable, and the output layer drew the now-trivial line. Figure 18.1.2 makes this transformation literal: it plots all four XOR points in the original input plane, where no line separates the two classes, beside the same four points in the hidden layer's $(h_1, h_2)$ coordinates, where a single straight line does.

Figure 18.1.2 Bend the space, not the line. Left: the four XOR points in the original input plane, where the two blue (class 0) and two red (class 1) points sit on opposite diagonals so no straight line (gray dashed) can separate them. Right: the hidden ReLU layer maps each point to its $(h_1, h_2) = \mathrm{ReLU}([x_1{+}x_2,\ x_1{+}x_2{-}1])$ coordinates, where both red points land on $(1,0)$ and the blue points spread to $(0,0)$ and $(2,1)$, so a single straight line (green dashed) now separates the classes. The output layer only has to draw that easy line.

That is the entire principle behind every network in Part III: learn a representation in which the final decision is easy. In Section 18.2 we stop setting weights by hand and let gradient descent discover them; in Chapter 25 we will see networks learn representations so good that the final linear layer barely has to do anything.

You Could Build This: A Live Decision-Boundary Explorer

With only the forward pass of this section you can build a small interactive tool that makes "bend the space, not the line" visible. Generate a two-dimensional toy dataset that no line can split (two interleaved spirals, or the four XOR points fattened into Gaussian blobs), train the one-hidden-layer MLP above, and after every few steps paint the decision boundary by running the forward pass over a fine grid of the input plane and coloring each cell by its predicted class. Watching the boundary start as a single straight cut and slowly fold into curves as the hidden ReLU units switch on is the most direct way to feel why the nonlinearity matters. Difficulty: beginner, about 45 to 60 minutes. It needs nothing beyond NumPy and Matplotlib, reuses the exact maximum(0, ...) forward pass from Code Fragment 3, and slots in the gradient-descent training of Section 18.2 once you reach it. A short animated GIF of the boundary morphing makes a memorable portfolio piece and a clean companion to the region-counting study in Exercise 18.1.2.

Fun Note

The XOR problem is the gremlin that nearly killed neural networks. In 1969 Minsky and Papert pointed out that a single-layer perceptron cannot solve it, and the field promptly went into a decade-long sulk now politely called the first "AI winter". The fix was hiding in plain sight all along: one hidden layer and one bend in the function. The moral worth tattooing on a sticky note is "if a straight line cannot cut it, bend the space, not the line", which is the entire job description of every hidden layer you will ever write.

Research Frontier: The MLP Refuses to Retire

The lowly MLP keeps resurfacing at the research frontier. MLP-Mixer (Tolstikhin et al., 2021) showed that an architecture built entirely from MLPs, mixing across spatial locations and across channels in alternation, could rival convolutional and transformer vision models, questioning how much of their success is the convolution versus the training recipe. The 2024 wave of Kolmogorov-Arnold Networks (KANs) replaces the MLP's fixed activations with learnable spline functions on the edges, reviving interest in where exactly an MLP's nonlinearity should live. And in modern transformers (Chapter 22), the position-wise feed-forward block, which holds the majority of the parameters, is just an MLP. The two-layer perceptron you built in this section is not a historical stepping stone you leave behind; it is a recurring component you will meet inside almost every architecture that follows.

Exercise 18.1.1: The Collapse, Generalized Conceptual

Prove that a network of $L$ linear layers (no activations) with weight matrices $W_1, \ldots, W_L$ and biases $\mathbf{b}_1, \ldots, \mathbf{b}_L$ is equivalent to a single affine map, and write the resulting effective weight and bias in terms of the originals. Then explain in one sentence why adding any single nonlinear layer anywhere in the chain breaks this equivalence, and identify which property of matrix multiplication the collapse relies on.

Exercise 18.1.2: Counting Linear Regions Coding

Build a small MLP in NumPy with two inputs, a hidden ReLU layer of width $H$, and one output, with random weights. Sample a fine grid over the input square $[-3, 3]^2$, run the forward pass, and color each grid point by which hidden units are active (the activation pattern). Visualize the result for $H = 2, 4, 8, 16$ and count the distinct linear regions. Describe how the number of regions grows with $H$, and connect your observation to the claim in subsection 3 that more units carve space into more flat pieces.

Exercise 18.1.3: When Width Is Not Enough Analysis

The universal approximation theorem guarantees a single wide hidden layer can fit any continuous function, yet practice favors depth. Using the spiral-dataset thought experiment (two interleaved spiral arms as two classes), argue why a shallow-but-wide MLP might need exponentially many units to separate the arms while a deeper-but-narrower MLP needs far fewer. Frame your argument in terms of how each successive hidden layer can reuse the regions built by the previous one, and relate it to the representation-learning view stated at the end of subsection 5.