"People accuse me of being inflexible. Nonsense. Under every transformation this book throws at me, I stretch, I shrink, I even flip sign. I simply refuse to turn."
A Stubbornly Direction-Preserving Eigenvector
Four bodies of mathematics carry this entire book: linear algebra, probability, calculus, and signal processing. Linear algebra is the language of transformations, from the image warps of Chapter 5 to the camera geometry of Chapters 12 to 14. Probability is the language of uncertainty and of generative modeling, the foundation of Chapters 30 to 33. Calculus, through the chain rule and gradient descent, is how every network in Chapter 18 onward learns. And signal processing, through convolution, Fourier analysis, and the sampling theorem, explains why Chapters 3 and 4 work at all. This appendix is a compact, self-contained refresher on exactly these four toolkits: enough to read every equation in the book, with pointers to the chapters where each idea earns its keep.
This appendix is a reference, not a course. Each topic gets a definition, a plain-language intuition, one small worked example, and a pointer to the chapters that rely on it. If a formula in the main text ever feels opaque, the habit to build is: come here, find the relevant block, work the small example by hand, then return. Nothing below assumes more than comfortable high-school algebra and the basic Python from Chapter 0. Readers who want full courses rather than a refresher will find a short list of excellent free resources at the end.
1. Linear Algebra for Vision
Vision is geometry done with arrays. Images are arrays of numbers, geometric operations on images are matrix multiplications, and a remarkable share of classical vision reduces to solving linear systems. This section covers the six linear-algebra ideas the book leans on most: vectors, matrices as transformations, eigenvectors, the singular value decomposition, homogeneous coordinates, and least squares.
Vectors and the Geometry of Image Data
Definition. A vector $\mathbf{x} \in \mathbb{R}^n$ is an ordered list of $n$ real numbers. The two operations that matter most are the dot product and the norm:
$$\mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n} x_i\, y_i, \qquad \|\mathbf{x}\| = \sqrt{\mathbf{x} \cdot \mathbf{x}}.$$The dot product measures alignment. Dividing it by the norms gives the cosine of the angle between two vectors, $\cos \theta = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|\,\|\mathbf{y}\|}$, a number in $[-1, 1]$ that is $1$ when the vectors point the same way, $0$ when they are perpendicular, and $-1$ when they oppose.
Intuition. In this book a vector is rarely an arrow on a chalkboard; it is data. A grayscale $100 \times 100$ image, flattened, is a single point in $\mathbb{R}^{10000}$. A SIFT descriptor from Chapter 10 is a point in $\mathbb{R}^{128}$. A CLIP embedding is a point in $\mathbb{R}^{512}$. Once data lives in a vector space, "how similar are these two patches?" becomes "what is the angle between these two points?", which is exactly how descriptor matching and embedding retrieval work.
Worked example. Let $\mathbf{u} = (3, 4)$ and $\mathbf{v} = (4, 3)$. Then $\|\mathbf{u}\| = \sqrt{9 + 16} = 5$, likewise $\|\mathbf{v}\| = 5$, and $\mathbf{u} \cdot \mathbf{v} = 12 + 12 = 24$. The cosine similarity is $24 / 25 = 0.96$: two nearly parallel vectors, as a sketch confirms.
Matrices: Multiplication as Transformation
Definition. A matrix $A \in \mathbb{R}^{m \times n}$ maps a vector $\mathbf{x} \in \mathbb{R}^n$ to a vector $A\mathbf{x} \in \mathbb{R}^m$, where the $i$-th output entry is the dot product of row $i$ of $A$ with $\mathbf{x}$. The map is linear: $A(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha A\mathbf{x} + \beta A\mathbf{y}$. Multiplying two matrices composes their maps: $(AB)\mathbf{x} = A(B\mathbf{x})$, so $B$ acts first, then $A$, and in general $AB \neq BA$.
Intuition. The most useful way to read a matrix is column by column: column $j$ of $A$ is where the $j$-th standard basis vector lands. The whole transformation is determined by where the basis goes, and everything else follows by linearity. A rotation matrix sends the basis to a rotated basis; a scaling matrix stretches it; a projection matrix flattens it.
Worked example. Rotation by $90^\circ$ counterclockwise is $R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}$. Its first column $(0, 1)$ says the $x$-axis basis vector $(1,0)$ rotates to point straight up; its second column $(-1, 0)$ says the $y$-axis vector rotates to point left. Applying it: $R\,(1, 0)^\top = (0, 1)^\top$, exactly as a quarter turn should.
Every linear step in this book, an image warp, a camera projection, a fully connected network layer, an attention head's query projection, is "multiply by a matrix". The columns of that matrix tell you where the basis goes, and composing steps is just multiplying matrices in right-to-left order. When an architecture diagram in Chapter 18 or a warping pipeline in Chapter 5 chains operations, it is chaining matrices, and the order of the chain matters for exactly the reason $AB \neq BA$.
Eigenvectors and Eigenvalues
Definition. A nonzero vector $\mathbf{v}$ is an eigenvector of a square matrix $A$ with eigenvalue $\lambda$ if
$$A\mathbf{v} = \lambda \mathbf{v}.$$The transformation does not turn $\mathbf{v}$; it only scales it by $\lambda$. For a symmetric matrix (the common case in vision: covariance matrices, second-moment matrices) the eigenvalues are real and the eigenvectors can be chosen orthogonal, so the matrix is simply "independent stretches along perpendicular axes".
Intuition. Eigenvectors reveal a transformation's preferred directions, and eigenvalues say how strongly it acts along them. That reading powers several workhorses: in PCA, the eigenvectors of a data covariance matrix are the directions of greatest variance, the basis behind the eigenface recognition pipeline of Chapter 16. In the Harris corner detector of Chapter 10, two large eigenvalues of a local gradient matrix mean intensity changes in every direction: a corner.
Worked example. Take $A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}$. Try $\mathbf{v}_1 = (1, 1)$: $A\mathbf{v}_1 = (3, 3) = 3\mathbf{v}_1$, so $\lambda_1 = 3$. Try $\mathbf{v}_2 = (1, -1)$: $A\mathbf{v}_2 = (1, -1) = 1 \cdot \mathbf{v}_2$, so $\lambda_2 = 1$. The matrix stretches the diagonal direction by 3 and leaves the anti-diagonal alone; any other input vector is a mix of the two and gets turned as its components scale unequally.
The Singular Value Decomposition
Definition. Every matrix $A \in \mathbb{R}^{m \times n}$, square or not, factors as
$$A = U \Sigma V^\top,$$where $U$ and $V$ have orthonormal columns and $\Sigma$ is diagonal with nonnegative entries $\sigma_1 \ge \sigma_2 \ge \dots \ge 0$, the singular values. Geometrically: any linear map is a rotation (or reflection) $V^\top$, then an axis-aligned scaling $\Sigma$, then another rotation $U$. The number of nonzero singular values is the rank of $A$.
Intuition. The SVD is the universal disassembly tool for matrices, and it shows up in this book in three roles. First, solving homogeneous systems: the best solution of $A\mathbf{x} = \mathbf{0}$ with $\|\mathbf{x}\| = 1$ is the right singular vector with the smallest singular value, which is exactly how homographies and the fundamental matrix are estimated from point matches in Chapter 13. Second, low-rank approximation: keeping the top $k$ singular values gives the best rank-$k$ approximation of $A$ (the Eckart-Young theorem), the principle behind PCA compression. Third, diagnosis: singular values measure how close a matrix is to losing rank, which is how degenerate geometry is detected in structure-from-motion pipelines in Chapter 14.
Worked example. The shear $S = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}$ looks innocent but is not a pure rotation or scaling. Its SVD has singular values $\sigma_1 = \varphi \approx 1.618$ and $\sigma_2 = 1/\varphi \approx 0.618$, where $\varphi$ is, delightfully, the golden ratio. The shear secretly rotates, stretches one axis by $\varphi$, squeezes the other by $1/\varphi$ (note $\sigma_1 \sigma_2 = 1$, so area is preserved), and rotates again.
Homogeneous Coordinates
Definition. A 2D point $(x, y)$ is written in homogeneous coordinates as the 3-vector $(x, y, 1)$, with the convention that any nonzero scalar multiple $(wx, wy, w)$ represents the same point. To convert back, divide by the last entry.
Intuition. The trick solves an annoyance and unlocks a superpower. The annoyance: translation is not a linear map of $(x, y)$, so it cannot be a $2 \times 2$ matrix; with the extra coordinate it becomes one matrix multiplication like everything else, and an entire pipeline of rotations, scalings, and translations collapses into a single $3 \times 3$ matrix. The superpower: $3 \times 3$ matrices on homogeneous coordinates also express projective transformations, the perspective-bending maps a real camera applies to the world, including "points at infinity" with $w = 0$ where parallel lines meet. This is the working notation of Chapter 5, the camera model of Chapter 12, and the two-view geometry of Chapter 13.
Worked example. Translating $(2, 3)$ by $(t_x, t_y) = (5, -1)$:
$$\begin{bmatrix} 1 & 0 & 5 \\ 0 & 1 & -1 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix} = \begin{bmatrix} 7 \\ 2 \\ 1 \end{bmatrix},$$which is the point $(7, 2)$: a translation performed by matrix multiplication, something no $2 \times 2$ matrix can do.
Least Squares
Definition. Given more equations than unknowns, $A\mathbf{x} \approx \mathbf{b}$ with $A \in \mathbb{R}^{m \times n}$ and $m > n$, the least-squares solution minimizes the squared residual:
$$\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2 = (A^\top A)^{-1} A^\top \mathbf{b},$$where the closed form on the right (the normal equations) holds when $A$ has full column rank. In practice the solution is computed via QR or SVD factorizations, which are numerically safer than forming $A^\top A$ explicitly.
Intuition. Measurements are noisy, so vision problems are almost always overdetermined: dozens of point matches constrain an 8-parameter homography, hundreds of corner observations constrain a camera calibration. Least squares is the standard way to let every measurement vote and to settle on the parameters that displease the votes least, in the squared-error sense. It is the engine inside the curve fitting of Chapter 9, the calibration of Chapter 12, and, scaled up to millions of parameters, the bundle adjustment of Chapter 14. One caveat to carry from Chapter 10: squared error is exquisitely sensitive to outliers, which is why robust wrappers like RANSAC exist.
Worked example. The smallest interesting case is fitting a line $y = \theta_1 x + \theta_2$ to noisy points. Each observation contributes one row $[x_i, 1]$ to $A$ and one entry $y_i$ to $\mathbf{b}$, and three lines of NumPy solve it:
import numpy as np
rng = np.random.default_rng(seed=0)
x = np.linspace(0, 10, 50)
y = 2.5 * x + 1.0 + rng.normal(0, 1.0, size=50) # noisy line, true slope 2.5, intercept 1.0
A = np.column_stack([x, np.ones_like(x)]) # design matrix: one [x_i, 1] row per point
theta, residuals, rank, sv = np.linalg.lstsq(A, y, rcond=None)
print(theta) # [2.616 0.549]: the noise tilts the estimate
numpy.linalg.lstsq: the design matrix stacks one [x, 1] row per observation, and the solver returns the slope and intercept minimizing the sum of squared residuals. With this seed the 50 noisy points yield slope 2.616 and intercept 0.549 against ground truth 2.5 and 1.0.The same pattern, "stack one row per measurement, solve", reappears whenever the book says "estimated by least squares", whether the unknowns are 2 line parameters or 9 homography entries.
2. Probability & Statistics
Part IV of this book is built on a single idea: model the probability distribution of natural images. Reading it requires five tools: random variables, expectation, the Gaussian distribution, Bayes' rule, maximum likelihood, and KL divergence. They are also quietly present much earlier, in the noise models of Chapter 7 and the histograms of Chapter 2.
Random Variables and Expectation
Definition. A random variable $X$ is a quantity whose value is governed by a probability distribution: a probability mass function $p(x)$ for discrete values, or a probability density function for continuous ones. The expectation is the probability-weighted average,
$$\mathbb{E}[X] = \sum_x x\, p(x) \quad \text{or} \quad \mathbb{E}[X] = \int x\, p(x)\, dx,$$and the variance $\operatorname{Var}(X) = \mathbb{E}\big[(X - \mathbb{E}[X])^2\big]$ measures spread around that average.
Intuition. Every pixel a sensor records is a random variable: the scene contributes a signal, and photon arrival, sensor electronics, and quantization contribute noise. The histogram of Chapter 2 is nothing but an empirical estimate of a pixel-value distribution, and the sample mean of many noisy frames converges to the expectation, which is why frame averaging denoises.
Worked example. A fair six-sided die has $\mathbb{E}[X] = (1 + 2 + \dots + 6)/6 = 3.5$. Note the expectation need not be an achievable value; it is the long-run average, not a typical outcome. The same caution applies to "the average image" of a dataset, which is usually a gray blur no single photograph resembles.
The Gaussian Distribution
Definition. The univariate Gaussian (normal) distribution with mean $\mu$ and variance $\sigma^2$ has density
$$\mathcal{N}(x;\, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left( -\frac{(x - \mu)^2}{2\sigma^2} \right),$$and its $d$-dimensional generalization replaces $\sigma^2$ with a covariance matrix $\Sigma$: $\mathcal{N}(\mathbf{x};\, \boldsymbol{\mu}, \Sigma) = (2\pi)^{-d/2} |\Sigma|^{-1/2} \exp\!\big( -\tfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \big)$.
Intuition. The Gaussian is everywhere for three reinforcing reasons. The central limit theorem says sums of many small independent effects look Gaussian, which is a fair description of sensor noise. The Gaussian is closed under linear transformations: a matrix times a Gaussian vector is again Gaussian, which keeps the algebra of Kalman filters in Chapter 15 and of diffusion models in Chapter 33 tractable end to end. And among all distributions with a given mean and variance it has maximum entropy: it assumes the least. The handy calibration to memorize: about 68% of the mass lies within $1\sigma$ of the mean, 95% within $2\sigma$, and 99.7% within $3\sigma$.
Worked example. If sensor noise is $\mathcal{N}(0, 2^2)$ gray levels, a noise excursion beyond $\pm 6$ gray levels ($3\sigma$) happens on roughly 0.3% of pixels: in a one-megapixel image, about three thousand pixels. "Rare per pixel" is never rare per image, which is why robust thresholds in Chapter 7 are set in units of $\sigma$.
Bayes' Rule
Definition. For events or hypotheses $H$ and observed data $D$,
$$P(H \mid D) = \frac{P(D \mid H)\, P(H)}{P(D)},$$read as: posterior equals likelihood times prior, normalized. It converts "how probable is this data under each hypothesis" into "how probable is each hypothesis given this data", which is the direction practitioners actually need.
A factory inspection model flags surface defects. Defects are rare: $P(\text{defect}) = 0.01$. The detector fires on 95% of true defects but also on 10% of clean parts. A part is flagged; what is the probability it is actually defective?
$$P(\text{defect} \mid \text{flag}) = \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.10 \times 0.99} = \frac{0.0095}{0.1085} \approx 0.088.$$
Under 9%. The flood of false alarms from the 99% of clean parts swamps the rare true positives. This base-rate effect governs every rare-event vision system, from medical screening to the deepfake detectors of Chapter 37, and it is why precision at a fixed false-positive rate, not raw accuracy, is the metric that matters for rare classes.
Intuition. Beyond alarm rates, Bayes' rule is the skeleton of probabilistic vision. Image restoration in Chapter 7 seeks the clean image maximizing $P(\text{clean} \mid \text{corrupted}) \propto P(\text{corrupted} \mid \text{clean}) P(\text{clean})$: a data-fidelity likelihood times an image prior. Generative classifiers invert class-conditional models the same way, and classifier guidance in Chapter 33 is Bayes' rule applied to the gradient of a log-posterior.
Maximum Likelihood Estimation
Definition. Given data $x_1, \dots, x_N$ drawn independently from a model $p_\theta(x)$ with unknown parameters $\theta$, the maximum likelihood estimate picks the parameters under which the observed data is most probable:
$$\hat{\theta} = \arg\max_\theta \sum_{i=1}^{N} \log p_\theta(x_i).$$The logarithm turns a product of probabilities into a sum, which is numerically stable and differentiable, and maximizing it is equivalent.
Intuition. MLE is the bridge between probability and optimization: it converts "find a good model" into "maximize a differentiable objective", which Section 3 then solves by gradient descent. Two identities knit the book together. Fitting a Gaussian by MLE gives exactly the sample mean and (the $1/N$ version of) the sample standard deviation. And minimizing squared error, as in the least-squares fit above, is MLE under the assumption of Gaussian observation noise, just as minimizing cross-entropy in the classifiers of Chapter 18 is MLE for label distributions. The generative models of Chapter 30 onward are trained by maximizing likelihood exactly (autoregressive models, flows) or by maximizing a tractable lower bound on it (VAEs in Chapter 31, diffusion models in Chapter 33).
KL Divergence
Definition. The Kullback-Leibler divergence from distribution $q$ to distribution $p$ is
$$D_{\mathrm{KL}}(p \,\|\, q) = \int p(x) \log \frac{p(x)}{q(x)}\, dx,$$with a sum replacing the integral in the discrete case. It is always nonnegative, equals zero only when $p = q$, and is not symmetric: $D_{\mathrm{KL}}(p \,\|\, q) \neq D_{\mathrm{KL}}(q \,\|\, p)$ in general, so it is a divergence, not a distance.
Intuition. $D_{\mathrm{KL}}(p \,\|\, q)$ measures the penalty, in extra nats of surprise, for believing the world is $q$ when it is really $p$. It is the currency of Part IV: the VAE objective in Chapter 31 contains a KL term pulling the encoder's latent distribution toward a Gaussian prior; the original GAN objective of Chapter 32 turns out to minimize a close cousin (Jensen-Shannon divergence) between real and generated distributions; and the diffusion training bound in Chapter 33 is a sum of KL terms between Gaussians, which is precisely why its loss collapses to simple squared error on noise predictions. Minimizing KL to the data distribution is also equivalent to maximizing likelihood, tying this block to the previous one.
Worked example. For two Gaussians there is a closed form:
$$D_{\mathrm{KL}}\big(\mathcal{N}(\mu_1, \sigma_1^2) \,\|\, \mathcal{N}(\mu_2, \sigma_2^2)\big) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}.$$Plugging in $\mathcal{N}(1, 1)$ versus $\mathcal{N}(0, 1)$: the logarithm vanishes, the fraction is $(1 + 1)/2 = 1$, and the result is $0.5$ nats. Shifting a unit Gaussian by one standard deviation costs exactly half a nat. The snippet below verifies the formula numerically and demonstrates the MLE identities at the same time:
import numpy as np
rng = np.random.default_rng(seed=0)
samples = rng.normal(loc=3.0, scale=2.0, size=10_000) # data from N(3, 2^2)
mu_hat = samples.mean() # Gaussian MLE of the mean is the sample mean
sigma_hat = samples.std() # MLE of sigma divides by N (not N-1)
def kl_gauss(mu1, s1, mu2, s2):
return np.log(s2 / s1) + (s1**2 + (mu1 - mu2)**2) / (2 * s2**2) - 0.5
print(mu_hat, sigma_hat) # 3.013 1.996: close to the truth
print(kl_gauss(mu_hat, sigma_hat, 3.0, 2.0)) # 2.4e-05: fitted and true nearly coincide
3. Calculus & Optimization
Deep learning, the subject of Parts III and IV, rests on one short story: write the model's badness as a differentiable function of its parameters, compute the gradient, and walk downhill. This section tells that story in three steps plus one paragraph on when downhill walking is guaranteed to succeed.
Gradients
Definition. For a function $f : \mathbb{R}^n \to \mathbb{R}$, the gradient is the vector of partial derivatives,
$$\nabla f(\mathbf{x}) = \left( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n} \right),$$and it points in the direction of steepest increase of $f$ at $\mathbf{x}$, with magnitude equal to the slope in that direction.
Intuition. The book uses gradients in two superficially different but identical ways. An image gradient treats the image as a function of pixel coordinates; its direction is perpendicular to edges, and the Sobel filters of Chapter 3 and the Canny detector of Chapter 9 estimate it. A loss gradient treats the training loss as a function of millions of network parameters; its negative is the direction of fastest improvement, and all of Chapter 18 is built on following it. Same mathematics, different variables.
Worked example. Let $f(x, y) = x^2 + 3y^2$. Then $\nabla f = (2x, 6y)$, and at the point $(1, 1)$ the gradient is $(2, 6)$: the function climbs three times faster along $y$ than along $x$, so steepest descent from $(1, 1)$ initially moves mostly in the $-y$ direction. Elongated bowls like this one are exactly the loss landscapes where naive gradient descent zigzags, motivating the momentum and Adam optimizers of Chapter 18.
The Chain Rule
Definition. If $y = f(g(x))$, then
$$\frac{dy}{dx} = f'\big(g(x)\big) \cdot g'(x),$$and the multivariable version composes Jacobian matrices the same way. Derivatives of nested functions are products of the derivatives of the layers, evaluated at the intermediate values.
Intuition. A neural network is a deeply nested function: loss of softmax of linear of activation of linear of pixels. The chain rule says its gradient is a product of per-layer derivatives, and backpropagation (Chapter 18, Section 18.2) is nothing more than evaluating that product from the loss backward while caching the intermediate values from the forward pass so nothing is recomputed. Every "the gradient flows through the network" sentence in Part III is the chain rule in costume.
Worked example. Take a one-parameter "network" $L(w) = (\sigma(w x) - t)^2$ with input $x$, target $t$, and sigmoid $\sigma(z) = 1/(1 + e^{-z})$, whose derivative is $\sigma'(z) = \sigma(z)(1 - \sigma(z))$. Peeling the layers from the outside in: $\frac{dL}{dw} = \underbrace{2(\sigma(w x) - t)}_{\text{loss layer}} \cdot \underbrace{\sigma(w x)\big(1 - \sigma(w x)\big)}_{\text{sigmoid layer}} \cdot \underbrace{x}_{\text{linear layer}}$. Three factors for three layers; a 50-layer network just has 50 factors, organized as matrix products.
Nobody differentiates a modern network by hand. PyTorch's autograd records the forward computation as a graph and applies the chain rule automatically: wrap parameters in tensors with requires_grad=True, compute the loss, and call loss.backward(); the gradient of every parameter appears in its .grad field. The hand-derived three-factor product above becomes two lines, and the same two lines scale unchanged to the billion-parameter models of Part III. What the library handles internally: graph construction, cached intermediate values, the per-operation derivative rules, and the reverse-order traversal. Chapter 18 opens the hood.
Gradient Descent
Definition. To minimize a differentiable loss $L(\theta)$, gradient descent iterates
$$\theta \leftarrow \theta - \eta\, \nabla L(\theta),$$where the learning rate $\eta > 0$ controls the step size. Stochastic gradient descent (SGD) estimates the gradient on a small random minibatch of data per step, trading per-step accuracy for many more steps per second.
Intuition. The learning rate is the single most consequential hyperparameter in deep learning: too small and training crawls, too large and the iterates overshoot and diverge. The practical recipes, warmup, decay schedules, and adaptive methods like Adam, are covered with the training loop in Chapter 18 and refined in Chapter 21. The snippet below closes the loop with Section 1: it minimizes the same line-fitting loss by gradient descent and lands on the same answer lstsq produced in closed form, continuing from that snippet's A and y:
theta = np.zeros(2) # [slope, intercept], starting from zero
lr = 0.01
for step in range(2000):
pred = A @ theta
grad = 2 * A.T @ (pred - y) / len(y) # gradient of the mean squared error
theta -= lr * grad # the gradient descent update rule
print(theta) # [2.616 0.549]: matches the lstsq solution
numpy.linalg.lstsq computed in closed form, demonstrating that the iterative and analytic routes minimize the same objective.Convexity, in One Paragraph
A function is convex if every line segment between two points on its graph lies on or above the graph; bowls are convex, egg cartons are not. For convex losses, any local minimum is the global minimum, gradient descent with a sensible learning rate provably converges to it, and the least-squares and logistic-regression objectives of classical machine learning are of exactly this kind. Deep network losses are decidedly not convex, which is why none of Part III comes with a convergence certificate, and the working surprise of the field is how well gradient methods do anyway: in practice, overparameterized networks reach excellent minima reliably, a phenomenon discussed alongside the training recipes of Chapter 21.
4. Signals & Sampling
An image is a sampled signal, and three results from signal processing govern what can be done with it: the Fourier decomposition, the convolution theorem, and the Nyquist sampling theorem. They are developed visually in Chapter 4; this section states them compactly.
Fourier Basics
Definition. The discrete Fourier transform (DFT) of a length-$N$ signal $x_0, \dots, x_{N-1}$ is
$$X_k = \sum_{n=0}^{N-1} x_n\, e^{-2\pi i k n / N}, \qquad k = 0, \dots, N-1,$$expressing the signal as a weighted sum of complex sinusoids at frequencies $k/N$ cycles per sample. The transform is invertible, so no information is lost; the fast Fourier transform (FFT) computes it in $O(N \log N)$ rather than $O(N^2)$ operations. For images the transform is applied along both axes, giving a 2D frequency grid.
Intuition. The Fourier transform changes the question from "what is the value at each position?" to "how much of each ripple frequency does the signal contain?". In images, low frequencies are smooth shading and large shapes; high frequencies are edges, texture, and noise. Magnitude says how much of each frequency is present; phase says where the ripples sit, and, perhaps surprisingly, phase carries most of the structural information of a photograph, a classic demonstration revisited in Chapter 4. The JPEG compression of Chapter 1 works by discarding the high-frequency coefficients human eyes barely notice.
Worked example. A pure cosine at 6 Hz, sampled densely, has a DFT with energy concentrated at $\pm 6$ Hz and essentially nowhere else: the spectrum of a single ripple is a single spike. Every real signal's spectrum is a recipe of such spikes.
The Convolution Theorem
Definition. The convolution of signals $f$ and $g$ is $(f * g)(t) = \int f(\tau)\, g(t - \tau)\, d\tau$ (a sliding weighted sum in the discrete case), and the convolution theorem states
$$\mathcal{F}\{f * g\} = \mathcal{F}\{f\} \cdot \mathcal{F}\{g\}:$$convolution in the spatial domain is plain pointwise multiplication in the frequency domain, and vice versa.
Intuition. This single identity explains most of Chapter 3 in hindsight. Every kernel has a spectrum, and convolving with the kernel multiplies the image's spectrum by it: a Gaussian kernel has a spectrum that decays with frequency, so Gaussian blur is a low-pass filter; sharpening boosts where blur attenuates; and the box filter's spectrum has oscillating lobes, which is precisely why box blur produces ringing artifacts. The theorem is also a performance trick: for large kernels it is faster to FFT the image, multiply, and inverse-FFT than to slide the kernel directly, a strategy weighed in Section 3.6 of that chapter. The convolution layers of Chapter 19 inherit the same analysis with learned kernels.
Worked example. Blurring twice with a Gaussian of standard deviation $\sigma$ equals blurring once with $\sigma\sqrt{2}$. In the frequency domain the proof is one line: the spectra multiply, and the product of two Gaussian spectra is a Gaussian spectrum whose variance is the sum. Arguing this purely in the spatial domain takes a page; in the frequency domain it is immediate, which is the theorem's daily value.
Nyquist and Aliasing
Definition. The Nyquist-Shannon sampling theorem: a signal containing no frequencies above $f_{\max}$ is perfectly reconstructible from samples taken at any rate $f_s > 2 f_{\max}$. Sampling below that rate causes aliasing: a frequency $f$ above $f_s / 2$ is recorded indistinguishably from a lower frequency, folded to $|f - f_s|$ (in the simplest case), and once folded it can never be separated again.
Intuition. Aliasing is the false pattern that appears when sampling is too coarse for the detail present: moiré stripes on a fine-checked shirt, jagged staircase edges on a downscaled render. The practical rule it dictates is burned into every image pipeline: remove the frequencies you cannot represent before you resample. This is why proper downscaling blurs first (the anti-aliasing filter of Chapter 4 and the interpolation choices of Chapter 5), why camera sensors historically carried optical low-pass filters in front of the photosites (Chapter 1), and why strided convolutions and pooling layers in CNNs can introduce aliasing of their own, a subtlety revisited in Chapter 19.
Worked example. Sample a 6 Hz cosine at $f_s = 8$ Hz. The Nyquist limit is 4 Hz, and 6 Hz exceeds it, so the samples are identical to those of a $|6 - 8| = 2$ Hz cosine: $\cos(2\pi \cdot 6 n / 8) = \cos(2\pi \cdot 2 n / 8)$ for every integer $n$, because the two arguments differ by a multiple of $2\pi$. The high frequency has put on a low-frequency disguise, and no later processing can take it off.
Aliasing predates digital cameras by a century. In old westerns, stagecoach wheels routinely appear to spin backward: the film's 24 frames per second undersamples the spoke rotation, folding it to a small negative frequency. Helicopter rotors filmed at just the right shutter rate appear to hang motionless for the same reason, producing some of the internet's most physics-defying footage. Every such illusion is the worked example above, playing at the cinema.
The Fourier view is having a second life inside deep networks. Plain multilayer perceptrons trained on raw $(x, y)$ coordinates are spectrally biased: they learn low frequencies easily and high-frequency detail almost not at all. Tancik et al.'s "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains" (NeurIPS 2020) showed that mapping coordinates through a bank of sinusoids before the network fixes this, and the positional encoding inside NeRF (Chapter 27) is exactly such a feature map. Sitzmann et al.'s SIREN (NeurIPS 2020) reaches the same goal by making the activations themselves sinusoidal. The sampling theorem your camera obeys and the encoding that lets a network memorize a radiance field are the same mathematics.
1. Conceptual. The matrix $\begin{bmatrix} 3 & 0 \\ 0 & 1 \end{bmatrix}$ stretches the $x$-axis by 3. Without computing characteristic polynomials, name its eigenvectors and eigenvalues, then explain why the vector $(1, 1)$ is not an eigenvector.
2. Coding. Extend the least-squares snippet to fit a parabola $y = \theta_1 x^2 + \theta_2 x + \theta_3$ by adding an $x^2$ column to the design matrix, then confirm that the gradient-descent loop recovers the same three coefficients. Watch what happens to the required number of steps, and relate it to the conditioning discussion in the gradients block.
3. Analysis. A 4000-pixel-wide photograph of a picket fence with 1000 evenly spaced pickets is downscaled to 1500 pixels without pre-blurring. Using the Nyquist criterion, predict whether the fence pattern survives, and compute the apparent spatial frequency of the alias if it does not.