Part I: Image Processing
Chapter 0: Foundations: The Python Imaging Stack

Images as Arrays: Pixels, Channels & Dtypes

"People keep telling me I have hidden depth. Technically I have three channels, a dtype, and trust issues about my value range."

A Mildly Philosophical Pixel
Big Picture

An image in Python is not represented by a NumPy array; it is a NumPy array, full stop. Once you accept that, three questions describe any image completely: what is its shape (how many rows, columns, and channels), what is its dtype (what kind of number lives in each cell), and what value range do those numbers occupy. Every operation in this book, from a Gaussian blur in Chapter 3 to a denoising step inside a diffusion model in Chapter 33, is just arithmetic on this array. This section builds that mental model from the ground up.

This is the first section of the book, so we begin at the absolute beginning: not with light, not with cameras (that story is told in Chapter 1), but with the object your code actually touches. Open an interpreter and follow along; every snippet below runs as written, with no image files required.

1. The Image Is the Array Beginner

A grayscale image is a two-dimensional grid of brightness values. In NumPy terms, it is a 2-D array of shape $(H, W)$: $H$ rows and $W$ columns. The element at row $y$, column $x$ is the pixel at that location, and mathematically we treat the image as a function

$$I : \{0, \dots, H-1\} \times \{0, \dots, W-1\} \rightarrow \mathbb{V},$$

where $\mathbb{V}$ is the set of representable values (for the common 8-bit case, the integers $0$ to $255$). Two conventions surprise newcomers immediately. First, the origin is the top-left corner, not the bottom-left as in mathematics class: row indices grow downward. Second, indexing is img[row, col], that is, vertical coordinate first. Figure 0.1.1 makes both conventions concrete, and Section 0.4 returns to the trouble they cause when mixed with the $(x, y)$ convention used by drawing and resizing functions.

axis 1: columns (x), 0 .. W-1 axis 0: rows (y), 0 .. H-1 origin (0,0) 12 31 60 90 120 150 180 210 26 46 75 105 135 165 195 225 40 61 90 120 150 197 210 240 53 75 105 135 165 195 225 250 66 88 118 148 178 208 238 255 80 102 132 162 192 222 246 255 img[2, 5] == 197 row y = 2, column x = 5 img.shape == (6, 8) (H rows, W columns)
Figure 0.1.1 A grayscale image is a 2-D array. The origin sits at the top-left, rows (axis 0) grow downward, columns (axis 1) grow rightward, and each cell holds one brightness value. The highlighted pixel is addressed as img[2, 5]: row first, then column.

Let us build exactly this kind of object from nothing. The code below synthesizes a horizontal brightness ramp, then interrogates it the way you should interrogate every image you ever load: shape, dtype, extreme values.

import numpy as np

# A horizontal ramp: each row runs 0..255 left to right.
row = np.linspace(0, 255, 320, dtype=np.uint8)   # one row, 320 values
img = np.tile(row, (240, 1))                     # stack it 240 times

print(img.shape)          # (240, 320)  -> 240 rows (H), 320 columns (W)
print(img.dtype)          # uint8
print(img.min(), img.max())  # 0 255
print(img[120, 160])      # 127  -> the pixel at row 120, column 160
Code Fragment 0.1.1: Synthesizing a 240 by 320 grayscale ramp and asking it the three diagnostic questions: shape, dtype, and value range.

Everything you know about NumPy now applies to pictures. Slicing crops: img[60:180, 80:240] is a rectangular region of interest. Fancy indexing selects: img[img > 200] = 255 brightens highlights, a one-line preview of the thresholding we study properly in Chapter 2. Aggregations measure: img.mean() is the average brightness of the whole frame. There is no separate "image API" to learn for any of this; the array API is the image API.

Key Insight: Three Questions Describe Any Image

Before doing anything with an image, ask: (1) What is its shape? $(H, W)$ means grayscale, $(H, W, 3)$ means color, $(H, W, 4)$ means color plus alpha. (2) What is its dtype? uint8, uint16, and float32 imply different value ranges and different arithmetic behavior. (3) What is its actual value range? A float image whose values run 0 to 255 instead of 0 to 1 is a bug waiting to detonate. Printing these three facts takes one line and prevents the majority of pipeline failures you will ever encounter.

2. Channels: The Third Axis Beginner

Color enters as a third axis. A color image is an array of shape $(H, W, 3)$, where the last axis holds the three color components of each pixel. In the RGB convention used by Pillow, scikit-image, Matplotlib, and essentially all deep learning code, those components are red, green, and blue in that order; OpenCV famously stores them reversed, as BGR, a historical accident dissected in Section 0.4. Indexing img[y, x] now returns a length-3 vector rather than a scalar, and img[:, :, 0] peels off an entire channel as a 2-D array. Figure 0.1.2 shows the geometry: three aligned planes stacked along the last axis.

B = img[:, :, 2] G = img[:, :, 1] R = img[:, :, 0] img.shape == (H, W, 3) img[y, x] == [r, g, b] one pixel = one 3-vector, its components stored side by side along the last (fastest) axis
Figure 0.1.2 A color image is a stack of three channel planes along the last axis, shape $(H, W, 3)$. Reading img[y, x] pierces all three planes at once and returns the pixel's color vector; slicing img[:, :, c] extracts one full plane.

The following snippet constructs a tiny color image by direct channel assignment, then verifies what landed where. Building images by hand like this is more than a toy exercise: it is the standard way to create test fixtures for vision code, because you know the ground truth of every pixel.

import numpy as np

img = np.zeros((100, 300, 3), dtype=np.uint8)  # black canvas, RGB order
img[:, :100, 0] = 255      # left third: pure red
img[:, 100:200, 1] = 255   # middle third: pure green
img[:, 200:, 2] = 255      # right third: pure blue

print(img[50, 50])    # [255   0   0]  red pixel
print(img[50, 150])   # [  0 255   0]  green pixel
print(img[50, 250])   # [  0   0 255]  blue pixel

red_plane = img[:, :, 0]
print(red_plane.shape, red_plane.mean().round(1))  # (100, 300) 85.0
Code Fragment 0.1.2: Painting a red-green-blue flag by assigning into channel slices, then reading individual pixels back as 3-vectors and extracting the red plane as a standalone 2-D array.

Two more channel layouts deserve a mention now so they do not ambush you later. Images with transparency carry a fourth alpha channel, shape $(H, W, 4)$. And deep learning frameworks prefer the channel axis first: a PyTorch image tensor is $(C, H, W)$, and a batch is $(N, C, H, W)$. The conversion is a single np.transpose(img, (2, 0, 1)), but forgetting it is a rite of passage we will formalize in Chapter 18.

Fun Fact

The reason the channel axis comes last in NumPy imaging is cache locality: the three bytes of one pixel sit adjacent in memory, so operations that touch whole pixels stream beautifully through the CPU. The reason deep learning frameworks put channels first is also cache locality, just for a different consumer: convolution kernels want each channel contiguous. Same argument, opposite conclusions, decades of transposes.

3. Dtypes: The Contract About What Numbers Mean Intermediate

The dtype of an image array is not a storage detail; it is a contract. It declares how much memory each value occupies, what range it can represent, and, by strong convention, what range it is expected to occupy. The three dtypes you will meet constantly are summarized in Table 0.1.1.

Table 0.1.1: The three working dtypes of the Python imaging stack.
DtypeBytes/valueRepresentable rangeConventional image rangeTypical sources
uint810 to 2550 to 255JPEG/PNG files, OpenCV defaults, screens
uint1620 to 655350 to 6553516-bit PNG/TIFF, medical and scientific sensors, RAW pipelines
float324±3.4×10380.0 to 1.0 (or -1.0 to 1.0 in generative models)scikit-image outputs, neural network inputs

An 8-bit channel offers $2^8 = 256$ distinct levels, which is roughly the limit of what human eyes distinguish under normal viewing; a 16-bit channel offers $2^{16} = 65536$ levels, which matters when you plan to stretch shadows or process medical scans, as discussed when we treat bit depth and dynamic range in Chapter 1. Floats exist not for storage but for mathematics: the moment you average, filter, or feed an image to a network, you want real-number arithmetic without overflow. Memory follows directly from the contract: a 12-megapixel RGB photo costs $4000 \times 3000 \times 3 = 36$ MB as uint8 and four times that, 144 MB, as float32. That factor of four decides batch sizes on GPUs for the rest of the book.

The danger zone is integer arithmetic. Unsigned 8-bit values wrap around modulo 256, so adding brightness can make pixels darker:

import numpy as np

a = np.full((2, 2), 200, dtype=np.uint8)
b = np.full((2, 2), 100, dtype=np.uint8)

print(a + b)
# [[44 44]
#  [44 44]]      because (200 + 100) % 256 == 44: wraparound!

# The safe patterns:
print((a.astype(np.uint16) + b).clip(0, 255).astype(np.uint8))
# [[255 255]
#  [255 255]]    promote, clip, demote

mean = (a.astype(np.float32) + b.astype(np.float32)) / 2
print(mean.astype(np.uint8))
# [[150 150]
#  [150 150]]    averages need float (or uint16) intermediates
Code Fragment 0.1.3: uint8 addition wraps modulo 256, turning 200 + 100 into 44; promoting to a wider dtype before arithmetic and clipping back restores the intended saturation behavior.

Formally, uint8 addition computes $(a + b) \bmod 256$, while what you almost always want is the saturating sum $\min(a + b, 255)$. OpenCV's cv2.add saturates for exactly this reason, one of several arithmetic conventions we contrast carefully in Section 0.4. The second classic dtype accident is conversion without rescaling: calling astype(np.uint8) on a float image in $[0, 1]$ truncates nearly everything to zero. Conversions must rescale, $x_{\text{uint8}} = \lfloor 255 \, x_{\text{float}} + 0.5 \rfloor$, not merely cast.

Library Shortcut: Safe Dtype Conversion with scikit-image

You could write your own conversion utility that checks the input dtype, rescales to the target range, rounds, and clips: about 15 lines with all the branches done right. scikit-image ships it as one line per direction:

from skimage.util import img_as_float32, img_as_ubyte

f = img_as_float32(img_u8)   # uint8 0..255  -> float32 0.0..1.0
u = img_as_ubyte(f)          # float 0..1    -> uint8 0..255, rounded and clipped
Code Fragment 0.1.5: The scikit-image conversion pair that replaces a hand-written, branch-heavy rescaling utility with two self-documenting calls.

That is a 15-to-2 line reduction, and the library handles the parts you would forget: negative float inputs raise instead of wrapping, uint16 scales by 257 rather than naive truncation, and bool images map cleanly to 0 and 255.

Practical Example: The Vanishing Tumors

Who: A machine learning engineer at a medical imaging startup, preparing CT slices for a detection model.

Situation: Source scans arrived as 16-bit DICOM-derived TIFFs with diagnostically meaningful detail in narrow intensity bands.

Problem: A data-loading utility written for photos called astype(np.uint8) on the 16-bit arrays. Values above 255 wrapped modulo 256, shredding the intensity structure; the training images looked like static in exactly the regions radiologists cared about, and model recall on small lesions was inexplicably poor.

Decision: The engineer added a dtype audit at ingestion (log every file's dtype, min, max), replaced the cast with an explicit windowed rescale from the 16-bit range to float32, and made the loader refuse any image whose dtype it did not recognize.

Result: Lesion recall improved by double digits with zero model changes, and the ingestion audit caught two further format surprises in the following month.

Lesson: A cast is not a conversion. Every dtype change must state where the values came from and where they are going.

4. Under the Hood: Memory, Strides, Views & Copies Advanced

One level beneath shape and dtype sits the machinery that makes NumPy fast: a flat block of bytes plus strides, the number of bytes to step to move one position along each axis. For a contiguous $(H, W, 3)$ uint8 image the strides are $(3W, 3, 1)$: one byte to the next channel, three bytes to the next pixel, $3W$ bytes to the next row. The byte offset of element $(y, x, c)$ is simply

$$\text{offset}(y, x, c) = y \cdot 3W + x \cdot 3 + c.$$

Why should a practitioner care? Because strides explain the single most consequential performance fact about NumPy images: slicing does not copy. A crop like roi = img[100:200, 50:150] creates a view, a new array object pointing into the same bytes with adjusted offsets and strides. Views make cropping free, but they also mean that writing into a view writes into the original, and that some downstream consumers (certain OpenCV functions, serialization, C extensions) require contiguous memory and will either copy behind your back or refuse non-contiguous input.

import numpy as np

img = np.zeros((240, 320, 3), dtype=np.uint8)
print(img.strides)            # (960, 3, 1)   bytes per step along each axis

roi = img[100:200, 50:150]    # a view: no pixels are copied
roi[:] = 255                  # ... so this writes into img itself!
print(img[150, 100])          # [255 255 255]  the "original" changed

flipped = img[::-1]           # vertical flip as a negative-stride view
print(flipped.strides)        # (-960, 3, 1)
print(flipped.flags['C_CONTIGUOUS'])   # False

safe = img[100:200, 50:150].copy()     # an independent crop
print(np.shares_memory(img, roi), np.shares_memory(img, safe))  # True False
Code Fragment 0.1.4: Slices are views that share memory with the parent image: writing into a region of interest mutates the original, a flip is just negative strides, and .copy() is the explicit way to cut the cord.

The rule of thumb: treat views as read-only windows unless mutation of the parent is exactly what you intend, and reach for .copy() whenever an image crosses a function boundary that might write. We will see this rule earn its keep in Section 0.4, where an in-place ROI edit corrupts a source image, and again throughout Chapter 5, where geometric operations must decide between viewing and resampling.

Research Frontier: The Array Contract Goes Cross-Platform

The shape-dtype-strides contract this section teaches is being standardized across the entire scientific Python world. The Python Array API standard (the 2023.12 revision and its successors) defines a common interface that NumPy 2.0 (released June 2024) implements natively, and scikit-image has been rolling out experimental array-API support since version 0.25 (late 2024) so the same image code can run on CuPy GPU arrays or PyTorch tensors. Zero-copy exchange between frameworks rides on DLPack, which is how torch.from_numpy and torch.utils.dlpack hand a 36 MB photo to the GPU without duplicating a byte. Meanwhile the dtype frontier keeps moving downward: vision training increasingly runs in bfloat16 and, on 2024-2026 accelerators such as NVIDIA's Blackwell generation, in 8-bit FP8 formats, making "what exactly does this number mean" a live research question rather than a beginner's footnote.

5. Looking Ahead: From Arrays to Everything Else Beginner

Every later chapter consumes the model built here. Histograms in Chapter 2 are statistics over these array values. Convolution in Chapter 3 slides kernels across these axes. PyTorch tensors in Chapter 18 are this same object with gradients and a transposed channel axis, and the noise that diffusion models learn to remove in Chapter 33 is sampled into float arrays shaped exactly like the ones you built today. The next section widens the view from the object itself to the ecosystem of libraries that all agreed, with one famous color-order exception, to speak this array language.

Exercise 0.1.1: The Wraparound Audit Conceptual

A colleague brightens an image with img + 60 where img is uint8, and reports that the sky in the result turned dark gray. Explain precisely what happened to a sky pixel of value 230, state the general formula for the corrupted result, and propose two distinct fixes that preserve the uint8 output dtype. Then explain why (img + 60).clip(0, 255) is not one of them.

Exercise 0.1.2: Checkerboard Factory Coding

Write a function checkerboard(h, w, square, c0=0, c1=255) that returns an $(h, w)$ uint8 checkerboard with cells of side square pixels, using only array operations (no Python-level double loop over pixels; a hint: integer-divide coordinate grids from np.arange, then test parity). Extend it to produce an RGB version where odd squares are a color of your choice. Verify correctness by checking the mean value analytically and with .mean().

Exercise 0.1.3: Strides Forensics Analysis

For a contiguous uint8 array of shape (480, 640, 3), predict on paper the strides of: (a) the array itself, (b) img[::2, ::2], (c) img.transpose(2, 0, 1), and (d) img[:, ::-1]. Check each prediction in NumPy, then determine which of the four results are C-contiguous and which share memory with the original, using .flags and np.shares_memory. Summarize in one paragraph when a vision pipeline should insert np.ascontiguousarray.