"Classification asked me one rude question about the whole picture and walked off. Detection at least drew a box around what it cared about. Segmentation sat down with every single pixel and asked, gently, who exactly are you. It took longer, but for the first time I felt seen."
A Pixel That Finally Got a Label of Its Own
Segmentation is classification carried out per pixel: instead of one label for an image or one box for an object, the model assigns a category, and sometimes an identity, to every location in the frame. That single change of granularity drives the entire chapter. A network that outputs a dense map rather than a single vector must keep spatial resolution alive through pooling and strides, which is why the encoder-decoder shape returns again and again. Asking "which class is this pixel" gives semantic segmentation; asking additionally "which object instance is this pixel" gives instance segmentation; asking both at once over the whole image gives panoptic segmentation. Transformers then reframe all three as predicting a set of masks rather than labeling a grid, and the Segment Anything Model takes the final step: a single model that segments whatever you point at, with no fixed list of classes at all. By the end you will be able to read a per-pixel logit map, train a U-Net, run Mask R-CNN and SAM from a library, and report the right number, mean Intersection-over-Union or panoptic quality, for the right task.
Every method in this chapter answers the same question, "what is at this location," at a finer and finer grain. Hold this five-rung ladder in mind and each section snaps into place:
- What (semantic, 24.1): a class per pixel. FCN, U-Net, DeepLab.
- What and which one (instance, 24.2): a class plus an identity per object. Mask R-CNN.
- What and which one, everywhere (panoptic, 24.3): things and stuff in one partition. Panoptic quality.
- The same answer, as a set of masks (transformers, 24.4): one model, all three readouts. SegFormer, Mask2Former.
- Whatever you point at (promptable, 24.5): no class list at all. SAM. (And 24.6 tells you how to score every rung.)
Underneath all five runs a single engineering refrain worth memorizing: classification threw away the resolution; segmentation is the art of getting it back. Skip connections, dilation, and learned upsampling are just three answers to that one sentence. The Hands-On Lab at the end of this chapter builds the first rung, a working U-Net, from scratch, trains it on a self-generating dataset, and scores it with the mean IoU and Dice metrics of Section 24.6, so you exercise the encoder-decoder idea and the measurement toolkit in one runnable program.
Chapter Overview
In Chapter 19 a convolutional network turned an image into a single label, and in the previous chapter, Chapter 23: Object Detection, it learned to draw boxes around objects and name them. A box is a coarse answer. It tells you roughly where the dog is, but it cannot say which pixels are dog and which are the grass behind it, and it cannot trace the irregular outline of a tumor, a road, or a piece of cloth. Many real tasks need exactly that pixel-precise outline: a self-driving stack must know which pixels are drivable road, a medical tool must measure the area of a lesion, a photo editor must cut a subject out cleanly. Segmentation is the family of methods that produces those dense, per-pixel answers, and this chapter walks the full arc from the first fully convolutional networks to the promptable foundation models of 2023 onward.
We begin with semantic segmentation in Section 24.1, where the task is a label per pixel with no notion of separate objects. The central engineering problem is resolution: convolutional backbones throw spatial detail away through pooling and strides, and a segmenter must get it back. Three influential answers, the fully convolutional network with skip connections, the symmetric U-Net, and the dilated-convolution DeepLab family, each solve that problem differently, and all three descend directly from the encoder-decoder and multi-scale ideas you met as image pyramids in Chapter 4. Section 24.2 adds the instance dimension with Mask R-CNN, which bolts a small mask-predicting branch onto the detector of Chapter 23 and, almost as an afterthought, fixes a quiet alignment bug with the RoIAlign operation that the whole field then adopted.
Section 24.3 unifies the two views. Real scenes contain countable "things" (people, cars) and uncountable "stuff" (sky, road, vegetation); panoptic segmentation labels every pixel with both a class and, for things, an instance identity, and introduces the panoptic quality metric that scores the result in one number. Section 24.4 is the turning point: the mask transformers. SegFormer rebuilds the semantic segmenter on a hierarchical transformer backbone with a startlingly simple decoder, and Mask2Former makes the conceptual leap that the same architecture, predicting a set of masks with masked attention, can do semantic, instance, and panoptic segmentation with no task-specific changes. This is the attention thread from Chapter 22 arriving in dense prediction.
Section 24.5 reaches the present: the Segment Anything Model and its successors. Trained on a billion masks, SAM takes a prompt, a click, a box, a rough mask, and returns a segmentation of that object, for any image, with no class list and no fine-tuning. It is segmentation's foundation model, and it changes the workflow from "train a segmenter for your classes" to "prompt a segmenter for your object." Finally, Section 24.6 is the chapter's measurement toolkit: the cross-entropy, Dice, and focal losses that train dense predictors, the IoU, mean IoU, boundary-F1, and panoptic-quality metrics that evaluate them, and the practical traps, class imbalance and boundary error, that trip up every first segmentation project.
The connecting idea is the one in the Big Picture: segmentation is dense classification, and almost every technique in the chapter is a different answer to "how do I keep, or recover, the spatial resolution that classification was happy to discard." Masks produced here do not stay here. They become the editing regions of Chapter 35, where a segmentation mask tells a generative model exactly where to inpaint, and they connect back to the classical watershed and graph-cut methods of Chapter 11 that did this job before deep learning, by hand-designed energy rather than learned features.
Prerequisites
You should have read Chapter 19: Convolutional Neural Networks for convolution, pooling, strides, and the receptive field, all of which set the resolution problem this chapter solves, and Chapter 20: CNN Architectures for the ResNet backbones that every segmenter sits on. Chapter 23: Object Detection is a direct prerequisite for Section 24.2, because Mask R-CNN extends Faster R-CNN and reuses its region proposals and RoI features. Chapter 22: Vision Transformers supplies the self-attention and patch-embedding mechanics that Section 24.4 turns into mask transformers. Comfort with PyTorch tensors and the training loop from Chapter 18 is assumed throughout, and the IoU metric you first meet in detection is generalized here, so a quick look back at the Chapter 23 evaluation section will pay off.
Chapter Roadmap
- 24.1 Semantic Segmentation: FCN, U-Net & DeepLab A label per pixel, and the resolution problem that defines dense prediction. Fully convolutional networks with skip fusion, the symmetric U-Net encoder-decoder, and DeepLab's dilated convolutions and atrous spatial pyramid pooling. A trainable U-Net built from scratch in PyTorch.
- 24.2 Instance Segmentation: Mask R-CNN Separating individual objects, not just classes. Mask R-CNN adds a per-region mask branch to Faster R-CNN, and RoIAlign fixes the quantization that broke pixel-accurate features. The two-stage detect-then-segment recipe, run end to end with torchvision.
- 24.3 Panoptic Segmentation: Unifying Things & Stuff Labeling every pixel with both a class and, for countable things, an instance identity. The things-versus-stuff distinction, how semantic and instance predictions are merged without overlap, and the panoptic quality metric that scores recognition and segmentation in one number.
- 24.4 Transformer Segmenters: SegFormer & Mask2Former Attention takes over dense prediction. SegFormer's hierarchical encoder and all-MLP decoder, and Mask2Former's mask-classification view with masked attention that does semantic, instance, and panoptic segmentation with one architecture. Why predicting a set of masks beats labeling a grid.
- 24.5 Segment Anything: Promptable Segmentation Segmentation's foundation model. SAM's image encoder, prompt encoder, and lightweight mask decoder, the billion-mask data engine that trained it, ambiguity handling with multiple mask outputs, and the shift from training a segmenter to prompting one. SAM 2 and video extensions.
- 24.6 Losses, Metrics & Evaluation for Dense Prediction The measurement toolkit. Pixel cross-entropy, Dice, Tversky, and focal losses and when each helps, IoU, mean IoU, pixel accuracy, boundary-F1, and panoptic quality, and the practical traps of class imbalance and boundary error that decide whether a segmenter is actually good.
Hands-On Lab: A U-Net Segmenter You Train and Score Yourself
Objective
Build a small U-Net from scratch in PyTorch, train it to segment shapes against a noisy background, and grade it with the exact metrics from Section 24.6, mean Intersection-over-Union and the Dice score. The dataset generates itself with code, so the lab is fully self-contained: there is nothing to download, it runs on a CPU in a couple of minutes, and because every mask is produced by the same generator that made the image, you can trust the ground truth absolutely. By the end you will have walked the chapter's central refrain, classification throws resolution away and segmentation gets it back, through a network you wrote line by line.
What You'll Practice
- Assembling the encoder-decoder with skip connections that defines U-Net (Section 24.1), the architecture that keeps and recovers spatial resolution.
- Generating a self-labeling synthetic segmentation dataset so the per-pixel ground truth is exact, the same trick the two-view lab of Chapter 13 uses for geometry.
- Training a dense predictor with pixel-wise cross-entropy and reading a per-pixel logit map (Section 24.6).
- Computing mean IoU and Dice on validation masks, the right numbers for semantic segmentation, and seeing why pixel accuracy alone can mislead.
- Swapping your hand-built network for a one-line library model to confirm the Right Tool payoff (the stretch goal).
Setup
One library and no dataset; the script synthesizes its own images and masks, so it always runs to completion on any machine. Install with:
pip install torch numpy
Everything runs on the CPU in a couple of minutes at the small image size used here. A GPU, if present, simply makes it faster. Matplotlib is optional and only used by the first stretch goal to visualize a predicted mask.
Steps
Step 1: Generate a self-labeling shapes dataset
Each sample is a small grayscale image with one bright disk on a noisy background, paired with a binary mask that is exactly the disk's pixels. Because one function draws both the image and its mask, the ground truth is perfect, which is what makes the lab self-grading.
import numpy as np
import torch
def make_sample(size=64, rng=None):
rng = rng or np.random.default_rng()
img = rng.normal(0.2, 0.1, (size, size)).astype(np.float32) # noisy background
cy, cx = rng.integers(16, size - 16, size=2) # random disk center
r = rng.integers(8, 14) # random disk radius
yy, xx = np.mgrid[:size, :size]
mask = ((yy - cy) ** 2 + (xx - cx) ** 2) <= r ** 2 # True inside the disk
img[mask] += 0.7 # brighten the disk
# TODO: return img as a (1, size, size) float32 tensor and mask as a
# (size, size) int64 tensor of class labels (0 = background, 1 = disk).
Hint
return torch.from_numpy(img)[None], torch.from_numpy(mask.astype(np.int64)). The leading [None] adds the single channel dimension a convolution expects; the mask stays a 2D map of integer class indices, which is what cross-entropy wants as its target.
Step 2: Build the U-Net blocks
A U-Net is built from one repeated unit: two 3x3 convolutions, each followed by a ReLU, that keep the spatial size fixed. Write it once as a reusable block so the encoder and decoder can both call it.
import torch.nn as nn
def conv_block(in_ch, out_ch):
# TODO: return an nn.Sequential of: Conv2d(in_ch, out_ch, 3, padding=1),
# ReLU, Conv2d(out_ch, out_ch, 3, padding=1), ReLU. The padding=1 keeps
# height and width unchanged so skip connections line up later.
Hint
return nn.Sequential(nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True), nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True)). Keeping padding=1 means a 64x64 input stays 64x64, so when you concatenate an encoder feature onto a decoder feature their spatial shapes match exactly.
Step 3: Wire the encoder, decoder, and skip connections
This is the heart of Section 24.1. The encoder halves resolution with max pooling while doubling channels; the decoder upsamples back and, crucially, concatenates the matching encoder feature so fine spatial detail is restored. That concatenation is the skip connection.
class UNet(nn.Module):
def __init__(self, n_classes=2):
super().__init__()
self.enc1 = conv_block(1, 16)
self.enc2 = conv_block(16, 32)
self.pool = nn.MaxPool2d(2)
self.bottleneck = conv_block(32, 64)
self.up2 = nn.ConvTranspose2d(64, 32, 2, stride=2)
self.dec2 = conv_block(64, 32) # 64 = 32 upsampled + 32 skip
self.up1 = nn.ConvTranspose2d(32, 16, 2, stride=2)
self.dec1 = conv_block(32, 16) # 32 = 16 upsampled + 16 skip
self.head = nn.Conv2d(16, n_classes, 1)
def forward(self, x):
e1 = self.enc1(x) # full resolution
e2 = self.enc2(self.pool(e1)) # half resolution
b = self.bottleneck(self.pool(e2)) # quarter resolution
d2 = self.up2(b)
# TODO: concatenate the encoder feature e2 onto d2 along the channel
# dim (dim=1), pass through self.dec2, then upsample with self.up1,
# concatenate e1, pass through self.dec1, and finally return self.head(...).
Hint
d2 = self.dec2(torch.cat([d2, e2], dim=1)); then d1 = self.up1(d2); d1 = self.dec1(torch.cat([d1, e1], dim=1)); return self.head(d1). The output has shape (batch, n_classes, 64, 64): one logit map per class, the dense readout the chapter keeps returning to.
Step 4: Train with pixel-wise cross-entropy
Generate a fresh batch each step and minimize cross-entropy averaged over every pixel. nn.CrossEntropyLoss expects raw logits of shape (N, C, H, W) and an integer target of shape (N, H, W), exactly what Steps 1 and 3 produce.
torch.manual_seed(0)
rng = np.random.default_rng(0)
model = UNet()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
def batch(n=16):
pairs = [make_sample(rng=rng) for _ in range(n)]
imgs = torch.stack([p[0] for p in pairs])
masks = torch.stack([p[1] for p in pairs])
return imgs, masks
for step in range(300):
imgs, masks = batch()
logits = model(imgs)
# TODO: compute the loss with loss_fn(logits, masks), backpropagate,
# step the optimizer, and zero the gradients. Print the loss every 50 steps.
Hint
The canonical four lines: loss = loss_fn(logits, masks); opt.zero_grad(); loss.backward(); opt.step(). Guard the print with if step % 50 == 0:. The loss should fall steadily from roughly 0.7 toward well under 0.1.
Step 5: Score with mean IoU and Dice
Accuracy is misleading here: the disk is a small fraction of the image, so a model that predicts "all background" already scores high pixel accuracy. The honest metrics from Section 24.6 are IoU and Dice on the foreground class. Take the per-pixel argmax to get the predicted mask, then compute both.
@torch.no_grad()
def evaluate(model, n=64):
imgs, masks = batch(n)
pred = model(imgs).argmax(dim=1) # (N, H, W) predicted class per pixel
p, g = (pred == 1), (masks == 1) # foreground booleans
inter = (p & g).sum().float()
union = (p | g).sum().float()
# TODO: return IoU = inter / union and Dice = 2 * inter / (p.sum() + g.sum()),
# both as plain Python floats. Add a tiny epsilon to each denominator to be safe.
Hint
iou = (inter / (union + 1e-6)).item(); dice = (2 * inter / (p.sum() + g.sum() + 1e-6)).item(); return iou, dice. After 300 steps both should land above about 0.9, and Dice is always slightly higher than IoU, the algebraic relationship noted in Section 24.6.
Step 6: Report and sanity-check
Print the final metrics next to plain pixel accuracy to feel the gap the chapter warns about. The point is concrete: a single number can hide a useless model, which is why dense prediction is scored on overlap, not on how many pixels happened to be right.
iou, dice = evaluate(model)
imgs, masks = batch(64)
acc = (model(imgs).argmax(1) == masks).float().mean().item()
# TODO: print pixel accuracy, mean IoU, and Dice on one line each, then
# state in a comment which of the three you would trust for an imbalanced mask.
print(f"pixel accuracy: {acc:.3f}")
Hint
print(f"foreground IoU: {iou:.3f}") and print(f"Dice: {dice:.3f}"). Pixel accuracy will look impressive even early in training because background dominates; IoU and Dice are the numbers that actually track whether the disk is found, which is the lesson of Section 24.6.
Expected Output
The training loop prints a loss that falls from about 0.7 to under 0.1 over 300 steps. The final report shows a pixel accuracy near 0.99 (inflated by the dominant background), a foreground IoU above roughly 0.90, and a Dice score a little higher still. The takeaway is the one Section 24.6 stresses: on an imbalanced mask the accuracy number flatters the model, while IoU and Dice report what you actually care about, how well the predicted region overlaps the true one. You have now built, trained, and correctly evaluated a semantic segmenter end to end.
Stretch Goals
- Visualize a result: pick one validation image and plot the input, the ground-truth mask, and the predicted mask side by side with Matplotlib. Seeing the boundary error directly makes the boundary-F1 discussion of Section 24.6 concrete.
- Swap the loss for soft Dice loss (one minus the differentiable Dice of Section 24.6) and compare convergence speed and final IoU against cross-entropy on the same seed. This shows why dense-prediction practitioners reach for region losses when the foreground is small.
- Library shortcut, the Right Tool principle in action: replace your hand-built
UNetwithsegmentation_models.pytorchin one line,import segmentation_models_pytorch as smp; model = smp.Unet(encoder_name="resnet18", in_channels=1, classes=2), train it with the identical loop, and note how a pretrained backbone reaches a higher IoU in fewer steps while you wrote almost no architecture code.
Complete Solution
import numpy as np
import torch
import torch.nn as nn
def make_sample(size=64, rng=None):
rng = rng or np.random.default_rng()
img = rng.normal(0.2, 0.1, (size, size)).astype(np.float32)
cy, cx = rng.integers(16, size - 16, size=2)
r = rng.integers(8, 14)
yy, xx = np.mgrid[:size, :size]
mask = ((yy - cy) ** 2 + (xx - cx) ** 2) <= r ** 2
img[mask] += 0.7
return torch.from_numpy(img)[None], torch.from_numpy(mask.astype(np.int64))
def conv_block(in_ch, out_ch):
return nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True))
class UNet(nn.Module):
def __init__(self, n_classes=2):
super().__init__()
self.enc1 = conv_block(1, 16)
self.enc2 = conv_block(16, 32)
self.pool = nn.MaxPool2d(2)
self.bottleneck = conv_block(32, 64)
self.up2 = nn.ConvTranspose2d(64, 32, 2, stride=2)
self.dec2 = conv_block(64, 32)
self.up1 = nn.ConvTranspose2d(32, 16, 2, stride=2)
self.dec1 = conv_block(32, 16)
self.head = nn.Conv2d(16, n_classes, 1)
def forward(self, x):
e1 = self.enc1(x)
e2 = self.enc2(self.pool(e1))
b = self.bottleneck(self.pool(e2))
d2 = self.up2(b)
d2 = self.dec2(torch.cat([d2, e2], dim=1))
d1 = self.up1(d2)
d1 = self.dec1(torch.cat([d1, e1], dim=1))
return self.head(d1)
torch.manual_seed(0)
rng = np.random.default_rng(0)
model = UNet()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
def batch(n=16):
pairs = [make_sample(rng=rng) for _ in range(n)]
imgs = torch.stack([p[0] for p in pairs])
masks = torch.stack([p[1] for p in pairs])
return imgs, masks
for step in range(300):
imgs, masks = batch()
logits = model(imgs)
loss = loss_fn(logits, masks)
opt.zero_grad()
loss.backward()
opt.step()
if step % 50 == 0:
print(f"step {step:3d} loss {loss.item():.4f}")
@torch.no_grad()
def evaluate(model, n=64):
imgs, masks = batch(n)
pred = model(imgs).argmax(dim=1)
p, g = (pred == 1), (masks == 1)
inter = (p & g).sum().float()
union = (p | g).sum().float()
iou = (inter / (union + 1e-6)).item()
dice = (2 * inter / (p.sum() + g.sum() + 1e-6)).item()
return iou, dice
iou, dice = evaluate(model)
imgs, masks = batch(64)
acc = (model(imgs).argmax(1) == masks).float().mean().item()
print(f"pixel accuracy: {acc:.3f}") # high, but inflated by background
print(f"foreground IoU: {iou:.3f}") # the honest semantic-segmentation metric
print(f"Dice: {dice:.3f}") # always slightly above IoU
What's Next?
With dense prediction in hand you have nearly completed the supervised half of deep vision: classify the image, detect the objects, segment every pixel. The recurring frustration across all three has been the appetite for labels, and segmentation labels are the most expensive of all, because a human must trace every boundary by hand. Chapter 25: Self-Supervised Learning & Vision Foundation Models is the answer: learn representations from unlabeled images, so that a segmentation head needs only a handful of annotated examples to specialize. It is no accident that the Segment Anything Model of Section 24.5 already behaves like a foundation model; Chapter 25 explains the self-supervised pretraining, the masked-image modeling and contrastive learning, that makes such generality possible, and the DINOv2-class backbones whose features power the best modern segmenters. The masks you learned to produce here will then return in Chapter 35, where they become the regions a generative model edits, inpaints, and recomposes.
Bibliography & Further Reading
Foundational Papers
Long, J., Shelhamer, E., Darrell, T. "Fully Convolutional Networks for Semantic Segmentation." CVPR (2015). arXiv:1411.4038
Ronneberger, O., Fischer, P., Brox, T. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI (2015). arXiv:1505.04597
Chen, L.-C. et al. "Rethinking Atrous Convolution for Semantic Image Segmentation (DeepLabv3)." (2017). arXiv:1706.05587
He, K. et al. "Mask R-CNN." ICCV (2017). arXiv:1703.06870
Kirillov, A. et al. "Panoptic Segmentation." CVPR (2019). arXiv:1801.00868
Transformer Segmenters & Foundation Models
Xie, E. et al. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS (2021). arXiv:2105.15203
Cheng, B. et al. "Masked-attention Mask Transformer for Universal Image Segmentation (Mask2Former)." CVPR (2022). arXiv:2112.01527
Kirillov, A. et al. "Segment Anything." ICCV (2023). arXiv:2304.02643
Ravi, N. et al. "SAM 2: Segment Anything in Images and Videos." (2024). arXiv:2408.00714
Tools & Libraries
torchvision segmentation and detection models. pytorch.org/vision/stable/models
Hugging Face Transformers, image segmentation. huggingface.co/docs/transformers
AutoModel loaders for SegFormer, Mask2Former, and the universal segmentation models of Section 24.4, with preprocessing and post-processing handled.Meta AI. Segment Anything (segment-anything) and SAM 2 repositories. github.com/facebookresearch/segment-anything
Iakubovskii, P. segmentation_models.pytorch. github.com/qubvel-org/segmentation_models.pytorch
Datasets & Benchmarks
Cordts, M. et al. "The Cityscapes Dataset for Semantic Urban Scene Understanding." CVPR (2016). cityscapes-dataset.com
Lin, T.-Y. et al. "Microsoft COCO: Common Objects in Context." ECCV (2014), with panoptic extension. cocodataset.org
Zhou, B. et al. "Scene Parsing through ADE20K Dataset." CVPR (2017). groups.csail.mit.edu/vision/datasets/ADE20K