Chapter 24: Segmentation: Semantic, Instance & Promptable

"Classification asked me one rude question about the whole picture and walked off. Detection at least drew a box around what it cared about. Segmentation sat down with every single pixel and asked, gently, who exactly are you. It took longer, but for the first time I felt seen."
A Pixel That Finally Got a Label of Its Own

Big Picture

Segmentation is classification carried out per pixel: instead of one label for an image or one box for an object, the model assigns a category, and sometimes an identity, to every location in the frame. That single change of granularity drives the entire chapter. A network that outputs a dense map rather than a single vector must keep spatial resolution alive through pooling and strides, which is why the encoder-decoder shape returns again and again. Asking "which class is this pixel" gives semantic segmentation; asking additionally "which object instance is this pixel" gives instance segmentation; asking both at once over the whole image gives panoptic segmentation. Transformers then reframe all three as predicting a set of masks rather than labeling a grid, and the Segment Anything Model takes the final step: a single model that segments whatever you point at, with no fixed list of classes at all. By the end you will be able to read a per-pixel logit map, train a U-Net, run Mask R-CNN and SAM from a library, and report the right number, mean Intersection-over-Union or panoptic quality, for the right task.

The Segmentation Ladder: One Schema for the Whole Chapter

Every method in this chapter answers the same question, "what is at this location," at a finer and finer grain. Hold this five-rung ladder in mind and each section snaps into place:

What (semantic, 24.1): a class per pixel. FCN, U-Net, DeepLab.
What and which one (instance, 24.2): a class plus an identity per object. Mask R-CNN.
What and which one, everywhere (panoptic, 24.3): things and stuff in one partition. Panoptic quality.
The same answer, as a set of masks (transformers, 24.4): one model, all three readouts. SegFormer, Mask2Former.
Whatever you point at (promptable, 24.5): no class list at all. SAM. (And 24.6 tells you how to score every rung.)

Underneath all five runs a single engineering refrain worth memorizing: classification threw away the resolution; segmentation is the art of getting it back. Skip connections, dilation, and learned upsampling are just three answers to that one sentence. The Hands-On Lab at the end of this chapter builds the first rung, a working U-Net, from scratch, trains it on a self-generating dataset, and scores it with the mean IoU and Dice metrics of Section 24.6, so you exercise the encoder-decoder idea and the measurement toolkit in one runnable program.

Chapter Overview

In Chapter 19 a convolutional network turned an image into a single label, and in the previous chapter, Chapter 23: Object Detection, it learned to draw boxes around objects and name them. A box is a coarse answer. It tells you roughly where the dog is, but it cannot say which pixels are dog and which are the grass behind it, and it cannot trace the irregular outline of a tumor, a road, or a piece of cloth. Many real tasks need exactly that pixel-precise outline: a self-driving stack must know which pixels are drivable road, a medical tool must measure the area of a lesion, a photo editor must cut a subject out cleanly. Segmentation is the family of methods that produces those dense, per-pixel answers, and this chapter walks the full arc from the first fully convolutional networks to the promptable foundation models of 2023 onward.

We begin with semantic segmentation in Section 24.1, where the task is a label per pixel with no notion of separate objects. The central engineering problem is resolution: convolutional backbones throw spatial detail away through pooling and strides, and a segmenter must get it back. Three influential answers, the fully convolutional network with skip connections, the symmetric U-Net, and the dilated-convolution DeepLab family, each solve that problem differently, and all three descend directly from the encoder-decoder and multi-scale ideas you met as image pyramids in Chapter 4. Section 24.2 adds the instance dimension with Mask R-CNN, which bolts a small mask-predicting branch onto the detector of Chapter 23 and, almost as an afterthought, fixes a quiet alignment bug with the RoIAlign operation that the whole field then adopted.

Section 24.3 unifies the two views. Real scenes contain countable "things" (people, cars) and uncountable "stuff" (sky, road, vegetation); panoptic segmentation labels every pixel with both a class and, for things, an instance identity, and introduces the panoptic quality metric that scores the result in one number. Section 24.4 is the turning point: the mask transformers. SegFormer rebuilds the semantic segmenter on a hierarchical transformer backbone with a startlingly simple decoder, and Mask2Former makes the conceptual leap that the same architecture, predicting a set of masks with masked attention, can do semantic, instance, and panoptic segmentation with no task-specific changes. This is the attention thread from Chapter 22 arriving in dense prediction.

Section 24.5 reaches the present: the Segment Anything Model and its successors. Trained on a billion masks, SAM takes a prompt, a click, a box, a rough mask, and returns a segmentation of that object, for any image, with no class list and no fine-tuning. It is segmentation's foundation model, and it changes the workflow from "train a segmenter for your classes" to "prompt a segmenter for your object." Finally, Section 24.6 is the chapter's measurement toolkit: the cross-entropy, Dice, and focal losses that train dense predictors, the IoU, mean IoU, boundary-F1, and panoptic-quality metrics that evaluate them, and the practical traps, class imbalance and boundary error, that trip up every first segmentation project.

The connecting idea is the one in the Big Picture: segmentation is dense classification, and almost every technique in the chapter is a different answer to "how do I keep, or recover, the spatial resolution that classification was happy to discard." Masks produced here do not stay here. They become the editing regions of Chapter 35, where a segmentation mask tells a generative model exactly where to inpaint, and they connect back to the classical watershed and graph-cut methods of Chapter 11 that did this job before deep learning, by hand-designed energy rather than learned features.

Prerequisites

You should have read Chapter 19: Convolutional Neural Networks for convolution, pooling, strides, and the receptive field, all of which set the resolution problem this chapter solves, and Chapter 20: CNN Architectures for the ResNet backbones that every segmenter sits on. Chapter 23: Object Detection is a direct prerequisite for Section 24.2, because Mask R-CNN extends Faster R-CNN and reuses its region proposals and RoI features. Chapter 22: Vision Transformers supplies the self-attention and patch-embedding mechanics that Section 24.4 turns into mask transformers. Comfort with PyTorch tensors and the training loop from Chapter 18 is assumed throughout, and the IoU metric you first meet in detection is generalized here, so a quick look back at the Chapter 23 evaluation section will pay off.

Chapter Roadmap

24.1 Semantic Segmentation: FCN, U-Net & DeepLab A label per pixel, and the resolution problem that defines dense prediction. Fully convolutional networks with skip fusion, the symmetric U-Net encoder-decoder, and DeepLab's dilated convolutions and atrous spatial pyramid pooling. A trainable U-Net built from scratch in PyTorch.
24.2 Instance Segmentation: Mask R-CNN Separating individual objects, not just classes. Mask R-CNN adds a per-region mask branch to Faster R-CNN, and RoIAlign fixes the quantization that broke pixel-accurate features. The two-stage detect-then-segment recipe, run end to end with torchvision.
24.3 Panoptic Segmentation: Unifying Things & Stuff Labeling every pixel with both a class and, for countable things, an instance identity. The things-versus-stuff distinction, how semantic and instance predictions are merged without overlap, and the panoptic quality metric that scores recognition and segmentation in one number.
24.4 Transformer Segmenters: SegFormer & Mask2Former Attention takes over dense prediction. SegFormer's hierarchical encoder and all-MLP decoder, and Mask2Former's mask-classification view with masked attention that does semantic, instance, and panoptic segmentation with one architecture. Why predicting a set of masks beats labeling a grid.
24.5 Segment Anything: Promptable Segmentation Segmentation's foundation model. SAM's image encoder, prompt encoder, and lightweight mask decoder, the billion-mask data engine that trained it, ambiguity handling with multiple mask outputs, and the shift from training a segmenter to prompting one. SAM 2 and video extensions.
24.6 Losses, Metrics & Evaluation for Dense Prediction The measurement toolkit. Pixel cross-entropy, Dice, Tversky, and focal losses and when each helps, IoU, mean IoU, pixel accuracy, boundary-F1, and panoptic quality, and the practical traps of class imbalance and boundary error that decide whether a segmenter is actually good.

Hands-On Lab: A U-Net Segmenter You Train and Score Yourself

Duration: about 60 to 75 minutes Difficulty: Intermediate

Objective

Build a small U-Net from scratch in PyTorch, train it to segment shapes against a noisy background, and grade it with the exact metrics from Section 24.6, mean Intersection-over-Union and the Dice score. The dataset generates itself with code, so the lab is fully self-contained: there is nothing to download, it runs on a CPU in a couple of minutes, and because every mask is produced by the same generator that made the image, you can trust the ground truth absolutely. By the end you will have walked the chapter's central refrain, classification throws resolution away and segmentation gets it back, through a network you wrote line by line.

What You'll Practice

Assembling the encoder-decoder with skip connections that defines U-Net (Section 24.1), the architecture that keeps and recovers spatial resolution.
Generating a self-labeling synthetic segmentation dataset so the per-pixel ground truth is exact, the same trick the two-view lab of Chapter 13 uses for geometry.
Training a dense predictor with pixel-wise cross-entropy and reading a per-pixel logit map (Section 24.6).
Computing mean IoU and Dice on validation masks, the right numbers for semantic segmentation, and seeing why pixel accuracy alone can mislead.
Swapping your hand-built network for a one-line library model to confirm the Right Tool payoff (the stretch goal).

Setup

One library and no dataset; the script synthesizes its own images and masks, so it always runs to completion on any machine. Install with:

pip install torch numpy

Everything runs on the CPU in a couple of minutes at the small image size used here. A GPU, if present, simply makes it faster. Matplotlib is optional and only used by the first stretch goal to visualize a predicted mask.

Steps

Step 1: Generate a self-labeling shapes dataset

Each sample is a small grayscale image with one bright disk on a noisy background, paired with a binary mask that is exactly the disk's pixels. Because one function draws both the image and its mask, the ground truth is perfect, which is what makes the lab self-grading.

import numpy as np
import torch

def make_sample(size=64, rng=None):
    rng = rng or np.random.default_rng()
    img = rng.normal(0.2, 0.1, (size, size)).astype(np.float32)   # noisy background
    cy, cx = rng.integers(16, size - 16, size=2)                   # random disk center
    r = rng.integers(8, 14)                                        # random disk radius
    yy, xx = np.mgrid[:size, :size]
    mask = ((yy - cy) ** 2 + (xx - cx) ** 2) <= r ** 2             # True inside the disk
    img[mask] += 0.7                                               # brighten the disk
    # TODO: return img as a (1, size, size) float32 tensor and mask as a
    # (size, size) int64 tensor of class labels (0 = background, 1 = disk).

Hint

return torch.from_numpy(img)[None], torch.from_numpy(mask.astype(np.int64)). The leading [None] adds the single channel dimension a convolution expects; the mask stays a 2D map of integer class indices, which is what cross-entropy wants as its target.

Step 2: Build the U-Net blocks

A U-Net is built from one repeated unit: two 3x3 convolutions, each followed by a ReLU, that keep the spatial size fixed. Write it once as a reusable block so the encoder and decoder can both call it.

import torch.nn as nn

def conv_block(in_ch, out_ch):
    # TODO: return an nn.Sequential of: Conv2d(in_ch, out_ch, 3, padding=1),
    # ReLU, Conv2d(out_ch, out_ch, 3, padding=1), ReLU. The padding=1 keeps
    # height and width unchanged so skip connections line up later.

Hint

return nn.Sequential(nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True), nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True)). Keeping padding=1 means a 64x64 input stays 64x64, so when you concatenate an encoder feature onto a decoder feature their spatial shapes match exactly.

Step 3: Wire the encoder, decoder, and skip connections

This is the heart of Section 24.1. The encoder halves resolution with max pooling while doubling channels; the decoder upsamples back and, crucially, concatenates the matching encoder feature so fine spatial detail is restored. That concatenation is the skip connection.

class UNet(nn.Module):
    def __init__(self, n_classes=2):
        super().__init__()
        self.enc1 = conv_block(1, 16)
        self.enc2 = conv_block(16, 32)
        self.pool = nn.MaxPool2d(2)
        self.bottleneck = conv_block(32, 64)
        self.up2 = nn.ConvTranspose2d(64, 32, 2, stride=2)
        self.dec2 = conv_block(64, 32)        # 64 = 32 upsampled + 32 skip
        self.up1 = nn.ConvTranspose2d(32, 16, 2, stride=2)
        self.dec1 = conv_block(32, 16)        # 32 = 16 upsampled + 16 skip
        self.head = nn.Conv2d(16, n_classes, 1)

    def forward(self, x):
        e1 = self.enc1(x)                     # full resolution
        e2 = self.enc2(self.pool(e1))         # half resolution
        b = self.bottleneck(self.pool(e2))    # quarter resolution
        d2 = self.up2(b)
        # TODO: concatenate the encoder feature e2 onto d2 along the channel
        # dim (dim=1), pass through self.dec2, then upsample with self.up1,
        # concatenate e1, pass through self.dec1, and finally return self.head(...).

Hint

d2 = self.dec2(torch.cat([d2, e2], dim=1)); then d1 = self.up1(d2); d1 = self.dec1(torch.cat([d1, e1], dim=1)); return self.head(d1). The output has shape (batch, n_classes, 64, 64): one logit map per class, the dense readout the chapter keeps returning to.

Step 4: Train with pixel-wise cross-entropy

Generate a fresh batch each step and minimize cross-entropy averaged over every pixel. nn.CrossEntropyLoss expects raw logits of shape (N, C, H, W) and an integer target of shape (N, H, W), exactly what Steps 1 and 3 produce.

torch.manual_seed(0)
rng = np.random.default_rng(0)
model = UNet()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

def batch(n=16):
    pairs = [make_sample(rng=rng) for _ in range(n)]
    imgs = torch.stack([p[0] for p in pairs])
    masks = torch.stack([p[1] for p in pairs])
    return imgs, masks

for step in range(300):
    imgs, masks = batch()
    logits = model(imgs)
    # TODO: compute the loss with loss_fn(logits, masks), backpropagate,
    # step the optimizer, and zero the gradients. Print the loss every 50 steps.

Hint

The canonical four lines: loss = loss_fn(logits, masks); opt.zero_grad(); loss.backward(); opt.step(). Guard the print with if step % 50 == 0:. The loss should fall steadily from roughly 0.7 toward well under 0.1.

Step 5: Score with mean IoU and Dice

Accuracy is misleading here: the disk is a small fraction of the image, so a model that predicts "all background" already scores high pixel accuracy. The honest metrics from Section 24.6 are IoU and Dice on the foreground class. Take the per-pixel argmax to get the predicted mask, then compute both.

@torch.no_grad()
def evaluate(model, n=64):
    imgs, masks = batch(n)
    pred = model(imgs).argmax(dim=1)          # (N, H, W) predicted class per pixel
    p, g = (pred == 1), (masks == 1)          # foreground booleans
    inter = (p & g).sum().float()
    union = (p | g).sum().float()
    # TODO: return IoU = inter / union and Dice = 2 * inter / (p.sum() + g.sum()),
    # both as plain Python floats. Add a tiny epsilon to each denominator to be safe.

Hint

iou = (inter / (union + 1e-6)).item(); dice = (2 * inter / (p.sum() + g.sum() + 1e-6)).item(); return iou, dice. After 300 steps both should land above about 0.9, and Dice is always slightly higher than IoU, the algebraic relationship noted in Section 24.6.

Step 6: Report and sanity-check

Print the final metrics next to plain pixel accuracy to feel the gap the chapter warns about. The point is concrete: a single number can hide a useless model, which is why dense prediction is scored on overlap, not on how many pixels happened to be right.

iou, dice = evaluate(model)
imgs, masks = batch(64)
acc = (model(imgs).argmax(1) == masks).float().mean().item()
# TODO: print pixel accuracy, mean IoU, and Dice on one line each, then
# state in a comment which of the three you would trust for an imbalanced mask.
print(f"pixel accuracy: {acc:.3f}")

Hint

print(f"foreground IoU: {iou:.3f}") and print(f"Dice: {dice:.3f}"). Pixel accuracy will look impressive even early in training because background dominates; IoU and Dice are the numbers that actually track whether the disk is found, which is the lesson of Section 24.6.

Expected Output

The training loop prints a loss that falls from about 0.7 to under 0.1 over 300 steps. The final report shows a pixel accuracy near 0.99 (inflated by the dominant background), a foreground IoU above roughly 0.90, and a Dice score a little higher still. The takeaway is the one Section 24.6 stresses: on an imbalanced mask the accuracy number flatters the model, while IoU and Dice report what you actually care about, how well the predicted region overlaps the true one. You have now built, trained, and correctly evaluated a semantic segmenter end to end.

Stretch Goals

Visualize a result: pick one validation image and plot the input, the ground-truth mask, and the predicted mask side by side with Matplotlib. Seeing the boundary error directly makes the boundary-F1 discussion of Section 24.6 concrete.
Swap the loss for soft Dice loss (one minus the differentiable Dice of Section 24.6) and compare convergence speed and final IoU against cross-entropy on the same seed. This shows why dense-prediction practitioners reach for region losses when the foreground is small.
Library shortcut, the Right Tool principle in action: replace your hand-built UNet with segmentation_models.pytorch in one line, import segmentation_models_pytorch as smp; model = smp.Unet(encoder_name="resnet18", in_channels=1, classes=2), train it with the identical loop, and note how a pretrained backbone reaches a higher IoU in fewer steps while you wrote almost no architecture code.

Complete Solution

import numpy as np
import torch
import torch.nn as nn

def make_sample(size=64, rng=None):
    rng = rng or np.random.default_rng()
    img = rng.normal(0.2, 0.1, (size, size)).astype(np.float32)
    cy, cx = rng.integers(16, size - 16, size=2)
    r = rng.integers(8, 14)
    yy, xx = np.mgrid[:size, :size]
    mask = ((yy - cy) ** 2 + (xx - cx) ** 2) <= r ** 2
    img[mask] += 0.7
    return torch.from_numpy(img)[None], torch.from_numpy(mask.astype(np.int64))

def conv_block(in_ch, out_ch):
    return nn.Sequential(
        nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True),
        nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.ReLU(inplace=True))

class UNet(nn.Module):
    def __init__(self, n_classes=2):
        super().__init__()
        self.enc1 = conv_block(1, 16)
        self.enc2 = conv_block(16, 32)
        self.pool = nn.MaxPool2d(2)
        self.bottleneck = conv_block(32, 64)
        self.up2 = nn.ConvTranspose2d(64, 32, 2, stride=2)
        self.dec2 = conv_block(64, 32)
        self.up1 = nn.ConvTranspose2d(32, 16, 2, stride=2)
        self.dec1 = conv_block(32, 16)
        self.head = nn.Conv2d(16, n_classes, 1)

    def forward(self, x):
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        b = self.bottleneck(self.pool(e2))
        d2 = self.up2(b)
        d2 = self.dec2(torch.cat([d2, e2], dim=1))
        d1 = self.up1(d2)
        d1 = self.dec1(torch.cat([d1, e1], dim=1))
        return self.head(d1)

torch.manual_seed(0)
rng = np.random.default_rng(0)
model = UNet()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

def batch(n=16):
    pairs = [make_sample(rng=rng) for _ in range(n)]
    imgs = torch.stack([p[0] for p in pairs])
    masks = torch.stack([p[1] for p in pairs])
    return imgs, masks

for step in range(300):
    imgs, masks = batch()
    logits = model(imgs)
    loss = loss_fn(logits, masks)
    opt.zero_grad()
    loss.backward()
    opt.step()
    if step % 50 == 0:
        print(f"step {step:3d}  loss {loss.item():.4f}")

@torch.no_grad()
def evaluate(model, n=64):
    imgs, masks = batch(n)
    pred = model(imgs).argmax(dim=1)
    p, g = (pred == 1), (masks == 1)
    inter = (p & g).sum().float()
    union = (p | g).sum().float()
    iou = (inter / (union + 1e-6)).item()
    dice = (2 * inter / (p.sum() + g.sum() + 1e-6)).item()
    return iou, dice

iou, dice = evaluate(model)
imgs, masks = batch(64)
acc = (model(imgs).argmax(1) == masks).float().mean().item()
print(f"pixel accuracy: {acc:.3f}")    # high, but inflated by background
print(f"foreground IoU: {iou:.3f}")    # the honest semantic-segmentation metric
print(f"Dice: {dice:.3f}")             # always slightly above IoU

What's Next?

With dense prediction in hand you have nearly completed the supervised half of deep vision: classify the image, detect the objects, segment every pixel. The recurring frustration across all three has been the appetite for labels, and segmentation labels are the most expensive of all, because a human must trace every boundary by hand. Chapter 25: Self-Supervised Learning & Vision Foundation Models is the answer: learn representations from unlabeled images, so that a segmentation head needs only a handful of annotated examples to specialize. It is no accident that the Segment Anything Model of Section 24.5 already behaves like a foundation model; Chapter 25 explains the self-supervised pretraining, the masked-image modeling and contrastive learning, that makes such generality possible, and the DINOv2-class backbones whose features power the best modern segmenters. The masks you learned to produce here will then return in Chapter 35, where they become the regions a generative model edits, inpaints, and recomposes.

Bibliography & Further Reading

Foundational Papers

Long, J., Shelhamer, E., Darrell, T. "Fully Convolutional Networks for Semantic Segmentation." CVPR (2015). arXiv:1411.4038

The FCN of Section 24.1. Replaced the classifier's final fully connected layers with convolutions so the network outputs a dense label map, and introduced the skip connections that fuse coarse semantics with fine spatial detail. The paper that started deep semantic segmentation.

Ronneberger, O., Fischer, P., Brox, T. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI (2015). arXiv:1505.04597

The U-Net of Section 24.1, the most-used segmentation architecture ever published. Its symmetric encoder-decoder with full-resolution skip connections trains from very few annotated images and reappears, much later, as the denoiser inside diffusion models.

Chen, L.-C. et al. "Rethinking Atrous Convolution for Semantic Image Segmentation (DeepLabv3)." (2017). arXiv:1706.05587

The DeepLab line of Section 24.1. Dilated (atrous) convolutions enlarge the receptive field without losing resolution, and atrous spatial pyramid pooling captures objects at multiple scales. DeepLabv3+ adds a decoder for sharper boundaries.

He, K. et al. "Mask R-CNN." ICCV (2017). arXiv:1703.06870

Mask R-CNN of Section 24.2. Adds a small per-region mask branch to Faster R-CNN and introduces RoIAlign, which removes the harmful quantization of RoIPool. The instance-segmentation baseline that dominated for years.

Kirillov, A. et al. "Panoptic Segmentation." CVPR (2019). arXiv:1801.00868

The paper that defined the panoptic task and the panoptic quality (PQ) metric of Section 24.3, unifying the previously separate semantic (stuff) and instance (things) segmentation communities under one evaluation.

Transformer Segmenters & Foundation Models

Xie, E. et al. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS (2021). arXiv:2105.15203

SegFormer of Section 24.4. A hierarchical transformer encoder with no positional embeddings paired with a lightweight all-MLP decoder, strong and efficient, and robust to input resolution changes.

Cheng, B. et al. "Masked-attention Mask Transformer for Universal Image Segmentation (Mask2Former)." CVPR (2022). arXiv:2112.01527

Mask2Former of Section 24.4. One architecture, masked-attention mask classification, that achieves state-of-the-art semantic, instance, and panoptic segmentation. The clearest statement of the mask-set view of segmentation.

Kirillov, A. et al. "Segment Anything." ICCV (2023). arXiv:2304.02643

SAM of Section 24.5. A promptable segmentation foundation model trained on the 1.1-billion-mask SA-1B dataset, with an image encoder, prompt encoder, and fast mask decoder, and zero-shot transfer to unseen tasks.

Ravi, N. et al. "SAM 2: Segment Anything in Images and Videos." (2024). arXiv:2408.00714

The 2024 successor in Section 24.5. Extends SAM to video with a streaming memory that propagates a prompted mask across frames in real time, unifying image and video segmentation.

Tools & Libraries

torchvision segmentation and detection models. pytorch.org/vision/stable/models

Pretrained FCN, DeepLabv3, and Mask R-CNN with a uniform API, the library shortcut behind the from-scratch builds of Sections 24.1 and 24.2.

Hugging Face Transformers, image segmentation. huggingface.co/docs/transformers

High-level pipelines and AutoModel loaders for SegFormer, Mask2Former, and the universal segmentation models of Section 24.4, with preprocessing and post-processing handled.

Meta AI. Segment Anything (segment-anything) and SAM 2 repositories. github.com/facebookresearch/segment-anything

The official SAM and SAM 2 code and checkpoints used in Section 24.5, including the automatic mask generator and the interactive predictor.

Iakubovskii, P. segmentation_models.pytorch. github.com/qubvel-org/segmentation_models.pytorch

A library of U-Net, FPN, DeepLab, and dozens of encoder backbones with a one-line model constructor and a built-in collection of the Dice and focal losses of Section 24.6.

Datasets & Benchmarks

Cordts, M. et al. "The Cityscapes Dataset for Semantic Urban Scene Understanding." CVPR (2016). cityscapes-dataset.com

The urban-driving benchmark with fine pixel annotations for 19 classes, the standard testbed for the semantic and panoptic methods of Sections 24.1, 24.3, and 24.4.

Lin, T.-Y. et al. "Microsoft COCO: Common Objects in Context." ECCV (2014), with panoptic extension. cocodataset.org

The instance- and panoptic-segmentation benchmark of Sections 24.2 and 24.3, with mask annotations for 80 thing classes and 53 stuff classes, and the source of the standard mask average-precision protocol.

Zhou, B. et al. "Scene Parsing through ADE20K Dataset." CVPR (2017). groups.csail.mit.edu/vision/datasets/ADE20K

A 150-class scene-parsing benchmark used heavily to evaluate the transformer segmenters of Section 24.4, with dense annotations across a very broad label vocabulary.