Section 24.2: Instance Segmentation: Mask R-CNN

"Semantic segmentation told me there were cars. I asked how many, and it shrugged and pointed at one big car-colored smear. Mask R-CNN finally counted them, and then, with the patience of a saint, drew a tidy outline around each one."
An Anchor Box That Grew Up to Cut Out Shapes

Big Picture

Instance segmentation answers two questions per pixel at once: which class, and which object instance, so two adjacent cars are two separate masks rather than one merged region. Mask R-CNN gets there by the most pragmatic route imaginable: take the Faster R-CNN detector from the previous chapter, which already finds and boxes objects, and bolt a tiny extra branch onto each detected region that predicts a binary mask. The architecture is detect-then-segment. The one subtle but consequential fix that made it work is RoIAlign, a way of cropping per-region features that avoids the coordinate rounding that had quietly corrupted earlier region-based methods. Get the detector and RoIAlign right, and the mask branch is almost trivial.

The semantic segmenters of Section 24.1 assign each pixel a class, and that is all they do. Stand them in front of a parking lot and every car-pixel gets the label "car," but the model has no idea where one car ends and the next begins; touching objects of the same class fuse into a single connected component. For counting, tracking, or cutting out one specific object, you need instances: a separate mask, with its own identity, for each individual object. This section builds the instance segmenter that defined the field, Mask R-CNN, and it leans directly on the detection machinery of Chapter 23.

1. From Detection to Instance Masks Beginner

Recall the two-stage detector from Chapter 23. A backbone extracts a feature map; a Region Proposal Network (RPN) suggests a few hundred candidate object regions; for each region, features are cropped and pooled to a fixed size and fed to two heads, one that classifies the region and one that refines its box. Mask R-CNN keeps this entire pipeline and adds exactly one thing: a third head, parallel to the other two, that takes the same per-region features and predicts a small binary mask, typically $28 \times 28$, indicating which pixels inside the region belong to the object. The mask is then resized to the box and pasted back into the image. Figure 24.2.1 shows the addition.

Figure 24.2.1: The Mask R-CNN pipeline. The backbone, the Feature Pyramid Network (FPN, Section 23.3) that fuses multi-scale features, RPN, RoIAlign, and the class and box heads are inherited unchanged from Faster R-CNN (Chapter 23). Mask R-CNN adds the purple mask head: a small fully convolutional network that, for each proposal, predicts a 28x28 binary mask in parallel with classification and box regression.

A key design decision in the mask branch is that the mask and the class are decoupled. The mask head predicts one $28 \times 28$ mask per class, and at inference the network simply uses the mask for whichever class the classification head chose. This means the mask branch never has to learn to distinguish classes; it only learns to separate foreground from background within a region, and the much smaller classification problem is left to the dedicated class head. This decoupling, the paper showed, measurably improves mask quality compared with making one branch do both jobs. The mask loss is a per-pixel binary cross-entropy applied only to the mask of the ground-truth class, the binary analog of the dense cross-entropy from Section 24.1.

Key Insight: Decouple Mask from Class

Mask R-CNN predicts a separate binary mask for every class and selects the one matching the classification head's decision, rather than predicting a single multi-class mask. The total loss is a sum of three terms, $L = L_{\text{cls}} + L_{\text{box}} + L_{\text{mask}}$, and the mask term is averaged only over the ground-truth class channel. Because each class's mask branch sees only positive and negative pixels for that class, "is this pixel part of the object" is cleanly separated from "what is the object," and each task gets a head sized for it. Separating responsibilities so each predictor solves the smaller problem is a recurring pattern in good architecture design.

2. RoIAlign: The Fix That Mattered Intermediate

The single most important technical contribution of Mask R-CNN is not the mask branch; it is RoIAlign. To understand why, look at how Faster R-CNN cropped per-region features. A region proposal is a box in image coordinates, say $x$ from 87.6 to 213.4 pixels. The feature map is downsampled, so to find the box on it you divide image coordinates by the stride; at stride 16 that box lives between feature coordinates 5.475 and 13.3375 (that is, 87.6 and 213.4 each divided by 16). The old RoIPool operation rounded those to integers, then divided the integer region into a grid of bins and rounded the bin boundaries again. Two rounding steps. For a box classifier those few-pixel misalignments wash out, the box head is tolerant. For a mask, where the goal is a pixel-accurate boundary, that misalignment is poison: the cropped features no longer line up with the image, and the predicted mask is shifted by a pixel or two everywhere.

RoIAlign removes both roundings. It keeps the region coordinates as floating-point values, divides the region into bins without rounding, places sampling points inside each bin at fractional positions, and reads the feature-map value at each sampling point with bilinear interpolation, exactly the interpolation you used for image warping in Chapter 5. Because no coordinate is ever snapped to an integer, the cropped features are spatially faithful to the image. Figure 24.2.2 contrasts the two.

Figure 24.2.2: Why RoIAlign matters. Left, RoIPool snaps the floating-point region (orange) to integer grid cells (red dashed), shifting the cropped features off the object. Right, RoIAlign keeps the region exact and reads feature values at fractional sample points (green dots) with bilinear interpolation, so the crop stays spatially faithful. For pixel-accurate masks, this difference is decisive.

The effect was large: in the original paper, switching RoIPool to RoIAlign improved mask accuracy by around ten points of mask average precision on the hardest, strict-overlap criteria, and it helped box detection too. The lesson generalized far beyond Mask R-CNN, RoIAlign is now the default region-cropping operator across detection and segmentation. The code below runs it on a feature map to show the shape contract.

# RoIAlign on a feature map, showing its shape contract.
# Given fractional-coordinate regions, it keeps the coordinates exact and reads
# feature values by bilinear sampling, producing a fixed-size crop per region.
import torch
from torchvision.ops import roi_align

# A single feature map (1 image, 256 channels, 50x50 spatial).
feat = torch.randn(1, 256, 50, 50)

# Two regions of interest as [batch_index, x1, y1, x2, y2] in feature-map coordinates.
# Note the deliberately fractional coordinates: RoIAlign keeps them exact.
boxes = torch.tensor([[0, 5.5, 8.2, 22.7, 31.9],
                      [0, 30.1, 12.4, 47.8, 44.0]])

pooled = roi_align(feat, boxes, output_size=(7, 7),  # fixed-size crop per region
                   spatial_scale=1.0,                 # boxes already in feature coords
                   sampling_ratio=2)                  # 2x2 bilinear samples per bin
print(pooled.shape)   # torch.Size([2, 256, 7, 7]) -> one 7x7 feature crop per region

Code Fragment 1: torchvision's roi_align turns the two fractional-coordinate boxes into uniform 7x7 feature crops via bilinear sampling, with sampling_ratio=2 placing 2x2 sample points per bin, the operation that replaced lossy RoIPool. The output shape (2, 256, 7, 7) is one crop per region; the mask branch uses a larger 14x14 crop to predict its 28x28 mask.

Fun Fact

The entire RoIAlign contribution can be summarized as "stop rounding." Two round() calls, the kind every beginner writes without thinking, were quietly shaving a pixel or two off every region crop, and removing them bought roughly ten points of mask average precision. It is a useful reminder that in dense prediction a half-pixel is not a rounding error you can ignore; it is the difference between a mask that hugs the object and one that floats next to it. The signature phrase to remember the section by: masks live and die by the half-pixel. The illustration below captures the fix in one image.

Two side-by-side panels: on the left a character snaps a crop rectangle to the nearest whole grid cells so it floats off the fish it is trying to cut out, and on the right a character keeps the region exactly in place and reads values at in-between sample points so the crop hugs the fish, illustrating how RoIAlign replaces RoIPool's coordinate rounding with bilinear sampling. — Two innocent round() calls were shifting every crop off its object; RoIAlign just stops rounding, because masks live and die by the half-pixel.

3. Running Mask R-CNN End to End Intermediate

With the pieces in place, the full model is straightforward to run. torchvision provides Mask R-CNN with a ResNet-50-FPN backbone and COCO-pretrained weights. The model takes a list of image tensors and returns, per image, a dictionary of boxes, labels, confidence scores, and per-instance masks. The code below loads it, runs inference, and filters by confidence, the typical inference recipe.

# Run a COCO-pretrained Mask R-CNN end to end and filter by confidence.
# The model takes a LIST of image tensors and returns, per image, a dict of
# boxes, labels, scores, and per-instance soft masks at full resolution.
import torch
from torchvision.models.detection import (maskrcnn_resnet50_fpn,
                                          MaskRCNN_ResNet50_FPN_Weights)

weights = MaskRCNN_ResNet50_FPN_Weights.DEFAULT
model = maskrcnn_resnet50_fpn(weights=weights).eval()
preprocess = weights.transforms()                 # normalization the model expects

# Stand-in for a loaded image (C, H, W) in [0, 1]; replace with read_image(...) / 255.
image = torch.rand(3, 480, 640)
batch = [preprocess(image)]                        # model takes a LIST of images

with torch.no_grad():
    output = model(batch)[0]                        # dict for the single image

keep = output["scores"] > 0.7                       # confidence threshold
boxes  = output["boxes"][keep]                       # (N, 4)
labels = output["labels"][keep]                      # (N,) COCO class indices
masks  = output["masks"][keep]                        # (N, 1, H, W) soft masks in [0, 1]
print(f"{keep.sum().item()} confident instances")
binary_masks = masks.squeeze(1) > 0.5                # threshold to hard masks
print(binary_masks.shape)                            # (N, 480, 640)

Code Fragment 2: Mask R-CNN inference in torchvision. The scores > 0.7 mask keeps only confident detections, and the model returns soft per-instance masks of shape (N, 1, H, W) at full image resolution; thresholding at 0.5 yields a hard binary mask for each detected object. Each mask is paired with its own box, class label, and score.

The output shape (N, 1, H, W) deserves a comment: there are $N$ confident instances, each carrying a single-channel soft mask at full image resolution. Internally the network predicted a small $28 \times 28$ mask in the box's coordinate frame and resized it to the box; torchvision pastes it into a full-frame canvas for you. To visualize, overlay each binary mask in a distinct color, the standard instance-segmentation display where every car gets its own hue rather than one shared "car" color.

Common Misconception: Mask R-CNN Predicts a Full-Resolution Mask

The full-resolution (N, 1, H, W) output tempts learners to assume Mask R-CNN reasons about the boundary at full pixel detail. It does not. Each mask is predicted at a fixed coarse grid (28 by 28 by default) inside its box, then bilinearly upsampled and pasted into the frame, so the final boundary smoothness you see is interpolation, not learned detail. Two consequences follow that trip up first projects. First, a large object's mask is just as coarse in absolute terms as a small one's, so big objects get jagged or rounded boundaries (a 28 by 28 grid stretched over a 600-pixel car cannot trace fine contours), which is why semantic segmenters from Section 24.1 often have crisper boundaries than Mask R-CNN. Second, the mask is confined to the detected box: if the box is too tight and clips the object, the mask is clipped too, so box quality caps mask quality. The mask branch cannot fix a bad detection; it only labels foreground inside whatever region the detector hands it.

Practical Example: Counting Cells in a Pathology Lab

Who: a computational-pathology group automating cell counting in stained tissue slides, 2024. Situation: they had trained a U-Net semantic segmenter that labeled every pixel "cell" or "background" with high pixel accuracy. Problem: the clinical metric was the number of cells of each type, and in dense regions the cells touched, so the U-Net's mask fused dozens of cells into one giant blob; counting connected components undercounted badly and a single mis-merged boundary could drop the count by a third. Decision: they reframed the task from semantic to instance segmentation and fine-tuned a Mask R-CNN, pretrained on COCO, on a few thousand annotated cells, reasoning that per-instance masks would keep touching cells separate by construction. They kept the U-Net as a fast first-pass tissue-versus-background filter. Result: instance-level counting error fell by more than half, and the per-cell masks let them also measure individual cell areas, a bonus the pathologists had not even asked for. Lesson: when the question is "how many" or "which one," semantic segmentation is the wrong tool no matter how good its pixel accuracy; instances are not a nicety, they are the answer. The reframing, not a bigger model, fixed the metric.

Library Shortcut: Fine-Tuning on Your Own Classes

Adapting Mask R-CNN to new classes does not mean rebuilding the RPN, RoIAlign, and three heads from scratch. torchvision lets you swap just the final predictors and keep everything else pretrained:

# Retarget a pretrained Mask R-CNN to new classes by swapping only the heads.
# The box and mask predictors are replaced for the new class count, while the
# backbone, FPN, RPN, and RoIAlign keep their expensive pretrained weights.
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights="DEFAULT")
num_classes = 4   # background + 3 of your classes

# Replace the box predictor.
in_feat = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_feat, num_classes)

# Replace the mask predictor.
in_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_mask, 256, num_classes)
# Now train as usual; the backbone, FPN, and RPN keep their pretrained weights.

Code Fragment 3: Retargeting Mask R-CNN to new classes in about ten lines using torchvision, instead of reimplementing the two-stage detector and mask branch. Only box_predictor and mask_predictor are swapped for the new num_classes, while the pretrained backbone, FPN, RPN, and RoIAlign are kept, so fine-tuning needs only a few hundred annotated images rather than COCO-scale data.

This is roughly ten lines versus reimplementing the entire two-stage detector plus mask branch, and it preserves the expensive-to-learn backbone and region machinery while retargeting only the class-specific output layers. The library owns RoIAlign, the FPN, anchor generation, and the loss combination of subsection 1.

Research Frontier: Beyond Detect-Then-Segment

Mask R-CNN's two-stage detect-then-segment design ruled instance segmentation from 2017 into the early 2020s, but the frontier has moved to single-stage and query-based methods that drop the explicit proposal-and-crop step. YOLACT and SOLO predict masks directly without per-region cropping; more importantly, the mask-transformer family, Mask2Former (Section 24.4), treats instance segmentation as predicting a fixed set of masks with a transformer decoder, eliminating RoIAlign, the RPN, and non-maximum suppression entirely, and beating Mask R-CNN on COCO mask average precision. Newer detectors such as the RT-DETR and YOLO-family segmentation variants of 2024-2025 push real-time instance segmentation onto edge hardware. The "add a branch to a detector" recipe taught here remains the clearest mental model and a strong baseline, but production systems in 2025 increasingly reach for query-based universal segmenters.

Exercise 24.2.1: Semantic, Instance, or Neither Conceptual

For each task, state whether semantic segmentation, instance segmentation, or plain object detection is the right tool, and justify in one sentence: (a) estimating the percentage of a satellite image covered by forest; (b) counting the number of pedestrians waiting at a crosswalk; (c) measuring the area in square millimeters of each individual skin lesion in a dermatology photo; (d) blurring every face in a crowd photo for privacy. Then explain why a perfectly accurate semantic segmenter would still fail task (b).

Exercise 24.2.2: Measure the RoIAlign Difference Coding

Create a small synthetic feature map containing a sharp diagonal edge. Crop the same fractional-coordinate region twice: once with torchvision.ops.roi_pool and once with torchvision.ops.roi_align, both to a 7x7 output. Display the two crops side by side and compute their mean absolute difference. Write a short paragraph relating what you see to the quantization argument of subsection 2, and explain why this difference would matter more for a mask head than for a box-classification head.

Exercise 24.2.3: Confidence Threshold and the Precision-Recall Trade Analysis

Run the pretrained Mask R-CNN of subsection 3 on five images containing many overlapping objects. Sweep the score threshold from 0.3 to 0.9 in steps of 0.1 and, for each value, count how many instances survive and inspect (qualitatively) how many are correct versus spurious. Plot instance count versus threshold and write one paragraph connecting the curve to the precision-recall trade-off and the mask average-precision metric, which you will formalize in Section 24.6. What threshold would you ship for the cell-counting application of subsection 3, and why?