Part III: Deep Learning for Computer Vision
Chapter 24: Segmentation: Semantic, Instance & Promptable

Panoptic Segmentation: Unifying Things & Stuff

"Semantic segmentation labeled the sky and forgot the birds. Instance segmentation counted the birds and ignored the sky. Panoptic segmentation, finally, labeled every last pixel and could still tell me there were exactly three birds. It is the only one I trust to describe a whole scene."

A Scene Parser With No Pixels Left Behind
Big Picture

Panoptic segmentation labels every pixel in an image with exactly one class and, for countable object categories, one instance identity, unifying the semantic and instance views into a single non-overlapping description of the whole scene. The conceptual key is the split between "things," countable objects like people and cars that have instances, and "stuff," amorphous regions like sky, road, and grass that do not. Semantic segmentation handles stuff well but cannot count things; instance segmentation counts things but ignores stuff and allows overlapping masks. Panoptic demands both at once with no pixel unlabeled and no pixel claimed twice, and it brings a single metric, panoptic quality, that judges how well you did at recognizing and at delineating, together.

Ask either tool you have built so far to describe a street and it answers half the question. Section 24.1 gives a class to every pixel but no sense of separate objects; Section 24.2 gives a separate mask to every object but says nothing about the background and is happy to let masks overlap. Neither alone produces a complete scene description. A self-driving stack, a robot mapping a room, or a system captioning a photo wants both: where is the drivable road (stuff), and exactly which and how many pedestrians are on it (things), with every pixel accounted for. This complete-partition goal echoes the classical region-partitioning of Chapter 11, where watershed and graph cuts carved an image into non-overlapping regions by hand-designed energy rather than learned classes. Panoptic segmentation, defined by Kirillov and colleagues in 2019, is the task that demands this complete answer.

1. Things, Stuff, and the Panoptic Output Format Beginner

The vocabulary matters because it shapes the task. Things are categories whose instances you can count and point to: person, car, dog, bicycle. Stuff is categories with no natural instances, where counting makes no sense: sky, road, vegetation, water, wall. The same scene contains both, and the panoptic task treats them by one rule. Every pixel receives a pair $(c, k)$: a class label $c$, and an instance id $k$. For thing classes, $k$ distinguishes the individual objects (car 1, car 2). For stuff classes, $k$ is ignored, all pixels of "sky" share one region. The hard constraint is that the assignment is a partition: every pixel gets exactly one pair, so the segments are mutually exclusive and collectively exhaustive. Figure 24.3.1 shows how panoptic sits between the two tasks you know.

Semantic person (merged) sky road classes only, things merge Instance person 1 person 2 things counted, no stuff Panoptic p 1 p 2 sky road every pixel, things + stuff
Figure 24.3.1: The three views of the same scene. Semantic segmentation (left) labels classes but merges the two people into one region. Instance segmentation (center) separates the things but ignores sky and road. Panoptic segmentation (right) does both: every pixel carries a class, things keep distinct instance identities (p 1, p 2), stuff is labeled as single regions, and nothing is left blank or claimed twice.

That partition constraint is exactly what neither earlier task enforces. Instance masks from Mask R-CNN can overlap, two boxes both claiming a pixel, and they leave the background unlabeled. Producing a valid panoptic map therefore requires a merge step that resolves conflicts and fills the gaps, which is the subject of subsection 2.

Fun Fact

The cleanest test for "thing versus stuff" is a grammar test, not a vision test: if the plural sounds natural with a number in front of it (three cars, two dogs), it is a thing; if it does not (three skies? two grasses?), it is stuff. The word "panoptic" itself is borrowed from Bentham's 18th-century Panopticon, the prison designed so a single guard could watch every cell at once. Fitting: a panoptic segmenter is the one model from which no pixel can hide. The mnemonic to keep: things you can count, stuff you can only point at.

Key Insight: One Partition, Two Question Types

Panoptic segmentation is not a third, unrelated task; it is the constraint that the answers to "what class" (everywhere) and "which instance" (for things) be a single consistent partition of the image. The deep payoff is that this constraint is what later let a single architecture, the mask transformers of Section 24.4, do all three tasks at once: if your model predicts a set of non-overlapping masks each with a class, then reading it as semantic, instance, or panoptic output is just a matter of how you group and report the masks. The partition view turned three research communities into one.

2. Merging Semantic and Instance Predictions Intermediate

The first panoptic systems were built by combining the two networks you already have: a semantic segmenter for stuff and an instance segmenter (Mask R-CNN) for things, with a heuristic merge that turns their outputs into a clean partition. The merge resolves three kinds of conflict. First, overlapping instance masks: sort instances by confidence and paint them in order, so a higher-confidence object wins any contested pixel. Second, thing-versus-stuff conflict: where an instance mask and a semantic stuff region disagree, the thing instance takes precedence (a person standing on the road is person, not road). Third, gaps: any pixel left unlabeled after instances are painted falls back to the semantic prediction, and very small stuff regions below a threshold are discarded as noise. The code below implements this canonical merge.

# The canonical heuristic merge that turns a semantic map plus instance masks
# into a valid panoptic partition: paint instances by descending confidence onto
# unclaimed pixels only, drop tiny fragments, then fill the gaps with stuff classes.
import torch

def merge_to_panoptic(sem_logits, inst_masks, inst_labels, inst_scores,
                      stuff_classes, score_thr=0.5, area_thr=64):
    """Combine a semantic map and instance masks into a non-overlapping panoptic map.
    Returns (class_map, instance_id_map): two HxW integer tensors.
    """
    H, W = sem_logits.shape[-2:]
    sem = sem_logits.argmax(0)                      # (H, W) semantic class per pixel
    class_map = torch.full((H, W), -1, dtype=torch.long)   # -1 = not yet assigned
    id_map    = torch.zeros((H, W), dtype=torch.long)
    next_id = 1

    # 1) Paint instances (things) in order of decreasing confidence.
    order = inst_scores.argsort(descending=True)
    for i in order:
        if inst_scores[i] < score_thr:
            continue
        m = (inst_masks[i] > 0.5) & (class_map == -1)  # only claim unassigned pixels
        if m.sum() < area_thr:
            continue                                   # drop tiny fragments
        class_map[m] = inst_labels[i]
        id_map[m] = next_id
        next_id += 1

    # 2) Fill remaining pixels with stuff from the semantic map.
    unassigned = class_map == -1
    for c in stuff_classes:
        sel = unassigned & (sem == c)
        class_map[sel] = c                             # stuff: shared id 0
    return class_map, id_map

# Toy run with dummy tensors.
sem_logits = torch.randn(20, 64, 64)                   # 20-class semantic logits
inst_masks  = torch.rand(3, 64, 64)                    # 3 instance soft masks
out_c, out_id = merge_to_panoptic(sem_logits, inst_masks,
                                  torch.tensor([1, 1, 5]),     # two people, one car
                                  torch.tensor([0.9, 0.8, 0.6]),
                                  stuff_classes=[10, 11, 12])
print(out_c.shape, "unique instances:", out_id.unique().numel() - 1)
Code Fragment 1: The canonical panoptic merge, merge_to_panoptic. The inst_scores.argsort(descending=True) order paints higher-confidence instances first, the mask (inst_masks[i] > 0.5) & (class_map == -1) claims only unassigned pixels (guaranteeing no overlap), the area_thr guard drops tiny fragments, and the stuff loop fills the remaining gaps. The result is a valid partition: one class and one instance id per pixel.

This two-network-plus-merge approach works but is clumsy: two models to train, two sets of features, and a hand-tuned heuristic gluing them. The field's response was end-to-end panoptic networks. Panoptic FPN shares one backbone between a semantic head and an instance head; Panoptic-DeepLab uses a single bottom-up design with class and instance-center predictions. These reduced the redundancy, but the cleanest answer came from the mask transformers of Section 24.4, which predict the non-overlapping mask set directly and make the merge step disappear.

3. Panoptic Quality: One Number for Two Jobs Intermediate

A panoptic predictor can fail in two distinct ways: it can mislabel or miss a segment (a recognition error), or it can label the right segment with a sloppy boundary (a segmentation error). The panoptic quality (PQ) metric scores both in a single value. For each class, a predicted segment and a ground-truth segment are matched if their Intersection-over-Union (IoU, the same overlap measure from Chapter 23) exceeds 0.5. Because a strict-majority overlap can match at most one prediction to one truth, this matching is unique. PQ then combines the average IoU of the matched pairs with the precision and recall of the matching itself:

$$\text{PQ} = \underbrace{\frac{\sum_{(p,g) \in TP} \text{IoU}(p, g)}{|TP|}}_{\text{Segmentation Quality (SQ)}} \times \underbrace{\frac{|TP|}{|TP| + \tfrac{1}{2}|FP| + \tfrac{1}{2}|FN|}}_{\text{Recognition Quality (RQ)}}$$

The decomposition is the metric's gift. SQ (segmentation quality) is the mean IoU over correctly matched segments: how tight are your boundaries when you do find a segment? RQ (recognition quality) is an F1-style score over the matches: how well do you find and not hallucinate segments? The half-weights on $|FP|$ and $|FN|$ are simply what makes this expression equal the F1 score, the harmonic mean of precision and recall, written compactly. Their product, PQ, is high only when you both find segments and delineate them well, and reading SQ and RQ separately tells you which of the two failure modes is hurting you. The metric is computed per class and averaged, and the literature usually also reports it split as $\text{PQ}^{\text{Th}}$ over thing classes and $\text{PQ}^{\text{St}}$ over stuff classes. Figure 24.3.2 makes the two factors concrete.

Match segments at IoU > 0.5 TP (matched) FP (extra pred) FN (missed gt) RQ = TP / (TP + .5FP + .5FN) SQ = mean IoU of TP PQ = SQ x RQ one number
Figure 24.3.2: How panoptic quality is built. Segments are first matched at IoU greater than 0.5, splitting predictions into true positives, false positives, and false negatives. Recognition quality (RQ) is an F1-style score over those counts; segmentation quality (SQ) is the mean IoU of the matched pairs. PQ is their product, so a model must both recognize and delineate well to score highly.

The IoU at the core of PQ is the same generalization of the metric you will formalize fully in Section 24.6, and it connects all the way back to the overlap and matching ideas you first met in the detection evaluation of Chapter 23. The minimal IoU helper below is the workhorse behind any PQ implementation.

# The mask-IoU helper at the heart of any panoptic-quality implementation.
# IoU is intersection over union of the set pixels in two boolean masks;
# the 0.5 threshold on this value decides whether a pair is a true positive.
import torch

def mask_iou(pred_mask, gt_mask):
    """IoU between two boolean HxW masks: intersection over union of set pixels."""
    inter = (pred_mask & gt_mask).sum().float()
    union = (pred_mask | gt_mask).sum().float()
    return (inter / union).item() if union > 0 else 0.0

a = torch.zeros(8, 8, dtype=torch.bool); a[2:6, 2:6] = True   # 4x4 square
b = torch.zeros(8, 8, dtype=torch.bool); b[3:7, 3:7] = True   # shifted 4x4 square
print(f"IoU = {mask_iou(a, b):.3f}")   # IoU = 0.391  (9 overlap / 23 union)
Code Fragment 2: Mask IoU, the matching criterion at the heart of panoptic quality. The mask_iou helper divides the bitwise-and intersection by the bitwise-or union. The two 4x4 squares a and b, offset by one pixel, overlap in a 3x3 region (9 pixels) and union to 23, giving IoU 0.39, below the 0.5 panoptic match threshold, so this pair would not count as a true positive.
Practical Example: Reading PQ on an Autonomous-Driving Benchmark

Who: a perception team at an autonomous-vehicle company evaluating two candidate panoptic models on Cityscapes, 2024. Situation: model A reported PQ 61.2 and model B reported PQ 60.8, so the program manager wanted to ship model A. Problem: the gap was within the noise of a few annotation disagreements, and a single PQ number hid where each model actually differed. Decision: the team looked at the SQ and RQ decomposition and the thing-versus-stuff split. Model A had higher SQ (sharper boundaries) but lower RQ on small thing classes; model B caught more distant pedestrians and cyclists (higher thing RQ) at the cost of slightly looser boundaries. Result: for a driving system, missing a distant pedestrian is far more dangerous than a one-pixel-loose curb, so they shipped model B despite its marginally lower headline PQ, and added the small-object RQ as a tracked metric. Lesson: the headline PQ is a starting point, not a verdict. The SQ-RQ decomposition and the thing-stuff split tell you which failure mode each model has, and the right choice depends on which failure your application cannot tolerate. Always read the components, not just the product.

Library Shortcut: PQ Without Reimplementing the Matching

The greedy IoU matching, the per-class accumulation of true positives, false positives, and false negatives, and the void-region handling are fiddly to get exactly right. The reference implementation is maintained by the COCO panoptic API, and modern toolkits wrap it:

# Score panoptic quality with the official panopticapi evaluator instead of
# hand-rolling the IoU matching and per-class accumulation. It consumes
# id-encoded panoptic PNGs plus COCO-format JSON and returns PQ, SQ, RQ.
# (pip install git+https://github.com/cocodataset/panopticapi)
from panopticapi.evaluation import pq_compute

# Inputs are panoptic PNGs (id-encoded) plus matching JSON in COCO panoptic format.
results = pq_compute(gt_json_file="gt_panoptic.json",
                     pred_json_file="pred_panoptic.json",
                     gt_folder="gt_png/", pred_folder="pred_png/")
# results["All"], results["Things"], results["Stuff"] each give PQ, SQ, RQ.
print(results["All"]["pq"], results["Things"]["pq"], results["Stuff"]["pq"])
Code Fragment 3: Scoring panoptic quality in one pq_compute call, replacing roughly 150 lines of matching-and-accumulation code. It consumes id-encoded panoptic PNGs plus COCO-format JSON and returns results["All"], results["Things"], and results["Stuff"], each with PQ, SQ, and RQ, matching the leaderboard protocol exactly including the subtle void-pixel rules.

This replaces roughly 150 lines of careful matching-and-accumulation code with one call, and it guarantees you are scoring exactly as the leaderboards do, including the subtle void-pixel rules. Hugging Face's evaluate library and Detectron2 both expose the same evaluator. Implement IoU yourself to understand it; never hand-roll the full PQ accumulator for a paper number.

Research Frontier: Universal and Open-Vocabulary Panoptic Segmentation

The two-network-plus-merge design of subsection 2 is now historical. Since 2022, universal architectures, Mask2Former (Section 24.4) and OneFormer (2023), produce panoptic, instance, and semantic outputs from one trained model and one set of weights, setting state-of-the-art PQ on Cityscapes, COCO, and ADE20K simultaneously. The 2024-2025 frontier is open-vocabulary panoptic segmentation: methods like FC-CLIP and the open-vocabulary extensions of the SAM family (Section 24.5) partition a scene into things and stuff drawn from arbitrary text-named categories, never seen during training, by leaning on the CLIP text-image embedding space you will meet in Chapter 34. The panoptic task is increasingly served not by a task-specific model but by prompting a foundation model with the classes you happen to care about today.

Exercise 24.3.1: Classify Each Category Conceptual

For each category, decide whether it is a "thing" or "stuff" in the panoptic sense, and justify with the countability test from subsection 1: sky, traffic light, river, person, sand, kite, ceiling, bottle. Then describe one genuinely ambiguous category (for example, a forest versus individual trees) and explain how the dataset's annotation policy, not nature, ultimately decides which it is.

Exercise 24.3.2: Implement the Conflict Merge Coding

Extend the merge_to_panoptic function from subsection 2 so that, in addition to its current behavior, it returns the number of instance pixels that were suppressed because they collided with a higher-confidence instance. Construct a synthetic case with three overlapping instance masks of decreasing confidence and verify that the suppression count is non-zero. Write one paragraph on why painting in confidence order, rather than, say, area order, is the sensible default, and when you might prefer a different ordering.

Exercise 24.3.3: When SQ and RQ Disagree Analysis

Construct two hypothetical models on a 10-image set. Model X matches every ground-truth segment but always with IoU around 0.55 (loose boundaries). Model Y matches only 60 percent of segments but those it matches have IoU around 0.95 (tight boundaries, low recall). Using the PQ formula from subsection 3, compute the approximate PQ, SQ, and RQ for each, and write a paragraph explaining which model you would choose for (a) a photo-editing cutout tool and (b) a pedestrian-counting safety system, connecting your choice to the SQ-RQ trade-off and the driving example in subsection 3.