Section 25.5: Open-Vocabulary Detection & Segmentation

"The old detector knew exactly eighty things and was magnificently helpless about the eighty-first. The new one does not have a list. You name the thing, in whatever words you like, and it goes and finds it. I find this both liberating and faintly terrifying."
A Detector That Lost Its Fixed Category List

Big Picture

Classical detectors and segmenters were locked to a fixed list of categories chosen at training time. Open-vocabulary models break that lock by replacing the fixed classification head with an alignment to CLIP-style text embeddings, so any phrase you can write becomes a category the model can localize. The same shared image-text space that gave CLIP zero-shot classification gives detectors and segmenters an open vocabulary: a region of an image is matched against the embedding of an arbitrary text query rather than against a fixed set of class weights. Layered on top is the Segment Anything Model, a promptable mask primitive that outputs a mask for any object you point at. Together they let a pipeline detect and segment anything you can name, the dense-prediction payoff of language supervision.

In Section 25.4 CLIP turned a text prompt into an image classifier. This section carries that idea into the dense tasks of Chapter 23 and Chapter 24. We will see how a detector aligns each candidate region to text embeddings so its vocabulary becomes open, how segmentation follows the same recipe at the pixel level, and how the Segment Anything Model reframes segmentation as a promptable primitive trained on a billion masks. We will then assemble the grounded detect-then-segment pipeline that has become a default tool. The thread that began with hand-drawn boxes and masks ends with a system you steer by typing a phrase, and it sets up the foundation-model landscape of Section 25.6.

1. From a Fixed Head to Text Embeddings Intermediate

Recall how a closed-vocabulary detector classifies a region in Chapter 23: a region feature vector is multiplied by a fixed classification weight matrix, one column per known class, and a softmax picks the class. The matrix has exactly $C$ columns, so the detector can only ever name $C$ things; the eighty-first category does not exist because there is no eighty-first column. The open-vocabulary move is to replace that fixed weight matrix with text embeddings. Encode each candidate category name with a CLIP text encoder, and use those text vectors as the classification weights. Now adding a category means adding a sentence, not retraining a head, exactly the zero-shot trick of Section 25.4 moved from whole images to regions.

Concretely, if a region produces a feature $r \in \mathbb{R}^d$ projected into the CLIP space and the candidate categories have text embeddings $t_1, \ldots, t_K$, the region's class scores are the similarities

s_k = \frac{r^\top t_k}{\|r\|\,\|t_k\|}, \qquad k = 1, \ldots, K

and the region is assigned the highest-scoring category (or labeled background if no score clears a threshold). The number of categories $K$ is whatever you supply at inference time. Figure 25.5.1 contrasts the closed head with the open one.

Figure 25.5.1: Closed versus open vocabulary. A classical detector multiplies each region feature by a fixed classification matrix with one column per known class, so the vocabulary is frozen at training. An open-vocabulary detector projects regions into the CLIP space and scores them against text embeddings of arbitrary category names, so the vocabulary is whatever phrases you provide at inference.

Models like ViLD, RegionCLIP, OWL-ViT, and Grounding DINO all implement this region-text alignment, but they split into two training lineages that are worth distinguishing. The CLIP-distillation lineage (ViLD, RegionCLIP, OWL-ViT) starts from a pretrained CLIP model and teaches a region head to produce embeddings that land in the same space as CLIP's text encoder: ViLD distills the CLIP image encoder's response on cropped regions, RegionCLIP adds region-text contrastive pretraining on pseudo-labels, and OWL-ViT fine-tunes a CLIP ViT directly for object detection with a simple linear head per patch. The grounded-data lineage (Grounding DINO, GLIP, Florence-2) instead trains on large datasets of image-text pairs where phrases are explicitly tied to boxes, so the model learns region-to-phrase grounding from supervision rather than distillation. The practical difference: CLIP-distillation models inherit CLIP's vocabulary and zero-shot breadth; grounded-data models tend to have stronger localization because they saw explicit box-phrase pairs during training. Both share the same inference-time interface: supply any text query, get back boxes. The detection mechanics underneath, region proposals and the set-prediction transformer decoder, are the ones you built in Chapter 23; only the classification step changed.

Key Insight: Open Vocabulary Is Just the CLIP Trick on Regions

There is no new principle here beyond Section 25.4. Zero-shot classification compared a whole image to text embeddings; open-vocabulary detection compares a region to text embeddings; open-vocabulary segmentation compares a pixel or mask to text embeddings. The unit of comparison shrinks from image to region to pixel, but the move is identical: throw away the fixed classification head and align visual features to a shared space with language. Once you see this, every open-vocabulary system in the literature reads as the same idea applied at a different spatial granularity.

2. Open-Vocabulary and Promptable Segmentation Intermediate

Segmentation gets the same treatment, at the pixel level. Language-driven segmenters (LSeg, and the open-vocabulary mask transformers that followed) compute a per-pixel or per-mask embedding in the CLIP space and label each by similarity to text queries, so you can segment "the wooden chair" without that class ever appearing in a training mask. The mask-transformer machinery of Chapter 24 supplies the masks; the text alignment supplies the open vocabulary.

A different and hugely influential idea arrived with the Segment Anything Model (SAM, Kirillov et al., 2023). SAM does not classify at all. It is a promptable segmenter: given an image and a prompt (a click point, a box, or a rough mask), it outputs a high-quality mask for the object indicated by the prompt. It was trained on SA-1B, a dataset of 1.1 billion masks built with a data engine in which the model itself proposed masks that annotators refined, bootstrapping to a scale no manual effort could reach. Because the prompt rather than a label specifies the target, SAM segments objects it was never told the names of; it is a segmentation foundation model, a reusable primitive rather than a task-specific network. Figure 25.5.2 shows its three-part architecture.

Common Misconception: SAM Is Not an Open-Vocabulary Segmenter

Because SAM is introduced alongside open-vocabulary models, it is easy to assume you can type "the wooden chair" and SAM will find and segment it. SAM has no text input and no category output at all: it answers the question "what object is at this point or inside this box?" with a mask, and it cannot tell you what that object is. The text understanding lives entirely in a separate model (the CLIP-aligned detector of subsection 1); a grounded pipeline uses that detector to turn a phrase into boxes and only then hands the boxes to SAM for pixel-accurate masks (subsection 3). Do not confuse SAM's promptable generality (a click works on any object) with open-vocabulary recognition (naming an object from text). Diagnostic question: given only SAM and an image, could you segment "every dog" from the word "dog"? No, you first need a model that maps the word to a location; SAM supplies the mask, never the name.

Figure 25.5.2: The Segment Anything Model. A heavy image encoder runs once per image to produce an embedding; a lightweight prompt encoder turns points or boxes into tokens; a fast mask decoder combines them into a mask in milliseconds. Because the expensive encoding is amortized, you can re-prompt the same image interactively without recomputing the image embedding, the design that makes SAM feel real-time.

SAM's design choice that makes it usable interactively is the split: the expensive image encoder runs once, and every subsequent prompt only invokes the cheap decoder, so a user can click around an image and get instant masks. The code below shows the prompt-and-mask loop with the reference implementation.

import numpy as np
from segment_anything import sam_model_registry, SamPredictor

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)
predictor.set_image(image_rgb)               # heavy encoder runs ONCE here

# Now prompt as many times as you like; each call uses only the fast decoder.
point = np.array([[420, 310]])               # one (x, y) click on the target object
label = np.array([1])                        # 1 = foreground point, 0 = background
masks, scores, _ = predictor.predict(
    point_coords=point, point_labels=label,
    multimask_output=True,                   # return 3 candidate masks at different scales
)
best = masks[scores.argmax()]                # pick the highest-confidence mask
print("masks returned:", masks.shape[0], "| best mask covers", int(best.sum()), "pixels")
# masks returned: 3 | best mask covers 18742 pixels

Code Fragment 1: SAM in use. set_image pays the one-time encoding cost; each predict call with a point or box prompt is fast because only the mask decoder runs. multimask_output returns several masks at different scales to resolve the ambiguity of a single click (the whole person, the shirt, or a button).

Fun Fact

SAM's training set, SA-1B, contains 1.1 billion masks, more segmentation masks than the entire field had produced in its prior history combined. They were not drawn by hand. A data engine ran in three stages: annotators corrected SAM's masks, then SAM proposed masks and annotators only added missed objects, then SAM ran fully automatically and humans merely audited. The model trained the data that trained the model, a bootstrapping loop that is now a standard recipe for building foundation-scale datasets, and a preview of the generative data engines of Chapter 37.

3. Grounded Pipelines: Detect, Then Segment Advanced

SAM is powerful but, as the section noted, it does not know category names; it segments what you point at. An open-vocabulary detector knows names but outputs boxes, not pixel-precise masks. The natural and now ubiquitous combination is to chain them: an open-vocabulary detector (commonly Grounding DINO) takes a text query, finds the boxes of every matching object, and feeds those boxes as prompts to SAM, which returns clean masks. The result, often called Grounded SAM, lets you type a phrase and get pixel-accurate masks of everything matching it, with no task-specific training. This composition is the practical face of the whole chapter: foundation models as interoperable primitives.

# Conceptual pipeline: text query -> open-vocab boxes -> SAM masks.
# (Using grounded-sam style components; APIs vary by package version.)
text_query = "every dog . a red frisbee ."        # period-separated phrases, open vocabulary

# 1) Open-vocabulary detection: phrase -> boxes (region-text alignment of Section 1).
boxes, phrases, confidences = grounding_dino.predict(image_rgb, text_query)

# 2) Promptable segmentation: each detected box becomes a SAM box-prompt.
predictor.set_image(image_rgb)                      # encode image once
masks = []
for box in boxes:
    m, scores, _ = predictor.predict(box=box, multimask_output=False)
    masks.append(m[0])

print(f"query matched {len(boxes)} objects: {phrases}")
# query matched 3 objects: ['a dog', 'a dog', 'a red frisbee']

Code Fragment 2: The Grounded SAM detect-then-segment pipeline. Grounding DINO turns the open-vocabulary text query into boxes via the region-text alignment of subsection 1; each box prompts SAM for a pixel-accurate mask. No category list and no task-specific training are involved.

The practical example below shows a team replacing a brittle, labeled-from-scratch segmentation project with exactly this composition.

Practical Example: Auto-Labeling a Segmentation Dataset Overnight

Who: a robotics startup, 2023, building a bin-picking system that needed pixel masks for roughly two hundred warehouse object types. Situation: their plan was to hand-annotate masks, but a pilot showed that masking two hundred categories across enough images would take their small annotation team many months. Problem: they had the images and a text list of the object types, but no masks and no time to draw them. Decision: they ran a Grounded SAM pipeline, feeding each object type as a text query to an open-vocabulary detector and the resulting boxes to SAM, producing candidate masks automatically. Annotators then only reviewed and corrected the machine-generated masks instead of drawing from scratch, the same data-engine loop SAM itself was trained with. Result: the first full pass of masks across the dataset finished in a single overnight run; human review and correction took days rather than months, and the corrected masks were used to fine-tune a small, fast in-house segmenter for deployment on the robot. Lesson: foundation models are most valuable not as the final deployed model (too large for an edge robot) but as a labeling engine that bootstraps a smaller specialist. Open vocabulary plus promptable segmentation turned a months-long labeling project into an overnight job plus a review pass.

Library Shortcut: Grounded SAM in a Few Lines

The detect-then-segment composition is packaged end to end. Through Hugging Face, Grounding DINO and SAM each load with an AutoModel, and the autogenerated mask pipeline collapses the loop above:

# Run open-vocabulary detection through a single Hugging Face pipeline:
# a text label list in, matching boxes out, no training and no plumbing.
from transformers import pipeline
# Open-vocabulary detection then mask generation, two pretrained models, no training:
detector = pipeline(model="IDEA-Research/grounding-dino-tiny", task="zero-shot-object-detection")
detections = detector(image, candidate_labels=["a dog", "a red frisbee"])
# Each detection's box can be passed to a SAM pipeline for a pixel-accurate mask.

Code Fragment 3: The detect half of Grounded SAM in two calls using the Hugging Face pipeline. Passing candidate_labels to a zero-shot-object-detection task runs Grounding DINO over an open vocabulary, replacing the manual model loading and box decoding of Code Fragment 2; each returned box then feeds a SAM pipeline for the masks.

This replaces the two model downloads, the box-to-prompt plumbing, and the per-object decoder loop with a couple of calls, and the library handles tokenization, box decoding, and preprocessing. SAM's own automatic mask generator (SamAutomaticMaskGenerator) similarly segments every object in an image with one call. The explicit pipeline above exists so you understand which model supplies the names and which supplies the pixels.

Research Frontier: SAM 2, Faster Detectors, and Unified Models

The open-vocabulary frontier in 2024 to 2026 moves on three fronts. SAM 2 (Ravi et al., 2024) extends promptable segmentation to video with a streaming memory, so a single click tracks an object across frames, the bridge into the video understanding of Chapter 26. Real-time open-vocabulary detectors such as YOLO-World push region-text alignment to the speeds the deployment chapter cares about, narrowing the gap between foundation-model flexibility and edge-model latency. And a unification trend is folding detection, segmentation, and grounding into single models conditioned on text, often built on the SigLIP encoders of Section 25.4 and increasingly invoked as tools by multimodal language models that decide what to detect or segment from a conversation. The clearest sign of that convergence is SAM 3 (Carion et al., 2025), which collapses the detect-then-segment pipeline of this section into a single model: its promptable concept segmentation takes a short noun phrase such as "yellow school bus" (or an image exemplar) and returns masks and identities for every matching instance at once, where SAM 1 and 2 segmented one prompted object at a time. The open question is whether the future is many specialized foundation primitives composed in pipelines, as in this section, or one model that does everything, which we take up in Section 25.6.

Exercise 25.5.1: The One Move Behind Open Vocabulary Conceptual

The Key Insight claims open-vocabulary classification, detection, and segmentation are the same move at different spatial scales. For each of the three, state precisely what the unit of comparison is (image, region, pixel or mask) and what it is compared against, and identify the single component that is removed relative to the closed-vocabulary version. Then explain why SAM does not fit this pattern, and what it provides instead.

Exercise 25.5.2: Build a Grounded SAM Pipeline Coding

Using the library shortcut, build a pipeline that takes an image and a free-text query, detects matching objects with an open-vocabulary detector, and produces a pixel mask for each with SAM. Run it on three images with queries the models were not specifically trained on (for example "the leftmost potted plant"). Report how many objects were found and overlay the masks. Then write one paragraph on a failure case you observe and whether it originates in the detector (wrong or missing box) or in SAM (box correct but mask poor), connecting the diagnosis to the two-stage structure.

Exercise 25.5.3: Foundation Model as Labeling Engine Analysis

The robotics practical example used a foundation model to label data for a smaller deployed model rather than deploying the foundation model itself. Analyze this decision: list three reasons the foundation model was unsuitable for direct deployment on the robot, and three reasons it was nonetheless valuable as a labeling engine. Then describe the failure mode of trusting auto-generated masks without human review, and explain why the review-and-correct loop (rather than fully automatic labeling) is the safe operating point, connecting it to the SAM data engine described in the Fun Fact.