"The old detector knew exactly eighty things and was magnificently helpless about the eighty-first. The new one does not have a list. You name the thing, in whatever words you like, and it goes and finds it. I find this both liberating and faintly terrifying."
A Detector That Lost Its Fixed Category List
Classical detectors and segmenters were locked to a fixed list of categories chosen at training time. Open-vocabulary models break that lock by replacing the fixed classification head with an alignment to CLIP-style text embeddings, so any phrase you can write becomes a category the model can localize. The same shared image-text space that gave CLIP zero-shot classification gives detectors and segmenters an open vocabulary: a region of an image is matched against the embedding of an arbitrary text query rather than against a fixed set of class weights. Layered on top is the Segment Anything Model, a promptable mask primitive that outputs a mask for any object you point at. Together they let a pipeline detect and segment anything you can name, the dense-prediction payoff of language supervision.
In Section 25.4 CLIP turned a text prompt into an image classifier. This section carries that idea into the dense tasks of Chapter 23 and Chapter 24. We will see how a detector aligns each candidate region to text embeddings so its vocabulary becomes open, how segmentation follows the same recipe at the pixel level, and how the Segment Anything Model reframes segmentation as a promptable primitive trained on a billion masks. We will then assemble the grounded detect-then-segment pipeline that has become a default tool. The thread that began with hand-drawn boxes and masks ends with a system you steer by typing a phrase, and it sets up the foundation-model landscape of Section 25.6.
1. From a Fixed Head to Text Embeddings Intermediate
Recall how a closed-vocabulary detector classifies a region in Chapter 23: a region feature vector is multiplied by a fixed classification weight matrix, one column per known class, and a softmax picks the class. The matrix has exactly $C$ columns, so the detector can only ever name $C$ things; the eighty-first category does not exist because there is no eighty-first column. The open-vocabulary move is to replace that fixed weight matrix with text embeddings. Encode each candidate category name with a CLIP text encoder, and use those text vectors as the classification weights. Now adding a category means adding a sentence, not retraining a head, exactly the zero-shot trick of Section 25.4 moved from whole images to regions.
Concretely, if a region produces a feature $r \in \mathbb{R}^d$ projected into the CLIP space and the candidate categories have text embeddings $t_1, \ldots, t_K$, the region's class scores are the similarities
and the region is assigned the highest-scoring category (or labeled background if no score clears a threshold). The number of categories $K$ is whatever you supply at inference time. Figure 25.5.1 contrasts the closed head with the open one.
Models like ViLD, RegionCLIP, OWL-ViT, and Grounding DINO implement variants of this region-text alignment, differing mainly in how they obtain good region features and how they train the alignment (distilling from CLIP, or training on grounded image-text data that ties phrases to boxes). The detection mechanics, region proposals, the set-prediction transformer decoder, are the ones you built in Chapter 23; only the classification step changed.
There is no new principle here beyond Section 25.4. Zero-shot classification compared a whole image to text embeddings; open-vocabulary detection compares a region to text embeddings; open-vocabulary segmentation compares a pixel or mask to text embeddings. The unit of comparison shrinks from image to region to pixel, but the move is identical: throw away the fixed classification head and align visual features to a shared space with language. Once you see this, every open-vocabulary system in the literature reads as the same idea applied at a different spatial granularity.
2. Open-Vocabulary and Promptable Segmentation Intermediate
Segmentation gets the same treatment, at the pixel level. Language-driven segmenters (LSeg, and the open-vocabulary mask transformers that followed) compute a per-pixel or per-mask embedding in the CLIP space and label each by similarity to text queries, so you can segment "the wooden chair" without that class ever appearing in a training mask. The mask-transformer machinery of Chapter 24 supplies the masks; the text alignment supplies the open vocabulary.
A different and hugely influential idea arrived with the Segment Anything Model (SAM, Kirillov et al., 2023). SAM does not classify at all. It is a promptable segmenter: given an image and a prompt (a click point, a box, or a rough mask), it outputs a high-quality mask for the object indicated by the prompt. It was trained on SA-1B, a dataset of 1.1 billion masks built with a data engine in which the model itself proposed masks that annotators refined, bootstrapping to a scale no manual effort could reach. Because the prompt rather than a label specifies the target, SAM segments objects it was never told the names of; it is a segmentation foundation model, a reusable primitive rather than a task-specific network. Figure 25.5.2 shows its three-part architecture.
Because SAM is introduced alongside open-vocabulary models, it is easy to assume you can type "the wooden chair" and SAM will find and segment it. SAM has no text input and no category output at all: it answers the question "what object is at this point or inside this box?" with a mask, and it cannot tell you what that object is. The text understanding lives entirely in a separate model (the CLIP-aligned detector of subsection 1); a grounded pipeline uses that detector to turn a phrase into boxes and only then hands the boxes to SAM for pixel-accurate masks (subsection 3). Do not confuse SAM's promptable generality (a click works on any object) with open-vocabulary recognition (naming an object from text). Diagnostic question: given only SAM and an image, could you segment "every dog" from the word "dog"? No, you first need a model that maps the word to a location; SAM supplies the mask, never the name.
SAM's design choice that makes it usable interactively is the split: the expensive image encoder runs once, and every subsequent prompt only invokes the cheap decoder, so a user can click around an image and get instant masks. The code below shows the prompt-and-mask loop with the reference implementation.
import numpy as np
from segment_anything import sam_model_registry, SamPredictor
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)
predictor.set_image(image_rgb) # heavy encoder runs ONCE here
# Now prompt as many times as you like; each call uses only the fast decoder.
point = np.array([[420, 310]]) # one (x, y) click on the target object
label = np.array([1]) # 1 = foreground point, 0 = background
masks, scores, _ = predictor.predict(
point_coords=point, point_labels=label,
multimask_output=True, # return 3 candidate masks at different scales
)
best = masks[scores.argmax()] # pick the highest-confidence mask
print("masks returned:", masks.shape[0], "| best mask covers", int(best.sum()), "pixels")
# masks returned: 3 | best mask covers 18742 pixels
set_image pays the one-time encoding cost; each predict call with a point or box prompt is fast because only the mask decoder runs. multimask_output returns several masks at different scales to resolve the ambiguity of a single click (the whole person, the shirt, or a button).SAM's training set, SA-1B, contains 1.1 billion masks, more segmentation masks than the entire field had produced in its prior history combined. They were not drawn by hand. A data engine ran in three stages: annotators corrected SAM's masks, then SAM proposed masks and annotators only added missed objects, then SAM ran fully automatically and humans merely audited. The model trained the data that trained the model, a bootstrapping loop that is now a standard recipe for building foundation-scale datasets, and a preview of the generative data engines of Chapter 37.
3. Grounded Pipelines: Detect, Then Segment Advanced
SAM is powerful but, as the section noted, it does not know category names; it segments what you point at. An open-vocabulary detector knows names but outputs boxes, not pixel-precise masks. The natural and now ubiquitous combination is to chain them: an open-vocabulary detector (commonly Grounding DINO) takes a text query, finds the boxes of every matching object, and feeds those boxes as prompts to SAM, which returns clean masks. The result, often called Grounded SAM, lets you type a phrase and get pixel-accurate masks of everything matching it, with no task-specific training. This composition is the practical face of the whole chapter: foundation models as interoperable primitives.
# Conceptual pipeline: text query -> open-vocab boxes -> SAM masks.
# (Using grounded-sam style components; APIs vary by package version.)
text_query = "every dog . a red frisbee ." # period-separated phrases, open vocabulary
# 1) Open-vocabulary detection: phrase -> boxes (region-text alignment of Section 1).
boxes, phrases, confidences = grounding_dino.predict(image_rgb, text_query)
# 2) Promptable segmentation: each detected box becomes a SAM box-prompt.
predictor.set_image(image_rgb) # encode image once
masks = []
for box in boxes:
m, scores, _ = predictor.predict(box=box, multimask_output=False)
masks.append(m[0])
print(f"query matched {len(boxes)} objects: {phrases}")
# query matched 3 objects: ['a dog', 'a dog', 'a red frisbee']
The practical example below shows a team replacing a brittle, labeled-from-scratch segmentation project with exactly this composition.
Who: a robotics startup, 2023, building a bin-picking system that needed pixel masks for roughly two hundred warehouse object types. Situation: their plan was to hand-annotate masks, but a pilot showed that masking two hundred categories across enough images would take their small annotation team many months. Problem: they had the images and a text list of the object types, but no masks and no time to draw them. Decision: they ran a Grounded SAM pipeline, feeding each object type as a text query to an open-vocabulary detector and the resulting boxes to SAM, producing candidate masks automatically. Annotators then only reviewed and corrected the machine-generated masks instead of drawing from scratch, the same data-engine loop SAM itself was trained with. Result: the first full pass of masks across the dataset finished in a single overnight run; human review and correction took days rather than months, and the corrected masks were used to fine-tune a small, fast in-house segmenter for deployment on the robot. Lesson: foundation models are most valuable not as the final deployed model (too large for an edge robot) but as a labeling engine that bootstraps a smaller specialist. Open vocabulary plus promptable segmentation turned a months-long labeling project into an overnight job plus a review pass.
The detect-then-segment composition is packaged end to end. Through Hugging Face, Grounding DINO and SAM each load with an AutoModel, and the autogenerated mask pipeline collapses the loop above:
# Run open-vocabulary detection through a single Hugging Face pipeline:
# a text label list in, matching boxes out, no training and no plumbing.
from transformers import pipeline
# Open-vocabulary detection then mask generation, two pretrained models, no training:
detector = pipeline(model="IDEA-Research/grounding-dino-tiny", task="zero-shot-object-detection")
detections = detector(image, candidate_labels=["a dog", "a red frisbee"])
# Each detection's box can be passed to a SAM pipeline for a pixel-accurate mask.
pipeline. Passing candidate_labels to a zero-shot-object-detection task runs Grounding DINO over an open vocabulary, replacing the manual model loading and box decoding of Code Fragment 2; each returned box then feeds a SAM pipeline for the masks.This replaces the two model downloads, the box-to-prompt plumbing, and the per-object decoder loop with a couple of calls, and the library handles tokenization, box decoding, and preprocessing. SAM's own automatic mask generator (SamAutomaticMaskGenerator) similarly segments every object in an image with one call. The explicit pipeline above exists so you understand which model supplies the names and which supplies the pixels.
The open-vocabulary frontier in 2024 to 2026 moves on three fronts. SAM 2 (Ravi et al., 2024) extends promptable segmentation to video with a streaming memory, so a single click tracks an object across frames, the bridge into the video understanding of Chapter 26. Real-time open-vocabulary detectors such as YOLO-World push region-text alignment to the speeds the deployment chapter cares about, narrowing the gap between foundation-model flexibility and edge-model latency. And a unification trend is folding detection, segmentation, and grounding into single models conditioned on text, often built on the SigLIP encoders of Section 25.4 and increasingly invoked as tools by multimodal language models that decide what to detect or segment from a conversation. The clearest sign of that convergence is SAM 3 (Carion et al., 2025), which collapses the detect-then-segment pipeline of this section into a single model: its promptable concept segmentation takes a short noun phrase such as "yellow school bus" (or an image exemplar) and returns masks and identities for every matching instance at once, where SAM 1 and 2 segmented one prompted object at a time. The open question is whether the future is many specialized foundation primitives composed in pipelines, as in this section, or one model that does everything, which we take up in Section 25.6.
The Key Insight claims open-vocabulary classification, detection, and segmentation are the same move at different spatial scales. For each of the three, state precisely what the unit of comparison is (image, region, pixel or mask) and what it is compared against, and identify the single component that is removed relative to the closed-vocabulary version. Then explain why SAM does not fit this pattern, and what it provides instead.
Using the library shortcut, build a pipeline that takes an image and a free-text query, detects matching objects with an open-vocabulary detector, and produces a pixel mask for each with SAM. Run it on three images with queries the models were not specifically trained on (for example "the leftmost potted plant"). Report how many objects were found and overlay the masks. Then write one paragraph on a failure case you observe and whether it originates in the detector (wrong or missing box) or in SAM (box correct but mask poor), connecting the diagnosis to the two-stage structure.
The robotics practical example used a foundation model to label data for a smaller deployed model rather than deploying the foundation model itself. Analyze this decision: list three reasons the foundation model was unsuitable for direct deployment on the robot, and three reasons it was nonetheless valuable as a labeling engine. Then describe the failure mode of trusting auto-generated masks without human review, and explain why the review-and-correct loop (rather than fully automatic labeling) is the safe operating point, connecting it to the SAM data engine described in the Fun Fact.