"Once there were a thousand small models, each expert in one narrow thing and helpless outside it. Now there are a few large ones that know almost everything a little, and the real skill is no longer training a model but choosing which giant to borrow and what to ask of it."
A Foundation Model Surveying Its Own Crowded Field
A vision foundation model is a single network, pretrained once at scale, whose features serve many downstream tasks with little or no task-specific training. The practitioner's job has shifted from training a model to choosing the right pretrained giant and adapting it cheaply. This closing section maps the 2024 to 2026 landscape: DINOv2 for general-purpose frozen features, the CLIP and SigLIP family for language-aligned embeddings, and SAM for promptable masks, plus the rule of thumb for picking among them. It then looks forward to the scaling laws that govern these models and the JEPA direction toward predicting in representation space, which carries the chapter's thread directly into video understanding and world models.
Your next vision project will almost certainly start with a downloaded backbone, not a trained one, and the single most consequential decision you make will be which giant to borrow. The previous five sections built the methods; this one turns them into the map you reach for when you face that choice. We will lay out the major foundation models by what they are good for, give a concrete decision procedure for choosing a backbone, examine the scaling laws and data-curation lessons that explain why these models work, and close on the predictive-representation direction that links this chapter to the rest of the book. By the end you will be able to look at a new task and name the right pretrained model, the adaptation recipe, and the reason for both. It completes the foundation-model story that Chapter 22 and the transfer learning of Chapter 21 set up, and hands off to Chapter 26.
1. The Major Families, by What They Do Intermediate
The methods of this chapter crystallized into a handful of model families, each defined by its pretraining objective and therefore by what its features are good at. Understanding the map means understanding which objective produced which strength. DINOv2 (self-distillation plus masked modeling, no language) gives general-purpose features that excel frozen, across classification, dense correspondence, depth, and segmentation. Its successor DINOv3 (Meta AI, 2025) scales the same recipe to a 7-billion-parameter ViT trained on 1.7 billion curated images. It adds a regularizer the authors call Gram anchoring, which keeps dense features sharp over long training, and reports state-of-the-art dense prediction with frozen weights. The CLIP and SigLIP family (language-supervised, from Section 25.4) gives language-aligned embeddings for anything involving text: zero-shot recognition, retrieval, and conditioning generators. The SAM line (promptable, from Section 25.5) gives masks on demand, and its latest member SAM 3 (Carion et al., 2025) adds promptable concept segmentation, returning masks for every instance matching a noun phrase in one model rather than one prompted object at a time. Table 25.6.1 lays out the comparison.
| Family | Pretraining signal | Strongest at | Reach for it when |
|---|---|---|---|
| DINOv2 / DINOv3 | Self-distillation + masked modeling, image only | Frozen dense features, correspondence, depth | You freeze the backbone and use features directly |
| CLIP / SigLIP | Image-text contrastive, web scale | Zero-shot recognition, retrieval, text conditioning | The task involves language or open vocabulary |
| MAE-pretrained ViT | Masked pixel reconstruction | Strong fine-tuning initialization | You will fine-tune end to end on labeled data |
| SAM / SAM 2 | Promptable masks, 1.1B-mask data engine | Pixel-accurate masks from a prompt; video tracking | You need masks for arbitrary objects, interactively |
These families are increasingly combined rather than chosen exclusively. DINOv2 features feed monocular depth systems in Chapter 27; SigLIP encoders sit inside open-vocabulary detectors from Section 25.5 and inside multimodal language models; SAM consumes the boxes of a CLIP-based detector. The map is less a menu of alternatives than a toolbox of interoperable parts, which is the defining shift the chapter has been describing.
You can predict what a foundation model's features will be good at from the objective it was trained on, without running a benchmark. A model trained to align images with text (CLIP) will be strong wherever language is involved and may be weaker at fine spatial detail. A model trained to reconstruct masked pixels (MAE) will be a strong fine-tuning start but a mediocre frozen feature extractor. A model trained by self-distillation (DINO) will cluster images semantically and give excellent frozen features. When you face an unfamiliar foundation model, read its pretraining objective first; it tells you where the model will shine and where it will not.
2. Choosing and Adapting a Backbone Beginner
For a real project, the choice follows from two questions answered in order. First, does your task involve language or an open vocabulary? If yes, start in the CLIP and SigLIP family, because only language-aligned models can match arbitrary text. Second, if the task is purely visual, will you freeze the backbone or fine-tune it? If you will freeze it (retrieval, clustering, few-shot, dense correspondence), choose DINOv2; if you will fine-tune on labeled data, an MAE-pretrained ViT is an excellent initialization. For masks of arbitrary objects, layer SAM on top of whichever recognition model names them. Figure 25.6.1 turns this into a decision tree.
Adapting the chosen backbone is cheap by design. The linear probe and fine-tuning protocols of Section 25.1 and Chapter 21 still apply, and parameter-efficient methods (a small adapter, or LoRA, the same low-rank adaptation you will meet for generators in Chapter 34) let you specialize a giant backbone by training a tiny fraction of its weights. The code below shows the most common adaptation: load a frozen DINOv2 backbone and attach a small trainable head.
import torch
import torch.nn as nn
# Load a frozen DINOv2 backbone; this is the practical default for frozen features.
backbone = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14")
for p in backbone.parameters():
p.requires_grad_(False) # freeze: adapt with a tiny head only
backbone.eval()
class FoundationClassifier(nn.Module):
def __init__(self, backbone, feat_dim=768, num_classes=10):
super().__init__()
self.backbone = backbone # frozen feature extractor
self.head = nn.Linear(feat_dim, num_classes) # the only trainable parameters
def forward(self, x):
with torch.no_grad():
feats = self.backbone(x) # frozen forward, no gradients stored
return self.head(feats) # train just this linear layer
model = FoundationClassifier(backbone)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"trainable: {trainable:,} of {total:,} params ({100*trainable/total:.3f}%)")
# trainable: 7,690 of 86,588,170 params (0.009%)
The printed line is the whole point of the chapter in one number: a task is solved by training under one ten-thousandth of the model's parameters, because the foundation model already learned to see. The practical example shows a team living by this workflow, and the Hands-On Lab at the end of this section has you build that workflow yourself, comparing a frozen DINOv2 backbone against CLIP zero-shot classification on the same images.
Who: the computer-vision group at a logistics company, 2024, responsible for a dozen perception tasks across sorting, damage detection, and label reading. Situation: historically they trained a bespoke CNN per task, each needing its own labeled dataset and weeks of tuning, and each aging quickly. Problem: maintaining a dozen separately-trained models was expensive, and new tasks took too long to stand up. Decision: they standardized on two frozen foundation backbones, a DINOv2 for visual tasks and a SigLIP for anything involving text or open vocabulary, and built every new task as a small trainable head or a LoRA adapter on top, exactly the pattern in the code above. They added SAM behind a Grounding DINO front end for any masking need. Result: new tasks went from weeks to days because the heavy representation was already done; the frozen backbones could be evaluated, cached, and shared across tasks, cutting both training cost and the inference cost of running one shared encoder; and robustness improved because the foundation features generalized to lighting and packaging the old per-task models had never seen. Lesson: for most applied teams in 2024 to 2026, the right unit of work is no longer a model but an adapter on a foundation backbone. Choosing the backbone by objective (Table 25.6.1) and adapting it cheaply is the modern computer-vision workflow, and it is what this chapter has been preparing you to do.
The illustration below sums up this shift: one shared giant backbone with small swappable adapter heads, retiring the old one-model-per-task era.
3. Scaling, Curation, and the Predictive Frontier Advanced
Why do these models work, and where are they going? Two empirical lessons explain the present. First, scaling laws: foundation-model quality improves predictably as a power law in model size, data size, and compute, so much of the progress from CLIP to DINOv2 to SigLIP came from scaling a known recipe rather than inventing a new one. Second, and the harder-won lesson, data curation matters as much as scale: DINOv2's leap over plain self-supervision came substantially from automatically curating a large, diverse, deduplicated image set, and the CLIP-data line (DataComp, DFN) showed that filtering web pairs well beats simply collecting more. Quality of data, not just quantity, sets the ceiling.
The frontier direction unifies the chapter's threads. Pixel reconstruction (MAE) wastes capacity on imperceptible detail; pure language alignment (CLIP) can miss fine spatial structure. The Joint-Embedding Predictive Architecture (JEPA) line argues for predicting in representation space: mask part of the input and predict the features of the missing part, not its pixels or its caption. I-JEPA did this for images, and the V-JEPA line extended it to video, learning to predict masked spatiotemporal regions in feature space and thereby learning motion and dynamics without action labels. V-JEPA 2 (Meta AI, 2025) scaled this to a 1.2-billion-parameter video model pretrained on over a million hours of video, and an action-conditioned variant plans robot manipulation from learned dynamics, an explicit step from representation learning toward a world model. This is the same self-supervision-to-prediction arc the cross-reference map traces: the masked-prediction idea of Section 25.3, pushed from reconstructing content to predicting representations, becomes a way to learn the structure of how scenes evolve. That is the doorway to the predictive video models and world models of Chapter 26 and beyond.
Three open questions define 2024 to 2026. First, will the field consolidate into one general vision model or remain a toolbox of specialized primitives (DINOv2, SigLIP, SAM) composed in pipelines as in Section 25.5? The evidence in 2025 points both ways: SAM 3 (Carion et al., 2025) folds the open-vocabulary detect-then-segment pipeline into one model, yet multimodal language models increasingly invoke separate detectors and segmenters as tools, suggesting specialist vision foundations orchestrated by a generalist controller. Second, does predicting in representation space (I-JEPA, V-JEPA, and V-JEPA 2, 2023 to 2025) decisively beat pixel reconstruction and language alignment, or do the three fuse? DINOv2 already fuses distillation and masked modeling, and DINOv3 (2025) shows the recipe still scales; a model fusing all three is the natural next experiment. Third, how far do scaling laws carry before data, not compute, becomes the binding constraint, the question the data-curation lesson raises. What is settled is the shift this chapter documented: vision models now learn from raw pixels and from the language humans already wrote, labels stopped being the bottleneck, and the practitioner's craft moved from training models to choosing and steering foundations.
The phrase "foundation model" was coined in a 2021 Stanford report, not in a vision paper, and it was controversial precisely because it named a bet: that a few large pretrained models would become the shared foundation everything else is built on, with all the concentration of capability and risk that implies. In vision the bet has largely paid off in the span of this chapter's timeline. The same report's warnings about brittleness, bias inherited from web data (recall CLIP's typographic attack), and the difficulty of auditing a model used for a thousand downstream purposes are now the active governance questions of Chapter 37.
For each of four hypothetical tasks, name which foundation family from Table 25.6.1 you would start with and justify it from the model's pretraining objective alone: (a) clustering ten million unlabeled product photos by visual similarity; (b) finding all images matching the phrase "a child flying a kite" in a database; (c) fine-tuning a high-accuracy classifier for fifty industrial defect types with ten thousand labeled images; (d) interactively masking arbitrary objects a user clicks on. Then identify one task where two families are plausible and explain the trade-off.
Using the frozen-backbone code in subsection 2, build classifiers for a small labeled dataset on top of two different frozen backbones (a DINOv2 and a CLIP image encoder), training only a linear head on each. Report test accuracy and the number of trainable parameters for each. Then unfreeze and fine-tune one of them end to end and compare accuracy and training time. Write one paragraph on when the extra accuracy of fine-tuning is worth the much larger training cost, connecting your answer to the logistics practical example.
The section contrasts three self-supervised prediction targets: pixels (MAE), language (CLIP), and representations (JEPA). For each, state what the model is penalized for getting wrong, and argue what kind of information that penalty encourages the model to keep versus discard. Then explain the specific complaint JEPA raises against pixel reconstruction, why predicting in representation space is claimed to address it, and how this connects to learning motion and dynamics for the video models of Chapter 26.
4. Generative Vision-Language Models and Visual Question Answering Intermediate
Everything so far in this chapter scores or clusters images; nothing in it can answer a question about one. CLIP (Section 25.4) can tell you that an image is more similar to the text "a child flying a kite" than to "a dog on a beach", but ask it "how many kites are in the sky?" and it has no machinery to reply, because it only measures similarity between a whole image and a whole caption. A foundation-models course that stops at CLIP leaves out the half of the multimodal field that actually talks: the models that read an image and generate free-form text about it, answering questions, describing scenes, and holding a dialogue. This subsection adds that half and ties it back to the contrastive story you already know.
4.1 The Task: Visual Question Answering
The cleanest way to test whether a model truly understands an image, rather than just labels it, is to ask it arbitrary questions: "What color is the bus?", "Is the man wearing a helmet?", "How many people are seated?" This is Visual Question Answering (VQA), and the problem it exposed is instructive. The first large benchmark let models cheat: questions carried such strong language priors that a model could answer "What color is the banana?" with "yellow" without looking at the image at all. The VQAv2 benchmark (Goyal et al., 2017, "Making the V in VQA Matter") fixed this by rebalancing: every question is paired with two similar images that yield different answers, so a model that ignores the pixels and bets on the prior is wrong half the time by construction. The "V" in VQA only matters if the vision is load-bearing, and VQAv2 forces it to be.
VQA also needs a metric that tolerates the genuine disagreement among human annotators (is that animal a "puppy" or a "dog"?). VQAv2 collects ten human answers per question and scores a candidate answer by how many humans gave it, saturating at three:
Read term by term: the numerator counts how many of the ten annotators produced the candidate answer $a$; dividing by three means an answer that at least three humans gave earns full credit, and the outer $\min$ caps the score at $1$ so extra agreement does not overflow. Concretely, if the model answers "yellow" and four of ten humans also said "yellow", the score is $\min(4/3, 1) = 1$; if only two humans said it, the score is $2/3 \approx 0.67$; if zero did, the score is $0$. The design rewards answers that match the human consensus while not punishing a model for missing a rare phrasing a single annotator happened to use. GQA (Hudson & Manning, CVPR 2019) extends VQA toward compositional reasoning, generating questions from scene graphs ("Is the cup to the left of the plate that is on the table?") so that answering correctly requires chaining several relational steps rather than recognizing one object.
The deeper shift is architectural. Early VQA systems framed the task as classification: encode the image into a feature vector, encode the question into an embedding, fuse the two, and predict over a fixed answer vocabulary of a few thousand frequent answers. That design cannot produce any answer outside its vocabulary and cannot explain itself. Modern vision-language models discard the fixed answer set entirely and treat VQA as open-ended autoregressive text generation: the model reads the image and the question and generates the answer one token at a time, exactly as a language model continues a sentence. Captioning, VQA, and multi-turn dialogue stop being three architectures and become one, differing only in the prompt.
CLIP and a generative vision-language model sit on opposite sides of one line. CLIP is discriminative: two encoders project an image and a text into a shared space, and the model only scores how well a matched pair aligns. That is enough for zero-shot classification and retrieval, but CLIP cannot emit a sentence; it has no decoder and no notion of generating text. A generative VLM is, well, generative: it feeds visual features into an autoregressive language model that produces free-form text conditioned on the image, so the same model can caption, answer questions, and converse. The two are complementary rather than competing, and the relationship is concrete: generative VLMs very often reuse a frozen CLIP or SigLIP vision transformer as their perception front-end, then bolt a language model on as the reasoning and generation back-end. The contrastive pretraining you studied in Section 25.4 is not superseded by generative models; it is the eyes those models see through.
4.2 The General Pattern: Encoder, Connector, LLM
Almost every modern vision-language model follows one template, and once you see it the individual models become variations on a theme. An image goes into a vision encoder (frequently a frozen CLIP or SigLIP ViT, reusing exactly the contrastively pretrained perception of Section 25.4), which emits a grid of patch features. Those features pass through a connector that turns them into visual tokens living in the language model's embedding space. Those visual tokens are then inserted into the context window alongside the embedded text tokens, and a language-model decoder autoregressively generates text conditioned on both. The image becomes, in effect, a few dozen "words" in a language the LLM already understands, and from there captioning, VQA, and dialogue are all just text continuation. Figure 25.6.2 shows the flow.
The models differ almost entirely in the connector, because that is where the design choices live: how many visual tokens, how much computation, and how much it can compress. Three families dominate. The Q-Former (BLIP-2) uses a small set of learned query tokens that cross-attend to the encoder's patches and compress them into a fixed, small number of visual tokens. The linear or MLP projector (LLaVA) is the minimalist choice: a single matrix multiply (or a two-layer MLP) maps each patch feature directly to a visual token, with almost no parameters and no compression. The perceiver resampler (Flamingo) resamples a variable number of patch features into a fixed number of latent tokens through cross-attention. The trade-off is the classic one between inductive bias and simplicity: the Q-Former and resampler bake in a learned compression step (more parameters, fewer visual tokens, more architectural assumptions), while the MLP projector trusts the LLM to do the heavy lifting (almost no parameters, one token per patch, minimal assumptions). LLaVA's result, that a trivial projector works remarkably well, was itself a finding.
4.3 BLIP-2: A Frozen Encoder, a Frozen LLM, and a Q-Former Between Them
Training a vision-language model from scratch means paying for both a vision backbone and a language model, an enormous cost. BLIP-2 (Li et al., 2023) asked whether you could leave both giants frozen and train only a small bridge between them. Its bridge is the Querying Transformer (Q-Former): a lightweight transformer holding a small set of learned query tokens (32 in the original) that are not derived from the image. These queries do two things at once. They cross-attend to the frozen image encoder's patch features, pulling in visual information, and they self-attend among themselves, organizing that information. The query outputs, projected to the language model's token dimension, are the entire visual input the frozen LLM ever sees: the image is compressed into just 32 tokens. Only the Q-Former trains; the vision encoder and the LLM stay frozen, which is why BLIP-2 reaches strong zero-shot VQA with far fewer trainable parameters than a model like Flamingo.
BLIP-2 trains the Q-Former in two stages. Stage one is representation learning: with the LLM not yet attached, the Q-Former is trained against the frozen image encoder using three objectives borrowed from contrastive and matching work, image-text contrastive learning, image-text matching, and image-grounded text generation, so the queries learn to extract text-relevant visual features. Stage two is generative learning: the trained queries are projected and prepended to the frozen LLM's context as a soft visual prefix, and the system learns to generate text conditioned on that prefix. The split matters because it first teaches the queries what to look at (stage one) before teaching them to speak to a language model (stage two).
4.4 LLaVA: Visual Instruction Tuning With a Simple Projector
BLIP-2 keeps the LLM frozen and invests in a clever connector; LLaVA (Liu et al., 2023, "Visual Instruction Tuning", NeurIPS 2023) makes the opposite bet. Its connector is as simple as possible, a single linear projection (a 2-layer MLP in LLaVA-1.5) mapping CLIP ViT patch features into the embedding space of a Vicuna language model, and its innovation is the data. The authors used a text-only GPT-4 to generate visual instruction-tuning data: given the captions and bounding boxes of an image (as text), GPT-4 wrote rich questions, answers, and conversations about it, producing instruction-following data for images without any human annotating new images. LLaVA then trains in two stages: stage one aligns the projector with the vision encoder and the LLM both frozen, so only the small projector learns to speak the LLM's language; stage two instruction-tunes the projector and the LLM together on the GPT-4-generated dialogues.
LLaVA-1.5 (Liu et al., 2023, "Improved Baselines with Visual Instruction Tuning", CVPR 2024) showed how far this simple recipe scales with a few disciplined upgrades: a stronger CLIP-ViT-L vision encoder at 336-pixel resolution (more visual detail), the two-layer MLP projector in place of the single linear layer, and the addition of academic VQA datasets to the instruction mix. The result reached state of the art on eleven benchmarks, and the 13-billion-parameter model trains in roughly one day on a single 8xA100 node, a striking efficiency for the capability. LLaVA's broader lesson is that a minimal connector plus good instruction data can rival far more elaborate architectures, which is why the linear-or-MLP projector became the default for a wave of open vision-language models.
| Model | Connector | What trains | Citation |
|---|---|---|---|
| BLIP-2 | Q-Former (learned query tokens, cross- and self-attention; compresses to ~32 visual tokens) | Q-Former only; vision encoder and LLM frozen | Li et al., 2023 (arXiv:2301.12597) |
| LLaVA / LLaVA-1.5 | Linear projector (LLaVA) or 2-layer MLP (LLaVA-1.5); one visual token per patch | Stage 1: projector only. Stage 2: projector and LLM | Liu et al., 2023 (arXiv:2304.08485 / 2310.03744) |
| Flamingo | Perceiver resampler plus gated cross-attention layers inserted into the LLM | Resampler and gated cross-attention; vision encoder and LLM backbone frozen | Alayrac et al., 2022 (arXiv:2204.14198) |
4.5 The Wider Landscape and How These Models Are Evaluated
BLIP-2 and LLaVA anchor two ends of a fast-moving field. Flamingo (Alayrac et al., 2022) pioneered the frozen-giants approach and added few-shot multimodal learning over interleaved sequences of images and text, using a perceiver resampler and gated cross-attention layers spliced into a frozen language model. More recent open families push resolution, grounding, and data quality: the Qwen2-VL and Qwen2.5-VL line (Wang et al., 2024, arXiv:2409.12191; Bai et al., 2025, arXiv:2502.13923) handles dynamic input resolution and visual grounding; InternVL (Chen et al., 2023, arXiv:2312.14238) scales the vision encoder alongside the LLM; PaliGemma (Beyer et al., 2024, arXiv:2407.07726) pairs a SigLIP encoder with the Gemma language model; and Molmo, trained on the open PixMo data (Deitke et al., 2024, arXiv:2409.17146), emphasizes fully open training data. Proprietary multimodal assistants and open model-card releases (for example the Llama-3.2-Vision models, distributed as model-card releases rather than arXiv papers) round out the ecosystem.
Because these models generate open-ended text, evaluation has grown beyond the classic VQAv2 and GQA accuracy. MMMU (Yue et al., 2024, "MMMU: A Massive Multi-discipline Multimodal Understanding benchmark", arXiv:2311.16502) tests college-level reasoning across dozens of subjects with figures, charts, and diagrams; MMBench probes specific capabilities with a robust answer-extraction protocol; and MM-Vet evaluates integrated skills such as combining recognition, OCR, and spatial reasoning in a single answer. Together they measure whether a model genuinely reasons over an image or merely pattern-matches frequent question-answer pairs, the same concern that motivated VQAv2's rebalancing in the first place.
One contrastively pretrained vision encoder, two completely different products. Aligned against text in a shared space and queried by similarity, it becomes CLIP: a discriminative engine for zero-shot classification and retrieval (Section 25.4). Frozen and fed through a connector into an autoregressive language model, that same encoder becomes the perception front-end of a generative VLM that captions, answers questions, and converses. The contrastive objective of this chapter did not just produce a classifier; it produced the reusable visual front-end that the entire generative multimodal field now builds on. When you reach the generative models of Part IV, remember that their eyes were trained here.
4.6 Code-First: Visual Question Answering in a Dozen Lines
The internals above (a frozen encoder, a connector, a frozen-or-tuned LLM, two-stage training) are exactly what a library now hides behind one class. Hugging Face transformers ships these models with a processor that handles image preprocessing and prompt formatting and a generation method that runs the autoregressive decode. The complexity of building the pipeline by hand collapses to loading a checkpoint and calling generate: the same image-plus-question to answer that took a research lab months to train is, for inference, a dozen lines.
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# Load a pretrained BLIP-2: frozen ViT + Q-Former + frozen LLM, all in one checkpoint.
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
image = Image.open("street.jpg").convert("RGB") # your own image
question = "Question: How many people are crossing the street? Answer:"
# The processor turns the image into pixel values and tokenizes the prompt;
# the Q-Former compresses the image to ~32 visual tokens fed to the frozen LLM.
inputs = processor(images=image, text=question, return_tensors="pt").to(
device, torch.float16
)
generated = model.generate(**inputs, max_new_tokens=20) # autoregressive decode
answer = processor.batch_decode(generated, skip_special_tokens=True)[0].strip()
print(answer) # open-ended text, no fixed vocab
"a photo of") or hold a follow-up turn. Swapping in LlavaForConditionalGeneration with llava-hf/llava-1.5-7b-hf changes only the class and checkpoint names, not the shape of the call.
The payoff mirrors the frozen-backbone reveal of subsection 2: the entire encoder-connector-LLM machine, frozen giants and learned bridge alike, is downloaded as one checkpoint and driven with a processor and a generate call. What the library absorbs is substantial: image preprocessing to the encoder's expected resolution, insertion of the visual tokens at the right position in the prompt, the key-value caching that makes autoregressive decoding fast, and the half-precision memory management that lets a multi-billion-parameter model run on one GPU. Understanding the pipeline is what lets you choose between BLIP-2's compressing Q-Former and LLaVA's pass-through projector for your latency and accuracy budget; the library is what lets you run either in an afternoon.
The 2024 to 2026 generative-VLM frontier moves in three directions a motivated student can follow. First, resolution and grounding: Qwen2.5-VL (2025) and similar models process native-resolution images and return pixel coordinates, turning a VLM into a detector and pointer rather than only a describer, which blurs the line with the open-vocabulary detection of Section 25.5. Second, open data and reproducibility: Molmo and its PixMo dataset (2024) argue that the recipe should be fully open, from data to weights, so the community can study what actually drives multimodal ability rather than guessing about closed models. Third, vision-language-action models: the same encoder-connector-LLM template, with actions as additional output tokens, is becoming the backbone of robot policies, an explicit bridge from the world models hinted at in subsection 3 to embodied control. The open question underneath all three: do these models reason over images, or do they retrieve memorized question-answer patterns? Benchmarks like MMMU (Yue et al., 2024) were built precisely to tell the difference, and the answer is still actively contested.
Early VQA models classified over a fixed vocabulary of a few thousand frequent answers; modern VLMs generate the answer as free text. Explain two concrete capabilities the generative formulation gains that the classifier cannot have in principle (consider answers outside any fixed list, multi-word and explanatory answers, and unifying captioning with VQA). Then describe one situation where the fixed-vocabulary classifier is actually preferable, and say what property of the deployment makes it so. Connect your answer to why the VQAv2 accuracy metric (the equation in subsection 4.1) still works for grading a generative model's free-text output.
Using the Code Fragment 2 pattern, load a Hugging Face VLM (BLIP-2 or LLaVA-1.5) and run it on three of your own images. For each image, (a) generate a caption by prompting with no question, and (b) ask two questions, one answerable from the image and one deliberately unanswerable from it (for example a question about something off-frame), and record both answers. Then load a second VLM and compare the two models' answers on the same inputs. Write a short paragraph on where they agreed, where they diverged, and how each handled the unanswerable question (did it refuse, guess, or hallucinate?).
Contrast the two connector families using Table 25.6.2. First, reason about trainable parameters and visual-token count: roughly how many visual tokens does a Q-Former emit versus an MLP projector applied to a 24-by-24 patch grid, and which connector has more trainable parameters? Second, contrast their inductive bias: what does the Q-Former's learned compression assume or bake in that the pass-through MLP does not, and what does the MLP delegate to the LLM instead? Finally, argue which connector you would choose for a long-context dialogue over many images and which for a single high-resolution image where fine detail matters, justifying each from the token-count and compression trade-off.
Hands-On Lab: Build a Label-Free Vision Toolkit From Two Foundation Models
Objective
Stand up a small image toolkit that needs no task labels, built entirely on two pretrained foundation models from this chapter. You will use a frozen DINOv2 backbone to embed a folder of your own photos and find visually similar images by nearest neighbor, then use CLIP to classify those same photos zero-shot from a list of category names you type at runtime. The finished artifact is a single script that, given a folder of images and a list of candidate labels, returns for each image its zero-shot label, its confidence, and its three nearest visual neighbors, the practical embodiment of the foundation-model workflow this chapter has been building toward.
What You'll Practice
- Loading a frozen self-supervised backbone (DINOv2) through
torch.hub, the frozen-feature default of subsection 2 - Extracting and L2-normalizing embeddings, then ranking by cosine similarity, the dot-product similarity measure used since Section 25.2
- Running CLIP zero-shot classification with prompt templates, the open-vocabulary recipe of Section 25.4
- Contrasting a pixel-and-distillation backbone (DINOv2) with a language-aligned one (CLIP) on the same images, the choose-by-objective skill of Table 25.6.1
- Building a reusable adapter-on-a-backbone pipeline instead of training a model from scratch
Setup
A machine with Python 3.9 or newer (a free Colab GPU runtime is ideal but a modern CPU runs the base models on a few dozen images). Install PyTorch, the Hugging Face stack for CLIP, and the small extras DINOv2 needs:
pip install torch torchvision transformers pillow numpy
For data, drop twenty to fifty of your own photos into a folder named images/. A mix of a few recognizable categories works best (for example pets, vehicles, and food), so that both the nearest-neighbor grouping and the zero-shot labels have something to separate. No annotations of any kind are required: that is the entire point.
Steps
Step 1: Load and preprocess your image folder
Read every image in the folder and build the two transforms the two models expect. DINOv2 wants ImageNet-normalized tensors at a multiple of its patch size; CLIP brings its own processor. Getting the preprocessing right is most of the battle.
from pathlib import Path
from PIL import Image
import torch
from torchvision import transforms
paths = sorted(Path("images").glob("*.jpg")) + sorted(Path("images").glob("*.png"))
pil_images = [Image.open(p).convert("RGB") for p in paths]
print(f"loaded {len(pil_images)} images")
# DINOv2 expects a square crop sized to a multiple of the patch (14) and
# the standard ImageNet mean/std normalization.
# TODO: build a torchvision transform that Resize((224, 224)), ToTensor(),
# and Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]).
dino_tf = ...
Hint
Compose the three steps with transforms.Compose([...]). 224 is a multiple of 14, so it fits the ViT-B/14 patch grid without padding. CLIP needs no manual transform here; its processor handles resizing and normalization in Step 3.
Step 2: Embed every image with a frozen DINOv2 backbone
Load DINOv2, freeze it, and run a no-gradient forward pass to get one feature vector per image. L2-normalize the vectors so that a dot product becomes cosine similarity, exactly the trick the contrastive methods of Section 25.2 rely on.
backbone = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14")
backbone.eval()
for p in backbone.parameters():
p.requires_grad_(False) # frozen feature extractor, no training
batch = torch.stack([dino_tf(im) for im in pil_images])
with torch.no_grad():
feats = backbone(batch) # (N, 768) CLS-token features
# TODO: L2-normalize feats along dim=1 so each row has unit length,
# then compute the (N, N) cosine-similarity matrix as feats @ feats.T.
dino_emb = ...
sim = ...
Hint
Use torch.nn.functional.normalize(feats, dim=1). After normalization, dino_emb @ dino_emb.T is the cosine-similarity matrix; its diagonal is all ones (every image is identical to itself).
Step 3: Run CLIP zero-shot classification on the same images
Load a CLIP model and processor, wrap your candidate category names in a prompt template, and let CLIP score each image against each label in its shared embedding space. This is the open-vocabulary move of Section 25.4: no training, just text comparison.
from transformers import CLIPModel, CLIPProcessor
clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip.eval()
labels = ["a dog", "a cat", "a car", "a plate of food", "a building"]
# TODO: build prompt strings with a template like "a photo of {label}",
# call proc(text=prompts, images=pil_images, return_tensors="pt", padding=True),
# run clip(**inputs), then softmax outputs.logits_per_image over dim=1.
probs = ...
Hint
The template matters: "a photo of a dog" usually beats the bare word "dog", the prompt-engineering effect of Exercise 25.4.2. outputs.logits_per_image has shape (num_images, num_labels); a row softmax turns it into per-image label probabilities.
Step 4: Report the toolkit's two outputs per image
For each image, print the CLIP zero-shot label with its confidence and the file names of its three nearest DINOv2 neighbors. This is the deliverable: one model gives you a name, the other gives you visual neighbors, neither needed a single label from you.
for i, path in enumerate(paths):
top = probs[i].argmax().item()
conf = probs[i][top].item()
# TODO: get the indices of the 3 most similar images to image i,
# excluding i itself, using sim[i] (set sim[i, i] to -1 first or skip it),
# then print path.name, labels[top], conf, and the neighbor file names.
...
Hint
Copy the row row = sim[i].clone(); row[i] = -1 so an image is not its own neighbor, then row.topk(3).indices gives the three nearest. Map those indices back through paths to get file names.
Step 5: Compare what the two backbones group by
Pick two images that CLIP labels the same but DINOv2 places far apart (or the reverse), and write down why. This step turns the toolkit into understanding: DINOv2 groups by visual appearance, CLIP groups by nameable concept, and the gap between them is the choose-by-objective lesson of this section.
# TODO: scan for a pair (i, j) where probs[i].argmax() == probs[j].argmax()
# (same CLIP label) but sim[i, j] is low (DINOv2 thinks they look different),
# print both file names and the values, and note the disagreement in a comment.
...
Hint
Two photos of very different-looking dogs share a CLIP label but can sit far apart in DINOv2 space, because DINOv2 was never told the word "dog". The reverse case, visually similar images CLIP labels differently, exposes where appearance and concept come apart.
Expected Output
Step 1 reports the number of images loaded. Step 2 produces an (N, 768) embedding matrix and an (N, N) similarity matrix whose diagonal is all ones. Step 4 prints one block per image, for example cat_03.jpg -> "a cat" (0.91); neighbors: cat_01.jpg, cat_07.jpg, cat_02.jpg, where the neighbors are visibly the same kind of subject. Step 5 surfaces at least one disagreement pair with a one-line explanation. The finished artifact is a single script that takes a folder and a label list and returns, per image, a zero-shot name and three visual neighbors, plus a short note on where the two backbones disagree and why.
Stretch Goals
- Swap CLIP for SigLIP (
google/siglip-base-patch16-224) and compare zero-shot confidences; SigLIP's sigmoid loss (Section 25.4) gives calibrated per-label scores rather than a forced softmax over your label set. - Add masking: run the Segment Anything automatic mask generator on one image and overlay the masks, connecting the toolkit to the promptable segmentation of Section 25.5.
- Library Shortcut: replace the hand-built DINOv2 nearest-neighbor search with a single
sklearn.neighbors.NearestNeighborsindex over the embeddings, and time how the query cost scales as you grow the folder, the retrieval pattern that feeds forward into Chapter 27.
Complete Solution
# Label-free vision toolkit: DINOv2 nearest neighbors + CLIP zero-shot labels.
from pathlib import Path
from PIL import Image
import torch
import torch.nn.functional as F
from torchvision import transforms
from transformers import CLIPModel, CLIPProcessor
# --- Step 1: load and preprocess ---
paths = sorted(Path("images").glob("*.jpg")) + sorted(Path("images").glob("*.png"))
pil_images = [Image.open(p).convert("RGB") for p in paths]
print(f"loaded {len(pil_images)} images")
dino_tf = transforms.Compose([
transforms.Resize((224, 224)), # 224 is a multiple of 14
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]), # ImageNet normalization
])
# --- Step 2: DINOv2 embeddings and cosine-similarity matrix ---
backbone = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14")
backbone.eval()
for p in backbone.parameters():
p.requires_grad_(False)
batch = torch.stack([dino_tf(im) for im in pil_images])
with torch.no_grad():
feats = backbone(batch) # (N, 768)
dino_emb = F.normalize(feats, dim=1) # unit-length rows
sim = dino_emb @ dino_emb.T # cosine similarity (N, N)
# --- Step 3: CLIP zero-shot classification ---
clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip.eval()
labels = ["a dog", "a cat", "a car", "a plate of food", "a building"]
prompts = [f"a photo of {lbl}" for lbl in labels] # prompt template helps
inputs = proc(text=prompts, images=pil_images, return_tensors="pt", padding=True)
with torch.no_grad():
out = clip(**inputs)
probs = out.logits_per_image.softmax(dim=1) # (N, num_labels)
# --- Step 4: per-image report ---
for i, path in enumerate(paths):
top = probs[i].argmax().item()
conf = probs[i][top].item()
row = sim[i].clone()
row[i] = -1 # exclude self
nbrs = [paths[j].name for j in row.topk(3).indices.tolist()]
print(f'{path.name} -> "{labels[top]}" ({conf:.2f}); neighbors: {", ".join(nbrs)}')
# --- Step 5: find a disagreement (same CLIP label, low DINOv2 similarity) ---
clip_label = probs.argmax(dim=1)
N = len(paths)
for i in range(N):
for j in range(i + 1, N):
if clip_label[i] == clip_label[j] and sim[i, j] < 0.4:
print(f"DISAGREE: {paths[i].name} & {paths[j].name} both "
f'"{labels[clip_label[i]]}" but DINOv2 sim={sim[i, j]:.2f}')
break
else:
continue
break
# DINOv2 groups by appearance; CLIP groups by nameable concept. The gap between
# them is exactly the choose-by-objective lesson of Table 25.6.1.
Further Reading: Generative Vision-Language Models Intermediate
Introduces VQAv2, which rebalances the benchmark by pairing each question with two similar images that have different answers, defeating language-prior shortcuts so that a model must actually look at the image. Defines the consensus-based accuracy metric $\min(\#\text{humans}/3, 1)$ used in subsection 4.1.
A lightweight Querying Transformer (Q-Former) with learned query tokens cross-attends to a frozen image encoder and feeds the frozen LLM, bridging two giants while training only the small connector. Two-stage pretraining (representation learning, then generative learning) yields strong zero-shot VQA with far fewer trainable parameters than Flamingo.
Introduces LLaVA: a CLIP ViT encoder, a single linear projection, and a Vicuna LLM, instruction-tuned on dialogue data generated by a text-only GPT-4. Shows that a minimal connector plus high-quality instruction data produces a capable open vision-language assistant.
Upgrades LLaVA with a CLIP-ViT-L 336px encoder, a two-layer MLP projector, and academic VQA data in the instruction mix, reaching state of the art on eleven benchmarks. The 13B model trains in roughly one day on a single 8xA100 node, evidence that the simple-projector recipe scales efficiently.
Pioneers the frozen-giants approach with a perceiver resampler and gated cross-attention layers spliced into a frozen language model, enabling few-shot learning over interleaved sequences of images and text. The architectural ancestor of the encoder-connector-LLM template in subsection 4.2.
A college-level multimodal benchmark spanning dozens of disciplines with figures, charts, and diagrams, designed to test genuine reasoning over images rather than pattern-matching frequent question-answer pairs. Representative of the evaluation suites (alongside MMBench and MM-Vet) that grade modern generative VLMs.