Section 16.2: Bag of Visual Words & Spatial Pyramids

"Describe this picture? Easy. Three hundred corners, two hundred blobs, a smattering of edges, and not the faintest idea where any of them are. You are welcome."
A Bag of Visual Words With No Sense of Place

Big Picture

If a document is a bag of words, an image is a bag of visual words, and a century of text-retrieval mathematics transfers wholesale the moment you decide what counts as a word. This section makes that decision precise. The local descriptors of Chapter 10 are clustered into a fixed vocabulary; every image becomes a histogram counting how often each visual word appears; and a standard classifier learns categories from those histograms. The encoding throws away all spatial layout, which is both its strength (robustness to position and clutter) and its fatal blindness (it cannot tell a face from the same parts shuffled). Spatial pyramids buy back coarse geometry, and the VLAD and Fisher vector refinements push the representation to the edge of what a hand-designed encoding can do, which is exactly the ceiling Section 16.6 will measure.

The previous section ended with a verdict: pixels are the wrong currency for recognition because they are invariant to nothing. Section 16.1 pointed at engineered features as the fix. This section takes the first and most influential step, turning the local descriptors you already know how to compute into a single fixed-length vector that describes a whole image, robustly enough to feed a category classifier. The whole construction is borrowed, almost line for line, from how search engines index text.

1. From Documents to Images: The Analogy Beginner

In text retrieval, the bag-of-words model represents a document by which words it contains and how often, ignoring word order entirely. "The cat sat on the mat" and "the mat sat on the cat" have identical bag-of-words representations. That sounds like a catastrophic loss of information, yet it is enough to cluster news articles, rank search results, and detect topics, because the set of words present is a strong signal of content even without order. Vision borrows the trick directly. The obstacle is that images do not come pre-segmented into words; we have to invent the vocabulary ourselves. The pipeline has four stages, and the rest of the section is one stage per subsection: detect and describe local features, build a vocabulary by clustering, encode each image as a word histogram, and classify.

Figure 16.2.1: The four stages of bag-of-visual-words recognition. Detection and description (stage 1) reuse Chapter 10 directly; clustering (stage 2) builds the vocabulary once; per image, features are assigned to their nearest word and counted into a histogram (stage 3) that a classifier maps to a label (stage 4).

2. Building the Visual Vocabulary Intermediate

A visual word is a representative descriptor, a cluster center in the space of local descriptors. To build the vocabulary we collect a large pool of descriptors from many training images and run k-means clustering with $K$ centers, where $K$ is the vocabulary size, typically a few hundred to a few thousand. Each of the $K$ resulting centroids is a visual word: a prototypical local appearance (a particular corner, a particular blob, a particular edge junction) that recurs across the corpus. The choice of $K$ trades expressiveness against generalization. Too small and distinct patterns collapse into one word; too large and the same physical pattern fragments across several words, so two images of the same object stop sharing words. Figure 16.2.1 places this clustering as the one offline stage of the pipeline.

The code below extracts SIFT descriptors from a set of training images and clusters them into a vocabulary. SIFT comes straight from Chapter 10; bag-of-words simply consumes its output.

# Build the visual vocabulary: pool SIFT descriptors from every training
# image, then run k-means so each of the K cluster centers becomes one
# visual word. This offline step is stage 2 of the pipeline.
import cv2
import numpy as np

sift = cv2.SIFT_create()

# Stage 1: pool descriptors from the whole training set.
all_descriptors = []
for path in training_image_paths:
    gray = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
    _, desc = sift.detectAndCompute(gray, None)   # desc: (n_keypoints, 128)
    if desc is not None:
        all_descriptors.append(desc)
pool = np.vstack(all_descriptors).astype(np.float32)
print(f"pooled {pool.shape[0]} descriptors of dim {pool.shape[1]}")

# Stage 2: k-means builds the vocabulary; each center is one visual word.
K = 400
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 1.0)
_, labels, vocabulary = cv2.kmeans(
    pool, K, None, criteria, attempts=3, flags=cv2.KMEANS_PP_CENTERS)
print(f"vocabulary shape: {vocabulary.shape}")    # (400, 128): 400 words
# pooled 318442 descriptors of dim 128
# vocabulary shape: (400, 128)

Building a $400$-word visual vocabulary: SIFT descriptors from every training image are pooled, then k-means partitions that descriptor space into $K$ clusters whose centers become the visual words. This runs once, offline.

Common Misconception: A Visual Word Is a Recognizable Object Part

The text-retrieval analogy invites a tempting picture: that visual word $\#37$ is "an eye," word $\#112$ is "a wheel," and the vocabulary is a dictionary of nameable object parts. In fact a visual word is just a k-means cluster center in SIFT descriptor space, a recurring low-level gradient pattern (a particular corner, edge junction, or blob texture) with no semantic label and no guarantee of corresponding to any meaningful part. The same word can fire on an eyelash, a leaf vein, and a watch strap if their local gradient statistics happen to be similar, and one true object part is usually shattered across several words. The vocabulary's power comes from the statistics of which low-level patterns co-occur, not from any per-word meaning. Do not expect to interpret an individual word. And do not assume two images sharing many words depict the same object until spatial verification (Section 6 and the SLAM loop closure of Chapter 14) confirms it. This is precisely the interpretability gap that learned features narrow only partially even in Chapter 25.

3. Encoding an Image as a Word Histogram Intermediate

With a vocabulary in hand, encoding any image is mechanical. Detect and describe its local features, assign each descriptor to its nearest visual word (the closest centroid), and count how many descriptors landed in each word. The result is a $K$-dimensional histogram: the bag of visual words. Two images of the same category produce similar histograms because they fire similar words at similar rates, regardless of where in the frame the features sat. That positional indifference is the encoding's central property, and we will spend the next subsection both praising and repairing it.

# Encode one image as a bag of visual words: assign each SIFT descriptor to
# its nearest vocabulary word, count the words into a K-dim histogram, and
# L1-normalize so images with different feature counts stay comparable.
import numpy as np
from scipy.cluster.vq import vq

def encode_bow(gray, sift, vocabulary):
    """Return an L1-normalized K-dim visual-word histogram for one image."""
    _, desc = sift.detectAndCompute(gray, None)
    K = vocabulary.shape[0]
    if desc is None:
        return np.zeros(K, dtype=np.float32)
    words, _ = vq(desc.astype(np.float32), vocabulary)  # nearest word per descriptor
    hist, _ = np.histogram(words, bins=np.arange(K + 1))
    hist = hist.astype(np.float32)
    return hist / (hist.sum() + 1e-9)                   # normalize away feature count

bow = encode_bow(cv2.imread("query.png", cv2.IMREAD_GRAYSCALE), sift, vocabulary)
print(f"histogram dim {bow.shape[0]}, sums to {bow.sum():.3f}, nonzero words {np.count_nonzero(bow)}")
# histogram dim 400, sums to 1.000, nonzero words 173

Encoding one image: each descriptor is mapped to its nearest visual word with vq, the words are counted, and the histogram is L1-normalized so images with different numbers of features remain comparable.

The normalization matters. A cluttered image fires more features than a sparse one, so raw counts would conflate "lots of texture" with "strong match." Dividing by the total turns counts into a distribution, the visual analog of term frequency in text. The text analogy goes one step further: just as the word "the" carries little discriminative power because it appears everywhere, a visual word that appears in nearly every image is uninformative. TF-IDF weighting downweights ubiquitous words and upweights rare, distinctive ones, exactly as in document retrieval, and it was the weighting that made the Video Google system (Sivic and Zisserman, ICCV 2003) work. The weight for word $w$ in image $d$ is

$$ \text{tfidf}(w, d) \;=\; \underbrace{\frac{n_{w,d}}{\sum_{w'} n_{w',d}}}_{\text{term frequency}} \;\cdot\; \underbrace{\log\frac{N}{n_w}}_{\text{inverse document frequency}}, $$

where $n_{w,d}$ counts word $w$ in image $d$, $N$ is the number of images, and $n_w$ is the number of images containing $w$ at all. The same machinery enables an inverted file: a lookup from each word to the list of images that contain it, which makes retrieval over millions of images fast and underpins the loop-closure step in the visual-SLAM systems of Chapter 14.

4. The Blind Spot, and the Spatial Pyramid Fix Intermediate

The bag-of-words encoding discards all geometry. This is genuinely useful: an object recognized by its parts survives translation, mild clutter, and occlusion, because the histogram does not care where the parts are or whether a few are missing. But the same blindness is a hard ceiling. A correct face and a scrambled face, eyes where the mouth should be, have identical word histograms, because both contain the same parts in the same counts. The encoding literally cannot represent layout, so it cannot distinguish arrangements, and scene categories that differ mainly by spatial structure (a coast is sky-above-water; a forest is texture-everywhere) confuse it.

The spatial pyramid (Lazebnik, Schmid and Ponce, CVPR 2006) buys back coarse geometry without abandoning the bag-of-words robustness. Partition the image into a sequence of ever-finer grids, a single $1 \times 1$ cell at level $0$, a $2 \times 2$ grid at level $1$, a $4 \times 4$ grid at level $2$, and so on, compute a separate word histogram inside each cell, and concatenate them all into one long vector. Now the representation knows roughly where each word fired (which cell), at multiple resolutions, while still being a histogram and still tolerating small misalignments within a cell. Figure 16.2.2 shows the three-level partition and how the per-cell histograms stack up.

Figure 16.2.2: Spatial pyramid matching. Histograms are computed in cells at three grid resolutions and concatenated into one vector $21$ times the vocabulary length. Finer levels receive larger weights because agreement in a small cell is stronger evidence of shared spatial structure; coarse levels stay forgiving of position.

The weighting in Figure 16.2.2 encodes the matching intuition: agreement at a fine resolution means the words really do sit in the same place, so it counts more, while the coarse level remains a safety net for objects that shifted between cells. With a three-level pyramid and a $400$-word vocabulary the concatenated vector has $(1 + 4 + 16) \times 400 = 8400$ dimensions, large but still a fixed length a linear classifier can handle. The code below builds it.

# Build a spatial-pyramid encoding: split the image into ever-finer grids,
# histogram the visual words inside each cell, weight finer levels more
# heavily, and concatenate every cell into one fixed-length vector.
import numpy as np
from scipy.cluster.vq import vq

def spatial_pyramid(gray, sift, vocabulary, levels=2):
    """Concatenate weighted word histograms over a 0..levels spatial pyramid."""
    kps, desc = sift.detectAndCompute(gray, None)
    K = vocabulary.shape[0]
    H, W = gray.shape
    if desc is None:
        n_cells = sum((2 ** l) ** 2 for l in range(levels + 1))
        return np.zeros(n_cells * K, dtype=np.float32)
    words, _ = vq(desc.astype(np.float32), vocabulary)
    xs = np.array([kp.pt[0] for kp in kps])
    ys = np.array([kp.pt[1] for kp in kps])

    blocks = []
    for level in range(levels + 1):
        n = 2 ** level                                  # n x n grid this level
        weight = 1.0 / 2 ** (levels - level)            # finer level -> larger weight
        for gy in range(n):
            for gx in range(n):
                in_cell = ((xs >= gx * W / n) & (xs < (gx + 1) * W / n) &
                           (ys >= gy * H / n) & (ys < (gy + 1) * H / n))
                hist, _ = np.histogram(words[in_cell], bins=np.arange(K + 1))
                blocks.append(weight * hist.astype(np.float32))
    vec = np.concatenate(blocks)
    return vec / (vec.sum() + 1e-9)

spm = spatial_pyramid(cv2.imread("scene.png", cv2.IMREAD_GRAYSCALE), sift, vocabulary)
print(f"spatial pyramid vector length: {spm.shape[0]}")   # 21 * K
# spatial pyramid vector length: 8400

Spatial pyramid encoding: per level, keypoints are bucketed by their pixel coordinates into an $n \times n$ grid, a histogram is built per cell, scaled by the level weight, and all cells across all levels are concatenated into one $21K$-dimensional vector.

Key Insight: Layout Is Information You Choose How Much To Keep

Plain bag-of-words keeps zero spatial information; a per-pixel template (Section 16.1) keeps all of it. The spatial pyramid is the dial between them: more levels keep more geometry and tolerate less misalignment, fewer levels keep less geometry and tolerate more. Recognition is full of such dials, and choosing where to set them by hand, how much invariance to spend, is exactly the design labor that deep networks later automate by learning the right amount of spatial pooling from data. When you reach the global average pooling of modern convolutional neural networks (CNNs) in Chapter 19, recognize it as a learned, single-level version of this very dial.

Fun Fact: An Image Walks Into a Search Engine

Bag-of-visual-words exists because someone took the joke literally. In 2003, Sivic and Zisserman looked at Google's text index and asked: what if a frame of video were a document and its SIFT patches were words? The pun became a system, the system became a paper, and the paper named itself Video Google. The mental model survives the laugh: cluster the corners into a dictionary, count the words, forget the grammar. Forgetting the grammar is the whole trick and the whole flaw, which is exactly why the spatial pyramid had to put a little grammar back. The illustration below captures that forgetting: the bag knows exactly which parts it holds and has no idea where any of them go.

A friendly tote bag stuffed with jumbled facial puzzle pieces happily counts how many of each piece it holds but cannot tell a real face from a scrambled one, illustrating how a bag of visual words counts local features while discarding all spatial layout. — Count the corners, forget the grammar: a bag of visual words knows exactly what parts an image contains and has absolutely no idea where any of them go.

5. Classifying the Histograms Beginner

Once every image is a fixed-length vector, recognition is ordinary supervised classification. A linear support vector machine (the subject of Section 16.3) works well, but bag-of-words histograms classify even better under a kernel matched to their nature. Because two histograms are compared most meaningfully by how their distributions overlap, the histogram intersection and chi-squared kernels outperform a plain linear one, and they were standard in the spatial-pyramid era. Each is just a similarity score between two histograms $\mathbf{a}$ and $\mathbf{b}$. Histogram intersection sums the per-bin minimum, $\sum_i \min(a_i, b_i)$, literally the shared area under the two histograms, so it rewards counts the two images hold in common and ignores the rest. The chi-squared kernel uses

$$ K(\mathbf{a}, \mathbf{b}) \;=\; \exp\!\left( -\gamma \sum_i \frac{(a_i - b_i)^2}{a_i + b_i} \right), $$

where the per-bin term $\tfrac{(a_i - b_i)^2}{a_i + b_i}$ divides each squared difference by the bin's own magnitude. That denominator is the whole point: a discrepancy of $0.1$ in a rare word (small $a_i + b_i$) counts far more than the same $0.1$ in a ubiquitous one. Put numbers on it: a rare word with $a_i = 0.10$ and $b_i = 0.00$ contributes $\tfrac{0.10^2}{0.10} = 0.10$, while a common word that differs by the same $0.1$ but sits high, $a_i = 0.50$ and $b_i = 0.40$, contributes only $\tfrac{0.10^2}{0.90} \approx 0.011$, roughly nine times less. So the kernel weights distinctive words heavily and shrugs off the noisy common ones, exactly the emphasis a plain Euclidean (linear) comparison cannot make. The code below trains a chi-squared-kernel SVM on the encoded training set, the classifier that closes the pipeline of Figure 16.2.1.

# Train the final category classifier on spatial-pyramid vectors using a
# chi-squared kernel SVM, the standard choice for visual-word histograms,
# then report accuracy on the encoded test set.
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics.pairwise import chi2_kernel

# X_train: (n_images, 21*K) spatial-pyramid vectors; y_train: integer labels.
clf = SVC(kernel=chi2_kernel, C=10.0)   # chi-squared kernel suits histograms
clf.fit(X_train, y_train)

X_test_vec = np.array([spatial_pyramid(g, sift, vocabulary) for g in test_grays])
pred = clf.predict(X_test_vec)
acc = (pred == y_test).mean()
print(f"spatial-pyramid + chi2-SVM accuracy: {acc:.3f}")
# spatial-pyramid + chi2-SVM accuracy: 0.812

Training the final classifier: a support vector machine with a chi-squared kernel, the standard choice for visual-word histograms, learns categories from spatial-pyramid vectors and reports test accuracy.

Practical Example: Indexing a Museum's Image Archive

Who & situation: a national museum digitizing two million historical photographs wanted "find me more like this" search across the collection, on a fixed on-premises server, with no GPUs and no cloud egress allowed for the sensitive archive. Problem: nearest-neighbor search over raw images was hopeless, and the deep-embedding services they trialed could not run within their compliance constraints. Decision: they built a $100{,}000$-word bag-of-visual-words index with TF-IDF weighting and an inverted file, exactly the Video Google recipe, so a query image retrieved candidates by shared rare words in milliseconds, then re-ranked the top few hundred with spatial verification. Result: sub-second visual search over the full two million images on commodity CPU hardware, fully on-premises, and curators discovered duplicate prints and related series the catalogue had never linked. Lesson: bag-of-words is not just historically important; its inverted-file scalability and zero-GPU footprint still make it the right tool when constraints rule out learned embeddings, and the same machinery is what closes loops in the SLAM systems of Chapter 14.

You Could Build This: A Reverse Image Search Engine Advanced

The four stages you just assembled are a complete "find me more like this" engine, and wiring them into one is a genuinely portfolio-worthy build. Point it at a folder of a few thousand of your own photos, build a $1000$-word vocabulary once with the Section 2 k-means step, encode every image to a TF-IDF-weighted histogram, and store the result as the inverted file of Section 3 (a dictionary from each visual word to the images that contain it). A query image then retrieves candidates by shared rare words in milliseconds rather than by scanning the whole collection, exactly the Video Google recipe behind this section's museum-archive example. The capstone touch that lifts it from demo to system is the re-ranking the practical example mentions: take the top few hundred candidates and verify each with the geometric match check of Chapter 10, so two images sharing many words are confirmed to depict the same scene rather than merely the same textures. Build it on commodity CPU with no GPU and no labels, then compare its retrieval quality against the two-line learned backbone of Chapter 25 to feel exactly where the classical pipeline still holds and where it gives way.

6. Beyond Counting: VLAD and Fisher Vectors Advanced

Counting which word each descriptor is nearest to throws away a lot: it records that a descriptor landed in cluster $c$ but forgets where inside the cluster it landed. The encodings that pushed bag-of-words to its peak accuracy keep that residual information. VLAD (Vector of Locally Aggregated Descriptors) stores, for each visual word, the sum of the differences between the assigned descriptors and the word's center: not just "how many" but "in which direction they deviated." For a $K$-word vocabulary of $D$-dimensional descriptors, VLAD produces a $K \times D$ vector,

$$ \mathbf{v}_c \;=\; \sum_{i : \text{NN}(x_i) = c} \big( x_i - \mu_c \big), \qquad \text{VLAD} = \big[\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_K\big], $$

where $\mu_c$ is the $c$-th center. Fisher vectors go further still, modeling the descriptor distribution as a Gaussian mixture (the same weighted blend of a few Gaussian blobs used to model colors in the GrabCut segmentation of Section 11.4, here modeling descriptors instead) and recording the gradient of the data's log-likelihood with respect to the mixture parameters, capturing first and second order deviations (means and variances) per component. Both encodings let a much smaller vocabulary reach higher accuracy than a giant bag-of-words histogram, because each word now carries rich residual statistics instead of a single count. Fisher vectors held the top of image classification benchmarks right up until convolutional networks overtook them in 2012, a hand-off Section 16.6 examines closely.

Research Frontier: Bag-of-Words, Fully Learned

The aggregate-local-features idea never left; it became a differentiable layer. NetVLAD turned the VLAD assignment into a soft, trainable operation and dominated visual place recognition for years, and the 2024 model BoQ (Ali-bey, Chaib-draa and Giguère, CVPR 2024, arXiv:2405.07364) replaces the fixed vocabulary with a set of learnable queries that attend over an image's features, a direct intellectual descendant of this section's vocabulary, now learned end to end. SuperGlue and its successors verify candidate matches with learned spatial reasoning, the modern form of the geometric re-ranking that bag-of-words retrieval always needed. Even the inverted-file trick survives in the approximate-nearest-neighbor indices (FAISS, ScaNN) that serve billion-scale embedding search today. The vocabulary, the histogram, the inverted file, and the spatial verification all persist; what changed is that the features being aggregated come from the learned backbones of Chapter 25 rather than from SIFT.

Library Shortcut: BOWImgDescriptorExtractor

The vocabulary build (Section 2) and the histogram encode (Section 3) are roughly thirty lines together. OpenCV bundles the whole bag-of-words encoder into two classes: BOWKMeansTrainer for clustering and BOWImgDescriptorExtractor for per-image histograms, handling the nearest-word assignment and normalization internally so you write about five lines. It does not include the spatial pyramid (you still add that yourself) or the TF-IDF weighting, but for a flat bag-of-words baseline it removes the bookkeeping entirely.

# OpenCV's built-in bag-of-words encoder: BOWKMeansTrainer clusters the
# vocabulary and BOWImgDescriptorExtractor produces a per-image histogram,
# handling nearest-word assignment and normalization internally.
import cv2
bow_trainer = cv2.BOWKMeansTrainer(clusterCount=400)
for desc in training_descriptors:
    bow_trainer.add(desc)
vocabulary = bow_trainer.cluster()                 # k-means inside

extractor = cv2.BOWImgDescriptorExtractor(cv2.SIFT_create(), cv2.BFMatcher())
extractor.setVocabulary(vocabulary)
hist = extractor.compute(gray, cv2.SIFT_create().detect(gray))  # one normalized histogram

OpenCV's built-in bag-of-words pipeline: BOWKMeansTrainer clusters the vocabulary and BOWImgDescriptorExtractor produces the per-image histogram in one call, collapsing the manual assignment and counting of Section 3.

Exercise 16.2.1: The Scrambled-Face Argument Conceptual

Explain precisely why a plain bag-of-visual-words histogram assigns an identical representation to a face and to the same face with its eyes, nose, and mouth spatially permuted. Then state exactly how a two-level spatial pyramid changes that, and identify a permutation that even a two-level pyramid would still fail to distinguish. Use your answer to define, in one sentence, what spatial information a pyramid of $L$ levels does and does not preserve.

Exercise 16.2.2: Vocabulary Size Sweep Coding

Using a small labeled image set (for example, a few classes from Caltech-101), build bag-of-words vocabularies of size $K \in \{50, 100, 400, 1000, 4000\}$, encode every image, and train a linear SVM for each. Plot test accuracy against $K$ and against vocabulary build time. Identify the $K$ where accuracy stops improving and explain the two competing effects (under-clustering versus fragmentation) from Section 2 that produce that peak.

Exercise 16.2.3: Pyramid Levels Versus Cost Analysis

For a fixed vocabulary of $K = 400$, encode the same dataset with spatial pyramids of $0$, $1$, $2$, and $3$ levels and record both classification accuracy and the resulting vector dimension ($K$, $5K$, $21K$, $85K$). Tabulate accuracy gained per dimension added. Argue from your table where the diminishing returns set in, and connect the result to the Key Insight's "dial" framing: at what level does adding geometry stop being worth the dimensionality it costs?