"Give me an image and I will divide it into regions. Whether those regions mean anything is, regrettably, your problem and not mine."
An Unsupervised Clustering Algorithm With No Concept of Objects
Segmentation answers the question every recognition system must eventually face: which pixels belong together? The previous chapter found sparse, distinctive anchor points for geometry. This chapter goes the other way and asks for dense groupings: partition the entire image into coherent regions that a downstream stage can label, measure, or edit. The classical toolkit assembled here, clustering in feature space, growing regions from seeds, flooding a topographic surface, cutting a graph at its weakest links, and oversegmenting into superpixels, is not a museum exhibit. Watershed still separates touching cells in microscopy, GrabCut still powers one-click background removal, and SLIC superpixels still preprocess pixels for networks that would otherwise drown in them. These algorithms also teach the vocabulary, region, boundary, affinity, energy, that the deep segmentation models of Chapter 24 later inherit wholesale.
Chapter Overview
Chapter 10 solved the correspondence problem: find the same physical point in two photographs. That gave us sparse, precise anchors. But a great deal of vision needs the opposite kind of information, dense and regional rather than sparse and pointwise. A robot picking parts off a conveyor needs to know which pixels are the part and which are the belt. A radiologist's assistant needs the tumor outlined, not three keypoints near it. A photo editor needs the foreground separated from the background so it can be relit or replaced. Each of these is a segmentation problem: a partition of the image grid into regions, ideally regions that correspond to something a human would call an object, a surface, or a material.
The difficulty is that "meaningful" is not a property the pixels carry. A pixel is a triple of numbers; it does not know it belongs to a cat. Classical segmentation therefore proceeds by proxy: it groups pixels that are similar in some measurable way (color, brightness, texture, spatial proximity) and hopes that similarity tracks meaning closely enough to be useful. This is both the strength and the limitation of the classical approach. When the proxy is good, the methods are fast, deterministic, and need no training data; when the proxy fails, no amount of clever optimization recovers the missing semantics. The chapter is, in a sense, an extended study of how far you can get on similarity alone, and where you must hand off to the learned models of Part III.
The five sections move from the most assumption-free method to the most structured. Section 11.1 treats segmentation as clustering: throw every pixel into a feature space and let K-Means or Mean-Shift find the clusters, with no notion of spatial connectivity at all. Section 11.2 adds spatial structure back, growing regions from seed pixels and merging or splitting them with a homogeneity test. Section 11.3 introduces the watershed transform, the elegant idea of reading an image as a landscape and flooding it from below, which solves the touching-objects problem that clustering cannot. Section 11.4 reframes the whole problem as a graph: pixels are nodes, similarities are edge weights, and a segmentation is a cut, an idea that yields both the globally optimal binary segmentation of graph cuts and the interactive magic of GrabCut. Section 11.5 closes with superpixels, which deliberately oversegment into hundreds of small, regular, edge-respecting regions that serve as a compact substrate for everything downstream.
A thread runs through all five sections: the tension between local evidence and global consistency. A pixel's own color is local evidence; whether it joins its neighbor is a question about consistency across the image. Clustering ignores spatial consistency entirely and pays for it with speckle. Region growing enforces local consistency greedily and pays for it with seed sensitivity. Graph cuts optimize a global energy and pay for it in compute. Reading the chapter as a sequence of answers to "how much global structure can I afford?" makes the design choices legible. The same trade reappears, with the unary and pairwise terms now learned, when Chapter 24 bolts a Conditional Random Field onto a fully convolutional network, and when promptable models like SAM let a user click instead of seed.
Prerequisites
This chapter builds directly on the color spaces and intensity statistics of Chapter 2: Point Operations, Histograms & Thresholding, whose histograms and thresholds are the simplest segmentations of all. The watershed and region-growing sections lean on the morphological operators (erosion, dilation, the distance transform, connected components) from Chapter 6: Morphology, Binary Images & Shape, and on the gradients and edge maps of Chapter 9: Edges, Lines & Curves. A reader comfortable with NumPy arrays, basic linear algebra, and the idea of a feature vector will follow the clustering material without trouble; the graph-cut section assumes nothing about max-flow beyond what Section 11.4 develops from scratch.
Chapter Roadmap
- 11.1 Segmentation as Clustering: K-Means & Mean-Shift Treating each pixel as a point in a color or color-plus-position feature space: K-Means with its chosen k, Mean-Shift with its data-driven mode count, and the speckle that comes from ignoring spatial connectivity.
- 11.2 Region Growing & Split-and-Merge Putting spatial connectivity back in: flood-filling from seed pixels under a homogeneity predicate, the quadtree split-and-merge alternative, and how seed placement decides everything.
- 11.3 The Watershed Transform Reading an image as a topographic surface and flooding it from below: catchment basins, the oversegmentation problem, and marker-controlled watershed that separates touching objects classical clustering cannot.
- 11.4 Graph-Based Segmentation: Graph Cuts & GrabCut Pixels as graph nodes and similarities as edge weights: the min-cut/max-flow energy, the Felzenszwalb-Huttenlocher efficient graph segmentation, and GrabCut's iterative foreground extraction from a single box.
- 11.5 Superpixels: SLIC & Friends Oversegmenting into hundreds of small, regular, edge-respecting regions: SLIC's local k-means, compactness versus boundary adherence, and why superpixels are the preferred substrate for graphs, CRFs, and even neural pipelines.
What's Next?
Segmentation carved the image into regions; the next chapters ask where those regions sit in three dimensions. Chapter 12: Camera Models & Calibration opens the geometric arc of Part II by formalizing how a 3D world projects onto the 2D grid we have been segmenting, and how calibration recovers the precise mapping. The grouping instinct developed here does not disappear: it returns, learned end to end, in Chapter 24: Segmentation, where fully convolutional networks predict per-pixel labels and promptable models like SAM turn a click into a mask. And the masks themselves become controllers in Chapter 35: Controllable Generation & Image Editing, where a region selected here decides what a diffusion model is allowed to repaint.
Bibliography & Further Reading
Foundational Papers
cv2.watershed descends from. The heart of Section 11.3.skimage.segmentation.felzenszwalb. Section 11.4's fast unsupervised alternative to graph cuts.Recent Research (2023-2026)
Books
Tools & Libraries
skimage.segmentation module documentation. scikit-image.orgcv2.watershed, cv2.grabCut, and cv2.kmeans that match Sections 11.1, 11.3, and 11.4 line for line.sklearn.cluster (KMeans, MeanShift) documentation. scikit-learn.org