Part II: Classical Computer Vision
Chapter 11: Classical Segmentation & Grouping

Classical Segmentation & Grouping

"Give me an image and I will divide it into regions. Whether those regions mean anything is, regrettably, your problem and not mine."

An Unsupervised Clustering Algorithm With No Concept of Objects
Big Picture

Segmentation answers the question every recognition system must eventually face: which pixels belong together? The previous chapter found sparse, distinctive anchor points for geometry. This chapter goes the other way and asks for dense groupings: partition the entire image into coherent regions that a downstream stage can label, measure, or edit. The classical toolkit assembled here, clustering in feature space, growing regions from seeds, flooding a topographic surface, cutting a graph at its weakest links, and oversegmenting into superpixels, is not a museum exhibit. Watershed still separates touching cells in microscopy, GrabCut still powers one-click background removal, and SLIC superpixels still preprocess pixels for networks that would otherwise drown in them. These algorithms also teach the vocabulary, region, boundary, affinity, energy, that the deep segmentation models of Chapter 24 later inherit wholesale.

Chapter Overview

Chapter 10 solved the correspondence problem: find the same physical point in two photographs. That gave us sparse, precise anchors. But a great deal of vision needs the opposite kind of information, dense and regional rather than sparse and pointwise. A robot picking parts off a conveyor needs to know which pixels are the part and which are the belt. A radiologist's assistant needs the tumor outlined, not three keypoints near it. A photo editor needs the foreground separated from the background so it can be relit or replaced. Each of these is a segmentation problem: a partition of the image grid into regions, ideally regions that correspond to something a human would call an object, a surface, or a material.

The difficulty is that "meaningful" is not a property the pixels carry. A pixel is a triple of numbers; it does not know it belongs to a cat. Classical segmentation therefore proceeds by proxy: it groups pixels that are similar in some measurable way (color, brightness, texture, spatial proximity) and hopes that similarity tracks meaning closely enough to be useful. This is both the strength and the limitation of the classical approach. When the proxy is good, the methods are fast, deterministic, and need no training data; when the proxy fails, no amount of clever optimization recovers the missing semantics. The chapter is, in a sense, an extended study of how far you can get on similarity alone, and where you must hand off to the learned models of Part III.

The five sections move from the most assumption-free method to the most structured. Section 11.1 treats segmentation as clustering: throw every pixel into a feature space and let K-Means or Mean-Shift find the clusters, with no notion of spatial connectivity at all. Section 11.2 adds spatial structure back, growing regions from seed pixels and merging or splitting them with a homogeneity test. Section 11.3 introduces the watershed transform, the elegant idea of reading an image as a landscape and flooding it from below, which solves the touching-objects problem that clustering cannot. Section 11.4 reframes the whole problem as a graph: pixels are nodes, similarities are edge weights, and a segmentation is a cut, an idea that yields both the globally optimal binary segmentation of graph cuts and the interactive magic of GrabCut. Section 11.5 closes with superpixels, which deliberately oversegment into hundreds of small, regular, edge-respecting regions that serve as a compact substrate for everything downstream.

A thread runs through all five sections: the tension between local evidence and global consistency. A pixel's own color is local evidence; whether it joins its neighbor is a question about consistency across the image. Clustering ignores spatial consistency entirely and pays for it with speckle. Region growing enforces local consistency greedily and pays for it with seed sensitivity. Graph cuts optimize a global energy and pay for it in compute. Reading the chapter as a sequence of answers to "how much global structure can I afford?" makes the design choices legible. The same trade reappears, with the unary and pairwise terms now learned, when Chapter 24 bolts a Conditional Random Field onto a fully convolutional network, and when promptable models like SAM let a user click instead of seed.

Prerequisites

This chapter builds directly on the color spaces and intensity statistics of Chapter 2: Point Operations, Histograms & Thresholding, whose histograms and thresholds are the simplest segmentations of all. The watershed and region-growing sections lean on the morphological operators (erosion, dilation, the distance transform, connected components) from Chapter 6: Morphology, Binary Images & Shape, and on the gradients and edge maps of Chapter 9: Edges, Lines & Curves. A reader comfortable with NumPy arrays, basic linear algebra, and the idea of a feature vector will follow the clustering material without trouble; the graph-cut section assumes nothing about max-flow beyond what Section 11.4 develops from scratch.

Chapter Roadmap

What's Next?

Segmentation carved the image into regions; the next chapters ask where those regions sit in three dimensions. Chapter 12: Camera Models & Calibration opens the geometric arc of Part II by formalizing how a 3D world projects onto the 2D grid we have been segmenting, and how calibration recovers the precise mapping. The grouping instinct developed here does not disappear: it returns, learned end to end, in Chapter 24: Segmentation, where fully convolutional networks predict per-pixel labels and promptable models like SAM turn a click into a mask. And the masks themselves become controllers in Chapter 35: Controllable Generation & Image Editing, where a region selected here decides what a diffusion model is allowed to repaint.

Bibliography & Further Reading

Foundational Papers

Comaniciu, D. and Meer, P. "Mean Shift: A Robust Approach Toward Feature Space Analysis." IEEE TPAMI (2002). doi:10.1109/34.1000236
The reference treatment of Mean-Shift for image segmentation and tracking: the kernel density view, the mode-seeking iteration, and the joint spatial-range filtering that Section 11.1 builds on.
Vincent, L. and Soille, P. "Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations." IEEE TPAMI (1991). doi:10.1109/34.87344
The immersion-flooding algorithm that made watershed practical, and the algorithm OpenCV's cv2.watershed descends from. The heart of Section 11.3.
Boykov, Y. and Jolly, M.-P. "Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images." ICCV (2001). doi:10.1109/ICCV.2001.937505
The paper that recast binary segmentation as min-cut/max-flow with user-supplied hard constraints. Section 11.4's energy function comes straight from here.
Rother, C., Kolmogorov, V., and Blake, A. "GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts." ACM SIGGRAPH (2004). doi:10.1145/1015706.1015720
GrabCut: Gaussian mixture color models plus iterated graph cuts, initialized from a single bounding box. The algorithm behind one-click background removal and Section 11.4's centerpiece.
Felzenszwalb, P. and Huttenlocher, D. "Efficient Graph-Based Image Segmentation." IJCV (2004). doi:10.1023/B:VISI.0000022288.19776.77
A near-linear-time region merging algorithm with an adaptive boundary predicate, available as skimage.segmentation.felzenszwalb. Section 11.4's fast unsupervised alternative to graph cuts.
Achanta, R. et al. "SLIC Superpixels Compared to State-of-the-Art Superpixel Methods." IEEE TPAMI (2012). doi:10.1109/TPAMI.2012.120
SLIC: localized k-means in a five-dimensional color-plus-position space. The most-used superpixel method and the entirety of Section 11.5's core.
Shi, J. and Malik, J. "Normalized Cuts and Image Segmentation." IEEE TPAMI (2000). doi:10.1109/34.868688
The normalized-cut criterion that fixed graph cuts' bias toward tiny segments and launched the spectral-clustering view of segmentation referenced in Section 11.4.

Recent Research (2023-2026)

Kirillov, A. et al. "Segment Anything." ICCV (2023). arXiv:2304.02643
SAM: the promptable foundation model whose click-to-mask interaction is the learned descendant of this chapter's interactive segmentation. The bridge from classical to Chapter 24.
Ravi, N. et al. "SAM 2: Segment Anything in Images and Videos." ICLR (2025). arXiv:2408.00714
The video extension of SAM with a streaming memory, showing the modern frontier of the grouping problem this chapter introduces.
Stutz, D., Hermans, A., and Leibe, B. "Superpixels: An Evaluation of the State-of-the-Art." CVIU (2018). arXiv:1612.01601
A thorough, still-cited benchmark of 28 superpixel algorithms on boundary recall and undersegmentation error, the empirical backdrop for Section 11.5's claims.

Books

Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book
Chapter 7 (Segmentation) covers clustering, watershed, graph cuts, and normalized cuts with full mathematical depth and a generous reference list; free online.
Gonzalez, R. and Woods, R. Digital Image Processing, 4th edition (2018). imageprocessingplace.com
Chapter 10 gives the classical, textbook account of thresholding, region growing, split-and-merge, and watershed that underpins Sections 11.2 and 11.3.

Tools & Libraries

scikit-image. skimage.segmentation module documentation. scikit-image.org
SLIC, Felzenszwalb, quickshift, watershed, random walker, and Chan-Vese in one consistent API, with worked gallery examples used throughout this chapter's code.
OpenCV. "Image Segmentation with Watershed Algorithm" and "GrabCut" tutorials. docs.opencv.org
The official OpenCV 4.x walkthroughs of cv2.watershed, cv2.grabCut, and cv2.kmeans that match Sections 11.1, 11.3, and 11.4 line for line.
scikit-learn. sklearn.cluster (KMeans, MeanShift) documentation. scikit-learn.org
The clustering reference for Section 11.1: k-means++ initialization, MiniBatchKMeans for large images, and the bandwidth estimation used by MeanShift.