"Give me an image and I will divide it into regions. Whether those regions mean anything is, regrettably, your problem and not mine."
An Unsupervised Grouping Engine With No Concept of Objects
Segmentation answers the question every recognition system must eventually face: which pixels belong together? The previous chapter found sparse, distinctive anchor points for geometry. This chapter goes the other way and asks for dense groupings: partition the entire image into coherent regions that a downstream stage can label, measure, or edit. The classical toolkit assembled here, clustering in feature space, growing regions from seeds, flooding a topographic surface, cutting a graph at its weakest links, and oversegmenting into superpixels, is not a museum exhibit. Watershed still separates touching cells in microscopy, GrabCut still powers one-click background removal, and SLIC superpixels still preprocess pixels for networks that would otherwise drown in them. These algorithms also teach the vocabulary, region, boundary, affinity, energy, that the deep segmentation models of Chapter 24 later inherit wholesale.
Chapter Overview
Chapter 10 solved the correspondence problem: find the same physical point in two photographs. That gave us sparse, precise anchors. But a great deal of vision needs the opposite kind of information, dense and regional rather than sparse and pointwise. A robot picking parts off a conveyor needs to know which pixels are the part and which are the belt. A radiologist's assistant needs the tumor outlined, not three keypoints near it. A photo editor needs the foreground separated from the background so it can be relit or replaced. Each of these is a segmentation problem: a partition of the image grid into regions, ideally regions that correspond to something a human would call an object, a surface, or a material.
The difficulty is that "meaningful" is not a property the pixels carry. A pixel is a triple of numbers; it does not know it belongs to a cat. Classical segmentation therefore proceeds by proxy: it groups pixels that are similar in some measurable way (color, brightness, texture, spatial proximity) and hopes that similarity tracks meaning closely enough to be useful. This is both the strength and the limitation of the classical approach. When the proxy is good, the methods are fast, deterministic, and need no training data; when the proxy fails, no amount of clever optimization recovers the missing semantics. The chapter is, in a sense, an extended study of how far you can get on similarity alone, and where you must hand off to the learned models of Part III.
The five sections move from the most assumption-free method to the most structured. Section 11.1 treats segmentation as clustering: throw every pixel into a feature space and let K-Means or Mean-Shift find the clusters, with no notion of spatial connectivity at all. Section 11.2 adds spatial structure back, growing regions from seed pixels and merging or splitting them with a homogeneity test. Section 11.3 introduces the watershed transform, the elegant idea of reading an image as a landscape and flooding it from below, which solves the touching-objects problem that clustering cannot. Section 11.4 reframes the whole problem as a graph: pixels are nodes, similarities are edge weights, and a segmentation is a cut, an idea that yields both the globally optimal binary segmentation of graph cuts and the interactive magic of GrabCut. Section 11.5 closes with superpixels, which deliberately oversegment into hundreds of small, regular, edge-respecting regions that serve as a compact substrate for everything downstream.
A thread runs through all five sections: the tension between local evidence and global consistency. A pixel's own color is local evidence; whether it joins its neighbor is a question about consistency across the image. Clustering ignores spatial consistency entirely and pays for it with speckle. Region growing enforces local consistency greedily and pays for it with seed sensitivity. Graph cuts optimize a global energy and pay for it in compute. Reading the chapter as a sequence of answers to "how much global structure can I afford?" makes the design choices legible. The same trade reappears, with the unary and pairwise terms now learned, when Chapter 24 bolts a Conditional Random Field onto a fully convolutional network, and when promptable models like SAM let a user click instead of seed.
The five methods are easiest to remember as five points along that single axis, from no spatial structure at all to a global optimum. Table 11.1 is the chapter on one card: read it top to bottom as the price of global structure rising, and keep it as the mental model the rest of the chapter fills in.
| Method (Section) | Spatial structure used | Decision scope | Signature failure |
|---|---|---|---|
| Clustering, K-Means / Mean-Shift (11.1) | None: feature space only | Each pixel alone | Speckle; distant same-color regions merge |
| Region growing / split-merge (11.2) | Local adjacency | Greedy, local | Leakage across faint edges; seed-sensitive |
| Watershed (11.3) | Surface topography | Local commitment at ridges | Oversegmentation from noise minima |
| Graph cuts / GrabCut (11.4) | Global energy | Global optimum (two labels) | Optimal for a color-only model, not for objects |
| Superpixels, SLIC (11.5) | Local grid leash | Shrink the problem, defer the decision | Compactness too high leaks across boundaries |
If one line survives a week after this chapter, make it this: every classical segmenter groups pixels by similarity, then governs that grouping with geometry, and the methods differ only in how much geometry they can afford. Clustering buys none and gets speckle; region growing buys local adjacency; watershed buys surface shape; graph cuts buy a global optimum; superpixels buy a cheap grid and defer the rest. The same two-part recipe, similarity for the data term and geometry for the smoothness term, is exactly what returns, now learned, in the CRF-on-a-network of Chapter 24. The feature space changed; the recipe did not.
Prerequisites
This chapter builds directly on the color spaces and intensity statistics of Chapter 2: Point Operations, Histograms & Thresholding, whose histograms and thresholds are the simplest segmentations of all. The watershed and region-growing sections lean on the morphological operators (erosion, dilation, the distance transform, connected components) from Chapter 6: Morphology, Binary Images & Shape, and on the gradients and edge maps of Chapter 9: Edges, Lines & Curves. A reader comfortable with NumPy arrays, basic linear algebra, and the idea of a feature vector will follow the clustering material without trouble; the graph-cut section assumes nothing about max-flow beyond what Section 11.4 develops from scratch.
Chapter Roadmap
- 11.1 Segmentation as Clustering: K-Means & Mean-Shift Treating each pixel as a point in a color or color-plus-position feature space: K-Means with its chosen k, Mean-Shift with its data-driven mode count, and the speckle that comes from ignoring spatial connectivity.
- 11.2 Region Growing & Split-and-Merge Putting spatial connectivity back in: flood-filling from seed pixels under a homogeneity predicate, the quadtree split-and-merge alternative, and how seed placement decides everything.
- 11.3 The Watershed Transform Reading an image as a topographic surface and flooding it from below: catchment basins, the oversegmentation problem, and marker-controlled watershed that separates touching objects classical clustering cannot.
- 11.4 Graph-Based Segmentation: Graph Cuts & GrabCut Pixels as graph nodes and similarities as edge weights: the min-cut/max-flow energy, the Felzenszwalb-Huttenlocher efficient graph segmentation, and GrabCut's iterative foreground extraction from a single box.
- 11.5 Superpixels: SLIC & Friends Oversegmenting into hundreds of small, regular, edge-respecting regions: SLIC's local K-Means, compactness versus boundary adherence, and why superpixels are the preferred substrate for graphs, CRFs, and even neural pipelines.
What's Next?
Segmentation carved the image into regions; the next chapters ask where those regions sit in three dimensions. Chapter 12: Camera Models & Calibration opens the geometric arc of Part II by formalizing how a 3D world projects onto the 2D grid we have been segmenting, and how calibration recovers the precise mapping. The grouping instinct developed here does not disappear: it returns, learned end to end, in Chapter 24: Segmentation, where fully convolutional networks predict per-pixel labels and promptable models like SAM turn a click into a mask. And the masks themselves become controllers in Chapter 35: Controllable Generation & Image Editing, where a region selected here decides what a diffusion model is allowed to repaint.
Before moving on, put the whole chapter into practice. The Hands-On Lab below builds a single segmentation workbench that runs all five methods of Table 11.1 on one image, so you can see for yourself how the price of global structure trades against the kind of mistake each method makes.
Hands-On Lab: Build a Five-Method Segmentation Workbench
Objective
Build a single script that runs all five classical segmenters of this chapter, K-Means clustering, marker-controlled watershed, GrabCut, Felzenszwalb graph segmentation, and SLIC superpixels, on one image, then writes a labeled comparison grid to disk so you can read Table 11.1's "price of global structure" spectrum straight off your own result.
What You'll Practice
- Clustering pixels in a color feature space with K-Means and mapping labels back to an image (Section 11.1).
- Driving a marker-controlled watershed from a distance transform to split touching blobs (Section 11.3).
- Extracting a foreground with GrabCut from a single bounding box (Section 11.4).
- Running the Felzenszwalb efficient graph segmentation and reasoning about its scale parameter (Section 11.4).
- Oversegmenting into SLIC superpixels and overlaying their boundaries (Section 11.5).
Setup
Two libraries cover every method in this chapter. Install with:
pip install opencv-python scikit-image numpy
Save any color photo as input.jpg in the working folder. If no file is found the script falls back to a synthetic image of overlapping colored disks, which exercises every method (the disks touch, so watershed has work to do, and they sit on a flat background, so GrabCut has a clean foreground to find).
Steps
Step 1: Load a color image with a synthetic fallback
Read the image in color (BGR, OpenCV's default) and provide a fallback so the lab always runs. The fallback paints three overlapping disks of different colors on a gray background.
import cv2
import numpy as np
from skimage.segmentation import felzenszwalb, slic, mark_boundaries
def load_image(path="input.jpg"):
img = cv2.imread(path, cv2.IMREAD_COLOR)
if img is not None:
return img
canvas = np.full((300, 400, 3), 200, np.uint8) # gray background
# TODO: draw three filled, overlapping circles in distinct BGR colors
# using cv2.circle(canvas, center, radius, color, thickness=-1).
# Place their centers close enough that at least two disks touch.
raise NotImplementedError
img = load_image()
print("loaded", img.shape, img.dtype)
Hint
Three calls such as cv2.circle(canvas, (150, 150), 60, (200, 80, 60), -1), cv2.circle(canvas, (230, 160), 60, (60, 160, 200), -1), and cv2.circle(canvas, (190, 220), 55, (80, 200, 120), -1) give you touching, differently colored blobs.
Step 2: Segment by clustering with K-Means
This is Section 11.1 in code. Reshape the image to a list of color vectors, cluster them, then recolor each pixel with its cluster's mean color. No spatial term means speckle is expected.
def kmeans_segment(img, k=4):
Z = img.reshape(-1, 3).astype(np.float32) # one row per pixel
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 1.0)
# TODO: call cv2.kmeans(Z, k, None, criteria, 10, cv2.KMEANS_PP_CENTERS)
# It returns (compactness, labels, centers). Rebuild the image by indexing
# centers (cast to uint8) with labels.flatten(), then reshape to img.shape.
...
Hint
_, labels, centers = cv2.kmeans(Z, k, None, criteria, 10, cv2.KMEANS_PP_CENTERS), then centers.astype(np.uint8)[labels.flatten()].reshape(img.shape) paints each pixel its cluster mean.
Step 3: Separate touching blobs with marker-controlled watershed
Clustering cannot split two same-colored objects that touch; watershed can (Section 11.3). Threshold to a foreground mask, take its distance transform, place a marker at each peak, and flood.
def watershed_segment(img):
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, fg = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
dist = cv2.distanceTransform(fg, cv2.DIST_L2, 5)
_, sure_fg = cv2.threshold(dist, 0.5 * dist.max(), 255, 0)
sure_fg = sure_fg.astype(np.uint8)
unknown = cv2.subtract(fg, sure_fg)
n, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1 # background must not be 0
markers[unknown == 255] = 0 # unknown region flagged for flooding
# TODO: call cv2.watershed(img, markers); it writes -1 on basin boundaries.
# Return a copy of img with boundary pixels (markers == -1) painted red.
...
Hint
cv2.watershed(img, markers) modifies markers in place. Then out = img.copy(); out[markers == -1] = (0, 0, 255) draws the ridge lines that separate the basins.
Step 4: Pull a foreground out with GrabCut
Section 11.4's interactive star. Give GrabCut a rectangle that loosely contains the foreground and let its iterated graph cut refine the boundary, then keep only the pixels it labels foreground.
def grabcut_segment(img, rect=None):
h, w = img.shape[:2]
if rect is None:
rect = (int(0.1 * w), int(0.1 * h), int(0.8 * w), int(0.8 * h))
mask = np.zeros((h, w), np.uint8)
bgd, fgd = np.zeros((1, 65), np.float64), np.zeros((1, 65), np.float64)
cv2.grabCut(img, mask, rect, bgd, fgd, 5, cv2.GC_INIT_WITH_RECT)
# TODO: build a binary keep-mask that is 1 where mask is GC_FGD or GC_PR_FGD,
# else 0. Multiply it (as a 3-channel mask) into img so background goes black.
...
Hint
keep = np.where((mask == cv2.GC_FGD) | (mask == cv2.GC_PR_FGD), 1, 0).astype(np.uint8), then img * keep[:, :, None].
Step 5: Run the Felzenszwalb graph segmentation and SLIC superpixels
The two scikit-image methods of Sections 11.4 and 11.5. Felzenszwalb merges regions under an adaptive boundary predicate controlled by scale; SLIC oversegments into roughly equal superpixels. Both expect an RGB float image, so convert from BGR first.
def felzenszwalb_segment(img, scale=200):
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
labels = felzenszwalb(rgb, scale=scale, sigma=0.5, min_size=50)
return (mark_boundaries(rgb, labels) * 255).astype(np.uint8)[:, :, ::-1]
def slic_segment(img, n_segments=120):
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# TODO: call slic(rgb, n_segments=n_segments, compactness=10, start_label=1),
# overlay its boundaries with mark_boundaries, scale to 0..255 uint8,
# and convert back to BGR (reverse the last axis) so OpenCV can save it.
...
Hint
Mirror felzenszwalb_segment: labels = slic(rgb, n_segments=n_segments, compactness=10, start_label=1), then the same (mark_boundaries(rgb, labels) * 255).astype(np.uint8)[:, :, ::-1] return.
Step 6: Assemble the comparison grid and save it
Run every method, label each panel with its name, and tile the six images (original plus five results) into one grid. This grid is the artifact you keep, a single picture of Table 11.1.
def labeled(panel, text):
out = panel.copy()
cv2.putText(out, text, (8, 24), cv2.FONT_HERSHEY_SIMPLEX,
0.7, (255, 255, 255), 2, cv2.LINE_AA)
return out
panels = [
labeled(img, "original"),
labeled(kmeans_segment(img), "kmeans"),
labeled(watershed_segment(img), "watershed"),
labeled(grabcut_segment(img), "grabcut"),
labeled(felzenszwalb_segment(img), "felzenszwalb"),
labeled(slic_segment(img), "slic"),
]
# TODO: stack the six panels into a 2-row by 3-column grid with np.vstack and
# np.hstack, then cv2.imwrite("segmentation_workbench.png", grid).
# Hint: rows = [np.hstack(panels[0:3]), np.hstack(panels[3:6])]; np.vstack(rows).
...
Hint
All panels share the input's height and width, so grid = np.vstack([np.hstack(panels[0:3]), np.hstack(panels[3:6])]) tiles them cleanly, then cv2.imwrite("segmentation_workbench.png", grid).
Expected Output
One file, segmentation_workbench.png, a 2-by-3 grid. The kmeans panel shows flat color regions with speckle along boundaries and no separation between touching disks, exactly the failure Table 11.1 predicts for a method with no spatial term. The watershed panel draws red ridge lines that split the touching disks into separate basins. The grabcut panel keeps the central foreground and blacks out the background. The felzenszwalb and slic panels overlay region boundaries, with SLIC's boundaries forming a regular, roughly grid-like tessellation and Felzenszwalb's following the actual color edges. Reading left to right, top to bottom, you are looking at the chapter's spectrum from no spatial structure to a global energy to a deferred-decision oversegmentation.
Stretch Goals
- Add a color-plus-position feature to the K-Means step by stacking scaled
(x, y)coordinates onto the three color channels, then watch the speckle of Section 11.1 shrink as spatial proximity starts to count. - Sweep the Felzenszwalb
scaleparameter over[50, 200, 800]and save a strip of the three results; this makes the scale-versus-region-count trade of Section 11.4 visible at a glance. - Reach forward to Chapter 24: feed your SLIC superpixels as nodes into a tiny region-adjacency graph and average each superpixel's color, previewing how superpixels become the compact substrate that learned models and CRFs reason over.
Complete Solution
import cv2
import numpy as np
from skimage.segmentation import felzenszwalb, slic, mark_boundaries
def load_image(path="input.jpg"):
img = cv2.imread(path, cv2.IMREAD_COLOR)
if img is not None:
return img
canvas = np.full((300, 400, 3), 200, np.uint8)
cv2.circle(canvas, (150, 150), 60, (200, 80, 60), -1)
cv2.circle(canvas, (230, 160), 60, (60, 160, 200), -1)
cv2.circle(canvas, (190, 220), 55, (80, 200, 120), -1)
return canvas
def kmeans_segment(img, k=4):
Z = img.reshape(-1, 3).astype(np.float32)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 1.0)
_, labels, centers = cv2.kmeans(Z, k, None, criteria, 10, cv2.KMEANS_PP_CENTERS)
return centers.astype(np.uint8)[labels.flatten()].reshape(img.shape)
def watershed_segment(img):
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, fg = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
dist = cv2.distanceTransform(fg, cv2.DIST_L2, 5)
_, sure_fg = cv2.threshold(dist, 0.5 * dist.max(), 255, 0)
sure_fg = sure_fg.astype(np.uint8)
unknown = cv2.subtract(fg, sure_fg)
_, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1
markers[unknown == 255] = 0
cv2.watershed(img, markers)
out = img.copy()
out[markers == -1] = (0, 0, 255)
return out
def grabcut_segment(img, rect=None):
h, w = img.shape[:2]
if rect is None:
rect = (int(0.1 * w), int(0.1 * h), int(0.8 * w), int(0.8 * h))
mask = np.zeros((h, w), np.uint8)
bgd, fgd = np.zeros((1, 65), np.float64), np.zeros((1, 65), np.float64)
cv2.grabCut(img, mask, rect, bgd, fgd, 5, cv2.GC_INIT_WITH_RECT)
keep = np.where((mask == cv2.GC_FGD) | (mask == cv2.GC_PR_FGD), 1, 0).astype(np.uint8)
return img * keep[:, :, None]
def felzenszwalb_segment(img, scale=200):
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
labels = felzenszwalb(rgb, scale=scale, sigma=0.5, min_size=50)
return (mark_boundaries(rgb, labels) * 255).astype(np.uint8)[:, :, ::-1]
def slic_segment(img, n_segments=120):
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
labels = slic(rgb, n_segments=n_segments, compactness=10, start_label=1)
return (mark_boundaries(rgb, labels) * 255).astype(np.uint8)[:, :, ::-1]
def labeled(panel, text):
out = panel.copy()
cv2.putText(out, text, (8, 24), cv2.FONT_HERSHEY_SIMPLEX,
0.7, (255, 255, 255), 2, cv2.LINE_AA)
return out
if __name__ == "__main__":
img = load_image()
panels = [
labeled(img, "original"),
labeled(kmeans_segment(img), "kmeans"),
labeled(watershed_segment(img), "watershed"),
labeled(grabcut_segment(img), "grabcut"),
labeled(felzenszwalb_segment(img), "felzenszwalb"),
labeled(slic_segment(img), "slic"),
]
grid = np.vstack([np.hstack(panels[0:3]), np.hstack(panels[3:6])])
cv2.imwrite("segmentation_workbench.png", grid)
print("wrote segmentation_workbench.png", grid.shape)
Bibliography & Further Reading
Foundational Papers
Comaniciu, D. and Meer, P. "Mean Shift: A Robust Approach Toward Feature Space Analysis." IEEE TPAMI (2002). doi:10.1109/34.1000236
The reference treatment of Mean-Shift for image segmentation and tracking: the kernel density view, the mode-seeking iteration, and the joint spatial-range filtering that Section 11.1 builds on.
Vincent, L. and Soille, P. "Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations." IEEE TPAMI (1991). doi:10.1109/34.87344
The immersion-flooding algorithm that made watershed practical, and the algorithm OpenCV's cv2.watershed descends from. The heart of Section 11.3.
Boykov, Y. and Jolly, M.-P. "Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images." ICCV (2001). doi:10.1109/ICCV.2001.937505
The paper that recast binary segmentation as min-cut/max-flow with user-supplied hard constraints. Section 11.4's energy function comes straight from here.
Rother, C., Kolmogorov, V., and Blake, A. "GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts." ACM SIGGRAPH (2004). doi:10.1145/1015706.1015720
GrabCut: Gaussian mixture color models plus iterated graph cuts, initialized from a single bounding box. The algorithm behind one-click background removal and Section 11.4's centerpiece.
Felzenszwalb, P. and Huttenlocher, D. "Efficient Graph-Based Image Segmentation." IJCV (2004). doi:10.1023/B:VISI.0000022288.19776.77
A near-linear-time region merging algorithm with an adaptive boundary predicate, available as skimage.segmentation.felzenszwalb. Section 11.4's fast unsupervised alternative to graph cuts.
Achanta, R. et al. "SLIC Superpixels Compared to State-of-the-Art Superpixel Methods." IEEE TPAMI (2012). doi:10.1109/TPAMI.2012.120
SLIC: localized K-Means in a five-dimensional color-plus-position space. The most-used superpixel method and the entirety of Section 11.5's core.
Shi, J. and Malik, J. "Normalized Cuts and Image Segmentation." IEEE TPAMI (2000). doi:10.1109/34.868688
The normalized-cut criterion that fixed graph cuts' bias toward tiny segments and launched the spectral-clustering view of segmentation referenced in Section 11.4.
Recent Research (2023-2026)
Kirillov, A. et al. "Segment Anything." ICCV (2023). arXiv:2304.02643
SAM: the promptable foundation model whose click-to-mask interaction is the learned descendant of this chapter's interactive segmentation. The bridge from classical to Chapter 24.
Ravi, N. et al. "SAM 2: Segment Anything in Images and Videos." ICLR (2025). arXiv:2408.00714
The video extension of SAM with a streaming memory, showing the modern frontier of the grouping problem this chapter introduces.
Stutz, D., Hermans, A., and Leibe, B. "Superpixels: An Evaluation of the State-of-the-Art." CVIU (2018). arXiv:1612.01601
A thorough, still-cited benchmark of 28 superpixel algorithms on boundary recall and undersegmentation error, the empirical backdrop for Section 11.5's claims.
Books
Szeliski, R. Computer Vision: Algorithms and Applications, 2nd edition (2022). szeliski.org/Book
Chapter 7 (Segmentation) covers clustering, watershed, graph cuts, and normalized cuts with full mathematical depth and a generous reference list; free online.
Gonzalez, R. and Woods, R. Digital Image Processing, 4th edition (2018). imageprocessingplace.com
Chapter 10 gives the classical, textbook account of thresholding, region growing, split-and-merge, and watershed that underpins Sections 11.2 and 11.3.
Tools & Libraries
scikit-image. skimage.segmentation module documentation. scikit-image.org
SLIC, Felzenszwalb, quickshift, watershed, random walker, and Chan-Vese in one consistent API, with worked gallery examples used throughout this chapter's code.
OpenCV. "Image Segmentation with Watershed Algorithm" and "GrabCut" tutorials. docs.opencv.org
The official OpenCV 4.x walkthroughs of cv2.watershed, cv2.grabCut, and cv2.kmeans that match Sections 11.1, 11.3, and 11.4 line for line.
scikit-learn. sklearn.cluster (KMeans, MeanShift) documentation. scikit-learn.org
The clustering reference for Section 11.1: k-means++ initialization, MiniBatchKMeans for large images, and the bandwidth estimation used by MeanShift.