"There are exactly two kinds of pixels in this world. I decided that. You're welcome."
An Uncompromisingly Binary Threshold Operator
Thresholding is the moment image processing first makes a decision: every pixel is declared foreground or background, and a grayscale measurement becomes a binary claim about the world. The entire question is where to put the cut. Otsu's 1979 answer, pick the threshold that best separates the histogram into two tight clusters, is one of the most-used algorithms in vision history, and the deeper pattern (continuous scores in, threshold, binary decision out) never goes away: it is exactly how segmentation networks produce masks from logits today.
The previous sections remapped intensities while keeping them continuous: Section 2.1 bent the tone curve and Section 2.3 let the histogram choose it. Now we collapse the scale entirely. Thresholding replaces every pixel with a yes or a no, which sounds destructive (and is) but is precisely the point: counting cells under a microscope, reading a barcode, locating solder joints, and OCRing a receipt all require deciding which pixels belong to the thing. This section covers choosing that decision boundary: by hand, optimally from the histogram, and locally when one global answer cannot exist.
1. Binarization: The Simplest Classifier Basic
Global thresholding is a one-parameter point operation:
$$g(x, y) = \begin{cases} 255 & \text{if } f(x, y) > T \\ 0 & \text{otherwise} \end{cases}$$
It is worth pausing on what this really is: a classifier with a single feature (intensity) and a single learned parameter ($T$). When does such a crude classifier work? Exactly when the histogram from Section 2.2 is bimodal: two well-separated humps, one for background and one for object, with a quiet valley between them. Then any $T$ in the valley yields a clean mask. Figure 2.4.1 shows this ideal case, along with the quantity Otsu's method will optimize. When the humps overlap (and the sensor noise we traced in Chapter 1 broadens both of them), no global $T$ can be clean, and we will need either better lighting, the local methods of this section's second half, or the learned segmentation of Part III.
In OpenCV, global thresholding is cv2.threshold, which returns both the threshold used and the binarized image, and supports inverse and truncation variants through its flag argument. The interesting question is not the call but the number: who chooses $T$?
2. Otsu's Method: The Histogram Chooses the Cut Advanced
Nobuyuki Otsu's 1979 insight reframed threshold selection as a clustering problem on the histogram. A candidate threshold $t$ splits the normalized histogram $p(k)$ into class 0 ($k \le t$) and class 1 ($k > t$), with class weights and means
$$\omega_0(t) = \sum_{k=0}^{t} p(k), \quad \omega_1(t) = 1 - \omega_0(t), \quad \mu_0(t) = \frac{\sum_{k=0}^{t} k\,p(k)}{\omega_0(t)}, \quad \mu_1(t) = \frac{\sum_{k=t+1}^{255} k\,p(k)}{\omega_1(t)}$$
A good threshold should produce two tight, well-separated classes: small variance within each class, large distance between their means. Otsu showed these are the same objective, because total variance decomposes as $\sigma^2 = \sigma_W^2(t) + \sigma_B^2(t)$: within-class plus between-class. Since $\sigma^2$ is fixed, minimizing the within-class spread equals maximizing the between-class variance
$$\sigma_B^2(t) = \omega_0(t)\, \omega_1(t)\, \big(\mu_0(t) - \mu_1(t)\big)^2$$
and the optimal threshold is $t^* = \arg\max_t \sigma_B^2(t)$. With only 256 candidates, brute force is instant, and cumulative sums make the whole search a few vectorized lines:
import numpy as np
import cv2
def otsu_threshold(gray):
"""Otsu's method from scratch: maximize between-class variance."""
hist = np.bincount(gray.ravel(), minlength=256).astype(np.float64)
p = hist / hist.sum()
k = np.arange(256)
w0 = np.cumsum(p) # class-0 weight for every t
m = np.cumsum(k * p) # cumulative first moment
mG = m[-1] # global mean
w1 = 1.0 - w0
valid = (w0 > 0) & (w1 > 0) # both classes must be non-empty
mu0 = np.where(valid, m / np.maximum(w0, 1e-12), 0)
mu1 = np.where(valid, (mG - m) / np.maximum(w1, 1e-12), 0)
sigma_b2 = np.where(valid, w0 * w1 * (mu0 - mu1) ** 2, 0)
return int(np.argmax(sigma_b2))
# Synthetic sanity check: dark background N(70, 12), bright blobs N(180, 12)
rng = np.random.default_rng(0)
img = rng.normal(70, 12, (400, 400))
img[100:300, 100:300] = rng.normal(180, 12, (200, 200))
img = np.clip(img, 0, 255).astype(np.uint8)
t_scratch = otsu_threshold(img)
t_cv, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
print(t_scratch, t_cv) # 124 124.0 (the valley between 70 and 180)
argmax of the between-class variance picks the cut. On the synthetic two-mode image, both the from-scratch version and OpenCV land on the same valley threshold of 124.The entire from-scratch function above is one flag added to cv2.threshold:
t, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
A 15-to-1 reduction. The library computes the histogram, runs the variance maximization, and applies the binarization in a single optimized pass, and the same flag composes with other modes (THRESH_BINARY_INV + THRESH_OTSU for dark objects on light backgrounds, common in document images). scikit-image offers skimage.filters.threshold_otsu when you want the number without the binarized image.
Minimizing within-class variance over a one-dimensional split is exactly the k-means objective with $k = 2$, solved exhaustively rather than iteratively. This framing tells you precisely when Otsu fails: when the two clusters have wildly unequal populations (a tiny defect on a vast background barely dents $\sigma_B^2$), when there are not actually two modes, or when the modes overlap heavily. The histogram diagnostics of Section 2.2 are therefore the pre-flight check for Otsu: look at the distribution before trusting an automatic cut. The same "inspect your score distribution before thresholding it" discipline applies verbatim to the network logits of Part III.
Otsu's paper is four pages long, contains one figure, and was published in 1979 in a systems-and-cybernetics journal rather than a vision venue. It has accumulated on the order of a hundred thousand citations and ships today inside OpenCV, scikit-image, MATLAB, ImageJ, Halcon, and effectively every microscope and document scanner on Earth. Per page, it may be the most deployed algorithm description in the history of imaging.
3. When One Threshold Cannot Work: Adaptive Methods Intermediate
Otsu optimizes a global cut, and some images simply do not have one. The canonical case is a document photographed under side lighting: the paper at the bright corner is lighter than the ink at the dark corner, so any single $T$ misclassifies one end of the page. Figure 2.4.2 shows the failure and the cure. The cure is to let the threshold vary across the image: classify each pixel against a statistic of its own neighborhood rather than against one global number.
OpenCV's cv2.adaptiveThreshold computes, for each pixel, the mean (or Gaussian-weighted mean) of a blockSize x blockSize neighborhood and thresholds the pixel at that local mean minus a constant C. For harder material (stained documents, low contrast film scans), Sauvola's method additionally scales the offset by the local standard deviation, so the threshold tightens in busy regions and relaxes in flat ones:
import cv2
from skimage.filters import threshold_sauvola
gray = cv2.imread("receipt_photo.jpg", cv2.IMREAD_GRAYSCALE)
# Local Gaussian-weighted mean, minus offset C. blockSize must be odd
# and larger than the stroke width of the text.
adaptive = cv2.adaptiveThreshold(gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=31, C=10)
# Sauvola: threshold = mean * (1 + k * (std/R - 1)), per pixel
t_sauvola = threshold_sauvola(gray, window_size=31, k=0.2)
sauvola = (gray > t_sauvola).astype("uint8") * 255
Two parameter rules of thumb save hours of fiddling. First, blockSize (or window_size) must be larger than the features you are extracting: if the window fits inside a pen stroke, the stroke's interior sees only itself as context and dissolves. Second, C (or Sauvola's k) sets how decisively a pixel must differ from its surroundings to count as foreground; raising it suppresses noise speckle at the cost of thin strokes. Speckle that survives is not a thresholding failure but a job for the smoothing filters of Chapter 3, and for the morphological cleanup of Chapter 6, which consumes exactly the binary maps this section produces.
Who: A vision engineer at a food-packaging plant, responsible for verifying inkjet-printed expiry dates on foil lids.
Situation: The verification system binarized each lid image with Otsu's threshold, then matched characters. It ran at 99.7 percent read rate for months.
Problem: Read rate collapsed to around 80 percent, but only on sunny afternoons. A skylight above the line cast a soft gradient across the foil from roughly 2 to 5 PM, and the glare made one global threshold impossible: either the bright half blew out or the dim half went black. The histogram, previously bimodal, smeared into one broad ridge, exactly the failure condition for Otsu.
Decision: The engineer swapped global Otsu for cv2.adaptiveThreshold (Gaussian, blockSize 41, C 8), sized so the window comfortably exceeded the character stroke width, and added a histogram-bimodality check to the telemetry so the system could flag lighting regressions itself.
Result: Afternoon read rate returned above 99.5 percent with no hardware change; the skylight got a diffuser film a month later anyway.
Lesson: A threshold is an assumption about the histogram. When a thresholding system degrades on a schedule (time of day, season), suspect illumination first, and reach for local methods before re-engineering the optics.
4. Beyond a Single Cut Intermediate
Three refinements round out the thresholding toolbox. Multi-level thresholding generalizes Otsu to two or more cuts (three classes or more); scikit-image ships it as threshold_multiotsu, useful for images with distinct material phases, like soil-pore-water micrographs. The triangle method (cv2.THRESH_TRIANGLE) handles strongly unimodal histograms with a small foreground tail, the regime where Otsu's equal-cluster assumption breaks, by dropping a perpendicular from the histogram peak's chord; it is a favorite in fluorescence microscopy. And hysteresis thresholding uses two thresholds: pixels above the high cut are definitely foreground, pixels between the cuts are accepted only if connected to definite ones. That last idea, tolerance for weak evidence when it is attached to strong evidence, is the heart of the Canny edge detector you will meet in Chapter 9.
Step back and notice the shape of everything this section did: produce a per-pixel score, then apply a cut to make a per-pixel decision. That shape survives the deep learning revolution completely intact. A semantic segmentation network in Chapter 24 ends in exactly this operation: per-pixel logits, a sigmoid, and a threshold (usually 0.5, and choosing it better is a tuning lever practitioners use). The score map got smarter; the final operation is still this section.
The threshold did not retire; it moved inside the model. SAM and SAM 2 (Meta, 2023 and 2024) generate segmentation masks by thresholding predicted mask logits at zero, and expose a stability_score that measures how much a mask changes when that threshold shifts, literally a threshold-sensitivity analysis from this section run as a quality filter. In cell microscopy, Cellpose's 2024-2025 generation (Cellpose3 and the "Cellpose-SAM" line) still finalizes instance masks by thresholding predicted flow and probability fields, with the cut exposed as a user parameter. And in document AI, the DIBCO binarization benchmarks continue to be contested by hybrid methods where a U-Net predicts a per-pixel threshold surface, a direct learned descendant of Sauvola's formula. The research question is no longer "where is the cut?" but "how do we make the scores so good that the cut is easy?"
Describe three concrete images on which Otsu's method chooses a badly wrong threshold, one for each failure cause: (a) extremely unequal class populations, (b) more than two modes, (c) heavy mode overlap. For each, predict roughly where Otsu's cut lands and why, using the $\sigma_B^2(t) = \omega_0 \omega_1 (\mu_0 - \mu_1)^2$ formula to justify your prediction.
Photograph several coins on a plain dark surface (or synthesize bright discs on a dark noisy background). Build a pipeline: grayscale, Otsu threshold, then count the resulting connected components with cv2.connectedComponents. Then re-shoot (or re-synthesize) with a strong lighting gradient and show that the Otsu version miscounts while a version using cv2.adaptiveThreshold recovers the correct count. Report both counts at both lighting conditions.
For the synthetic two-mode image from this section's Otsu code, sweep the threshold $t$ from 0 to 255 and plot the pixel-level classification error against the known ground truth (the blob rectangle) as a function of $t$. Mark Otsu's choice on the curve. Then narrow the gap between the two modes (means 70 and 180, then 90 and 160, then 110 and 140) and repeat. Write a short analysis of how the error valley's width changes and what that implies about thresholding robustness as class separability shrinks.