Section 2.4: Thresholding: Global, Otsu & Adaptive

"There are exactly two kinds of pixels in this world. I decided that. You're welcome."
An Uncompromisingly Binary Threshold Operator

Big Picture

Thresholding is the moment image processing first makes a decision: every pixel is declared foreground or background, and a grayscale measurement becomes a binary claim about the world. The entire question is where to put the cut. Otsu's 1979 answer, pick the threshold that best separates the histogram into two tight clusters, is one of the most-used algorithms in vision history, and the deeper pattern (continuous scores in, threshold, binary decision out) never goes away: it is exactly how segmentation networks produce masks from logits today.

The previous sections remapped intensities while keeping them continuous: Section 2.1 bent the tone curve and Section 2.3 let the histogram choose it. Now we collapse the scale entirely. Thresholding replaces every pixel with a yes or a no, which sounds destructive (and is) but is precisely the point: counting cells under a microscope, reading a barcode, locating solder joints, and OCRing a receipt all require deciding which pixels belong to the thing. This section covers choosing that decision boundary: by hand, optimally from the histogram, and locally when one global answer cannot exist.

1. Binarization: The Simplest Classifier Basic

Global thresholding is a one-parameter point operation:

$$g(x, y) = \begin{cases} 255 & \text{if } f(x, y) > T \\ 0 & \text{otherwise} \end{cases}$$

It is worth pausing on what this really is: a classifier with a single feature (intensity) and a single learned parameter ($T$). When does such a crude classifier work? Exactly when the histogram from Section 2.2 is bimodal: two well-separated humps, one for background and one for object, with a quiet valley between them. Then any $T$ in the valley yields a clean mask. Figure 2.4.1 shows this ideal case, along with the quantity Otsu's method will optimize. When the humps overlap (and the sensor noise we traced in Chapter 1 broadens both of them), no global $T$ can be clean, and we will need either better lighting, the local methods of this section's second half, or the learned segmentation of Part III.

Figure 2.4.1 The thresholding ideal: a bimodal histogram whose two modes correspond to background and object. Any threshold in the valley separates them; Otsu's method finds the cut $t^*$ that maximizes the between-class variance $\sigma_B^2(t) = \omega_0 \omega_1 (\mu_0 - \mu_1)^2$, pushing the two class means as far apart as their populations allow.

In OpenCV, global thresholding is cv2.threshold, which returns both the threshold used and the binarized image, and supports inverse and truncation variants through its flag argument. The interesting question is not the call but the number: who chooses $T$?

2. Otsu's Method: The Histogram Chooses the Cut Advanced

Nobuyuki Otsu's 1979 insight reframed threshold selection as a clustering problem on the histogram. A candidate threshold $t$ splits the normalized histogram $p(k)$ into class 0 ($k \le t$) and class 1 ($k > t$), with class weights and means

$$\omega_0(t) = \sum_{k=0}^{t} p(k), \quad \omega_1(t) = 1 - \omega_0(t), \quad \mu_0(t) = \frac{\sum_{k=0}^{t} k\,p(k)}{\omega_0(t)}, \quad \mu_1(t) = \frac{\sum_{k=t+1}^{255} k\,p(k)}{\omega_1(t)}$$

A good threshold should produce two tight, well-separated classes: small variance within each class, large distance between their means. Otsu showed these are the same objective, because total variance decomposes as $\sigma^2 = \sigma_W^2(t) + \sigma_B^2(t)$: within-class plus between-class. The total $\sigma^2$ is a property of the image's histogram alone and does not change with $t$ (the cut only decides how that fixed budget is split between the two terms), so the two terms must trade off: minimizing the within-class spread is the same as maximizing the between-class variance

Proof: The Total Variance Splits Into Within Plus Between

The whole Otsu argument pivots on $\sigma^2 = \sigma_W^2(t) + \sigma_B^2(t)$ with $\sigma^2$ independent of $t$, so it is worth deriving in two lines. Let $\mu_G$ be the global mean and recall the class definitions above; the within-class and between-class variances are the population-weighted average of the per-class variances and of the squared class-mean deviations from $\mu_G$:

$$\sigma_W^2(t) = \omega_0 \sigma_0^2 + \omega_1 \sigma_1^2, \qquad \sigma_B^2(t) = \omega_0 (\mu_0 - \mu_G)^2 + \omega_1 (\mu_1 - \mu_G)^2$$

Start from the total variance and split the sum over the two classes, writing each pixel's deviation from $\mu_G$ as its deviation from its own class mean plus that class mean's deviation from $\mu_G$:

$$\sigma^2 = \sum_k (k - \mu_G)^2 p(k) = \sum_{i=0}^{1} \sum_{k \in C_i} \big[(k - \mu_i) + (\mu_i - \mu_G)\big]^2 p(k)$$

Expanding the square gives three sums per class. The cross term vanishes because $\sum_{k \in C_i} (k - \mu_i)\, p(k) = \omega_i (\mu_i - \mu_i) = 0$ (the deviations about a class mean sum to zero), leaving only the squared terms:

$$\sigma^2 = \underbrace{\sum_{i=0}^{1} \omega_i \sigma_i^2}_{\sigma_W^2(t)} + \underbrace{\sum_{i=0}^{1} \omega_i (\mu_i - \mu_G)^2}_{\sigma_B^2(t)}$$

The left side never mentions $t$, so as the cut moves, every unit of variance handed to $\sigma_B^2$ is taken from $\sigma_W^2$. Maximizing one is identically minimizing the other, which is what makes the single search over $t$ legitimate. Substituting $\mu_G = \omega_0 \mu_0 + \omega_1 \mu_1$ into the between-class term and using $\omega_0 + \omega_1 = 1$ collapses it to the compact form $\sigma_B^2(t) = \omega_0 \omega_1 (\mu_0 - \mu_1)^2$ used below.

A cartoon referee stands in the dip between a tight dark-shaded crowd on the left and a tight bright-shaded crowd on the right and chalks one vertical dividing line in the valley while holding a balance scale, illustrating Otsu's method choosing the single threshold that splits the histogram into two compact, maximally separated classes. — Otsu draws one line in the valley between two crowds, the cut that shoves the dark and bright populations as far apart as their numbers allow.

The illustration above stages that objective as a referee chalking a single line in the valley between two crowds:

$$\sigma_B^2(t) = \omega_0(t)\, \omega_1(t)\, \big(\mu_0(t) - \mu_1(t)\big)^2$$

and the optimal threshold is $t^* = \arg\max_t \sigma_B^2(t)$. With only 256 candidates, brute force is instant, and cumulative sums make the whole search a few vectorized lines:

# Otsu's threshold from scratch: among all 256 candidate cuts, pick the
# one maximizing between-class variance. Cumulative sums give every
# candidate's class weights and means in one vectorized sweep.
import numpy as np
import cv2

def otsu_threshold(gray):
    """Otsu's method from scratch: maximize between-class variance."""
    hist = np.bincount(gray.ravel(), minlength=256).astype(np.float64)
    p = hist / hist.sum()
    k = np.arange(256)

    w0 = np.cumsum(p)                      # class-0 weight for every t
    m  = np.cumsum(k * p)                  # cumulative first moment
    mG = m[-1]                             # global mean

    w1 = 1.0 - w0
    valid = (w0 > 0) & (w1 > 0)            # both classes must be non-empty
    mu0 = np.where(valid, m / np.maximum(w0, 1e-12), 0)
    mu1 = np.where(valid, (mG - m) / np.maximum(w1, 1e-12), 0)

    sigma_b2 = np.where(valid, w0 * w1 * (mu0 - mu1) ** 2, 0)
    return int(np.argmax(sigma_b2))

# Synthetic sanity check: dark background N(70, 12), bright blobs N(180, 12)
rng = np.random.default_rng(0)
img = rng.normal(70, 12, (400, 400))
img[100:300, 100:300] = rng.normal(180, 12, (200, 200))
img = np.clip(img, 0, 255).astype(np.uint8)

t_scratch = otsu_threshold(img)
t_cv, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
print(t_scratch, t_cv)   # 124 124.0  (the valley between 70 and 180)

Code Fragment 1: Otsu's method in vectorized NumPy: inside otsu_threshold, cumulative sums give every candidate threshold's class weights and means at once, and argmax of the between-class variance sigma_b2 picks the cut. On the synthetic two-mode image, both the from-scratch version and OpenCV land on the same valley threshold of 124.

Library Shortcut: One Flag, Not Fifteen Lines

The entire from-scratch function above is one flag added to cv2.threshold:

t, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

Otsu's entire optimization, reduced to one flag on the threshold call; the chosen threshold comes back as the first return value.

A 15-to-1 reduction. The library computes the histogram, runs the variance maximization, and applies the binarization in a single optimized pass, and the same flag composes with other modes (THRESH_BINARY_INV + THRESH_OTSU for dark objects on light backgrounds, common in document images). scikit-image offers skimage.filters.threshold_otsu when you want the number without the binarized image.

Key Insight: Otsu Is k-Means in One Dimension

Minimizing within-class variance over a one-dimensional split is exactly the objective of k-means, the clustering algorithm that partitions data into $k$ groups by minimizing the spread within each group (covered in full in Chapter 11), here with $k = 2$ and solved exhaustively over all 256 cuts rather than iteratively. This one framing tells you precisely when Otsu fails: when the two clusters have wildly unequal populations (a tiny defect on a vast background barely dents $\sigma_B^2$), when there are not actually two modes, or when the modes overlap heavily.

The unequal-population trap is worth seeing as a number. The between-class variance carries the factor $\omega_0\omega_1$, and if the defect is 1 percent of the image, that product is at most $0.01 \times 0.99 \approx 0.0099$, versus $0.25$ for a balanced split. The correct cut at the defect is starved of roughly 25 times less variance to maximize, so Otsu happily abandons it and slices the dominant background into two meaningless halves instead.

The practical takeaway: the histogram diagnostics of Section 2.2 are the pre-flight check for Otsu, so look at the distribution before trusting an automatic cut. The same "inspect your score distribution before thresholding it" discipline applies verbatim to the network logits of Part III.

Common Misconception: A Returned Otsu Threshold Means a Good Split Exists

Because cv2.threshold(..., THRESH_OTSU) always returns a number and a clean-looking binary image, students conclude the cut is meaningful. In fact Otsu maximizes between-class variance unconditionally: it returns a value even for a perfectly unimodal histogram with no valley at all, placing the cut somewhere on the single hump and slicing one object into arbitrary "foreground" and "background" halves. The number is never absent; its validity is the open question. This is why the histogram diagnostics of Section 2.2 are the pre-flight check: confirm two separated modes exist before you trust the cut, exactly the "inspect the score distribution before thresholding it" habit that carries over to the segmentation logits of Chapter 24.

Fun Fact: Four Pages, Half a Century

Otsu's paper is four pages long, contains one figure, and was published in 1979 in a systems-and-cybernetics journal rather than a vision venue. It has accumulated tens of thousands of citations (in the range of fifty thousand or more by common indices) and ships today inside OpenCV, scikit-image, MATLAB, ImageJ, Halcon, and a vast range of microscopes and document scanners. Per page, it is among the most deployed algorithm descriptions in the history of imaging.

3. When One Threshold Cannot Work: Adaptive Methods Intermediate

Otsu optimizes a global cut, and some images simply do not have one. The canonical case is a document photographed under side lighting: the paper at the bright corner is lighter than the ink at the dark corner, so any single $T$ misclassifies one end of the page. Figure 2.4.2 shows the failure and the cure. The cure is to let the threshold vary across the image: classify each pixel against a statistic of its own neighborhood rather than against one global number.

Figure 2.4.2 Why adaptive thresholding exists. Left: a document under an illumination gradient; the paper in the dark corner is darker than the ink in the bright corner. Center: any global threshold misclassifies one end, here crushing the dark corner to solid ink. Right: comparing each pixel to its own local neighborhood mean recovers every text line, because the gradient varies slowly while ink edges vary fast.

OpenCV's cv2.adaptiveThreshold computes, for each pixel, the mean (or Gaussian-weighted mean) of a blockSize x blockSize neighborhood and thresholds the pixel at that local mean minus a constant C. For harder material (stained documents, low contrast film scans), Sauvola's method additionally scales the offset by the local standard deviation, so the threshold tightens in busy regions and relaxes in flat ones:

# Two local thresholds for uneven illumination, where no single global
# cut works: OpenCV's Gaussian-weighted adaptive mean minus an offset,
# and Sauvola, which also scales the offset by local standard deviation.
import cv2
from skimage.filters import threshold_sauvola

gray = cv2.imread("receipt_photo.jpg", cv2.IMREAD_GRAYSCALE)

# Local Gaussian-weighted mean, minus offset C. blockSize must be odd
# and larger than the stroke width of the text.
adaptive = cv2.adaptiveThreshold(gray, 255,
                                 cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                 cv2.THRESH_BINARY,
                                 blockSize=31, C=10)

# Sauvola: threshold = mean * (1 + k * (std/R - 1)), per pixel
t_sauvola = threshold_sauvola(gray, window_size=31, k=0.2)
sauvola = (gray > t_sauvola).astype("uint8") * 255

# Both keep the ink legible across the gradient; the ink fraction is similar:
print((adaptive == 0).mean(), (sauvola == 0).mean())   # e.g. 0.071 0.063

Code Fragment 2: Two local thresholding methods on a photographed receipt: cv2.adaptiveThreshold with a Gaussian-weighted mean and offset C, and scikit-image's threshold_sauvola, which modulates the threshold by local standard deviation and dominates document-binarization benchmarks among classical methods. Both use a 31-pixel window sized above the text stroke width.

Two parameter rules of thumb save hours of fiddling. First, blockSize (or window_size) must be larger than the features you are extracting: if the window fits inside a pen stroke, the stroke's interior sees only itself as context and dissolves. Second, C (or Sauvola's k) sets how decisively a pixel must differ from its surroundings to count as foreground; raising it suppresses noise speckle at the cost of thin strokes. Speckle that survives is not a thresholding failure but a job for the smoothing filters of Chapter 3, and for the morphological cleanup of Chapter 6, which consumes exactly the binary maps this section produces.

Try This: Watch Text Dissolve as the Window Shrinks

Photograph a page of text under uneven light, then run cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, b, 10) for window sizes b of 7, 15, 31, 61, and 121 (each must be odd). Stack the five results side by side. At b=7 the window fits inside the pen strokes, so the interior of every thick letter sees only itself as context and washes out to white: the text turns into hollow outlines. As b grows past the stroke width the letters fill in solid, and at very large b the method drifts back toward a single global threshold and the dark corner starts to crush again. The sweet spot is a window comfortably larger than a stroke but smaller than the illumination gradient. For the second dial, fix b=31 and raise C from 2 to 20: higher C wipes out speckle but also thins or breaks faint strokes, the exact trade described above.

Practical Example: The Expiry-Date Reader That Died at 3 PM

Who: A vision engineer at a food-packaging plant, responsible for verifying inkjet-printed expiry dates on foil lids.

Situation: The verification system binarized each lid image with Otsu's threshold, then matched characters. It ran at 99.7 percent read rate for months.

Problem: Read rate collapsed to around 80 percent, but only on sunny afternoons. A skylight above the line cast a soft gradient across the foil from roughly 2 to 5 PM, and the glare made one global threshold impossible: either the bright half blew out or the dim half went black. The histogram, previously bimodal, smeared into one broad ridge, exactly the failure condition for Otsu.

Decision: The engineer swapped global Otsu for cv2.adaptiveThreshold (Gaussian, blockSize 41, C 8), sized so the window comfortably exceeded the character stroke width, and added a histogram-bimodality check to the telemetry so the system could flag lighting regressions itself.

Result: Afternoon read rate returned above 99.5 percent with no hardware change; the skylight got a diffuser film a month later anyway.

Lesson: A threshold is an assumption about the histogram. When a thresholding system degrades on a schedule (time of day, season), suspect illumination first, and reach for local methods before re-engineering the optics.

4. Beyond a Single Cut Intermediate

Three refinements round out the thresholding toolbox. Multi-level thresholding generalizes Otsu to two or more cuts (three classes or more); scikit-image ships it as threshold_multiotsu, useful for images with distinct material phases, like soil-pore-water micrographs. The triangle method (cv2.THRESH_TRIANGLE) handles strongly unimodal histograms with a small foreground tail, the regime where Otsu's equal-cluster assumption breaks, by dropping a perpendicular from the histogram peak's chord; it is a favorite in fluorescence microscopy. And hysteresis thresholding uses two thresholds: pixels above the high cut are definitely foreground, pixels between the cuts are accepted only if connected to definite ones. That last idea, tolerance for weak evidence when it is attached to strong evidence, is the heart of the Canny edge detector you will meet in Chapter 9.

Step back and notice the shape of everything this section did: produce a per-pixel score, then apply a cut to make a per-pixel decision. That shape survives the deep learning revolution completely intact. A semantic segmentation network in Chapter 24 ends in exactly this operation: per-pixel logits, a sigmoid, and a threshold (usually 0.5, and choosing it better is a tuning lever practitioners use). The score map got smarter; the final operation is still this section.

Research Frontier: Thresholds Inside Modern Segmenters

The threshold did not retire; it moved inside the model. SAM and SAM 2 (Meta, 2023 and 2024) generate segmentation masks by thresholding predicted mask logits at zero, and expose a stability_score that measures how much a mask changes when that threshold shifts, literally a threshold-sensitivity analysis from this section run as a quality filter. In cell microscopy, Cellpose's 2024-2025 generation (Cellpose3 and the "Cellpose-SAM" line) still finalizes instance masks by thresholding predicted flow and probability fields, with the cut exposed as a user parameter. And in document AI, the DIBCO binarization benchmarks continue to be contested by hybrid methods where a U-Net predicts a per-pixel threshold surface, a direct learned descendant of Sauvola's formula. The research question is no longer "where is the cut?" but "how do we make the scores so good that the cut is easy?"

Exercise 2.4.1: Break Otsu on Purpose Conceptual

Describe three concrete images on which Otsu's method chooses a badly wrong threshold, one for each failure cause: (a) extremely unequal class populations, (b) more than two modes, (c) heavy mode overlap. For each, predict roughly where Otsu's cut lands and why, using the $\sigma_B^2(t) = \omega_0 \omega_1 (\mu_0 - \mu_1)^2$ formula to justify your prediction.

Exercise 2.4.2: Coin Counter Coding

Photograph several coins on a plain dark surface (or synthesize bright discs on a dark noisy background). Build a pipeline: grayscale, Otsu threshold, then count the resulting connected components with cv2.connectedComponents. Then re-shoot (or re-synthesize) with a strong lighting gradient and show that the Otsu version miscounts while a version using cv2.adaptiveThreshold recovers the correct count. Report both counts at both lighting conditions.

Exercise 2.4.3: The Valley Sensitivity Curve Analysis

For the synthetic two-mode image from this section's Otsu code, sweep the threshold $t$ from 0 to 255 and plot the pixel-level classification error against the known ground truth (the blob rectangle) as a function of $t$. Mark Otsu's choice on the curve. Then narrow the gap between the two modes (means 70 and 180, then 90 and 160, then 110 and 140) and repeat. Write a short analysis of how the error valley's width changes and what that implies about thresholding robustness as class separability shrinks.