Section 21.5: Class Imbalance, Label Noise & Real-World Data

"Ninety-nine times out of a hundred the part was fine, so I learned to say fine and was right ninety-nine percent of the time. They were furious. The whole point, they said, was the hundredth part. Nobody mentioned this during training."
A Defect Detector That Optimized the Wrong Thing

Big Picture

Benchmark datasets are clean and balanced; the data you actually get is neither, and the two dominant pathologies, severe class imbalance and wrong labels, will quietly defeat a model trained as if the data were perfect. When one class outnumbers another a thousand to one, plain accuracy and plain cross-entropy push the model to ignore the rare class, which is usually the class you care about. When a fraction of your labels are simply incorrect (and a fraction always is, even in ImageNet), a high-capacity network will dutifully memorize the mistakes. This section gives you the tools for both: class-balanced and focal losses and resampling for imbalance, and noise-robust training plus the right metrics for label noise.

The previous four sections assumed something close to the clean, balanced benchmark world of Section 21.1. Real projects rarely live there. A medical screening set has far more healthy than diseased scans; a manufacturing line produces vastly more good parts than defective ones; a wildlife camera trap captures common species constantly and the endangered one almost never. And in every dataset, some labels are wrong, because annotation is done by tired humans under time pressure. This section is about training well anyway. It connects back to the noise models of Chapter 7, where noise corrupted pixels; here it corrupts labels.

1. The Imbalance Problem and Why Accuracy Lies Beginner

Consider a defect detector where 99% of parts are good. A model that predicts "good" for everything achieves 99% accuracy while being completely useless, because it never catches a single defect, the one thing it exists to do. This is why accuracy is the wrong metric under imbalance. Cross-entropy, the standard classification loss from Section 18.5, has the same blind spot: summed over a batch that is 99% good parts, the loss is dominated by the easy majority class, so the gradient mostly teaches the model to be even more confident about good parts and barely registers the rare defects. The model converges to the trivial majority predictor because that is what the loss rewards. The illustration below dramatizes the trap: a forecaster who always predicts sun looks brilliant on paper and misses the one event that mattered.

A smug desert weather forecaster robot holds a trophy for always predicting sun and being right nearly every day, while a single tiny rain cloud sneaks by unnoticed behind it, illustrating the accuracy paradox where a majority-only predictor scores high accuracy yet misses the rare class that actually matters. — Being right about the boring majority case is not the same as being useful; under imbalance, accuracy lets a useless model look excellent.

The first fix is the metric. Under imbalance you measure precision, recall, and their harmonic mean the F1 score, per class, or you summarize with balanced accuracy (the average of per-class recalls) or the area under the precision-recall curve. These reward catching the rare class. Choosing the right metric is not a cosmetic decision; it determines what "good" even means, and it sets the threshold tuning you will revisit when per-pixel logits get thresholded in the segmentation of Chapter 24. The metric evolution from PSNR in Chapter 1 through intersection over union (IoU) and mean average precision (mAP) runs straight through this choice.

Key Insight: Pick the Metric Before the Loss

Under imbalance, decide what success means before you touch the training loss. If catching every defect matters more than the occasional false alarm, optimize recall; if a false alarm is expensive, balance precision against it with the F1 score or a cost-weighted metric. Report per-class numbers, never a single pooled accuracy, because pooled accuracy on imbalanced data is the number that lets a useless model look excellent. Once the metric reflects the real goal, the loss and sampling fixes below have a target to aim at.

2. Loss-Level Fixes: Class Weights and Focal Loss Intermediate

The simplest loss-level fix is class weighting: multiply each class's contribution to the loss by a weight inversely related to its frequency, so a rare-class mistake costs more than a common-class mistake. A refinement, the class-balanced loss, weights by the "effective number" of samples rather than the raw inverse frequency, which behaves better when counts vary by orders of magnitude. The intuition is that near-duplicate examples in a large class add little new information, so the thousandth photo of a common class counts for less than the raw count suggests. Concretely, a class of $n$ raw samples is assigned an effective number $E_n = (1 - \beta^n) / (1 - \beta)$ with $\beta$ just under $1$ (Cui et al., 2019), and the class weight is set to $1 / E_n$. The point of this particular form is that it saturates: as $n$ grows, $\beta^n \to 0$ and $E_n$ flattens toward the ceiling $1 / (1 - \beta)$, so doubling an already-large class barely changes its weight, while a rare class with small $n$ still gets nearly the full inverse-frequency boost. Weighting by this saturating effective count stops the loss from over-amplifying a handful of rare-class samples.

Class weighting rebalances how much each class counts; focal loss instead rebalances how much each example counts. Focal loss is the more surgical tool, designed for the extreme imbalance of object detection. It multiplies the standard cross-entropy by a factor $(1 - p_t)^\gamma$, where $p_t$ is the model's predicted probability for the true class. Easy examples (high $p_t$) get their loss driven toward zero, so the gradient is dominated by the hard and usually rare cases.

Figure 21.5.1 shows how the focal factor reshapes the loss: well-classified easy examples are suppressed so the model stops spending capacity on what it already knows and focuses on the hard tail.

Figure 21.5.1: Focal loss versus cross-entropy as a function of the model's confidence in the correct class. The focal modulating factor $(1 - p_t)^\gamma$ pulls the loss of confidently-correct (easy) examples toward zero, so the gradient is dominated by hard examples, which under imbalance are mostly the rare class.

import torch
import torch.nn.functional as F

def focal_loss(logits, targets, gamma=2.0, alpha=None):
    """Multi-class focal loss: down-weights easy (confident) examples."""
    log_p = F.log_softmax(logits, dim=1)
    log_pt = log_p.gather(1, targets.unsqueeze(1)).squeeze(1)   # log prob of true class
    pt = log_pt.exp()
    loss = -((1 - pt) ** gamma) * log_pt                        # the focal factor
    if alpha is not None:                                       # optional per-class weight
        loss = alpha[targets] * loss
    return loss.mean()

logits = torch.randn(8, 3); targets = torch.randint(0, 3, (8,))
print("CE   :", F.cross_entropy(logits, targets).item())
print("focal:", focal_loss(logits, targets, gamma=2.0).item())
# Representative (unseeded, so values vary):
# CE   : 1.27
# focal: 0.61
# focal loss is smaller because confidently-correct examples contribute almost nothing

Code Fragment 1: Focal loss in a few lines. After gathering the log-probability of the true class, the (1 - pt) ** gamma factor shrinks the contribution of easy (high-pt) examples, and the optional alpha[targets] adds a per-class weight. With gamma = 0 it reduces exactly to cross-entropy, so gamma tunes how aggressively the easy majority is suppressed.

Fun Fact

A weather forecaster in a desert who predicts "no rain" every single day will be correct well over 99% of the time, which is also exactly how a 99.7%-accurate defect detector that has never once flagged a defect achieves its impressive-looking number. The statistics community has called this the "accuracy paradox" for decades; deep learning simply rediscovered it the hard way, one useless high-accuracy model at a time. The forecaster and the detector share a motto: being right about the boring case is not the same as being useful.

3. Data-Level Fixes: Resampling Intermediate

Instead of (or alongside) reweighting the loss, you can rebalance the data the model sees. Oversampling draws rare-class examples more often, so each batch is closer to balanced; PyTorch's WeightedRandomSampler does this cleanly by assigning each sample a draw probability inversely proportional to its class frequency. Undersampling discards majority-class examples to match the minority, which is simpler but throws away data. A middle path that often works best pairs modest oversampling with the augmentation of Section 21.2, so the repeated rare-class examples are not identical copies (which would just be memorized) but varied views. The decoupling insight from recent long-tailed-recognition research is worth knowing: it is often best to train the feature backbone on the natural imbalanced distribution and only rebalance when training the final classifier.

import torch
from torch.utils.data import DataLoader, WeightedRandomSampler

labels = torch.tensor([0]*990 + [1]*10)          # 99:1 imbalance
class_count = torch.bincount(labels).float()
sample_weight = (1.0 / class_count)[labels]      # rare class -> high draw weight
sampler = WeightedRandomSampler(sample_weight,
                                num_samples=len(labels), replacement=True)
loader = DataLoader(dataset, batch_size=64, sampler=sampler)
# Each batch now sees roughly balanced classes despite the 99:1 raw ratio.
# Pair with augmentation so oversampled rare examples are varied, not duplicated.

Code Fragment 2: Oversampling the rare class with WeightedRandomSampler. torch.bincount counts each class, and the per-sample sample_weight is set to the inverse class frequency, so a 99:1 dataset yields roughly balanced batches with replacement=True and without discarding any majority data. The closing comment flags that augmentation should accompany it so oversampled examples are varied, not duplicated.

Practical Example: The 0.3% Defect Class

Who: a quality-control team at a circuit-board factory, 2025. Situation: their defect dataset was 99.7% good boards and 0.3% defective, and their first model reported 99.7% accuracy. Problem: on inspection, the model flagged zero defects; it had learned the majority predictor, exactly the failure of subsection 1. Decision: they made three changes together: switched the reported metric to per-class recall and precision, replaced cross-entropy with focal loss at $\gamma = 2$, and added WeightedRandomSampler oversampling combined with heavy augmentation of the defect images. Result: defect recall rose from 0% to over 85% at an acceptable false-alarm rate, and crucially the team could now see and tune that tradeoff because the metric exposed it. Lesson: imbalance is rarely fixed by one lever. The metric reveals the problem, the loss reshapes the gradient, and resampling reshapes the data; used together they convert a useless 99.7%-accurate model into a deployable one.

You Could Build This: A Rare-Class Detector With an Operating-Point Dial

Here is an advanced build that mirrors what manufacturing and medical-screening teams ship in production. Take any balanced dataset and deliberately make it imbalanced: keep all of one class and subsample a second to roughly 1%, so you manufacture the 99:1 problem of subsection 1. Then build the three-lever fix end to end: report per-class precision and recall instead of accuracy, train with the focal loss of Code Fragment 1 plus WeightedRandomSampler oversampling, and (the part that makes it portfolio-worthy) sweep the decision threshold and plot the full precision-recall curve, so a viewer can pick the operating point that trades false alarms against missed defects. Add a tiny control that prints the confusion matrix at the chosen threshold. This takes roughly three to four hours and combines the metric, loss, and resampling ideas of this section into one tool. The deliverable, a rare-class detector whose recall you can dial to a chosen false-alarm budget, demonstrates exactly the judgment that separates a benchmark score from a deployable system.

4. Label Noise: Even ImageNet Is Wrong Sometimes Advanced

Every real dataset contains wrong labels, and so do the benchmarks. As noted in Section 21.1, confident-learning analyses estimate roughly a 6% label-error rate in the ImageNet validation set and similar rates across CIFAR and others, errors large enough to reorder model leaderboards. The danger is that a high-capacity network, given enough epochs, will memorize even random labels, so it happily fits your mistakes. The empirical pattern that makes this tractable is that networks tend to learn the clean, consistent majority pattern first and only memorize the noisy minority later in training. That ordering is the lever for almost every noise-robust method.

Several techniques exploit it. Early stopping halts training before the memorization phase, a free first defense. Robust losses like the generalized cross-entropy or symmetric cross-entropy bound the loss a single mislabeled example can contribute, so one wrong label cannot dominate the gradient. Co-teaching trains two networks that each select the small-loss (probably-clean) examples for the other, since clean examples have low loss in the early-learning phase. Confident learning (the cleanlab approach) estimates which labels are likely wrong from the model's own predictions and lets you remove or relabel them. Label smoothing from Section 21.4 also helps modestly, since softened targets reduce how hard the model chases any single label.

Library Shortcut: Find Mislabeled Data with cleanlab

Implementing confident learning by hand (estimating the noise transition matrix, ranking label-quality scores) is a substantial amount of careful code. The cleanlab library does it from cross-validated predictions in a few lines:

from cleanlab.filter import find_label_issues
import numpy as np

# pred_probs: (N, K) cross-validated predicted probabilities; labels: (N,) given labels
issue_idx = find_label_issues(
    labels=labels, pred_probs=pred_probs,
    return_indices_ranked_by="self_confidence",   # worst-looking labels first
)
print(f"{len(issue_idx)} of {len(labels)} labels flagged as likely errors")
# Inspect the top-ranked issues by eye, then relabel or drop them and retrain.

Code Fragment 3: Finding mislabeled data with cleanlab. find_label_issues takes the given labels and cross-validated pred_probs and returns the suspicious indices ranked by self_confidence, worst first. This turns the confident-learning method, normally substantial noise-matrix code, into one call whose output is a review queue rather than an automatic deletion.

cleanlab handles the noise-rate estimation, the joint-distribution math, and the ranking, turning a research paper's method into one function call. The output is a ranked list of suspicious examples to review by hand, which is the highest-leverage data-cleaning step on a mature, imbalanced dataset, and a concrete instance of the data-centric mindset from Section 21.1.

Research Frontier: Learning From Noisy Web-Scale Data

The foundation models of Chapter 25 and the text-to-image systems of Chapter 34 are trained on billions of noisy, weakly-labeled image-text pairs scraped from the web, where careful per-example labeling is impossible. The 2022-2026 frontier is making that noise an asset rather than a liability: data-filtering networks that score and keep only high-quality pairs (the DataComp benchmark formalized this as a competition), caption rewriting where a model rewrites noisy alt-text into cleaner descriptions, and noise-aware contrastive objectives that tolerate mismatched pairs. The lesson scales the one in this section: at small scale you clean labels with cleanlab; at web scale you build learned filters that clean the stream automatically. Either way, treating data quality as a first-class engineering problem, not a fixed input, is what separates strong systems from fragile ones.

Exercise 21.5.1: Why the Majority Predictor Wins Conceptual

For a binary problem with a 95:5 class split and plain cross-entropy, argue quantitatively why the gradient at initialization pushes the model toward predicting the majority class. Then explain how (a) class weighting and (b) focal loss each change the per-example loss contributions to counteract this. Finally, state which of the two you would prefer if the minority class were not only rare but also genuinely harder to classify, and why.

Exercise 21.5.2: Inject Noise, Measure Memorization Coding

Take CIFAR-10 and corrupt 20% of the training labels by reassigning them uniformly at random. Train a network for many epochs, tracking both training accuracy and clean-test accuracy each epoch. You should see test accuracy rise, peak, and then fall as the network begins memorizing the noisy labels. Mark the peak, and use it to justify the early-stopping defense from subsection 4. Then re-run with label smoothing at $\epsilon = 0.1$ and report whether the peak is higher or the decline gentler.

Exercise 21.5.3: Audit a Benchmark for Label Errors Analysis

Train a model on a standard dataset with cross-validation to obtain out-of-fold predicted probabilities, then run cleanlab.filter.find_label_issues on them. Inspect the 20 highest-ranked suspected errors by displaying the images alongside their given labels. Categorize each as a genuine error, an ambiguous or multi-object image, or a false alarm. Report the proportions and write a short paragraph on what this implies for trusting a single test-set accuracy number, tying back to the leakage-flatters-you discussion of Section 21.1.