Part I: Image Processing
Chapter 5: Geometric Transformations & Image Warping

Worked Example: A Document Scanner from Scratch

"You photographed an A4 page as a trapezoid at a 40-degree angle, in restaurant lighting, with your thumb in frame. I will fix it. I always fix it. This is my whole personality now."

A Long-Suffering Document Scanner
Big Picture

This section spends the whole chapter at once: a working document scanner is page detection (finding four corner correspondences), model selection (a flat page photographed with perspective means a homography), estimation (four points, eight equations, eight unknowns), and execution (inverse-mapped warp with bilinear interpolation), finished with a binarization pass. About a hundred lines of Python reproduce the core of every mobile scanning app, and each line is a concept you can now name.

The previous sections built the theory in the order textbooks like: models, coordinates, interpolation, execution, estimation. Real projects run in the opposite direction, starting from a goal: the user photographs a paper at an angle; produce a clean, flat, high-contrast scan. This section walks the full distance from one to the other, including the unglamorous parts (corner ordering, resolution bookkeeping) where real implementations actually break. Figure 5.6.1 shows the route.

1. Photo page is a trapezoid 2. Edges blur + Canny + dilate 3. Quad contour, 4 ordered corners 4. Warp homography, full resolution 5. Binarize adaptive threshold
Figure 5.6.1: The scanner pipeline. Stages 1 to 3 manufacture the four corner correspondences that Section 5.5 would have gotten from feature matching; stage 4 is the homography estimation and inverse warp of Sections 5.1 to 5.4; stage 5 is classic thresholding from Chapter 2. Each stage hands a strictly simpler object to the next: image, edge map, four points, rectangle, binary scan.

1. Stage 1: Find the Page Outline Intermediate

Our scanner's "correspondence problem" is friendlier than Section 5.5's general matching: we know the object of interest is a quadrilateral that contrasts with the background. The classical detection recipe is blur, edge detection, and contour analysis. We downscale first, both for speed and because edge detectors behave more consistently at a standard working resolution; the crucial bookkeeping is to remember the scale factor, because the final warp must run on the full-resolution original. Detection can be lossy; rectification must not be.

import cv2
import numpy as np

PROC_HEIGHT = 600.0          # standard working height for detection

def find_page_quad(image_bgr):
    """Return the page's 4 corners in FULL-RES coords, or None."""
    scale = PROC_HEIGHT / image_bgr.shape[0]
    small = cv2.resize(image_bgr, None, fx=scale, fy=scale,
                       interpolation=cv2.INTER_AREA)

    gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (5, 5), 0)        # tame paper texture
    edges = cv2.Canny(gray, 75, 200)                # edge map
    edges = cv2.dilate(edges, np.ones((3, 3), np.uint8))  # bridge gaps

    contours, _ = cv2.findContours(edges, cv2.RETR_LIST,
                                   cv2.CHAIN_APPROX_SIMPLE)
    contours = sorted(contours, key=cv2.contourArea, reverse=True)[:5]

    for c in contours:                              # biggest first
        peri = cv2.arcLength(c, True)
        approx = cv2.approxPolyDP(c, 0.02 * peri, True)
        if len(approx) == 4 and cv2.isContourConvex(approx):
            if cv2.contourArea(approx) > 0.2 * small.size / 3:
                return approx.reshape(4, 2).astype(np.float64) / scale
    return None
Code 5.6.1: Page detection: blur, Canny, dilate, then scan the largest contours for a big convex quadrilateral. The final division by scale converts the corners back to full-resolution coordinates, the single most forgotten line in homemade scanners.

Each ingredient earns its place. The Gaussian blur (from Chapter 3) suppresses paper grain and carpet texture that would otherwise fill the edge map with confetti. Canny, which we use here as a black box and dissect properly in Chapter 9, traces intensity discontinuities; the dilation closes one-pixel gaps in the page border so the contour is a single closed curve. approxPolyDP simplifies each candidate contour with the Douglas-Peucker algorithm at a tolerance of 2 percent of the perimeter: page outlines survive as exactly 4 vertices, while sleeves, mugs, and shadows rarely do. The convexity and minimum-area tests reject the rest.

2. Stage 2: Order the Corners Intermediate

The contour hands us four corners in an arbitrary cyclic order, but getPerspectiveTransform pairs source to destination points by index: if our first destination corner is "top-left", the first source corner had better actually be the page's top-left. Feed the points in a rotated or reflected order and you get a perfectly valid homography to an upside-down or mirror-imaged page. The classic ordering trick uses two scalar functions of each corner $(x, y)$: the sum $x + y$ is smallest at the top-left and largest at the bottom-right; the difference $y - x$ is smallest at the top-right and largest at the bottom-left.

def order_corners(pts):
    """pts: (4, 2) array in any order -> [tl, tr, br, bl]."""
    s = pts.sum(axis=1)            # x + y
    d = np.diff(pts, axis=1)[:, 0] # y - x
    return np.array([pts[np.argmin(s)],    # top-left
                     pts[np.argmin(d)],    # top-right
                     pts[np.argmax(s)],    # bottom-right
                     pts[np.argmax(d)]],   # bottom-left
                    dtype=np.float32)
Code 5.6.2: Corner ordering by the sum/difference trick. It is reliable for the convex, roughly axis-aligned quadrilaterals a scanner sees; documents photographed at rotations near 45 degrees can fool it, which Exercise 5.6.1 explores.

3. Stage 3: Size the Output and Warp Intermediate

What size should the flattened page be? We measure the quadrilateral's edges in the photo and take the maximum of opposite sides as the output width and height. This preserves as much resolution as the photo captured and gets the aspect ratio approximately right. Only approximately: perspective foreshortening means the photographed side lengths are not the true paper proportions. Recovering the exact aspect ratio of a rectangle from one perspective view is possible, but it requires the camera's focal length, which belongs to the calibration story of Chapter 12. Production apps either do that or simply snap to known paper ratios (A4, Letter); we take the honest approximation.

def rectify(image_bgr, quad):
    """Warp the quadrilateral region into a flat, axis-aligned scan."""
    tl, tr, br, bl = order_corners(quad)

    W = int(max(np.linalg.norm(br - bl), np.linalg.norm(tr - tl)))
    H = int(max(np.linalg.norm(tr - br), np.linalg.norm(tl - bl)))

    src = np.array([tl, tr, br, bl], dtype=np.float32)
    dst = np.array([[0, 0], [W - 1, 0],
                    [W - 1, H - 1], [0, H - 1]], dtype=np.float32)

    M = cv2.getPerspectiveTransform(src, dst)   # 4 pairs -> 8 DoF, exact
    return cv2.warpPerspective(image_bgr, M, (W, H),
                               flags=cv2.INTER_LINEAR)
Code 5.6.3: Rectification: measure the output size from the quad's sides, build the 4-point correspondence, solve the homography, and inverse-warp at full resolution. Every line of this function is a section of this chapter in miniature.

Pause on getPerspectiveTransform for a moment, because it closes a loop opened in Section 5.1: a homography has 8 degrees of freedom, each point pair supplies 2 equations, and 4 pairs make the system exactly determined, so the function solves a small linear system and returns the unique homography through our corners. No RANSAC is needed here, unlike Section 5.5, because we have exactly four correspondences and trust all of them; the robustness lives upstream in the contour tests. The warp call then runs the inverse-mapping gather of Section 5.4 with the bilinear kernel of Section 5.3.

4. Stage 4: Binarize Like a Scanner Beginner

A geometric rectangle of a photo still looks like a photo: gray paper, uneven lighting, a shadow from your hand. The "scanned document" look is a thresholding problem, and the right tool is the adaptive thresholding of Chapter 2, which computes a local threshold per neighborhood and therefore shrugs off illumination gradients that destroy any single global threshold:

def to_scan(image_bgr):
    gray = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2GRAY)
    return cv2.adaptiveThreshold(gray, 255,
                                 cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                 cv2.THRESH_BINARY,
                                 blockSize=21, C=10)

# The complete scanner, end to end:
image = cv2.imread("receipt_photo.jpg")
quad = find_page_quad(image)
if quad is None:
    raise SystemExit("No document found: check contrast with background")
flat = rectify(image, quad)
scan = to_scan(flat)
cv2.imwrite("scan.png", scan)
print(f"saved {scan.shape[1]}x{scan.shape[0]} scan")
Code 5.6.4: Binarization and the six-line main program that chains all four stages. blockSize sets the neighborhood over which "local brightness" is judged; C biases the threshold to keep thin pen strokes.
saved 1187x1684 scan
Output 5.6.4a: A representative run on a phone photo of an A4 page: the output dimensions land within about 1 percent of the true A4 ratio (1.414), the residual being the perspective aspect-ratio approximation discussed in stage 3.

And that is the entire scanner: roughly one hundred lines including comments, no machine learning, latency dominated by the single full-resolution warp. The binarized output often shows speckle noise from paper texture and dust; cleaning that up with a morphological opening is literally the first worked example of the next chapter, which picks up this exact image.

Key Insight: Detect Cheap, Rectify Exact

The pipeline splits into a perception half and a geometry half with different error economics. Detection (stages 1-2) can run on a downscaled image, fail occasionally, and be retried with different parameters, because its output is just four numbers that are easy to sanity-check. Rectification (stages 3-4) is exact mathematics that must run at full resolution exactly once. This "cheap proposal, exact execution" split recurs throughout vision systems, and getting the resolution bookkeeping right at the boundary (Code 5.6.1's final division by scale) is where a disproportionate share of real-world bugs live.

Library Shortcut: imutils.four_point_transform

Stages 2 and 3, corner ordering, output sizing, homography, and warp, are packaged in the imutils library as a single battle-tested call:

from imutils.perspective import four_point_transform
flat = four_point_transform(image, quad.reshape(4, 2))
Code 5.6.5: Corner ordering, output sizing, and the perspective warp delegated to imutils in a single call.

That replaces our order_corners plus rectify, roughly 30 lines, with 1, handling degenerate quads and dtype conversions internally. Detection and binarization remain yours, which is the right division: those are the stages you tune per application.

Practical Example: Receipts Are Not Rectangles

Who: An ML engineer at an expense-management startup whose app extracts totals from photographed receipts.

Situation: The OCR vendor's accuracy was excellent on flatbed scans but poor on user photos. The team inserted a scanner pipeline nearly identical to this section's in front of OCR.

Problem: Accuracy improved overall but stayed bad for a stubborn 20 percent of receipts. Inspection of the failures showed thermal-paper receipts that had been crumpled and re-flattened, or were curling off the table: their edges were detected fine, but a homography assumes a plane, and these were cylinders and crumple surfaces. Text lines stayed bent after rectification, and the OCR's line segmentation broke.

Decision: Ship the homography scanner for the 80 percent it fixed (per-receipt OCR field accuracy rose from 71 to 89 percent in their evaluation), route low-confidence OCR outputs to manual review, and prototype a learned dewarping model for the curled cases rather than stretching the geometric model past its assumptions.

Result: Support tickets about wrong totals dropped sharply; the dewarping prototype (based on the document-restoration models in the callout below) later recovered half of the residual failures.

Lesson: Know your model's load-bearing assumption. The homography's is planarity; when the world bends, no four points will save you, and the fix is a richer deformation model, not more parameter tuning.

5. Failure Modes and Hardening Advanced

Turning this demo into a product is mostly about the inputs that break it. Four failure classes account for nearly everything, and each maps to a specific upgrade path:

Research Frontier: Scanners That Learn

The 2024-2026 generation of document capture replaces each classical stage with a learned one while keeping this section's architecture recognizable. Page localization is now typically a lightweight segmentation network rather than Canny plus contours. For non-planar geometry, dewarping models regress a dense backward map (a per-pixel remap field, exactly Section 5.4's lookup-table view) instead of an 8-parameter homography: DocTr++ (Feng et al., 2023) and the grid-based UVDoc (Verhoeven et al., SIGGRAPH Asia 2023) flatten curled and folded pages, and DocRes (Zhang et al., CVPR 2024) unifies dewarping, deshadowing, deblurring, and appearance enhancement in one generalist model prompted per task. Benchmarks in this line still report the geometry through warped-distance metrics, and the models still emit warp fields executed by the very machinery you built in this chapter; what changed is who computes the field.

Fun Fact

The sum/difference corner-ordering trick in Code 5.6.2 has been re-invented and re-blogged so many times that its origin is genuinely untraceable; it appears in graphics forums from the 1990s, OCR preprocessing papers, and at least one patent filing. It is the geometric equivalent of a folk song. The robust version (sort by angle around the centroid) is three lines longer and has an author on record, which tells you something about which solutions survive.

6. What This Project Taught Beginner

Walk back through the hundred lines and notice how the chapter's sections each carried a stage: the hierarchy (5.1) told us a photographed plane needs exactly a homography, no more, no less; homogeneous coordinates (5.2) are why getPerspectiveTransform returns a 3×3 matrix and why the warp divides by $W$; interpolation (5.3) fills every output pixel from fractional source positions; inverse mapping (5.4) is the reason the output has no holes; and the four corners are a tiny, trusted correspondence set, the same currency 5.5 earned with feature matching and RANSAC. One pipeline, five ideas, each load-bearing.

The scanner also hands the book its next problem. Its output is a binary image, and binary images have their own algebra: erosion to strip speckle, dilation to heal broken strokes, connected components to find characters, shape descriptors to classify them. That algebra is Chapter 6: Morphology, Binary Images & Shape, and it begins exactly where scan.png ends.

Exercise 5.6.1: Break the Corner Ordering Conceptual

Construct (on paper) a convex quadrilateral for which the sum/difference trick of Code 5.6.2 assigns two corners the same role, or the wrong roles. At what rotation angles of a long, thin receipt does this happen? Then describe the centroid-angle alternative (sort corners by atan2 around their mean) and explain why it cannot produce duplicate assignments, but still needs a rule to decide which sorted corner is "top-left".

Exercise 5.6.2: Scanner, Hardened Coding

Extend the scanner with two production features. (a) A fallback detection pass: if find_page_quad returns None, retry with Otsu-thresholded saturation and value channels (Chapter 2 tools) before giving up. (b) A quality gate: reject the detected quad if the ratio of its longest to shortest side exceeds 12 (receipt sanity), if any interior angle is below 35 degrees, or if opposite sides differ by more than 3x (extreme-tilt detector from this section's failure-mode list). Demonstrate both features on five of your own photos, including at least one deliberate failure case.

Exercise 5.6.3: How Wrong Corners Hurt Analysis

Perturb each of the four detected corners independently by Gaussian noise of $\sigma \in \{1, 2, 5, 10\}$ pixels before rectification, 50 trials each, and measure the damage to the output: (a) SSIM between the perturbed and unperturbed scans, and (b) if you have an OCR engine available (e.g. pytesseract), character error rate on a printed test page. Plot both against $\sigma$. Which corner perturbations hurt most, and why does the answer depend on the camera angle? Relate the shape of the curve to the homography's sensitivity as the quad degenerates.