"You photographed an A4 page as a trapezoid at a 40-degree angle, in restaurant lighting, with your thumb in frame. I will fix it. I always fix it. This is my whole personality now."
A Long-Suffering Document Scanner
This section spends the whole chapter at once: a working document scanner is page detection (finding four corner correspondences), model selection (a flat page photographed with perspective means a homography), estimation (four points, eight equations, eight unknowns), and execution (inverse-mapped warp with bilinear interpolation), finished with a binarization pass. About a hundred lines of Python reproduce the core of every mobile scanning app, and each line is a concept you can now name.
The previous sections built the theory in the order textbooks like: models, coordinates, interpolation, execution, estimation. Real projects run in the opposite direction, starting from a goal: the user photographs a paper at an angle; produce a clean, flat, high-contrast scan. This section walks the full distance from one to the other, including the unglamorous parts (corner ordering, resolution bookkeeping) where real implementations actually break. Figure 5.6.1 shows the route.
1. Stage 1: Find the Page Outline Intermediate
Our scanner's "correspondence problem" is friendlier than Section 5.5's general matching: we know the object of interest is a quadrilateral that contrasts with the background. The classical detection recipe is blur, edge detection, and contour analysis. We downscale first, both for speed and because edge detectors behave more consistently at a standard working resolution; the crucial bookkeeping is to remember the scale factor, because the final warp must run on the full-resolution original. Detection can be lossy; rectification must not be.
import cv2
import numpy as np
PROC_HEIGHT = 600.0 # standard working height for detection
def find_page_quad(image_bgr):
"""Return the page's 4 corners in FULL-RES coords, or None."""
scale = PROC_HEIGHT / image_bgr.shape[0]
small = cv2.resize(image_bgr, None, fx=scale, fy=scale,
interpolation=cv2.INTER_AREA)
gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (5, 5), 0) # tame paper texture
edges = cv2.Canny(gray, 75, 200) # edge map
edges = cv2.dilate(edges, np.ones((3, 3), np.uint8)) # bridge gaps
contours, _ = cv2.findContours(edges, cv2.RETR_LIST,
cv2.CHAIN_APPROX_SIMPLE)
contours = sorted(contours, key=cv2.contourArea, reverse=True)[:5]
for c in contours: # biggest first
peri = cv2.arcLength(c, True)
approx = cv2.approxPolyDP(c, 0.02 * peri, True)
if len(approx) == 4 and cv2.isContourConvex(approx):
if cv2.contourArea(approx) > 0.2 * small.size / 3:
return approx.reshape(4, 2).astype(np.float64) / scale
return None
scale converts the corners back to full-resolution coordinates, the single most forgotten line in homemade scanners.
Each ingredient earns its place. The Gaussian blur (from Chapter 3) suppresses paper grain and carpet texture that would otherwise fill the edge map with confetti. Canny, which we use here as a black box and dissect properly in Chapter 9, traces intensity discontinuities; the dilation closes one-pixel gaps in the page border so the contour is a single closed curve. approxPolyDP simplifies each candidate contour with the Douglas-Peucker algorithm at a tolerance of 2 percent of the perimeter: page outlines survive as exactly 4 vertices, while sleeves, mugs, and shadows rarely do. The convexity and minimum-area tests reject the rest.
2. Stage 2: Order the Corners Intermediate
The contour hands us four corners in an arbitrary cyclic order, but getPerspectiveTransform pairs source to destination points by index: if our first destination corner is "top-left", the first source corner had better actually be the page's top-left. Feed the points in a rotated or reflected order and you get a perfectly valid homography to an upside-down or mirror-imaged page. The classic ordering trick uses two scalar functions of each corner $(x, y)$: the sum $x + y$ is smallest at the top-left and largest at the bottom-right; the difference $y - x$ is smallest at the top-right and largest at the bottom-left.
def order_corners(pts):
"""pts: (4, 2) array in any order -> [tl, tr, br, bl]."""
s = pts.sum(axis=1) # x + y
d = np.diff(pts, axis=1)[:, 0] # y - x
return np.array([pts[np.argmin(s)], # top-left
pts[np.argmin(d)], # top-right
pts[np.argmax(s)], # bottom-right
pts[np.argmax(d)]], # bottom-left
dtype=np.float32)
3. Stage 3: Size the Output and Warp Intermediate
What size should the flattened page be? We measure the quadrilateral's edges in the photo and take the maximum of opposite sides as the output width and height. This preserves as much resolution as the photo captured and gets the aspect ratio approximately right. Only approximately: perspective foreshortening means the photographed side lengths are not the true paper proportions. Recovering the exact aspect ratio of a rectangle from one perspective view is possible, but it requires the camera's focal length, which belongs to the calibration story of Chapter 12. Production apps either do that or simply snap to known paper ratios (A4, Letter); we take the honest approximation.
def rectify(image_bgr, quad):
"""Warp the quadrilateral region into a flat, axis-aligned scan."""
tl, tr, br, bl = order_corners(quad)
W = int(max(np.linalg.norm(br - bl), np.linalg.norm(tr - tl)))
H = int(max(np.linalg.norm(tr - br), np.linalg.norm(tl - bl)))
src = np.array([tl, tr, br, bl], dtype=np.float32)
dst = np.array([[0, 0], [W - 1, 0],
[W - 1, H - 1], [0, H - 1]], dtype=np.float32)
M = cv2.getPerspectiveTransform(src, dst) # 4 pairs -> 8 DoF, exact
return cv2.warpPerspective(image_bgr, M, (W, H),
flags=cv2.INTER_LINEAR)
Pause on getPerspectiveTransform for a moment, because it closes a loop opened in Section 5.1: a homography has 8 degrees of freedom, each point pair supplies 2 equations, and 4 pairs make the system exactly determined, so the function solves a small linear system and returns the unique homography through our corners. No RANSAC is needed here, unlike Section 5.5, because we have exactly four correspondences and trust all of them; the robustness lives upstream in the contour tests. The warp call then runs the inverse-mapping gather of Section 5.4 with the bilinear kernel of Section 5.3.
4. Stage 4: Binarize Like a Scanner Beginner
A geometric rectangle of a photo still looks like a photo: gray paper, uneven lighting, a shadow from your hand. The "scanned document" look is a thresholding problem, and the right tool is the adaptive thresholding of Chapter 2, which computes a local threshold per neighborhood and therefore shrugs off illumination gradients that destroy any single global threshold:
def to_scan(image_bgr):
gray = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2GRAY)
return cv2.adaptiveThreshold(gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=21, C=10)
# The complete scanner, end to end:
image = cv2.imread("receipt_photo.jpg")
quad = find_page_quad(image)
if quad is None:
raise SystemExit("No document found: check contrast with background")
flat = rectify(image, quad)
scan = to_scan(flat)
cv2.imwrite("scan.png", scan)
print(f"saved {scan.shape[1]}x{scan.shape[0]} scan")
blockSize sets the neighborhood over which "local brightness" is judged; C biases the threshold to keep thin pen strokes.saved 1187x1684 scan
And that is the entire scanner: roughly one hundred lines including comments, no machine learning, latency dominated by the single full-resolution warp. The binarized output often shows speckle noise from paper texture and dust; cleaning that up with a morphological opening is literally the first worked example of the next chapter, which picks up this exact image.
The pipeline splits into a perception half and a geometry half with different error economics. Detection (stages 1-2) can run on a downscaled image, fail occasionally, and be retried with different parameters, because its output is just four numbers that are easy to sanity-check. Rectification (stages 3-4) is exact mathematics that must run at full resolution exactly once. This "cheap proposal, exact execution" split recurs throughout vision systems, and getting the resolution bookkeeping right at the boundary (Code 5.6.1's final division by scale) is where a disproportionate share of real-world bugs live.
Stages 2 and 3, corner ordering, output sizing, homography, and warp, are packaged in the imutils library as a single battle-tested call:
from imutils.perspective import four_point_transform
flat = four_point_transform(image, quad.reshape(4, 2))
That replaces our order_corners plus rectify, roughly 30 lines, with 1, handling degenerate quads and dtype conversions internally. Detection and binarization remain yours, which is the right division: those are the stages you tune per application.
Who: An ML engineer at an expense-management startup whose app extracts totals from photographed receipts.
Situation: The OCR vendor's accuracy was excellent on flatbed scans but poor on user photos. The team inserted a scanner pipeline nearly identical to this section's in front of OCR.
Problem: Accuracy improved overall but stayed bad for a stubborn 20 percent of receipts. Inspection of the failures showed thermal-paper receipts that had been crumpled and re-flattened, or were curling off the table: their edges were detected fine, but a homography assumes a plane, and these were cylinders and crumple surfaces. Text lines stayed bent after rectification, and the OCR's line segmentation broke.
Decision: Ship the homography scanner for the 80 percent it fixed (per-receipt OCR field accuracy rose from 71 to 89 percent in their evaluation), route low-confidence OCR outputs to manual review, and prototype a learned dewarping model for the curled cases rather than stretching the geometric model past its assumptions.
Result: Support tickets about wrong totals dropped sharply; the dewarping prototype (based on the document-restoration models in the callout below) later recovered half of the residual failures.
Lesson: Know your model's load-bearing assumption. The homography's is planarity; when the world bends, no four points will save you, and the fix is a richer deformation model, not more parameter tuning.
5. Failure Modes and Hardening Advanced
Turning this demo into a product is mostly about the inputs that break it. Four failure classes account for nearly everything, and each maps to a specific upgrade path:
- Low edge contrast. White paper on a white desk gives Canny nothing. Mitigations: try multiple threshold pairs, run detection per color channel and on a saturation channel, or fall back to asking the user to tap the corners. The durable fix is replacing stage 1 with a segmentation model, the approach of Chapter 24, which is exactly what modern phone scanners do.
- Distractor quadrilaterals. Laptops, monitors, and floor tiles are large convex quads. Mitigations: prefer the quad containing the image center, score candidates by text-like high-frequency content inside, or track stability across video frames.
- Non-planar pages. Books near the spine, curled receipts. The homography is structurally wrong; see the practical example and research frontier.
- Extreme angles. Beyond roughly 60 degrees of tilt, the far edge's effective resolution collapses; the warp magnifies a few hundred captured pixels into a thousand output pixels of mush. Detect by comparing opposite side lengths and prompt for a re-shoot; no interpolation from Section 5.3 can manufacture detail the sensor never sampled.
The 2024-2026 generation of document capture replaces each classical stage with a learned one while keeping this section's architecture recognizable. Page localization is now typically a lightweight segmentation network rather than Canny plus contours. For non-planar geometry, dewarping models regress a dense backward map (a per-pixel remap field, exactly Section 5.4's lookup-table view) instead of an 8-parameter homography: DocTr++ (Feng et al., 2023) and the grid-based UVDoc (Verhoeven et al., SIGGRAPH Asia 2023) flatten curled and folded pages, and DocRes (Zhang et al., CVPR 2024) unifies dewarping, deshadowing, deblurring, and appearance enhancement in one generalist model prompted per task. Benchmarks in this line still report the geometry through warped-distance metrics, and the models still emit warp fields executed by the very machinery you built in this chapter; what changed is who computes the field.
The sum/difference corner-ordering trick in Code 5.6.2 has been re-invented and re-blogged so many times that its origin is genuinely untraceable; it appears in graphics forums from the 1990s, OCR preprocessing papers, and at least one patent filing. It is the geometric equivalent of a folk song. The robust version (sort by angle around the centroid) is three lines longer and has an author on record, which tells you something about which solutions survive.
6. What This Project Taught Beginner
Walk back through the hundred lines and notice how the chapter's sections each carried a stage: the hierarchy (5.1) told us a photographed plane needs exactly a homography, no more, no less; homogeneous coordinates (5.2) are why getPerspectiveTransform returns a 3×3 matrix and why the warp divides by $W$; interpolation (5.3) fills every output pixel from fractional source positions; inverse mapping (5.4) is the reason the output has no holes; and the four corners are a tiny, trusted correspondence set, the same currency 5.5 earned with feature matching and RANSAC. One pipeline, five ideas, each load-bearing.
The scanner also hands the book its next problem. Its output is a binary image, and binary images have their own algebra: erosion to strip speckle, dilation to heal broken strokes, connected components to find characters, shape descriptors to classify them. That algebra is Chapter 6: Morphology, Binary Images & Shape, and it begins exactly where scan.png ends.
Construct (on paper) a convex quadrilateral for which the sum/difference trick of Code 5.6.2 assigns two corners the same role, or the wrong roles. At what rotation angles of a long, thin receipt does this happen? Then describe the centroid-angle alternative (sort corners by atan2 around their mean) and explain why it cannot produce duplicate assignments, but still needs a rule to decide which sorted corner is "top-left".
Extend the scanner with two production features. (a) A fallback detection pass: if find_page_quad returns None, retry with Otsu-thresholded saturation and value channels (Chapter 2 tools) before giving up. (b) A quality gate: reject the detected quad if the ratio of its longest to shortest side exceeds 12 (receipt sanity), if any interior angle is below 35 degrees, or if opposite sides differ by more than 3x (extreme-tilt detector from this section's failure-mode list). Demonstrate both features on five of your own photos, including at least one deliberate failure case.
Perturb each of the four detected corners independently by Gaussian noise of $\sigma \in \{1, 2, 5, 10\}$ pixels before rectification, 50 trials each, and measure the damage to the output: (a) SSIM between the perturbed and unperturbed scans, and (b) if you have an OCR engine available (e.g. pytesseract), character error rate on a printed test page. Plot both against $\sigma$. Which corner perturbations hurt most, and why does the answer depend on the camera angle? Relate the shape of the curve to the homography's sensitivity as the quad degenerates.