Section 5.6: Worked Example: A Document Scanner from Scratch

"You photographed an A4 page as a trapezoid at a 40-degree angle, in restaurant lighting, with your thumb in frame. I will fix it. I always fix it. This is my whole personality now."
A Long-Suffering Document Scanner

Big Picture

This section spends the whole chapter at once: a working document scanner is page detection (finding four corner correspondences), model selection (a flat page photographed with perspective means a homography), estimation (four points, eight equations, eight unknowns), and execution (inverse-mapped warp with bilinear interpolation), finished with a binarization pass. About a hundred lines of Python reproduce the core of every mobile scanning app, and each line is a concept you can now name.

In about a hundred lines of Python, you are about to rebuild the core of every mobile scanning app on the planet. The textbook taught the theory bottom up: models, coordinates, interpolation, execution, estimation. Real projects run the other way, starting from a goal: the user photographs a paper at an angle; produce a clean, flat, high-contrast scan. This section walks the full distance from goal to working code, including the unglamorous parts (corner ordering, resolution bookkeeping) where real implementations actually break. Figure 5.6.1 shows the route.

Figure 5.6.1: The scanner pipeline. Stages 1 to 3 manufacture the four corner correspondences that Section 5.5 would have gotten from feature matching; stage 4 is the homography estimation and inverse warp of Sections 5.1 to 5.4; stage 5 is classic thresholding from Chapter 2. Each stage hands a strictly simpler object to the next: image, edge map, four points, rectangle, binary scan.

1. Stage 1: Find the Page Outline Intermediate

Our scanner's "correspondence problem" is friendlier than Section 5.5's general matching: we know the object of interest is a quadrilateral that contrasts with the background. The classical detection recipe is blur, edge detection, and contour analysis. We downscale first, both for speed and because edge detectors behave more consistently at a standard working resolution; the crucial bookkeeping is to remember the scale factor, because the final warp must run on the full-resolution original. Detection can be lossy; rectification must not be.

# Stage 1 of the scanner: locate the page as the largest convex
# quadrilateral in a downscaled edge map, then scale its corners back to
# full-resolution coordinates so the later warp loses no detail.
import cv2
import numpy as np

PROC_HEIGHT = 600.0          # standard working height for detection

def find_page_quad(image_bgr):
    """Return the page's 4 corners in FULL-RES coords, or None."""
    scale = PROC_HEIGHT / image_bgr.shape[0]
    small = cv2.resize(image_bgr, None, fx=scale, fy=scale,
                       interpolation=cv2.INTER_AREA)

    gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (5, 5), 0)        # tame paper texture
    edges = cv2.Canny(gray, 75, 200)                # edge map
    edges = cv2.dilate(edges, np.ones((3, 3), np.uint8))  # bridge gaps

    contours, _ = cv2.findContours(edges, cv2.RETR_LIST,
                                   cv2.CHAIN_APPROX_SIMPLE)
    contours = sorted(contours, key=cv2.contourArea, reverse=True)[:5]

    for c in contours:                              # biggest first
        peri = cv2.arcLength(c, True)
        approx = cv2.approxPolyDP(c, 0.02 * peri, True)
        if len(approx) == 4 and cv2.isContourConvex(approx):
            if cv2.contourArea(approx) > 0.2 * small.size / 3:
                return approx.reshape(4, 2).astype(np.float64) / scale
    return None

Code 5.6.1: Page detection: blur, Canny, dilate, then scan the largest contours for a big convex quadrilateral. The final division by scale converts the corners back to full-resolution coordinates, the single most forgotten line in homemade scanners.

Each ingredient earns its place. The Gaussian blur (from Chapter 3) suppresses paper grain and carpet texture that would otherwise fill the edge map with confetti. Canny, which we use here as a black box and dissect properly in Chapter 9, traces intensity discontinuities; the dilation closes one-pixel gaps in the page border so the contour is a single closed curve. approxPolyDP simplifies each candidate contour with the Douglas-Peucker algorithm at a tolerance of 2 percent of the perimeter: page outlines survive as exactly 4 vertices, while sleeves, mugs, and shadows rarely do. The convexity and minimum-area tests reject the rest.

2. Stage 2: Order the Corners Intermediate

The contour hands us four corners in an arbitrary cyclic order, but getPerspectiveTransform pairs source to destination points by index: if our first destination corner is "top-left", the first source corner had better actually be the page's top-left. Feed the points in a rotated or reflected order and you get a perfectly valid homography to an upside-down or mirror-imaged page. The classic ordering trick uses two scalar functions of each corner $(x, y)$: the sum $x + y$ is smallest at the top-left and largest at the bottom-right; the difference $y - x$ is smallest at the top-right and largest at the bottom-left.

The intuition is that the two diagonals point in perpendicular directions. One scalar (the sum) separates the corners along the main diagonal, the other (the difference) separates them along the anti-diagonal, and between them every corner gets a unique label.

# Stage 2: put the four detected corners into a canonical order so that
# getPerspectiveTransform pairs them with the right destination corners.
# The sum x+y and difference y-x identify each corner's role.
def order_corners(pts):
    """pts: (4, 2) array in any order -> [tl, tr, br, bl]."""
    s = pts.sum(axis=1)            # x + y
    d = np.diff(pts, axis=1)[:, 0] # y - x
    return np.array([pts[np.argmin(s)],    # top-left
                     pts[np.argmin(d)],    # top-right
                     pts[np.argmax(s)],    # bottom-right
                     pts[np.argmax(d)]],   # bottom-left
                    dtype=np.float32)

Code 5.6.2: Corner ordering by the sum/difference trick. It is reliable for the convex, roughly axis-aligned quadrilaterals a scanner sees; documents photographed at rotations near 45 degrees can fool it, which Exercise 5.6.1 explores.

3. Stage 3: Size the Output and Warp Intermediate

What size should the flattened page be? We measure the quadrilateral's edges in the photo and take the maximum of opposite sides as the output width and height. This preserves as much resolution as the photo captured and gets the aspect ratio approximately right. Only approximately: perspective foreshortening means the photographed side lengths are not the true paper proportions. Recovering the exact aspect ratio of a rectangle from one perspective view is possible, but it requires the camera's focal length, which belongs to the calibration story of Chapter 12. Production apps either do that or simply snap to known paper ratios (A4, Letter); we take the honest approximation.

# Stage 3: size the flat output from the quad's own side lengths, then
# solve the exact 4-point homography and inverse-warp the page to a
# rectangle. Four pairs pin down all eight homography parameters.
def rectify(image_bgr, quad):
    """Warp the quadrilateral region into a flat, axis-aligned scan."""
    tl, tr, br, bl = order_corners(quad)

    W = int(max(np.linalg.norm(br - bl), np.linalg.norm(tr - tl)))
    H = int(max(np.linalg.norm(tr - br), np.linalg.norm(tl - bl)))

    src = np.array([tl, tr, br, bl], dtype=np.float32)
    dst = np.array([[0, 0], [W - 1, 0],
                    [W - 1, H - 1], [0, H - 1]], dtype=np.float32)

    M = cv2.getPerspectiveTransform(src, dst)   # 4 pairs -> 8 DoF, exact
    return cv2.warpPerspective(image_bgr, M, (W, H),
                               flags=cv2.INTER_LINEAR)

Code 5.6.3: Rectification: measure the output size from the quad's sides, build the 4-point correspondence, solve the homography, and inverse-warp at full resolution. Every line of this function is a section of this chapter in miniature.

Pause on getPerspectiveTransform for a moment, because it closes a loop opened in Section 5.1: a homography has 8 degrees of freedom, each point pair supplies 2 equations, and 4 pairs make the system exactly determined, so the function solves a small linear system and returns the unique homography through our corners. No RANSAC is needed here, unlike Section 5.5, because we have exactly four correspondences and trust all of them; the robustness lives upstream in the contour tests. The warp call then runs the inverse-mapping gather of Section 5.4 with the bilinear kernel of Section 5.3.

4. Stage 4: Binarize Like a Scanner Beginner

A geometric rectangle of a photo still looks like a photo: gray paper, uneven lighting, a shadow from your hand. The "scanned document" look is a thresholding problem, and the right tool is the adaptive thresholding of Chapter 2, which computes a local threshold per neighborhood and therefore shrugs off illumination gradients that destroy any single global threshold:

# Stage 4 plus the main program: adaptive thresholding gives the crisp
# black-on-white scanner look despite uneven lighting, and the bottom
# block chains detection, rectification, and binarization end to end.
def to_scan(image_bgr):
    gray = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2GRAY)
    return cv2.adaptiveThreshold(gray, 255,
                                 cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                 cv2.THRESH_BINARY,
                                 blockSize=21, C=10)

# The complete scanner, end to end:
image = cv2.imread("receipt_photo.jpg")
quad = find_page_quad(image)
if quad is None:
    raise SystemExit("No document found: check contrast with background")
flat = rectify(image, quad)
scan = to_scan(flat)
cv2.imwrite("scan.png", scan)
print(f"saved {scan.shape[1]}x{scan.shape[0]} scan")

Code 5.6.4: Binarization and the six-line main program that chains all four stages. blockSize sets the neighborhood over which "local brightness" is judged; C biases the threshold to keep thin pen strokes.

saved 1187x1684 scan

Output 5.6.4a: A representative run on a phone photo of an A4 page: the output dimensions land within about 1 percent of the true A4 ratio (1.414), the residual being the perspective aspect-ratio approximation discussed in stage 3.

And that is the entire scanner: roughly one hundred lines including comments, no machine learning, latency dominated by the single full-resolution warp. The binarized output often shows speckle noise from paper texture and dust; cleaning that up with a morphological opening is literally the first worked example of the next chapter, which picks up this exact image.

Key Insight: Detect Cheap, Rectify Exact

The pipeline splits into a perception half and a geometry half with different error economics. Detection (stages 1-2) can run on a downscaled image, fail occasionally, and be retried with different parameters, because its output is just four numbers that are easy to sanity-check. Rectification (stages 3-4) is exact mathematics that must run at full resolution exactly once. This "cheap proposal, exact execution" split recurs throughout vision systems, and getting the resolution bookkeeping right at the boundary (Code 5.6.1's final division by scale) is where a disproportionate share of real-world bugs live.

Library Shortcut: imutils.four_point_transform

Stages 2 and 3, corner ordering, output sizing, homography, and warp, are packaged in the imutils library as a single battle-tested call:

# imutils folds corner ordering, output sizing, and the perspective warp
# (our order_corners plus rectify) into a single tested call; detection
# and binarization stay in your hands as the per-application stages.
from imutils.perspective import four_point_transform
flat = four_point_transform(image, quad.reshape(4, 2))

Code 5.6.5: Corner ordering, output sizing, and the perspective warp delegated to imutils in a single call.

That replaces our order_corners plus rectify, roughly 30 lines, with 1, handling degenerate quads and dtype conversions internally. Detection and binarization remain yours, which is the right division: those are the stages you tune per application.

Practical Example: Receipts Are Not Rectangles

Who: An ML engineer at an expense-management startup whose app extracts totals from photographed receipts.

Situation: The optical character recognition (OCR) vendor's accuracy was excellent on flatbed scans but poor on user photos. The team inserted a scanner pipeline nearly identical to this section's in front of OCR.

Problem: Accuracy improved overall but stayed bad for a stubborn 20 percent of receipts. Inspection of the failures showed thermal-paper receipts that had been crumpled and re-flattened, or were curling off the table: their edges were detected fine, but a homography assumes a plane, and these were cylinders and crumple surfaces. Text lines stayed bent after rectification, and the OCR's line segmentation broke.

Decision: Ship the homography scanner for the 80 percent it fixed (per-receipt OCR field accuracy rose from 71 to 89 percent in their evaluation), route low-confidence OCR outputs to manual review, and prototype a learned dewarping model for the curled cases rather than stretching the geometric model past its assumptions.

Result: Support tickets about wrong totals dropped sharply; the dewarping prototype (based on the document-restoration models in the callout below) later recovered half of the residual failures.

Lesson: Know your model's load-bearing assumption. The homography's is planarity; when the world bends, no four points will save you, and the fix is a richer deformation model, not more parameter tuning.

5. Failure Modes and Hardening Advanced

Turning this demo into a product is mostly about the inputs that break it. Four failure classes account for nearly everything, and each maps to a specific upgrade path:

Low edge contrast. White paper on a white desk gives Canny nothing. Mitigations: try multiple threshold pairs, run detection per color channel and on a saturation channel, or fall back to asking the user to tap the corners. The durable fix is replacing stage 1 with a segmentation model, the approach of Chapter 24, which is exactly what modern phone scanners do.
Distractor quadrilaterals. Laptops, monitors, and floor tiles are large convex quads. Mitigations: prefer the quad containing the image center, score candidates by text-like high-frequency content inside, or track stability across video frames.
Non-planar pages. Books near the spine, curled receipts. The homography is structurally wrong; see the practical example and research frontier.
Extreme angles. Beyond roughly 60 degrees of tilt, the far edge's effective resolution collapses; the warp magnifies a few hundred captured pixels into a thousand output pixels of mush. Detect by comparing opposite side lengths and prompt for a re-shoot; no interpolation from Section 5.3 can manufacture detail the sensor never sampled.

Research Frontier: Scanners That Learn

The 2024-2026 generation of document capture replaces each classical stage with a learned one while keeping this section's architecture recognizable. Page localization is now typically a lightweight segmentation network rather than Canny plus contours. For non-planar geometry, dewarping models regress a dense backward map (a per-pixel remap field, exactly Section 5.4's lookup-table view) instead of an 8-parameter homography: DocTr++ (Feng et al., 2023) and the grid-based UVDoc (Verhoeven et al., SIGGRAPH Asia 2023) flatten curled and folded pages, and DocRes (Zhang et al., CVPR 2024) unifies dewarping, deshadowing, deblurring, and appearance enhancement in one generalist model prompted per task. Benchmarks in this line still report the geometry through warped-distance metrics, and the models still emit warp fields executed by the very machinery you built in this chapter; what changed is who computes the field.

Fun Fact

The sum/difference corner-ordering trick in Code 5.6.2 has been re-invented and re-blogged so many times that its origin is genuinely untraceable; it appears in graphics forums from the 1990s, OCR preprocessing papers, and at least one patent filing. It is the geometric equivalent of a folk song. The robust version (sort by angle around the centroid) is three lines longer and has an author on record, which tells you something about which solutions survive.

6. What This Project Taught Beginner

Walk back through the hundred lines and notice how the chapter's sections each carried a stage: the hierarchy (5.1) told us a photographed plane needs exactly a homography, no more, no less; homogeneous coordinates (5.2) are why getPerspectiveTransform returns a 3×3 matrix and why the warp divides by $W$; interpolation (5.3) fills every output pixel from fractional source positions; inverse mapping (5.4) is the reason the output has no holes; and the four corners are a tiny, trusted correspondence set, the same currency 5.5 earned with feature matching and RANSAC. One pipeline, five ideas, each load-bearing.

The scanner also hands the book its next problem. Its output is a binary image, and binary images have their own algebra: erosion to strip speckle, dilation to heal broken strokes, connected components to find characters, shape descriptors to classify them. That algebra is Chapter 6: Morphology, Binary Images & Shape, and it begins exactly where scan.png ends.

The scanner exercised the chapter on a single image with four trusted corners. The chapter's other headline application, stitching, needs the full registration machinery of Section 5.5 running on two images at once. Build it yourself in the Hands-On Lab below: a two-photo panorama stitcher that detects and matches features, fits a homography with RANSAC, and warps one frame onto the other's canvas. It is the natural companion to the scanner and the capstone for everything in Chapter 5.

Exercise 5.6.1: Break the Corner Ordering Conceptual

Construct (on paper) a convex quadrilateral for which the sum/difference trick of Code 5.6.2 assigns two corners the same role, or the wrong roles. At what rotation angles of a long, thin receipt does this happen? Then describe the centroid-angle alternative (sort corners by atan2 around their mean) and explain why it cannot produce duplicate assignments, but still needs a rule to decide which sorted corner is "top-left".

Exercise 5.6.2: Scanner, Hardened Coding

Extend the scanner with two production features. (a) A fallback detection pass: if find_page_quad returns None, retry with Otsu-thresholded saturation and value channels (Chapter 2 tools) before giving up. (b) A quality gate: reject the detected quad if the ratio of its longest to shortest side exceeds 12 (receipt sanity), if any interior angle is below 35 degrees, or if opposite sides differ by more than 3x (extreme-tilt detector from this section's failure-mode list). Demonstrate both features on five of your own photos, including at least one deliberate failure case.

Exercise 5.6.3: How Wrong Corners Hurt Analysis

Perturb each of the four detected corners independently by Gaussian noise of $\sigma \in \{1, 2, 5, 10\}$ pixels before rectification, 50 trials each, and measure the damage to the output: (a) SSIM (the structural similarity index from Section 1.5, where 1.0 is identical and lower means more distortion) between the perturbed and unperturbed scans, and (b) if you have an OCR engine available (e.g. pytesseract), character error rate on a printed test page. Plot both against $\sigma$. Which corner perturbations hurt most, and why does the answer depend on the camera angle? Relate the shape of the curve to the homography's sensitivity as the quad degenerates.

Hands-On Lab: Build a Two-Photo Panorama Stitcher

Duration: about 60 to 90 minutes Difficulty: Intermediate

Objective

Build a panorama stitcher from scratch that takes two overlapping photographs, discovers the homography between them with feature matching and RANSAC, warps the left image onto a canvas sized for both, and blends the seam, producing one wide panorama image you can keep.

What You'll Practice

Picking the right transform from the hierarchy of Section 5.1: two photos from one rotating camera relate by a homography.
Estimating that homography from data with ORB matching and RANSAC, the registration pipeline of Section 5.5.
Composing transforms in homogeneous coordinates (Section 5.2) to add a translation offset so nothing falls off the canvas.
Executing the warp with cv2.warpPerspective, the inverse-mapping interpolation of Sections 5.3 and 5.4.
Reading the RANSAC inlier count as a built-in quality gate, exactly as in Output 5.5.4a.

Setup

You need opencv-python and numpy (both from Chapter 0). Install with pip install opencv-python numpy. For input, take two photos of a wide scene (a bookshelf, a building, a landscape) from the same standing spot, rotating the camera roughly 20 to 30 degrees between shots so they share about 40 percent overlap. Save them as left.jpg and right.jpg. No download is required; your own pair works best.

Steps

Step 1: Load the pair and detect features

Read both images and run an ORB detector on each. ORB returns keypoints (locations) and descriptors (binary signatures of each keypoint's neighborhood). This is the black-box front end from Section 5.5; you only consume its output.

import cv2
import numpy as np

left  = cv2.imread("left.jpg")    # will be warped onto the right
right = cv2.imread("right.jpg")   # the fixed reference frame

gray_l = cv2.cvtColor(left,  cv2.COLOR_BGR2GRAY)
gray_r = cv2.cvtColor(right, cv2.COLOR_BGR2GRAY)

# TODO: create an ORB detector aiming for a few thousand features
# Hint: cv2.ORB_create(nfeatures=...)
orb = ...

# TODO: detect keypoints and compute descriptors for BOTH gray images
# Hint: orb.detectAndCompute(image, None) returns (keypoints, descriptors)
k_l, d_l = ...
k_r, d_r = ...
print(f"left: {len(k_l)} kp, right: {len(k_r)} kp")

Hint

ORB descriptors are binary, so later you must match them with the Hamming distance, not the Euclidean L2 used for float descriptors like SIFT. Aim for nfeatures=4000: more features means more candidate matches for RANSAC to work with.

Step 2: Match descriptors across the two images

Find candidate correspondences by matching each left descriptor to its closest right descriptor. A brute-force matcher with cross-check keeps only pairs that are each other's mutual best match, which discards many obvious mismatches before RANSAC even runs.

# TODO: build a brute-force matcher with Hamming distance and crossCheck=True
# Hint: cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
bf = ...

# TODO: match d_l against d_r, then sort matches by ascending distance
# and keep the best few hundred
matches = ...
matches = sorted(matches, key=lambda m: m.distance)[:600]
print(f"{len(matches)} candidate matches")

Hint

With crossCheck=True you call bf.match(d_l, d_r) (a single best match per keypoint), not bf.knnMatch. Sorting by m.distance and slicing keeps the most confident candidates; RANSAC will still throw out the wrong ones.

Step 3: Estimate the homography with RANSAC

Gather the matched point coordinates into source and destination arrays, then fit a homography. Many candidates are wrong, so RANSAC is essential: it finds the largest set of matches that agree on one transform and ignores the rest, returning an inlier mask alongside the matrix.

src = np.float32([k_l[m.queryIdx].pt for m in matches]).reshape(-1, 1, 2)
dst = np.float32([k_r[m.trainIdx].pt for m in matches]).reshape(-1, 1, 2)

# TODO: estimate the homography mapping left points to right points,
# using RANSAC with a reprojection threshold of about 4 pixels
# Hint: cv2.findHomography(src, dst, cv2.RANSAC, ransacReprojThreshold=...)
H, mask = ...

n_in = int(mask.sum())
print(f"RANSAC kept {n_in}/{len(matches)} matches as inliers")
assert n_in >= 30, "too few inliers: re-shoot with more overlap"

Hint

H maps left-image coordinates to right-image coordinates. The inlier count is your quality gauge from Output 5.5.4a: fewer than about 30 inliers means the pair does not overlap enough or the scene has too little texture. Try cv2.USAC_MAGSAC instead of cv2.RANSAC to skip the threshold-tuning, as the Research Frontier in Section 5.5 recommends.

Step 4: Size the canvas and add a translation offset

If you warp the left image with H alone, parts of it can land at negative coordinates and get clipped. Project the left image's four corners through H, combine them with the right image's corners, and build a translation matrix T that shifts everything into positive territory. Composing T @ H in homogeneous coordinates (Section 5.2) is exactly the chaining of transforms you learned to do by matrix multiplication.

h_l, w_l = left.shape[:2]
h_r, w_r = right.shape[:2]

corners_l = np.float32([[0, 0], [0, h_l], [w_l, h_l], [w_l, 0]]).reshape(-1, 1, 2)
warped_corners = cv2.perspectiveTransform(corners_l, H)
corners_r = np.float32([[0, 0], [0, h_r], [w_r, h_r], [w_r, 0]]).reshape(-1, 1, 2)
all_corners = np.concatenate([warped_corners, corners_r], axis=0)

x_min, y_min = np.int32(all_corners.min(axis=0).ravel() - 0.5)
x_max, y_max = np.int32(all_corners.max(axis=0).ravel() + 0.5)

# TODO: build the 3x3 translation matrix T that shifts by (-x_min, -y_min)
# Hint: a homogeneous translation is np.array([[1,0,tx],[0,1,ty],[0,0,1]])
T = ...
canvas_size = (x_max - x_min, y_max - y_min)

Hint

Use tx = -x_min and ty = -y_min so the most negative projected corner moves to coordinate 0. The canvas width is x_max - x_min and height is y_max - y_min. Keep T as np.float32 or np.float64 so the later matrix product stays floating point.

Step 5: Warp the left image and place the right image

Warp the left image with the composed matrix T @ H onto the full canvas, then copy the right image into its offset position. This is the inverse-mapping warp of Section 5.4 with bilinear interpolation from Section 5.3 doing the resampling under the hood.

# TODO: warp the left image onto the canvas using T @ H as the transform
# Hint: cv2.warpPerspective(left, T @ H, canvas_size)
pano = ...

# Paste the right image at its translated location (overwrite, for now)
tx, ty = int(T[0, 2]), int(T[1, 2])
pano[ty:ty + h_r, tx:tx + w_r] = right

cv2.imwrite("panorama_hard_seam.jpg", pano)
print(f"panorama size: {pano.shape[1]}x{pano.shape[0]}")

Hint

Matrix multiplication order matters: T @ H applies H first (left into the right frame) then T (the canvas shift), reading right to left as composition always does. At this point you have a working panorama with a visible hard seam where the two images meet; Step 6 softens it.

Step 6: Blend the seam with a feathered mask

The hard overwrite in Step 5 leaves an obvious line and any exposure difference shows as a step. Replace it with a simple linear feather: build a weight that ramps from the warped-left side to the right side across the overlap, and blend the two layers by that weight. This is a one-level version of the multi-band blending that cv2.Stitcher does internally.

warped_left = cv2.warpPerspective(left, T @ H, canvas_size)
right_layer = np.zeros_like(warped_left)
right_layer[ty:ty + h_r, tx:tx + w_r] = right

# Coverage masks (single channel) for each layer
mask_l = (warped_left.sum(axis=2) > 0).astype(np.float32)
mask_r = (right_layer.sum(axis=2) > 0).astype(np.float32)

# TODO: turn the two binary masks into normalized blend weights so that
# wl + wr = 1 wherever either image has content (avoid divide-by-zero)
# Hint: total = mask_l + mask_r; weight = mask / np.maximum(total, 1e-6)
wl = ...
wr = ...

blended = (warped_left * wl[..., None] + right_layer * wr[..., None])
cv2.imwrite("panorama.jpg", blended.astype(np.uint8))

Hint

For a smoother seam, blur each mask with cv2.GaussianBlur(mask, (0, 0), sigmaX=25) BEFORE normalizing: the Gaussian turns the sharp coverage edge into a gradual ramp, so the two exposures cross-fade over a band instead of meeting at a line. This is the chapter's filtering (Chapter 3) reused for compositing.

Expected Output

The console reports a few thousand keypoints per image, several hundred candidate matches, and a RANSAC inlier count typically between 100 and 400 for a well-overlapping pair (printed as a line like RANSAC kept 247/600 matches as inliers). The saved panorama.jpg is a single wide image, noticeably wider than either input, with the two photographs fused along the overlap. With the feathered mask from Step 6, the seam is hard to spot; with the hard overwrite from Step 5, you will see a faint vertical line and any brightness mismatch. Straight lines in the scene (shelf edges, window frames) stay straight across the join when the homography is good.

Stretch Goals

Extend to three or more images by stitching pairwise left to right, accumulating each new homography onto the running canvas transform. Watch the error grow toward the right edge: that drift is exactly what the bundle adjustment of Chapter 14 later corrects.
Draw the RANSAC inliers and outliers with cv2.drawMatches using the inlier mask to color them differently, visualizing Figure 5.5.1's consensus idea on your own photos.
Library shortcut: reproduce the whole result with OpenCV's built-in stitcher and compare. The Library Shortcut callout in Section 5.5 shows it is three lines: cv2.Stitcher_create(cv2.Stitcher_PANORAMA).stitch([left, right]). Note how the production version also handles exposure compensation and multi-band blending that your feather only approximates.

Complete Solution

# Two-photo panorama stitcher: detect, match, RANSAC-fit, warp, blend.
import cv2
import numpy as np

def stitch(left, right):
    gray_l = cv2.cvtColor(left,  cv2.COLOR_BGR2GRAY)
    gray_r = cv2.cvtColor(right, cv2.COLOR_BGR2GRAY)

    # Step 1: detect ORB features in both images
    orb = cv2.ORB_create(nfeatures=4000)
    k_l, d_l = orb.detectAndCompute(gray_l, None)
    k_r, d_r = orb.detectAndCompute(gray_r, None)

    # Step 2: match with Hamming distance + cross-check, keep best 600
    bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
    matches = sorted(bf.match(d_l, d_r), key=lambda m: m.distance)[:600]

    # Step 3: homography from left to right, RANSAC rejecting outliers
    src = np.float32([k_l[m.queryIdx].pt for m in matches]).reshape(-1, 1, 2)
    dst = np.float32([k_r[m.trainIdx].pt for m in matches]).reshape(-1, 1, 2)
    H, mask = cv2.findHomography(src, dst, cv2.RANSAC, ransacReprojThreshold=4.0)
    n_in = int(mask.sum())
    print(f"RANSAC kept {n_in}/{len(matches)} matches as inliers")
    assert n_in >= 30, "too few inliers: re-shoot with more overlap"

    # Step 4: size the canvas and build a translation offset
    h_l, w_l = left.shape[:2]
    h_r, w_r = right.shape[:2]
    corners_l = np.float32([[0, 0], [0, h_l], [w_l, h_l], [w_l, 0]]).reshape(-1, 1, 2)
    warped_corners = cv2.perspectiveTransform(corners_l, H)
    corners_r = np.float32([[0, 0], [0, h_r], [w_r, h_r], [w_r, 0]]).reshape(-1, 1, 2)
    all_corners = np.concatenate([warped_corners, corners_r], axis=0)
    x_min, y_min = np.int32(all_corners.min(axis=0).ravel() - 0.5)
    x_max, y_max = np.int32(all_corners.max(axis=0).ravel() + 0.5)
    T = np.array([[1, 0, -x_min],
                  [0, 1, -y_min],
                  [0, 0, 1]], dtype=np.float64)
    canvas_size = (x_max - x_min, y_max - y_min)

    # Step 5: warp left onto the canvas; stage right on its own layer
    warped_left = cv2.warpPerspective(left, T @ H, canvas_size)
    tx, ty = int(T[0, 2]), int(T[1, 2])
    right_layer = np.zeros_like(warped_left)
    right_layer[ty:ty + h_r, tx:tx + w_r] = right

    # Step 6: feathered linear blend across the overlap
    mask_l = (warped_left.sum(axis=2) > 0).astype(np.float32)
    mask_r = (right_layer.sum(axis=2) > 0).astype(np.float32)
    mask_l = cv2.GaussianBlur(mask_l, (0, 0), sigmaX=25)
    mask_r = cv2.GaussianBlur(mask_r, (0, 0), sigmaX=25)
    total = mask_l + mask_r
    wl = mask_l / np.maximum(total, 1e-6)
    wr = mask_r / np.maximum(total, 1e-6)
    blended = warped_left * wl[..., None] + right_layer * wr[..., None]
    return blended.astype(np.uint8)

if __name__ == "__main__":
    left  = cv2.imread("left.jpg")
    right = cv2.imread("right.jpg")
    pano = stitch(left, right)
    cv2.imwrite("panorama.jpg", pano)
    print(f"panorama size: {pano.shape[1]}x{pano.shape[0]}")