"They cite the textbook ten thousand times and read it maybe forty. The afternoon I finally read it, my bug was waiting on page 312, in a footnote, with the sign convention spelled out."
A Pose Estimator Who Finally Opened the Book
Classical geometric vision is one of the best-documented subfields in all of machine learning: two textbooks, a handful of canonical papers, and a few lecture series cover almost everything in Part II, and they remain the debugging manual even as the field goes neural. This section is the curated map, organized by chapter so you can find the right source for a specific problem, plus a sustainable workflow for following a field whose foundations are stable but whose front-end is changing fast.
A bad reading list buries you in fifty links you will never open; a good one tells you the single source to reach for when a specific function misbehaves. This section is the second kind. It gives you the two books that anchor the field, one canonical reference per chapter of Part II, the lecture series worth your evenings, and the documentation trails behind the tools of Section 17.1 and Section 17.2. It closes with the part of the skill no book teaches: how to stay current without drowning. Treat it as a routing table, not a syllabus.
1. The Two Books to Own Beginner
Two books carry most of Part II. Hartley and Zisserman, Multiple View Geometry in Computer Vision (2nd ed., 2004) is the definitive treatment of the projective geometry behind Chapter 12 through Chapter 14: the fundamental and essential matrices, the camera projection model, triangulation, and bundle adjustment, all with the sign conventions and degeneracy analysis that the function docstrings of Section 17.1 assume but never explain. It is dense, and you read it the way you read a reference: when recoverPose returns a rotation you did not expect, the answer is in its pages. Szeliski, Computer Vision: Algorithms and Applications (2nd ed., 2022) is the opposite personality: broad, current, free online, and readable cover to cover, with chapters that map almost one-to-one onto this book's Part II and full citation trails into the literature.
It is tempting to assume that a 2004 textbook is obsolete in a field moving as fast as vision. The opposite is true for geometry. Learned methods such as the feed-forward reconstructors of Section 17.2 still produce camera matrices, essential matrices, and reprojection errors, the same objects Hartley and Zisserman defined, and they fail in the same geometric ways: scale ambiguity, planar degeneracy, and gauge freedom. Gauge freedom means a reconstruction is fixed only up to a global similarity transform: the whole scene can be rotated, translated, and rescaled without changing any reprojection, as Section 14.3 spells out. When a neural pose estimator behaves strangely, the explanation is almost never in the network architecture and almost always in the projective geometry the network is approximating. The classical books are not history; they are the specification that the learned methods implement, and the manual for diagnosing them when they break. The phrase worth keeping: the network forgets, the geometry does not. The illustration below casts that idea as the old green book patiently debugging a modern neural reconstructor.
Practitioners do not say the title; they say the color. Multiple View Geometry in Computer Vision is universally known as "Hartley and Zisserman", or just "the green book", and asking a robotics lab "what does the green book say about cheirality?" is a complete, well-formed question that everyone present can answer. Two decades after its second edition, the book is cited in papers whose methods it predates by twenty years, because the sign conventions and degeneracy cases it spells out are still the ones the GPU is silently obeying. The fastest debugging tool in modern geometric vision is a paperback from 2004 with a worn green spine.
2. One Paper Per Chapter Intermediate
If you read one primary source for each chapter of Part II, read these. They are the papers the algorithms are named after, and going to the source resolves the ambiguities that secondary explanations introduce. Table 17.4.1 routes each chapter to its canonical paper; the chapter bibliographies hold the full hyperlinked entries.
| Chapter | Topic | Canonical source |
|---|---|---|
| 9 | Edges | Canny, "A Computational Approach to Edge Detection" (1986) |
| 10 | Features | Lowe, "Distinctive Image Features from Scale-Invariant Keypoints" (2004) |
| 11 | Segmentation | Felzenszwalb & Huttenlocher, "Efficient Graph-Based Image Segmentation" (2004) |
| 12 | Calibration | Zhang, "A Flexible New Technique for Camera Calibration" (2000) |
| 13 | Stereo | Hirschmüller, "Stereo Processing by Semiglobal Matching" (2008) |
| 14 | SfM / SLAM | Schönberger & Frahm, "Structure-from-Motion Revisited" (2016) |
| 15 | Flow / tracking | Lucas & Kanade (1981); Kalman (1960) |
| 16 | Recognition | Dalal & Triggs, "Histograms of Oriented Gradients" (2005) |
A practical reading habit makes these sources far more useful, and it can be partly automated. Code 17.4.1 builds a tiny personal citation index: given a list of arXiv identifiers, it fetches each title and abstract so you can skim a reading queue without opening eight browser tabs.
# Personal arXiv reading queue: given a list of paper IDs, query the
# public arXiv API and print each title with a short abstract snippet,
# so a reading list can be skimmed without opening a browser tab per paper.
import urllib.request
import xml.etree.ElementTree as ET
def arxiv_titles(ids):
"""Fetch title + abstract for a list of arXiv IDs via the public API."""
query = ",".join(ids)
url = f"http://export.arxiv.org/api/query?id_list={query}&max_results={len(ids)}"
with urllib.request.urlopen(url, timeout=20) as resp:
root = ET.fromstring(resp.read())
ns = {"a": "http://www.w3.org/2005/Atom"}
for entry in root.findall("a:entry", ns):
title = entry.find("a:title", ns).text.strip().replace("\n", " ")
summary = entry.find("a:summary", ns).text.strip()[:160]
print(f"- {title}\n {summary}...\n")
# A starter reading queue: GLOMAP, SuperPoint, RAFT, DUSt3R, VGGT.
arxiv_titles(["2407.20219", "1712.07629", "2003.12039", "2312.14132", "2503.11651"])
- Global Structure-from-Motion Revisited
We revisit the problem of estimating camera poses and 3D structure...
- SuperPoint: Self-Supervised Interest Point Detection and Description
This paper presents a self-supervised framework for training interest...
- RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep...
Code 17.4.1 is a teaching toy; for real bibliography work a reference manager replaces hundreds of lines of fetching and de-duplication. Zotero (free, open source) ingests a paper from a browser button, pulls clean metadata from Crossref and arXiv, de-duplicates, and exports BibTeX for your own writing; Semantic Scholar and Connected Papers turn one seed paper into a citation graph so you can find the survey, the follow-ups, and the rebuttals in minutes. The workflow that scales is not "save a PDF to a folder" but "one tool that knows what you have read and what cites it." This is the same principle as the rest of the chapter: do not hand-roll infrastructure the ecosystem already maintains.
3. Lectures, Docs & a Staying-Current Workflow Intermediate
Books are for the bug you already have; lectures are for the intuition you do not yet have, and two free series teach Part II better than any text for first exposure. The First Principles of Computer Vision series by Shree Nayar (Columbia) covers the imaging, features, and geometry of this part with exceptional clarity, and Cyrill Stachniss's Photogrammetry and Robotics lectures (Bonn) are the best free treatment of the calibration, SfM, and SLAM of Chapter 12 through Chapter 14. For the tools, the primary documentation is the reference, not a blog: the OpenCV tutorials for the modules of Section 17.1, and the COLMAP FAQ, which answers the overwhelming majority of reconstruction failures, including the disconnected-model problem from Section 17.2's practical example.
Staying current is a skill in itself, and the goal is sustainability, not completeness. Figure 17.4.1 lays out a low-effort weekly loop that catches the important developments without the firehose of reading everything.
Who: A computer-vision engineer at an agricultural-robotics company, two years into building stereo pipelines for a field robot, fluent in OpenCV calls but self-taught from blog posts and Stack Overflow.
Situation: A stereo reconstruction kept producing depth maps with a subtle, consistent slant: planar walls came out tilted by a few degrees, and rectified pairs looked almost but not quite right.
Problem: Days of parameter tuning on StereoSGBM changed nothing, because the bug was upstream in the rectification, and the blog tutorials they had learned from never explained the rectification sign conventions or the assumptions stereoRectify makes about the calibration.
Dilemma: After three days of stalled tuning, three options were on the table. They could keep brute-forcing StereoSGBM parameters, comfortable but plainly not converging since the bug lived upstream. They could post a minimal repro and wait a day or two for a Stack Overflow answer, low effort but slow and unlikely to surface a subtle convention bug. Or they could spend two focused hours reading the primary text on rectification, a real time cost up front with no guarantee the answer was even in the book.
Decision: They finally opened Hartley and Zisserman to the chapter on stereo rectification and read it properly. The book made explicit that their calibration had swapped two coordinate conventions, a detail the API silently accepted and the blogs never mentioned.
Result: A one-line fix to the calibration convention removed the slant entirely. The fix took five minutes; finding it had taken a week, and the book would have saved the week.
Lesson: The canonical references are not for impressing reviewers; they are the fastest route to the bug. Reach for the primary source before the fortieth blog post, especially when a geometric result is subtly, consistently wrong, which is the signature of a convention error the book spells out and the tutorial omits.
The literature to track in 2026 sits at the seam between classical geometry and learning, and three threads are worth a standing watch. First, feed-forward reconstruction: after DUSt3R and MASt3R (2024), VGGT (Wang et al., CVPR 2025) and its successors are racing to replace the COLMAP pipeline of Section 17.2 with a single network pass, and the open question is whether they reach COLMAP's accuracy at scale or stay a fast approximation. Second, learned matching: SuperPoint plus LightGlue, packaged in the hloc toolbox, is steadily becoming the default front-end, and the Image Matching Challenge tracks the frontier. Third, the long arc this book follows, where the descriptors of Chapter 10 become the learned representations of Chapter 25, and the geometry of this part becomes the scaffolding for the neural scene representations of Chapter 27. The reading habit that pays off is to follow the classical venues (CVPR, ICCV, ECCV) and the tool changelogs together: the papers tell you what is possible, and the changelogs of COLMAP, GLOMAP, and kornia tell you what is actually usable.
For each debugging question, name the single best source from this section to consult first and why: (a) "Why does my essential matrix decompose into four possible poses?"; (b) "What does the P1/P2 smoothness penalty in StereoSGBM actually control?"; (c) "Is global SfM accurate enough to replace incremental for my 800-image set?"; (d) "Which descriptor should I use for a phototourism matching task in 2026?"
Extend Code 17.4.1 to also print each paper's primary subject category and publication date (both are in the arXiv API response, under the category and published elements). Then assemble a personal queue of the five papers from Table 17.4.1 and the chapter bibliography most relevant to a project you care about, run the fetcher, and write a one-sentence triage note for each, exactly as step 2 of Figure 17.4.1 prescribes.
Pick one classical method from Table 17.4.1 and trace its modern learned descendant: for example, Lowe's SIFT (2004) to SuperPoint (2018), or Lucas-Kanade flow (1981) to RAFT (2020). In a short write-up, identify (a) the geometric object both methods produce, (b) what the learned method changed and what it kept, and (c) one situation from Section 17.3's benchmarks where the classical method might still win. Argue why the classical reference remains worth reading even after the learned method surpasses it on the leaderboard.