"Everyone blames me when training is slow. Nobody mentions that they bought me for gaming, plugged me into a dusty case with one fan, and fed me two million tiny JPEG files over a USB hard drive."
A Chronically Underprovisioned GPU
Hardware sets the ceiling on everything the software chapters teach. A camera decides what information exists in your pixels before a single line of code runs; a GPU's memory decides which models you can train at all; an edge device decides what survives contact with deployment. This appendix is the buying guide the rest of the book deliberately avoids: not this month's product codes, but the durable selection criteria (sensor size, shutter type, interface bandwidth, VRAM, TOPS per watt, storage throughput) that will still be the right questions years from now. Read it before you spend money; revisit it before you spend a lot of money.
The chapters of this book are written so that almost everything in Part I and Part II runs on any laptop, and the deep learning parts run on a single modest GPU or a free cloud notebook. Sooner or later, though, every vision practitioner faces three purchasing decisions: which camera to point at the world, which GPU to train on, and which device to deploy to. Vendors will happily answer all three questions for you. This appendix gives you the vendor-neutral version: the physics and the arithmetic that let you evaluate any product family on your own. Sections are numbered for reference; the decision tree in Section 5 compresses the whole appendix into one callout you can take shopping.
1. Cameras & Optics for Vision Projects
Software can sharpen, denoise, and hallucinate, but it cannot recover information the camera never captured. Chapter 1 explains the image formation pipeline from photons through the ISP; this section turns that physics into purchasing criteria.
Sensor size and resolution
Image sensors are sold in optical format classes: 1/3", 1/2.5", 1/1.8", 2/3", 1", and on up through APS-C and full frame. For a fixed resolution, a larger sensor means larger photosites, and each photosite collects more photons per exposure. More photons means better signal-to-noise ratio, more usable dynamic range, and cleaner images in dim light. This is why a 12-megapixel 1" sensor routinely outperforms a 48-megapixel phone-class 1/2" sensor in low light: megapixels are easy to print on a box, photons per pixel are what the algorithms in Chapter 7 actually fight over. When comparing sensors seriously, look for characterization data published under the EMVA 1288 standard (emva.org/standards-technology/emva-1288), which reports quantum efficiency, temporal dark noise, and saturation capacity in comparable units across vendors.
Resolution should be derived from the task, not maximized. If you need to resolve a 0.2 mm defect across a 200 mm field of view, you need roughly 2 to 3 pixels per defect, so about 2000 to 3000 pixels across the field; a 5-megapixel camera is right and a 45-megapixel camera is an expensive way to slow your pipeline down. Higher resolution also raises the bar for the lens, the interface bandwidth, the storage budget of Section 4, and the compute budget of Section 2 all at once.
Global versus rolling shutter
A rolling-shutter sensor exposes and reads out the image row by row; a global-shutter sensor exposes every pixel during the same interval. Rolling shutter is cheaper and dominates consumer devices, and for static scenes it is perfectly fine. The moment the scene or the camera moves fast, rolling shutter produces geometric artifacts: vertical lines lean, propellers turn into scimitars, and a strobe light exposes only a band of rows.
Rolling-shutter skew is not a cosmetic problem. It violates the single-viewpoint, single-instant assumption behind the camera model of Chapter 12, so calibration, pose estimation, stereo, and structure from motion all silently degrade on fast motion. If your application involves measurement, conveyor belts, vehicles, or drones, specify a global-shutter sensor (or model the rolling readout explicitly, which is much harder). For a stationary document scanner or a microscopy stage, save the money.
Lens basics
The lens decides the field of view, the light budget, and ultimately how much of the sensor's resolution you actually get. Four parameters cover most decisions:
- Focal length sets the field of view for a given sensor size: longer focal length, narrower view. Compute it from the working distance and the scene width you must cover before browsing catalogs; every machine vision vendor publishes the same thin-lens calculator.
- Aperture (f-number) trades light against depth of field: a low f-number gathers more light but keeps a thinner slab of the scene in focus, and at very high f-numbers diffraction softens the image regardless of lens quality.
- Mount must match the camera: C-mount and CS-mount dominate industrial cameras, M12 (S-mount) dominates board-level and embedded cameras. Adapters exist in one direction only (C lens on CS body).
- Resolving power must match the pixel pitch: a lens rated for 5-micron pixels will blur a sensor with 2.4-micron pixels no matter what the sensor datasheet promises. Vendors state this as megapixel ratings or MTF curves.
Two special cases worth knowing by name: telecentric lenses keep magnification constant with distance, which makes them the default for dimensional measurement, and varifocal lenses trade some image quality for an adjustable field of view during prototyping. All real lenses add the radial and tangential distortion that Chapter 12 teaches you to calibrate away; budget lenses simply add more of it.
Interfaces: how pixels reach the computer
The interface is a bandwidth and cabling decision, and it quietly determines your system architecture. Table E.1 compares the families you will actually choose between. The unifying good news is GenICam (emva.org/standards-technology/genicam): a standard programming model that lets one API control cameras across vendors and interfaces, so your exposure-setting code survives a camera swap.
| Interface | Usable bandwidth | Cable reach | Typical role | Notes |
|---|---|---|---|---|
| USB3 Vision | ~380 MB/s | 3-5 m (more with active cables) | Single-camera benches and lab rigs | Bus-powered, plug and play, easiest start |
| GigE Vision (1 GbE) | ~115 MB/s | 100 m, PoE available | Multi-camera factory lines | Standard switches; IEEE 1588 time sync |
| 2.5/5/10 GigE Vision | ~280-1100 MB/s | 30-100 m | High-speed or high-resolution inspection | Needs matching NICs and switches |
| MIPI CSI-2 | Multiple Gbit/s per lane | ~0.3 m ribbon | Embedded: Jetson, Raspberry Pi, phones | Lowest cost and power; driver integration work (mipi.org/specifications/csi-2) |
| CoaXPress | Up to 12.5 Gbit/s per lane | 30-40 m coax | Line-scan and very high speed | Requires a frame grabber card |
Industrial camera or webcam?
A webcam is a camera plus an opinionated, invisible ISP: auto-exposure, auto-white-balance, auto-focus, and aggressive compression, all tuned to make video calls look pleasant. Those automatics are exactly what a vision pipeline does not want, because the same scene produces different pixels every time the automatics re-decide. Industrial cameras (product families such as Basler ace, Teledyne FLIR Blackfly, IDS uEye, Allied Vision Alvium, and the Sony Pregius and STARVIS sensor families inside many of them) invert the philosophy: raw or lightly processed frames, every parameter fixed and scriptable, hardware trigger inputs for synchronizing with strobes and encoders, and screw-lock connectors and mounting threads that survive vibration.
For learning the material in this book, a webcam is fine, and you can claw back a useful amount of reproducibility by pinning its automatics:
import cv2
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
# 0.25 selects manual-exposure mode on V4L2 backends; 0.75 restores auto
cap.set(cv2.CAP_PROP_AUTO_EXPOSURE, 0.25)
cap.set(cv2.CAP_PROP_EXPOSURE, -6) # units are backend-specific; verify below
cap.set(cv2.CAP_PROP_AUTO_WB, 0) # freeze white balance
cap.set(cv2.CAP_PROP_WB_TEMPERATURE, 4600) # then pin a color temperature
ok, frame = cap.read()
assert ok, "camera returned no frame"
# Always read back what the driver actually accepted
print(cap.get(cv2.CAP_PROP_EXPOSURE), frame.shape, frame.dtype)
cap.release()
The single most underrated purchase in any camera budget is not the camera: it is lighting. A fixed, diffuse, bright light source removes more variance from a dataset than any amount of preprocessing from Chapter 2 can remove afterwards. Machine vision engineers spend as much on illumination as on cameras, and they are right to.
Who: A three-person team adding automated optical inspection to a small electronics assembly line.
Situation: Their defect classifier, trained on bench photos, performed well offline but produced noisy, low-confidence predictions on the line.
Problem: The line camera was a rolling-shutter model, and boards arrived on a conveyor that never stopped. Every frame contained a slight horizontal smear and skew that bench images lacked; the classifier was being asked about a distribution it had never seen.
Decision: They replaced the camera with a global-shutter model from the same product family, added a hardware trigger from the conveyor's encoder, and installed a diffuse white strobe fired by the same trigger.
Result: Frames became geometrically identical from board to board. Classifier confidence recovered to bench levels with no retraining, and the strobe froze residual motion so well they could double the belt speed.
Lesson: When deployment images differ from training images, suspect the acquisition hardware before suspecting the model. A trigger, a strobe, and the right shutter buy accuracy that no architecture change can.
2. Choosing GPUs for Training and Inference
GPU marketing is a blizzard of core counts, clock speeds, and benchmark bars. For vision work, almost all of it is secondary to one number.
A GPU with too little memory cannot run your job at any speed; a GPU with enough memory but fewer cores merely runs it slower. Memory capacity is a hard wall, compute is a soft slope. Buy VRAM first, memory bandwidth second, and tensor-core generation third. Every other specification is a tiebreaker.
You can estimate memory needs on the back of an envelope. Inference in fp16 costs about 2 bytes per parameter plus activations. Training with the AdamW optimizer under mixed precision (the standard recipe of Chapter 18) keeps fp16 weights and gradients plus fp32 master weights and two optimizer moments: roughly $16$ bytes per parameter, before activations. Activations scale with batch size and resolution and typically add one to a few gigabytes for the models in this book:
def training_vram_gb(params_millions, bytes_per_param=16, activations_gb=2.0):
"""Rough AdamW mixed-precision footprint: fp16 weights and gradients,
fp32 master weights and two optimizer moments (about 16 bytes/param),
plus an activation budget that grows with batch size and resolution."""
weights = params_millions * 1e6 * bytes_per_param / 2**30
return weights + activations_gb
for name, p in [("ResNet-50", 25.6), ("ViT-B/16", 86.6), ("SDXL U-Net", 2570)]:
print(f"{name:11s} ~{training_vram_gb(p):5.1f} GB")
# ResNet-50 ~ 2.4 GB
# ViT-B/16 ~ 3.3 GB
# SDXL U-Net ~ 40.3 GB
Mapping the estimate onto the book's four parts:
- Parts I and II (image processing, classical CV) need no GPU at all. OpenCV and scikit-image are CPU libraries first; a GPU helps only for bulk batch processing via CUDA-accelerated modules.
- Part III (deep learning) is comfortable at 8 to 16 GB. Fine-tuning a ResNet-50 or YOLO-family detector at 640-pixel resolution with automatic mixed precision fits in 8 GB; ViT-B fine-tunes and Mask R-CNN training prefer 12 to 16 GB; inference for almost every model in Part III, including SAM, fits in 8 GB.
- Part IV (generative models) is where VRAM pressure jumps. Stable Diffusion 1.5 inference runs in about 4 GB at fp16; SDXL-class models want 10 to 12 GB; 12-billion-parameter text-to-image models (Chapter 34) want 24 GB at fp16, or less with 8-bit and 4-bit quantization. LoRA fine-tuning of an SDXL-class model is pleasant at 24 GB; full fine-tunes and video models belong on rented datacenter hardware.
Table E.2 condenses this mapping into the memory tiers you will actually shop in; products rotate through the tiers every generation, but the tiers themselves stay put.
| VRAM tier | Typical hardware class | What fits comfortably |
|---|---|---|
| 6-8 GB | Entry consumer cards, gaming laptops | All of Parts I-II; Part III inference and small AMP fine-tunes; SD 1.5 inference |
| 12-16 GB | Mainstream consumer cards | Detector and segmenter fine-tuning; ViT-B; SDXL inference; LoRA on SD 1.5 |
| 24 GB | Flagship consumer (xx90-class), entry workstation | SDXL LoRA, quantized 12B text-to-image inference, ViT-L, NeRF and 3D Gaussian splatting training |
| 32-48 GB | Workstation cards, previous-generation datacenter cards | Full SDXL-class fine-tunes, larger batches, small video models |
| 80 GB and up | Current datacenter accelerators, usually rented | Pretraining, video diffusion, foundation-model fine-tuning, multi-GPU jobs |
Consumer versus datacenter families
Consumer GeForce-class cards offer the best fp16 throughput per dollar ever sold and are the default for individuals and small labs; their costs are no ECC memory, no NVLink on recent generations, 3-slot coolers designed for gaming cases, and driver license terms that restrict datacenter deployment. Workstation cards (the RTX professional line) trade some price efficiency for more VRAM, ECC, blower coolers that stack densely, and certified drivers. Datacenter accelerators add huge memory (40 to 80 GB and beyond), NVLink interconnect for multi-GPU training, and fp8 support, at prices that only make sense rented or amortized across a team. The pragmatic pattern for most readers: develop and fine-tune on one consumer card, rent datacenter hardware for the few jobs that need it. CUDA documentation lives at docs.nvidia.com/cuda; the non-NVIDIA paths, AMD's ROCm (rocm.docs.amd.com) and Apple-silicon Macs via PyTorch's MPS backend, are genuinely usable for Part III development, but expect occasional missing operators and verify your full pipeline early before committing.
When renting beats buying
A flagship-consumer-class card costs on the order of $1500 to $2500; marketplace cloud GPUs of the same class rent for roughly $0.30 to $0.60 per hour, and datacenter cards for a few dollars per hour. The break-even is therefore around 3000 to 6000 GPU-hours, two to three years of nights-and-weekends use. Rent when your usage is bursty (a sweep this week, nothing next week), when one job needs more VRAM than you own, or when you want multi-GPU for days rather than forever. Buy when the GPU would run more than about a third of working hours, when data cannot leave your premises, or when iteration latency (no upload, no queue) is worth real money to you. Most readers are best served by exactly one local card plus a cloud account.
3. Edge & Embedded Targets
Training happens where power is cheap; inference increasingly happens where the camera is. Chapter 28 covers the software side of this move (quantization, pruning, distillation, and export through ONNX to runtimes like TensorRT and OpenVINO); this section maps the hardware you export to. Table E.3 summarizes the field.
| Device family | Compute class | Power | Primary toolchain | Sweet spot |
|---|---|---|---|---|
| NVIDIA Jetson (Nano to AGX class) | ~10-275 INT8 TOPS | 5-60 W | JetPack: CUDA, TensorRT, DeepStream | Detection and segmentation at video rate, multi-camera robots |
| Google Coral (Edge TPU) | 4 INT8 TOPS | ~2 W | LiteRT (TensorFlow Lite), int8 only | Always-on classification, lightweight detection |
| Raspberry Pi 5 plus M.2 NPU accelerators | CPU plus 13-26 TOPS add-on | 5-15 W | picamera2/libcamera, ONNX Runtime, vendor SDKs | Prototypes, kiosks, modest frame rates at hobby prices |
| Phones (Apple Neural Engine, Android NPUs) | Tens of TOPS | Battery | Core ML; LiteRT delegates / NNAPI | In-app vision, AR, on-device privacy |
Jetson is the closest thing to "your training GPU, shrunk": the same CUDA and TensorRT stack runs on modules spanning roughly 10 to 275 INT8 TOPS, so a PyTorch model exported on your workstation deploys with minimal translation. The family (documented at docs.nvidia.com/jetson, module lineup at developer.nvidia.com/embedded/jetson-modules) accepts MIPI CSI-2 cameras directly, which is why Table E.1's embedded interface matters. Choose Jetson when you need real-time detection or segmentation, multiple camera streams, or any CUDA-only dependency.
Coral's Edge TPU (coral.ai/docs) sits at the opposite pole: a 4-TOPS, roughly 2-watt accelerator that runs only fully int8-quantized LiteRT models with supported operators. Inside that fence it is astonishingly efficient; outside it, models simply will not compile. It rewards the quantization-aware training discipline of Chapter 28 and suits battery-powered, always-on sensing.
Raspberry Pi (raspberrypi.com/documentation) is the prototyping default. A Pi 5 runs small classifiers and detectors on CPU at single-digit frame rates; pairing it with an M.2 or HAT-format NPU (the Hailo-based AI Kit family is the canonical example) lifts it into real-time territory for modest models. The camera stack is libcamera (libcamera.org) with the picamera2 Python bindings, and official camera modules connect over CSI-2, including a global-shutter module, so Section 1's shutter advice applies even at hobby scale.
Phones are the most powerful edge devices most users already own. On iOS, models convert to Core ML (developer.apple.com/documentation/coreml) and run on the Apple Neural Engine; on Android, LiteRT (ai.google.dev/edge/litert) dispatches to GPU and NPU delegates, the successor path to the older NNAPI. The export bridges are Core ML Tools and ONNX, plus PyTorch's ExecuTorch runtime for direct mobile deployment. The recurring constraint across all four families is the same: int8 (and increasingly 4-bit) quantization support in your model is the price of admission, and memory, not TOPS, is usually what kills a port.
Edge hardware guides used to end at detection. Since 2024 the frontier is on-device generation and multimodal understanding: 4-bit quantized diffusion models producing images in seconds on phone NPUs, distilled one-step and few-step samplers (the step-distillation lineage of Chapter 33), and small vision-language models running fully offline on laptop-class NPUs. Vendor NPU TOPS figures are climbing precisely to serve these workloads, and "AI PC" marketing is largely this trend wearing a suit. The durable lesson for buyers: quantization support and memory bandwidth on edge silicon now matter more than peak TOPS, because generative workloads are memory-bound.
4. Storage & Data Pipelines for Image Datasets
A training run is a pipeline: storage feeds decoders, decoders feed augmentation, augmentation feeds the GPU. The GPU is the most expensive stage, so the engineering goal is simple: the GPU must never wait. A single modern GPU training a detector can consume several hundred to several thousand images per second; if your storage and decode stages deliver less, you have bought compute you cannot use.
The classic mistake is storing a dataset as millions of individual small files. Filesystems handle a million 100 KB files far worse than a hundred 1 GB files: every open is metadata overhead, and on network filesystems the latency multiplies. The standard remedies all amount to sharding: pack samples into sequential archives using WebDataset tar shards (github.com/webdataset/webdataset), Hugging Face Datasets with Parquet-backed storage (huggingface.co/docs/datasets), or an LMDB database. Sequential reads from sharded archives saturate even cheap disks; random reads of tiny files saturate nothing.
Plan capacity in tiers, like a memory hierarchy:
- Hot (NVMe SSD): the dataset you are actively training on. A 1 to 2 TB NVMe drive holds most of the classification and detection datasets in Appendix B with room to spare; a million JPEG images at typical web resolution is roughly 100 to 300 GB.
- Warm (SATA SSD or HDD): datasets you rotate between projects, raw captures awaiting curation, and checkpoints.
- Cold (object storage or NAS): everything else, especially video, which inflates storage budgets by an order of magnitude. Follow the 3-2-1 backup rule for anything you cannot re-download: three copies, two media, one off-site.
Throughput is only half the battle; decode cost is the other half. JPEG decoding happens on the CPU by default, and at high image rates it, not the disk, becomes the bottleneck. The levers, in escalation order: more DataLoader workers (the input-pipeline machinery of Chapter 18), resizing images offline to the resolution you actually train at (the augmentation pipelines of Chapter 21 rarely need originals), and GPU-side decoding and augmentation with NVIDIA DALI (docs.nvidia.com/deeplearning/dali). Measure before optimizing: time your loader alone, with the model step disabled, and compare images per second against what the GPU consumes.
Finally, treat datasets as versioned artifacts, not folders. DVC (dvc.org/doc) gives datasets the commit-and-diff discipline git gives code, and FiftyOne (docs.voxel51.com) lets you visually audit what is actually inside the shards before you spend GPU-days training on it.
Data expands to fill the storage available, plus one more drive. Every vision practitioner eventually owns a directory named datasets_old_FINAL_backup2 whose contents nobody dares delete and nobody can identify. Sharding and versioning will not cure the hoarding instinct, but at least the hoard becomes greppable.
5. Putting It Together: A Decision Tree by Budget
Hardware advice ages; budget tiers do not. The callout below compresses Sections 1 through 4 into a decision tree organized by total spend, naming capability classes rather than products. Prices are rough 2026 street figures for orientation; the structure of the advice outlives them.
- Tier 0, about $0 (a laptop you already own): Built-in webcam or phone camera for capture; free cloud notebooks for anything needing a GPU. Everything in Parts I and II runs locally on CPU; Part III runs in free notebook sessions; Part IV is usable through hosted inference. Spend nothing until a specific bottleneck tells you what to buy.
- Tier 1, about $500: A used or entry consumer GPU in the 8-12 GB class, plus a Raspberry Pi with the global-shutter camera module if you want a capture rig. Unlocks: local fine-tuning of classifiers and detectors (Part III), SD 1.5-class generation (Part IV), and real camera experiments for Chapter 12 calibration work.
- Tier 2, about $2000-3000: A 16-24 GB consumer GPU in a desktop with 64 GB RAM and a 2 TB NVMe drive, plus a USB3 Vision industrial camera with a C-mount lens and a basic diffuse light. Unlocks: every chapter of this book at comfortable speed, SDXL-class fine-tuning with LoRA, NeRF and 3D Gaussian splatting, and acquisition experiments with deterministic exposure and triggering. This is the sweet spot for one serious practitioner.
- Tier 3, about $10,000: Either one workstation with dual 24-48 GB cards, 10 GbE networking, and a small NAS, or (usually smarter) the Tier 2 workstation plus a standing cloud budget for datacenter GPUs on demand. Add GigE Vision cameras and proper lighting if acquisition is part of the mission, and a Jetson-class module if deployment is. Unlocks: multi-GPU experiments, full generative fine-tunes, and team-scale datasets.
- Tier 4, production and lab scale: Stop buying flagships and start matching hardware to the deployed workload: rented datacenter clusters for training, a fleet of edge devices sized by Table E.3 for inference, and storage engineered per Section 4. At this tier the decision tree inverts: the deployment target is chosen first, and everything upstream, including which models you train at all, follows from Chapter 28's constraints.
Whatever the tier, the spending order is the same: lighting before lenses, lenses before sensors, VRAM before cores, NVMe before more compute, and measurements before all of it. Hardware purchased to fix a bottleneck you have actually profiled is rarely regretted; hardware purchased from a spec sheet usually is.