Appendices
Appendix E: Cameras, GPUs & Edge Hardware Guide

Cameras, GPUs & Edge Hardware Guide

"Everyone blames me when training is slow. Nobody mentions that they bought me for gaming, plugged me into a dusty case with one fan, and fed me two million tiny JPEG files over a USB hard drive."

A Chronically Underprovisioned GPU
Big Picture

Hardware sets the ceiling on everything the software chapters teach. A camera decides what information exists in your pixels before a single line of code runs; a GPU's memory decides which models you can train at all; an edge device decides what survives contact with deployment. This appendix is the buying guide the rest of the book deliberately avoids: not this month's product codes, but the durable selection criteria (sensor size, shutter type, interface bandwidth, VRAM, TOPS per watt, storage throughput) that will still be the right questions years from now. Read it before you spend money; revisit it before you spend a lot of money.

The chapters of this book are written so that almost everything in Part I and Part II runs on any laptop, and the deep learning parts run on a single modest GPU or a free cloud notebook. Sooner or later, though, every vision practitioner faces three purchasing decisions: which camera to point at the world, which GPU to train on, and which device to deploy to. Vendors will happily answer all three questions for you. This appendix gives you the vendor-neutral version: the physics and the arithmetic that let you evaluate any product family on your own. Sections are numbered for reference; the decision tree in Section 5 compresses the whole appendix into one callout you can take shopping.

1. Cameras & Optics for Vision Projects

Software can sharpen, denoise, and hallucinate, but it cannot recover information the camera never captured. Chapter 1 explains the image formation pipeline from photons through the ISP; this section turns that physics into purchasing criteria.

Sensor size and resolution

Image sensors are sold in optical format classes: 1/3", 1/2.5", 1/1.8", 2/3", 1", and on up through APS-C and full frame. For a fixed resolution, a larger sensor means larger photosites, and each photosite collects more photons per exposure. More photons means better signal-to-noise ratio, more usable dynamic range, and cleaner images in dim light. This is why a 12-megapixel 1" sensor routinely outperforms a 48-megapixel phone-class 1/2" sensor in low light: megapixels are easy to print on a box, photons per pixel are what the algorithms in Chapter 7 actually fight over. When comparing sensors seriously, look for characterization data published under the EMVA 1288 standard (emva.org/standards-technology/emva-1288), which reports quantum efficiency, temporal dark noise, and saturation capacity in comparable units across vendors.

Resolution should be derived from the task, not maximized. If you need to resolve a 0.2 mm defect across a 200 mm field of view, you need roughly 2 to 3 pixels per defect, so about 2000 to 3000 pixels across the field; a 5-megapixel camera is right and a 45-megapixel camera is an expensive way to slow your pipeline down. Higher resolution also raises the bar for the lens, the interface bandwidth, the storage budget of Section 4, and the compute budget of Section 2 all at once.

Global versus rolling shutter

A rolling-shutter sensor exposes and reads out the image row by row; a global-shutter sensor exposes every pixel during the same interval. Rolling shutter is cheaper and dominates consumer devices, and for static scenes it is perfectly fine. The moment the scene or the camera moves fast, rolling shutter produces geometric artifacts: vertical lines lean, propellers turn into scimitars, and a strobe light exposes only a band of rows.

Warning: Rolling Shutter Corrupts Geometry, Not Just Aesthetics

Rolling-shutter skew is not a cosmetic problem. It violates the single-viewpoint, single-instant assumption behind the camera model of Chapter 12, so calibration, pose estimation, stereo, and structure from motion all silently degrade on fast motion. If your application involves measurement, conveyor belts, vehicles, or drones, specify a global-shutter sensor (or model the rolling readout explicitly, which is much harder). For a stationary document scanner or a microscopy stage, save the money.

Lens basics

The lens decides the field of view, the light budget, and ultimately how much of the sensor's resolution you actually get. Four parameters cover most decisions:

Two special cases worth knowing by name: telecentric lenses keep magnification constant with distance, which makes them the default for dimensional measurement, and varifocal lenses trade some image quality for an adjustable field of view during prototyping. All real lenses add the radial and tangential distortion that Chapter 12 teaches you to calibrate away; budget lenses simply add more of it.

Interfaces: how pixels reach the computer

The interface is a bandwidth and cabling decision, and it quietly determines your system architecture. Table E.1 compares the families you will actually choose between. The unifying good news is GenICam (emva.org/standards-technology/genicam): a standard programming model that lets one API control cameras across vendors and interfaces, so your exposure-setting code survives a camera swap.

InterfaceUsable bandwidthCable reachTypical roleNotes
USB3 Vision~380 MB/s3-5 m (more with active cables)Single-camera benches and lab rigsBus-powered, plug and play, easiest start
GigE Vision (1 GbE)~115 MB/s100 m, PoE availableMulti-camera factory linesStandard switches; IEEE 1588 time sync
2.5/5/10 GigE Vision~280-1100 MB/s30-100 mHigh-speed or high-resolution inspectionNeeds matching NICs and switches
MIPI CSI-2Multiple Gbit/s per lane~0.3 m ribbonEmbedded: Jetson, Raspberry Pi, phonesLowest cost and power; driver integration work (mipi.org/specifications/csi-2)
CoaXPressUp to 12.5 Gbit/s per lane30-40 m coaxLine-scan and very high speedRequires a frame grabber card
Table E.1: Camera interface families. Bandwidth figures are realistic sustained rates, not headline link speeds.

Industrial camera or webcam?

A webcam is a camera plus an opinionated, invisible ISP: auto-exposure, auto-white-balance, auto-focus, and aggressive compression, all tuned to make video calls look pleasant. Those automatics are exactly what a vision pipeline does not want, because the same scene produces different pixels every time the automatics re-decide. Industrial cameras (product families such as Basler ace, Teledyne FLIR Blackfly, IDS uEye, Allied Vision Alvium, and the Sony Pregius and STARVIS sensor families inside many of them) invert the philosophy: raw or lightly processed frames, every parameter fixed and scriptable, hardware trigger inputs for synchronizing with strobes and encoders, and screw-lock connectors and mounting threads that survive vibration.

For learning the material in this book, a webcam is fine, and you can claw back a useful amount of reproducibility by pinning its automatics:

import cv2

cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

# 0.25 selects manual-exposure mode on V4L2 backends; 0.75 restores auto
cap.set(cv2.CAP_PROP_AUTO_EXPOSURE, 0.25)
cap.set(cv2.CAP_PROP_EXPOSURE, -6)          # units are backend-specific; verify below
cap.set(cv2.CAP_PROP_AUTO_WB, 0)            # freeze white balance
cap.set(cv2.CAP_PROP_WB_TEMPERATURE, 4600)  # then pin a color temperature

ok, frame = cap.read()
assert ok, "camera returned no frame"
# Always read back what the driver actually accepted
print(cap.get(cv2.CAP_PROP_EXPOSURE), frame.shape, frame.dtype)
cap.release()
Pinning a webcam's auto-exposure and auto-white-balance through OpenCV so consecutive captures are comparable. Property units and even whether a property is settable vary by backend and camera, so the read-back print at the end is not optional.

The single most underrated purchase in any camera budget is not the camera: it is lighting. A fixed, diffuse, bright light source removes more variance from a dataset than any amount of preprocessing from Chapter 2 can remove afterwards. Machine vision engineers spend as much on illumination as on cameras, and they are right to.

Practical Example: The Smearing Solder Joints

Who: A three-person team adding automated optical inspection to a small electronics assembly line.

Situation: Their defect classifier, trained on bench photos, performed well offline but produced noisy, low-confidence predictions on the line.

Problem: The line camera was a rolling-shutter model, and boards arrived on a conveyor that never stopped. Every frame contained a slight horizontal smear and skew that bench images lacked; the classifier was being asked about a distribution it had never seen.

Decision: They replaced the camera with a global-shutter model from the same product family, added a hardware trigger from the conveyor's encoder, and installed a diffuse white strobe fired by the same trigger.

Result: Frames became geometrically identical from board to board. Classifier confidence recovered to bench levels with no retraining, and the strobe froze residual motion so well they could double the belt speed.

Lesson: When deployment images differ from training images, suspect the acquisition hardware before suspecting the model. A trigger, a strobe, and the right shutter buy accuracy that no architecture change can.

2. Choosing GPUs for Training and Inference

GPU marketing is a blizzard of core counts, clock speeds, and benchmark bars. For vision work, almost all of it is secondary to one number.

Key Insight: VRAM Is the Binding Constraint

A GPU with too little memory cannot run your job at any speed; a GPU with enough memory but fewer cores merely runs it slower. Memory capacity is a hard wall, compute is a soft slope. Buy VRAM first, memory bandwidth second, and tensor-core generation third. Every other specification is a tiebreaker.

You can estimate memory needs on the back of an envelope. Inference in fp16 costs about 2 bytes per parameter plus activations. Training with the AdamW optimizer under mixed precision (the standard recipe of Chapter 18) keeps fp16 weights and gradients plus fp32 master weights and two optimizer moments: roughly $16$ bytes per parameter, before activations. Activations scale with batch size and resolution and typically add one to a few gigabytes for the models in this book:

def training_vram_gb(params_millions, bytes_per_param=16, activations_gb=2.0):
    """Rough AdamW mixed-precision footprint: fp16 weights and gradients,
    fp32 master weights and two optimizer moments (about 16 bytes/param),
    plus an activation budget that grows with batch size and resolution."""
    weights = params_millions * 1e6 * bytes_per_param / 2**30
    return weights + activations_gb

for name, p in [("ResNet-50", 25.6), ("ViT-B/16", 86.6), ("SDXL U-Net", 2570)]:
    print(f"{name:11s} ~{training_vram_gb(p):5.1f} GB")
# ResNet-50   ~  2.4 GB
# ViT-B/16    ~  3.3 GB
# SDXL U-Net  ~ 40.3 GB
A back-of-envelope VRAM estimator for full fine-tuning with AdamW and mixed precision. The SDXL line is why parameter-efficient methods such as the LoRA adapters of Chapter 35 exist: shrinking the trainable parameter count collapses the optimizer-state term.

Mapping the estimate onto the book's four parts:

Table E.2 condenses this mapping into the memory tiers you will actually shop in; products rotate through the tiers every generation, but the tiers themselves stay put.

VRAM tierTypical hardware classWhat fits comfortably
6-8 GBEntry consumer cards, gaming laptopsAll of Parts I-II; Part III inference and small AMP fine-tunes; SD 1.5 inference
12-16 GBMainstream consumer cardsDetector and segmenter fine-tuning; ViT-B; SDXL inference; LoRA on SD 1.5
24 GBFlagship consumer (xx90-class), entry workstationSDXL LoRA, quantized 12B text-to-image inference, ViT-L, NeRF and 3D Gaussian splatting training
32-48 GBWorkstation cards, previous-generation datacenter cardsFull SDXL-class fine-tunes, larger batches, small video models
80 GB and upCurrent datacenter accelerators, usually rentedPretraining, video diffusion, foundation-model fine-tuning, multi-GPU jobs
Table E.2: GPU memory tiers mapped to what this book asks of them. Tiers are durable even as specific products rotate through them.

Consumer versus datacenter families

Consumer GeForce-class cards offer the best fp16 throughput per dollar ever sold and are the default for individuals and small labs; their costs are no ECC memory, no NVLink on recent generations, 3-slot coolers designed for gaming cases, and driver license terms that restrict datacenter deployment. Workstation cards (the RTX professional line) trade some price efficiency for more VRAM, ECC, blower coolers that stack densely, and certified drivers. Datacenter accelerators add huge memory (40 to 80 GB and beyond), NVLink interconnect for multi-GPU training, and fp8 support, at prices that only make sense rented or amortized across a team. The pragmatic pattern for most readers: develop and fine-tune on one consumer card, rent datacenter hardware for the few jobs that need it. CUDA documentation lives at docs.nvidia.com/cuda; the non-NVIDIA paths, AMD's ROCm (rocm.docs.amd.com) and Apple-silicon Macs via PyTorch's MPS backend, are genuinely usable for Part III development, but expect occasional missing operators and verify your full pipeline early before committing.

When renting beats buying

Tip: The Utilization Break-Even

A flagship-consumer-class card costs on the order of $1500 to $2500; marketplace cloud GPUs of the same class rent for roughly $0.30 to $0.60 per hour, and datacenter cards for a few dollars per hour. The break-even is therefore around 3000 to 6000 GPU-hours, two to three years of nights-and-weekends use. Rent when your usage is bursty (a sweep this week, nothing next week), when one job needs more VRAM than you own, or when you want multi-GPU for days rather than forever. Buy when the GPU would run more than about a third of working hours, when data cannot leave your premises, or when iteration latency (no upload, no queue) is worth real money to you. Most readers are best served by exactly one local card plus a cloud account.

3. Edge & Embedded Targets

Training happens where power is cheap; inference increasingly happens where the camera is. Chapter 28 covers the software side of this move (quantization, pruning, distillation, and export through ONNX to runtimes like TensorRT and OpenVINO); this section maps the hardware you export to. Table E.3 summarizes the field.

Device familyCompute classPowerPrimary toolchainSweet spot
NVIDIA Jetson (Nano to AGX class)~10-275 INT8 TOPS5-60 WJetPack: CUDA, TensorRT, DeepStreamDetection and segmentation at video rate, multi-camera robots
Google Coral (Edge TPU)4 INT8 TOPS~2 WLiteRT (TensorFlow Lite), int8 onlyAlways-on classification, lightweight detection
Raspberry Pi 5 plus M.2 NPU acceleratorsCPU plus 13-26 TOPS add-on5-15 Wpicamera2/libcamera, ONNX Runtime, vendor SDKsPrototypes, kiosks, modest frame rates at hobby prices
Phones (Apple Neural Engine, Android NPUs)Tens of TOPSBatteryCore ML; LiteRT delegates / NNAPIIn-app vision, AR, on-device privacy
Table E.3: Edge inference targets compared by compute class, power envelope, and toolchain rather than by model year.

Jetson is the closest thing to "your training GPU, shrunk": the same CUDA and TensorRT stack runs on modules spanning roughly 10 to 275 INT8 TOPS, so a PyTorch model exported on your workstation deploys with minimal translation. The family (documented at docs.nvidia.com/jetson, module lineup at developer.nvidia.com/embedded/jetson-modules) accepts MIPI CSI-2 cameras directly, which is why Table E.1's embedded interface matters. Choose Jetson when you need real-time detection or segmentation, multiple camera streams, or any CUDA-only dependency.

Coral's Edge TPU (coral.ai/docs) sits at the opposite pole: a 4-TOPS, roughly 2-watt accelerator that runs only fully int8-quantized LiteRT models with supported operators. Inside that fence it is astonishingly efficient; outside it, models simply will not compile. It rewards the quantization-aware training discipline of Chapter 28 and suits battery-powered, always-on sensing.

Raspberry Pi (raspberrypi.com/documentation) is the prototyping default. A Pi 5 runs small classifiers and detectors on CPU at single-digit frame rates; pairing it with an M.2 or HAT-format NPU (the Hailo-based AI Kit family is the canonical example) lifts it into real-time territory for modest models. The camera stack is libcamera (libcamera.org) with the picamera2 Python bindings, and official camera modules connect over CSI-2, including a global-shutter module, so Section 1's shutter advice applies even at hobby scale.

Phones are the most powerful edge devices most users already own. On iOS, models convert to Core ML (developer.apple.com/documentation/coreml) and run on the Apple Neural Engine; on Android, LiteRT (ai.google.dev/edge/litert) dispatches to GPU and NPU delegates, the successor path to the older NNAPI. The export bridges are Core ML Tools and ONNX, plus PyTorch's ExecuTorch runtime for direct mobile deployment. The recurring constraint across all four families is the same: int8 (and increasingly 4-bit) quantization support in your model is the price of admission, and memory, not TOPS, is usually what kills a port.

Research Frontier: Generative Models Move On-Device

Edge hardware guides used to end at detection. Since 2024 the frontier is on-device generation and multimodal understanding: 4-bit quantized diffusion models producing images in seconds on phone NPUs, distilled one-step and few-step samplers (the step-distillation lineage of Chapter 33), and small vision-language models running fully offline on laptop-class NPUs. Vendor NPU TOPS figures are climbing precisely to serve these workloads, and "AI PC" marketing is largely this trend wearing a suit. The durable lesson for buyers: quantization support and memory bandwidth on edge silicon now matter more than peak TOPS, because generative workloads are memory-bound.

4. Storage & Data Pipelines for Image Datasets

A training run is a pipeline: storage feeds decoders, decoders feed augmentation, augmentation feeds the GPU. The GPU is the most expensive stage, so the engineering goal is simple: the GPU must never wait. A single modern GPU training a detector can consume several hundred to several thousand images per second; if your storage and decode stages deliver less, you have bought compute you cannot use.

The classic mistake is storing a dataset as millions of individual small files. Filesystems handle a million 100 KB files far worse than a hundred 1 GB files: every open is metadata overhead, and on network filesystems the latency multiplies. The standard remedies all amount to sharding: pack samples into sequential archives using WebDataset tar shards (github.com/webdataset/webdataset), Hugging Face Datasets with Parquet-backed storage (huggingface.co/docs/datasets), or an LMDB database. Sequential reads from sharded archives saturate even cheap disks; random reads of tiny files saturate nothing.

Plan capacity in tiers, like a memory hierarchy:

Throughput is only half the battle; decode cost is the other half. JPEG decoding happens on the CPU by default, and at high image rates it, not the disk, becomes the bottleneck. The levers, in escalation order: more DataLoader workers (the input-pipeline machinery of Chapter 18), resizing images offline to the resolution you actually train at (the augmentation pipelines of Chapter 21 rarely need originals), and GPU-side decoding and augmentation with NVIDIA DALI (docs.nvidia.com/deeplearning/dali). Measure before optimizing: time your loader alone, with the model step disabled, and compare images per second against what the GPU consumes.

Finally, treat datasets as versioned artifacts, not folders. DVC (dvc.org/doc) gives datasets the commit-and-diff discipline git gives code, and FiftyOne (docs.voxel51.com) lets you visually audit what is actually inside the shards before you spend GPU-days training on it.

Fun Note: Parkinson's Law of Datasets

Data expands to fill the storage available, plus one more drive. Every vision practitioner eventually owns a directory named datasets_old_FINAL_backup2 whose contents nobody dares delete and nobody can identify. Sharding and versioning will not cure the hoarding instinct, but at least the hoard becomes greppable.

5. Putting It Together: A Decision Tree by Budget

Hardware advice ages; budget tiers do not. The callout below compresses Sections 1 through 4 into a decision tree organized by total spend, naming capability classes rather than products. Prices are rough 2026 street figures for orientation; the structure of the advice outlives them.

Decision Tree: What to Buy at Each Budget

Whatever the tier, the spending order is the same: lighting before lenses, lenses before sensors, VRAM before cores, NVMe before more compute, and measurements before all of it. Hardware purchased to fix a bottleneck you have actually profiled is rarely regretted; hardware purchased from a spec sheet usually is.

Documentation Hubs & Standards

Camera & Sensor Standards

The vendor-neutral sensor characterization standard: quantum efficiency, dark noise, and saturation capacity reported in comparable units, the basis for the sensor advice in Section 1.
The generic camera programming interface that makes exposure and trigger code portable across vendors and across the USB3, GigE, and CoaXPress interfaces of Table E.1.
MIPI CSI-2 Specification. mipi.org/specifications/csi-2
The embedded camera interface used by Jetson, Raspberry Pi, and phones; relevant whenever Section 3's edge devices need direct sensor input.
libcamera Project. libcamera.org
The open-source camera stack underneath Raspberry Pi's picamera2 and a growing share of embedded Linux capture pipelines.

GPU & Edge Toolchains

NVIDIA CUDA Documentation. docs.nvidia.com/cuda
The programming-model and toolkit reference for the GPU families discussed in Section 2.
NVIDIA TensorRT Documentation. docs.nvidia.com/deeplearning/tensorrt
The optimizing inference runtime shared by workstation GPUs and the Jetson family; the deployment path detailed in Chapter 28.
NVIDIA Jetson Documentation. docs.nvidia.com/jetson
JetPack, multimedia, and camera documentation for the embedded module family in Table E.3.
AMD ROCm Documentation. rocm.docs.amd.com
The open compute stack for AMD GPUs; the main non-NVIDIA training path noted in Section 2.
Google Coral Documentation. coral.ai/docs
Edge TPU compiler constraints and the int8 model requirements discussed in Section 3.
Apple Core ML Documentation. developer.apple.com/documentation/coreml
The on-device inference framework for iOS and macOS targets, including Neural Engine dispatch.
Google LiteRT Documentation. ai.google.dev/edge/litert
The successor to TensorFlow Lite and the NNAPI era: delegate-based GPU and NPU inference on Android and embedded targets.
ONNX Runtime Documentation. onnxruntime.ai/docs
The portable inference runtime that bridges Section 2's training hardware to Section 3's heterogeneous edge devices.
Intel OpenVINO Documentation. docs.openvino.ai
Inference optimization for CPUs, integrated GPUs, and NPUs; the usual answer when an edge box has no discrete accelerator at all.

Data Pipeline Tooling

NVIDIA DALI Documentation. docs.nvidia.com/deeplearning/dali
GPU-accelerated decoding and augmentation, the escalation path when Section 4's CPU decode stage becomes the bottleneck.
Tar-shard dataset format and PyTorch-compatible loader that turns millions of small files into fast sequential reads.
DVC Documentation. dvc.org/doc
Dataset and model versioning with git-like semantics, the artifact discipline recommended in Section 4.
FiftyOne Documentation. docs.voxel51.com
Visual dataset curation and inspection, for auditing what is actually inside the shards before training on them.