Chapter 28: Efficient Vision & Edge Deployment

"They trained me for six weeks on a cluster the size of a tennis court, then asked me to run on a doorbell. I lost three quarters of my weights and most of my dignity, and I have never been faster."
A Formerly Float32 Network, Now Quantized and Content

Big Picture

A model that cannot run on the target hardware within the target latency and power budget is not a model; it is a research artifact. Everything in Part III so far has optimized for accuracy on a benchmark, measured in milliseconds on a data-center GPU you did not have to pay for by the watt. Production inverts the constraint. The question is no longer "what is the best mean average precision" but "what is the best accuracy I can buy under sixteen milliseconds, two watts, and four megabytes of weights, on a chip that costs eight dollars." This chapter is about closing that gap. We shrink the model with quantization, pruning, and distillation; we compile it to a hardware-specific runtime with ONNX, TensorRT, and OpenVINO; we deploy it to edge devices from a Jetson to a phone; we serve it under real traffic with batching and concurrency; and we watch it after launch, because the world the model meets in production is never the world it was trained on.

Chapter Overview

Every model the last ten chapters built, the classifiers, detectors, segmenters, video networks, and the depth and neural-scene models of Chapter 27, shares the same unfinished business: it has been proven accurate but not yet made deployable. There is a moment in every applied vision project where the research ends and the engineering begins. The validation curve has plateaued, the model is as accurate as it is going to get, and someone asks the question that no benchmark answers: can we actually ship this? The honest answer is usually no, not yet. The model that scores well on the leaderboard is a few hundred megabytes of float32 weights, it assumes a recent GPU, and it returns a prediction in forty milliseconds only when nothing else is competing for the device. The product needs it to run in real time on a battery-powered camera, or to serve ten thousand requests a second without melting the budget. The distance between those two states is the subject of this chapter.

We cross that distance in five steps, and the order matters. Section 28.1 opens the efficiency toolbox: the three model-level techniques that make a network smaller and faster without retraining it from scratch. Quantization stores and computes weights in eight bits instead of thirty-two, cutting memory by four and unlocking integer math that modern chips run far faster than floating point. Pruning removes the weights and channels that contribute least, trading a small accuracy drop for a smaller, sparser network. Distillation trains a small student model to imitate a large teacher, transferring accuracy that the student could never reach by training on labels alone. These techniques compose, and used together they routinely deliver a four-to-tenfold speedup for a single-digit accuracy cost.

A smaller model still needs to actually run, and a PyTorch nn.Module running under the Python interpreter is not how you deploy. Section 28.2 is about export and runtimes: ONNX as the portable interchange format that decouples the framework you train in from the engine you serve on, and the two dominant hardware-specific compilers, NVIDIA's TensorRT for their GPUs and Intel's OpenVINO for their CPUs and accelerators. We trace what a compiler actually does (operator fusion, kernel autotuning, precision calibration) and why a compiled engine is often several times faster than the same graph run eagerly. Section 28.3 takes the compiled model to the edge: the Jetson family for embedded GPU compute, the mobile runtimes (Core ML, TensorFlow Lite, ExecuTorch) that put vision on a phone, and the architectural choices, the MobileNet and EfficientNet families, designed for these constraints from the start. The camera and sensor pipeline first met in Chapter 1 reappears here as the front end of an edge system.

Not every model lives at the edge; many live behind an API, and serving them well is its own discipline. Section 28.4 covers vision-model serving: the throughput-versus-latency trade that governs every serving decision, dynamic batching that amortizes GPU overhead across requests, and the inference servers (Triton, TorchServe, Ray Serve) that productionize all of it. Finally, Section 28.5 confronts the truth that deployment is not the finish line. Data drifts, the distribution shifts under your model's feet, and accuracy decays silently because production has no labels to tell you it is failing. The section builds the monitoring, drift-detection, and continual-improvement loop that keeps a deployed model honest, closing the chapter where every real system actually lives.

The recurring theme is that efficiency is a system property, not a model property. A quantized network exported to a tuned runtime on the right accelerator behind a well-batched server with a working drift monitor is a deployment; any one of those pieces missing is a liability. By the end of this chapter you will be able to take a model from Chapter 23 or Chapter 24 and walk it all the way to a chip, and to know which knob to turn when the latency budget is blown.

Prerequisites

This chapter assumes the deep-learning foundation of Part III. You should have built and trained a model in PyTorch, so Chapter 18: Neural Networks & PyTorch for Vision is essential, and you should understand the convolutional and transformer backbones we will be compressing, from Chapter 19: Convolutional Neural Networks, Chapter 20: CNN Architectures, and Chapter 22: Vision Transformers. The training-recipe material of Chapter 21: Training Recipes matters because quantization-aware training and distillation are fine-tuning recipes. The detectors of Chapter 23: Object Detection and segmentation models of Chapter 24: Segmentation are the running examples we deploy. From Part I, the image and sensor pipeline of Chapter 1: Digital Image Fundamentals is the front end of every edge system here. Comfort reading basic latency and throughput numbers, and a willingness to measure rather than guess, will serve you throughout.

Chapter Roadmap

28.1 The Efficiency Toolbox: Quantization, Pruning & Distillation The three model-level compression techniques: integer quantization (post-training and quantization-aware), magnitude and structured pruning, and knowledge distillation. How each works, what it costs in accuracy, and how they compose into a single shrunk model.
28.2 Export & Runtimes: ONNX, TensorRT & OpenVINO Moving from a training framework to a serving engine: ONNX as the portable graph format, what a compiler does (operator fusion, autotuning, precision calibration), and the two dominant hardware runtimes, TensorRT for NVIDIA GPUs and OpenVINO for Intel CPUs and accelerators.
28.3 Edge & Mobile Vision: From Jetson to Phones Deploying to constrained devices: the Jetson embedded-GPU family, the mobile runtimes (Core ML, TensorFlow Lite, ExecuTorch), and the efficient architectures (MobileNet, EfficientNet) designed for the edge. Power, memory, and thermal budgets as first-class constraints.
28.4 Serving Vision Models: Batching, Throughput & Latency Productionizing a model behind an API: the throughput-versus-latency trade, dynamic batching that amortizes GPU overhead, concurrency and model instances, and the inference servers (Triton, TorchServe, Ray Serve) that orchestrate it all under real traffic.
28.5 Monitoring, Drift & Continual Improvement Life after launch: why accuracy decays silently without labels, detecting data and concept drift from inputs and confidence alone, the human-in-the-loop labeling and retraining loop, and the guardrails (shadow deployment, canary rollout) that make continual improvement safe.

What's Next?

This chapter is the last of Part III's content chapters, and it ends the journey from a trained network to a running product. Chapter 29: Tools of the Trade: The Deep Vision Stack follows immediately and zooms out: it surveys the full ecosystem of frameworks, libraries, model zoos, experiment trackers, and deployment platforms that the previous eleven chapters have drawn on piecemeal, so you can assemble your own stack with intent rather than habit. Many of the deployment tools introduced here, ONNX, TensorRT, OpenVINO, Triton, and the edge runtimes, reappear there in the context of the wider toolchain. Beyond Part III, the efficiency techniques of this chapter become essential again in Chapter 33: Diffusion Models and Chapter 34: Text-to-Image Systems, where the sheer size of generative models makes quantization, distillation, and step-reduction not a luxury but the only path to interactive generation.

Bibliography & Further Reading

Foundational Papers

Jacob, B. et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR (2018). arXiv:1712.05877

The integer-only quantization scheme of Section 28.1 and the foundation of TensorFlow Lite's quantized inference. Defines the affine mapping between float and int8 and the simulated-quantization trick that makes quantization-aware training possible.

Hinton, G., Vinyals, O. & Dean, J. "Distilling the Knowledge in a Neural Network." NeurIPS Deep Learning Workshop (2015). arXiv:1503.02531

The knowledge-distillation paper of Section 28.1. Soft targets from a teacher's temperature-scaled logits carry "dark knowledge" about class similarity that hard labels do not, letting a small student learn what it could not learn alone.

Han, S., Mao, H. & Dally, W. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR (2016), Best Paper. arXiv:1510.00149

The deep-compression pipeline of Section 28.1 that combined pruning, quantization, and coding to shrink networks by an order of magnitude. The paper that made model compression a research field in its own right.

Frankle, J. & Carbin, M. "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." ICLR (2019), Best Paper. arXiv:1803.03635

The lottery-ticket hypothesis referenced in Section 28.1: dense networks contain sparse subnetworks that, trained in isolation from the original initialization, match the full network's accuracy. The theoretical backbone of why pruning works.

Efficient Architectures

Howard, A. et al. "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." (2017). arXiv:1704.04861

The depthwise-separable convolution of Section 28.3 that splits a standard convolution into a per-channel spatial filter and a pointwise mixing step, cutting compute by an order of magnitude. The architecture that defined on-device vision.

Tan, M. & Le, Q. "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." ICML (2019). arXiv:1905.11946

The compound-scaling rule of Section 28.3 that scales depth, width, and resolution together by a single coefficient, producing a family of models on the accuracy-efficiency frontier. The reference for choosing a model size to a hardware budget.

Sandler, M. et al. "MobileNetV2: Inverted Residuals and Linear Bottlenecks." CVPR (2018). arXiv:1801.04381

The inverted-residual block of Section 28.3 that expands, filters depthwise, then projects back down, the building block of nearly every modern mobile vision backbone and the default student in many distillation setups.

Tools, Runtimes & Servers

ONNX: Open Neural Network Exchange, the portable model format. onnx.ai · onnxruntime.ai

The interchange format and cross-platform runtime of Section 28.2 that decouples the training framework from the serving engine. ONNX Runtime ships execution providers for CPU, CUDA, TensorRT, OpenVINO, Core ML, and more.

NVIDIA TensorRT, the high-performance GPU inference compiler. docs.nvidia.com/deeplearning/tensorrt

The GPU runtime of Sections 28.2 and 28.3. Operator fusion, kernel autotuning, and int8 calibration that turn a static graph into an optimized engine for a specific GPU, the default for NVIDIA deployment from data center to Jetson.

Intel OpenVINO, the toolkit for CPU and accelerator inference. docs.openvino.ai

The CPU and integrated-accelerator runtime of Section 28.2, with a model optimizer that compiles ONNX or framework graphs to an intermediate representation tuned for Intel hardware. The TensorRT counterpart for the CPU world.

NVIDIA Triton Inference Server, the multi-framework model server. github.com/triton-inference-server/server

The production inference server of Section 28.4 with dynamic batching, concurrent model instances, and multi-framework backends (TensorRT, ONNX, PyTorch). The reference implementation of the serving patterns in that section.

PyTorch quantization, pruning, and ExecuTorch on-device runtime. pytorch.org/docs/stable/quantization · pytorch.org/executorch

The library shortcuts behind Sections 28.1 and 28.3: the torch.ao quantization and torch.nn.utils.prune APIs, and ExecuTorch, the 2024-onward on-device PyTorch runtime for phones and embedded targets.

Ultralytics YOLO export to ONNX, TensorRT, OpenVINO, Core ML, and TFLite. docs.ultralytics.com/modes/export

The one-command export pipeline used as a running library shortcut across the chapter, wrapping the per-runtime export and quantization steps that Sections 28.2 and 28.3 perform by hand.

Rabanser, S., Günnemann, S. & Lipton, Z. "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift." NeurIPS (2019). arXiv:1810.11953

The dataset-shift detection study behind Section 28.5: a systematic comparison of statistical tests for detecting distribution shift from unlabeled data, including the dimensionality-reduction-then-test recipe that the section's drift monitor uses.