"They trained me for six weeks on a cluster the size of a tennis court, then asked me to run on a doorbell. I lost three quarters of my weights and most of my dignity, and I have never been faster."
A Formerly Float32 Network, Now Quantized and Content
A model that cannot run on the target hardware within the target latency and power budget is not a model; it is a research artifact. Everything in Part III so far has optimized for accuracy on a benchmark, measured in milliseconds on a data-center GPU you did not have to pay for by the watt. Production inverts the constraint. The question is no longer "what is the best mean average precision" but "what is the best accuracy I can buy under sixteen milliseconds, two watts, and four megabytes of weights, on a chip that costs eight dollars." This chapter is about closing that gap. We shrink the model with quantization, pruning, and distillation; we compile it to a hardware-specific runtime with ONNX, TensorRT, and OpenVINO; we deploy it to edge devices from a Jetson to a phone; we serve it under real traffic with batching and concurrency; and we watch it after launch, because the world the model meets in production is never the world it was trained on.
Chapter Overview
Every model the last ten chapters built, the classifiers, detectors, segmenters, video networks, and the depth and neural-scene models of Chapter 27, shares the same unfinished business: it has been proven accurate but not yet made deployable. There is a moment in every applied vision project where the research ends and the engineering begins. The validation curve has plateaued, the model is as accurate as it is going to get, and someone asks the question that no benchmark answers: can we actually ship this? The honest answer is usually no, not yet. The model that scores well on the leaderboard is a few hundred megabytes of float32 weights, it assumes a recent GPU, and it returns a prediction in forty milliseconds only when nothing else is competing for the device. The product needs it to run in real time on a battery-powered camera, or to serve ten thousand requests a second without melting the budget. The distance between those two states is the subject of this chapter.
We cross that distance in five steps, and the order matters. Section 28.1 opens the efficiency toolbox: the three model-level techniques that make a network smaller and faster without retraining it from scratch. Quantization stores and computes weights in eight bits instead of thirty-two, cutting memory by four and unlocking integer math that modern chips run far faster than floating point. Pruning removes the weights and channels that contribute least, trading a small accuracy drop for a smaller, sparser network. Distillation trains a small student model to imitate a large teacher, transferring accuracy that the student could never reach by training on labels alone. These techniques compose, and used together they routinely deliver a four-to-tenfold speedup for a single-digit accuracy cost.
A smaller model still needs to actually run, and a PyTorch nn.Module running under the Python interpreter is not how you deploy. Section 28.2 is about export and runtimes: ONNX as the portable interchange format that decouples the framework you train in from the engine you serve on, and the two dominant hardware-specific compilers, NVIDIA's TensorRT for their GPUs and Intel's OpenVINO for their CPUs and accelerators. We trace what a compiler actually does (operator fusion, kernel autotuning, precision calibration) and why a compiled engine is often several times faster than the same graph run eagerly. Section 28.3 takes the compiled model to the edge: the Jetson family for embedded GPU compute, the mobile runtimes (Core ML, TensorFlow Lite, ExecuTorch) that put vision on a phone, and the architectural choices, the MobileNet and EfficientNet families, designed for these constraints from the start. The camera and sensor pipeline first met in Chapter 1 reappears here as the front end of an edge system.
Not every model lives at the edge; many live behind an API, and serving them well is its own discipline. Section 28.4 covers vision-model serving: the throughput-versus-latency trade that governs every serving decision, dynamic batching that amortizes GPU overhead across requests, and the inference servers (Triton, TorchServe, Ray Serve) that productionize all of it. Finally, Section 28.5 confronts the truth that deployment is not the finish line. Data drifts, the distribution shifts under your model's feet, and accuracy decays silently because production has no labels to tell you it is failing. The section builds the monitoring, drift-detection, and continual-improvement loop that keeps a deployed model honest, closing the chapter where every real system actually lives.
The recurring theme is that efficiency is a system property, not a model property. A quantized network exported to a tuned runtime on the right accelerator behind a well-batched server with a working drift monitor is a deployment; any one of those pieces missing is a liability. By the end of this chapter you will be able to take a model from Chapter 23 or Chapter 24 and walk it all the way to a chip, and to know which knob to turn when the latency budget is blown.
Prerequisites
This chapter assumes the deep-learning foundation of Part III. You should have built and trained a model in PyTorch, so Chapter 18: Neural Networks & PyTorch for Vision is essential, and you should understand the convolutional and transformer backbones we will be compressing, from Chapter 19: Convolutional Neural Networks, Chapter 20: CNN Architectures, and Chapter 22: Vision Transformers. The training-recipe material of Chapter 21: Training Recipes matters because quantization-aware training and distillation are fine-tuning recipes. The detectors of Chapter 23: Object Detection and segmentation models of Chapter 24: Segmentation are the running examples we deploy. From Part I, the image and sensor pipeline of Chapter 1: Digital Image Fundamentals is the front end of every edge system here. Comfort reading basic latency and throughput numbers, and a willingness to measure rather than guess, will serve you throughout.
Chapter Roadmap
- 28.1 The Efficiency Toolbox: Quantization, Pruning & Distillation The three model-level compression techniques: integer quantization (post-training and quantization-aware), magnitude and structured pruning, and knowledge distillation. How each works, what it costs in accuracy, and how they compose into a single shrunk model.
- 28.2 Export & Runtimes: ONNX, TensorRT & OpenVINO Moving from a training framework to a serving engine: ONNX as the portable graph format, what a compiler does (operator fusion, autotuning, precision calibration), and the two dominant hardware runtimes, TensorRT for NVIDIA GPUs and OpenVINO for Intel CPUs and accelerators.
- 28.3 Edge & Mobile Vision: From Jetson to Phones Deploying to constrained devices: the Jetson embedded-GPU family, the mobile runtimes (Core ML, TensorFlow Lite, ExecuTorch), and the efficient architectures (MobileNet, EfficientNet) designed for the edge. Power, memory, and thermal budgets as first-class constraints.
- 28.4 Serving Vision Models: Batching, Throughput & Latency Productionizing a model behind an API: the throughput-versus-latency trade, dynamic batching that amortizes GPU overhead, concurrency and model instances, and the inference servers (Triton, TorchServe, Ray Serve) that orchestrate it all under real traffic.
- 28.5 Monitoring, Drift & Continual Improvement Life after launch: why accuracy decays silently without labels, detecting data and concept drift from inputs and confidence alone, the human-in-the-loop labeling and retraining loop, and the guardrails (shadow deployment, canary rollout) that make continual improvement safe.
What's Next?
This chapter is the last of Part III's content chapters, and it ends the journey from a trained network to a running product. Chapter 29: Tools of the Trade: The Deep Vision Stack follows immediately and zooms out: it surveys the full ecosystem of frameworks, libraries, model zoos, experiment trackers, and deployment platforms that the previous eleven chapters have drawn on piecemeal, so you can assemble your own stack with intent rather than habit. Many of the deployment tools introduced here, ONNX, TensorRT, OpenVINO, Triton, and the edge runtimes, reappear there in the context of the wider toolchain. Beyond Part III, the efficiency techniques of this chapter become essential again in Chapter 33: Diffusion Models and Chapter 34: Text-to-Image Systems, where the sheer size of generative models makes quantization, distillation, and step-reduction not a luxury but the only path to interactive generation.
Bibliography & Further Reading
Foundational Papers
Efficient Architectures
Tools, Runtimes & Servers
torch.ao quantization and torch.nn.utils.prune APIs, and ExecuTorch, the 2024-onward on-device PyTorch runtime for phones and embedded targets.