"A prompt is a wish. A control map is a contract. I will still surprise you, but only inside the lines you drew."
A Diffusion Model That Finally Reads the Brief
The text-to-image system of Chapter 34 gives you a slot machine: type a sentence, pull the lever, and accept whatever comes out. This chapter replaces the lever with a steering wheel. Control comes in three flavors that this chapter treats in turn: spatial control fixes where things go (ControlNet on an edge map, a depth map, a pose skeleton); identity control fixes what a subject or style is (LoRA, DreamBooth, textual inversion teaching the model a specific face or look); and editing control changes part of an image while preserving everything else (inpainting, instruction-based editing, and the inversion machinery that lets you edit a real photograph rather than a generated one). The unifying theme is preservation: a good edit changes exactly what you asked for and nothing more, and most of the engineering in this chapter is about defending the parts of the image you did not mention.
The whole chapter fits on one mental card. Every method answers exactly one of three questions, and the practitioner's skill is knowing which axis a task lives on:
- WHERE things go: spatial control (ControlNet on edges, depth, pose, Section 35.1).
- WHAT a subject or style is: identity control (LoRA, DreamBooth, textual inversion, Section 35.2).
- WHICH pixels may move: editing control (masks, instructions, inversion, Sections 35.3 to 35.5).
Underneath all three runs one signature idea, the line to carry out of this chapter: "a good edit changes exactly what you asked for and nothing more." Every technique here is, at heart, a different way of defending the parts of the image you did not mention.
Chapter Overview
The last chapter ended with prompting and parameter-efficient fine-tuning, the two levers a practitioner reaches for first. Both are blunt. A prompt describes the image you want in words, but words cannot say "put the horizon exactly here" or "match this person's face" or "change only the jacket." Fine-tuning teaches the model a new concept, but a single LoRA still generates from scratch every time and cannot touch a photo that already exists. This chapter is the toolkit for everything words and weights leave on the table: pixel-accurate layout, subject and style identity, and surgical edits to images the model has never seen.
We open with spatial control. Section 35.1 introduces ControlNet and the family of conditioning adapters that take a structural map (a Canny edge image, a depth map, an OpenPose skeleton) and force the diffusion model to honor that geometry while still inventing texture, color, and content from the prompt. This is the chapter's first concrete payoff on the book's longest narrative thread: the edge maps you computed with Sobel and Canny in Chapter 9 and the depth maps from Chapter 27 now become control signals for a generator.
The middle of the chapter handles identity and region. Section 35.2 deepens the personalization methods of Chapter 34 into a working comparison of LoRA, DreamBooth, and textual inversion, with the practical question of which to reach for when. Section 35.3 covers inpainting, outpainting, and object replacement: how a mask tells the model where it may paint, and the seam and context problems that separate a believable edit from an obvious one, the generative descendant of the classical inpainting you met in Chapter 7. Section 35.4 turns to instruction-based editing, the InstructPix2Pix line, where the edit is specified as a natural-language command applied to an existing image.
The chapter closes on the hardest and most important problem in practical editing. Section 35.5 tackles real-image inversion: to edit a real photograph with a diffusion model, you must first recover the noise and conditioning that would have produced it, and naive inversion is lossy and unfaithful. We work through DDIM inversion, null-text optimization, and the Prompt-to-Prompt attention-control method that makes edits both faithful to the original and responsive to the new prompt. Finally, Section 35.6 steps up a level to composition: real production work chains these tools (segment, then inpaint, then upscale, then color-match) into multi-step workflows, and this section teaches the discipline of building and debugging such pipelines.
The recurring lesson is that control is not one mechanism but a layered stack: spatial conditioning, identity adapters, masked regions, instruction parsing, and inversion all compose. The practitioner who understands the seams can combine them, predict where a workflow will break, and fix an edit that bleeds outside its mask, the same modular reasoning that Chapter 34 applied to the generation stack now applied to the control stack.
Prerequisites
This chapter assumes the text-to-image stack of Chapter 34 in full: the VAE, the conditioned U-Net or DiT, cross-attention conditioning, classifier-free guidance, and the LoRA/DreamBooth/textual-inversion methods introduced there. It builds directly on the latent diffusion machinery of Chapter 33, especially the forward and reverse processes and the DDIM sampler, which the inversion of Section 35.5 runs in reverse. You will reuse the edge detectors of Chapter 9, the segmentation masks of Chapter 24 (SAM in particular), and the monocular depth of Chapter 27 as control signals. Comfortable PyTorch and a working diffusers install are assumed; a GPU with 8 GB or more makes the code runnable.
Chapter Roadmap
- 35.1 Spatial Control: ControlNet & Conditioning Adapters How a structural map pins down layout: the ControlNet trainable copy with zero-convolution injection, the adapter family (T2I-Adapter, IP-Adapter), conditioning on edges, depth, and pose, and the conditioning-scale dial that trades obedience against creativity.
- 35.2 Personalization: LoRA, DreamBooth & Textual Inversion Teaching a generator a specific subject or style: textual inversion's single learned word, DreamBooth's prior-preserving subject binding, LoRA's low-rank adapters, how the three compare on data, compute, and fidelity, and how to combine and weight multiple LoRAs at inference.
- 35.3 Inpainting, Outpainting & Object Replacement Editing a region while preserving the rest: the inpainting U-Net's mask channels, masked-latent blending, the seam and context problems, extending a canvas with outpainting, and replacing an object by combining a segmentation mask with a new prompt.
- 35.4 Instruction-Based Editing Editing by natural-language command: how InstructPix2Pix builds a synthetic instruction dataset and conditions on both an instruction and an input image, the dual guidance scales that balance instruction-following against image preservation, and the modern editing-model landscape.
- 35.5 Real-Image Inversion & Faithful Editing Editing real photographs: why naive encoding is not enough, DDIM inversion to recover the latent trajectory, null-text optimization to fix the guidance gap, and Prompt-to-Prompt attention control that edits content while preserving structure.
- 35.6 Composing Multi-Step Editing Workflows Building production pipelines: chaining segmentation, control, inpainting, and upscaling into a directed graph, managing latents and color consistency across stages, debugging where a workflow degrades, and the node-graph tools (ComfyUI) that practitioners use to author them.
What's Next?
This chapter controls a single still image in space, identity, and content. The next dimension is time and three-dimensional structure. Chapter 36: Video, 3D Generation & World Models extends the conditioned, controllable diffusion model into video, where temporal consistency becomes the new preservation problem, and into 3D generation and world models, where the control signals are camera poses and actions rather than edge maps. The ControlNet idea reappears as motion and camera conditioning, the inversion problem reappears as editing a real video, and the multi-step workflow discipline of Section 35.6 becomes essential when each stage is a heavy model. After that, Chapter 37 asks how we measure whether any of this control actually worked, and how we keep powerful editing tools safe. Before moving on, make the whole chapter concrete in the Hands-On Lab below, where spatial control, region masking, and a chained workflow come together as one small editing studio you build and run yourself.
Hands-On Lab: A Controlled Editing Studio
Objective
Build a small editing studio that exercises all three axes of control from the chapter card in one pipeline. First you fix where content goes by generating a fresh image whose layout is locked to the edges of a reference photo with a Canny ControlNet (Section 35.1). Then you change which pixels may move by masking one region and repainting only it with an inpainting pipeline (Section 35.3). Chaining the two stages, edge-controlled generation feeding masked region editing, is exactly the multi-step workflow discipline of Section 35.6: the output of one controllable model becomes the input of the next, and the signature rule of the chapter, change exactly what you asked for and nothing more, becomes something you can see at every stage.
What You'll Practice
- Turning a Canny edge map (the detector of Chapter 9) into a spatial control signal and driving a ControlNet with it, the WHERE axis of Section 35.1.
- Reading the
controlnet_conditioning_scaledial that trades obedience to the edges against creative freedom (Section 35.1). - Constructing a binary mask and repainting only the masked region with a dedicated inpainting checkpoint, the WHICH pixels axis of Section 35.3.
- Composing two controllable stages into one directed workflow and reasoning about where it can degrade, the pipeline discipline of Section 35.6.
- Verifying the preservation property by differencing input and output outside the mask, the chapter's core "nothing more" guarantee.
Setup
A GPU with 8 GB or more makes this comfortable; it also runs on CPU, far more slowly. The models download once from the Hugging Face Hub. Install with:
pip install diffusers transformers accelerate opencv-python pillow numpy
Provide one reference photo named reference.jpg with a clear subject and a plain region you will later edit (a room with one chair, a desk with one object). No training is required: every model is pretrained, and the whole lab is one short script built from the chapter's own code fragments.
Steps
Step 1: Build the spatial control map
Load your reference photo and extract its Canny edges, the same detector you met in Chapter 9. The edge map is the contract that will fix layout in Step 2: the generator may invent any material and lighting, but it must honor these lines.
import cv2, numpy as np
from PIL import Image
ref = cv2.imread("reference.jpg") # BGR uint8
# TODO: run cv2.Canny on `ref` with thresholds 100 and 200, then stack the
# single-channel result into 3 identical channels (ControlNet expects RGB).
# Wrap the result in Image.fromarray and name it `control`.
edges = ...
control = ...
control.save("control_edges.png")
Hint
edges = cv2.Canny(ref, 100, 200), then edges = np.stack([edges] * 3, axis=-1), then control = Image.fromarray(edges). Open control_edges.png: you should see white outlines of your subject on black. If the edges are too sparse, lower the thresholds; too noisy, raise them.
Step 2: Generate a layout-locked image with ControlNet
Load a Canny ControlNet alongside a base Stable Diffusion model and generate from a prompt that describes a different scene than the original photo. The edge map holds the geometry while the prompt supplies new content, the WHERE axis of Section 35.1 in action.
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny", torch_dtype=dtype)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet, torch_dtype=dtype)
pipe = pipe.to("cuda" if torch.cuda.is_available() else "cpu")
gen = torch.Generator(pipe.device).manual_seed(0) # fix the seed for repeatability
# TODO: call pipe(...) with your prompt, image=control, num_inference_steps=30,
# generator=gen, and controlnet_conditioning_scale=1.0. Take .images[0] and
# save it as stage1.png.
stage1 = ...
stage1.save("stage1.png")
Hint
stage1 = pipe("a cozy reading room, warm light, photorealistic", image=control, num_inference_steps=30, generator=gen, controlnet_conditioning_scale=1.0).images[0]. The output should keep the silhouettes from control_edges.png but render your prompt's scene. Drop the scale to 0.5 and the layout loosens; push it past 1.3 and edge artifacts creep in.
Step 3: Define the region you are allowed to change
Build a binary mask the size of stage1 that is white where you want to repaint and black everywhere else. Here you draw a simple rectangle by hand; in production you would let SAM produce the mask, the path Section 35.3 takes. The mask is the WHICH pixels contract: the inpainter may only touch the white area.
stage1 = Image.open("stage1.png").convert("RGB")
W, H = stage1.size
mask = np.zeros((H, W), dtype=np.uint8)
# TODO: set a rectangular block of `mask` to 255 over the region you want to
# replace (for example the right third of the canvas). Then convert to a PIL
# image named `mask_img`. White = repaint, black = keep.
mask[...] = 255
mask_img = ...
mask_img.save("mask.png")
Hint
For the right third: mask[:, int(W * 0.66):] = 255, then mask_img = Image.fromarray(mask). Overlay mask.png on stage1.png in any viewer to confirm the white block sits exactly over the object you intend to replace, with a little margin around it.
Step 4: Repaint only the masked region
Load a dedicated inpainting checkpoint and run it with stage1 as the image and your mask. The inpainting U-Net receives the mask and the masked image as extra channels, so it paints the new content with full awareness of the surrounding pixels it must blend into, the seam-and-context concern of Section 35.3.
from diffusers import StableDiffusionInpaintPipeline
inpaint = StableDiffusionInpaintPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-inpainting", torch_dtype=dtype)
inpaint = inpaint.to(pipe.device)
gen2 = torch.Generator(inpaint.device).manual_seed(1)
# TODO: call inpaint(...) with a prompt describing the NEW object, image=stage1,
# mask_image=mask_img, num_inference_steps=30, generator=gen2. Take .images[0]
# and save it as stage2.png. The prompt should describe only what goes in the mask.
stage2 = ...
stage2.save("stage2.png")
Hint
stage2 = inpaint(prompt="a tall potted fern", image=stage1, mask_image=mask_img, num_inference_steps=30, generator=gen2).images[0]. The new object should appear inside the white region and the rest of stage1 should look untouched. If a hard seam shows, dilate the mask by a few pixels with cv2.dilate so the inpainter has room to blend.
Step 5: Verify the preservation property
The chapter's signature claim is that a good edit changes exactly what you asked for and nothing more. Make that measurable: difference stage1 and stage2 outside the mask and confirm the change there is near zero, while inside the mask it is large. This is the numerical form of "nothing more."
a = np.asarray(stage1, dtype=np.float32)
b = np.asarray(stage2, dtype=np.float32)
diff = np.abs(a - b).mean(axis=2) # per-pixel mean absolute change
m = mask > 127
# TODO: print the mean of `diff` inside the mask (diff[m]) and outside it
# (diff[~m]). The inside value should be much larger than the outside value.
print("inside mask:", ...)
print("outside mask:", ...)
Hint
print("inside mask:", diff[m].mean()) and print("outside mask:", diff[~m].mean()). A clean edit shows an inside value many times the outside value. A large outside value means the pipeline leaked changes beyond the mask, the failure mode Section 35.3 warns about; tighten the mask or lower the inpainting strength.
Step 6: Assemble the studio into one callable workflow
Wrap Steps 1 through 4 into a single function controlled_edit(ref_path, scene_prompt, object_prompt, mask_box) that returns the final image. This is the composition step of Section 35.6: a directed graph where edge extraction feeds ControlNet generation, which feeds masked inpainting. A reusable function is what lets you batch many edits and swap a stage without rewiring the rest.
def controlled_edit(ref_path, scene_prompt, object_prompt, mask_box):
# mask_box = (x0, y0, x1, y1) in pixels of the region to repaint.
# TODO: chain the steps: read ref_path -> Canny control map ->
# ControlNet generate stage1 with scene_prompt -> build a rectangular mask
# from mask_box -> inpaint object_prompt into stage1 -> return the result.
...
final = controlled_edit("reference.jpg",
"a cozy reading room, warm light, photorealistic",
"a tall potted fern",
(0, 0, 200, 512))
final.save("studio_output.png")
Hint
Move the body of Steps 1, 2, 3, and 4 inside the function, replacing the hard-coded rectangle with mask[y0:y1, x0:x1] = 255 from mask_box, and return stage2. Keep the two pipelines loaded outside the function so repeated calls do not reload weights, the kind of resource bookkeeping Section 35.6 flags as essential when every stage is a heavy model.
Expected Output
Four image artifacts that tell the workflow story stage by stage: control_edges.png (white outlines on black), stage1.png (a new scene that nonetheless traces those outlines), mask.png (a white block over the region to edit), and studio_output.png (the same scene with only the masked object replaced). The printed diagnostic from Step 5 should report a mean absolute change inside the mask several times larger than outside it; a typical clean run shows the outside value in the low single digits on a 0 to 255 scale while the inside value is many times that. Exact pixels vary with seed and model version; what should hold is a layout-locked Stage 1 and a Stage 2 whose changes are confined to the mask.
Stretch Goals
- Replace the hand-drawn rectangle in Step 3 with a real SAM mask (Chapter 24): prompt SAM with a click on the object, feed its mask to Step 4, and watch the seam quality improve when the mask follows the object's true shape, the object-replacement recipe of Section 35.3.
- Add a third stage that swaps the Canny ControlNet for a depth ControlNet using a monocular depth map (Chapter 27), and compare which control signal preserves your reference layout better, the conditioning-choice question of Section 35.1.
- Sweep
controlnet_conditioning_scaleover $\{0.0, 0.5, 1.0, 1.4\}$ in Step 2, run the full workflow at each, and assemble a contact sheet showing how the obedience dial of Section 35.1 changes both the Stage 1 layout and the final edited result.
Complete Solution
import cv2, numpy as np, torch
from PIL import Image
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers import StableDiffusionInpaintPipeline
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load both pipelines once and reuse them across calls.
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny", torch_dtype=dtype)
ctrl_pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet, torch_dtype=dtype).to(device)
inpaint = StableDiffusionInpaintPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-inpainting",
torch_dtype=dtype).to(device)
def controlled_edit(ref_path, scene_prompt, object_prompt, mask_box, seed=0):
# Step 1: spatial control map (Canny edges, Chapter 9 detector).
ref = cv2.imread(ref_path)
edges = cv2.Canny(ref, 100, 200)
control = Image.fromarray(np.stack([edges] * 3, axis=-1))
# Step 2: ControlNet generation locked to those edges (WHERE).
gen = torch.Generator(device).manual_seed(seed)
stage1 = ctrl_pipe(
scene_prompt, image=control, num_inference_steps=30,
generator=gen, controlnet_conditioning_scale=1.0).images[0]
# Step 3: binary mask over the region we may repaint (WHICH pixels).
W, H = stage1.size
x0, y0, x1, y1 = mask_box
mask = np.zeros((H, W), dtype=np.uint8)
mask[y0:y1, x0:x1] = 255
mask_img = Image.fromarray(mask)
# Step 4: repaint only inside the mask.
gen2 = torch.Generator(device).manual_seed(seed + 1)
stage2 = inpaint(
prompt=object_prompt, image=stage1, mask_image=mask_img,
num_inference_steps=30, generator=gen2).images[0]
# Step 5: verify preservation outside the mask.
a = np.asarray(stage1, dtype=np.float32)
b = np.asarray(stage2.resize(stage1.size), dtype=np.float32)
diff = np.abs(a - b).mean(axis=2)
m = mask > 127
print("inside mask:", diff[m].mean(), "outside mask:", diff[~m].mean())
return stage2
# Step 6: one call runs the whole studio.
final = controlled_edit(
"reference.jpg",
"a cozy reading room, warm light, photorealistic",
"a tall potted fern",
(0, 0, 200, 512))
final.save("studio_output.png")
The script above wires the two-stage graph by hand on purpose, so you can see every tensor pass from edge map to ControlNet to mask to inpainter. The ComfyUI node-graph editor from Section 35.6 expresses the identical workflow as a visual directed graph: a Canny preprocessor node, an Apply ControlNet node, a mask node, and a VAE-encode-for-inpaint node, connected by dragging wires, with no Python at all. Build the chain in code once to understand which output feeds which input; reach for ComfyUI when you want to iterate on the graph quickly or share it as a single portable workflow file.
Bibliography & Further Reading
Foundational Papers
Inversion & Faithful Editing
Adapters & Recent Methods (2023-2026)
Tools & Libraries
diffusers. huggingface.co/docs/diffusers