Part IV: Generative Vision Models
Chapter 35: Controllable Generation & Image Editing

Chapter 35: Controllable Generation & Image Editing

"A prompt is a wish. A control map is a contract. I will still surprise you, but only inside the lines you drew."

A Diffusion Model That Finally Reads the Brief
Big Picture

The text-to-image system of Chapter 34 gives you a slot machine: type a sentence, pull the lever, and accept whatever comes out. This chapter replaces the lever with a steering wheel. Control comes in three flavors that this chapter treats in turn: spatial control fixes where things go (ControlNet on an edge map, a depth map, a pose skeleton); identity control fixes what a subject or style is (LoRA, DreamBooth, textual inversion teaching the model a specific face or look); and editing control changes part of an image while preserving everything else (inpainting, instruction-based editing, and the inversion machinery that lets you edit a real photograph rather than a generated one). The unifying theme is preservation: a good edit changes exactly what you asked for and nothing more, and most of the engineering in this chapter is about defending the parts of the image you did not mention.

Remember This: The Three Axes of Control

The whole chapter fits on one mental card. Every method answers exactly one of three questions, and the practitioner's skill is knowing which axis a task lives on:

Underneath all three runs one signature idea, the line to carry out of this chapter: "a good edit changes exactly what you asked for and nothing more." Every technique here is, at heart, a different way of defending the parts of the image you did not mention.

Chapter Overview

The last chapter ended with prompting and parameter-efficient fine-tuning, the two levers a practitioner reaches for first. Both are blunt. A prompt describes the image you want in words, but words cannot say "put the horizon exactly here" or "match this person's face" or "change only the jacket." Fine-tuning teaches the model a new concept, but a single LoRA still generates from scratch every time and cannot touch a photo that already exists. This chapter is the toolkit for everything words and weights leave on the table: pixel-accurate layout, subject and style identity, and surgical edits to images the model has never seen.

We open with spatial control. Section 35.1 introduces ControlNet and the family of conditioning adapters that take a structural map (a Canny edge image, a depth map, an OpenPose skeleton) and force the diffusion model to honor that geometry while still inventing texture, color, and content from the prompt. This is the chapter's first concrete payoff on the book's longest narrative thread: the edge maps you computed with Sobel and Canny in Chapter 9 and the depth maps from Chapter 27 now become control signals for a generator.

The middle of the chapter handles identity and region. Section 35.2 deepens the personalization methods of Chapter 34 into a working comparison of LoRA, DreamBooth, and textual inversion, with the practical question of which to reach for when. Section 35.3 covers inpainting, outpainting, and object replacement: how a mask tells the model where it may paint, and the seam and context problems that separate a believable edit from an obvious one, the generative descendant of the classical inpainting you met in Chapter 7. Section 35.4 turns to instruction-based editing, the InstructPix2Pix line, where the edit is specified as a natural-language command applied to an existing image.

The chapter closes on the hardest and most important problem in practical editing. Section 35.5 tackles real-image inversion: to edit a real photograph with a diffusion model, you must first recover the noise and conditioning that would have produced it, and naive inversion is lossy and unfaithful. We work through DDIM inversion, null-text optimization, and the Prompt-to-Prompt attention-control method that makes edits both faithful to the original and responsive to the new prompt. Finally, Section 35.6 steps up a level to composition: real production work chains these tools (segment, then inpaint, then upscale, then color-match) into multi-step workflows, and this section teaches the discipline of building and debugging such pipelines.

The recurring lesson is that control is not one mechanism but a layered stack: spatial conditioning, identity adapters, masked regions, instruction parsing, and inversion all compose. The practitioner who understands the seams can combine them, predict where a workflow will break, and fix an edit that bleeds outside its mask, the same modular reasoning that Chapter 34 applied to the generation stack now applied to the control stack.

Prerequisites

This chapter assumes the text-to-image stack of Chapter 34 in full: the VAE, the conditioned U-Net or DiT, cross-attention conditioning, classifier-free guidance, and the LoRA/DreamBooth/textual-inversion methods introduced there. It builds directly on the latent diffusion machinery of Chapter 33, especially the forward and reverse processes and the DDIM sampler, which the inversion of Section 35.5 runs in reverse. You will reuse the edge detectors of Chapter 9, the segmentation masks of Chapter 24 (SAM in particular), and the monocular depth of Chapter 27 as control signals. Comfortable PyTorch and a working diffusers install are assumed; a GPU with 8 GB or more makes the code runnable.

Chapter Roadmap

What's Next?

This chapter controls a single still image in space, identity, and content. The next dimension is time and three-dimensional structure. Chapter 36: Video, 3D Generation & World Models extends the conditioned, controllable diffusion model into video, where temporal consistency becomes the new preservation problem, and into 3D generation and world models, where the control signals are camera poses and actions rather than edge maps. The ControlNet idea reappears as motion and camera conditioning, the inversion problem reappears as editing a real video, and the multi-step workflow discipline of Section 35.6 becomes essential when each stage is a heavy model. After that, Chapter 37 asks how we measure whether any of this control actually worked, and how we keep powerful editing tools safe. Before moving on, make the whole chapter concrete in the Hands-On Lab below, where spatial control, region masking, and a chained workflow come together as one small editing studio you build and run yourself.

Hands-On Lab: A Controlled Editing Studio

Duration: about 60 to 90 minutes Difficulty: Intermediate

Objective

Build a small editing studio that exercises all three axes of control from the chapter card in one pipeline. First you fix where content goes by generating a fresh image whose layout is locked to the edges of a reference photo with a Canny ControlNet (Section 35.1). Then you change which pixels may move by masking one region and repainting only it with an inpainting pipeline (Section 35.3). Chaining the two stages, edge-controlled generation feeding masked region editing, is exactly the multi-step workflow discipline of Section 35.6: the output of one controllable model becomes the input of the next, and the signature rule of the chapter, change exactly what you asked for and nothing more, becomes something you can see at every stage.

What You'll Practice

  • Turning a Canny edge map (the detector of Chapter 9) into a spatial control signal and driving a ControlNet with it, the WHERE axis of Section 35.1.
  • Reading the controlnet_conditioning_scale dial that trades obedience to the edges against creative freedom (Section 35.1).
  • Constructing a binary mask and repainting only the masked region with a dedicated inpainting checkpoint, the WHICH pixels axis of Section 35.3.
  • Composing two controllable stages into one directed workflow and reasoning about where it can degrade, the pipeline discipline of Section 35.6.
  • Verifying the preservation property by differencing input and output outside the mask, the chapter's core "nothing more" guarantee.

Setup

A GPU with 8 GB or more makes this comfortable; it also runs on CPU, far more slowly. The models download once from the Hugging Face Hub. Install with:

pip install diffusers transformers accelerate opencv-python pillow numpy

Provide one reference photo named reference.jpg with a clear subject and a plain region you will later edit (a room with one chair, a desk with one object). No training is required: every model is pretrained, and the whole lab is one short script built from the chapter's own code fragments.

Steps

Step 1: Build the spatial control map

Load your reference photo and extract its Canny edges, the same detector you met in Chapter 9. The edge map is the contract that will fix layout in Step 2: the generator may invent any material and lighting, but it must honor these lines.

import cv2, numpy as np
from PIL import Image

ref = cv2.imread("reference.jpg")                 # BGR uint8
# TODO: run cv2.Canny on `ref` with thresholds 100 and 200, then stack the
# single-channel result into 3 identical channels (ControlNet expects RGB).
# Wrap the result in Image.fromarray and name it `control`.
edges = ...
control = ...
control.save("control_edges.png")
Hint

edges = cv2.Canny(ref, 100, 200), then edges = np.stack([edges] * 3, axis=-1), then control = Image.fromarray(edges). Open control_edges.png: you should see white outlines of your subject on black. If the edges are too sparse, lower the thresholds; too noisy, raise them.

Step 2: Generate a layout-locked image with ControlNet

Load a Canny ControlNet alongside a base Stable Diffusion model and generate from a prompt that describes a different scene than the original photo. The edge map holds the geometry while the prompt supplies new content, the WHERE axis of Section 35.1 in action.

import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

dtype = torch.float16 if torch.cuda.is_available() else torch.float32
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=dtype)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet, torch_dtype=dtype)
pipe = pipe.to("cuda" if torch.cuda.is_available() else "cpu")

gen = torch.Generator(pipe.device).manual_seed(0)   # fix the seed for repeatability
# TODO: call pipe(...) with your prompt, image=control, num_inference_steps=30,
# generator=gen, and controlnet_conditioning_scale=1.0. Take .images[0] and
# save it as stage1.png.
stage1 = ...
stage1.save("stage1.png")
Hint

stage1 = pipe("a cozy reading room, warm light, photorealistic", image=control, num_inference_steps=30, generator=gen, controlnet_conditioning_scale=1.0).images[0]. The output should keep the silhouettes from control_edges.png but render your prompt's scene. Drop the scale to 0.5 and the layout loosens; push it past 1.3 and edge artifacts creep in.

Step 3: Define the region you are allowed to change

Build a binary mask the size of stage1 that is white where you want to repaint and black everywhere else. Here you draw a simple rectangle by hand; in production you would let SAM produce the mask, the path Section 35.3 takes. The mask is the WHICH pixels contract: the inpainter may only touch the white area.

stage1 = Image.open("stage1.png").convert("RGB")
W, H = stage1.size
mask = np.zeros((H, W), dtype=np.uint8)
# TODO: set a rectangular block of `mask` to 255 over the region you want to
# replace (for example the right third of the canvas). Then convert to a PIL
# image named `mask_img`. White = repaint, black = keep.
mask[...] = 255
mask_img = ...
mask_img.save("mask.png")
Hint

For the right third: mask[:, int(W * 0.66):] = 255, then mask_img = Image.fromarray(mask). Overlay mask.png on stage1.png in any viewer to confirm the white block sits exactly over the object you intend to replace, with a little margin around it.

Step 4: Repaint only the masked region

Load a dedicated inpainting checkpoint and run it with stage1 as the image and your mask. The inpainting U-Net receives the mask and the masked image as extra channels, so it paints the new content with full awareness of the surrounding pixels it must blend into, the seam-and-context concern of Section 35.3.

from diffusers import StableDiffusionInpaintPipeline

inpaint = StableDiffusionInpaintPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-inpainting", torch_dtype=dtype)
inpaint = inpaint.to(pipe.device)

gen2 = torch.Generator(inpaint.device).manual_seed(1)
# TODO: call inpaint(...) with a prompt describing the NEW object, image=stage1,
# mask_image=mask_img, num_inference_steps=30, generator=gen2. Take .images[0]
# and save it as stage2.png. The prompt should describe only what goes in the mask.
stage2 = ...
stage2.save("stage2.png")
Hint

stage2 = inpaint(prompt="a tall potted fern", image=stage1, mask_image=mask_img, num_inference_steps=30, generator=gen2).images[0]. The new object should appear inside the white region and the rest of stage1 should look untouched. If a hard seam shows, dilate the mask by a few pixels with cv2.dilate so the inpainter has room to blend.

Step 5: Verify the preservation property

The chapter's signature claim is that a good edit changes exactly what you asked for and nothing more. Make that measurable: difference stage1 and stage2 outside the mask and confirm the change there is near zero, while inside the mask it is large. This is the numerical form of "nothing more."

a = np.asarray(stage1, dtype=np.float32)
b = np.asarray(stage2, dtype=np.float32)
diff = np.abs(a - b).mean(axis=2)                 # per-pixel mean absolute change
m = mask > 127
# TODO: print the mean of `diff` inside the mask (diff[m]) and outside it
# (diff[~m]). The inside value should be much larger than the outside value.
print("inside mask:", ...)
print("outside mask:", ...)
Hint

print("inside mask:", diff[m].mean()) and print("outside mask:", diff[~m].mean()). A clean edit shows an inside value many times the outside value. A large outside value means the pipeline leaked changes beyond the mask, the failure mode Section 35.3 warns about; tighten the mask or lower the inpainting strength.

Step 6: Assemble the studio into one callable workflow

Wrap Steps 1 through 4 into a single function controlled_edit(ref_path, scene_prompt, object_prompt, mask_box) that returns the final image. This is the composition step of Section 35.6: a directed graph where edge extraction feeds ControlNet generation, which feeds masked inpainting. A reusable function is what lets you batch many edits and swap a stage without rewiring the rest.

def controlled_edit(ref_path, scene_prompt, object_prompt, mask_box):
    # mask_box = (x0, y0, x1, y1) in pixels of the region to repaint.
    # TODO: chain the steps: read ref_path -> Canny control map ->
    # ControlNet generate stage1 with scene_prompt -> build a rectangular mask
    # from mask_box -> inpaint object_prompt into stage1 -> return the result.
    ...

final = controlled_edit("reference.jpg",
                        "a cozy reading room, warm light, photorealistic",
                        "a tall potted fern",
                        (0, 0, 200, 512))
final.save("studio_output.png")
Hint

Move the body of Steps 1, 2, 3, and 4 inside the function, replacing the hard-coded rectangle with mask[y0:y1, x0:x1] = 255 from mask_box, and return stage2. Keep the two pipelines loaded outside the function so repeated calls do not reload weights, the kind of resource bookkeeping Section 35.6 flags as essential when every stage is a heavy model.

Expected Output

Four image artifacts that tell the workflow story stage by stage: control_edges.png (white outlines on black), stage1.png (a new scene that nonetheless traces those outlines), mask.png (a white block over the region to edit), and studio_output.png (the same scene with only the masked object replaced). The printed diagnostic from Step 5 should report a mean absolute change inside the mask several times larger than outside it; a typical clean run shows the outside value in the low single digits on a 0 to 255 scale while the inside value is many times that. Exact pixels vary with seed and model version; what should hold is a layout-locked Stage 1 and a Stage 2 whose changes are confined to the mask.

Stretch Goals

  • Replace the hand-drawn rectangle in Step 3 with a real SAM mask (Chapter 24): prompt SAM with a click on the object, feed its mask to Step 4, and watch the seam quality improve when the mask follows the object's true shape, the object-replacement recipe of Section 35.3.
  • Add a third stage that swaps the Canny ControlNet for a depth ControlNet using a monocular depth map (Chapter 27), and compare which control signal preserves your reference layout better, the conditioning-choice question of Section 35.1.
  • Sweep controlnet_conditioning_scale over $\{0.0, 0.5, 1.0, 1.4\}$ in Step 2, run the full workflow at each, and assemble a contact sheet showing how the obedience dial of Section 35.1 changes both the Stage 1 layout and the final edited result.
Complete Solution
import cv2, numpy as np, torch
from PIL import Image
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers import StableDiffusionInpaintPipeline

dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load both pipelines once and reuse them across calls.
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=dtype)
ctrl_pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet, torch_dtype=dtype).to(device)
inpaint = StableDiffusionInpaintPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-inpainting",
    torch_dtype=dtype).to(device)

def controlled_edit(ref_path, scene_prompt, object_prompt, mask_box, seed=0):
    # Step 1: spatial control map (Canny edges, Chapter 9 detector).
    ref = cv2.imread(ref_path)
    edges = cv2.Canny(ref, 100, 200)
    control = Image.fromarray(np.stack([edges] * 3, axis=-1))

    # Step 2: ControlNet generation locked to those edges (WHERE).
    gen = torch.Generator(device).manual_seed(seed)
    stage1 = ctrl_pipe(
        scene_prompt, image=control, num_inference_steps=30,
        generator=gen, controlnet_conditioning_scale=1.0).images[0]

    # Step 3: binary mask over the region we may repaint (WHICH pixels).
    W, H = stage1.size
    x0, y0, x1, y1 = mask_box
    mask = np.zeros((H, W), dtype=np.uint8)
    mask[y0:y1, x0:x1] = 255
    mask_img = Image.fromarray(mask)

    # Step 4: repaint only inside the mask.
    gen2 = torch.Generator(device).manual_seed(seed + 1)
    stage2 = inpaint(
        prompt=object_prompt, image=stage1, mask_image=mask_img,
        num_inference_steps=30, generator=gen2).images[0]

    # Step 5: verify preservation outside the mask.
    a = np.asarray(stage1, dtype=np.float32)
    b = np.asarray(stage2.resize(stage1.size), dtype=np.float32)
    diff = np.abs(a - b).mean(axis=2)
    m = mask > 127
    print("inside mask:", diff[m].mean(), "outside mask:", diff[~m].mean())
    return stage2

# Step 6: one call runs the whole studio.
final = controlled_edit(
    "reference.jpg",
    "a cozy reading room, warm light, photorealistic",
    "a tall potted fern",
    (0, 0, 200, 512))
final.save("studio_output.png")
Library Shortcut: ComfyUI Authors the Same Graph Visually

The script above wires the two-stage graph by hand on purpose, so you can see every tensor pass from edge map to ControlNet to mask to inpainter. The ComfyUI node-graph editor from Section 35.6 expresses the identical workflow as a visual directed graph: a Canny preprocessor node, an Apply ControlNet node, a mask node, and a VAE-encode-for-inpaint node, connected by dragging wires, with no Python at all. Build the chain in code once to understand which output feeds which input; reach for ComfyUI when you want to iterate on the graph quickly or share it as a single portable workflow file.

Bibliography & Further Reading

Foundational Papers

Zhang, L., Rao, A., Agrawala, M. "Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)." ICCV (2023). arXiv:2302.05543
ControlNet, the central method of Section 35.1. It clones the U-Net encoder into a trainable branch connected by zero convolutions, letting a structural map control layout without destroying the base model.
Ruiz, N. et al. "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." CVPR (2023). arXiv:2208.12242
DreamBooth, the subject-binding method of Section 35.2, which ties a subject to a rare token with a prior-preservation loss that prevents the model from forgetting the wider class.
Gal, R. et al. "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion." ICLR (2023). arXiv:2208.01618
Textual inversion from Section 35.2: learn one new embedding vector for a concept while freezing the entire model, the lightest personalization method.
Hu, E. et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR (2022). arXiv:2106.09685
LoRA, the low-rank adapter of Section 35.2, now the dominant way to fine-tune diffusion models for new styles and subjects on a single GPU and to stack multiple concepts at inference.
Brooks, T., Holynski, A., Efros, A. "InstructPix2Pix: Learning to Follow Image Editing Instructions." CVPR (2023). arXiv:2211.09800
InstructPix2Pix, the instruction-editing method of Section 35.4. It builds a synthetic (instruction, before, after) dataset with GPT-3 and Prompt-to-Prompt, then trains a model conditioned on both an image and a command.

Inversion & Faithful Editing

Song, J., Meng, C., Ermon, S. "Denoising Diffusion Implicit Models (DDIM)." ICLR (2021). arXiv:2010.02502
DDIM, whose deterministic sampler is run in reverse in Section 35.5 to invert a real image back to its latent noise, the foundation of faithful real-image editing.
Mokady, R. et al. "Null-text Inversion for Editing Real Images using Guided Diffusion Models." CVPR (2023). arXiv:2211.09794
Null-text inversion from Section 35.5, which optimizes the unconditional (null) embedding per timestep to close the gap that classifier-free guidance opens in plain DDIM inversion.
Hertz, A. et al. "Prompt-to-Prompt Image Editing with Cross Attention Control." ICLR (2023). arXiv:2208.01626
Prompt-to-Prompt, the attention-injection editing of Section 35.5 that preserves structure by reusing the original cross-attention maps while swapping the words that should change.

Adapters & Recent Methods (2023-2026)

Mou, C. et al. "T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models." AAAI (2024). arXiv:2302.08453
T2I-Adapter, the lightweight alternative to ControlNet in Section 35.1: a small side network that adds control features without cloning the whole encoder.
Ye, H. et al. "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models." (2023). arXiv:2308.06721
IP-Adapter from Section 35.1, which adds a decoupled cross-attention path so an image can be used as a prompt alongside text, the basis of much identity and style transfer.
Kawar, B. et al. "Imagic: Text-Based Real Image Editing with Diffusion Models." CVPR (2023). arXiv:2210.09276
Imagic, referenced in Section 35.5, which interpolates between an optimized embedding and a target embedding to perform complex single-image edits like changing a pose.
Peng, B. et al. "ControlNeXt: Powerful and Efficient Control for Image and Video Generation." (2024). arXiv:2408.06070
ControlNeXt, the efficient-control frontier in Section 35.1: it replaces ControlNet's cloned encoder with a small selector and a Cross Normalization injection, reporting up to ninety percent fewer trainable parameters.
Wang, Q. et al. "InstantID: Zero-shot Identity-Preserving Generation in Seconds." (2024). arXiv:2401.07519
InstantID, the tuning-free personalization frontier in Section 35.2: an IdentityNet encodes a single face into conditioning so a subject's identity is preserved with no per-subject training.
Wang, Z. et al. "GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing." NeurIPS (2024). arXiv:2407.05600
GenArtist, the agentic-pipeline frontier in Section 35.6: an MLLM agent that plans a tree of generation and editing tool calls with step-by-step verification and self-correction.
Black Forest Labs et al. "FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space." (2025). arXiv:2506.15742
FLUX.1 Kontext, a 2025 in-context editing model discussed in Section 35.4 that takes a reference image and an instruction and edits in a single flow-matching pass with strong character consistency.
Qwen Team, Alibaba. "Qwen-Image Technical Report." (2025). arXiv:2508.02324
The 20B-parameter Qwen-Image backbone behind the open-weight Qwen-Image-Edit instruction editor of Section 35.4 (released August 2025 under Apache-2.0), which feeds the input image to both a vision-language encoder and a VAE to separate semantic edits from appearance edits.

Tools & Libraries

Hugging Face diffusers. huggingface.co/docs/diffusers
The reference library for every code example: ControlNet, inpainting, IP-Adapter, instruction-editing, and inversion pipelines, plus the training scripts for DreamBooth and LoRA.
ComfyUI node-graph interface. github.com/comfyanonymous/ComfyUI
The node-graph editor of Section 35.6, the standard way practitioners author multi-step control-and-edit workflows as a visual directed graph.
Segment Anything (SAM). Kirillov, A. et al. ICCV (2023). github.com/facebookresearch/segment-anything
SAM, the promptable segmenter used in Sections 35.3 and 35.6 to produce the masks that drive object replacement and region editing.

Books & Explainers

Prince, S. J. D. Understanding Deep Learning. MIT Press (2023). udlbook.github.io/udlbook
Its diffusion and conditioning chapters give the cleanest textbook account of the guidance and cross-attention machinery the control methods of this chapter manipulate. Free online.