Part IV: Generative Vision Models
Chapter 35: Controllable Generation & Image Editing

Instruction-Based Editing

"You said 'make it winter.' Not which winter, not how much snow, not whether the dog keeps his hat. I made a decision. The dog keeps his hat."

An Instruction-Following Editor With Opinions
Big Picture

Instruction-based editing replaces "describe the image you want" with "describe the change you want": you give a model an image and a command like "make it snow" or "turn the cat into a tiger," and it applies the edit while keeping the rest of the image intact, with no mask and no description of the unchanged parts. The central trick of InstructPix2Pix is data: there is no natural dataset of (instruction, before-image, after-image) triples, so the authors generated one synthetically by pairing a language model that invents edit instructions with the Prompt-to-Prompt editing of Section 35.5 to produce the before and after images. Once that dataset exists, training is ordinary conditional diffusion. At inference, two separate guidance scales (one for the instruction, one for the input image) let you dial the balance between obeying the command and preserving the original.

The editing methods so far each demand something extra from you: ControlNet wants a structural map, personalization wants training images, inpainting wants a mask. Instruction editing asks only for a sentence. That convenience is bought by a clever data-generation pipeline and a dual-conditioning inference scheme, and this section unpacks both. We then survey how the idea matured from the 2023 InstructPix2Pix into the strong general editors of 2024 and 2025.

1. The Data Problem and Its Synthetic Solution Intermediate

To train a model that takes (image, instruction) and produces an edited image, you need examples: an original image, an instruction, and the correct edited result. No such dataset exists at scale in the wild. InstructPix2Pix manufactures one in two stages, illustrated in Figure 35.4.1. First, a large language model (GPT-3, in the original) is given a caption and asked to invent a plausible edit instruction and the caption of the edited image. Starting from "photograph of a girl riding a horse," it might produce the instruction "have her ride a dragon" and the edited caption "photograph of a girl riding a dragon." Second, the original and edited captions are turned into a matched pair of images using Prompt-to-Prompt (the cross-attention-sharing method of Section 35.5), which generates two images that differ only in the changed concept while keeping composition, pose, and background nearly identical.

The matched pair is essential. If the two images were generated independently, they would differ in countless incidental ways (the horse and the dragon in different poses, different backgrounds), and the model would learn to repaint the whole image rather than apply just the named edit. Prompt-to-Prompt forces everything except the changed word to stay put, so the before-and-after difference is the edit, which is exactly the supervision the model needs.

caption "girl on a horse" Language model invents instruction + edited caption Prompt-to-Prompt renders matched before + after triple: (instr, before, after) hundreds of thousands of synthetic triples → train an instruction-conditioned model
Figure 35.4.1: The InstructPix2Pix data pipeline. A language model turns a caption into an edit instruction plus an edited caption; Prompt-to-Prompt renders a before/after image pair that differs only in the edited concept. The resulting triples train an ordinary conditional diffusion model to apply instructions directly.
Key Insight: The Bottleneck Was Data, Not Architecture

InstructPix2Pix uses a nearly standard conditional diffusion model. Its contribution is almost entirely the data-generation recipe. This is a recurring lesson in modern generative vision (DALL-E 3 made the same point about captions in Chapter 34): once you can synthesize the right supervision, a conventional architecture trained on it does the job. When you face a task with no natural dataset, ask whether existing generative models can be composed to manufacture one before you reach for a novel architecture.

2. Conditioning on Both Image and Instruction Intermediate

The trained model is conditioned on two things at once: the input image $c_I$ (encoded to latents and concatenated to the U-Net input channels, much like the inpainting U-Net of Section 35.3) and the text instruction $c_T$ (through cross-attention, as in Chapter 34). The interesting part is inference. Recall classifier-free guidance from Chapter 33: you extrapolate from the unconditional prediction toward the conditional one to strengthen the conditioning. With two conditions, InstructPix2Pix uses two guidance scales, one that pushes toward following the instruction and one that pushes toward preserving the input image. The score estimate combines three forward passes:

$$\tilde{\epsilon}_\theta(z_t, c_I, c_T) = \epsilon_\theta(z_t, \varnothing, \varnothing) + s_I\big[\epsilon_\theta(z_t, c_I, \varnothing) - \epsilon_\theta(z_t, \varnothing, \varnothing)\big] + s_T\big[\epsilon_\theta(z_t, c_I, c_T) - \epsilon_\theta(z_t, c_I, \varnothing)\big].$$

The image guidance scale $s_I$ controls how much the output resembles the original; the text guidance scale $s_T$ controls how strongly the instruction is applied. Raising $s_T$ makes the edit more aggressive but risks changing parts you wanted preserved; raising $s_I$ keeps the image faithful but can suppress the edit. Tuning the pair is the core skill of using these models, and the right balance depends on the edit: a global style change tolerates a low $s_I$, while a small local change ("add a hat") wants a high $s_I$ to protect everything else. The snippet below runs the pretrained editor and surfaces both scales.

# Apply a text instruction to an existing image with InstructPix2Pix.
# Two guidance scales are surfaced: guidance_scale (s_T) drives how hard the
# instruction is applied, image_guidance_scale (s_I) drives how much is preserved.
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
from diffusers.utils import load_image

pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    "timbrooks/instruct-pix2pix", torch_dtype=torch.float16).to("cuda")

image = load_image("photo.png")
edited = pipe(
    "make it look like a snowy winter scene",
    image=image,
    num_inference_steps=30,
    guidance_scale=7.5,            # s_T: how strongly to apply the instruction
    image_guidance_scale=1.5,      # s_I: how strongly to preserve the input image
).images[0]
edited.save("winter.png")
Code Fragment 1: InstructPix2Pix in diffusers. The two guidance scales are exposed as guidance_scale=7.5 (instruction strength $s_T$) and image_guidance_scale=1.5 (preservation strength $s_I$), the three-pass score combination of subsection 2 wired into one call. Raising image_guidance_scale toward 2.0 keeps more of the original; lowering it lets the "snowy winter scene" instruction reshape the photo.

3. Why Preservation Is the Hard Part Intermediate

The defining challenge of instruction editing is the same preservation problem that runs through this whole chapter, now without a mask to enforce it. Inpainting could guarantee the unmasked region by construction; instruction editing has no such guarantee, because the model is free to change every pixel and only the training data and the image guidance scale discourage it. This is why instruction edits sometimes subtly alter a face, shift colors globally, or lose fine detail even when the instruction was local. The image guidance scale is a soft constraint, not a hard one. For edits that must preserve identity or specific regions exactly, practitioners still combine instruction editing with the masked methods of Section 35.3 or the inversion-based faithful editing of Section 35.5, applying the instruction only where it is allowed to act.

Fun Fact

An instruction editor takes you at your word, which is funnier than it sounds. Ask InstructPix2Pix to "make her happy" and it may repaint the whole face; ask it to "add fireworks" and you might get fireworks reflected in the subject's eyes, on their shirt, and in a puddle that was not there before, because the model has no notion of "just the sky." The early demos that went viral were mostly the over-eager edits: "turn it into a Van Gogh" applied with such enthusiasm that the cat dissolved into brushstrokes. The dual guidance scales of subsection 2 are, in a real sense, the volume knob for the model's literal-mindedness.

Practical Example: A Marketing Tool Ships "Edit by Typing"

Who: a small SaaS company building a self-serve creative tool for social-media marketers, 2024. Situation: their non-designer users wanted to tweak stock and uploaded photos ("warmer lighting," "add a festive feel," "make the background a beach") without learning masks or layers. Problem: a mask-based editor was too complex for the audience, but plain text-to-image could not edit an existing brand photo. Decision: they shipped an InstructPix2Pix backend with the two guidance scales surfaced as two friendly sliders, "how big a change" and "how close to the original," defaulting to a high image-guidance value so edits stayed conservative. They added a fallback: if the user's instruction named an object, they routed it through Grounded-SAM and inpainting from Section 35.3 for a hard-masked edit instead. Result: most edits succeeded with one sentence, and the masked fallback caught the cases where global instruction editing drifted. Lesson: instruction editing is the right default for ease of use, but a preservation-critical product needs a masked path underneath for the edits where "mostly preserved" is not good enough.

4. The Modern Editing Landscape Advanced

If InstructPix2Pix taught a model to edit from a sentence, why is the category still moving? Because its all-synthetic data left visible artifacts, and closing that gap is the story of every editor since. InstructPix2Pix (2023) opened the category; the years since sharpened it. MagicBrush (NeurIPS 2023) fine-tuned on a manually-annotated dataset of real edits, fixing many of the artifacts that the fully-synthetic data introduced. InstructDiffusion and the broader unified-editing line framed many vision tasks (editing, segmentation, keypoint detection) as instruction-following under one model.

By 2025, the strongest editors fold editing into large multimodal or flow-matching models. FLUX.1 Kontext (2025, arXiv:2506.15742) takes a reference image and an instruction and edits in a single flow-matching pass with strong subject and character consistency, and the image-output modes of frontier multimodal models edit by simply being told what to change in a conversation. On the open-weight side, Qwen-Image-Edit (Alibaba, released August 2025, Apache-2.0) builds an instruction editor on the 20B-parameter Qwen-Image backbone (arXiv:2508.02324). It feeds the input image to both a vision-language encoder and a VAE, which lets it separate semantic edits (replace or restyle an object) from appearance edits (precise text and color changes).

That two-encoder design is worth pausing on, because it captures the whole preservation problem in one architectural choice. The reason for two encoders is that each supplies what the other lacks: the VAE preserves pixel-exact appearance (the fine detail and exact colors the VAE of Chapter 31 is built to reconstruct) but carries little high-level meaning, while the vision-language encoder is semantically rich (it understands "the red car" as an object) but lossy on exact pixels. Routing the image through both gives the editor a faithful appearance channel and a meaning-aware channel at once, so it can rewrite what an object is without disturbing how unrelated regions look. It reports competitive results on editing benchmarks. The arc is toward editing as a native capability of a general model rather than a specialized pipeline, though the dual-guidance and preservation lessons here remain the right mental model for what the model is balancing internally.

The Right Tool: One Pipeline Class, Many Editors

Reproducing instruction editing from scratch means building the synthetic-data pipeline (a language model plus Prompt-to-Prompt rendering of hundreds of thousands of pairs), then training a dual-conditioned model, weeks of work and significant compute. In practice you load a pretrained editor: StableDiffusionInstructPix2PixPipeline.from_pretrained("timbrooks/instruct-pix2pix") for the classic model, or the equivalent pipeline classes for MagicBrush-tuned and newer checkpoints. The diffusers pipeline handles the dual-guidance three-pass score combination of subsection 2 internally; you supply an image, an instruction, and the two scales. Understand the data recipe and the guidance math; reach for the pretrained pipeline to actually edit.

Research Frontier: Editing as Conversation

The newest direction collapses the distinction between generation and editing. In-context editing models like FLUX.1 Kontext (2025) and the native image editing in large multimodal systems treat an edit as another turn in a multimodal exchange: you show an image, say what to change, see the result, and refine, all without masks, control maps, or per-edit training. These systems internalize the preservation problem (they are trained to change only what is mentioned) and chain edits while holding identity stable across turns, the multi-step consistency that Section 35.6 otherwise has to engineer by hand. The open question they raise is evaluation: how do you measure whether an edit changed exactly what was asked and nothing more? That question is the subject of Chapter 37.

Exercise 35.4.1: Read the Dual Guidance Conceptual

In the three-pass score formula of subsection 2, identify which difference term is scaled by $s_T$ and which by $s_I$, and state in words what each term pushes the generation toward. Then predict the qualitative result of setting $s_I$ very high and $s_T$ very low, and the opposite, and explain why the first barely edits while the second can change parts of the image you did not mention.

Exercise 35.4.2: Sweep Both Scales Coding

Using the InstructPix2Pix code in subsection 2 on a single photo and instruction (for example "make it autumn"), generate a grid that varies image_guidance_scale over $\{1.0, 1.5, 2.0\}$ across columns and guidance_scale over $\{5, 7.5, 10\}$ across rows. Identify the cell that best applies the edit while preserving the subject, and describe the failure modes in the corners (too little edit, too much drift).

Exercise 35.4.3: When Does Instruction Editing Fail Preservation? Analysis

Run an identity-sensitive instruction edit (for example "add sunglasses" on a portrait) and inspect whether the person's face changes. Explain, using the fact that instruction editing has no mask and only a soft image-guidance constraint, why local edits can leak into global changes. Then describe how you would combine the masked object replacement of Section 35.3 with this instruction to guarantee the face is untouched, and what you give up by doing so.