"You showed me five photos of your dog and now I cannot stop drawing him. In a spacesuit. On the moon. I regret nothing."
A Diffusion Model That Has Bonded With Your Pet
Personalization teaches a pretrained generator a concept it never saw in training (your face, your product, a particular art style) from a handful of example images, by injecting new knowledge at one of three points in the stack. Textual inversion adds a new word, a single embedding vector, and changes nothing else. DreamBooth fine-tunes the whole model to bind a subject to a rare token, using a prior-preservation loss so the model does not forget everything else. LoRA inserts small low-rank adapters into the attention layers, the modern default because it is light, composable, and shareable. The three methods sit on a spectrum from "smallest change, least expressive" to "largest change, most expressive," and the practitioner's real skill is choosing the right point on that spectrum for the job.
Section 35.1 fixed where content goes by feeding the model a structural map; this section turns to the second axis of control, what a subject or style is, which no edge map can specify. Section 34.6 introduced these three methods at the level of "here is what each one does." This section takes them seriously as engineering choices. We work through what each one actually modifies, the loss that keeps DreamBooth from collapsing, the low-rank math that makes LoRA cheap, and a concrete comparison so that when a task lands on your desk you know which to reach for. We close with the production reality that you rarely use one adapter alone: you stack several LoRAs and weight them, which raises its own questions.
1. The Spectrum of Personalization Beginner
Picture the text-to-image stack from Chapter 34 as a line from input to output: text gets embedded, the embeddings condition the U-Net through cross-attention, and the U-Net's weights transform noise into a latent. A new concept can be injected at any of these stages. Inject it at the embedding (add a new word) and you have textual inversion. Inject it into the weights everywhere (fine-tune the U-Net) and you have DreamBooth. Inject it as small additive corrections inside the attention weights and you have LoRA. The illustration below dramatizes the escalating effort, and Figure 35.2.1 places the three on that line.
2. Textual Inversion: Learning One Word Intermediate
Textual inversion asks the smallest possible question: can a new concept be captured by a single new entry in the embedding table? It freezes the entire model and introduces one new token, written here as <my-cat>, whose embedding vector $v_*$ is the only thing that gets trained. Given a few images of the concept, you run ordinary diffusion training (corrupt the image, predict the noise) on prompts like "a photo of <my-cat>," and backpropagate only into $v_*$. The model's weights never move; you are searching the embedding space for the vector that, when fed through the frozen cross-attention, reconstructs your concept.
where $\theta$ is held fixed and $c(\cdot)$ is the frozen text encoder. The learned artifact is tiny: a single vector of a few thousand floats, often under 10 KB. That is also its limitation. One vector cannot express everything about a complex subject, so textual inversion captures broad style and rough identity well but struggles with fine, exact detail (the precise pattern on a face, the exact logo on a product). It is the method to reach for when you want a shareable, model-agnostic "word" for a look, and the model already knows the rough class your concept belongs to.
3. DreamBooth: Binding a Subject Without Forgetting Intermediate
DreamBooth goes to the other extreme and fine-tunes the whole U-Net so that a rare token (such as sks) becomes a faithful handle on a specific subject. Fine-tuning a giant model on three to five images invites two failures. The first is overfitting: the model memorizes the exact training shots and can only reproduce those poses. The second, more insidious, is language drift or catastrophic forgetting: as the model learns that sks dog means your dog, it starts forgetting what dogs in general look like, and prompts for any dog begin returning your dog.
DreamBooth's fix is the prior-preservation loss. Alongside the few subject images, you generate a few hundred images of the generic class ("a photo of a dog") using the model itself, and train on both sets at once. The subject term teaches the rare token; the class term reminds the model what the broader class looks like, anchoring the prior. The combined objective is
where the second term uses model-generated class images $x'$ with conditioning $c_{\text{class}} = $ "a photo of a dog," and $\lambda$ (often around $1.0$) balances the two. The result is a model that renders your subject in any pose, lighting, and context the prompt asks for, while still being able to draw generic members of the class. The price is size: a full fine-tuned U-Net is gigabytes, and you have one per subject unless you convert it to a LoRA.
The class images in DreamBooth are not real; the model generates them from its own prior just before training. This is a clever form of regularization: you are telling the model "whatever you learn about the rare token, do not change your answer to this set of generic prompts you already get right." It is the generative-model version of the rehearsal techniques used against catastrophic forgetting elsewhere in deep learning, and it foreshadows Chapter 37, where models generating their own training data becomes a theme in its own right.
4. LoRA: Low-Rank Adapters, the Modern Default Intermediate
LoRA sits in the productive middle. It comes from the observation, first made for language models in the transfer-learning setting of Chapter 21, that the change a fine-tune makes to a weight matrix is usually low rank: you do not need a full dense update, just a thin one. For a frozen weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA learns a low-rank correction $\Delta W = B A$ where $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$ with rank $r \ll \min(d, k)$, and uses
Only $A$ and $B$ are trained; $W_0$ stays frozen. With $r$ as small as 4 to 16, the adapter has orders of magnitude fewer parameters than the full matrix, so the trained file is a few megabytes rather than gigabytes. $B$ is initialized to zero (the same start-at-identity trick as ControlNet's zero convolutions in Section 35.1), so the model is unchanged at the start of training. The scalar $\alpha / r$ sets the adapter's strength, and dividing by $r$ is what lets you change the rank without re-tuning everything else. A larger $r$ means $BA$ sums more terms, which would otherwise grow its magnitude; dividing by $r$ keeps the correction's scale roughly constant as you adjust capacity. In diffusion models, LoRA is applied to the attention projection matrices, the same Q, K, V, O matrices you met in the attention layers of Chapter 22. The code trains and then loads a LoRA.
# Load a trained subject LoRA into a frozen base model and render with it.
# The LoRA is a tiny file of low-rank A/B matrices; loading patches the
# attention projections, set_adapters scales each adapter's contribution.
import torch
from diffusers import StableDiffusionPipeline
# --- Inference: load a trained subject LoRA into a base model. ---
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
# A LoRA is a small file of A/B matrices; loading it patches the attention weights.
pipe.load_lora_weights("my_dog_lora.safetensors", adapter_name="dog")
pipe.set_adapters(["dog"], adapter_weights=[0.8]) # inference-time strength multiplier
image = pipe(
"a photo of sks dog wearing a tiny astronaut helmet on the moon",
num_inference_steps=30,
).images[0]
image.save("astro_dog.png")
load_lora_weights call patches the attention matrices from a small .safetensors file, and the set_adapters([...], adapter_weights=[0.8]) value is an inference-time multiplier on the adapter's already-scaled correction $\alpha/r \cdot BA$; lowering it toward zero fades the personalization out, raising it past one over-applies the concept. The base model file is untouched on disk.Two distinct numbers are easy to conflate. The rank $r$ is a training-time capacity choice baked into the file (how many directions $BA$ can express); the adapter_weights value is an inference-time multiplier on the already-trained delta. You cannot raise a rank-4 LoRA's expressiveness by setting its weight to 4 at inference; you only over-apply the few directions it learned, which oversaturates the concept rather than adding detail. A related error is believing a higher weight always means more faithful identity: past roughly $1$ the adapter starts overriding the prompt and the base prior, so the subject appears but the scene, lighting, and composition degrade. Finally, loading a LoRA does not write into the base checkpoint; it patches the attention projections in memory, which is exactly why the same small file composes with others and ports across base models.
Because a LoRA is just a small additive delta to specific matrices, it has two properties that make it the default. It is shareable: a few-megabyte file rather than a multi-gigabyte checkpoint. And it is composable: you can load several at once and blend them, which subsection 6 covers. The training script (diffusers ships train_dreambooth_lora.py) combines DreamBooth's prior-preservation idea with LoRA's low-rank parameterization, giving subject fidelity at a fraction of the storage.
The whole reason a LoRA fits in an email attachment is the rank $r$. A full attention matrix update for a 1024-dimensional layer is over a million numbers; a rank-8 LoRA stores two thin matrices, roughly 16,000 numbers, for the same layer. That is why a DreamBooth checkpoint is gigabytes and the equivalent LoRA is megabytes: you are not saving the new model, you are saving the few hundred directions in which it differs from the old one. The "rare token" trick has a similar frugal spirit, the field reached for nonsense strings like sks precisely because the model had almost no prior meaning attached to them, leaving a clean handle to overwrite.
5. Choosing Among the Three Intermediate
The three methods are not competitors so much as points on a budget curve. Table 35.2.1 lays out the tradeoffs that decide which to use.
| Method | What it trains | Artifact size | Images needed | Best for |
|---|---|---|---|---|
| Textual Inversion | One embedding vector | ~10 KB | 3 to 5 | Styles, rough identity, shareable "words" |
| DreamBooth (full) | The entire U-Net | 2 to 7 GB | 3 to 5 | Highest subject fidelity, one-off use |
| LoRA (DreamBooth-LoRA) | Low-rank attention deltas | 2 to 200 MB | 5 to 20 | The default: good fidelity, shareable, stackable |
The rule of thumb: start with LoRA. It gives most of DreamBooth's fidelity at a thousandth of the storage and composes with other adapters. Reach for full DreamBooth only when you need the absolute best fidelity for one important subject and storage does not matter. Use textual inversion when you want a portable concept word that works across many base models, or when you only need to capture a style rather than an exact subject. All three are referenced in Chapter 34; the practical difference is almost always storage and composability, which is why the field converged on LoRA.
Writing a LoRA layer means subclassing every attention module to add the $BA$ path, hooking the optimizer to train only $A$ and $B$, and serializing just those tensors, perhaps two hundred lines. The Hugging Face peft library does it in one call: get_peft_model(unet, LoraConfig(r=8, target_modules=["to_q","to_k","to_v","to_out.0"])) wraps the attention projections automatically, and save_pretrained writes only the adapter. For the full pipeline, the diffusers train_dreambooth_lora.py script handles prior-preservation image generation, the combined loss, mixed precision, and checkpointing; you supply a folder of images and a config. Implement the low-rank math once for understanding; in practice you run the script.
6. Stacking and Weighting Multiple LoRAs Advanced
The composability of LoRA is its quiet superpower and its quiet trap. Because each LoRA is an additive delta to the same matrices, you can load several at once: a subject LoRA for your character, a style LoRA for a watercolor look, and a detail LoRA for sharper textures. diffusers exposes this directly.
# Load two LoRAs at once (a subject and a style) and blend them with
# independent weights. Both add low-rank deltas into the same attention
# matrices, so their effects superpose and can also interfere.
pipe.load_lora_weights("character.safetensors", adapter_name="char")
pipe.load_lora_weights("watercolor.safetensors", adapter_name="style")
# Blend two adapters with per-adapter weights; they add into the same matrices.
pipe.set_adapters(["char", "style"], adapter_weights=[0.9, 0.6])
image = pipe("a portrait of sks person, painterly", num_inference_steps=30).images[0]
set_adapters(["char", "style"], adapter_weights=[0.9, 0.6]) call superposes both deltas on the same attention matrices, but conflicting concepts can interfere; the per-adapter weights are the knob you tune to keep the style from overwhelming the subject's identity.The trap is interference. Two LoRAs that touch the same matrices can fight: a strong style LoRA can wash out a subject's identity, or two subject LoRAs can blend into a chimera. The weights are the first line of defense, but for serious multi-concept work the field has developed merging methods (such as orthogonal or SVD-based merges) that combine adapters with less conflict than naive addition. The practical discipline is to tune one LoRA at a time, then add the next at a low weight and raise it until it just takes effect.
Who: a four-person concept-art team at an indie game studio, 2024. Situation: they needed hundreds of illustrations of the same hero character in different scenes, costumes, and lighting for marketing and in-game art. Problem: prompting alone could not hold the character's face and proportions consistent; every image was a slightly different person, and full DreamBooth checkpoints at 4 GB each were unmanageable to version and share across the team. Decision: they trained a single DreamBooth-LoRA on twenty clean turnaround renders of the character at rank 16, producing a 40 MB file they checked into their asset repository. They paired it at inference with an existing watercolor style LoRA at weight 0.5. Result: a portable, version-controlled character that any team member could load, render in any scene, and restyle by swapping the style LoRA, with the character weight held at 0.9 so identity never drifted. Lesson: LoRA's small size and composability turn personalization from a storage problem into an asset-management win; the character became a file you check in, not a model you redeploy.
Everything in subsections 4 through 6 adds up to a small, portfolio-ready project. Take fifteen to twenty photos of one subject you have rights to (a pet, a favorite mug, a 3D-printed figurine), train a rank-8 DreamBooth-LoRA with the diffusers train_dreambooth_lora.py script, and publish the few-megabyte .safetensors file to the Hugging Face Hub with a model card showing the rare token and a strength-sweep grid. The deliverable is the same asset the game studio above checked into its repository: a portable concept that anyone can load with load_lora_weights and stack with a style LoRA, exactly the composability of subsection 6. Plan for about thirty minutes of setup and a short single-GPU training run; the result is a downloadable artifact that demonstrates personalization, low-rank adaptation, and adapter weighting in one link, which is far more convincing in an interview than describing the method in words.
The methods above all require a training run, even if a short one. The frontier through 2024 and 2025 is encoder-based or tuning-free personalization: a single forward pass that injects identity from one reference photo with no optimization at all. InstantID (2024, arXiv:2401.07519) and PhotoMaker (CVPR 2024, arXiv:2312.04461) encode a face into conditioning tokens and combine them with an IP-Adapter-style path from Section 35.1, producing identity-consistent images instantly. The same idea generalizes: IP-Adapter-FaceID and the in-context character consistency of FLUX.1 Kontext (2025, arXiv:2506.15742) take a reference image and a prompt and keep the subject without ever updating a weight. The trajectory mirrors the rest of the field: a per-instance optimization gives way to a learned amortized encoder that does the same job in one shot.
For each scenario, choose textual inversion, DreamBooth, or LoRA and justify the choice in one or two sentences using Table 35.2.1: (a) you want to share a "1970s film grain" look that works across SD 1.5, SDXL, and any base model a community member uses; (b) you must reproduce a single celebrity's exact face for one premium ad with no storage constraint; (c) you are building a library of fifty reusable character assets that artists will mix and match. State which property (artifact size, fidelity, composability, model-portability) is decisive in each case.
Using diffusers' train_dreambooth_lora.py on 10 to 20 photos of a single object, train a rank-8 LoRA for a rare token. Then, with a fixed prompt and seed, render the object at adapter_weights in $\{0.2, 0.5, 0.8, 1.1, 1.5\}$. Describe the transition from "concept absent" to "concept overcooked" (over-saturated, ignoring the rest of the prompt) and report the weight that best balances fidelity and prompt-following.
Train two DreamBooth models on the same five subject images, one with the prior-preservation term and one without. After each, prompt for a generic member of the class (for example "a photo of a dog" if the subject is a dog) several times. Compare the outputs and explain, in terms of the combined loss in subsection 3, why the model trained without prior preservation begins returning the specific subject for the generic prompt. Connect this to catastrophic forgetting as discussed for transfer learning in Chapter 21.