"You typed eleven words and expected a cathedral at sunset with anatomically correct hands. I delivered the cathedral, the sunset, and a respectable six fingers. Two of us tried our best."
A Text-to-Image Pipeline With Realistic Expectations
A text-to-image system is not one model but a small assembly line: a text encoder turns your sentence into vectors, a generator (almost always the diffusion model of Chapter 33) turns noise into a latent while attending to those vectors, and an autoencoder decodes the latent into pixels. Once you see the pipeline as separable stages, the entire bewildering landscape of named products falls into place: DALL-E, Imagen, Stable Diffusion, Midjourney, and FLUX differ mainly in which text encoder they pick, which generator backbone they use, and how aggressively they post-process. This chapter opens the assembly line, examines each station, walks the model landscape so the names stop being magic, detours through the token-based alternative to diffusion, and ends with the two skills that turn a generator into a tool you control: prompting and fine-tuning.
The one schema to carry out of this chapter is the assembly line and its dials. Three stations: encode the sentence, generate a latent, decode to pixels (Sections 34.1, 34.2). Three knobs: which encoder, which generator backbone, and what data and finishing shaped the model (Section 34.3). Every named system, and every system released after this book is printed, is the same three stations with the three knobs set differently. When a model misbehaves, ask which station broke and which knob is responsible, rather than treating the product as one opaque box.
Chapter Overview
The previous chapter built the generative engine. Chapter 33 showed how to corrupt an image into noise and train a network to walk it back, how to make that walk fast with better samplers, and how to run the whole process in a compressed latent space so it fits on a laptop. It also introduced, in its guidance section, the one hook that everything in this chapter hangs from: a diffusion model can be made conditional. Feed the denoiser an extra signal at every step and it will denoise toward images that match the signal. When that signal encodes the meaning of a sentence, you have a text-to-image system. This chapter is about where that signal comes from, how it is injected, and how the resulting systems are built, named, prompted, and customized.
We begin at the bridge between language and pixels. Section 34.1 studies the text encoders that produce the conditioning vectors: CLIP, whose contrastive training aligned images and captions in a shared embedding space and gave the field its first universal text-to-image bridge, and the large language-model encoders like T5 that newer systems prefer for their grasp of long, compositional prompts. The encoder is the most underappreciated component of the stack; swap CLIP for a stronger encoder and the same diffusion backbone suddenly follows instructions it used to ignore.
With conditioning in hand, Section 34.2 opens the body of the system: the three-part Stable Diffusion architecture of variational autoencoder, denoising U-Net (or its transformer successor, the DiT), and the cross-attention layers that let text reach into the image. We trace a single generation end to end in code so the data flow is concrete, then Section 34.3 uses that anatomy to read the model landscape: how DALL-E 2 and 3, Imagen, the Stable Diffusion line through SDXL and SD3, Midjourney, and FLUX each instantiate the same template with different choices, and which choice explains which strength. Section 34.4 steps outside diffusion entirely to the autoregressive and masked-token generators (Parti, MUSE, and the image branch of modern multimodal models) that treat an image as a sequence of discrete tokens and generate it the way a language model generates text.
The final two sections turn the systems into instruments. Section 34.5 is a practitioner's guide to prompt engineering: how prompt structure, weighting, and negative prompts actually move the conditioning, why guidance scale trades fidelity against diversity, and how to debug a prompt that will not cooperate. Section 34.6 covers fine-tuning, from full retraining down to the parameter-efficient methods (LoRA, DreamBooth, textual inversion) that let you teach a model a new face, object, or style on a single GPU in an afternoon, the transfer-learning thread of Chapter 21 reaching its generative conclusion.
The recurring lesson is modularity. A text-to-image system is a composition of parts you already understand: the contrastive representation learning of Chapter 25, the cross-attention of Chapter 22, the latent autoencoder of Chapter 31, and the latent diffusion of Chapter 33. Understanding the seams between the parts is what lets you reason about a system you have never used, fix a failure you have never seen, and choose the right model before you have wasted an afternoon on the wrong one.
Prerequisites
This chapter assumes Chapter 33: Diffusion Models in full, especially classifier-free guidance and latent diffusion, which are the substrate for everything here. You need the self- and cross-attention of Chapter 22, since cross-attention is the mechanism that injects text into the image. The contrastive and self-supervised learning of Chapter 25 explains how CLIP was trained, and the autoencoders and VAEs of Chapter 31 are the compression stage of the Stable Diffusion latent space. The transfer learning and fine-tuning recipes of Chapter 21 set up the customization methods of Section 34.6. Comfortable PyTorch and a working diffusers install are assumed; a GPU with 8 GB or more makes the code examples runnable.
Chapter Roadmap
- 34.1 Connecting Text & Pixels: CLIP & Text Encoders How a sentence becomes conditioning vectors: CLIP's contrastive image-text alignment and shared embedding space, the difference between pooled and per-token embeddings, and why large language-model encoders like T5 improve prompt following on long, compositional prompts.
- 34.2 Inside Stable Diffusion: VAE, U-Net, DiT & Conditioning The three-part architecture in detail: the perceptual autoencoder that compresses to latents, the denoising U-Net and its diffusion-transformer successor, and the cross-attention layers that let text reach into every spatial location. A full generation traced end to end in code.
- 34.3 The Model Landscape: DALL-E, Imagen, Midjourney & FLUX Reading the named systems as instances of one template: DALL-E 2's unCLIP prior versus DALL-E 3's caption recaptioning, Imagen's frozen T5 encoder, the open Stable Diffusion line through SDXL and SD3, Midjourney's aesthetic tuning, and FLUX's rectified-flow transformer.
- 34.4 Autoregressive & Token-Based Image Generation The alternative to diffusion: tokenize an image into discrete codes with a VQ autoencoder, then generate the codes with a transformer, either left to right (Parti) or by parallel unmasking (MUSE). How modern multimodal models fold image generation into a single token stream.
- 34.5 Prompt Engineering for Image Generation A mechanistic guide to prompting: how subject, style, and modifier structure maps onto the conditioning, prompt weighting and negative prompts, the fidelity-versus-diversity tradeoff of guidance scale, seeds and reproducibility, and a systematic procedure for debugging a prompt that will not behave.
- 34.6 Fine-Tuning Text-to-Image Models Teaching a generator new concepts: full fine-tuning and its cost, textual inversion that learns a new word, DreamBooth that binds a subject to a rare token with a prior-preservation loss, and LoRA's low-rank adapters that make subject and style customization a single-GPU afternoon.
What's Next?
This chapter shows how a prompt produces an image; the next shows how to seize control of the result after the prompt runs out of expressive power. Chapter 35: Controllable Generation & Image Editing takes the conditioned diffusion model of Section 34.2 and adds spatial control: ControlNet and adapters that condition on edge maps, depth, and pose; inpainting and outpainting that edit a region while preserving the rest; and latent inversion that reconstructs the noise behind a real photograph so you can edit it. The fine-tuning methods of Section 34.6 reappear there as the substrate for personalized editing, and Chapter 37 returns to ask how we measure whether any of these systems are actually good. The assembly line you learn here is the chassis every later generative system is bolted onto.
Bibliography & Further Reading
Foundational Papers
Recent Research (2023-2026)
Customization & Fine-Tuning
Tools & Libraries
diffusers. github.com/huggingface/diffusers