Part IV: Generative Vision Models
Chapter 34: Text-to-Image Systems

Chapter 34: Text-to-Image Systems

"You typed eleven words and expected a cathedral at sunset with anatomically correct hands. I delivered the cathedral, the sunset, and a respectable six fingers. Two of us tried our best."

A Text-to-Image Pipeline With Realistic Expectations
Big Picture

A text-to-image system is not one model but a small assembly line: a text encoder turns your sentence into vectors, a generator (almost always the diffusion model of Chapter 33) turns noise into a latent while attending to those vectors, and an autoencoder decodes the latent into pixels. Once you see the pipeline as separable stages, the entire bewildering landscape of named products falls into place: DALL-E, Imagen, Stable Diffusion, Midjourney, and FLUX differ mainly in which text encoder they pick, which generator backbone they use, and how aggressively they post-process. This chapter opens the assembly line, examines each station, walks the model landscape so the names stop being magic, detours through the token-based alternative to diffusion, and ends with the two skills that turn a generator into a tool you control: prompting and fine-tuning.

Key Insight: Three Stations, Three Knobs

The one schema to carry out of this chapter is the assembly line and its dials. Three stations: encode the sentence, generate a latent, decode to pixels (Sections 34.1, 34.2). Three knobs: which encoder, which generator backbone, and what data and finishing shaped the model (Section 34.3). Every named system, and every system released after this book is printed, is the same three stations with the three knobs set differently. When a model misbehaves, ask which station broke and which knob is responsible, rather than treating the product as one opaque box.

Chapter Overview

The previous chapter built the generative engine. Chapter 33 showed how to corrupt an image into noise and train a network to walk it back, how to make that walk fast with better samplers, and how to run the whole process in a compressed latent space so it fits on a laptop. It also introduced, in its guidance section, the one hook that everything in this chapter hangs from: a diffusion model can be made conditional. Feed the denoiser an extra signal at every step and it will denoise toward images that match the signal. When that signal encodes the meaning of a sentence, you have a text-to-image system. This chapter is about where that signal comes from, how it is injected, and how the resulting systems are built, named, prompted, and customized.

We begin at the bridge between language and pixels. Section 34.1 studies the text encoders that produce the conditioning vectors: CLIP, whose contrastive training aligned images and captions in a shared embedding space and gave the field its first universal text-to-image bridge, and the large language-model encoders like T5 that newer systems prefer for their grasp of long, compositional prompts. The encoder is the most underappreciated component of the stack; swap CLIP for a stronger encoder and the same diffusion backbone suddenly follows instructions it used to ignore.

With conditioning in hand, Section 34.2 opens the body of the system: the three-part Stable Diffusion architecture of variational autoencoder, denoising U-Net (or its transformer successor, the DiT), and the cross-attention layers that let text reach into the image. We trace a single generation end to end in code so the data flow is concrete, then Section 34.3 uses that anatomy to read the model landscape: how DALL-E 2 and 3, Imagen, the Stable Diffusion line through SDXL and SD3, Midjourney, and FLUX each instantiate the same template with different choices, and which choice explains which strength. Section 34.4 steps outside diffusion entirely to the autoregressive and masked-token generators (Parti, MUSE, and the image branch of modern multimodal models) that treat an image as a sequence of discrete tokens and generate it the way a language model generates text.

The final two sections turn the systems into instruments. Section 34.5 is a practitioner's guide to prompt engineering: how prompt structure, weighting, and negative prompts actually move the conditioning, why guidance scale trades fidelity against diversity, and how to debug a prompt that will not cooperate. Section 34.6 covers fine-tuning, from full retraining down to the parameter-efficient methods (LoRA, DreamBooth, textual inversion) that let you teach a model a new face, object, or style on a single GPU in an afternoon, the transfer-learning thread of Chapter 21 reaching its generative conclusion.

The recurring lesson is modularity. A text-to-image system is a composition of parts you already understand: the contrastive representation learning of Chapter 25, the cross-attention of Chapter 22, the latent autoencoder of Chapter 31, and the latent diffusion of Chapter 33. Understanding the seams between the parts is what lets you reason about a system you have never used, fix a failure you have never seen, and choose the right model before you have wasted an afternoon on the wrong one.

Prerequisites

This chapter assumes Chapter 33: Diffusion Models in full, especially classifier-free guidance and latent diffusion, which are the substrate for everything here. You need the self- and cross-attention of Chapter 22, since cross-attention is the mechanism that injects text into the image. The contrastive and self-supervised learning of Chapter 25 explains how CLIP was trained, and the autoencoders and VAEs of Chapter 31 are the compression stage of the Stable Diffusion latent space. The transfer learning and fine-tuning recipes of Chapter 21 set up the customization methods of Section 34.6. Comfortable PyTorch and a working diffusers install are assumed; a GPU with 8 GB or more makes the code examples runnable.

Chapter Roadmap

What's Next?

This chapter shows how a prompt produces an image; the next shows how to seize control of the result after the prompt runs out of expressive power. Chapter 35: Controllable Generation & Image Editing takes the conditioned diffusion model of Section 34.2 and adds spatial control: ControlNet and adapters that condition on edge maps, depth, and pose; inpainting and outpainting that edit a region while preserving the rest; and latent inversion that reconstructs the noise behind a real photograph so you can edit it. The fine-tuning methods of Section 34.6 reappear there as the substrate for personalized editing, and Chapter 37 returns to ask how we measure whether any of these systems are actually good. The assembly line you learn here is the chassis every later generative system is bolted onto.

Bibliography & Further Reading

Foundational Papers

Radford, A. et al. "Learning Transferable Visual Models From Natural Language Supervision (CLIP)." ICML (2021). arXiv:2103.00020
CLIP, the contrastive image-text model of Section 34.1 that aligned vision and language in a shared embedding space and became the default text encoder for the first generation of text-to-image diffusion systems.
Rombach, R. et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR (2022). arXiv:2112.10752
The Stable Diffusion paper. It introduced the VAE-plus-U-Net-plus-cross-attention architecture dissected in Section 34.2 and made open-weight text-to-image generation practical on consumer hardware.
Ramesh, A. et al. "Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)." (2022). arXiv:2204.06125
DALL-E 2's two-stage design from Section 34.3: a prior that maps text to a CLIP image embedding, then a decoder that diffuses an image from it. The clearest example of using CLIP's image side as the generation target.
Saharia, C. et al. "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)." NeurIPS (2022). arXiv:2205.11487
Imagen, which showed that a large frozen T5 text encoder beats CLIP-scale encoders for prompt following, the encoder argument that drives Section 34.1 and the model comparison in Section 34.3.

Recent Research (2023-2026)

Tschannen, M. et al. "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features." (2025). arXiv:2502.14786
SigLIP 2, the stronger sigmoid-loss image-text encoder of the Section 34.1 frontier, now a common drop-in replacement for CLIP as the conditioning encoder and a direct illustration of the encoder-ceiling argument that opens the chapter.
Podell, D. et al. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis." ICLR (2024). arXiv:2307.01952
SDXL, the larger Stable Diffusion with a dual text encoder and a refinement stage, the open-model baseline of Section 34.3 and the most common fine-tuning target in Section 34.6.
Esser, P. et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)." ICML (2024). arXiv:2403.03206
Stable Diffusion 3, the rectified-flow diffusion transformer with three text encoders (two CLIP plus T5) and the MMDiT block, a centerpiece of the modern-architecture discussion in Sections 34.2 and 34.3.
Betker, J. et al. "Improving Image Generation with Better Captions (DALL-E 3)." (2023). cdn.openai.com/papers/dall-e-3.pdf
DALL-E 3's central finding from Section 34.3: training on highly descriptive synthetic captions dramatically improves prompt following, shifting the bottleneck from architecture to data quality.
Chang, H. et al. "Muse: Text-To-Image Generation via Masked Generative Transformers." ICML (2023). arXiv:2301.00704
MUSE, the masked-token image generator of Section 34.4 that produces images by parallel unmasking of discrete codes, an order of magnitude faster than comparable diffusion or autoregressive models.
Yu, J. et al. "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti)." TMLR (2022). arXiv:2206.10789
Parti, the autoregressive sequence-of-tokens generator of Section 34.4 that treats image synthesis as a translation problem and scales to 20 billion parameters.
Tian, K. et al. "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR)." NeurIPS (2024), Best Paper Award. arXiv:2404.02905
VAR, the next-scale-prediction model behind the Section 34.4 frontier: it replaces raster-order token prediction with a coarse-to-fine scale order and was the first autoregressive image model to surpass a diffusion transformer on ImageNet, which is why the 2023 "diffusion has won" consensus is revisited.
Li, T. et al. "Autoregressive Image Generation without Vector Quantization (MAR)." NeurIPS (2024). arXiv:2406.11838
MAR, the continuous-token hybrid of Section 34.4 that drops the discrete codebook and models each token with a small per-token diffusion head, directly fusing the autoregressive and diffusion paths this chapter contrasts.
Black Forest Labs. "FLUX.1." (2024). github.com/black-forest-labs/flux
FLUX.1, the open rectified-flow transformer from Black Forest Labs, founded by researchers who had worked on the original Stable Diffusion. On its August 2024 release it set the open-weight bar of the Section 34.3 landscape and remains a popular LoRA fine-tuning base; FLUX.2 followed in November 2025.
Google. "Introducing Gemini 2.5 Flash Image (Nano Banana)." (2025). developers.googleblog.com
A 2025 example of the closed-side convergence in Sections 34.3 and 34.4: image generation and conversational editing folded into a multimodal language model, alongside OpenAI's native 4o image generation from March 2025. Its successor Nano Banana Pro, built on Gemini 3 Pro (November 2025), extended the line to legible in-image text and 2K to 4K output.

Customization & Fine-Tuning

Ruiz, N. et al. "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." CVPR (2023). arXiv:2208.12242
DreamBooth, the subject-binding method of Section 34.6 that ties a specific subject to a rare token using a prior-preservation loss to avoid catastrophic forgetting.
Gal, R. et al. "An Image is Worth One Word: Textual Inversion." ICLR (2023). arXiv:2208.01618
Textual inversion from Section 34.6: learn a single new embedding vector for a concept while freezing the entire model, the lightest-weight customization method available.
Hu, E. et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR (2022). arXiv:2106.09685
LoRA, the low-rank adapter method of Section 34.6. Originally for language models, it is now the dominant way to fine-tune diffusion models for new styles and subjects on a single GPU.

Tools & Libraries

Hugging Face diffusers. github.com/huggingface/diffusers
The reference library for every code example in this chapter: pipelines for Stable Diffusion, SDXL, SD3, and FLUX, plus the training scripts for DreamBooth, textual inversion, and LoRA.
OpenAI CLIP repository. github.com/openai/CLIP
The original CLIP code and pretrained weights used in Section 34.1 to embed images and text into the shared space that the first diffusion conditioners consumed.

Books & Explainers

Prince, S. J. D. Understanding Deep Learning. MIT Press (2023). udlbook.github.io/udlbook
Its diffusion and transformer chapters give the cleanest textbook account of the conditioning and cross-attention machinery this chapter builds on. Free online.
Hugging Face Diffusion Models Course. github.com/huggingface/diffusion-models-class
Hands-on notebooks covering conditioning, Stable Diffusion internals, and fine-tuning, mirroring the build-and-shortcut structure of Sections 34.2 and 34.6.