"They retired me from the front page and gave the headline job to a model that takes fifty steps to do what I do in one. I did not argue. I simply moved to the part of the building where the deadline is measured in milliseconds, and here, I am still the fastest brush in the world."
A GAN That Read Its Own Obituary and Disagreed
Diffusion took the text-to-image crown by being easier to train and scale, but a GAN generates an image in a single forward pass, and wherever speed is the binding constraint, GANs still win. This closing section is an honest 2026 scorecard. It explains precisely why diffusion overtook GANs for open-ended generation (training stability and mode coverage, not raw image quality), then maps the territory GANs still own: real-time and interactive synthesis, super-resolution and restoration, the adversarial loss living inside other models including diffusion itself, and the large-scale text-to-image GANs that closed much of the quality gap while keeping the speed advantage. The chapter's lessons outlived its peak.
You have now built the GAN from its minimax core (Section 32.1), stabilized it (Section 32.2), scaled it to photoreal faces (Section 32.3), conditioned it for translation (Section 32.4), and run it backward for editing (Section 32.5). The honest question to end on is where this family stands now that diffusion models dominate the headlines. The answer is not "GANs are obsolete". It is more interesting: GANs lost the general-purpose generation race for specific, understandable reasons, and won a set of niches where their one structural advantage, single-step generation, is decisive.
1. Why Diffusion Won the Crown Intermediate
By 2022 the leading text-to-image systems were diffusion models, not GANs, and the reasons are the mirror image of everything in Section 32.2. A diffusion model (Chapter 33) trains a single network with a stable regression loss to reverse a noising process; there is no adversary, no balance to maintain, no mode collapse. That training stability is what let diffusion scale to billions of images and billions of parameters where GANs of the same era buckled. Diffusion also covers modes naturally, because its likelihood-based objective penalizes ignoring parts of the data, exactly the coverage that GANs fight for.
The cost is speed: classical diffusion needs tens to hundreds of network evaluations to produce one image, against a GAN's single forward pass. Put a number on it and the moat becomes visceral: a standard fifty-step diffusion sampler runs its U-Net fifty times to make one picture, so on identical hardware the GAN finishes the same image in roughly one-fiftieth the wall-clock time. That is not a tuning difference; it is the gap between thirty frames per second and less than one, which is precisely why no fifty-step model can drive a live video filter and a one-pass GAN can.
It is worth being precise about what diffusion did not win on: at matched scale and compute, a well-trained GAN's individual samples are competitive in fidelity. The crown changed hands because of trainability and coverage at scale, not because GAN images are inherently worse. Table 32.6.1 lays out the comparison.
| Axis | GAN | Diffusion |
|---|---|---|
| Generation speed | One forward pass (milliseconds) | Many steps, unless distilled |
| Training stability | Delicate two-player balance | Stable single regression loss |
| Mode coverage | Fights mode collapse | Covers modes naturally |
| Single-sample fidelity | Competitive at matched scale | Competitive; state of the art at scale |
| Latent space for editing | Compact, disentangled (W) | Editable via inversion, less compact |
| Likelihood / density | None (implicit model) | Available (ELBO / score) |
Every GAN advantage that survives traces back to one fact: a GAN produces an image in a single forward pass, while a diffusion model integrates an iterative process. When latency is the binding constraint, real-time video, interactive editing, on-device generation, mobile super-resolution, the GAN's single pass is not a convenience but a requirement. This is why the diffusion community spent 2023 to 2025 trying to make diffusion fast, and why the tool they reached for, adversarial distillation, is the GAN loss of Section 32.1 bolted onto a diffusion model. The two families are converging, and the bridge is the adversarial objective this chapter began with.
2. Where GANs Still Win Beginner
Four niches remain firmly GAN territory, and a practitioner in 2026 should reach for a GAN in each.
Real-time and interactive synthesis. Anything that must respond at video frame rates, live avatars, real-time face reenactment, interactive editing like the DragGAN of Section 32.5, game-asset generation in the loop, needs the single forward pass. A fifty-step diffusion model cannot hit 60 frames per second on consumer hardware; a GAN can.
Super-resolution and restoration. The adversarial loss is exceptional at hallucinating plausible high-frequency detail, which is exactly what super-resolution needs. Real-ESRGAN remains a widely deployed, fast upscaler in 2026, and GAN-based restoration (deblurring, denoising, face restoration with GFPGAN) is standard in photo tools because it runs in one pass and produces crisp results. This is the learned, scaled-up descendant of the classical super-resolution and restoration of Chapter 7 and the efficient edge super-resolution of Chapter 28.
Adversarial losses inside other models. The most pervasive role of the GAN in 2026 is as a component, not a standalone generator. The autoencoder that compresses images into the latent space of Stable Diffusion (the VQGAN-derived AutoencoderKL of Chapter 31) is trained with an adversarial loss to keep its reconstructions sharp. Neural video codecs, learned image compression, and many restoration networks use a discriminator as a perceptual loss. The adversarial idea quietly does its work inside dozens of systems that are not called GANs.
Domain-specific and data-scarce generation. When you have a narrow domain and limited data (medical images, a specific product category, a particular art style), a GAN often trains faster and to competitive quality with far less compute than fine-tuning a giant diffusion model, and its compact latent makes targeted editing easy.
3. The GANs That Fought Back Advanced
The GAN community did not concede text-to-image without a fight, and the counterattack is instructive. StyleGAN-T (Sauer et al., 2023) and GigaGAN (Kang et al., 2023) are large-scale, text-conditioned GANs that generate at high resolution in a single forward pass, closing much of the quality gap to diffusion while remaining far faster. GigaGAN, a billion-parameter model, synthesizes a $512 \times 512$ image in a fraction of a second and doubles as a strong, fast super-resolution upsampler; StyleGAN-T brought the style-based generator of Section 32.3 into the text-to-image regime. They are not the market leaders, but they are an existence proof: the speed advantage and competitive quality can coexist at scale, and the technique that made it possible (large-scale training with the spectral-normalization and regularization toolkit of Section 32.2) is the same one that stabilizes the discriminators inside diffusion distillation.
It is worth writing the adversarial-distillation objective down, because it shows the GAN loss of Section 32.1 living literally inside a diffusion pipeline. Adversarial diffusion distillation trains a few-step student generator $G_\theta$ with two terms: a distillation term that pulls the student's one-step denoised prediction toward a frozen diffusion teacher $T$, and an adversarial term from a discriminator $D_\psi$ that must tell the student's samples from real images,
where $d$ is a distance in pixel or feature space, $\ell_{\text{adv}}$ is the generator's GAN loss (commonly the hinge loss of Section 32.2), and $D_\psi$ is trained adversarially alongside. The distillation term transfers the teacher's competence in a handful of steps; the adversarial term is what restores the crispness that pure distillation blurs away. That single $\lambda$-weighted sum is why SDXL-Turbo and SD3-Turbo generate in one to four steps without looking washed out, and it is the cleanest example of the chapter's thesis that the adversarial signal is now a component other models borrow rather than a standalone architecture.
The website thispersondoesnotexist.com, which serves an endlessly refreshing stream of StyleGAN faces of people who have never existed, went live in 2019 and became the single most effective public explainer of generative models ever made: no math, just an unsettling demonstration. It is still running in 2026 on the StyleGAN lineage of this chapter, quietly outlasting several waves of "GANs are dead" commentary, one fictional face at a time.
A video-conferencing company in 2024 wanted a "background replacement plus relighting" feature that re-rendered the user's face to match a chosen virtual lighting environment, live, at 30 frames per second, on laptop hardware with no dedicated GPU. The team's first instinct was a fine-tuned diffusion editing model, which produced beautiful results in offline tests, but even an aggressively distilled four-step diffusion model could not hit the frame budget on the target hardware, and dropping frames made the call feel broken. They switched to a StyleGAN-based reenactment-and-relighting generator: one forward pass per frame, comfortably real-time on the CPU and integrated GPU, with quality that, while a notch below the diffusion version in still frames, was indistinguishable in motion. The decision was made entirely by the latency budget, not by image quality in isolation. The general rule the team adopted, and a fitting way to close this chapter: pick diffusion when quality and flexibility dominate and you can afford the steps; pick a GAN when a millisecond budget or a real-time loop makes the single forward pass non-negotiable.
The single-pass speed moat of the Key Insight is exactly what makes a portfolio-worthy restoration tool buildable in an afternoon. Wrap a pretrained Real-ESRGAN upscaler and the GFPGAN face restorer of this section behind a tiny drag-and-drop interface (a small Gradio or Streamlit app) that takes a low-resolution or degraded photo and returns a crisp one, then put a number on the GAN's advantage: print the per-image wall-clock latency on screen and confirm it stays in the tens of milliseconds, the same one-forward-pass timing that Table 32.6.1 contrasts with diffusion's many steps. This is an advanced-leaning but achievable build (roughly three to four hours, no training, just inference plumbing) that exercises the chapter's central claim that adversarial losses excel at hallucinating plausible high-frequency detail. The honest analysis step, the one that makes it interview-ready, is to find an image where that hallucination is a feature (a blurry vacation photo made sharp) and one where it is a bug (an upscaled document or face where the GAN invents detail that was never there), the feature-versus-bug distinction Exercise 32.6.3 asks you to defend. It complements the from-scratch training lab of the chapter index by showing the deployment side: you ship the result of adversarial training rather than running it.
The production GANs of this section are a few lines to use. Real-ESRGAN upscales an image with a pretrained model in roughly five lines via its RealESRGANer wrapper; GFPGAN restores a degraded face in a single call; and the diffusers library exposes the adversarially-trained VAE behind Stable Diffusion as AutoencoderKL, so the discriminator that sharpened its training is already baked into the weights you load. For fast diffusion that uses the adversarial-distillation idea, diffusers ships SDXL-Turbo and SD3.5-Turbo behind the same one-line pipeline API, generating in one to four steps where the base model needs fifty. You almost never implement the adversarial training of these systems; you load the result of it.
The defining 2023 to 2026 story is convergence rather than replacement. Adversarial diffusion distillation (the ADD loss behind SDXL-Turbo, Sauer et al., 2023, and reused for Stable Diffusion 3.5 Large Turbo, October 2024) and its successor latent adversarial diffusion distillation (LADD, behind SD3-Turbo, Sauer et al., SIGGRAPH Asia 2024) use a GAN discriminator to compress a fifty-step diffusion teacher into a one-to-four-step student, putting the GAN loss of Section 32.1 at the center of fast diffusion. NVIDIA's R3GAN (Huang et al., NeurIPS 2024) argued that modern, well-regularized GANs are stable and competitive after all, reopening the from-scratch GAN as a serious option, and the 2025 GAT (Scalable GANs with Transformers, Hyun et al., 2025, arXiv:2509.24935) pushed that line further by training a purely transformer-based GAN in a compact VAE latent space, reporting single-step class-conditional generation on ImageNet-256 at an FID near 2, on par with the best diffusion and autoregressive models at that resolution. And distribution-matching distillation (DMD, Yin et al., CVPR 2024) added an explicit GAN loss in its improved form (DMD2, NeurIPS 2024) to reach state-of-the-art one-step generation, one more place the adversarial signal turned out to matter. The throughline of this chapter, that learning by competition produces sharpness no fixed loss matches, did not retire with the GAN's headline era; it became a reusable tool the entire generative-vision field now depends on, and you will meet it again throughout Chapter 33 and Chapter 34.
Exercises
Using Table 32.6.1, write a short decision rule (three or four bullet points) a practitioner could follow to choose between a GAN and a diffusion model for a new project. Make sure your rule correctly handles each of these cases: a real-time avatar at 60 frames per second, an offline marketing-image generator where quality is paramount, an on-device photo upscaler for a phone, and an open-ended text-to-image product. Justify each choice in one sentence.
Benchmark the speed advantage directly. Load a pretrained StyleGAN2 and a small pretrained diffusion model (or use SDXL-Turbo for a fast diffusion baseline) and time the generation of a single image on the same hardware, averaging over twenty runs. Report the GAN's single-pass latency against the diffusion model's per-step latency times its step count. Then run SDXL-Turbo at one step and compare again. Discuss how adversarial distillation narrows the gap.
Take a set of low-resolution images and upscale each with a fast GAN upscaler (Real-ESRGAN) and with a diffusion-based upscaler. Compare on three axes: wall-clock time per image, perceptual quality (LPIPS against a high-resolution ground truth if you have one, otherwise a visual judgment), and tendency to hallucinate detail not present in the input. Connect your findings to the section's claim that adversarial losses excel at hallucinating plausible high-frequency detail, and identify a use case where that hallucination is a feature and one where it is a bug.