Part IV: Generative Vision Models
Chapter 32: Generative Adversarial Networks

Chapter 32: Generative Adversarial Networks

"I was trained to call everything fake, and for a glorious year I was right. Then my opponent got good, and now I am wrong exactly half the time. They tell me that is success. I tell them it feels like losing in slow motion."

A GAN Discriminator Who Trusts No One
Big Picture

A generative adversarial network learns to draw without ever being told what good looks like; instead it learns from a second network whose only job is to catch it cheating, and the chase between the two produces images sharper than any likelihood objective ever managed. That single idea, learning by competition rather than by reconstruction, is the spine of this chapter. The variational autoencoder of Chapter 31 maximized a likelihood and paid for it with soft, slightly blurry samples; the GAN replaces that objective with a game, and the adversarial pressure manufactures the crisp high-frequency texture that likelihood smooths away. The price is a famously temperamental training process, so this chapter spends as much time on the pathologies (mode collapse, vanishing gradients, oscillation) as on the wins. It then follows the family's eight-year arc from the first convolutional recipe that trained reliably, DCGAN, to the controllable, photoreal latent of StyleGAN; sideways into conditional and image-to-image translation with pix2pix and CycleGAN; and finally into the inversion and editing techniques that turned a trained generator into a programmable image editor. GANs no longer hold the crown for general text-to-image synthesis, but the lessons they left behind, adversarial losses, latent-space geometry, and image-to-image translation, are everywhere in the diffusion era, and in the places where speed matters most they still win outright.

Chapter Overview

For the better part of a decade, when a research paper showed a face that did not exist or a horse that turned into a zebra, the machine behind it was almost always a generative adversarial network. The idea, introduced by Ian Goodfellow and colleagues in 2014, is disarmingly simple and slightly mischievous: instead of writing down a loss that says what a good image looks like, you train a second network to tell real images from generated ones, and you train the first network to fool it. Neither network is ever handed a definition of realism. Realism is whatever the discriminator has not yet learned to distrust, and as the discriminator sharpens, so must the generator. The result is a moving target, a two-player game whose equilibrium, when you reach it, is a generator whose samples are indistinguishable from the training distribution.

That elegance comes with a reputation for instability, and the reputation is earned. A GAN has no single loss to watch go down; it has two losses locked in tension, and the healthy state is a delicate balance rather than a minimum. Section 32.2 is the chapter's honest middle, cataloguing the ways training goes wrong and the fixes the field invented in response: the Wasserstein reformulation that gave the loss a meaningful magnitude, gradient penalties and spectral normalization that tamed the discriminator, and the diagnostic habits that separate a converging run from one quietly collapsing onto a single output.

The chapter is organized as a story in three movements. The first two sections lay the foundation: Section 32.1 derives the adversarial game itself, from the minimax objective to the optimal discriminator to the Jensen-Shannon divergence it implicitly minimizes, and builds a working GAN from scratch. Section 32.2 confronts the training pathologies head on. The middle two sections trace the architecture lineage and the conditional extensions: Section 32.3 walks the road from DCGAN's stabilizing convolutional recipe through progressive growing to StyleGAN's style-based generator, and Section 32.4 covers conditioning, paired translation with pix2pix, and the cycle-consistency trick that made unpaired translation possible.

The final two sections are about control and consequence. Section 32.5 shows how to run a trained generator backward: given a real photograph, find the latent code that produces it, then edit that code to change a smile, an age, or a pose. This is where the latent space you have been building since Chapter 30 becomes a steering wheel. Section 32.6 closes with a clear-eyed account of where GANs stand in 2026: dethroned for open-ended text-to-image synthesis by diffusion, but still the right tool when you need a single fast forward pass, a learnable perceptual loss inside another model, or real-time interactive synthesis.

The unifying thread is the latent space again, met as a continuous code in the VAE and now reshaped by an adversarial objective into something with a different and often more useful geometry. You will see that the GAN latent is the same kind of object as the VAE latent, a low-dimensional handle on the manifold of natural images, but earned a different way, and that the adversarial loss which carves it out has outlived the architecture that introduced it. Adversarial training now lives inside the autoencoders of latent diffusion (Chapter 33), inside super-resolution and restoration networks, and inside the perceptual metrics of Chapter 37.

Prerequisites

You should have read Chapter 30: Foundations of Generative Modeling for the framing of modeling $p(x)$, latent variables, and sampling, and Chapter 31: Autoencoders & Variational Autoencoders for the latent-space vocabulary and the contrast between likelihood-based and implicit generators that this chapter sharpens. The networks here are convolutional generators and discriminators built with the PyTorch mechanics of Chapter 18 and the convolutional and transposed-convolutional layers of Chapter 19. On the math side you need expectation, the Gaussian and uniform distributions, and a working comfort with KL and Jensen-Shannon divergence as developed in Chapter 30; the Wasserstein distance of Section 32.2 is built up from scratch. Familiarity with perceptual and distribution metrics (PSNR, SSIM from Chapter 1, and the Fréchet Inception Distance (FID) preview from Chapter 37) will help you read the evaluation discussions.

Chapter Roadmap

What's Next?

The GAN bought its sharpness with instability: a single adversarial game, balanced on a knife edge, that can collapse without warning. Chapter 33: Diffusion Models takes the opposite bargain. Instead of a two-player game, a diffusion model trains a single network with a stable regression loss to reverse a gradual noising process, trading the GAN's one-shot generation for many small, reliable denoising steps. You will recognize the destination immediately: the iterative denoising that defines diffusion is the learned, scaled-up descendant of the denoising autoencoder you built in Chapter 31 and the classical denoising of Chapter 7. The adversarial loss does not disappear; it reappears as a tool inside diffusion's autoencoder and inside the distillation tricks that make diffusion fast. And the latent-space editing you learn here returns in Chapter 35, where the same find-the-code-then-edit-it idea drives diffusion-based image editing. The game ends; its lessons do not. Before moving on, make the whole chapter concrete in the Hands-On Lab below, where the adversarial game of Section 32.1, the non-saturating loss, the DCGAN recipe of Section 32.3, and the mode-collapse and balance diagnostics of Section 32.2 come together as one small conditional GAN you train, watch, and sample from yourself.

Hands-On Lab: A Conditional GAN You Can Train, Diagnose, and Sample

Duration: about 60 to 90 minutes Difficulty: Intermediate

Objective

Build a complete conditional DCGAN that learns to draw Fashion-MNIST garments on demand, then instrument it so you can read its health while it trains. You will write the two-player game of Section 32.1 by hand, stabilize it with the convolutional recipe of Section 32.3, and add the two diagnostics that Section 32.2 argues matter most: the discriminator's accuracy on real versus fake batches as a balance gauge, and a per-class coverage count that catches mode collapse the moment it starts. The dataset is small and the network is tiny, so the run finishes on a CPU in well under an hour and every quantity that defines a healthy GAN, the loss tension, the balance, and the diversity, stays visible on one screen.

What You'll Practice

  • Implementing the generator and discriminator and the non-saturating adversarial loss of Section 32.1, where the generator maximizes $\log D(G(z))$ rather than minimizing $\log(1 - D(G(z)))$ to avoid the vanishing gradient.
  • Applying the DCGAN architectural recipe of Section 32.3: transposed convolutions in the generator, strided convolutions in the discriminator, batch normalization, and no fully connected layers.
  • Conditioning both networks on a class label so you can ask the trained generator for a specific garment, the conditional GAN of Section 32.4.
  • Reading discriminator accuracy as the balance diagnostic of Section 32.2: near 100 percent means the discriminator has won and the generator gradient is dying; near 50 percent is the healthy equilibrium.
  • Detecting mode collapse with a cheap diversity metric, the failure mode that Section 32.2 warns is invisible in the loss curves alone.

Setup

One scientific-Python stack and one small automatic download; torchvision fetches Fashion-MNIST (about 30 MB) on first run. Everything trains on CPU in roughly ten to twenty minutes for the few epochs the lab needs. Install with:

pip install torch torchvision matplotlib

The whole lab is one short script. Both networks are deliberately small DCGANs so the focus stays on the adversarial loop and its diagnostics rather than on architecture tuning.

Steps

Step 1: Load the data and fix the conditioning

Fetch Fashion-MNIST and scale every pixel to the range $[-1, 1]$ so it matches the $\tanh$ output of the generator you build next. Ten clothing classes give you something concrete to condition on and, later, a way to measure diversity: a healthy generator should be able to produce all ten, a collapsed one will favour a few.

import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

device = "cuda" if torch.cuda.is_available() else "cpu"
n_classes, z_dim = 10, 64

tfm = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),            # map [0,1] pixels to [-1,1]
])
# TODO: build a DataLoader over datasets.FashionMNIST(root=".", train=True,
# download=True, transform=tfm) with batch_size=128, shuffle=True.
# The [-1,1] range must match the generator's tanh output in Step 2.
loader = ...
Hint

ds = datasets.FashionMNIST(root=".", train=True, download=True, transform=tfm) then loader = DataLoader(ds, batch_size=128, shuffle=True, drop_last=True). Use drop_last=True so every batch is full size, which keeps the batch-norm statistics in Step 2 stable.

Step 2: Build a conditional DCGAN generator

The generator maps a noise vector plus a class embedding to a 28 by 28 image through transposed convolutions, the upsampling recipe of Section 32.3. Batch normalization after each layer and a final $\tanh$ are the DCGAN defaults that first made this kind of network train without diverging. Concatenating a learned label embedding onto the noise is what makes the generator conditional.

import torch.nn as nn

class Generator(nn.Module):
    def __init__(self):
        super().__init__()
        self.label_emb = nn.Embedding(n_classes, n_classes)
        self.net = nn.Sequential(
            nn.ConvTranspose2d(z_dim + n_classes, 128, 7, 1, 0), nn.BatchNorm2d(128), nn.ReLU(True),
            nn.ConvTranspose2d(128, 64, 4, 2, 1), nn.BatchNorm2d(64), nn.ReLU(True),  # 7 -> 14
            nn.ConvTranspose2d(64, 1, 4, 2, 1), nn.Tanh(),                            # 14 -> 28
        )

    def forward(self, z, y):
        # TODO: concatenate z (B, z_dim) with self.label_emb(y) (B, n_classes)
        # along dim=1, reshape to (B, z_dim + n_classes, 1, 1), and pass through
        # self.net. The reshape turns the vector into a 1x1 "image" the
        # transposed convolutions can grow.
        x = ...
        return self.net(x)

G = Generator().to(device)
Hint

x = torch.cat([z, self.label_emb(y)], dim=1).view(z.size(0), -1, 1, 1) then return self.net(x). Check the output shape is (B, 1, 28, 28) with values in $[-1, 1]$; if it is 27 or 29 pixels wide, the kernel, stride, and padding triple on one layer is off.

Step 3: Build the matching discriminator

The discriminator mirrors the generator: strided convolutions shrink the image to a single real-versus-fake score, with the label broadcast as an extra input channel so the critic judges whether the image matches its claimed class. LeakyReLU and the absence of batch norm on the first layer are the DCGAN discriminator conventions of Section 32.3.

class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.label_emb = nn.Embedding(n_classes, 28 * 28)
        self.net = nn.Sequential(
            nn.Conv2d(2, 64, 4, 2, 1), nn.LeakyReLU(0.2, True),                       # 28 -> 14
            nn.Conv2d(64, 128, 4, 2, 1), nn.BatchNorm2d(128), nn.LeakyReLU(0.2, True),# 14 -> 7
            nn.Conv2d(128, 1, 7, 1, 0),                                               # 7 -> 1
        )

    def forward(self, img, y):
        lab = self.label_emb(y).view(-1, 1, 28, 28)    # label as an image channel
        x = torch.cat([img, lab], dim=1)               # (B, 2, 28, 28)
        # TODO: pass x through self.net and return the logit reshaped to (B,).
        # Return the raw logit (no sigmoid); the loss in Step 4 applies it.
        return ...

D = Discriminator().to(device)
Hint

return self.net(x).view(-1). Returning the raw logit lets you use BCEWithLogitsLoss in Step 4, which is numerically safer than a separate sigmoid followed by BCELoss.

Step 4: Write the two-player training loop

This is the adversarial game of Section 32.1 made runnable. Each step updates the discriminator to label reals as 1 and fakes as 0, then updates the generator using the non-saturating objective: it asks the discriminator to call its fakes real. Using real labels of 1 for the generator update is exactly the $\log D(G(z))$ trick that keeps the generator gradient strong early in training.

opt_G = torch.optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999))
bce = nn.BCEWithLogitsLoss()

def step(real, y):
    b = real.size(0)
    ones, zeros = torch.ones(b, device=device), torch.zeros(b, device=device)
    z = torch.randn(b, z_dim, device=device)
    fake = G(z, y)

    # Discriminator: reals -> 1, fakes -> 0 (detach so G is not updated here).
    opt_D.zero_grad()
    loss_D = bce(D(real, y), ones) + bce(D(fake.detach(), y), zeros)
    loss_D.backward(); opt_D.step()

    # TODO: generator update. Recompute D(fake, y) (do NOT detach this time)
    # and push it toward `ones` with bce, the non-saturating loss. Then
    # zero_grad, backward, and step opt_G. Returning the two losses lets
    # Step 5 plot the tension between them.
    opt_G.zero_grad()
    loss_G = ...
    return loss_D.item(), loss_G.item()
Hint

loss_G = bce(D(fake, y), ones) then loss_G.backward(); opt_G.step() and return loss_D.item(), loss_G.item(). Pushing the fakes toward ones (not zeros) is the whole point: the generator is rewarded for fooling the discriminator, the non-saturating form of the loss derived in Section 32.1.

Step 5: Add the balance diagnostic and train

A GAN has no single loss to watch fall, so Section 32.2 tells you to watch the balance instead. After each epoch, measure the discriminator's accuracy on a batch of reals and a batch of fakes. Accuracy stuck near 1.0 means the discriminator has crushed the generator and the gradient is vanishing; accuracy hovering near 0.5 is the healthy tension you want.

@torch.no_grad()
def d_accuracy(real, y):
    z = torch.randn(real.size(0), z_dim, device=device)
    fake = G(z, y)
    # TODO: count a real as correct when D(real, y) > 0 and a fake as correct
    # when D(fake, y) < 0 (the logit's sign is the decision boundary). Return
    # the mean of both correctness masks as a single accuracy in [0, 1].
    correct = ...
    return correct

for epoch in range(8):
    for real, y in loader:
        real, y = real.to(device), y.to(device)
        ld, lg = step(real, y)
    acc = d_accuracy(real, y)
    print(f"epoch {epoch}  loss_D {ld:.3f}  loss_G {lg:.3f}  D_acc {acc:.2f}")
Hint

real_ok = (D(real, y) > 0).float(); fake_ok = (D(fake, y) < 0).float(); correct = torch.cat([real_ok, fake_ok]).mean().item(). A healthy run drifts toward roughly 0.6 to 0.8 and stays there; a value pinned at 1.0 for several epochs is the warning sign of Section 32.2 that the discriminator has won.

Step 6: Sample every class and detect mode collapse

Ask the trained generator for several samples of each of the ten classes, save the grid, then put a number on diversity. A cheap mode-collapse detector is the mean pairwise pixel distance within a class: a generator that has collapsed produces near-identical images, so that distance falls toward zero, the diversity failure that Section 32.2 says the loss curves hide.

import matplotlib.pyplot as plt
from torchvision.utils import make_grid

G.eval()
labels = torch.arange(n_classes, device=device).repeat_interleave(8)  # 8 per class
z = torch.randn(labels.size(0), z_dim, device=device)
with torch.no_grad():
    grid = make_grid((G(z, labels) + 1) / 2, nrow=8)                  # back to [0,1]
plt.imshow(grid.permute(1, 2, 0).cpu()); plt.axis("off")
plt.savefig("cgan_samples.png", dpi=120)

with torch.no_grad():
    imgs = G(z, labels).view(n_classes, 8, -1)
# TODO: for each class, compute the mean pairwise L2 distance between its 8
# samples (torch.pdist on imgs[c] then .mean()), and report the average over
# all classes. A value near zero means the generator is producing copies.
diversity = ...
print(f"mean within-class diversity: {diversity:.3f}")
Hint

diversity = torch.stack([torch.pdist(imgs[c]).mean() for c in range(n_classes)]).mean().item(). With samples normalized to $[-1, 1]$ a healthy generator scores a clearly positive number (often above 5 for these 784-pixel vectors); a collapsed one trends toward zero because every sample in a class is nearly the same picture.

Expected Output

Two artifacts. First, cgan_samples.png, a ten-row grid where each row holds eight recognizable but distinct examples of one garment class (sneakers, trousers, coats), the visible payoff of the conditioning. Second, a printed per-epoch line where D_acc settles into the healthy 0.6 to 0.8 band rather than pinning at 1.0, and a final within-class diversity comfortably above zero. Exact numbers vary with seed and epoch count; what should hold is legible class-conditional samples, a discriminator accuracy that does not run away to 1.0, and a positive diversity score that confirms the generator did not collapse onto one image per class.

Stretch Goals

  • Induce mode collapse on purpose: raise the generator learning rate to 2e-3 or update the generator several times per discriminator step, then watch D_acc swing and the diversity score crash, the instability of Section 32.2 reproduced in your own run.
  • Swap the binary cross-entropy game for the Wasserstein critic of Section 32.2: drop the sigmoid framing, train the discriminator to maximize the gap between real and fake scores, add a gradient penalty, and compare how much steadier the loss curve becomes.
  • Reuse the trained generator for inversion, the technique of Section 32.5: freeze G, pick a real image, and optimize a latent z to reconstruct it, then nudge the class embedding to morph the garment into a neighbouring category.
Library Shortcut: Reach for a Reference Implementation at Scale

The script above is roughly a hundred lines and exposes the loss, the balance check, and the diversity metric on purpose. When you move from 28 by 28 garments to real photographs, you reach for the maintained authors' implementations in the bibliography instead: the pytorch-CycleGAN-and-pix2pix repository wraps the conditional and image-to-image games of Section 32.4 behind a single configurable train.py, and the NVIDIA stylegan3 code supplies the progressive, style-based generator of Section 32.3 with pretrained weights, so a high-resolution face generator is a clone-and-run rather than a hundred lines of training loop. Build the adversarial loop once by hand to understand the game and its diagnostics; reach for those repositories when the data becomes high-resolution images.

Complete Solution
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.utils import make_grid

torch.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"
n_classes, z_dim = 10, 64

# Step 1: data scaled to [-1, 1] to match the generator's tanh output.
tfm = transforms.Compose([transforms.ToTensor(), transforms.Normalize([0.5], [0.5])])
ds = datasets.FashionMNIST(root=".", train=True, download=True, transform=tfm)
loader = DataLoader(ds, batch_size=128, shuffle=True, drop_last=True)

# Step 2: conditional DCGAN generator.
class Generator(nn.Module):
    def __init__(self):
        super().__init__()
        self.label_emb = nn.Embedding(n_classes, n_classes)
        self.net = nn.Sequential(
            nn.ConvTranspose2d(z_dim + n_classes, 128, 7, 1, 0), nn.BatchNorm2d(128), nn.ReLU(True),
            nn.ConvTranspose2d(128, 64, 4, 2, 1), nn.BatchNorm2d(64), nn.ReLU(True),
            nn.ConvTranspose2d(64, 1, 4, 2, 1), nn.Tanh(),
        )

    def forward(self, z, y):
        x = torch.cat([z, self.label_emb(y)], dim=1).view(z.size(0), -1, 1, 1)
        return self.net(x)

# Step 3: matching discriminator with the label as an extra channel.
class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.label_emb = nn.Embedding(n_classes, 28 * 28)
        self.net = nn.Sequential(
            nn.Conv2d(2, 64, 4, 2, 1), nn.LeakyReLU(0.2, True),
            nn.Conv2d(64, 128, 4, 2, 1), nn.BatchNorm2d(128), nn.LeakyReLU(0.2, True),
            nn.Conv2d(128, 1, 7, 1, 0),
        )

    def forward(self, img, y):
        lab = self.label_emb(y).view(-1, 1, 28, 28)
        x = torch.cat([img, lab], dim=1)
        return self.net(x).view(-1)

G, D = Generator().to(device), Discriminator().to(device)

# Step 4: the two-player loop with the non-saturating generator loss.
opt_G = torch.optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999))
bce = nn.BCEWithLogitsLoss()

def step(real, y):
    b = real.size(0)
    ones, zeros = torch.ones(b, device=device), torch.zeros(b, device=device)
    z = torch.randn(b, z_dim, device=device)
    fake = G(z, y)
    opt_D.zero_grad()
    loss_D = bce(D(real, y), ones) + bce(D(fake.detach(), y), zeros)
    loss_D.backward(); opt_D.step()
    opt_G.zero_grad()
    loss_G = bce(D(fake, y), ones)           # non-saturating: fool the critic
    loss_G.backward(); opt_G.step()
    return loss_D.item(), loss_G.item()

# Step 5: balance diagnostic and training.
@torch.no_grad()
def d_accuracy(real, y):
    z = torch.randn(real.size(0), z_dim, device=device)
    fake = G(z, y)
    real_ok = (D(real, y) > 0).float()
    fake_ok = (D(fake, y) < 0).float()
    return torch.cat([real_ok, fake_ok]).mean().item()

for epoch in range(8):
    for real, y in loader:
        real, y = real.to(device), y.to(device)
        ld, lg = step(real, y)
    acc = d_accuracy(real, y)
    print(f"epoch {epoch}  loss_D {ld:.3f}  loss_G {lg:.3f}  D_acc {acc:.2f}")

# Step 6: sample every class and measure within-class diversity.
G.eval()
labels = torch.arange(n_classes, device=device).repeat_interleave(8)
z = torch.randn(labels.size(0), z_dim, device=device)
with torch.no_grad():
    grid = make_grid((G(z, labels) + 1) / 2, nrow=8)
    imgs = G(z, labels).view(n_classes, 8, -1)
plt.imshow(grid.permute(1, 2, 0).cpu()); plt.axis("off")
plt.savefig("cgan_samples.png", dpi=120)
diversity = torch.stack([torch.pdist(imgs[c]).mean() for c in range(n_classes)]).mean().item()
print(f"mean within-class diversity: {diversity:.3f}")

Bibliography & Further Reading

Foundational Papers

Goodfellow, I. et al. "Generative Adversarial Networks." NeurIPS (2014). arXiv:1406.2661
The paper that started it all. Introduces the minimax game, the optimal-discriminator analysis, the Jensen-Shannon connection, and the non-saturating generator loss of Section 32.1. Short, readable, and still the clearest statement of the core idea.
Radford, A., Metz, L. & Chintala, S. "Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN)." ICLR (2016). arXiv:1511.06434
DCGAN of Section 32.3. The architectural recipe (strided convolutions, batch norm, no fully-connected layers) that first made GAN training reliable, plus the famous latent-arithmetic demonstrations.
Arjovsky, M., Chintala, S. & Bottou, L. "Wasserstein GAN." ICML (2017). arXiv:1701.07875
WGAN of Section 32.2. Diagnoses why the original loss gives vanishing gradients and replaces Jensen-Shannon with the Wasserstein distance, yielding a loss that correlates with sample quality.
Gulrajani, I. et al. "Improved Training of Wasserstein GANs (WGAN-GP)." NeurIPS (2017). arXiv:1704.00028
The gradient penalty of Section 32.2. Replaces WGAN's brittle weight clipping with a penalty on the critic's gradient norm, the most widely used stabilizer of the late-2010s.

Architecture & Method Papers

Karras, T. et al. "Analyzing and Improving the Image Quality of StyleGAN (StyleGAN2)." CVPR (2020). arXiv:1912.04958
StyleGAN and StyleGAN2 of Sections 32.3 and 32.5, building on the progressive-growing schedule of Karras et al. (ICLR 2018, arXiv:1710.10196). The mapping network, the disentangled W latent, and per-layer style injection of the original StyleGAN, refined here with weight demodulation and path-length regularization; the de facto editing backbone of the GAN-inversion literature.
Isola, P. et al. "Image-to-Image Translation with Conditional Adversarial Networks (pix2pix)." CVPR (2017). arXiv:1611.07004
pix2pix and the PatchGAN of Section 32.4. A single conditional-GAN recipe that turns sketches into photos, maps into satellite images, and labels into scenes, given paired training data.
Zhu, J.-Y. et al. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (CycleGAN)." ICCV (2017). arXiv:1703.10593
CycleGAN of Section 32.4. The cycle-consistency loss that learns translation between two domains with no paired examples, the horse-to-zebra and summer-to-winter demonstrations that made the technique famous.
Miyato, T. et al. "Spectral Normalization for Generative Adversarial Networks." ICLR (2018). arXiv:1802.05957
Spectral normalization of Section 32.2. Constrains each discriminator layer's largest singular value to enforce Lipschitz continuity cheaply, now a default in most modern GAN discriminators.

Inversion, Editing & Recent GANs

Richardson, E. et al. "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation (pSp)." CVPR (2021). arXiv:2008.00951
The pixel2style2pixel encoder of Section 32.5, and the related GANSpace direction-discovery method (Härkönen et al., NeurIPS 2020, arXiv:2004.02546). A feed-forward network maps an image directly into StyleGAN's W+ space in a single fast forward pass; GANSpace then finds interpretable editing directions unsupervised via PCA on latent codes.
Kang, M. et al. "Scaling up GANs for Text-to-Image Synthesis (GigaGAN)." CVPR (2023). arXiv:2303.05511
GigaGAN of Section 32.6, alongside StyleGAN-T (Sauer et al., 2023). A billion-parameter text-to-image GAN that synthesizes at high resolution far faster than diffusion in a single forward pass and doubles as a strong, fast super-resolution upsampler, the existence proof that GANs still matter for speed at scale.
Huang, Y. et al. "The GAN is dead; long live the GAN! A Modern GAN Baseline (R3GAN)." NeurIPS (2024). arXiv:2501.05441
R3GAN of Sections 32.2 and 32.6. Argues that with a well-posed regularized relativistic loss and a modern backbone a plain GAN trains stably and competitively, retiring the bag-of-tricks folklore.
Hyun, S., Lee, M. & Heo, J.-P. "Scalable GANs with Transformers (GAT)." (2025). arXiv:2509.24935
GAT of Section 32.6. A purely transformer-based GAN trained in a compact VAE latent space that scales reliably from small to extra-large and reports single-step, class-conditional generation on ImageNet-256 at an FID near 2, evidence that the GAN's speed advantage and competitive quality persist at modern scale.

Tools & Libraries

NVIDIA. stylegan3 official PyTorch implementation. github.com/NVlabs/stylegan3
The reference StyleGAN2/3 code and pretrained weights used by almost every inversion and editing project, the practical backbone behind Sections 32.3 and 32.5.
junyanz. pytorch-CycleGAN-and-pix2pix. github.com/junyanz/pytorch-CycleGAN-and-pix2pix
The original authors' clean, configurable implementation of both pix2pix and CycleGAN, the library shortcut behind Section 32.4.

Datasets & Benchmarks

Liu, Z. et al. "Deep Learning Face Attributes in the Wild (CelebA)." ICCV (2015). mmlab.ie.cuhk.edu.hk/projects/CelebA
The 200,000-image face dataset, with its higher-resolution CelebA-HQ and FFHQ successors, on which most of the face GANs and editing demonstrations in this chapter were trained.
Heusel, M. et al. "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID)." NeurIPS (2017). arXiv:1706.08500
The Fréchet Inception Distance, the standard GAN quality metric used for the diagnostics of Section 32.2 and developed in full in Chapter 37, plus the two time-scale update rule (TTUR) training trick.