Section 32.5: GAN Inversion & Latent-Space Editing

"You handed me a photograph of a stranger and asked which of my dreams it came from. None of them, exactly. But I found a dream close enough that you cannot tell, and then I made her smile by walking three steps north in dream-space."
A Generator Asked to Run in Reverse

Big Picture

A trained generator maps latent codes to images; run it backward, finding the code that produces a given real photo, and the latent space becomes a steering wheel for editing real images. This section is about that reverse map, called GAN inversion. We cover the two ways to find the code (slow per-image optimization and a fast learned encoder), the surprisingly consequential choice of which latent space to invert into (Z, W, or the extended W+), and how to discover semantic directions in that space so that adding a vector makes someone older, adds a smile, or rotates a pose. The recurring tension is editability versus fidelity: the codes that reconstruct a photo most faithfully are often the ones that edit worst.

The latent arithmetic of DCGAN in Section 32.3 hinted that the latent space has semantic structure: directions correspond to meaningful changes. StyleGAN's disentangled $W$ space of Section 32.3 made that structure far cleaner. But all of it operated on generated images, where you already have the code. To edit a real photograph, the one your user actually uploaded, you first have to answer a hard question: which latent code, fed to the generator, produces this exact image? That is GAN inversion, and it is the bridge from "a GAN that dreams up faces" to "a tool that edits your face".

1. The Inversion Problem Intermediate

Formally, given a fixed pretrained generator $G$ and a target image $\mathbf{x}$, inversion seeks a latent code $\mathbf{w}^{*}$ such that $G(\mathbf{w}^{*}) \approx \mathbf{x}$. The "approximately equal" is measured with a reconstruction loss that combines a pixel term and a perceptual term (the Learned Perceptual Image Patch Similarity, or LPIPS, distance, which compares deep features rather than raw pixels and is far closer to human judgment, a metric developed in Chapter 37):

$$ \mathbf{w}^{*} \;=\; \arg\min_{\mathbf{w}} \; \lVert G(\mathbf{w}) - \mathbf{x} \rVert_2^2 \;+\; \lambda_{\text{lpips}} \, \mathrm{LPIPS}\big(G(\mathbf{w}), \mathbf{x}\big). $$

There are two ways to solve this, and they trade speed against quality.

Optimization-based inversion treats $\mathbf{w}$ as the variable and runs gradient descent on the loss above, with $G$ frozen. It is accurate but slow, hundreds to thousands of optimization steps per image, because every step is a full forward and backward pass through the generator. Encoder-based inversion trains a separate feed-forward network $E$ that maps an image directly to its latent in one pass, amortizing the optimization across a training set exactly as the VAE encoder of Chapter 31 amortized inference. The pixel2style2pixel (pSp) encoder (Richardson et al., 2021) is the canonical example: a single forward pass yields a $W+$ code, milliseconds instead of minutes. The cost is accuracy, the encoder generalizes but rarely matches per-image optimization, so the best systems use the encoder for an instant initialization and a few optimization steps to refine.

# Optimization-based GAN inversion: with the generator frozen, run gradient
# descent on the latent w alone so that G(w) reconstructs a target image,
# scoring the match with a pixel term plus the LPIPS perceptual distance.
import torch
import lpips   # pip install lpips; the standard perceptual metric

def invert_image(G, target, w_init, steps=500, lr=0.05, device="cuda"):
    """Optimization-based inversion: find w so that G(w) reconstructs `target`."""
    w = w_init.clone().detach().requires_grad_(True)   # optimize the latent, not G
    opt = torch.optim.Adam([w], lr=lr)
    perceptual = lpips.LPIPS(net="vgg").to(device)
    for step in range(steps):
        img = G.synthesis(w)                            # frozen generator forward pass
        loss = (img - target).pow(2).mean() + perceptual(img, target).mean()
        opt.zero_grad(); loss.backward(); opt.step()    # gradient flows only into w
    return w.detach()

Code Fragment 1: Optimization-based inversion against a frozen StyleGAN. The latent w is the only learnable tensor; the generator's weights never move. A learned encoder would replace this whole loop with one forward pass w = E(target).

2. Which Space? Z, W, and W+ Advanced

For a StyleGAN, you can invert into three different spaces, and the choice dominates the result. The original latent space $Z$ is the Gaussian input; inverting into it is hard and reconstructs poorly because $Z$ is entangled and low-capacity. The intermediate $W$ space (after the mapping network) is more disentangled and reconstructs noticeably better. The trick that makes inversion practical is $W+$: StyleGAN feeds the same $\mathbf{w}$ to every layer, but $W+$ relaxes this, allowing a different $\mathbf{w}$ per layer (eighteen vectors for a $1024 \times 1024$ generator). The extra degrees of freedom let $W+$ reconstruct almost any real image, including faces outside the training distribution, which $W$ alone cannot represent.

Key Insight: The Editability-Fidelity Tradeoff

The richer the latent space, the better it reconstructs, and the worse it edits. A $W+$ code can reproduce a real photo almost perfectly, but it often lands off the generator's learned manifold (the set of latent codes the generator actually saw during training and knows how to render well), in a region the semantic editing directions were not calibrated for, so applying a "smile" direction produces artifacts instead of a clean smile. A plain $W$ code stays on the manifold and edits cleanly, but may not capture the exact person. This editability-fidelity tradeoff is the central tension of GAN inversion, and the modern answer is regularized inversion: invert into $W+$ but add a penalty that keeps the code close to the well-behaved $W$ region, buying most of the reconstruction quality while preserving most of the editability.

Figure 32.5.1: The editability-fidelity tradeoff across StyleGAN latent spaces. $Z$ both reconstructs and edits poorly. $W$ edits beautifully but cannot capture every real image. $W+$ reconstructs nearly any photo but drifts off the manifold and edits worse. Regularized $W+$ inversion (magenta) targets the upper-right sweet spot, the goal of modern inversion methods.

3. Finding Semantic Directions

Once you can place a real image in latent space, editing reduces to finding the right direction to move. There are three families of methods, and it helps to take them one at a time. Supervised directions use attribute labels: train a linear classifier (for example "smiling versus not") on many latent-image pairs, and its normal vector is the editing direction, walk along it to add a smile.

Unsupervised directions need no labels. GANSpace (Härkönen et al., 2020) runs principal component analysis (PCA) on a large sample of latent codes and finds that the top principal components correspond to interpretable changes (pose, lighting, gender), while SeFa factorizes the generator's first weight matrix to find directions in closed form. Conditional methods like StyleFlow round out the set, learning nonlinear edits that respect attribute correlations.

The supervised recipe is short enough to read in full. Collect a sample of latent codes, generate each image, label it with an off-the-shelf attribute classifier, fit a linear separator in latent space, and use its weight vector as the edit.

# Supervised editing direction: fit a linear SVM that separates "smiling"
# from "not smiling" latent codes in W space; its normal vector is the
# direction to walk for more or less of the attribute.
import numpy as np
from sklearn.svm import LinearSVC

def smile_direction(latents, attribute_scores):
    """latents: (N, w_dim) sampled W codes; attribute_scores: (N,) smiling probability.
       Returns a unit vector in W space that increases the attribute when added."""
    labels = (attribute_scores > 0.5).astype(int)          # binarize the attribute
    svm = LinearSVC(C=1.0).fit(latents, labels)            # separate smiling from not
    direction = svm.coef_[0]
    return direction / np.linalg.norm(direction)           # unit editing direction

# editing a real image: invert it, then walk along the direction
# w_edited = w_inverted + alpha * torch.tensor(smile_dir)   # alpha controls strength
# G.synthesis(w_edited) now shows the same person, smiling

Code Fragment 2: Discovering a supervised editing direction. A linear support vector machine (SVM) separates "smiling" from "not smiling" latent codes; its normal vector is the direction to walk for more or less smile. Adding $\alpha$ times this vector to an inverted real image's code edits that real image.

The remarkable fact, and the reason this works at all, is that these directions are largely global and linear: the same "age" vector ages most faces, the same "pose" vector rotates most heads, because StyleGAN's $W$ space organized the factors of variation into roughly linear, disentangled axes. This is the editability that the entangled $Z$ space and, partly, the over-flexible $W+$ space lack. Figure 32.5.2 assembles the three steps into the single find-then-edit pipeline this section is built around.

Figure 32.5.2: The find-then-edit pipeline. Inversion maps a real photo to a latent code $\mathbf{w}$ (by slow per-image optimization or a fast learned encoder), a semantic direction $\mathbf{d}$ scaled by strength $\alpha$ is added to that code (green), and the frozen generator renders the edited code in a single forward pass. Only the latent moves; the generator's weights never change. This is the loop every method in the section plugs into, differing only in how $\mathbf{w}$ and $\mathbf{d}$ are obtained.

Fun Fact

Because editing directions are largely linear, they are also gloriously composable, and gloriously entangled in ways nobody asked for. Researchers poking at face latents kept finding directions that bundled attributes together against all intuition: push "add glasses" far enough and the face tends to age, push "smile" and the eyes often narrow, push "open mouth" and people occasionally sprout a microphone. The latent space learned correlations from its training photos, not a tidy ontology, so the "smile" axis quietly remembers that smiling people in the dataset were also, on average, doing a dozen other things. Disentanglement is a spectrum, and $W$ sits closer to the good end than anyone in 2014 dared hope, but it is not a control panel with one clean knob per concept.

Library Shortcut

The full inversion-and-editing stack is packaged in several maintained repositories so you do not assemble it by hand. The pSp and e4e (encoder-for-editing) projects ship pretrained encoders that invert a face into $W+$ in one forward pass, and GANSpace, InterFaceGAN, and StyleCLIP provide ready-made editing directions, including text-driven ones (StyleCLIP edits by an instruction like "a person with grey hair" using a CLIP loss, where CLIP is the image-text embedding model from Chapter 25 that scores how well an image matches a text prompt). Using them, inverting and editing a real photo is a handful of lines: load the pretrained StyleGAN and encoder, run w = E(image), add a precomputed direction, and re-synthesize, replacing the optimization loop and the direction-discovery training above with calls into the library.

Practical Example: The Photo-Editing Feature That Shipped Then Stalled

A consumer photo-app team in 2021 built a "smile" and "age" slider on top of a StyleGAN2 face model and a pSp encoder. In the demo, on clean studio portraits, it was magical: drag the slider and the subject smiled or aged convincingly in real time. In production it disappointed, and the post-mortem was instructive. Real user photos were nothing like FFHQ: heavy makeup, unusual lighting, hats, side angles, glasses, multiple faces in frame. The encoder inverted these into $W+$ codes far off the manifold, so the edit directions, calibrated on clean faces, produced warping and color shifts. The team's fix was regularized inversion (keeping the $W+$ code near $W$, the sweet spot of Figure 32.5.1) plus a quality gate that fell back to "edit unavailable" when the inversion's perceptual reconstruction error was too high. The honest product lesson: GAN editing is real and impressive within the generator's distribution, and its quality degrades exactly as far as your inputs stray from that distribution. By 2023 much of this feature category migrated to diffusion-based editing (Chapter 35), which inverts and edits arbitrary images more robustly, though the find-the-code-then-edit-it idea is identical.

Research Frontier

The single most striking recent result in this space is DragGAN (Pan et al., SIGGRAPH 2023), which lets a user click a point on a generated image and drag it to a new location, while the generator continuously updates the latent so that the dragged point moves there and the rest of the image follows plausibly, turning latent-space editing into direct point-and-drag manipulation. It made interactive GAN editing feel like pulling on the pixels themselves. The same year, the inversion-and-editing machinery jumped wholesale to diffusion: null-text inversion and the broader line of diffusion-inversion methods in Chapter 35 solve the identical "find the code for this real image, then edit it" problem for diffusion models, and DragDiffusion ported DragGAN's interaction directly. The find-then-edit paradigm born here is now the default interface for controllable generation across both model families.

Exercises

Exercise 32.5.1 Conceptual

Explain the editability-fidelity tradeoff in your own words using Figure 32.5.1. Why does a $W+$ code that reconstructs a real face almost perfectly often edit worse than a $W$ code that reconstructs it only approximately? Use the phrase "on the manifold" in your answer, and describe what regularized inversion does to resolve the tension.

Exercise 32.5.2 Coding

Using a pretrained StyleGAN2 and the invert_image loop in this section, invert a generated image (one whose true $\mathbf{w}$ you know, so you can check the answer) into both $W$ and $W+$. Compare the reconstruction LPIPS for each, and compare the recovered code to the true one. Then invert a real photograph (from CelebA-HQ) into both spaces and report the LPIPS reconstruction error, confirming that $W+$ reconstructs the out-of-distribution real image better than $W$.

Exercise 32.5.3 Analysis

Discover a semantic direction with the smile_direction recipe: sample 2000 $W$ codes, generate each face, score "smiling" with an attribute classifier (or use precomputed scores), and fit the SVM. Apply the resulting direction at increasing strengths $\alpha$ to a fixed inverted face and observe where the edit breaks down. At large $\alpha$, what entangled attributes change along with the smile (for example, does the face also appear younger or change gaze)? What does this reveal about how disentangled the $W$ space actually is?