mlresearchgenerative-ai

GANs Are Not Magic

Jun 10, 20217 min readUtso Sarkar

Every few months someone posts AI-generated faces that look indistinguishable from photographs, and the internet collectively forgets that Generative Adversarial Networks have been a nightmare to train since Ian Goodfellow invented them in a bar in 2014. I spent the first half of 2021 trying to use GANs for synthetic training data in a computer vision pipeline. This post is what I wish the tutorials had told me before I burned three weeks on a model that generated the same three faces with different hair.

GANs are not magic. They are two neural networks playing a minimax game that is provably unstable, evaluated with metrics that correlate weakly with human judgment, and surrounded by a literature that reports best-case results while burying the failed runs.

The Setup Everyone Understands

A GAN has two components: a generator G that maps random noise z to synthetic samples, and a discriminator D that classifies samples as real or fake. They train adversarially. G tries to fool D. D tries to catch G. At equilibrium, G’s outputs are indistinguishable from the real data distribution.

In theory, elegant. In practice, you are balancing two loss functions that fight each other while both networks are simultaneously learning representations of your data. It is like teaching two students to debate while both are still learning the subject.

flowchart LR
    Z[Random noise z] --> G[Generator G]
    G --> Fake[Fake sample]
    Real[Real sample] --> D[Discriminator D]
    Fake --> D
    D -->|Real/Fake score| LossD[Discriminator loss]
    D -->|Gradient to G| LossG[Generator loss]
    LossD -->|Update D| D
    LossG -->|Update G| G

The loop looks clean in a diagram. In a Jupyter notebook at 2 AM it looks like NaN losses and generated images that resemble television static had a baby with a Rorschach test.

Mode Collapse: The Silent Killer

Mode collapse is when the generator learns to produce a small set of outputs that fool the discriminator, ignoring most of the training distribution. Your dataset has ten thousand diverse faces. Your generator produces one face with slightly different skin tones. The discriminator cannot distinguish this one face from real faces in its neighborhood of feature space, so G stops exploring.

I hit mode collapse on day four. My synthetic images were generating the same defect on the same surface, rotated slightly. The discriminator was not stupid. It was complacent. G found a local optimum and camped there.

Detecting mode collapse is easier than fixing it:

Visual inspection (the honest method)
Intra-batch diversity metrics (compare pairwise distances in generated batch)
Sudden drop in discriminator accuracy with no improvement in sample quality

Fixing it is a grab bag:

Minibatch discrimination (let D see batches, not individual samples)
Unrolled GANs (G optimizes against k-step lookahead of D)
Different architectures (StyleGAN’s mapping network helps)
More data (brutal but often true)
Starting over (underrated)

None of these are guaranteed. The GAN literature is a graveyard of techniques that work on CelebA and fail on your domain.

Training Instability: Where the Loss Curves Lie

GAN loss curves are famously uninformative. The generator loss can increase while sample quality improves. The discriminator loss can flatline at 0.69 (ln 2) while the model is either perfect or useless. You cannot read GAN training like you read supervised learning.

What actually helps me monitor training:

Sample grids every N steps. Save 64 generated images. Look at them. Your eyes are the best metric.
Discriminator accuracy on a held-out real set. If D hits 100% and stays there, G is not learning. If D is at 50%, either G is perfect or D is broken. Context matters.
Feature matching loss as a supplementary signal. Match intermediate discriminator features between real and fake batches.

Hyperparameters that matter more than the papers admit:

Learning rate ratio between G and D. If D learns too fast, G gets no gradient. If D learns too slow, G collapses. I typically use 1:1 or 2:1 (D:G) with separate optimizers.
Batch size. Small batches increase gradient variance. GANs hate variance.
Architecture capacity. A generator with too few parameters collapses. One with too many overfits to adversarial examples against a weak D.

I spent a week tuning learning rates before I realized my data pipeline was feeding D and G different normalizations. Check your preprocessing before you touch hyperparameters. I say this because I did not.

FID Lies (Or At Least Misleadingly Whispers)

Fréchet Inception Distance (FID) measures the distance between feature distributions of real and generated images, using Inception-v3 activations. Lower is better. Papers report FID of 2.5 and reviewers nod approvingly.

Problems with FID:

Inception-v3 was trained on ImageNet. If your domain is medical X-rays or industrial inspection photos, Inception features may not capture what matters. Two generated images can have excellent FID while being useless for your downstream task.

FID is a distribution metric. It tells you whether generated samples cover the real distribution on average. It does not tell you whether any individual sample is good. A model with mode collapse can achieve decent FID by nailing the most common modes.

Sample size sensitivity. FID estimates depend on how many generated samples you use. Comparing FID across papers with different sample counts is apples to oranges with different apple varieties.

It does not measure diversity within the generated set well. Two models with identical FID can have wildly different mode coverage.

I still compute FID. It is useful for relative comparison within a single experiment series. I do not trust it as an absolute quality gate. When someone says “our GAN achieves state-of-the-art FID on this dataset,” ask how the samples look and whether anyone tried to use them for anything real.

What Actually Worked for Me

After the GAN experiments, I landed on a hybrid approach:

Use GANs for data augmentation of existing images, not generation from scratch. Conditional GANs that take a real image and add synthetic variation performed better than unconditional generation. The real image anchors the output.
Start from pretrained generators. StyleGAN2 checkpoints, fine-tune on domain data. Training from scratch is a research project, not an engineering task.
Pair GAN outputs with a quality filter. A separate classifier rejects generated samples below a confidence threshold. Brute force, but it beats shipping bad synthetic data into a production training set.
Consider diffusion models. In 2021 they were emerging. By now they have largely superseded GANs for image generation quality. If you are starting fresh in generative modeling, look at DDPM and Stable Diffusion before you invest in GAN expertise.

The Honest Researcher’s Checklist

Before you commit to GANs for a project:

Do you have enough real data that you do not need generation? (Often yes.)
Is your use case tolerant of occasional garbage outputs?
Do you have GPU budget for hundreds of hyperparameter sweeps?
Can you evaluate quality with domain-specific metrics, not just FID?
Have you tried simpler baselines (copy-paste augmentation, texture synthesis, 3D rendering)?

If you checked fewer than three, reconsider.

The Takeaway

GANs produced genuinely remarkable results. StyleGAN faces, CycleGAN style transfer, Pix2Pix for paired image translation. These are real achievements built on a foundation of unstable optimization and careful engineering that papers underreport.

They are not magic. They are a tool that works sometimes, for some domains, with significant tuning investment. Treating them as plug-and-play generative models is how you end up with three faces and a FID score you cannot explain to your advisor.

I still respect the architecture. I just no longer trust it blindly. That is the difference between reading papers and doing research.

--claps