mlgenerative-airesearchcomputer-vision

Before Stable Diffusion: The Summer of DALL-E

Feb 14, 20226 min readUtso Sarkar

Everyone talks about Stable Diffusion like it appeared from nowhere in August 2022. It did not. The real inflection was the summer before, when OpenAI and Google Research published two papers that made text-to-image generation feel inevitable instead of laughable. If you were building anything in computer vision that year, you were not waiting for a single model release. You were watching a stack crystallize: diffusion as the generative backbone, CLIP as the semantic bridge, and classifier-free guidance as the knob that turned fuzzy blobs into something you could ship in a demo.

I was running a startup and doing research on the side. That combination makes you allergic to hype but hungry for leverage. DALL-E 1 and GLIDE were leverage. Not because they were perfect. They were not. But they made the failure modes legible.

DALL-E 1: Proof That Text Could Steer Pixels

OpenAI’s “Zero-Shot Text-to-Image Generation” (January 2021, but the ecosystem digestion took months) used a discrete VQ-VAE token space and an autoregressive transformer. You did not diffuse in pixel space. You predicted the next image token given prior tokens and text. The results were surreal, inconsistent, and occasionally brilliant. That was the point.

DALL-E 1 established three things practitioners still rely on:

Joint training on image-text pairs scales. The model did not need fine-grained captions for every object; noisy web alt-text was enough to learn rough alignment.
Compositionality is hard. Ask for “a red cube on top of a blue sphere” and you get symbolism, not physics. Text conditioning is not a scene graph.
Autoregressive image modeling is expensive at inference. Generating hundreds or thousands of tokens per image is fine for research demos, painful for products.

If you only remember DALL-E 1 as “the weird avocado chair meme,” you missed the architectural bet: treat images as a sequence problem conditioned on language. CLIP’s text encoder made that bet viable.

GLIDE: Diffusion Enters the Chat

Google’s GLIDE (Guided Language to Image Diffusion for Generation and Editing) took a different route. Train a diffusion model in pixel space (with a U-Net), condition on text embeddings, and use classifier guidance at sampling time to sharpen outputs toward a label or embedding direction.

GLIDE was better at photorealism than DALL-E 1 for many prompts. It also exposed the sampling cost problem openly: diffusion means many forward passes through a large U-Net. Quality scales with steps. Your GPU bill scales with it too.

The paper’s editing results were underrated. Inpainting and masked editing with the same diffusion objective foreshadowed what Stable Diffusion’s latent inpainting would later productize. The research community was already circling the same idea: diffusion is not just generation; it is iterative refinement under constraints.

Classifier-Free Guidance: The Hack That Won

Classifier guidance required a separate classifier and gradients through it at sample time. Clever, but clunky. Ho and Salimans’ classifier-free guidance (CFG) trick dropped the separate classifier. During training, randomly drop the conditioning signal so the model learns both conditional and unconditional score estimates. At inference, interpolate:

The guided score is a weighted combination of conditional and unconditional predictions. Turn the guidance weight up and images get sharper, more literal, more “on prompt.” Turn it too high and they get crunchy, oversaturated, and artifact-ridden.

Every modern text-to-image stack you have used since then is, in some sense, CFG all the way down. Stable Diffusion. DALL-E 2’s public details differ, but the guidance intuition persists. Midjourney’s aesthetic bias is not magic; it is training data plus guidance schedules plus post-processing.

If you implement one equation from this era, implement CFG. It is the cheapest performance lever in the generative toolbox.

Forward and Reverse: The Diffusion Mental Model

Diffusion is not deep learning astrology. It is a controlled noise process. Forward diffusion gradually corrupts data into Gaussian noise. Reverse diffusion learns to denoise step by step, recovering structure. Text conditioning bends the denoising vector field so “a fox in watercolor” means something different than “a fox in neon.”

flowchart LR
    subgraph forward["Forward diffusion"]
        X0["x0: data"] --> X1["x1: light noise"]
        X1 --> X2["x2: more noise"]
        X2 --> XT["xT: ~ Gaussian"]
    end
    subgraph reverse["Reverse diffusion (learned)"]
        XT --> Rn1["denoise step T"]
        Rn1 --> Rn2["denoise step T-1"]
        Rn2 --> R0["x0: sample"]
    end
    Text["Text embedding"] -.-> Rn1
    Text -.-> Rn2
    CFG["Classifier-free guidance"] -.-> Rn1
    CFG -.-> Rn2

Once you internalize this diagram, papers stop sounding like incantations. DDPM, DDIM, latent diffusion: they are variations on how you parameterize the reverse steps, how many you take, and whether you work in pixel space or a VAE latent.

DALL-E 1 vs GLIDE: What Actually Mattered for Founders

From a product standpoint in early 2022, neither model was yours to ship. APIs were gated. Weights were not public. But the comparison still informed build vs wait decisions:

Dimension	DALL-E 1 (autoregressive)	GLIDE (diffusion)
Visual fidelity	Stylized, variable	Stronger photorealism
Controllability	Prompt-only, limited edit	Masked editing in paper
Inference cost	Many serial token steps	Many parallel U-Net steps
Open replication	Harder without scale	Feasible with effort

The startup calculus was brutal. If your moat was “we call OpenAI,” you had no moat. If your moat was domain-specific data, constrained generation, or verification on outputs, the model layer being closed was almost irrelevant.

What We Got Wrong That Summer

We underestimated how fast latent diffusion would drop VRAM requirements. DALL-E 1’s token story made us think only hyperscalers could play. Stable Diffusion proved a compressed latent space plus a smaller U-Net could run on consumer GPUs. We overestimated how much users cared about semantic correctness vs aesthetic punch. CFG-heavy sampling taught users that cranking “strength” fixes prompts. It does, until it does not.

We also underweighted safety and provenance. Generative demos were so novel that nobody wanted to talk about deepfakes at dinner. That changed quickly.

The Line to Stable Diffusion

Stable Diffusion did not invent text-to-image. It bundled the winning ingredients for replication:

Latent diffusion (from Rombach et al.) to cut compute
CLIP text conditioning (the semantic glue since DALL-E 1’s era)
CFG for prompt adherence
Open weights, which turned research into a fork ecosystem overnight

If you were paying attention in the summer of DALL-E and GLIDE, Stable Diffusion felt like the obvious open-source endpoint, not a surprise. The shock was licensing and speed, not science.

What I Would Tell a Founder Starting in 2022

Read the papers, not the threads. Implement a tiny DDPM on MNIST, then scale your intuition. Replicate CFG ablations on a small conditional model so you feel the artifact tradeoff in your own outputs. Do not anchor your roadmap to a single vendor’s API tier.

The summer before Stable Diffusion was when text-to-image went from “party trick” to “infrastructure.” Everything after is distribution, fine-tuning, and the unglamorous work of verification, rights, and cost control. The models were the spark. The product is still the hard part.

--claps