mlresearchtransformersnlp

Attention Is All You Need: Three Years Later

Nov 8, 20217 min readUtso Sarkar

In June 2017, eight Google researchers published “Attention Is All You Need” and quietly detonated the entire field of natural language processing. This post is a technical revisit of the paper that started it all, written from the perspective of someone who has read it five times and understood it properly on the fourth. By November 2021, three years and change after publication, I was implementing attention mechanisms for a side research project while the rest of the world debated whether GPT-3 was AGI.

Transformers are not magic. They are weighted sum machines with a clever inductive bias. But that bias turned out to be exactly what language needed.

What the Paper Actually Proposed

Before Transformers, sequence modeling meant RNNs (LSTM, GRU) or CNNs (ByteNet, ConvS2S). RNNs process tokens sequentially. Training is slow because parallelism is limited. Long-range dependencies vanish in gradient flow despite LSTM gates.

The Transformer throws recurrence away. Every token attends to every other token in parallel. Position information comes from positional encodings, not from sequential processing.

Core components:

Multi-head self-attention: each token builds a representation by attending to all tokens, with multiple attention heads learning different relationship types
Position-wise feed-forward networks: two linear layers with ReLU applied per token independently
Residual connections and layer normalization around each sublayer
Encoder-decoder architecture for sequence-to-sequence tasks (original paper targeted machine translation)

flowchart TB
    subgraph Encoder
        E_IN[Input embeddings + Positional encoding] --> E_ATTN[Multi-Head Self-Attention]
        E_ATTN --> E_ADD1[Add and Norm]
        E_ADD1 --> E_FFN[Feed Forward]
        E_FFN --> E_ADD2[Add and Norm]
        E_ADD2 --> E_OUT[Encoder output]
    end

    subgraph Decoder
        D_IN[Output embeddings + Positional encoding] --> D_MASK[Masked Self-Attention]
        D_MASK --> D_ADD1[Add and Norm]
        D_ADD1 --> D_CROSS[Cross-Attention to Encoder]
        E_OUT --> D_CROSS
        D_CROSS --> D_ADD2[Add and Norm]
        D_ADD2 --> D_FFN[Feed Forward]
        D_FFN --> D_ADD3[Add and Norm]
        D_ADD3 --> D_SOFT[Linear + Softmax]
        D_SOFT --> D_OUT[Output probabilities]
    end

The diagram is the whole paper. Everything else is engineering details and training tricks.

The O(n^2) Elephant in the Room

Self-attention computes pairwise interactions between all tokens. For sequence length n, attention is O(n^2) in both compute and memory.

The 2017 paper used sequences up to a few hundred tokens. Fine for translation. In 2021:

BERT maxes at 512 tokens
GPT-2 uses 1024
GPT-3 uses 2048
Document-level tasks want 4096+

At n=4096, the attention matrix has 16 million entries per head per layer. Multiply by batch size, heads, and layers. Your GPU weeps.

The research response in 2021 included:

Sparse attention (Longformer, BigBird): attend locally plus global tokens
Linear attention (Performer, Linformer): kernel approximations to avoid materializing full matrix
FlashAttention (not yet widely deployed in 2021 but coming): IO-aware exact attention
Recurrence hybrids (Transformer-XL): cache previous segments

None of these fully solved the problem. They traded exactness for scalability. The Transformer ate NLP anyway because n=512 covers most profitable use cases and GPUs got bigger.

For my research project processing longer documents, I hit the wall at 512 tokens and had to chunk with overlap, losing cross-chunk context. The paper does not mention this pain. Production does.

BERT vs GPT-2: Same Architecture, Different Religion

Both use Transformer blocks. The difference is training objective and architecture variant.

BERT (2018): Encoder-only. Trained with masked language modeling (predict hidden tokens) and next sentence prediction. Bidirectional context. Fine-tune for classification, NER, QA.

GPT-2 (2019): Decoder-only. Trained with causal language modeling (predict next token). Left-to-right only. Fine-tune for generation, or prompt without fine-tuning.

Aspect	BERT	GPT-2
Architecture	Encoder	Decoder
Attention	Bidirectional	Causal (masked)
Pre-training	MLM + NSP	CLM
Best for	Understanding tasks	Generation
Fine-tuning	Standard	Prompting emerges

The ideological split: is language understanding best served by bidirectional context (BERT) or by generative modeling that must implicitly learn understanding to predict well (GPT)?

By 2021 the answer was leaning GPT. Scaling decoder-only models produced emergent capabilities BERT’s architecture could not match. BERT still wins on extractive QA and classification with limited data. GPT wins when you have compute and want one model to do everything.

I used BERT-base for a text classification task with 2000 labeled examples. It worked. I tried GPT-2 for the same task with prompting. It kind of worked. With 2000 examples, BERT was the correct engineering choice. The paper’s lesson is not “always use the biggest model.” It is “match architecture to objective and data scale.”

Multi-Head Attention: What the Heads Learn

The paper uses h=8 heads with d_model=512, so d_k=d_v=64 per head. Each head learns different attention patterns. Visualizations in follow-up work show:

Heads that attend to previous/next token (syntax)
Heads that attend to matching brackets or delimiters
Heads that attend to coreferent entities

You do not design these patterns. They emerge from training. This is the Transformer’s real superpower: flexible relational inductive bias without hand-crafted linguistic features.

Implementation detail that tripped me:

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, V), attn

The scaling by sqrt(d_k) prevents softmax saturation as dimension grows. Small detail. Without it training diverges. Papers omit these details. Code does not.

Positional Encodings: Sinusoidal vs Learned

The original paper uses fixed sinusoidal encodings. Later models (GPT, BERT) use learned positional embeddings. Relative position encodings (Transformer-XL, T5) generalize better to longer sequences than absolute positions.

In 2021 this was still an active research area. For fine-tuning BERT on short texts, it barely matters. For extrapolating to longer sequences than training, it matters enormously.

Three Years Later: What Held Up

Still true:

Attention as the core primitive for sequence modeling
Parallelization advantage over RNNs
Transfer learning via pre-train then fine-tune (or prompt)
Layer normalization placement and residual connections

Underestimated in 2017:

Scale. The paper’s largest model was 213M parameters (Transformer-big). GPT-3 is 175B.
Decoder-only dominance for general-purpose models
Prompting as an interface replacing fine-tuning
Multimodal extension (Vision Transformer, CLIP, DALL-E)

Overestimated:

Encoder-decoder as the default architecture (decoder-only won for LLMs)
Need for task-specific architectures (one big model eats them)
Efficiency at long context (still painful in 2021)

What I Tell Juniors Who Ask “Should I Read the Paper?”

Yes. Read the original. Not just the annotated blog version.

Read it for:

Scaled dot-product attention definition
Why multi-head instead of single head with larger dimension
Encoder-decoder cross-attention for seq2seq

Skip detailed derivation of learning rate schedule unless you are reproducing training. Nobody uses the exact warmup from 2017 anymore.

Then read BERT and GPT-2 papers to see how the architecture forked. Then read one efficient attention paper (Longformer or Performer) to understand the O(n^2) mitigation landscape.

The Honest Assessment

“Attention Is All You Need” is one of the most impactful ML papers ever written. It is also, by 2021 standards, an incomplete blueprint for modern LLMs. No RLHF, no scaling laws, no emergent abilities, no prompting, no safety considerations.

Three years later the title was prophetic and slightly wrong. Attention is most of what you need. You also need 10^23 FLOPs, a data pipeline, and a product team.

But it started here. Every LLM you use in 2024 traces back to this architecture. Worth understanding properly, not just as a buzzword on a pitch deck.

--claps