Attention Is All You Need: Three Years Later

In June 2017, eight Google researchers published “Attention Is All You Need” and quietly detonated the entire field of natural language processing. This post is a technical revisit of the paper that started it all, written from the perspective of someone who has read it five times and understood it properly on the fourth. By November 2021, three years and change after publication, I was implementing attention mechanisms for a side research project while the rest of the world debated whether GPT-3 was AGI.

Transformers are not magic. They are weighted sum machines with a clever inductive bias. But that bias turned out to be exactly what language needed.

What the Paper Actually Proposed

Before Transformers, sequence modeling meant RNNs (LSTM, GRU) or CNNs (ByteNet, ConvS2S). RNNs process tokens sequentially. Training is slow because parallelism is limited. Long-range dependencies vanish in gradient flow despite LSTM gates.

The Transformer throws recurrence away. Every token attends to every other token in parallel. Position information comes from positional encodings, not from sequential processing.

Core components:

flowchart TB
    subgraph Encoder
        E_IN[Input embeddings + Positional encoding] --> E_ATTN[Multi-Head Self-Attention]
        E_ATTN --> E_ADD1[Add and Norm]
        E_ADD1 --> E_FFN[Feed Forward]
        E_FFN --> E_ADD2[Add and Norm]
        E_ADD2 --> E_OUT[Encoder output]
    end

    subgraph Decoder
        D_IN[Output embeddings + Positional encoding] --> D_MASK[Masked Self-Attention]
        D_MASK --> D_ADD1[Add and Norm]
        D_ADD1 --> D_CROSS[Cross-Attention to Encoder]
        E_OUT --> D_CROSS
        D_CROSS --> D_ADD2[Add and Norm]
        D_ADD2 --> D_FFN[Feed Forward]
        D_FFN --> D_ADD3[Add and Norm]
        D_ADD3 --> D_SOFT[Linear + Softmax]
        D_SOFT --> D_OUT[Output probabilities]
    end

The diagram is the whole paper. Everything else is engineering details and training tricks.

The O(n^2) Elephant in the Room

Self-attention computes pairwise interactions between all tokens. For sequence length n, attention is O(n^2) in both compute and memory.

The 2017 paper used sequences up to a few hundred tokens. Fine for translation. In 2021:

At n=4096, the attention matrix has 16 million entries per head per layer. Multiply by batch size, heads, and layers. Your GPU weeps.

The research response in 2021 included:

None of these fully solved the problem. They traded exactness for scalability. The Transformer ate NLP anyway because n=512 covers most profitable use cases and GPUs got bigger.

For my research project processing longer documents, I hit the wall at 512 tokens and had to chunk with overlap, losing cross-chunk context. The paper does not mention this pain. Production does.

BERT vs GPT-2: Same Architecture, Different Religion

Both use Transformer blocks. The difference is training objective and architecture variant.

BERT (2018): Encoder-only. Trained with masked language modeling (predict hidden tokens) and next sentence prediction. Bidirectional context. Fine-tune for classification, NER, QA.

GPT-2 (2019): Decoder-only. Trained with causal language modeling (predict next token). Left-to-right only. Fine-tune for generation, or prompt without fine-tuning.

Aspect BERT GPT-2
Architecture Encoder Decoder
Attention Bidirectional Causal (masked)
Pre-training MLM + NSP CLM
Best for Understanding tasks Generation
Fine-tuning Standard Prompting emerges

The ideological split: is language understanding best served by bidirectional context (BERT) or by generative modeling that must implicitly learn understanding to predict well (GPT)?

By 2021 the answer was leaning GPT. Scaling decoder-only models produced emergent capabilities BERT’s architecture could not match. BERT still wins on extractive QA and classification with limited data. GPT wins when you have compute and want one model to do everything.

I used BERT-base for a text classification task with 2000 labeled examples. It worked. I tried GPT-2 for the same task with prompting. It kind of worked. With 2000 examples, BERT was the correct engineering choice. The paper’s lesson is not “always use the biggest model.” It is “match architecture to objective and data scale.”

Multi-Head Attention: What the Heads Learn

The paper uses h=8 heads with d_model=512, so d_k=d_v=64 per head. Each head learns different attention patterns. Visualizations in follow-up work show:

You do not design these patterns. They emerge from training. This is the Transformer’s real superpower: flexible relational inductive bias without hand-crafted linguistic features.

Implementation detail that tripped me:

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, V), attn

The scaling by sqrt(d_k) prevents softmax saturation as dimension grows. Small detail. Without it training diverges. Papers omit these details. Code does not.

Positional Encodings: Sinusoidal vs Learned

The original paper uses fixed sinusoidal encodings. Later models (GPT, BERT) use learned positional embeddings. Relative position encodings (Transformer-XL, T5) generalize better to longer sequences than absolute positions.

In 2021 this was still an active research area. For fine-tuning BERT on short texts, it barely matters. For extrapolating to longer sequences than training, it matters enormously.

Three Years Later: What Held Up

Still true:

Underestimated in 2017:

Overestimated:

What I Tell Juniors Who Ask “Should I Read the Paper?”

Yes. Read the original. Not just the annotated blog version.

Read it for:

Skip detailed derivation of learning rate schedule unless you are reproducing training. Nobody uses the exact warmup from 2017 anymore.

Then read BERT and GPT-2 papers to see how the architecture forked. Then read one efficient attention paper (Longformer or Performer) to understand the O(n^2) mitigation landscape.

The Honest Assessment

“Attention Is All You Need” is one of the most impactful ML papers ever written. It is also, by 2021 standards, an incomplete blueprint for modern LLMs. No RLHF, no scaling laws, no emergent abilities, no prompting, no safety considerations.

Three years later the title was prophetic and slightly wrong. Attention is most of what you need. You also need 10^23 FLOPs, a data pipeline, and a product team.

But it started here. Every LLM you use in 2024 traces back to this architecture. Worth understanding properly, not just as a buzzword on a pitch deck.

--claps