Transformers from First Principles — Part 2: What Scale Reveals

Beyond the Basic Architecture

In Part 1 we built the Transformer block from scratch — self-attention, multi-head attention, positional encoding, FFN, residuals. That was the 2017 architecture. It works, but when you scale it up to billions of parameters, interesting things happen and practical problems appear.

This post covers what scale reveals about transformers: attention patterns that emerge naturally, modern architectural improvements, and efficiency tricks that make training and inference feasible at LLM scale.

Sparse Attention Emergence

One of the more surprising empirical findings is that transformers develop sparse attention patterns during training — most attention weights end up near zero, with each token focusing on just a few other tokens.

A NeurIPS 2025 oral paper (“Sparse Attention Emergence in Transformers”) studied when and how this happens. The key findings:

Emergence follows power laws. The timing of when attention becomes sparse depends on the task, architecture, and optimizer, but the relationship follows power laws in each case. Larger models become sparse earlier in training. Harder tasks delay sparsity.

Sparsity is not uniform. Some heads become very sparse (attending to 1-3 tokens), while others maintain broad attention. This isn’t random — it reflects functional specialization.

This matters because it tells us that dense attention (every token attending to every token equally) is a starting condition, not the trained behavior. The model learns to be selective.

Attention Head Specialization

Another NeurIPS 2025 paper looked at what individual attention heads actually learn to do. They found that heads specialize in specific semantic and structural roles:

Local heads: attend primarily to nearby tokens (capturing syntax, local dependencies)
Global heads: attend to distant tokens (long-range semantic relationships)
Positional heads: attend to specific relative positions (e.g., always look 1 token back)
Attribute heads: track specific semantic properties across the sequence

This isn’t designed — it emerges from training. Each head in a multi-head attention layer ends up doing something different, which is exactly the intuition behind multi-head attention but now we have empirical evidence.

The practical implication: if you understand what each head does, you can edit model behavior by intervening on specific heads. This is one of the building blocks of mechanistic interpretability, which we’ll cover in a later post.

Gated Attention

NeurIPS 2025 Best Paper (“Gated Attention for Large Language Models”) proposed a small but effective modification to the attention mechanism: add a head-specific learnable gate after the scaled dot-product attention.

The idea is simple. After computing attention for each head:

$$\text{head}_i = g_i \odot \text{Attention}(Q_i, K_i, V_i)$$

where $g_i$ is a sigmoid gate that can learn to suppress or amplify each head’s contribution. The gate operates element-wise on the attention output.

They tested 30+ variants and found that this gating consistently improved performance across model sizes. The intuition: not every head is useful for every input, and the gate lets the model dynamically adjust which heads contribute.

Of course this works. The model already learns to specialize heads — giving it an explicit mechanism to modulate their influence just makes the specialization more effective.

Modern Positional Encodings: RoPE

The sinusoidal positional encodings from the original paper work, but modern LLMs have largely moved to Rotary Position Embeddings (RoPE) (Su et al., 2021).

RoPE applies position information as a rotation in the embedding space rather than an addition. For a 2D subspace of the embedding, at position $m$:

$$\text{RoPE}(x_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} x_m^{(1)} \ x_m^{(2)} \end{pmatrix}$$

Applied independently to pairs of dimensions. The key property: when you compute the attention score between positions $m$ and $n$, the rotations combine such that the score depends only on the relative position $m - n$:

$$q_m^T k_n = (R_m x_m)^T (R_n x_n) = x_m^T R_{m-n} x_n$$

This gives you relative positional information naturally, without separate relative position embeddings or biases. And it composes well with the attention mechanism — you just rotate the queries and keys before computing attention.

Why RoPE won over sinusoidal:

Better extrapolation to sequence lengths longer than training
Cleaner integration with attention (rotation vs. addition)
Works well with KV caching during inference

Nearly all modern LLMs (LLaMA, Mistral, Qwen, etc.) use RoPE.

Efficient Attention: FlashAttention

Standard self-attention has $O(n^2)$ memory complexity because you materialize the full $n \times n$ attention matrix. For sequence length 8192, that’s a 67M-entry matrix per head. At 32 heads, that’s over 2 billion floats just for attention scores. Not great.

FlashAttention (Dao et al., 2022) doesn’t change what is computed — it changes how. The key insight: GPUs are bottlenecked by memory I/O, not compute. Reading and writing the large attention matrix to GPU memory (HBM) is the actual bottleneck.

FlashAttention computes attention in tiles, keeping intermediate results in fast SRAM (on-chip memory) and never materializing the full attention matrix in HBM. The algorithm:

Split Q, K, V into blocks that fit in SRAM
For each block of Q, iterate over blocks of K and V
Compute attention scores and accumulate the output incrementally
Use the online softmax trick (keeping running max and sum) to compute exact softmax without needing all scores at once

The result: exact same output as standard attention, but 2-4x faster and with $O(n)$ memory instead of $O(n^2)$.

This matters a lot. FlashAttention is what makes training with sequence lengths of 8K, 32K, 128K+ feasible. Without it, context windows would be much shorter.

Grouped-Query Attention (GQA)

Standard multi-head attention has separate K, V projections for each head. During autoregressive generation, you cache all these K, V pairs (the “KV cache”). With many heads, this cache gets large.

Grouped-Query Attention (Ainslie et al., 2023) shares K, V projections across groups of heads. If you have 32 heads and 8 KV groups, each group of 4 query heads shares one K and one V projection.

The extreme case: Multi-Query Attention (MQA), where all heads share one K, V. GQA interpolates between MQA and full multi-head attention.

$$\text{GQA-head}_{i} = \text{Attention}(Q_i, K_{g(i)}, V_{g(i)})$$

where $g(i) = \lfloor i / G \rfloor$ maps head $i$ to its group.

This reduces KV cache size by $G\times$ during inference with minimal quality loss. Most production LLMs (LLaMA 2/3, Mistral, Gemma) use GQA.

Mixture of Experts (MoE)

The FFN in each transformer block is typically the most parameter-heavy component. Mixture of Experts replaces the single FFN with multiple “expert” FFNs and a gating network that routes each token to a subset of experts:

$$\text{MoE-FFN}(x) = \sum_{i=1}^{E} g_i(x) \cdot \text{FFN}_i(x)$$

where $g(x) = \text{TopK}(\text{softmax}(W_g x))$ selects the top-$K$ experts for each token.

Typically $K = 2$ out of $E = 8$ or $E = 16$ experts. This means each token only activates a small fraction of the total parameters, giving you a much larger model (more total parameters = more knowledge capacity) with the same compute cost per token.

Mixtral (Mistral’s MoE) has 46.7B total parameters but only activates ~12.9B per token. You get the knowledge of a large model at the cost of a small one.

The challenges: load balancing (you want tokens distributed evenly across experts, not all going to the same one), expert collapse (some experts never get used), and the gating decision itself (routing is discrete, which is hard to backpropagate thru — usually handled with an auxiliary loss).

Scaling Laws

One of the most important empirical findings about transformers is that their performance is predictable from scale. Kaplan et al. (2020) showed:

$$L(N) \propto N^{-\alpha}$$

where $L$ is the loss, $N$ is the number of parameters, and $\alpha \approx 0.076$ for language modeling. Similar power laws hold for dataset size and compute.

Chinchilla (Hoffmann et al., 2022) refined this: for a given compute budget, there’s an optimal ratio of model size to training data. The original GPT-3 was undertrained relative to its size. Chinchilla, trained with more data on a smaller model, matched GPT-3’s performance at 4x less inference cost.

The implication: you can predict model performance before training. This is why labs can plan training runs costing millions of dollars — they know approximately what they’ll get.

Recent work (2024-2025) shows these power laws hold across dense and sparse (MoE) architectures, though the constants differ. Interestingly, the laws seem to break or change character for reasoning tasks, especially with test-time compute scaling (o1-style models). This is an active research area.

What We Covered

In this two-part series:

Part 1: the core transformer — attention, positional encoding, multi-head attention, the full block
Part 2: what happens at scale — emergent sparsity, head specialization, and the modern efficiency toolkit (RoPE, FlashAttention, GQA, MoE, gated attention)

The transformer is remarkably simple at its core: attention + FFN + residuals. The complexity comes from making it efficient at scale and understanding what it learns. That understanding is becoming increasingly important as we try to align these models — which is where this blog is heading next.

Next up: reinforcement learning foundations, because we need RL before we can understand how LLMs learn from human feedback.

Beyond the Basic Architecture#

Sparse Attention Emergence#

Attention Head Specialization#

Gated Attention#

Modern Positional Encodings: RoPE#

Efficient Attention: FlashAttention#

Grouped-Query Attention (GQA)#

Mixture of Experts (MoE)#

Scaling Laws#

What We Covered#