New Direction

If you’ve been following this blog, you know I’ve spent the last year or so on generative models — VAEs, the ELBO, reparameterization, all of that. That series is done (for now), and I want to pivot toward the models that are actually eating the world: large language models.

But I don’t want to just use them. I want to understand them from the ground up, the same way we derived the ELBO from scratch in the VAE series. So this is the start of a new series, and we’re beginning where every LLM begins: the Transformer architecture.

The paper is “Attention Is All You Need” (Vaswani et al., 2017). If you’ve heard the phrase “self-attention” thrown around and vaguely know it involves queries, keys, and values but aren’t sure why — this post is for you.


The Problem with Sequences

Before transformers, the dominant approach for sequence modeling was recurrent neural networks (RNNs) and their variants (LSTMs, GRUs). These process sequences one token at a time, maintaining a hidden state that gets updated at each step:

$$h_t = f(h_{t-1}, x_t)$$

This works, but it has two fundamental problems:

Sequential computation. You can’t compute $h_t$ until you’ve computed $h_{t-1}$. This means you can’t parallelize across time steps during training, which makes training on long sequences painfully slow.

Long-range dependencies. Information from early tokens has to survive thru every intermediate hidden state to influence later tokens. In practice, gradients vanish or explode over long distances. LSTMs and GRUs helped, but they didn’t solve this.

The Transformer’s key insight: what if we could let every token directly attend to every other token, in parallel? No sequential bottleneck. No vanishing gradients thru time. That’s what self-attention does.


Self-Attention: The Core Mechanism

Given a sequence of $n$ token embeddings, each of dimension $d$, we stack them into a matrix $X \in \mathbb{R}^{n \times d}$. Self-attention transforms this into a new representation where each token is a weighted combination of all other tokens.

Queries, Keys, and Values

We project $X$ into three different spaces using learned weight matrices:

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

where $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d_v}$.

The intuition:

  • Query ($Q$): “What am I looking for?”
  • Key ($K$): “What do I contain?”
  • Value ($V$): “What information do I actually carry?”

You can think of it like a database lookup. Each token broadcasts a query (“I need context about X”), every other token offers a key (“I have information about Y”), and the attention mechanism matches queries to keys to decide which values to retrieve.

The Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let’s unpack this step by step.

Step 1: Compute attention scores. $QK^T \in \mathbb{R}^{n \times n}$ gives us a matrix where entry $(i, j)$ is the dot product between query $i$ and key $j$. Higher dot product means token $i$ is more “interested” in token $j$.

Step 2: Scale. We divide by $\sqrt{d_k}$. You may wonder why. The reason is that when $d_k$ is large, the dot products grow in magnitude. If we have $q$ and $k$ as random vectors with zero mean and unit variance, then $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ has variance $d_k$. Large dot products push the softmax into regions where gradients are extremely small. Dividing by $\sqrt{d_k}$ brings the variance back to 1.

Step 3: Softmax. Applied row-wise, this converts each row of scores into a probability distribution. Token $i$ now has a distribution over all other tokens — how much attention to pay to each one.

Step 4: Weighted sum. Multiply the attention weights by $V$. Each token’s output is a weighted combination of all value vectors, with weights determined by the query-key compatibility.

The result: every token in the output “knows about” every other token in the input, weighted by relevance. And the whole thing is a matrix multiplication — fully parallelizable.


Positional Encoding

There’s a problem with self-attention as described above: it’s permutation-invariant. If you shuffle the input tokens, the outputs change (different attention patterns), but the mechanism itself has no notion of position. “The cat sat on the mat” and “mat the on sat cat the” would be treated the same way structurally.

For language, position obviously matters. The original Transformer uses sinusoidal positional encodings added to the input embeddings:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

where $pos$ is the position in the sequence and $i$ is the dimension index.

Why sines and cosines? Two reasons:

  1. Unique encoding. Each position gets a unique pattern across dimensions, like a binary counter but with smooth waveforms instead of bits.

  2. Relative position information. For any fixed offset $k$, $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$. This means the model can learn to attend to relative positions (e.g., “the token 3 positions back”) rather than absolute positions.

These encodings are added (not concatenated) to the token embeddings, so the model starts with both “what is this token” and “where is this token” combined in a single vector.

Modern models use Rotary Position Embeddings (RoPE) instead, which we’ll cover in Part 2. But the sinusoidal idea was the starting point.


Multi-Head Attention

A single attention head computes one set of attention weights. But different relationships between tokens might be relevant simultaneously — syntactic dependencies, semantic similarity, coreference, etc. Asking one attention head to capture all of these is asking too much.

Multi-head attention runs $h$ independent attention heads in parallel, each with its own projections:

$$\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)$$

Then concatenates and projects:

$$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O$$

where $W_O \in \mathbb{R}^{hd_v \times d}$ projects back to the model dimension.

Each head uses smaller dimensions: $d_k = d_v = d / h$. So the total computation is roughly the same as a single full-dimension head, but the model gets $h$ different “perspectives” on the input.

In practice, different heads do learn to specialize. Some attend to nearby tokens (local syntax), others attend to distant tokens (long-range dependencies), and some develop specialized behaviors like tracking coreference or function words. We’ll look at the evidence for this in Part 2.


The Transformer Block

A complete transformer block combines multi-head attention with a feedforward network and uses residual connections + layer normalization:

$$Z = \text{LayerNorm}(X + \text{MultiHead}(X))$$ $$\text{Output} = \text{LayerNorm}(Z + \text{FFN}(Z))$$

The feedforward network (FFN) is a simple two-layer MLP applied position-wise (independently to each token):

$$\text{FFN}(z) = \text{ReLU}(zW_1 + b_1)W_2 + b_2$$

with $W_1 \in \mathbb{R}^{d \times d_{ff}}$ and $W_2 \in \mathbb{R}^{d_{ff} \times d}$. Typically $d_{ff} = 4d$.

The residual connections ($X + \ldots$) are important — they let gradients flow directly thru the network without being forced thru the attention or FFN layers. This is the same idea as in ResNets and it’s essential for training deep transformers.

Why the FFN matters. The attention mechanism handles token-to-token interactions. The FFN handles per-token transformations — you can think of it as “processing what the attention gathered.” Recent interpretability work suggests that FFN layers store factual knowledge, while attention layers route information.


Encoder, Decoder, and Decoder-Only

The original Transformer has two stacks:

Encoder: processes the full input sequence with bidirectional attention (every token attends to every other token). Used for understanding the input.

Decoder: generates the output sequence one token at a time, using causal (masked) attention — token $i$ can only attend to tokens $\leq i$. This prevents the model from “cheating” by looking at future tokens during generation. The decoder also has cross-attention layers that attend to the encoder’s output.

For modern LLMs (GPT, Claude, LLaMA), we use decoder-only architectures. There’s no separate encoder. The model just generates tokens left-to-right with causal masking. This simplifies things and turns out to be sufficient for most tasks when the model is large enough.

The causal mask is just a triangular matrix applied to the attention scores before softmax — setting future positions to $-\infty$ so they get zero weight after softmax.


Putting It Together

A full transformer-based language model:

  1. Input: Tokenize text into token IDs
  2. Embedding: Look up token embeddings + add positional encodings
  3. Transformer blocks: Pass thru $L$ stacked transformer blocks (each with multi-head attention + FFN + residual connections + layer norm)
  4. Output: Project final hidden states to vocabulary size, apply softmax to get next-token probabilities

Training: maximize the log-probability of the correct next token at each position (cross-entropy loss).

The magic is in the scale. GPT-2 had 1.5B parameters with 48 transformer blocks. GPT-3 had 175B. Modern models go further. But the core architecture — attention + FFN + residuals — has remained remarkably stable since 2017.


What We Covered and What’s Next

In this post:

  • Self-attention as parallel, position-agnostic information routing
  • The QKV framework and scaled dot-product attention
  • Positional encoding to restore position information
  • Multi-head attention for multiple “perspectives”
  • The full transformer block with residuals and layer norm

In Part 2, we’ll look at what happens when you actually train these things at scale — sparse attention emergence, how individual attention heads specialize, modern architectural improvements (RoPE, GQA, FlashAttention, MoE), and the gated attention mechanism that won Best Paper at NeurIPS 2025.

The foundation is here. Now we see what happens when you make it big.