Transformers from First Principles — Part 2: What Scale Reveals
Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.
Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.
A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work.