Transformers from First Principles — Part 2: What Scale Reveals

Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.

February 20,2026 | Estimated reading time: 7 min | 1487 words | Author: khanhnn

Transformers from First Principles — Part 1: Attention Is All You Need (Really)

A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work.

February 8,2026 | Estimated reading time: 8 min | 1533 words | Author: khanhnn