Transformers

Transformers from First Principles — Part 2: What Scale Reveals

Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.

Transformers from First Principles — Part 1: Attention Is All You Need (Really)

A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work.