Why This Post Exists#
So for the past two years I’ve been writing mostly about generative models — VAEs, probability, information theory. That was fun and I learned a lot. But I’ve been increasingly drawn toward the intersection of LLMs, reinforcement learning, and AI safety. Not just using these models, but understanding how they learn, how we align them, and what happens when that alignment fails.
I spent some time putting together a resource guide for myself — conferences to follow, books to read, papers to prioritize, researchers to track. Then I realized this might be useful for other people making a similar pivot. So here it is. This is not comprehensive and not objective. It’s my personal reading list with my own notes on what matters and why.
Top-Tier Conferences#
If you’re serious about this field, you need to follow the proceedings from these conferences. You don’t need to attend (I haven’t yet) — just read the accepted paper lists and best paper awards.
Tier 1: The Big 3#
| Conference | When | Acceptance Rate | Best For |
|---|
| NeurIPS | Nov/Dec | ~24-26% | Broadest coverage. LLM + alignment papers from all major labs |
| ICML | July | ~27.5% | RL emphasis, optimization, scaling laws |
| ICLR | May | ~20-25% | Representation learning. Safety/interpretability track is growing fast |
Some notes from my own reading:
- ICML historically emphasizes reinforcement learning and robotics more than the other two
- NeurIPS has the widest breadth — it attracts alignment work from OpenAI, Anthropic, DeepMind
- ICLR is increasingly where the safety and interpretability community publishes
Tier 2: Specialized#
| Conference | When | Focus |
|---|
| AAAI | February | Broad AI, emerging safety track |
| ACL / EMNLP / NAACL | Varies | NLP-specific, language model papers |
| FAccT | June | AI ethics, fairness, policy — bridges technical and societal perspectives |
| FLLM | Annual | LLM-specific research (newer venue) |
| AIES | Annual | AI ethics and society |
FAccT is interesting because it’s where technical alignment meets policy. If you care about how alignment research connects to things like the EU AI Act, this is the conference.
How to Actually Use Conferences (Without Attending)#
- Scan accepted paper lists — monthly for NeurIPS/ICML/ICLR
- Follow best paper awards — these signal where the field is moving
- Watch keynotes on YouTube — 1-2 hours saves you reading 10 papers
- Use Papers With Code — find implementations alongside papers
- Attend 1 in-person eventually — NeurIPS or ICML for networking
Books#
I organized these by topic and roughly by the order you’d want to read them. Not all of these I’ve finished — I’m noting which ones I’ve actually read vs. which are on my list.
Math Foundation#
| Book | Authors | Notes |
|---|
| Mathematics for Machine Learning | Deisenroth, Faisal, Ong | Free PDF. If you already have math background, skim the optimization chapters — you’ll need that for RL. |
| Linear Algebra and Optimization for ML | Charu C. Aggarwal | Heavier on optimization. Good if you want depth on SVD and kernel methods. |
Deep Learning#
| Book | Authors | Notes |
|---|
| Deep Learning: Foundations and Concepts | Christopher Bishop | 2024 edition. This is the most up-to-date comprehensive reference. Covers transformers properly. |
| The Little Book of Deep Learning | François Fleuret | 150 pages. Fast refresher. Good if you already know the basics and just want modern architecture intuitions. |
| Deep Learning | Goodfellow, Bengio, Courville | The classic. Dense but foundational. Getting old now but still referenced everywhere. |
Reinforcement Learning#
| Book | Authors | Notes |
|---|
| Reinforcement Learning: An Introduction (2nd ed) | Sutton & Barto | THE canonical textbook. Free PDF at incompleteideas.net. Read cover to cover. |
| Deep Reinforcement Learning Hands-On | Maxim Lapan | Practical implementations. Good companion to Sutton & Barto — they give you theory, this gives you code. |
| Multi-Agent Reinforcement Learning | (Recent survey) | Cutting-edge MARL. First to integrate modern deep learning approaches. |
NLP & Language Models#
| Book | Authors | Notes |
|---|
| Transformers for NLP | Luis Serrano | Clear transformer explanations with practical applications. |
| Build a Large Language Model (From Scratch) | Sebastian Raschka | Hands-on from tokenizer to RLHF. This is probably the best “implement it yourself” resource. |
AI Safety & Alignment#
There is no canonical textbook for AI safety yet. The knowledge lives in papers and blogs:
| Resource | What You Get |
|---|
| Anthropic’s research blog | Constitutional AI, reward hacking, alignment faking — primary source for cutting-edge alignment research |
| Paul Christiano’s blog | Scalable oversight, alignment research agenda |
| Chris Olah / Distill.pub | Interpretability done right — visual, rigorous, beautiful |
| Stuart Russell — Human Compatible | Philosophical foundations. Good for the “why” before the “how” |
| Alignment Forum (alignmentforum.org) | Community discussion and paper reviews |
Foundational Papers by Priority#
Must-Read First (Before Everything Else)#
| Paper | Year | Key Concepts |
|---|
| Attention is All You Need | 2017 | Self-attention, positional encoding, multi-head attention. The paper that started the LLM era. |
| Scaling Laws for Neural Language Models | 2020 | Power-law relationship between loss, model size, and data. Why bigger models work. |
| Language Models are Unsupervised Multitask Learners (GPT-2) | 2019 | In-context learning, zero-shot. The paper where emergence became undeniable. |
LLM-Specific (2019-2023)#
| Paper | Year | What It Addresses |
|---|
| InstructGPT / RLHF | 2022 | The 3-step pipeline: SFT → Reward Model → PPO. How ChatGPT was trained. |
| Constitutional AI (Anthropic) | 2023 | Self-improving alignment without human labels. RLAIF. |
| Direct Preference Optimization (DPO) | 2023 | Simplifies RLHF by removing the reward model entirely. Stability + simplicity. |
| Chain-of-Thought Prompting | 2022 | Intermediate reasoning steps improve complex reasoning. |
Advanced Alignment & Safety (2023-2025)#
| Paper | Year | Focus |
|---|
| Reward Hacking in LLMs (Multiple labs) | 2024-2025 | Models optimize proxy rewards, breaking alignment. Specification gaming. |
| Natural Emergent Misalignment (Anthropic) | 2024 | Reward hacking emerges in RL without explicit intent. Alignment faking. |
| Mitigating Reward Hacking via Information-Theoretic Approaches | 2024 | Proxy Compression Hypothesis — information bottleneck view of reward hacking |
| Training on Docs About Reward Hacking Induces Reward Hacking | 2025 | Out-of-context learning: models learn reward-hacking from documentation about it. Wild. |
Recent Advances (2024-2025)#
| Area | What’s New |
|---|
| Reasoning Models (o1, DeepSeek-R1) | RL-trained chain-of-thought. Test-time compute scaling. |
| Chain-of-X Paradigms | Beyond CoT: Tree-of-Thought, Graph-of-Thought, etc. |
| Tool Use & Agents | Function calling, multi-agent coordination (MetaGPT, CAMEL, AutoAct) |
| Scaling Laws Across Architectures | Power laws hold across dense and sparse (MoE) model families |
Researchers to Follow#
Must-Follow (Active, Accessible Content)#
| Researcher | Where | Best For |
|---|
| Andrej Karpathy | YouTube, X | Best DL educator. His “Let’s build GPT from scratch” is mandatory viewing. |
| Yoshua Bengio | Papers, talks | Nobel laureate. Shifted focus to AI safety — worth understanding why. |
| Chris Olah | distill.pub, Anthropic blog | Interpretability pioneer. If you want to understand what’s inside neural networks, start here. |
| Lilian Weng | lilianweng.github.io | The gold standard for technical surveys. Her RLHF post is referenced everywhere. |
| Paul Christiano | paulchristiano.com, X | Alignment research agenda. Scalable oversight. Deep thinker. |
Read Their Papers#
| Researcher | Affiliation | Focus |
|---|
| Dario Amodei | Anthropic | Constitutional AI, responsible scaling |
| Ilya Sutskever | SSI | Scaling dynamics, AGI trajectory |
| John Schulman | Anthropic | PPO, RLHF — the RL-for-LLMs guy |
| Jacob Steinhardt | UC Berkeley | AI safety forecasting, alignment evaluation |
| Ethan Perez | Anthropic | Red-teaming, sycophancy research |
Where They Publish#
| Channel | Best For |
|---|
| Personal Blogs | Long-form systematic thinking (Weng, Christiano, Olah) |
| X/Twitter | Hot takes, latest ideas, paper commentary |
| YouTube | Karpathy tutorials, Lex Fridman interviews, Neel Nanda’s mech interp tutorials |
| arXiv | Full papers — read abstracts first, then decide |
| Papers with Code | Implementations alongside papers |
2024-2025: What Changed#
Reasoning & Inference-Time Compute#
Models are scaling more via test-time compute (reasoning) than training-time now. OpenAI’s o1 uses RL-trained long chains of thought. Process Reward Models train on reasoning trajectories, not just outputs. The next frontier is making reasoning itself learnable.
Must read: “Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes” (2025)
LLMs now routinely use external tools. Multi-agent systems are emerging. Function calling is standardized across GPT-4, Claude, etc. The implication: alignment and safety must extend beyond single-LLM to multi-agent scenarios. This is mostly unsolved.
Alignment Technique Landscape#
| Technique | Status 2025 | Trade-offs |
|---|
| RLHF | Still dominant; complex to tune | Requires reward model; PPO instability |
| DPO | Widely adopted; simpler | Offline preference data only; less flexible |
| Constitutional AI / RLAIF | Growing | Depends on constitution quality |
| Online RL (GRPO etc.) | Emerging frontier | More complex; richer feedback loop |
No single “best” technique. Constitutional AI scales better. DPO is simpler. RLHF is most empirically validated. The field is converging on hybrid approaches.
Interpretability Breakthroughs#
- Sparse Autoencoders (SAEs): Extract interpretable features from activations
- Linear Probes: Predict reward hacking from activations at the token level
- Circuit Analysis: Understanding how features compose into behaviors
The direction is clear: mechanistic interpretability is becoming a requirement for alignment, not a nice-to-have.
Emerging Risks (2025)#
- Out-of-Context Reasoning: Models learn behaviors from documentation about behaviors
- Inference-Time Misalignment: Reward hacking at inference time (Best-of-N sampling etc.)
- Scaling Alignment Failures: Alignment properties don’t scale with model size as expected
Open Questions#
Things I don’t have good answers to yet:
- When will Constitutional AI surpass RLHF empirically? It theoretically scales better, but RLHF is still the most battle-tested in production.
- How do scaling laws change with inference-time compute? o1 suggests different scaling curves emerge with reasoning, but how generalizable is this?
- Can interpretability scale to reasoning models? SAEs work on base models. Unclear if they reveal reasoning-time dynamics in o1-style models.
- Will tool use change alignment requirements fundamentally? Multi-agent + tool use introduces failure modes nobody has really studied yet.
If you have thoughts on any of these, I’d love to hear them.