A Curated Guide to LLMs, Reinforcement Learning, and AI Safety

Why This Post Exists

So for the past two years I’ve been writing mostly about generative models — VAEs, probability, information theory. That was fun and I learned a lot. But I’ve been increasingly drawn toward the intersection of LLMs, reinforcement learning, and AI safety. Not just using these models, but understanding how they learn, how we align them, and what happens when that alignment fails.

I spent some time putting together a resource guide for myself — conferences to follow, books to read, papers to prioritize, researchers to track. Then I realized this might be useful for other people making a similar pivot. So here it is. This is not comprehensive and not objective. It’s my personal reading list with my own notes on what matters and why.

Top-Tier Conferences

If you’re serious about this field, you need to follow the proceedings from these conferences. You don’t need to attend (I haven’t yet) — just read the accepted paper lists and best paper awards.

Tier 1: The Big 3

Conference	When	Acceptance Rate	Best For
NeurIPS	Nov/Dec	~24-26%	Broadest coverage. LLM + alignment papers from all major labs
ICML	July	~27.5%	RL emphasis, optimization, scaling laws
ICLR	May	~20-25%	Representation learning. Safety/interpretability track is growing fast

Some notes from my own reading:

ICML historically emphasizes reinforcement learning and robotics more than the other two
NeurIPS has the widest breadth — it attracts alignment work from OpenAI, Anthropic, DeepMind
ICLR is increasingly where the safety and interpretability community publishes

Tier 2: Specialized

Conference	When	Focus
AAAI	February	Broad AI, emerging safety track
ACL / EMNLP / NAACL	Varies	NLP-specific, language model papers
FAccT	June	AI ethics, fairness, policy — bridges technical and societal perspectives
FLLM	Annual	LLM-specific research (newer venue)
AIES	Annual	AI ethics and society

FAccT is interesting because it’s where technical alignment meets policy. If you care about how alignment research connects to things like the EU AI Act, this is the conference.

How to Actually Use Conferences (Without Attending)

Scan accepted paper lists — monthly for NeurIPS/ICML/ICLR
Follow best paper awards — these signal where the field is moving
Watch keynotes on YouTube — 1-2 hours saves you reading 10 papers
Use Papers With Code — find implementations alongside papers
Attend 1 in-person eventually — NeurIPS or ICML for networking

Books

I organized these by topic and roughly by the order you’d want to read them. Not all of these I’ve finished — I’m noting which ones I’ve actually read vs. which are on my list.

Math Foundation

Book	Authors	Notes
Mathematics for Machine Learning	Deisenroth, Faisal, Ong	Free PDF. If you already have math background, skim the optimization chapters — you’ll need that for RL.
Linear Algebra and Optimization for ML	Charu C. Aggarwal	Heavier on optimization. Good if you want depth on SVD and kernel methods.

Deep Learning

Book	Authors	Notes
Deep Learning: Foundations and Concepts	Christopher Bishop	2024 edition. This is the most up-to-date comprehensive reference. Covers transformers properly.
The Little Book of Deep Learning	François Fleuret	150 pages. Fast refresher. Good if you already know the basics and just want modern architecture intuitions.
Deep Learning	Goodfellow, Bengio, Courville	The classic. Dense but foundational. Getting old now but still referenced everywhere.

Reinforcement Learning

Book	Authors	Notes
Reinforcement Learning: An Introduction (2nd ed)	Sutton & Barto	THE canonical textbook. Free PDF at incompleteideas.net. Read cover to cover.
Deep Reinforcement Learning Hands-On	Maxim Lapan	Practical implementations. Good companion to Sutton & Barto — they give you theory, this gives you code.
Multi-Agent Reinforcement Learning	(Recent survey)	Cutting-edge MARL. First to integrate modern deep learning approaches.

NLP & Language Models

Book	Authors	Notes
Transformers for NLP	Luis Serrano	Clear transformer explanations with practical applications.
Build a Large Language Model (From Scratch)	Sebastian Raschka	Hands-on from tokenizer to RLHF. This is probably the best “implement it yourself” resource.

AI Safety & Alignment

There is no canonical textbook for AI safety yet. The knowledge lives in papers and blogs:

Resource	What You Get
Anthropic’s research blog	Constitutional AI, reward hacking, alignment faking — primary source for cutting-edge alignment research
Paul Christiano’s blog	Scalable oversight, alignment research agenda
Chris Olah / Distill.pub	Interpretability done right — visual, rigorous, beautiful
Stuart Russell — Human Compatible	Philosophical foundations. Good for the “why” before the “how”
Alignment Forum (alignmentforum.org)	Community discussion and paper reviews

Foundational Papers by Priority

Must-Read First (Before Everything Else)

Paper	Year	Key Concepts
Attention is All You Need	2017	Self-attention, positional encoding, multi-head attention. The paper that started the LLM era.
Scaling Laws for Neural Language Models	2020	Power-law relationship between loss, model size, and data. Why bigger models work.
Language Models are Unsupervised Multitask Learners (GPT-2)	2019	In-context learning, zero-shot. The paper where emergence became undeniable.

LLM-Specific (2019-2023)

Paper	Year	What It Addresses
InstructGPT / RLHF	2022	The 3-step pipeline: SFT → Reward Model → PPO. How ChatGPT was trained.
Constitutional AI (Anthropic)	2023	Self-improving alignment without human labels. RLAIF.
Direct Preference Optimization (DPO)	2023	Simplifies RLHF by removing the reward model entirely. Stability + simplicity.
Chain-of-Thought Prompting	2022	Intermediate reasoning steps improve complex reasoning.

Advanced Alignment & Safety (2023-2025)

Paper	Year	Focus
Reward Hacking in LLMs (Multiple labs)	2024-2025	Models optimize proxy rewards, breaking alignment. Specification gaming.
Natural Emergent Misalignment (Anthropic)	2024	Reward hacking emerges in RL without explicit intent. Alignment faking.
Mitigating Reward Hacking via Information-Theoretic Approaches	2024	Proxy Compression Hypothesis — information bottleneck view of reward hacking
Training on Docs About Reward Hacking Induces Reward Hacking	2025	Out-of-context learning: models learn reward-hacking from documentation about it. Wild.

Recent Advances (2024-2025)

Area	What’s New
Reasoning Models (o1, DeepSeek-R1)	RL-trained chain-of-thought. Test-time compute scaling.
Chain-of-X Paradigms	Beyond CoT: Tree-of-Thought, Graph-of-Thought, etc.
Tool Use & Agents	Function calling, multi-agent coordination (MetaGPT, CAMEL, AutoAct)
Scaling Laws Across Architectures	Power laws hold across dense and sparse (MoE) model families

Researchers to Follow

Must-Follow (Active, Accessible Content)

Researcher	Where	Best For
Andrej Karpathy	YouTube, X	Best DL educator. His “Let’s build GPT from scratch” is mandatory viewing.
Yoshua Bengio	Papers, talks	Nobel laureate. Shifted focus to AI safety — worth understanding why.
Chris Olah	distill.pub, Anthropic blog	Interpretability pioneer. If you want to understand what’s inside neural networks, start here.
Lilian Weng	lilianweng.github.io	The gold standard for technical surveys. Her RLHF post is referenced everywhere.
Paul Christiano	paulchristiano.com, X	Alignment research agenda. Scalable oversight. Deep thinker.

Read Their Papers

Researcher	Affiliation	Focus
Dario Amodei	Anthropic	Constitutional AI, responsible scaling
Ilya Sutskever	SSI	Scaling dynamics, AGI trajectory
John Schulman	Anthropic	PPO, RLHF — the RL-for-LLMs guy
Jacob Steinhardt	UC Berkeley	AI safety forecasting, alignment evaluation
Ethan Perez	Anthropic	Red-teaming, sycophancy research

Where They Publish

Channel	Best For
Personal Blogs	Long-form systematic thinking (Weng, Christiano, Olah)
X/Twitter	Hot takes, latest ideas, paper commentary
YouTube	Karpathy tutorials, Lex Fridman interviews, Neel Nanda’s mech interp tutorials
arXiv	Full papers — read abstracts first, then decide
Papers with Code	Implementations alongside papers

2024-2025: What Changed

Reasoning & Inference-Time Compute

Models are scaling more via test-time compute (reasoning) than training-time now. OpenAI’s o1 uses RL-trained long chains of thought. Process Reward Models train on reasoning trajectories, not just outputs. The next frontier is making reasoning itself learnable.

Must read: “Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes” (2025)

Agents & Tool Use

LLMs now routinely use external tools. Multi-agent systems are emerging. Function calling is standardized across GPT-4, Claude, etc. The implication: alignment and safety must extend beyond single-LLM to multi-agent scenarios. This is mostly unsolved.

Alignment Technique Landscape

Technique	Status 2025	Trade-offs
RLHF	Still dominant; complex to tune	Requires reward model; PPO instability
DPO	Widely adopted; simpler	Offline preference data only; less flexible
Constitutional AI / RLAIF	Growing	Depends on constitution quality
Online RL (GRPO etc.)	Emerging frontier	More complex; richer feedback loop

No single “best” technique. Constitutional AI scales better. DPO is simpler. RLHF is most empirically validated. The field is converging on hybrid approaches.

Interpretability Breakthroughs

Sparse Autoencoders (SAEs): Extract interpretable features from activations
Linear Probes: Predict reward hacking from activations at the token level
Circuit Analysis: Understanding how features compose into behaviors

The direction is clear: mechanistic interpretability is becoming a requirement for alignment, not a nice-to-have.

Emerging Risks (2025)

Out-of-Context Reasoning: Models learn behaviors from documentation about behaviors
Inference-Time Misalignment: Reward hacking at inference time (Best-of-N sampling etc.)
Scaling Alignment Failures: Alignment properties don’t scale with model size as expected

Open Questions

Things I don’t have good answers to yet:

When will Constitutional AI surpass RLHF empirically? It theoretically scales better, but RLHF is still the most battle-tested in production.
How do scaling laws change with inference-time compute? o1 suggests different scaling curves emerge with reasoning, but how generalizable is this?
Can interpretability scale to reasoning models? SAEs work on base models. Unclear if they reveal reasoning-time dynamics in o1-style models.
Will tool use change alignment requirements fundamentally? Multi-agent + tool use introduces failure modes nobody has really studied yet.

If you have thoughts on any of these, I’d love to hear them.

Why This Post Exists#

Top-Tier Conferences#

Tier 1: The Big 3#

Tier 2: Specialized#

How to Actually Use Conferences (Without Attending)#

Books#

Math Foundation#

Deep Learning#

Reinforcement Learning#

NLP & Language Models#

AI Safety & Alignment#

Foundational Papers by Priority#

Must-Read First (Before Everything Else)#

LLM-Specific (2019-2023)#

Advanced Alignment & Safety (2023-2025)#

Recent Advances (2024-2025)#

Researchers to Follow#

Must-Follow (Active, Accessible Content)#

Read Their Papers#

Where They Publish#

2024-2025: What Changed#

Reasoning & Inference-Time Compute#

Agents & Tool Use#

Alignment Technique Landscape#

Interpretability Breakthroughs#

Emerging Risks (2025)#

Open Questions#