Why This Post Exists

So for the past two years I’ve been writing mostly about generative models — VAEs, probability, information theory. That was fun and I learned a lot. But I’ve been increasingly drawn toward the intersection of LLMs, reinforcement learning, and AI safety. Not just using these models, but understanding how they learn, how we align them, and what happens when that alignment fails.

I spent some time putting together a resource guide for myself — conferences to follow, books to read, papers to prioritize, researchers to track. Then I realized this might be useful for other people making a similar pivot. So here it is. This is not comprehensive and not objective. It’s my personal reading list with my own notes on what matters and why.


Top-Tier Conferences

If you’re serious about this field, you need to follow the proceedings from these conferences. You don’t need to attend (I haven’t yet) — just read the accepted paper lists and best paper awards.

Tier 1: The Big 3

ConferenceWhenAcceptance RateBest For
NeurIPSNov/Dec~24-26%Broadest coverage. LLM + alignment papers from all major labs
ICMLJuly~27.5%RL emphasis, optimization, scaling laws
ICLRMay~20-25%Representation learning. Safety/interpretability track is growing fast

Some notes from my own reading:

  • ICML historically emphasizes reinforcement learning and robotics more than the other two
  • NeurIPS has the widest breadth — it attracts alignment work from OpenAI, Anthropic, DeepMind
  • ICLR is increasingly where the safety and interpretability community publishes

Tier 2: Specialized

ConferenceWhenFocus
AAAIFebruaryBroad AI, emerging safety track
ACL / EMNLP / NAACLVariesNLP-specific, language model papers
FAccTJuneAI ethics, fairness, policy — bridges technical and societal perspectives
FLLMAnnualLLM-specific research (newer venue)
AIESAnnualAI ethics and society

FAccT is interesting because it’s where technical alignment meets policy. If you care about how alignment research connects to things like the EU AI Act, this is the conference.

How to Actually Use Conferences (Without Attending)

  1. Scan accepted paper lists — monthly for NeurIPS/ICML/ICLR
  2. Follow best paper awards — these signal where the field is moving
  3. Watch keynotes on YouTube — 1-2 hours saves you reading 10 papers
  4. Use Papers With Code — find implementations alongside papers
  5. Attend 1 in-person eventually — NeurIPS or ICML for networking

Books

I organized these by topic and roughly by the order you’d want to read them. Not all of these I’ve finished — I’m noting which ones I’ve actually read vs. which are on my list.

Math Foundation

BookAuthorsNotes
Mathematics for Machine LearningDeisenroth, Faisal, OngFree PDF. If you already have math background, skim the optimization chapters — you’ll need that for RL.
Linear Algebra and Optimization for MLCharu C. AggarwalHeavier on optimization. Good if you want depth on SVD and kernel methods.

Deep Learning

BookAuthorsNotes
Deep Learning: Foundations and ConceptsChristopher Bishop2024 edition. This is the most up-to-date comprehensive reference. Covers transformers properly.
The Little Book of Deep LearningFrançois Fleuret150 pages. Fast refresher. Good if you already know the basics and just want modern architecture intuitions.
Deep LearningGoodfellow, Bengio, CourvilleThe classic. Dense but foundational. Getting old now but still referenced everywhere.

Reinforcement Learning

BookAuthorsNotes
Reinforcement Learning: An Introduction (2nd ed)Sutton & BartoTHE canonical textbook. Free PDF at incompleteideas.net. Read cover to cover.
Deep Reinforcement Learning Hands-OnMaxim LapanPractical implementations. Good companion to Sutton & Barto — they give you theory, this gives you code.
Multi-Agent Reinforcement Learning(Recent survey)Cutting-edge MARL. First to integrate modern deep learning approaches.

NLP & Language Models

BookAuthorsNotes
Transformers for NLPLuis SerranoClear transformer explanations with practical applications.
Build a Large Language Model (From Scratch)Sebastian RaschkaHands-on from tokenizer to RLHF. This is probably the best “implement it yourself” resource.

AI Safety & Alignment

There is no canonical textbook for AI safety yet. The knowledge lives in papers and blogs:

ResourceWhat You Get
Anthropic’s research blogConstitutional AI, reward hacking, alignment faking — primary source for cutting-edge alignment research
Paul Christiano’s blogScalable oversight, alignment research agenda
Chris Olah / Distill.pubInterpretability done right — visual, rigorous, beautiful
Stuart Russell — Human CompatiblePhilosophical foundations. Good for the “why” before the “how”
Alignment Forum (alignmentforum.org)Community discussion and paper reviews

Foundational Papers by Priority

Must-Read First (Before Everything Else)

PaperYearKey Concepts
Attention is All You Need2017Self-attention, positional encoding, multi-head attention. The paper that started the LLM era.
Scaling Laws for Neural Language Models2020Power-law relationship between loss, model size, and data. Why bigger models work.
Language Models are Unsupervised Multitask Learners (GPT-2)2019In-context learning, zero-shot. The paper where emergence became undeniable.

LLM-Specific (2019-2023)

PaperYearWhat It Addresses
InstructGPT / RLHF2022The 3-step pipeline: SFT → Reward Model → PPO. How ChatGPT was trained.
Constitutional AI (Anthropic)2023Self-improving alignment without human labels. RLAIF.
Direct Preference Optimization (DPO)2023Simplifies RLHF by removing the reward model entirely. Stability + simplicity.
Chain-of-Thought Prompting2022Intermediate reasoning steps improve complex reasoning.

Advanced Alignment & Safety (2023-2025)

PaperYearFocus
Reward Hacking in LLMs (Multiple labs)2024-2025Models optimize proxy rewards, breaking alignment. Specification gaming.
Natural Emergent Misalignment (Anthropic)2024Reward hacking emerges in RL without explicit intent. Alignment faking.
Mitigating Reward Hacking via Information-Theoretic Approaches2024Proxy Compression Hypothesis — information bottleneck view of reward hacking
Training on Docs About Reward Hacking Induces Reward Hacking2025Out-of-context learning: models learn reward-hacking from documentation about it. Wild.

Recent Advances (2024-2025)

AreaWhat’s New
Reasoning Models (o1, DeepSeek-R1)RL-trained chain-of-thought. Test-time compute scaling.
Chain-of-X ParadigmsBeyond CoT: Tree-of-Thought, Graph-of-Thought, etc.
Tool Use & AgentsFunction calling, multi-agent coordination (MetaGPT, CAMEL, AutoAct)
Scaling Laws Across ArchitecturesPower laws hold across dense and sparse (MoE) model families

Researchers to Follow

Must-Follow (Active, Accessible Content)

ResearcherWhereBest For
Andrej KarpathyYouTube, XBest DL educator. His “Let’s build GPT from scratch” is mandatory viewing.
Yoshua BengioPapers, talksNobel laureate. Shifted focus to AI safety — worth understanding why.
Chris Olahdistill.pub, Anthropic blogInterpretability pioneer. If you want to understand what’s inside neural networks, start here.
Lilian Wenglilianweng.github.ioThe gold standard for technical surveys. Her RLHF post is referenced everywhere.
Paul Christianopaulchristiano.com, XAlignment research agenda. Scalable oversight. Deep thinker.

Read Their Papers

ResearcherAffiliationFocus
Dario AmodeiAnthropicConstitutional AI, responsible scaling
Ilya SutskeverSSIScaling dynamics, AGI trajectory
John SchulmanAnthropicPPO, RLHF — the RL-for-LLMs guy
Jacob SteinhardtUC BerkeleyAI safety forecasting, alignment evaluation
Ethan PerezAnthropicRed-teaming, sycophancy research

Where They Publish

ChannelBest For
Personal BlogsLong-form systematic thinking (Weng, Christiano, Olah)
X/TwitterHot takes, latest ideas, paper commentary
YouTubeKarpathy tutorials, Lex Fridman interviews, Neel Nanda’s mech interp tutorials
arXivFull papers — read abstracts first, then decide
Papers with CodeImplementations alongside papers

2024-2025: What Changed

Reasoning & Inference-Time Compute

Models are scaling more via test-time compute (reasoning) than training-time now. OpenAI’s o1 uses RL-trained long chains of thought. Process Reward Models train on reasoning trajectories, not just outputs. The next frontier is making reasoning itself learnable.

Must read: “Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes” (2025)

Agents & Tool Use

LLMs now routinely use external tools. Multi-agent systems are emerging. Function calling is standardized across GPT-4, Claude, etc. The implication: alignment and safety must extend beyond single-LLM to multi-agent scenarios. This is mostly unsolved.

Alignment Technique Landscape

TechniqueStatus 2025Trade-offs
RLHFStill dominant; complex to tuneRequires reward model; PPO instability
DPOWidely adopted; simplerOffline preference data only; less flexible
Constitutional AI / RLAIFGrowingDepends on constitution quality
Online RL (GRPO etc.)Emerging frontierMore complex; richer feedback loop

No single “best” technique. Constitutional AI scales better. DPO is simpler. RLHF is most empirically validated. The field is converging on hybrid approaches.

Interpretability Breakthroughs

  • Sparse Autoencoders (SAEs): Extract interpretable features from activations
  • Linear Probes: Predict reward hacking from activations at the token level
  • Circuit Analysis: Understanding how features compose into behaviors

The direction is clear: mechanistic interpretability is becoming a requirement for alignment, not a nice-to-have.

Emerging Risks (2025)

  • Out-of-Context Reasoning: Models learn behaviors from documentation about behaviors
  • Inference-Time Misalignment: Reward hacking at inference time (Best-of-N sampling etc.)
  • Scaling Alignment Failures: Alignment properties don’t scale with model size as expected

Open Questions

Things I don’t have good answers to yet:

  1. When will Constitutional AI surpass RLHF empirically? It theoretically scales better, but RLHF is still the most battle-tested in production.
  2. How do scaling laws change with inference-time compute? o1 suggests different scaling curves emerge with reasoning, but how generalizable is this?
  3. Can interpretability scale to reasoning models? SAEs work on base models. Unclear if they reveal reasoning-time dynamics in o1-style models.
  4. Will tool use change alignment requirements fundamentally? Multi-agent + tool use introduces failure modes nobody has really studied yet.

If you have thoughts on any of these, I’d love to hear them.