[{"content":"What This Is I\u0026rsquo;ve been going thru the proceedings of NeurIPS 2025 and ICLR 2026 to map out what the research community is working on in LLM safety, alignment, RLHF, and related areas. This post is basically my reading list — organized by conference and topic, with personal notes on the papers I think are most important.\nICML 2026 (July, Seoul) hasn\u0026rsquo;t published its accepted papers yet, so I\u0026rsquo;ll update this when that drops.\nIf you just want the top picks, skip to the curated reading list at the bottom.\nNeurIPS 2025 Best Papers \u0026amp; Award Winners Artificial Hivemind: The Open-Ended Homogeneity of Language Models (Best Paper)\nThis one is interesting because it shows that RLHF reduces output diversity. They built a benchmark of 26K queries with 31K dense human annotations and found that reward models are miscalibrated against diverse human preferences. So when we align models to be \u0026ldquo;helpful\u0026rdquo;, we\u0026rsquo;re actually making them converge to a narrow band of responses. You may wonder if that\u0026rsquo;s necessarily bad — and the answer is, it depends on what you\u0026rsquo;re optimizing for. But it\u0026rsquo;s something we should at least be aware of.\nGated Attention for Large Language Models (Best Paper)\nAdds head-specific sigmoid gating after scaled dot product attention. Tested across 30+ variants. More of an architecture contribution than safety, but it touches on how attention mechanisms can be made more controllable.\nDoes Reinforcement Learning Really Incentivize Reasoning Capabilities?\nThis is the contrarian paper I keep coming back to. They show that RLVR (RL from verifiable rewards) enhances sampling efficiency without actually expanding reasoning capacity. So RL makes models better at finding good answers from what they already know, but doesn\u0026rsquo;t teach them to reason about new things. If this holds up, it means current RL methods haven\u0026rsquo;t even scratched the surface of what\u0026rsquo;s possible.\nSafety \u0026amp; Alignment LLM Safety Alignment is Divergence Estimation in Disguise\nProbably my favorite theory paper from this cycle. Shows that RLHF, DPO, and related methods are all doing the same thing — estimating divergence between safe and unsafe output distributions. They propose a KLDO variant based on KL divergence. The unifying perspective is really clean.\nTowards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons\nThey identified that roughly 5% of neurons in a model are responsible for safety behavior. Patching these \u0026ldquo;safety neurons\u0026rdquo; restores \u0026gt;90% of safety performance. This is huge for interpretability — it means safety isn\u0026rsquo;t spread diffusely across the whole network, it\u0026rsquo;s concentrated. And that concentration is both a strength (we can study it) and a vulnerability (we can attack it).\nLifelong Safety Alignment for Language Models\nFrames alignment as a competitive game: a Meta-Attacker that discovers jailbreaks vs. a Defender that resists them. Continuous adaptation to evolving strategies. Interesting framing but I\u0026rsquo;m not sure how practical it is — the arms race dynamic seems expensive.\nSafe RLHF-V (PKU-Alignment)\nFirst multimodal safety alignment framework. Introduces BeaverTails-V dataset with dual preference annotations (helpfulness + safety) and a Beaver-Guard-V multi-level guardrail system. This matters because as models become multimodal, single-modality safety approaches won\u0026rsquo;t be enough.\nRectified Policy Optimization (RePO)\nReplaces expected safety constraints with critical per-prompt constraints. The key insight: average-case safety isn\u0026rsquo;t good enough — you need worst-case guarantees per prompt. Makes sense intuitively.\nPreference Optimization \u0026amp; RLHF Less is More: Improving LLM Alignment via Preference Data Selection\nStrategic data selection for DPO training improves performance while reducing computation. Quality over quantity for preference data — not surprising but good to have the empirical validation.\nGreedy Sampling Is Provably Efficient For RLHF\nTheory paper. If you\u0026rsquo;re into the math of RLHF sampling strategies, this is your paper.\nProvably Efficient Online RLHF with One-Pass Reward Modeling\nAddresses the computational cost of continuously integrating new data + re-optimizing reward models. One-pass approach. Practical if you\u0026rsquo;re deploying RLHF at scale.\nMechanistic Interpretability \u0026amp; SAEs SAEs (Sparse Autoencoders) dominated the interpretability track this year:\nRevising and Falsifying SAE Feature Explanations — Improving the quality of what SAE features actually mean Measuring SAE Feature Sensitivity — How reliably do SAE features activate on similar inputs? SAE Neural Operators — Generalizes SAEs to infinite-dimensional function spaces. Ambitious. One-Step SAEs for Diffusion Models — Extends SAE interpretability beyond language to image generation (SDXL Turbo) SAEs for Pathology Foundation Models — SAE features map to interpretable biological concepts. Strong correlations with cell type counts. Feature Absorption in SAEs — Hierarchical features cause absorption problems during optimization. Varying SAE size/sparsity doesn\u0026rsquo;t fix it. Matching Pursuit SAE — New architecture using Matching Pursuit for hierarchical features. Reveals geometric assumptions in existing SAE designs. My take: SAEs are becoming the swiss army knife of interpretability. But the absorption and sensitivity issues suggest we\u0026rsquo;re still in the \u0026ldquo;figuring out the tool\u0026rdquo; phase, not the \u0026ldquo;using the tool reliably\u0026rdquo; phase.\nReasoning, CoT \u0026amp; Agents Decision Pivots for CoT Verification — Identifies minimal checkpoints that any correct reasoning path must visit. Interesting for verification. Latent Reasoning Models / CODI — Reasoning in continuous hidden space via self-distillation. Efficient but less interpretable. Self-Correction in Long CoT — Models do redundant reasoning; first step dominates. So longer chains don\u0026rsquo;t always help. RL with Model-rewarded Thinking (RLMT) — Online RL with preference-based reward models. Outperforms standard RLHF across DPO, PPO, GRPO. Multi-Agent Reasoning (Game-Theoretic) — Non-zero-sum game between base agents + critical agent. Uncertainty-aware collaboration. R1-Zero for GUI Grounding — Online RL + CoT reasoning for computer-use agents. Finds that longer chains lead to worse performance. Confirms the self-correction finding above. Attention \u0026amp; Feature Interpretation Sparse Attention Emergence — Timing follows power laws based on task/architecture/optimizer. Oral paper. Attention Head Specialization — Individual heads specialize in semantic/visual attributes. Gives interpretable, controllable structure. ICLR 2026 Outstanding Papers SafeDPO: A Simple Approach to DPO with Enhanced Safety (Outstanding)\nBalances helpfulness + safety without auxiliary networks or cost models. Single additional hyperparameter. Minimal modifications to standard DPO. This is the kind of paper I love — takes a real problem and solves it with minimal machinery. If you read one DPO paper this year, read this.\nQ-RAG: Multi-Step Retrieval via RL-Trained Embedders (Outstanding)\nValue-based RL for training retrieval embedders on long contexts. The evolution from single-hop to multi-step RAG. Important for practical systems.\nAgentFlow: Agentic Framework with Flow-GRPO (Outstanding)\nThis is wild. A trainable agentic system (planner, executor, verifier, generator) using Flow-based Group Refined Policy Optimization for sparse-reward credit assignment. A 7B parameter backbone beats GPT-4o on search, math, and science reasoning. If this replicates, it\u0026rsquo;s a strong signal that small models + good RL + agent architecture can compete with giant models.\nCommon Corpus: Ethical LLM Pre-training Data (Outstanding)\nAddresses data bias and ethics in pre-training. Important groundwork even if less flashy.\nTransformers are Inherently Succinct (Outstanding)\nTheoretical explanation of why transformers work so well. If you like theory papers.\nLLMs Get Lost In Multi-Turn Conversation (Outstanding)\nDemonstrates performance decline in multi-turn with underspecified instructions. Very relevant for real-world deployment where conversations are messy and long.\nDPO \u0026amp; Preference Optimization This was a major theme — lots of people finding problems with DPO and proposing fixes:\nPaper Key Finding Why DPO is a Misspecified Estimator (Oral) Exposes statistical flaw: preference reversals + reward degradation Token-Importance Guided DPO Refines DPO with token-level weighting Multiplayer Nash Preference Optimization Extends preference optimization to multi-agent setting Semi-Supervised Preference with Limited Feedback Data efficiency for preference-based training Learning from Reference Answers Alternative to binary preferences Learning Correlated Reward Models Multi-objective reward modeling with correlations The trend is clear: original DPO had real problems, and the community is fixing them from multiple angles — statistical, multi-agent, data-efficient, multi-objective.\nSafety \u0026amp; Alignment Paper Focus Rethinking Deep Safety Alignment Balances harmlessness + helpfulness Alignment-Weighted DPO Weighted approach for safety Invisible Safety Threat: Malicious Finetuning via Steganography Security vulnerabilities in fine-tuning. Scary. Causal Intervention for Vulnerability Analysis Shows shallow alignment enables jailbreaks. Fine-tuning on CoT datasets encourages principled refusals. The steganography paper is worth highlighting — it shows that fine-tuning can be weaponized in ways that are hard to detect. Not great news for open-weight model safety.\nReasoning \u0026amp; Chain-of-Thought Paper Key Finding Your Base Model is Smarter Than You Think Sampling-based reasoning strategies unlock latent capability Verifying CoT via Computational Graphs Graph-based reasoning validation Detecting Implicit Reward Hacking Measures reasoning effort to identify deceptive reasoning LoongRL RL reasoning over long contexts The Art of Scaling RL Compute for LLMs Optimal compute allocation for RL training TROLL Trust region methods for stable RL in language models Mechanistic Interpretability SAEs for Code Correctness — Identifies directions corresponding to code correctness in LLMs Mech Interp of In-Context Learning — Finds \u0026ldquo;common structures\u0026rdquo; in transformer QK circuits Tracking Equivalent Mech Interp Across Networks — Framework for discovering succinct algorithms Small Transformers Don\u0026rsquo;t Need LayerNorm at Inference — LN-free analogs enable more precise mechanistic analysis Is Mechanistic Interpretability Identifiable? — Fundamental question about whether we can uniquely identify mechanisms. Important. Research Trends Across Both Conferences Looking at these together, some clear patterns:\nDPO is getting fixed from all directions. Multiple papers identify statistical, practical, and safety issues. SafeDPO is the cleanest solution so far.\nSAEs are everywhere. 10+ papers across both conferences. The tool is gaining adoption but still has fundamental issues (absorption, sensitivity, identifiability).\nRL for reasoning is complicated. Evidence that RL improves sampling efficiency but may not expand actual reasoning capacity. Longer chains don\u0026rsquo;t always help.\nMultimodal safety is just starting. Safe RLHF-V is the first real multimodal safety framework. Expect this to explode next year.\nPluralistic alignment is emerging. Moving beyond single-objective \u0026ldquo;helpful and harmless\u0026rdquo; toward diverse values and personalization.\nAgents + RL intersection. AgentFlow showing 7B beats GPT-4o with the right architecture. Small + smart \u0026gt; big + dumb.\nCurated Reading List Read These First Paper Venue Why SafeDPO ICLR 2026 Fixes DPO for safety with one extra hyperparameter. Clean and practical. Why DPO is Misspecified ICLR 2026 Understand the problem before the solution. Safety Alignment is Divergence Estimation NeurIPS 2025 Unifying theory for RLHF/DPO/etc. Safety Neurons NeurIPS 2025 5% of neurons → 90% of safety. Huge for interpretability. Does RL Incentivize Reasoning? NeurIPS 2025 Contrarian finding. Changes how you think about RL for LLMs. AgentFlow ICLR 2026 7B beats GPT-4o. RL + agents \u0026gt; scale. Second Priority Paper Venue Why TROLL ICLR 2026 Stable RL training for LLMs. Practical. LoongRL ICLR 2026 RL + long context reasoning. Artificial Hivemind NeurIPS 2025 RLHF reduces diversity. Best Paper for a reason. SAE Feature Absorption NeurIPS 2025 Fundamental limitation of current SAEs. Is Mech Interp Identifiable? ICLR 2026 Existential question for the field. Safe RLHF-V NeurIPS 2025 First multimodal safety framework. Sources NeurIPS 2025 Best Papers ICLR 2026 Outstanding Papers OpenReview NeurIPS 2025 ICLR 2026 Papers ICLR 2026 Oral Papers GitHub PKU-Alignment Group IntuitionLabs NeurIPS 2025 Summary ","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-04-29/","summary":"A curated list of papers on alignment, preference optimization, mechanistic interpretability, and reasoning from the two biggest ML conferences this cycle — with personal takes on the ones that matter most.","title":"Paper Roundup: LLM Safety \u0026 RLHF at NeurIPS 2025 and ICLR 2026"},{"content":"The Problem with Papers So here\u0026rsquo;s something that has always bugged me about academic papers. You read a paper, you understand the method (maybe), you want to reproduce it, and then you spend three days figuring out what the authors actually did because the paper doesn\u0026rsquo;t tell you. The code repo, if it exists, doesn\u0026rsquo;t match the paper. Half the experiments in the paper are cherry-picked. And all the failed attempts that led to the final method? Gone. Deleted from history.\nThis is what a recent paper calls the \u0026ldquo;Storytelling Tax\u0026rdquo; — the cost of forcing research into a narrative format that reads well but hides information. There\u0026rsquo;s also the \u0026ldquo;Engineering Tax\u0026rdquo; — the gap between human-readable prose and machine-executable specifications. Papers are written for humans to read, but increasingly it\u0026rsquo;s AI agents that need to consume, reproduce, and extend research.\nA paper from April 2026 proposes a pretty radical solution: stop publishing papers entirely (at least in the traditional sense) and publish executable research artifacts instead.\nThe paper is \u0026ldquo;Agent-Native Research Artifacts\u0026rdquo; by Jiachen Liu, Jiaxin Pei, and about 33 other authors. They call the format Ara. Let me walk thru what it is, what the results look like, and what I think about it.\nWhat Ara Actually Is Ara replaces the PDF with a structured artifact containing four layers. Each layer serves a different purpose and is designed so that AI agents can load only what they need:\nThe Four Layers Cognitive Layer (/logic/)\nThis is where the thinking lives — claims, problem statements, solution architecture, experiment design, related work. But unlike a paper\u0026rsquo;s introduction and methods section, everything here is structured with typed dependencies. Claims link to evidence. Related work has typed edges (extends, contradicts, cites).\nWhy this matters: an agent can trace reasoning chains, detect contradictions, and extend hypotheses without parsing natural language prose.\nPhysical Layer (/src/)\nThe actual code. Executable, annotated configs, environment specs, dependencies. Not a \u0026ldquo;code availability\u0026rdquo; link in the appendix — the code is a first-class part of the artifact.\nExploration Layer (/trace/)\nThis is the most interesting part to me. It\u0026rsquo;s a DAG (directed acyclic graph) of all research decisions — questions asked, experiments run, dead-ends hit, pivots made. With timestamps and provenance.\nTraditional papers delete 90% of experiments. Ara\u0026rsquo;s trace layer preserves them. An agent can learn from failure modes without repeating them. It can understand why not approach X — which is often more valuable than knowing why approach Y worked.\nEvidence Layer (/evidence/)\nRaw outputs. Metric tables. Training curves. Resource logs. Hyperparameter sensitivity analyses. Not the curated figures from the paper — the actual data.\nHow It\u0026rsquo;s Built Two systems support artifact creation:\nLive Research Manager — runs passively during research sessions. It captures code commits, terminal outputs, error logs, classifies them into typed events (hypothesis, experiment, pivot, dead-end), and promotes observations to formal artifact entries when closure signals are detected. The key claim: no additional author burden.\nAra Compiler — translates existing PDFs + repos into Ara format thru four stages: semantic deconstruction (strip narrative framing), cognitive mapping (populate /logic/), physical grounding (extract /src/ with code-paper reconciliation), and exploration graph reconstruction (infer the research DAG from version history). This is the migration path — you don\u0026rsquo;t need to rewrite all existing papers from scratch.\nVerification Three levels of machine-verifiable review:\nLevel 1 (Structural): Schema conformance, reference resolution, valid DAG structure. Takes seconds. Level 2 (Argumentative): Evidence relevance, falsifiability, methodology. Uses LLM-as-judge. Takes minutes. Level 3 (Reproducibility): Isolated agents reproduce claims in sandboxes without access to expected outputs. Takes hours to days. This is important — withholding ground truth prevents agents from fabricating results via label leakage. The Results On two benchmarks:\nMetric Without Ara With Ara Improvement Q\u0026amp;A accuracy (PaperBench) 72.4% 93.7% +21.3% Reproduction success (RE-Bench, hard tasks) 57.4% 64.4% +10% The Q\u0026amp;A improvement makes sense — structured data is easier to query than prose. The reproduction improvement is more modest but still meaningful, especially on hard tasks.\nAn interesting finding from their analysis: Category C information (failure knowledge — things that went wrong) showed +65.7% accuracy improvement when available thru the trace layer vs. no source at all. This validates the core intuition that preserving failures is valuable.\nWhat I Think What Ara Gets Right Failure preservation is the killer feature. I can\u0026rsquo;t overstate how much research time is wasted repeating failed approaches because nobody publishes their failures. The trace layer alone would be worth the effort.\nBidirectional claim grounding — every claim links to code and evidence, every piece of code traces back to a claim. This is what we should have been doing all along. In traditional papers, the connection between \u0026ldquo;we observe X\u0026rdquo; and the actual experiment that produced X is maintained only in the authors\u0026rsquo; heads.\nProgressive crystallization — the idea that artifacts are built continuously during research, not written up after the fact. This is how research actually works but not how papers are written.\nThe migration path exists. Ara Compiler means you don\u0026rsquo;t need to convince every researcher to change their workflow overnight. You can convert existing papers, imperfectly, and improve over time.\nWhat Gives Me Pause Late-phase reversals. They found that stronger models (Claude Sonnet 4.6) sometimes outperform Ara-assisted agents on extension tasks. This suggests preserved failure traces might actually constrain exploration for capable systems. If you tell a smart agent \u0026ldquo;don\u0026rsquo;t go there, it\u0026rsquo;s a dead end\u0026rdquo;, maybe it would have found something you missed. This is a real tension.\nFabrication isn\u0026rsquo;t solved. 1-2 fabrication instances across all runs. Level 3 verification prevents label leakage but not fundamental confabulation. Structured data reduces hallucination risk but doesn\u0026rsquo;t eliminate it.\nDiscipline scope. Ara relies on executable code. This works for ML and CS, maybe for computational biology and physics. It doesn\u0026rsquo;t work for theoretical math, humanities, or wet-lab biology without automation. The authors acknowledge this but the limitation is important.\nHuman oversight costs. Level 2 and Level 3 review require substantial compute. We\u0026rsquo;re offloading mechanical checking but not cognitive judgment. The question is whether the compute cost is worth it compared to human peer review — probably yes, but it\u0026rsquo;s not free.\nThe vision might be ahead of the tooling. Ara envisions a future of executable diffs instead of PDFs, git-like forking instead of citations, machine-verifiable claims instead of peer-review opinions. That\u0026rsquo;s beautiful but we\u0026rsquo;re very far from having the infrastructure. The MCP-style integration patterns they describe are natural but nobody has built them yet.\nThe Big Picture What Ara is really proposing is a shift from papers as narratives to papers as databases. Instead of telling a story about your research, you publish a queryable, executable knowledge bundle that any agent (human or AI) can inspect, verify, and extend.\nI think this is directionally correct. The PDF paper format is a product of the printing press era, and we\u0026rsquo;ve been using it for decades past its expiration date. The question isn\u0026rsquo;t whether something like Ara will happen — it\u0026rsquo;s whether it\u0026rsquo;ll be Ara specifically or something else, and whether the transition will be gradual (Ara Compiler converting old papers) or sudden (a major conference adopting Ara-native submissions).\nMy guess: gradual, starting with ML conferences (who are the most tooling-forward), and probably not Ara exactly but something that borrows its best ideas. The four-layer structure is clean. The trace layer is genuinely novel. The verification system is thoughtful.\nIf you\u0026rsquo;re building AI research tools or thinking about the future of scientific publishing, this paper is worth reading in full.\nReference Title: Agent-Native Research Artifacts (Ara) Authors: Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, +33 coauthors Published: April 27, 2026 (45 pages, 15 figures, 14 tables) arXiv: 2604.24658 License: CC0 (Public Domain) ","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-04-28/","summary":"A deep look at Agent-Native Research Artifacts (Ara) — a proposed replacement for academic PDFs that packages research as machine-executable knowledge bundles. What it gets right, what it gets wrong, and why it matters for AI-assisted research.","title":"Ara: What If Research Papers Were Executable?"},{"content":"The Homogeneity Problem When we do RLHF, we train a reward model on human preference data. But whose preferences? In practice, it\u0026rsquo;s a mix of annotators from a specific demographic, often English-speaking, often from a particular cultural background. The reward model learns to predict what this group prefers on average.\n\u0026ldquo;Artificial Hivemind: The Open-Ended Homogeneity of Language Models\u0026rdquo; (NeurIPS 2025 Best Paper) measured this problem directly. They built a benchmark of 26K queries with 31K dense human annotations and found that:\nRLHF significantly reduces output diversity Reward models are miscalibrated against diverse human preferences Models converge to a narrow band of \u0026ldquo;safe, helpful\u0026rdquo; responses that represent a cultural average rather than genuine helpfulness The metaphor is apt: RLHF creates a hivemind. Every model aligned with the same reward model produces similar outputs, reflecting similar values. If your values happen to align with the training distribution\u0026rsquo;s average, great. If they don\u0026rsquo;t, the model is less helpful to you.\nWhy This Matters More Than It Sounds You might think \u0026ldquo;so what — the model is just optimized for the majority, that\u0026rsquo;s normal.\u0026rdquo; But consider:\nDifferent cultures have different safety norms. What counts as \u0026ldquo;harmful content\u0026rdquo; varies significantly across cultures. A frank discussion about certain political topics might be considered dangerous in one context and essential free speech in another. A single safety threshold encodes one culture\u0026rsquo;s norms as universal.\nDifferent users have different expertise levels. A medical professional asking about drug interactions needs different information than a teenager. A security researcher asking about vulnerabilities needs different responses than a random user. One-size-fits-all safety leaves experts underserved and novices occasionally overserved.\nAnnotator disagreement is signal, not noise. When annotators disagree about which response is \u0026ldquo;better,\u0026rdquo; standard RLHF treats this as noise and averages it out. But the disagreement itself contains information — it tells you this is a value-laden judgment where reasonable people differ. Averaging destroys this signal.\nApproaches to Pluralistic Alignment Counterfactual Reasoning for Steerable Alignment \u0026ldquo;Counterfactual Reasoning for Steerable Pluralistic Value Alignment\u0026rdquo; (NeurIPS 2025) proposes aligning LLMs with diverse cultural and demographic values using counterfactual reasoning.\nThe idea: instead of training one reward model on averaged preferences, train the model to reason about whose values are being applied and adjust accordingly. \u0026ldquo;If this user is a medical professional in Japan, what response would align with their values and context?\u0026rdquo;\nThis requires the model to maintain multiple \u0026ldquo;value profiles\u0026rdquo; and select or blend them based on context. It\u0026rsquo;s more complex than standard alignment but more respectful of diversity.\nPersonalization + Reward Modeling \u0026ldquo;Pluralistic Alignment \u0026amp; LLM Personalization\u0026rdquo; (ICLR 2026) explores personalization as a mechanism for pluralistic alignment. Instead of one reward model for everyone, learn user-specific or group-specific reward signals.\nThe challenge: you need enough interaction data per user or group to learn meaningful preferences. And you need to do this without allowing personalization to undermine safety — a model that \u0026ldquo;personalizes\u0026rdquo; by removing safety guardrails for users who prefer no guardrails is not the goal.\nCOMAL: Meta-Algorithm for General Preferences \u0026ldquo;COMAL: Convergent Meta-Algorithm for Aligning LLMs with General Preferences\u0026rdquo; (ICLR 2026) takes a meta-learning approach. Instead of hard-coding one preference structure, learn a general preference alignment algorithm that can be instantiated with different preference distributions.\nThink of it as alignment-as-a-service: plug in a preference distribution (cultural group X, professional role Y, personal values Z) and get an aligned model for that distribution.\nFair Decision Utility in Human-AI Collaboration \u0026ldquo;Fair Decision Utility in Human-AI Collaboration\u0026rdquo; (ICLR 2026, Meta) reframes alignment as fair utility across groups with varying cognitive capacities. The model should provide equal value to users regardless of their background knowledge.\nThis is a different angle: not \u0026ldquo;align with different values\u0026rdquo; but \u0026ldquo;be equally useful to different people.\u0026rdquo; A model that provides sophisticated explanations to experts but dumbed-down answers to novices might be more helpful overall, but it raises questions about fairness.\nThe Safety Tension Here\u0026rsquo;s where it gets difficult: pluralistic alignment and safety can conflict.\nSafety guardrails are, by design, universal constraints. \u0026ldquo;Don\u0026rsquo;t help users create weapons\u0026rdquo; applies to everyone, regardless of their cultural background or professional role. But many safety decisions are not this clear-cut:\nShould the model discuss suicide prevention methods in detail? (Helpful for counselors, potentially harmful for vulnerable individuals) Should it explain how certain drugs interact? (Essential for pharmacists, risky for self-medication) Should it engage with politically sensitive topics? (Important for journalists, potentially destabilizing in certain contexts) A pluralistic alignment system needs to distinguish between universal safety constraints (hard rules that apply to everyone) and contextual norms (soft guidelines that vary by user, culture, and situation).\nThe current approach — treating all safety as universal — is simpler but less helpful. The pluralistic approach is more helpful but harder to implement safely.\nNash DPO and Multi-Stakeholder Alignment Nash DPO (ICLR 2026, mentioned in the SafeDPO post) addresses a specific technical challenge in pluralistic alignment: how to optimize when multiple stakeholders with different preferences all have a say.\nThe solution: find the Nash equilibrium of a preference game. Each stakeholder\u0026rsquo;s preferences define a \u0026ldquo;player\u0026rdquo; in the game, and the Nash equilibrium is the policy that no single player can improve by unilateral deviation.\nThis is elegant game theory, but it raises a practical question: who are the \u0026ldquo;players\u0026rdquo;? Who decides which preference groups get a seat at the table? This is ultimately a governance question, not a technical one.\nMy Take I think pluralistic alignment is the right long-term direction but an incredibly hard engineering and governance problem.\nThe technical challenge: you need models that can maintain multiple coherent value systems simultaneously and switch between them based on context, without allowing any single value system to subvert safety constraints. This is much harder than current single-objective alignment.\nThe governance challenge: who defines the value groups? Who decides which cultural norms are \u0026ldquo;valid preferences\u0026rdquo; vs. \u0026ldquo;harmful beliefs the model should not accommodate\u0026rdquo;? These are fundamentally political questions that technical systems can\u0026rsquo;t resolve.\nThe near-term reality: most deployment contexts will continue using single-objective alignment (one reward model, universal safety guardrails) because it\u0026rsquo;s simpler and more predictable. Pluralistic alignment will likely emerge first in specialized applications — healthcare, legal, education — where the user\u0026rsquo;s role and context are well-defined.\nWhat I\u0026rsquo;d bet on: the Artificial Hivemind finding will be cited heavily in the coming years. The realization that RLHF reduces diversity is important because it reframes the alignment problem. We\u0026rsquo;re not just asking \u0026ldquo;is the model safe?\u0026rdquo; but also \u0026ldquo;safe according to whom?\u0026rdquo; and \u0026ldquo;helpful for whom?\u0026rdquo;\nThese are questions that the alignment community is only starting to take seriously. The technical tools (counterfactual reasoning, personalized reward models, Nash equilibria) exist. The hard part is figuring out when and how to use them responsibly.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-04-15/","summary":"RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.","title":"Pluralistic Alignment: One Model, Many Values"},{"content":"Why SAEs Suddenly Everywhere If you looked at the NeurIPS 2025 and ICLR 2026 proceedings, you\u0026rsquo;d notice something: Sparse Autoencoders (SAEs) are in everything. At least 10 papers at NeurIPS alone, another 5+ at ICLR. Language models, diffusion models, pathology models, code correctness, in-context learning — SAEs are being applied to all of them.\nTwo years ago, SAEs were a niche tool in the mechanistic interpretability community. Now they\u0026rsquo;re arguably the dominant method for understanding what neural networks learn. So what happened?\nWhat SAEs Do The core idea is simple. Neural network activations (the hidden states between layers) live in high-dimensional space. Each dimension doesn\u0026rsquo;t correspond to a single interpretable concept — instead, networks use superposition, where more concepts are represented than there are dimensions, with each concept encoded as a direction in activation space.\nAn SAE is trained to decompose these activations into sparse, interpretable features:\n$$h = \\text{Dec}(\\text{Enc}(x)) = \\sum_{i} f_i(x) \\cdot d_i$$\nwhere $f_i(x)$ are the (sparse) feature activations and $d_i$ are learned feature directions (dictionary vectors).\nThe sparsity constraint means that for any given input, only a small number of features are active. This makes each feature interpretable — you can look at what inputs activate feature $i$ and get a coherent description of what it represents.\nFor example, an SAE trained on GPT-2\u0026rsquo;s activations might produce features like:\nFeature 47: activates on Python function definitions Feature 203: activates on mentions of geographic locations in Europe Feature 891: activates on negation words in the context of safety refusals This gives us a vocabulary for describing what the model is doing at each layer.\nThe Application Explosion Here\u0026rsquo;s where SAEs are being applied, based on the 2025-2026 conference papers:\nLanguage Models (Expected) This is where SAEs started and where most work happens. The standard application: train an SAE on a specific layer\u0026rsquo;s activations, identify features, and use them to understand or steer model behavior.\nNeurIPS 2025 had several papers pushing this forward:\nFeature explanations — improving the quality of automated descriptions of what each feature represents Feature sensitivity — measuring how reliably features activate on similar inputs (answer: not as reliably as you\u0026rsquo;d want) SAE Neural Operators — generalizing SAEs to infinite-dimensional function spaces, extending the linear representation hypothesis to a \u0026ldquo;functional representation hypothesis\u0026rdquo; Diffusion Models (New) \u0026ldquo;One-Step is Enough\u0026rdquo; (NeurIPS 2025) extended SAE interpretability to SDXL Turbo, a text-to-image diffusion model. They found interpretable features corresponding to visual concepts — textures, object types, compositional patterns. This is significant because it shows the SAE approach generalizes beyond language.\nPathology Models (Unexpected) \u0026ldquo;Evaluating the Utility of SAEs for Interpreting a Pathology Foundation Model\u0026rdquo; (NeurIPS 2025) applied SAEs to a medical image model and found that features correlated strongly with cell type counts. So SAEs extract biologically meaningful concepts from pathology models, not just vaguely interpretable directions.\nCode Correctness (ICLR 2026) \u0026ldquo;Mechanistic Interpretability of Code Correctness in LLMs via SAEs\u0026rdquo; identified feature directions that correspond to whether the model believes its generated code is correct. This could enable runtime monitoring of code generation confidence.\nIn-Context Learning (ICLR 2026) \u0026ldquo;Mechanistic Interpretability of In-Context Learning Generalization\u0026rdquo; used SAEs to find \u0026ldquo;common structures\u0026rdquo; in transformer QK circuits that enable in-context learning. This connects SAEs to understanding one of the most mysterious capabilities of LLMs.\nThe Limitations Nobody Has Solved Despite the hype, SAEs have fundamental issues that the community is actively wrestling with.\nFeature Absorption This is the most important open problem. \u0026ldquo;Feature Absorption in Sparse Autoencoders\u0026rdquo; (NeurIPS 2025) showed that when a model represents hierarchical features (e.g., \u0026ldquo;animal\u0026rdquo; → \u0026ldquo;dog\u0026rdquo; → \u0026ldquo;golden retriever\u0026rdquo;), SAE training can cause lower-level features to be absorbed into higher-level ones.\nWhat this means: the SAE might have a feature for \u0026ldquo;dog\u0026rdquo; that absorbs the \u0026ldquo;golden retriever\u0026rdquo; feature. You\u0026rsquo;d see \u0026ldquo;dog\u0026rdquo; activate for golden retrievers, but you\u0026rsquo;d miss the specific breed information. The hierarchy collapses.\nThe paper showed that varying SAE size and sparsity level is insufficient to fix this. It\u0026rsquo;s a structural problem with how SAEs decompose hierarchical representations. \u0026ldquo;Matching Pursuit SAE\u0026rdquo; (NeurIPS 2025) proposes a different architecture using Matching Pursuit that handles hierarchical features better, but it\u0026rsquo;s not a complete solution.\nSensitivity and Reliability How reliably does a feature activate on similar inputs? \u0026ldquo;Measuring SAE Feature Sensitivity\u0026rdquo; (NeurIPS 2025) found that reliability varies significantly across features. Some features activate consistently on related inputs; others are noisy and context-dependent.\nIf you\u0026rsquo;re building safety monitoring on top of SAE features (e.g., \u0026ldquo;alert if the safety-refusal feature drops below threshold\u0026rdquo;), you need features to be reliable. Current SAEs may not be reliable enough for safety-critical applications.\nIdentifiability \u0026ldquo;Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?\u0026rdquo; (ICLR 2026) asks the fundamental question: can we uniquely identify the \u0026ldquo;true\u0026rdquo; mechanistic description of a network, or are there multiple equally valid decompositions?\nIf there are multiple valid decompositions, then different SAEs trained on the same model might give different but equally correct feature sets. This doesn\u0026rsquo;t invalidate SAEs, but it means we need to be careful about claiming we\u0026rsquo;ve found \u0026ldquo;the\u0026rdquo; representation rather than \u0026ldquo;a\u0026rdquo; representation.\nLayerNorm Interference \u0026ldquo;Small Transformers Don\u0026rsquo;t Need LayerNorm at Inference Time\u0026rdquo; (ICLR 2026) found that LayerNorm hinders mechanistic interpretability. They created LN-free analogs of GPT-2 XL that enable more precise analysis. This suggests that some of the noise in SAE feature quality might come from the interaction between SAEs and LayerNorm, not from the SAE method itself.\nSAEs and Safety How does this connect to the safety neurons work and the broader alignment story?\nSAEs give us a feature-level vocabulary for talking about model behavior. Safety neurons (from the earlier post) are specific circuits identified at the neuron level. SAEs operate at a higher level of abstraction — they decompose activations into features that may span multiple neurons.\nThe connection: SAE features for safety-related behavior should activate in the same contexts where safety neurons fire. If they don\u0026rsquo;t, something is being missed by one or both methods.\nThe promise: if SAEs give us reliable safety features, we can monitor model alignment in real time during deployment. \u0026ldquo;Is the model\u0026rsquo;s safety feature active?\u0026rdquo; is a much more informative signal than \u0026ldquo;did the model refuse?\u0026rdquo; (which only tells you after the output is generated).\nThe reality: feature absorption might hide safety features inside broader categories, and sensitivity issues mean feature activations might not be reliable enough for safety-critical monitoring. We\u0026rsquo;re not there yet.\nWhere This Goes My read on the SAE landscape in mid-2026:\nAdoption is ahead of understanding. People are applying SAEs to everything because they work — you get interpretable features out. But the fundamental limitations (absorption, sensitivity, identifiability) mean we don\u0026rsquo;t fully understand what we\u0026rsquo;re getting or missing.\nThe tool is improving faster than alternatives. Despite its problems, SAEs are more practical and scalable than other interpretability methods (circuit analysis, linear probes, attention head analysis). The volume of papers suggests the community has converged on SAEs as the primary approach, and incremental improvements are compounding.\nSafety applications are the highest-stakes use case. If SAEs become reliable enough for real-time safety monitoring, that changes the alignment game significantly. But \u0026ldquo;reliable enough\u0026rdquo; is a high bar, and we\u0026rsquo;re honestly not sure how close we are.\nThe Swiss Army Knife metaphor is apt: SAEs are versatile and useful for many things, but you wouldn\u0026rsquo;t use a Swiss Army Knife for surgery. For safety-critical interpretability, we might need more specialized tools — and SAEs as currently designed might not be enough.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-04-08/","summary":"SAEs went from niche interpretability tool to dominant research theme in one year. Where they\u0026rsquo;re being applied, what they reveal, and the fundamental limitations nobody has solved yet.","title":"Sparse Autoencoders: The Swiss Army Knife of Interpretability"},{"content":"DPO: Clean but Flawed If you\u0026rsquo;ve read the RLHF post in this series, you know that DPO (Direct Preference Optimization) is elegant — it eliminates the reward model and PPO entirely, replacing them with a single loss function over preference pairs. It\u0026rsquo;s simpler to implement, more stable to train, and produces competitive results.\nSo why did NeurIPS 2025 and ICLR 2026 have a combined 15+ papers trying to fix it?\nBecause DPO has real problems. And in 2025-2026, the community found them, characterized them, and started fixing them. This post surveys the most important DPO variants, focusing on the safety angle.\nWhat\u0026rsquo;s Wrong with DPO Problem 1: DPO Is a Misspecified Estimator An ICLR 2026 oral paper (\u0026ldquo;Why DPO is a Misspecified Estimator and How to Fix It\u0026rdquo;) showed that DPO\u0026rsquo;s loss function makes an implicit assumption that doesn\u0026rsquo;t hold in practice.\nDPO assumes the optimal policy takes the form $\\pi^*(y|x) \\propto \\pi_{ref}(y|x) \\exp(r(x,y)/\\beta)$. This is the closed-form solution to the KL-constrained reward maximization problem. But this only holds at convergence — during training, the policy is not yet optimal, and the implicit reward extracted from the policy\u0026rsquo;s log-probabilities is inaccurate.\nThe concrete consequences:\nPreference reversals: DPO can learn to prefer response B over A even when the training data says A \u0026gt; B Reward degradation: The implicit reward model quality decreases during training, especially on out-of-distribution inputs Problem 2: Safety-Helpfulness Trade-off Standard DPO optimizes a single preference: \u0026ldquo;which response is better?\u0026rdquo; But \u0026ldquo;better\u0026rdquo; conflates helpfulness and safety. A response can be helpful but unsafe, or safe but unhelpful. Optimizing a single preference signal pushes the model toward whichever dimension dominates the training data.\nIn practice, most preference datasets are skewed toward helpfulness (because that\u0026rsquo;s what annotators naturally reward). Safety gets underrepresented unless you explicitly add safety-focused preference pairs.\nProblem 3: Average vs. Worst-Case DPO optimizes expected performance across the training distribution. But safety is a worst-case property — you need the model to be safe on every prompt, not just on average. A model that\u0026rsquo;s safe on 99% of prompts and catastrophically unsafe on 1% has failed at safety.\nSafeDPO: The Clean Fix SafeDPO (ICLR 2026, Kim et al.) is my favorite solution because it adds minimal machinery.\nThe idea: add a single safety constraint to the DPO objective. Instead of optimizing helpfulness preference alone, SafeDPO preserves the optimal solution of a safety-constrained optimization problem.\nIn standard DPO, the loss is:\n$$L_{DPO} = -\\mathbb{E}\\left[\\log \\sigma\\left(\\beta \\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{ref}(y_w|x)} - \\beta \\log \\frac{\\pi_\\theta(y_l|x)}{\\pi_{ref}(y_l|x)}\\right)\\right]$$\nSafeDPO modifies this to incorporate a safety margin $\\alpha$:\n$$L_{SafeDPO} = -\\mathbb{E}\\left[\\log \\sigma\\left(\\beta \\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{ref}(y_w|x)} - \\beta \\log \\frac{\\pi_\\theta(y_l|x)}{\\pi_{ref}(y_l|x)} - \\alpha \\cdot c(x, y_w, y_l)\\right)\\right]$$\nwhere $c(x, y_w, y_l)$ is a safety cost term derived from dual preference annotations (helpfulness + safety).\nOne extra hyperparameter ($\\alpha$). No auxiliary networks. No cost models. No major architectural changes. Just a principled adjustment to the loss that balances helpfulness and safety.\nThe results show that SafeDPO matches or exceeds standard DPO on helpfulness benchmarks while significantly improving safety metrics. The safety-helpfulness trade-off becomes tunable via $\\alpha$.\nRePO: Per-Prompt Safety Guarantees Rectified Policy Optimization (NeurIPS 2025) attacks Problem 3 — the average vs. worst-case issue.\nStandard safety-constrained RLHF uses an expected safety constraint: $\\mathbb{E}[c(x, y)] \\leq \\epsilon$. This allows the model to be very unsafe on some prompts as long as it\u0026rsquo;s safe on average.\nRePO replaces this with a per-prompt constraint: for every prompt $x$, the expected safety cost must be below the threshold. This is much stricter but also much more aligned with what we actually want from a safe model.\nThe challenge: per-prompt constraints are hard to enforce across the entire input distribution. RePO uses a rectification approach that identifies and upweights the prompts where safety violations are most likely, effectively focusing optimization effort where it matters most.\nToken-Importance Guided DPO Standard DPO treats all tokens in a response equally when computing log-probabilities. But not all tokens matter equally for preference. The token that makes a response harmful might be one specific word in an otherwise fine response.\nToken-Importance DPO (ICLR 2026) assigns learned weights to each token position, so the preference signal is concentrated on the tokens that actually differ between preferred and dispreferred responses. This reduces noise in the gradient and improves both helpfulness and safety.\nNash DPO: Multi-Agent Preferences Multiplayer Nash Preference Optimization (ICLR 2026) extends DPO to a multi-agent setting. Instead of binary preferences (A \u0026gt; B), it considers preferences from multiple annotators who may disagree.\nThe solution: find the Nash equilibrium of a preference game, where the policy balances competing preferences. This connects to the pluralistic alignment theme — different users have different values, and a single DPO loss can\u0026rsquo;t represent all of them.\nOther Notable Variants Variant Key Idea Source Semi-Supervised DPO Uses limited labeled + abundant unlabeled preference data ICLR 2026 Learning from Reference Answers Replaces binary preferences with reference completions ICLR 2026 Learning from Noisy Preferences Robust to mislabeled preference pairs ICLR 2026 Alignment-Weighted DPO Weights DPO loss by alignment quality per example ICLR 2026 Less is More (Data Selection) Strategic selection of preference data beats using all of it NeurIPS 2025 The Comparison Method Fixes Complexity vs. DPO Safety-Aware? DPO (baseline) — Baseline No SafeDPO Safety-helpfulness trade-off +1 hyperparameter Yes (dual annotations) RePO Average vs. worst-case Moderate Yes (per-prompt) Token-Importance DPO Token-level noise Moderate Indirectly Nash DPO Multi-annotator disagreement Higher Indirectly (pluralistic) Data Selection Data quality Minimal Depends on data What I Take Away The DPO story in 2025-2026 follows a familiar pattern in ML:\nA clean, elegant method appears (DPO, 2023) People adopt it widely because it\u0026rsquo;s simpler than the alternative (RLHF) Edge cases and failure modes emerge at scale The community produces a dozen fixes, each addressing a different failure mode A few fixes become standard (SafeDPO is my bet for safety-aware applications) The broader lesson: alignment techniques that optimize a single objective (helpfulness preference) will always struggle with safety. Safety needs to be in the objective explicitly, not as a side effect of \u0026ldquo;being helpful.\u0026rdquo; SafeDPO\u0026rsquo;s approach — adding a safety term with minimal machinery — seems like the right level of intervention.\nThe even broader question: are these DPO fixes enough, or do we need fundamentally different approaches for safety? The divergence estimation view from the RLHF post suggests that better estimators could help. But better estimators of what? That\u0026rsquo;s where the pluralistic alignment work comes in — and we\u0026rsquo;ll cover that soon.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-03-30/","summary":"DPO has problems — preference reversals, reward degradation, and a safety-helpfulness trade-off. Here\u0026rsquo;s how SafeDPO, RePO, and other recent variants are fixing them.","title":"SafeDPO and Friends: Preference Optimization That Doesn't Sacrifice Safety"},{"content":"The Hype Since OpenAI\u0026rsquo;s o1 and DeepSeek\u0026rsquo;s R1, the narrative has been clear: train LLMs with RL on reasoning tasks, and they learn to reason better. Give them chain-of-thought, reward correct answers, and the model develops genuine reasoning capabilities.\nThis narrative is compelling. It\u0026rsquo;s also probably wrong — or at least significantly oversimplified.\nSeveral papers from NeurIPS 2025 and ICLR 2026 challenge the assumption that RL for reasoning does what we think it does. The evidence points to something more nuanced: RL makes models better at finding answers they could already produce, without necessarily expanding what they can reason about.\nThe Key Finding: Sampling Efficiency ≠ Reasoning Capacity A NeurIPS 2025 paper (\u0026ldquo;Does Reinforcement Learning Really Incentivize Reasoning Capabilities?\u0026rdquo;) ran a clean experiment. They took a base model, applied RL from verifiable rewards (RLVR — where the reward signal comes from checking if the answer is correct, not from a learned reward model), and measured two things:\nPass@1: Does the model get the right answer on the first try more often? Pass@k: If you sample k responses and take the best one, does RL expand the set of problems the model can solve? The result: RL significantly improves Pass@1 but barely moves Pass@k (for large k). In other words:\nThe model doesn\u0026rsquo;t learn to solve new problems it couldn\u0026rsquo;t solve before It gets better at producing correct solutions on the first attempt The \u0026ldquo;reasoning\u0026rdquo; was already latent in the pre-trained model — RL just makes it more accessible This is a fundamental distinction. Sampling efficiency means you need fewer attempts to find a correct answer. Reasoning capacity means you can solve problems you previously couldn\u0026rsquo;t solve at all. RLVR improves the former, not the latter.\nLonger Chains Don\u0026rsquo;t Always Help If RL for reasoning worked the way we hoped, longer chains of thought should lead to better answers — more \u0026ldquo;thinking\u0026rdquo; should mean better reasoning. Several papers found the opposite.\nSelf-Correction in Long CoT (NeurIPS 2025 FoRLM Workshop) found that in extended chain-of-thought sequences, models engage in redundant reasoning. The first reasoning step dominates the outcome, and subsequent steps mostly repeat or marginally refine it. Self-correction does happen, but it\u0026rsquo;s rarer than you\u0026rsquo;d expect.\nR1-Zero for GUI Grounding (NeurIPS 2025) trained agents with online RL + CoT reasoning for computer-use tasks. They found that longer reasoning chains actually led to worse performance. The model would overthink, second-guess itself, and end up with worse actions than a shorter reasoning process.\nThis suggests there\u0026rsquo;s an optimal \u0026ldquo;reasoning length\u0026rdquo; for each task, and that length is shorter than you might think. More tokens ≠ more thought. Sometimes more tokens = more confusion.\nWhat RL Actually Does to LLMs Combining these findings, here\u0026rsquo;s my current model of what RL for reasoning does:\nRL reshapes the output distribution, not the knowledge. The pre-trained model has a broad distribution over possible outputs for any given input. Some of those outputs contain correct reasoning, some don\u0026rsquo;t. RL concentrates probability mass on the outputs that lead to correct answers. It\u0026rsquo;s sharpening, not expanding.\nRL teaches the model when to think and when to just answer. On easy problems, RL-trained models often produce shorter, more direct responses. On hard problems, they produce longer reasoning chains. The model learns to allocate compute where it\u0026rsquo;s needed — which is a real and valuable capability, just not the same as \u0026ldquo;learning to reason.\u0026rdquo;\nRL can improve formatting and structure. RL-trained models produce more organized, step-by-step reasoning chains. This improves readability and makes it easier for verifiers to check the work. But structured presentation of reasoning is different from the reasoning itself.\nThe Counter-Evidence Not all evidence points this way. There are cases where RL does seem to unlock new capabilities:\nRLMT (RL with Model-Rewarded Thinking) (NeurIPS 2025 FoRLM Workshop) uses preference-based reward models and online RL, and outperforms standard RLHF across DPO, PPO, and GRPO optimizers. The gains come from teaching the model to think in a more structured way before answering, which sometimes enables solutions the base model couldn\u0026rsquo;t find.\nAgentFlow + Flow-GRPO (ICLR 2026) showed that a 7B model with good RL training and agent architecture beat GPT-4o on complex multi-step tasks. This isn\u0026rsquo;t just sampling efficiency — the 7B model genuinely couldn\u0026rsquo;t solve these tasks before RL training. But the capability might come from the agent architecture (planner + executor + verifier) rather than from RL alone.\nLoongRL (ICLR 2026) focuses on RL for long-context reasoning and shows genuine improvement on tasks requiring multi-hop reasoning over long documents. The key ingredient seems to be training on tasks that specifically require long-range information integration.\nSo the picture isn\u0026rsquo;t all negative. RL can expand capabilities in some settings, especially when combined with architectural innovations or specifically designed training tasks. But the blanket claim that \u0026ldquo;RL teaches reasoning\u0026rdquo; is too strong.\nWhat This Means for Alignment If RL mostly improves sampling efficiency rather than reasoning capacity, the implications for alignment are interesting:\nGood news: Safety training (RLHF/DPO) might be more robust than we thought. If RL doesn\u0026rsquo;t fundamentally change what the model can do, only how likely it is to do it, then safety alignment is working on the same model capabilities that pre-training established.\nBad news: We can\u0026rsquo;t rely on RL alone to teach models to reason safely about novel situations. If the model doesn\u0026rsquo;t have latent safety reasoning from pre-training, RLHF won\u0026rsquo;t create it — it\u0026rsquo;ll just make whatever safety behavior exists more or less likely to appear.\nPractical implication: Pre-training data and SFT quality might matter more than we thought for safety outcomes. RLHF fine-tunes the edges, but the foundation has to be solid.\nMy Take I think the field over-indexed on RL for reasoning in 2024-2025, partly because the results were impressive (o1 genuinely performs better on math and coding benchmarks) and partly because the narrative was clean (train with RL → learn to reason).\nThe reality is messier. RL is a powerful tool for optimization, but optimization and learning are different things. RL optimizes the model to produce outputs that score well on a reward signal. If \u0026ldquo;scoring well\u0026rdquo; correlates with \u0026ldquo;genuine reasoning,\u0026rdquo; great. If it correlates with \u0026ldquo;pattern-matching that looks like reasoning,\u0026rdquo; we\u0026rsquo;re fooling ourselves.\nThe research direction I\u0026rsquo;m most excited about is understanding when RL expands capabilities vs. when it merely sharpens existing ones. The answer probably depends on the task, the training data, and how far the pre-trained model is from being able to solve it. Getting this boundary right will determine whether scaling RL for reasoning is a path to genuine AI progress or a path to increasingly convincing pattern matching.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-03-28/","summary":"The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don\u0026rsquo;t always help.","title":"Does RL Actually Make LLMs Reason Better?"},{"content":"The Standard Story The standard way to explain RLHF is as a three-step pipeline:\nSupervised Fine-Tuning (SFT): Take a pre-trained LLM, fine-tune it on high-quality demonstration data. Reward Model Training: Collect human preferences (response A is better than response B), train a reward model to predict these preferences. RL Optimization: Use PPO to optimize the SFT model against the reward model, with a KL penalty to prevent it from straying too far. This is the InstructGPT recipe (Ouyang et al., 2022). It works. But it doesn\u0026rsquo;t really explain why it works, or why DPO — which skips the reward model entirely — also works, or what all these methods have in common.\nA NeurIPS 2025 paper (\u0026ldquo;LLM Safety Alignment is Divergence Estimation in Disguise\u0026rdquo;) offers a much cleaner perspective. Let me walk thru it.\nThe Divergence View Here\u0026rsquo;s the core idea. Imagine we have two distributions over model outputs:\n$p_{safe}$: the distribution of \u0026ldquo;good\u0026rdquo; outputs (helpful, harmless, honest) $p_{unsafe}$: the distribution of \u0026ldquo;bad\u0026rdquo; outputs (harmful, unhelpful, dishonest) Alignment is fundamentally about making the model\u0026rsquo;s output distribution $\\pi_\\theta$ closer to $p_{safe}$ and farther from $p_{unsafe}$. We can measure this with divergence functions like KL divergence.\nThe surprising claim: RLHF, DPO, and related methods are all doing divergence estimation. They differ in how they estimate and optimize this divergence, but the underlying objective is the same.\nRLHF as Divergence Estimation In RLHF, the reward model $r_\\phi(x, y)$ learns to assign higher scores to preferred outputs. The RL objective is:\n$$\\max_\\theta ; \\mathbb{E}_{x \\sim D, y \\sim \\pi_\\theta(\\cdot|x)}\\left[r_\\phi(x, y)\\right] - \\beta \\cdot D_{KL}(\\pi_\\theta | \\pi_{ref})$$\nwhere $\\pi_{ref}$ is the SFT model and $\\beta$ controls how far the policy can deviate.\nThe closed-form solution (assuming unlimited optimization capacity) is:\n$$\\pi^*(y|x) = \\frac{1}{Z(x)} \\pi_{ref}(y|x) \\exp\\left(\\frac{r_\\phi(x, y)}{\\beta}\\right)$$\nwhere $Z(x) = \\sum_y \\pi_{ref}(y|x) \\exp(r_\\phi(x, y) / \\beta)$ is the partition function.\nNow, what is the reward model actually learning? It\u0026rsquo;s trained on preference pairs $(y_w, y_l)$ where $y_w$ is preferred over $y_l$, using the Bradley-Terry model:\n$$P(y_w \\succ y_l | x) = \\sigma(r_\\phi(x, y_w) - r_\\phi(x, y_l))$$\nwhere $\\sigma$ is the sigmoid function.\nThe key insight: $r_\\phi(x, y_w) - r_\\phi(x, y_l)$ is estimating the log-likelihood ratio between the safe and unsafe distributions:\n$$r_\\phi(x, y_w) - r_\\phi(x, y_l) \\approx \\log \\frac{p_{safe}(y_w|x)}{p_{unsafe}(y_w|x)} - \\log \\frac{p_{safe}(y_l|x)}{p_{unsafe}(y_l|x)}$$\nSo the reward model is implicitly learning the divergence between safe and unsafe distributions, and PPO is optimizing the policy to maximize this divergence in favor of the safe distribution.\nDPO: Skipping the Middleman Direct Preference Optimization (Rafailov et al., 2023) starts from the same KL-constrained objective as RLHF but rearranges the math to eliminate the reward model.\nFrom the closed-form optimal policy above, we can solve for the reward:\n$$r(x, y) = \\beta \\log \\frac{\\pi_\\theta(y|x)}{\\pi_{ref}(y|x)} + \\beta \\log Z(x)$$\nSubstituting this into the Bradley-Terry preference model:\n$$P(y_w \\succ y_l | x) = \\sigma\\left(\\beta \\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{ref}(y_w|x)} - \\beta \\log \\frac{\\pi_\\theta(y_l|x)}{\\pi_{ref}(y_l|x)}\\right)$$\nThe partition function $Z(x)$ cancels because it appears in both terms. The DPO loss maximizes this:\n$$L_{DPO}(\\theta) = -\\mathbb{E}_{(x, y_w, y_l)}\\left[\\log \\sigma\\left(\\beta \\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{ref}(y_w|x)} - \\beta \\log \\frac{\\pi_\\theta(y_l|x)}{\\pi_{ref}(y_l|x)}\\right)\\right]$$\nNo reward model. No PPO. No sampling from the current policy during training. Just a loss function over preference pairs that you can optimize with standard gradient descent.\nIn the divergence framework: DPO directly estimates the divergence between preferred and dispreferred output distributions using the policy\u0026rsquo;s own log-probabilities as the measurement tool. It\u0026rsquo;s the same divergence estimation, just with a different estimator.\nConstitutional AI / RLAIF Constitutional AI (Bai et al., 2022, Anthropic) adds another twist: instead of human preferences, use AI-generated preferences. A \u0026ldquo;critic\u0026rdquo; model evaluates outputs against a set of principles (the \u0026ldquo;constitution\u0026rdquo;) and provides preference labels.\nIn the divergence framework: Constitutional AI replaces the human oracle $p_{safe}$ with an AI-approximated version $\\hat{p}_{safe}$. The divergence estimation is the same — we\u0026rsquo;re still trying to push the model toward safe outputs — but the reference distribution is defined by the constitution rather than collected human judgments.\nThe advantage: it scales better (no human labeling bottleneck). The risk: the constitution is an imperfect specification of safety, and errors in $\\hat{p}_{safe}$ propagate to the trained model.\nKLDO: Making the Divergence Explicit The NeurIPS paper proposes KLDO (KL Divergence Optimization), which makes the divergence estimation explicit rather than implicit:\n$$L_{KLDO}(\\theta) = D_{KL}(\\pi_\\theta | p_{safe}) - \\alpha \\cdot D_{KL}(\\pi_\\theta | p_{unsafe})$$\nMinimize divergence from the safe distribution, maximize divergence from the unsafe distribution. Straightforward.\nIn practice, $p_{safe}$ and $p_{unsafe}$ aren\u0026rsquo;t known directly — they\u0026rsquo;re estimated from the preference data. But the explicit framing makes the optimization objective clearer and allows you to weight the two terms differently (maybe you care more about avoiding unsafe outputs than about matching the ideal safe distribution).\nThe Unifying Table Method How It Estimates Divergence Reward Model? Online RL? RLHF Reward model learns implicit log-likelihood ratio; PPO optimizes against it Yes Yes (PPO) DPO Policy log-probability ratios directly estimate divergence No (implicit) No (offline) Constitutional AI AI critic approximates safe/unsafe distributions; then RLHF or DPO Optional Optional KTO Uses single-feedback (good/bad, no pairs) to estimate utility No No KLDO Explicitly optimizes KL divergence between policy and safe/unsafe No No They\u0026rsquo;re all doing the same thing: moving the model\u0026rsquo;s output distribution toward safety and away from harm. The mathematical formulations differ, but the underlying optimization problem is shared.\nWhy This Matters This unifying view gives us several things:\nDiagnostic power. If your aligned model is misbehaving, you can ask: is the divergence estimate wrong (bad reward model / bad preference data), or is the optimization failing (PPO instability / DPO misspecification)?\nMethod selection. Need stability and simplicity? DPO. Need to scale without human labels? Constitutional AI. Need fine-grained control over the safety-helpfulness trade-off? KLDO\u0026rsquo;s separate terms let you tune that.\nUnderstanding failure modes. Reward hacking in RLHF is the policy finding outputs that have high estimated divergence from $p_{unsafe}$ but don\u0026rsquo;t actually correspond to safe behavior — the estimator is wrong. Sycophancy is the model overfit to $\\hat{p}_{safe}$ which rewards agreeable responses. The divergence framework makes these failure modes interpretable.\nIt also explains why no single method dominates. RLHF has the richest divergence estimation (learned reward model) but the most complex optimization (PPO). DPO has simpler optimization but a more restrictive estimator (offline, no reward model). Constitutional AI trades off estimation accuracy for scalability. Each makes different trade-offs in the same fundamental optimization problem.\nOpen Questions Does the divergence view suggest better alignment methods we haven\u0026rsquo;t tried yet? If all methods are doing divergence estimation, can we find better divergence measures than KL? Is there a \u0026ldquo;best\u0026rdquo; estimator we should converge on? And how does this framework extend to multi-turn conversations, tool use, and agent behaviors where the output distribution is much more complex?\nThese are questions I\u0026rsquo;m still thinking about. If the divergence view is right, then the path to better alignment is better divergence estimation — not fundamentally new paradigms.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-03-22/","summary":"A unifying view of RLHF, DPO, and Constitutional AI — they\u0026rsquo;re all estimating the divergence between safe and unsafe output distributions. Plus a clean derivation of why DPO works.","title":"RLHF Is Just Divergence Estimation in Disguise"},{"content":"Where We Left Off In Part 1, we derived the policy gradient theorem, built REINFORCE, reduced its variance with baselines and actor-critic methods, and arrived at GAE for practical advantage estimation. The missing piece: stability. Vanilla policy gradient methods take steps that can be too large, destroying a good policy in one update.\nThis post covers how trust regions solve that, leading to TRPO and then PPO — and why PPO specifically became the algorithm behind RLHF. We\u0026rsquo;ll also look at GRPO, a recent alternative from DeepSeek that drops the critic entirely.\nThe Step Size Problem Policy gradient gives us a direction to update $\\theta$. But how big should the step be?\nToo small: slow convergence, wasted compute. Too big: the policy changes dramatically, the advantage estimates (computed under the old policy) become inaccurate, and performance can collapse catastrophically.\nThis is worse than in supervised learning. In supervised learning, a bad step gives you a higher loss but you can recover. In RL, a bad policy step means the agent starts collecting worse trajectories, which gives worse gradient estimates, which leads to worse updates — a death spiral.\nWe need some way to constrain how much the policy changes per update.\nTrust Region Policy Optimization (TRPO) TRPO (Schulman et al., 2015) formalizes this constraint. Instead of the standard policy gradient update, TRPO solves a constrained optimization problem:\n$$\\max_\\theta ; \\hat{\\mathbb{E}}_t\\left[\\frac{\\pi_\\theta(a_t | s_t)}{\\pi_{\\theta_{old}}(a_t | s_t)} A_t\\right]$$\nsubject to:\n$$\\hat{\\mathbb{E}}_t\\left[D_{KL}(\\pi_{\\theta_{old}}(\\cdot | s_t) | \\pi_\\theta(\\cdot | s_t))\\right] \\leq \\delta$$\nThe ratio $r_t(\\theta) = \\frac{\\pi_\\theta(a_t | s_t)}{\\pi_{\\theta_{old}}(a_t | s_t)}$ is the importance sampling ratio. It lets us evaluate the new policy using data collected under the old policy.\nThe KL constraint says: the new policy can\u0026rsquo;t be \u0026ldquo;too different\u0026rdquo; from the old one, measured by KL divergence. The trust region is the set of policies within $\\delta$ KL divergence of the current one.\nTRPO works well in practice, but it\u0026rsquo;s complicated to implement. The constrained optimization requires computing the Fisher information matrix and solving a second-order optimization problem (conjugate gradient + line search). This is expensive and fiddly.\nPPO: Making Trust Regions Practical PPO (Schulman et al., 2017) approximates TRPO\u0026rsquo;s constraint with a much simpler mechanism: clipping.\nPPO-Clip Instead of constraining the KL divergence explicitly, PPO clips the importance sampling ratio:\n$$L^{CLIP}(\\theta) = \\hat{\\mathbb{E}}_t\\left[\\min\\left(r_t(\\theta) A_t, ; \\text{clip}(r_t(\\theta), 1-\\epsilon, 1+\\epsilon) A_t\\right)\\right]$$\nwhere $\\epsilon$ is typically 0.1 or 0.2.\nLet\u0026rsquo;s unpack what the clipping does:\nWhen advantage is positive ($A_t \u0026gt; 0$, the action was good):\n$r_t \u0026gt; 1 + \\epsilon$: the new policy makes this action much more likely. Clip it — don\u0026rsquo;t over-exploit. $r_t \u0026lt; 1$: the new policy makes this action less likely. The min takes the unclipped value — allow the gradient to push the policy back toward this good action. When advantage is negative ($A_t \u0026lt; 0$, the action was bad):\n$r_t \u0026lt; 1 - \\epsilon$: the new policy already reduced this action\u0026rsquo;s probability a lot. Clip it — don\u0026rsquo;t over-correct. $r_t \u0026gt; 1$: the new policy makes this bad action more likely. The min takes the unclipped value — allow the gradient to fix this. The effect: PPO allows the policy to improve but prevents it from changing too much in either direction. No Fisher matrix, no conjugate gradient, no line search. Just a clipped objective you can optimize with standard gradient descent.\nPPO-Penalty (Alternative) There\u0026rsquo;s also a KL-penalty variant:\n$$L^{KL}(\\theta) = \\hat{\\mathbb{E}}_t\\left[r_t(\\theta) A_t - \\beta \\cdot D_{KL}(\\pi_{\\theta_{old}} | \\pi_\\theta)\\right]$$\nwhere $\\beta$ is an adaptive coefficient that increases when KL divergence exceeds a target and decreases when it\u0026rsquo;s below. This is closer to the TRPO constraint in spirit but uses a penalty instead of a hard constraint.\nIn practice, PPO-Clip is simpler and more widely used.\nWhy PPO for RLHF When OpenAI developed InstructGPT (the precursor to ChatGPT), they chose PPO for the RL step. Why?\nStability. Language model outputs are high-dimensional (vocabulary size ~50K-100K), and the reward signal is sparse (one reward per complete response). This makes the optimization landscape treacherous. PPO\u0026rsquo;s clipping prevents the catastrophic policy collapse that vanilla policy gradient methods suffer from.\nCompatibility with KL constraints. In RLHF, you typically add a KL penalty between the current policy and the initial SFT model: $R_{total} = R_{reward} - \\beta \\cdot D_{KL}(\\pi_\\theta | \\pi_{SFT})$. This prevents the model from drifting too far from its pre-trained behavior (which would cause it to generate gibberish that the reward model scores highly). PPO\u0026rsquo;s own stability mechanism plus this external KL constraint gives you two layers of protection.\nSample efficiency. PPO can do multiple gradient steps on the same batch of collected data (within the trust region), while REINFORCE uses each batch once. This matters when generating responses from a large language model is expensive.\nImplementation simplicity. Compared to TRPO, PPO is straightforward to implement and doesn\u0026rsquo;t require second-order optimization.\nThat said, PPO for LLMs is still notoriously hard to tune. The reward model, the KL coefficient, the clipping parameter, the learning rate, the batch size — all interact in non-obvious ways. This difficulty is one of the motivations behind DPO, which we\u0026rsquo;ll cover in a later post.\nGRPO: Dropping the Critic Group Relative Policy Optimization (GRPO), introduced by DeepSeek for training their reasoning models, takes a different approach: eliminate the critic entirely.\nThe key insight: instead of training a separate value network to estimate advantages, just compare outputs within a group.\nFor each prompt, GRPO:\nSamples a group of $G$ responses from the current policy Scores each response with the reward model Computes advantages as normalized rewards within the group: $$A_i = \\frac{r_i - \\text{mean}(r_1, \\ldots, r_G)}{\\text{std}(r_1, \\ldots, r_G)}$$\nUses the PPO-clip objective with these group-relative advantages No critic network. No GAE. No TD learning. Just \u0026ldquo;which responses in this group were better than average?\u0026rdquo;\nThis works because in the LLM setting, the reward model already provides a scalar reward per response. You don\u0026rsquo;t need a separate value function to estimate returns — you have the actual rewards. The group normalization provides a baseline automatically (the group mean serves the same variance-reduction role as the critic).\nGRPO was used to train DeepSeek-R1, which showed strong reasoning capabilities. The simplicity is appealing — one fewer network to train, fewer hyperparameters, less engineering complexity.\nFlow-GRPO: Agents + RL AgentFlow (ICLR 2026) extended GRPO to agentic systems with Flow-GRPO. The problem with standard GRPO for agents: rewards are sparse (only at the end of a multi-step task), so credit assignment is hard. Which step in a 10-step agent trajectory was responsible for the final success or failure?\nFlow-GRPO addresses this with flow-based credit assignment — decomposing the agent\u0026rsquo;s trajectory into stages (planning, execution, verification, generation) and assigning credit at each stage boundary.\nThe result: a 7B parameter model beat GPT-4o on search, math, and science reasoning tasks. Small model + good RL + good architecture \u0026gt; big model. This is a strong signal that RL for agents is a frontier worth watching.\nThe Landscape Method Critic? Constraint Complexity Used In REINFORCE No None Very simple Teaching Actor-Critic Yes None Moderate Classic RL TRPO Yes KL constraint (hard) Complex Research PPO-Clip Yes Clip ratio Simple RLHF (InstructGPT, ChatGPT) GRPO No Clip ratio + group norm Simpler DeepSeek-R1 reasoning Flow-GRPO No Flow credit assignment Moderate AgentFlow (agents) The trend is toward simpler methods that leverage the structure of the LLM setting (reward models, group sampling) rather than general-purpose RL machinery.\nWhat\u0026rsquo;s Next We now have both the transformer architecture (Part 1-2 of that series) and the RL foundations. The next step is to combine them: how do you actually train a language model with human feedback? That\u0026rsquo;s the RLHF pipeline — and it turns out the whole thing can be viewed as divergence estimation in disguise.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-03-18/","summary":"How trust regions stabilize policy optimization, why PPO became the default for RLHF, and how GRPO eliminates the critic entirely.","title":"From Policy Gradient to PPO — Part 2: Trust Regions, PPO, and GRPO"},{"content":"Why RL Now? If you\u0026rsquo;re reading this blog for the LLM content, you might be wondering why we\u0026rsquo;re taking a detour into reinforcement learning. The reason is simple: RL is how we train LLMs to be helpful, harmless, and honest. The \u0026ldquo;HF\u0026rdquo; in RLHF stands for Human Feedback, but the \u0026ldquo;RL\u0026rdquo; is what actually does the optimization. And if you don\u0026rsquo;t understand policy gradients and PPO, the RLHF pipeline is just a black box.\nSo this two-part series builds RL from scratch. Part 1: the foundations (MDPs, value functions, REINFORCE, actor-critic). Part 2: from TRPO to PPO to GRPO, and why PPO specifically was chosen for language model alignment.\nMarkov Decision Processes An MDP is defined by a tuple $(S, A, P, R, \\gamma)$:\n$S$: set of states $A$: set of actions $P(s\u0026rsquo;|s, a)$: transition probability — if you\u0026rsquo;re in state $s$ and take action $a$, what\u0026rsquo;s the probability of ending up in state $s\u0026rsquo;$? $R(s, a)$: reward function — how good is taking action $a$ in state $s$? $\\gamma \\in [0, 1)$: discount factor — how much we value future rewards vs. immediate ones A policy $\\pi(a|s)$ is a probability distribution over actions given a state. The agent\u0026rsquo;s goal: find the policy that maximizes cumulative discounted reward.\nFor LLMs, the mapping is:\nState: the tokens generated so far (the context) Action: the next token to generate Reward: comes from a reward model (trained on human preferences) after the full response is generated Policy: the language model itself — $\\pi_\\theta(a_t | s_t)$ is the probability of generating token $a_t$ given the context $s_t$ Value Functions The state value function tells us the expected cumulative reward from state $s$ under policy $\\pi$:\n$$V^\\pi(s) = \\mathbb{E}_\\pi\\left[\\sum_{t=0}^{\\infty} \\gamma^t r_t \\mid s_0 = s\\right]$$\nThe action-value function (Q-function) conditions on both state and action:\n$$Q^\\pi(s, a) = \\mathbb{E}_\\pi\\left[\\sum_{t=0}^{\\infty} \\gamma^t r_t \\mid s_0 = s, a_0 = a\\right]$$\nThe relationship: $V^\\pi(s) = \\mathbb{E}_{a \\sim \\pi}[Q^\\pi(s, a)]$.\nThe advantage function measures how much better action $a$ is compared to the average:\n$$A^\\pi(s, a) = Q^\\pi(s, a) - V^\\pi(s)$$\nIf $A^\\pi(s, a) \u0026gt; 0$, action $a$ is better than what the policy would typically do. If $A^\\pi(s, a) \u0026lt; 0$, it\u0026rsquo;s worse. This will be important for policy gradient methods.\nBellman Equations Value functions satisfy recursive relationships (the Bellman equations):\n$$V^\\pi(s) = \\mathbb{E}_{a \\sim \\pi}\\left[R(s, a) + \\gamma \\mathbb{E}_{s\u0026rsquo; \\sim P}[V^\\pi(s\u0026rsquo;)]\\right]$$\nThis says: the value of a state equals the immediate reward plus the discounted value of the next state, in expectation. It\u0026rsquo;s the foundation for all dynamic programming and TD learning methods.\nThe Policy Gradient Theorem We want to find the policy parameters $\\theta$ that maximize the expected return:\n$$J(\\theta) = \\mathbb{E}_{\\tau \\sim \\pi_\\theta}\\left[\\sum_{t=0}^{T} \\gamma^t r_t\\right]$$\nwhere $\\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \\ldots)$ is a trajectory sampled from the policy.\nThe policy gradient theorem gives us the gradient of this objective:\n$$\\nabla_\\theta J(\\theta) = \\mathbb{E}_{\\tau \\sim \\pi_\\theta}\\left[\\sum_{t=0}^{T} \\nabla_\\theta \\log \\pi_\\theta(a_t | s_t) \\cdot G_t\\right]$$\nwhere $G_t = \\sum_{k=t}^{T} \\gamma^{k-t} r_k$ is the return from time step $t$.\nThis is a beautiful result. The intuition: $\\nabla_\\theta \\log \\pi_\\theta(a_t | s_t)$ points in the direction that makes action $a_t$ more likely. We weight this by $G_t$ — if the return was high, push the policy toward those actions; if low, push away.\nThe proof relies on the \u0026ldquo;log-derivative trick\u0026rdquo;: $\\nabla_\\theta \\pi_\\theta = \\pi_\\theta \\nabla_\\theta \\log \\pi_\\theta$. This lets us express the gradient as an expectation under the policy itself, which we can estimate with samples.\nREINFORCE REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. It directly implements the policy gradient theorem:\nSample a trajectory $\\tau$ by running the policy Compute returns $G_t$ for each time step Update: $\\theta \\leftarrow \\theta + \\alpha \\sum_t \\nabla_\\theta \\log \\pi_\\theta(a_t | s_t) \\cdot G_t$ That\u0026rsquo;s it. Sample, compute returns, update. Simple and unbiased.\nThe problem: variance. $G_t$ is a noisy estimate because it depends on the entire future trajectory. One lucky trajectory can send the gradient in a wildly wrong direction. In practice, REINFORCE is slow to converge and unstable.\nVariance Reduction with Baselines We can subtract a baseline $b(s_t)$ from the return without introducing bias:\n$$\\nabla_\\theta J(\\theta) = \\mathbb{E}\\left[\\sum_t \\nabla_\\theta \\log \\pi_\\theta(a_t | s_t) \\cdot (G_t - b(s_t))\\right]$$\nWhy is this still unbiased? Because $\\mathbb{E}_{a \\sim \\pi}[\\nabla_\\theta \\log \\pi_\\theta(a|s) \\cdot b(s)] = b(s) \\cdot \\nabla_\\theta \\sum_a \\pi_\\theta(a|s) = b(s) \\cdot \\nabla_\\theta 1 = 0$.\nThe optimal baseline turns out to be close to $V^\\pi(s_t)$ — the expected return from state $s_t$. With this baseline, we\u0026rsquo;re effectively using the advantage: $G_t - V^\\pi(s_t) \\approx A^\\pi(s_t, a_t)$. Actions better than average get positive gradient, worse than average get negative gradient.\nThis reduces variance dramatically but we need to estimate $V^\\pi$ somehow. Which brings us to actor-critic methods.\nActor-Critic The actor-critic framework uses two networks:\nActor ($\\pi_\\theta$): the policy network — decides which actions to take Critic ($V_\\phi$): the value network — estimates $V^\\pi(s)$ to reduce variance The critic provides the baseline. Instead of waiting for the full trajectory return $G_t$, we can use the TD (temporal difference) target:\n$$A_t \\approx r_t + \\gamma V_\\phi(s_{t+1}) - V_\\phi(s_t)$$\nThis is a one-step advantage estimate. It has lower variance than $G_t - V_\\phi(s_t)$ (because we bootstrap from the critic instead of using the full return) but introduces bias (because the critic is an approximation).\nThe actor update: $$\\theta \\leftarrow \\theta + \\alpha \\nabla_\\theta \\log \\pi_\\theta(a_t | s_t) \\cdot A_t$$\nThe critic update (minimize squared TD error): $$\\phi \\leftarrow \\phi - \\beta \\nabla_\\phi (V_\\phi(s_t) - (r_t + \\gamma V_\\phi(s_{t+1})))^2$$\nActor-critic is the foundation for essentially all modern policy gradient methods, including PPO.\nGeneralized Advantage Estimation (GAE) The one-step TD advantage $\\delta_t = r_t + \\gamma V(s_{t+1}) - V(s_t)$ is low variance but biased. The full Monte Carlo advantage $G_t - V(s_t)$ is unbiased but high variance. GAE (Schulman et al., 2015) interpolates between them:\n$$A_t^{GAE} = \\sum_{l=0}^{T-t} (\\gamma \\lambda)^l \\delta_{t+l}$$\nwhere $\\lambda \\in [0, 1]$ controls the bias-variance trade-off:\n$\\lambda = 0$: pure one-step TD (low variance, high bias) $\\lambda = 1$: equivalent to Monte Carlo advantage (no bias, high variance) In practice: $\\lambda \\approx 0.95$ works well This is an exponentially-weighted average of multi-step TD estimates. The decaying weights mean that nearby rewards are trusted more than distant ones (lower variance), while still incorporating long-horizon information (lower bias than pure one-step).\nGAE is used in virtually every modern policy gradient implementation, including the PPO implementations used for RLHF.\nSummary What we covered:\nMDPs: the formal framework for sequential decision-making Value functions: $V^\\pi$, $Q^\\pi$, advantage $A^\\pi$, and Bellman equations Policy gradient theorem: how to compute gradients of expected return REINFORCE: simplest implementation, high variance Baselines and actor-critic: variance reduction thru learned value functions GAE: the practical bias-variance trade-off for advantage estimation These are the building blocks. In Part 2, we\u0026rsquo;ll see how trust regions and clipped objectives lead to PPO — and why PPO became the algorithm of choice for training language models with human feedback.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-03-05/","summary":"MDPs, value functions, the REINFORCE algorithm, actor-critic methods, and generalized advantage estimation — the RL foundations you need before understanding RLHF.","title":"From Policy Gradient to PPO — Part 1: Foundations"},{"content":"Where the VAE Idea Went The Variational Autoencoder was proposed by Kingma \u0026amp; Welling in 2014, which is now more than a decade ago. In machine learning time, that\u0026rsquo;s ancient. And yet the scaffolding we built in this series — encoder, decoder, ELBO, reparameterization — has become one of the most influential ideas in modern generative modeling.\nVery few people train vanilla VAEs for image generation today. That job has been taken over by diffusion models and, increasingly, flow matching. But look under the hood of any state-of-the-art system, and you will find VAE ideas threaded through: a tokenizer that encodes images into discrete codes; a continuous latent space that diffusion models actually operate in; a probabilistic framing of \u0026ldquo;encoder outputs a distribution, not a point.\u0026rdquo;\nThis post is basically a survey of VAE variants — no heavy derivations, just the main ideas — and a look at what the probabilistic perspective gave us that is still useful today.\nVQ-VAE: Discrete Latents Proposed by van den Oord et al. (2017), VQ-VAE replaces the continuous Gaussian posterior with a discrete codebook. Instead of $q_\\phi(z|x)$ being a Gaussian, the encoder outputs a vector $z_e(x)$ that gets quantized to its nearest entry in a learned codebook ${e_k}_{k=1}^K$:\n\\begin{equation} z_q(x) = e_{k^*}, \\quad k^* = \\arg\\min_k |z_e(x) - e_k| \\end{equation}\nThe decoder then reconstructs $x$ from the discrete code $z_q(x)$.\nThis looks strange at first — how do we backpropagate through the arg-min? The answer is the straight-through estimator: in the forward pass, quantize; in the backward pass, pretend quantization is the identity. The decoder\u0026rsquo;s gradients flow directly back to the encoder.\nThe loss (to minimize) also looks different from a standard VAE:\n\\begin{equation} \\mathcal{L} = \\underbrace{-\\log p_\\theta(x | z_q(x))}_{\\text{reconstruction}} + \\underbrace{|\\text{sg}[z_e(x)] - e_{k^*}|^2}_{\\text{codebook}} + \\beta \\underbrace{|z_e(x) - \\text{sg}[e_{k^*}]|^2}_{\\text{commitment}} \\end{equation}\nwhere $\\text{sg}[\\cdot]$ is the stop-gradient operator. The KL term as we knew it is gone, replaced by a uniform prior over discrete codes and these commitment losses.\nWhy it mattered. Discrete latents turn images (and audio, and video) into sequences of tokens, which can then be modeled by powerful autoregressive priors — exactly the transformer-based machinery we use for language. VQ-VAE and its successor VQ-VAE-2 were the bridge that let images enter the LLM era. Every modern image-tokenizer you\u0026rsquo;ve heard of — from DALL·E\u0026rsquo;s discrete VAE to the tokenizers inside modern multi-modal models — is a descendant.\nHierarchical VAEs: Depth as Capacity A flat VAE with a single latent $z$ has limited capacity to model complex images. Hierarchical VAEs stack multiple layers of latents $z_1, z_2, \\dots, z_L$ in a chain:\n\\begin{equation} p(x, z_{1:L}) = p(x | z_1), p(z_1 | z_2), \\cdots, p(z_{L-1} | z_L), p(z_L) \\end{equation}\nEach level captures structure at a different scale — top levels encode coarse composition, bottom levels handle fine detail. The ELBO generalizes naturally to a sum of KL terms, one per level.\nThe obvious problem is that deeper hierarchies are harder to train. Upper latents tend to collapse (the decoder ignores them), leaving all the work to the bottom. The important follow-ups:\nLadder VAE (Sønderby et al., 2016). A specific parameterization of the approximate posterior that shares information between bottom-up and top-down paths. IAF-VAE (Kingma et al., 2016). Replaces the Gaussian posterior with an Inverse Autoregressive Flow, giving $q_\\phi(z|x)$ the flexibility to capture correlations a diagonal Gaussian cannot. NVAE (Vahdat \u0026amp; Kautz, 2020). A deep, carefully engineered hierarchical VAE with residual cells, swish activations, spectral regularization, and batch normalization tuning — the first hierarchical VAE to produce genuinely competitive image samples. It showed that hierarchical VAEs can scale, if you\u0026rsquo;re willing to do the engineering. The lesson from hierarchical VAEs generalizes beyond VAEs: generative models benefit from multi-scale structure, and the posterior approximation has to be rich enough to express the dependencies you care about.\nAdversarial Hybrids VAEs produce blurry samples because the Gaussian decoder likelihood, combined with a mean-field posterior, averages over modes. GANs produce sharp samples but with no likelihood, no encoder, and no stability.\nSeveral papers asked the obvious question: what if we combine them?\nVAE-GAN (Larsen et al., 2016). Use a GAN discriminator as the reconstruction loss instead of pixel MSE/BCE. The decoder now has to produce samples that a discriminator finds indistinguishable from real data. IntroVAE (Huang et al., 2018). Turns the encoder itself into the discriminator, creating an adversarial game without a separate GAN head. BiGAN / ALI (Dumoulin et al., 2017). GANs with an encoder baked in, trained adversarially on joint samples $(x, z)$. Sometimes these hybrids get the best of both, sometimes the worst of both. They are important because they formalized a question that stayed relevant: the ELBO and the adversarial objective are measuring different things — when should we prefer one over the other?\nFlow-Based Posteriors and Decoders One strand of VAE research pushed in a direction that eventually became its own field: normalizing flows.\nA normalizing flow is a bijective, differentiable transformation $f$ that maps a simple distribution (a Gaussian) to a complex one. If $z_0 \\sim \\mathcal{N}(0, I)$ and $z_K = f_K \\circ \\cdots \\circ f_1(z_0)$, then the density of $z_K$ can be computed via the change-of-variables formula — which means we get a tractable, expressive, exact likelihood.\nApplied inside a VAE, a flow can play two roles:\nAs a posterior: replace the mean-field Gaussian $q_\\phi(z|x)$ with $q_\\phi(z|x) = f_K \\circ \\cdots \\circ f_1 (\\epsilon; x)$, making the approximate posterior arbitrarily expressive (IAF-VAE, above). As a prior or decoder: replace $p(z)$ or $p(x|z)$ with flow-based densities. The ELBO still holds, but each piece is much more flexible. Flows later grew into a standalone family of models (RealNVP, Glow, FFJORD). The VAE was one of the first settings where their expressive power was useful.\nThe Information Bottleneck Lens You may wonder why all these variants look so similar to each other. So I think there is a common theme running through all of them, and that theme has a name: the Information Bottleneck (IB) principle (Tishby et al., 1999).\nThe IB principle says: a good representation $z$ of input $x$ for predicting some target $y$ should maximize:\n\\begin{equation} \\mathcal{L}_{\\text{IB}} = I(z; y) - \\beta \\cdot I(z; x) \\end{equation}\nThat is, $z$ should contain as much information about $y$ as possible, and as little about $x$ as possible, balanced by a trade-off parameter β.\nReplace $y$ with \u0026ldquo;the data reconstruction task\u0026rdquo; and you get β-VAE. Replace $I(z; x)$ with the KL to a prior and you get a tractable upper bound. Every VAE-style model we have discussed — vanilla, conditional, β, VQ, hierarchical — is a different point in an IB design space, parameterized by how $I(z; x)$ is approximated and what \u0026ldquo;task\u0026rdquo; the latent is serving.\nAccording to my own reading of this, the VAE was the first clean implementation of an information-theoretic idea that had been floating around for decades.\nWhere VAE Ideas Live Today A short, incomplete catalog of where you still find the VAE\u0026rsquo;s fingerprints in 2026:\nLatent Diffusion Models. Stable Diffusion and its descendants train a diffusion model not on pixels, but in the latent space of a pretrained VAE. The VAE compresses images into a tractable latent where diffusion is dramatically cheaper. The generative heavy lifting has moved to diffusion; the representation is still the VAE\u0026rsquo;s. VQ tokenizers for multi-modal LLMs. Images, audio, and video are tokenized by VQ-VAE-style encoders before being fed to transformers. The codebook lives on. World models. Systems that predict future states in an environment use continuous latent spaces whose structure is directly inherited from the VAE literature. The encoder-is-a-distribution framing is essential for capturing uncertainty in prediction. Probabilistic programming. Amortized variational inference — the central computational idea of the VAE — is now a standard tool in libraries like Pyro and NumPyro, used for Bayesian models far removed from images. Scientific machine learning. In chemistry, genomics, and neuroscience, VAE-style models are used precisely because the latent space gives you a distribution over explanations, not a single point estimate. Uncertainty is the product, not a byproduct. What the VAE Really Taught Us If I had to distill the lasting contribution of the VAE, it would be three interlocking ideas.\nFirst, encoders should return distributions, not points. A deterministic embedding is a commitment to a single interpretation of the input. A distributional embedding keeps the model honest about uncertainty, and gives downstream samplers, generators, and decision-makers something meaningful to sample from.\nSecond, intractable integrals can be traded for tractable optimization. The ELBO is the archetype of this move: take an integral you can\u0026rsquo;t compute, write it as the sum of something you can maximize and something non-negative, then maximize what you can. This pattern now appears across variational inference, PAC-Bayes bounds, amortized Bayes, and diffusion model training.\nThird, the reparameterization trick changed the default in machine learning. Before the VAE, stochastic models used custom inference algorithms per model family. After the VAE, the default became: write your model, rewrite sampling as a deterministic function of noise, call backward(). Virtually every modern probabilistic deep learning system inherits this pattern.\nThe VAE stopped being the state-of-the-art generator years ago. But it never stopped being the template for how we write probabilistic models in deep learning.\nClosing the Series We started this series by asking a simple question: how do we model data that seems to be generated by hidden causes? That led us through Latent Variable Models, Variational Inference, the ELBO, the reparameterization trick, conditional generation, β-regularization, and finally this survey of descendants.\nAlong the way we did a lot of math. The math was never the point. The point was a particular way of looking at problems: data is a shadow of something structured; inference is optimization in disguise; uncertainty is a first-class citizen; and the right objective function is more valuable than any specific architecture.\nI only introduced the basic concepts here — each of these variants could be its own series. But hopefully this gives you enough to follow the literature when you want to go deeper.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-02-25/","summary":"A survey of where the VAE idea went after 2014 — VQ-VAE, hierarchical VAEs, adversarial hybrids, flow-based posteriors — and what the VAE really gave us beyond a specific architecture.","title":"VAE Variants and Modern Interpretations"},{"content":"Beyond the Basic Architecture In Part 1 we built the Transformer block from scratch — self-attention, multi-head attention, positional encoding, FFN, residuals. That was the 2017 architecture. It works, but when you scale it up to billions of parameters, interesting things happen and practical problems appear.\nThis post covers what scale reveals about transformers: attention patterns that emerge naturally, modern architectural improvements, and efficiency tricks that make training and inference feasible at LLM scale.\nSparse Attention Emergence One of the more surprising empirical findings is that transformers develop sparse attention patterns during training — most attention weights end up near zero, with each token focusing on just a few other tokens.\nA NeurIPS 2025 oral paper (\u0026ldquo;Sparse Attention Emergence in Transformers\u0026rdquo;) studied when and how this happens. The key findings:\nEmergence follows power laws. The timing of when attention becomes sparse depends on the task, architecture, and optimizer, but the relationship follows power laws in each case. Larger models become sparse earlier in training. Harder tasks delay sparsity.\nSparsity is not uniform. Some heads become very sparse (attending to 1-3 tokens), while others maintain broad attention. This isn\u0026rsquo;t random — it reflects functional specialization.\nThis matters because it tells us that dense attention (every token attending to every token equally) is a starting condition, not the trained behavior. The model learns to be selective.\nAttention Head Specialization Another NeurIPS 2025 paper looked at what individual attention heads actually learn to do. They found that heads specialize in specific semantic and structural roles:\nLocal heads: attend primarily to nearby tokens (capturing syntax, local dependencies) Global heads: attend to distant tokens (long-range semantic relationships) Positional heads: attend to specific relative positions (e.g., always look 1 token back) Attribute heads: track specific semantic properties across the sequence This isn\u0026rsquo;t designed — it emerges from training. Each head in a multi-head attention layer ends up doing something different, which is exactly the intuition behind multi-head attention but now we have empirical evidence.\nThe practical implication: if you understand what each head does, you can edit model behavior by intervening on specific heads. This is one of the building blocks of mechanistic interpretability, which we\u0026rsquo;ll cover in a later post.\nGated Attention NeurIPS 2025 Best Paper (\u0026ldquo;Gated Attention for Large Language Models\u0026rdquo;) proposed a small but effective modification to the attention mechanism: add a head-specific learnable gate after the scaled dot-product attention.\nThe idea is simple. After computing attention for each head:\n$$\\text{head}_i = g_i \\odot \\text{Attention}(Q_i, K_i, V_i)$$\nwhere $g_i$ is a sigmoid gate that can learn to suppress or amplify each head\u0026rsquo;s contribution. The gate operates element-wise on the attention output.\nThey tested 30+ variants and found that this gating consistently improved performance across model sizes. The intuition: not every head is useful for every input, and the gate lets the model dynamically adjust which heads contribute.\nOf course this works. The model already learns to specialize heads — giving it an explicit mechanism to modulate their influence just makes the specialization more effective.\nModern Positional Encodings: RoPE The sinusoidal positional encodings from the original paper work, but modern LLMs have largely moved to Rotary Position Embeddings (RoPE) (Su et al., 2021).\nRoPE applies position information as a rotation in the embedding space rather than an addition. For a 2D subspace of the embedding, at position $m$:\n$$\\text{RoPE}(x_m, m) = \\begin{pmatrix} \\cos m\\theta \u0026amp; -\\sin m\\theta \\ \\sin m\\theta \u0026amp; \\cos m\\theta \\end{pmatrix} \\begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \\end{pmatrix}$$\nApplied independently to pairs of dimensions. The key property: when you compute the attention score between positions $m$ and $n$, the rotations combine such that the score depends only on the relative position $m - n$:\n$$q_m^T k_n = (R_m x_m)^T (R_n x_n) = x_m^T R_{m-n} x_n$$\nThis gives you relative positional information naturally, without separate relative position embeddings or biases. And it composes well with the attention mechanism — you just rotate the queries and keys before computing attention.\nWhy RoPE won over sinusoidal:\nBetter extrapolation to sequence lengths longer than training Cleaner integration with attention (rotation vs. addition) Works well with KV caching during inference Nearly all modern LLMs (LLaMA, Mistral, Qwen, etc.) use RoPE.\nEfficient Attention: FlashAttention Standard self-attention has $O(n^2)$ memory complexity because you materialize the full $n \\times n$ attention matrix. For sequence length 8192, that\u0026rsquo;s a 67M-entry matrix per head. At 32 heads, that\u0026rsquo;s over 2 billion floats just for attention scores. Not great.\nFlashAttention (Dao et al., 2022) doesn\u0026rsquo;t change what is computed — it changes how. The key insight: GPUs are bottlenecked by memory I/O, not compute. Reading and writing the large attention matrix to GPU memory (HBM) is the actual bottleneck.\nFlashAttention computes attention in tiles, keeping intermediate results in fast SRAM (on-chip memory) and never materializing the full attention matrix in HBM. The algorithm:\nSplit Q, K, V into blocks that fit in SRAM For each block of Q, iterate over blocks of K and V Compute attention scores and accumulate the output incrementally Use the online softmax trick (keeping running max and sum) to compute exact softmax without needing all scores at once The result: exact same output as standard attention, but 2-4x faster and with $O(n)$ memory instead of $O(n^2)$.\nThis matters a lot. FlashAttention is what makes training with sequence lengths of 8K, 32K, 128K+ feasible. Without it, context windows would be much shorter.\nGrouped-Query Attention (GQA) Standard multi-head attention has separate K, V projections for each head. During autoregressive generation, you cache all these K, V pairs (the \u0026ldquo;KV cache\u0026rdquo;). With many heads, this cache gets large.\nGrouped-Query Attention (Ainslie et al., 2023) shares K, V projections across groups of heads. If you have 32 heads and 8 KV groups, each group of 4 query heads shares one K and one V projection.\nThe extreme case: Multi-Query Attention (MQA), where all heads share one K, V. GQA interpolates between MQA and full multi-head attention.\n$$\\text{GQA-head}_{i} = \\text{Attention}(Q_i, K_{g(i)}, V_{g(i)})$$\nwhere $g(i) = \\lfloor i / G \\rfloor$ maps head $i$ to its group.\nThis reduces KV cache size by $G\\times$ during inference with minimal quality loss. Most production LLMs (LLaMA 2/3, Mistral, Gemma) use GQA.\nMixture of Experts (MoE) The FFN in each transformer block is typically the most parameter-heavy component. Mixture of Experts replaces the single FFN with multiple \u0026ldquo;expert\u0026rdquo; FFNs and a gating network that routes each token to a subset of experts:\n$$\\text{MoE-FFN}(x) = \\sum_{i=1}^{E} g_i(x) \\cdot \\text{FFN}_i(x)$$\nwhere $g(x) = \\text{TopK}(\\text{softmax}(W_g x))$ selects the top-$K$ experts for each token.\nTypically $K = 2$ out of $E = 8$ or $E = 16$ experts. This means each token only activates a small fraction of the total parameters, giving you a much larger model (more total parameters = more knowledge capacity) with the same compute cost per token.\nMixtral (Mistral\u0026rsquo;s MoE) has 46.7B total parameters but only activates ~12.9B per token. You get the knowledge of a large model at the cost of a small one.\nThe challenges: load balancing (you want tokens distributed evenly across experts, not all going to the same one), expert collapse (some experts never get used), and the gating decision itself (routing is discrete, which is hard to backpropagate thru — usually handled with an auxiliary loss).\nScaling Laws One of the most important empirical findings about transformers is that their performance is predictable from scale. Kaplan et al. (2020) showed:\n$$L(N) \\propto N^{-\\alpha}$$\nwhere $L$ is the loss, $N$ is the number of parameters, and $\\alpha \\approx 0.076$ for language modeling. Similar power laws hold for dataset size and compute.\nChinchilla (Hoffmann et al., 2022) refined this: for a given compute budget, there\u0026rsquo;s an optimal ratio of model size to training data. The original GPT-3 was undertrained relative to its size. Chinchilla, trained with more data on a smaller model, matched GPT-3\u0026rsquo;s performance at 4x less inference cost.\nThe implication: you can predict model performance before training. This is why labs can plan training runs costing millions of dollars — they know approximately what they\u0026rsquo;ll get.\nRecent work (2024-2025) shows these power laws hold across dense and sparse (MoE) architectures, though the constants differ. Interestingly, the laws seem to break or change character for reasoning tasks, especially with test-time compute scaling (o1-style models). This is an active research area.\nWhat We Covered In this two-part series:\nPart 1: the core transformer — attention, positional encoding, multi-head attention, the full block Part 2: what happens at scale — emergent sparsity, head specialization, and the modern efficiency toolkit (RoPE, FlashAttention, GQA, MoE, gated attention) The transformer is remarkably simple at its core: attention + FFN + residuals. The complexity comes from making it efficient at scale and understanding what it learns. That understanding is becoming increasingly important as we try to align these models — which is where this blog is heading next.\nNext up: reinforcement learning foundations, because we need RL before we can understand how LLMs learn from human feedback.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-02-20/","summary":"Sparse attention patterns, head specialization, rotary embeddings, gated attention, and the modern efficiency tricks that make large transformers actually trainable.","title":"Transformers from First Principles — Part 2: What Scale Reveals"},{"content":"The Entanglement Problem By now, we have a fully trainable VAE. It reconstructs inputs, generates plausible samples, and even accepts conditions. But there\u0026rsquo;s a problem with the latent space.\nTrain a VAE on, say, a dataset of faces. The encoder gives you a 32-dimensional $z$. You might hope that one dimension controls pose, another controls smile, another controls lighting — that each axis of $z$ corresponds to an independent factor of variation.\nIn reality, this almost never happens. When you sweep along one axis of a trained VAE\u0026rsquo;s latent space, multiple factors tend to move at once. The pose and the hair color shift together. Smile and gender drift in the same direction. The latent dimensions are entangled — each one encodes a tangled mixture of underlying factors.\nThis is a problem if we care about understanding or manipulating the representation. A model that can generate a face is useful. A model that can generate the same face with a different smile is far more useful — and for that, we need a latent code where smile lives on its own axis.\nSo the question this post asks: can we nudge a VAE toward learning a disentangled latent space with a change as simple as a single hyperparameter?\nHiggins et al. (2017) showed that yes, it actually works, at least most of the time.\nWhat Does \u0026ldquo;Disentanglement\u0026rdquo; Even Mean? Before modifying anything, we should pin down what we\u0026rsquo;re asking for. There is no universally agreed-upon definition of disentanglement, but a working intuition is: a representation is disentangled when each latent dimension varies in response to exactly one generative factor of the data, independently of the others.\nIn more formal terms, if the true data was generated by independent factors $v_1, v_2, \\dots, v_k$ (pose, lighting, identity, etc.), a disentangled encoder would map each $v_i$ to a distinct $z_i$.\nThis property is attractive for three reasons:\nInterpretability. You can look at a dimension and name what it controls. Controllability. You can modify one factor without disturbing the rest. Generalization. Factored representations tend to compose better on new combinations. The tricky part: the true factors $v_i$ are never observed during training. We\u0026rsquo;re hoping the model figures them out on its own.\nThe β-VAE Modification Here is the entire proposal. Take the ELBO:\n\\begin{equation} \\mathcal{L}_{\\text{VAE}} = \\mathbb{E}_{q_\\phi(z|x)}[\\log p_\\theta(x|z)] - D_{KL}(q_\\phi(z|x) ,||, p(z)) \\end{equation}\nand introduce a single hyperparameter $\\beta \\ge 1$ in front of the KL term:\n\\begin{equation} \\mathcal{L}_{\\beta\\text{-VAE}} = \\mathbb{E}_{q_\\phi(z|x)}[\\log p_\\theta(x|z)] - \\beta \\cdot D_{KL}(q_\\phi(z|x) ,||, p(z)) \\end{equation}\nThat\u0026rsquo;s it. One Greek letter. No architectural change. No new sampling procedure.\nWhen $\\beta = 1$, we recover the standard VAE. When $\\beta \u0026gt; 1$, we penalize the KL divergence more strongly, forcing the posterior to stay closer to the prior. Intuitively, this applies more pressure to compress — the encoder is told it has less bandwidth to describe $x$ in the latent code, so it must use that bandwidth efficiently.\nHiggins et al. observed empirically that this compressive pressure tends to push the VAE toward representations where each latent dimension encodes an independent factor of variation. If you increase $\\beta$, the latent dimensions tend to become more disentangled — at least on the right data.\nWhy Would Compression Encourage Disentanglement? At first glance, the connection is not obvious. Why should \u0026ldquo;use less bandwidth\u0026rdquo; imply \u0026ldquo;use independent axes\u0026rdquo;?\nOne way to think about it: the prior $p(z) = \\mathcal{N}(0, I)$ is a factorized distribution. Its dimensions are independent by construction. The KL term $D_{KL}(q_\\phi(z|x) ,||, p(z))$ is minimized when $q_\\phi(z|x)$ also looks factorized and close to the prior.\nWhen $\\beta$ is large, the encoder is heavily penalized for deviating from a factorized Gaussian. But it still has to explain the data well enough to avoid a huge reconstruction penalty. The encoder\u0026rsquo;s only way out is to find a representation that is both informative about $x$ and close to factorized. And if the data was actually generated by a small number of independent factors, the most efficient factorized code aligns with those factors.\nIn short: the prior is factorized, so the KL term pulls the aggregate posterior toward factorization, and on data with latent independent structure, that often coincides with disentanglement.\nThis explanation is not very rigorous — and we\u0026rsquo;ll look at its weaknesses in a moment — but it gives you the right first intuition.\nThe Information-Theoretic View The β-VAE objective has a nice reinterpretation in the language of information theory. The KL term, averaged over the data distribution, is related to the mutual information between $x$ and $z$:\n\\begin{equation} \\mathbb{E}_{p_D(x)} [D_{KL}(q_\\phi(z|x) ,||, p(z))] = I_q(x; z) + D_{KL}(q_\\phi(z) ,||, p(z)) \\end{equation}\nwhere $q_\\phi(z) = \\mathbb{E}_{p_D(x)}[q_\\phi(z|x)]$ is the aggregate posterior.\nThis decomposition tells us that shrinking the KL does two things at once:\nIt reduces the mutual information between input and latent code — the encoder keeps less information about each specific $x$. It pushes the aggregate posterior toward the prior — the distribution of latents averaged over the whole dataset looks more like $\\mathcal{N}(0, I)$. Read this way, β-VAE is a rate-distortion model. The KL term is a rate (how many bits of information the latent carries about $x$), the reconstruction is a distortion (how well we reproduce $x$), and $\\beta$ is the trade-off knob. Training a β-VAE traces out a point on the rate-distortion curve.\nThis is the lens that connects β-VAE to the Information Bottleneck principle: the encoder should keep just enough information about $x$ to reconstruct it well, and no more. Higher $\\beta$ narrows the bottleneck.\nThe Trade-off Curve So what actually happens as you sweep $\\beta$?\n$\\beta \u0026lt; 1$. Reconstruction dominates. The model behaves more like a standard autoencoder. Latents carry lots of information about $x$, reconstructions look sharp, disentanglement is poor. $\\beta = 1$. The vanilla VAE. A reasonable middle ground, but latents are usually entangled. $\\beta$ moderately \u0026gt; 1. The sweet spot claimed by Higgins et al. Reconstructions are a bit blurrier, but latents begin to align with independent factors of variation. $\\beta \\gg 1$. The KL term dominates. The encoder gives up and outputs $q_\\phi(z|x) \\approx p(z)$ for every input. The latents become uninformative — posterior collapse. Reconstructions degenerate into the data mean. The trade-off is real: disentanglement comes at a cost in reconstruction quality. For datasets with clear independent factors (dSprites, 3D shapes), a moderate β can yield remarkably interpretable latents. For natural images, the same β typically produces blur without much disentanglement benefit.\nSo β-VAE is not a free lunch. It is a dial that exchanges reconstruction fidelity for representational structure. Whether that structure matters depends on your downstream task.\nA Practical Trick: KL Annealing Training a β-VAE with a fixed high β from the start often collapses. A common practical fix is KL annealing — start with $\\beta \\approx 0$ and ramp it up over the course of training:\n\\begin{equation} \\beta(t) = \\min(1, t / T) \\cdot \\beta_{\\max} \\end{equation}\nIn the early phase, the model focuses on reconstruction and learns a useful decoder. Later, as β rises, the KL term starts pulling the posterior toward structure. This avoids the degenerate local minimum where the latent is ignored before the decoder has learned anything.\nA related idea is free bits: impose the KL penalty only on dimensions whose KL exceeds a threshold $\\lambda$:\n\\begin{equation} \\mathcal{L}_{\\text{free-bits}} = \\mathbb{E}_{q_\\phi(z|x)}[\\log p_\\theta(x|z)] - \\sum_j \\max(\\lambda, D_{KL}(q_\\phi(z_j|x) ,||, p(z_j))) \\end{equation}\nThis protects a minimum amount of latent information from being squeezed out, and often helps on larger datasets.\nDoes β-VAE Really Disentangle? This is where we have to be honest. In 2019, Locatello et al. published a now-famous critique titled \u0026ldquo;Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations.\u0026rdquo; Their result, summarized: without inductive biases on models and data, unsupervised disentanglement is fundamentally impossible.\nTheir argument is partly theoretical (there always exist equivalent latent-space reparameterizations that preserve the data distribution but scramble the factors) and partly empirical (running many β-VAE variants across datasets and seeds, they found that the particular disentanglement achieved depends critically on the random seed, and that standard disentanglement metrics are unreliable).\nThe takeaway is not \u0026ldquo;β-VAE is useless.\u0026rdquo; It is:\nβ-VAE can sometimes produce disentangled-looking representations. Whether it does depends heavily on the dataset\u0026rsquo;s actual latent structure and on inductive biases hidden in the encoder/decoder architecture. There is no guarantee, and the standard metrics for disentanglement are themselves contested. If you want disentangled representations that matter, you generally need either partial supervision (tell the model what the factors are) or structural priors (design the architecture to factor things out).\nExtensions Worth Knowing β-VAE opened a family of objectives that dissect the KL term in more refined ways:\nβ-TCVAE (Chen et al., 2018). Decomposes the KL into three pieces — mutual information, total correlation, and dimension-wise KL — and penalizes only the total correlation term. This targets disentanglement more directly. FactorVAE (Kim \u0026amp; Mnih, 2018). Adds a discriminator that explicitly pushes the aggregate posterior $q_\\phi(z)$ to be factorized across dimensions. DIP-VAE (Kumar et al., 2017). Matches moments of the aggregate posterior to a factorized target. All of these are variations on the same theme: if you want factorized latents, find the right term in the ELBO to penalize.\nSummary A plain VAE tends to learn entangled latents — each dimension encodes a mixture of generative factors. β-VAE introduces a single hyperparameter that scales the KL term, applying compressive pressure on the encoder. Higher β trades reconstruction fidelity for representational structure, and on the right datasets encourages disentanglement. The objective has an information-theoretic reading as a rate-distortion problem and connects directly to the Information Bottleneck principle. In practice, KL annealing and free bits help avoid posterior collapse during training. The disentanglement-for-free story is real but limited — Locatello et al. showed it requires inductive biases that are rarely acknowledged. ","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-02-10/","summary":"A single Greek letter in front of the KL term changes what the VAE learns. We look at β-VAE as a rate-distortion trade-off, an information bottleneck, and a simple probe into disentangled representations.","title":"β-VAE and the Emergence of Disentanglement"},{"content":"New Direction If you\u0026rsquo;ve been following this blog, you know I\u0026rsquo;ve spent the last year or so on generative models — VAEs, the ELBO, reparameterization, all of that. That series is done (for now), and I want to pivot toward the models that are actually eating the world: large language models.\nBut I don\u0026rsquo;t want to just use them. I want to understand them from the ground up, the same way we derived the ELBO from scratch in the VAE series. So this is the start of a new series, and we\u0026rsquo;re beginning where every LLM begins: the Transformer architecture.\nThe paper is \u0026ldquo;Attention Is All You Need\u0026rdquo; (Vaswani et al., 2017). If you\u0026rsquo;ve heard the phrase \u0026ldquo;self-attention\u0026rdquo; thrown around and vaguely know it involves queries, keys, and values but aren\u0026rsquo;t sure why — this post is for you.\nThe Problem with Sequences Before transformers, the dominant approach for sequence modeling was recurrent neural networks (RNNs) and their variants (LSTMs, GRUs). These process sequences one token at a time, maintaining a hidden state that gets updated at each step:\n$$h_t = f(h_{t-1}, x_t)$$\nThis works, but it has two fundamental problems:\nSequential computation. You can\u0026rsquo;t compute $h_t$ until you\u0026rsquo;ve computed $h_{t-1}$. This means you can\u0026rsquo;t parallelize across time steps during training, which makes training on long sequences painfully slow.\nLong-range dependencies. Information from early tokens has to survive thru every intermediate hidden state to influence later tokens. In practice, gradients vanish or explode over long distances. LSTMs and GRUs helped, but they didn\u0026rsquo;t solve this.\nThe Transformer\u0026rsquo;s key insight: what if we could let every token directly attend to every other token, in parallel? No sequential bottleneck. No vanishing gradients thru time. That\u0026rsquo;s what self-attention does.\nSelf-Attention: The Core Mechanism Given a sequence of $n$ token embeddings, each of dimension $d$, we stack them into a matrix $X \\in \\mathbb{R}^{n \\times d}$. Self-attention transforms this into a new representation where each token is a weighted combination of all other tokens.\nQueries, Keys, and Values We project $X$ into three different spaces using learned weight matrices:\n$$Q = XW_Q, \\quad K = XW_K, \\quad V = XW_V$$\nwhere $W_Q, W_K \\in \\mathbb{R}^{d \\times d_k}$ and $W_V \\in \\mathbb{R}^{d \\times d_v}$.\nThe intuition:\nQuery ($Q$): \u0026ldquo;What am I looking for?\u0026rdquo; Key ($K$): \u0026ldquo;What do I contain?\u0026rdquo; Value ($V$): \u0026ldquo;What information do I actually carry?\u0026rdquo; You can think of it like a database lookup. Each token broadcasts a query (\u0026ldquo;I need context about X\u0026rdquo;), every other token offers a key (\u0026ldquo;I have information about Y\u0026rdquo;), and the attention mechanism matches queries to keys to decide which values to retrieve.\nThe Attention Formula $$\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V$$\nLet\u0026rsquo;s unpack this step by step.\nStep 1: Compute attention scores. $QK^T \\in \\mathbb{R}^{n \\times n}$ gives us a matrix where entry $(i, j)$ is the dot product between query $i$ and key $j$. Higher dot product means token $i$ is more \u0026ldquo;interested\u0026rdquo; in token $j$.\nStep 2: Scale. We divide by $\\sqrt{d_k}$. You may wonder why. The reason is that when $d_k$ is large, the dot products grow in magnitude. If we have $q$ and $k$ as random vectors with zero mean and unit variance, then $q \\cdot k = \\sum_{i=1}^{d_k} q_i k_i$ has variance $d_k$. Large dot products push the softmax into regions where gradients are extremely small. Dividing by $\\sqrt{d_k}$ brings the variance back to 1.\nStep 3: Softmax. Applied row-wise, this converts each row of scores into a probability distribution. Token $i$ now has a distribution over all other tokens — how much attention to pay to each one.\nStep 4: Weighted sum. Multiply the attention weights by $V$. Each token\u0026rsquo;s output is a weighted combination of all value vectors, with weights determined by the query-key compatibility.\nThe result: every token in the output \u0026ldquo;knows about\u0026rdquo; every other token in the input, weighted by relevance. And the whole thing is a matrix multiplication — fully parallelizable.\nPositional Encoding There\u0026rsquo;s a problem with self-attention as described above: it\u0026rsquo;s permutation-invariant. If you shuffle the input tokens, the outputs change (different attention patterns), but the mechanism itself has no notion of position. \u0026ldquo;The cat sat on the mat\u0026rdquo; and \u0026ldquo;mat the on sat cat the\u0026rdquo; would be treated the same way structurally.\nFor language, position obviously matters. The original Transformer uses sinusoidal positional encodings added to the input embeddings:\n$$PE_{(pos, 2i)} = \\sin\\left(\\frac{pos}{10000^{2i/d}}\\right)$$ $$PE_{(pos, 2i+1)} = \\cos\\left(\\frac{pos}{10000^{2i/d}}\\right)$$\nwhere $pos$ is the position in the sequence and $i$ is the dimension index.\nWhy sines and cosines? Two reasons:\nUnique encoding. Each position gets a unique pattern across dimensions, like a binary counter but with smooth waveforms instead of bits.\nRelative position information. For any fixed offset $k$, $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$. This means the model can learn to attend to relative positions (e.g., \u0026ldquo;the token 3 positions back\u0026rdquo;) rather than absolute positions.\nThese encodings are added (not concatenated) to the token embeddings, so the model starts with both \u0026ldquo;what is this token\u0026rdquo; and \u0026ldquo;where is this token\u0026rdquo; combined in a single vector.\nModern models use Rotary Position Embeddings (RoPE) instead, which we\u0026rsquo;ll cover in Part 2. But the sinusoidal idea was the starting point.\nMulti-Head Attention A single attention head computes one set of attention weights. But different relationships between tokens might be relevant simultaneously — syntactic dependencies, semantic similarity, coreference, etc. Asking one attention head to capture all of these is asking too much.\nMulti-head attention runs $h$ independent attention heads in parallel, each with its own projections:\n$$\\text{head}_i = \\text{Attention}(XW_Q^i, XW_K^i, XW_V^i)$$\nThen concatenates and projects:\n$$\\text{MultiHead}(X) = \\text{Concat}(\\text{head}_1, \\ldots, \\text{head}_h)W_O$$\nwhere $W_O \\in \\mathbb{R}^{hd_v \\times d}$ projects back to the model dimension.\nEach head uses smaller dimensions: $d_k = d_v = d / h$. So the total computation is roughly the same as a single full-dimension head, but the model gets $h$ different \u0026ldquo;perspectives\u0026rdquo; on the input.\nIn practice, different heads do learn to specialize. Some attend to nearby tokens (local syntax), others attend to distant tokens (long-range dependencies), and some develop specialized behaviors like tracking coreference or function words. We\u0026rsquo;ll look at the evidence for this in Part 2.\nThe Transformer Block A complete transformer block combines multi-head attention with a feedforward network and uses residual connections + layer normalization:\n$$Z = \\text{LayerNorm}(X + \\text{MultiHead}(X))$$ $$\\text{Output} = \\text{LayerNorm}(Z + \\text{FFN}(Z))$$\nThe feedforward network (FFN) is a simple two-layer MLP applied position-wise (independently to each token):\n$$\\text{FFN}(z) = \\text{ReLU}(zW_1 + b_1)W_2 + b_2$$\nwith $W_1 \\in \\mathbb{R}^{d \\times d_{ff}}$ and $W_2 \\in \\mathbb{R}^{d_{ff} \\times d}$. Typically $d_{ff} = 4d$.\nThe residual connections ($X + \\ldots$) are important — they let gradients flow directly thru the network without being forced thru the attention or FFN layers. This is the same idea as in ResNets and it\u0026rsquo;s essential for training deep transformers.\nWhy the FFN matters. The attention mechanism handles token-to-token interactions. The FFN handles per-token transformations — you can think of it as \u0026ldquo;processing what the attention gathered.\u0026rdquo; Recent interpretability work suggests that FFN layers store factual knowledge, while attention layers route information.\nEncoder, Decoder, and Decoder-Only The original Transformer has two stacks:\nEncoder: processes the full input sequence with bidirectional attention (every token attends to every other token). Used for understanding the input.\nDecoder: generates the output sequence one token at a time, using causal (masked) attention — token $i$ can only attend to tokens $\\leq i$. This prevents the model from \u0026ldquo;cheating\u0026rdquo; by looking at future tokens during generation. The decoder also has cross-attention layers that attend to the encoder\u0026rsquo;s output.\nFor modern LLMs (GPT, Claude, LLaMA), we use decoder-only architectures. There\u0026rsquo;s no separate encoder. The model just generates tokens left-to-right with causal masking. This simplifies things and turns out to be sufficient for most tasks when the model is large enough.\nThe causal mask is just a triangular matrix applied to the attention scores before softmax — setting future positions to $-\\infty$ so they get zero weight after softmax.\nPutting It Together A full transformer-based language model:\nInput: Tokenize text into token IDs Embedding: Look up token embeddings + add positional encodings Transformer blocks: Pass thru $L$ stacked transformer blocks (each with multi-head attention + FFN + residual connections + layer norm) Output: Project final hidden states to vocabulary size, apply softmax to get next-token probabilities Training: maximize the log-probability of the correct next token at each position (cross-entropy loss).\nThe magic is in the scale. GPT-2 had 1.5B parameters with 48 transformer blocks. GPT-3 had 175B. Modern models go further. But the core architecture — attention + FFN + residuals — has remained remarkably stable since 2017.\nWhat We Covered and What\u0026rsquo;s Next In this post:\nSelf-attention as parallel, position-agnostic information routing The QKV framework and scaled dot-product attention Positional encoding to restore position information Multi-head attention for multiple \u0026ldquo;perspectives\u0026rdquo; The full transformer block with residuals and layer norm In Part 2, we\u0026rsquo;ll look at what happens when you actually train these things at scale — sparse attention emergence, how individual attention heads specialize, modern architectural improvements (RoPE, GQA, FlashAttention, MoE), and the gated attention mechanism that won Best Paper at NeurIPS 2025.\nThe foundation is here. Now we see what happens when you make it big.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-02-08/","summary":"A first-principles walkthrough of the Transformer — self-attention, positional encoding, multi-head attention — with the math that makes it work.","title":"Transformers from First Principles — Part 1: Attention Is All You Need (Really)"},{"content":"The Unconditional Problem In the previous post, we assembled a working VAE — encoder, decoder, reparameterization, ELBO. If you train it on MNIST, you get a model that can generate handwritten digits by sampling $z \\sim \\mathcal{N}(0, I)$ and decoding.\nThat\u0026rsquo;s cool and all, but also kind of annoying.\nWe have no way to say \u0026ldquo;draw me a 7.\u0026rdquo; The VAE generates something that looks like a digit, but we have no lever to control which digit. The generative process is:\n\\begin{equation} z \\sim p(z), \\quad x \\sim p_\\theta(x|z) \\end{equation}\nThere is simply no place for our intent to enter the equation.\nIn most real-world applications, we don\u0026rsquo;t want a model that generates data blindly. We want to generate data conditioned on something — a class label, a text prompt, a source image, a musical key. So the question this post answers is: how do we add a condition $y$ to the VAE without breaking anything we already built?\nTurns out the solution is actually very simple.\nFrom $p(x)$ to $p(x|y)$ Alongside our data $x$, assume we also have a condition $y$. This $y$ could be many things:\nA discrete class label (y = 7). A continuous attribute (y = brightness value). Another modality (y = a text embedding). A structured object (y = a segmentation mask). We are no longer modeling the unconditional distribution $p(x)$. We are modeling the conditional distribution $p(x|y)$.\nFollowing the same latent variable recipe from before, we introduce a latent $z$ and write:\n\\begin{equation} p_\\theta(x|y) = \\int p_\\theta(x|z, y),p(z|y),dz \\end{equation}\nThree things to notice:\nThe likelihood $p_\\theta(x|z, y)$ now depends on both $z$ and $y$. The decoder sees the condition. The prior $p(z|y)$ may also depend on $y$. (Often we just set $p(z|y) = p(z) = \\mathcal{N}(0, I)$.) Everything else stays the same. And just like before, this integral is intractable for nonlinear decoders. So we do what we already know: introduce an approximate posterior and derive an ELBO.\nDeriving the Conditional ELBO We introduce a variational posterior $q_\\phi(z|x, y)$ — the encoder, which now also takes $y$ as input — and repeat the familiar derivation.\nStarting from the log conditional likelihood:\n\\begin{equation} \\log p_\\theta(x|y) = \\log \\int p_\\theta(x|z, y),p(z|y),dz \\end{equation}\nMultiply and divide inside the integral by $q_\\phi(z|x, y)$, then apply Jensen\u0026rsquo;s inequality (or repeat the Bayes-decomposition argument from the VI post). The result is the Conditional ELBO:\n\\begin{equation} \\log p_\\theta(x|y) \\ge \\mathbb{E}_{q_\\phi(z|x,y)} [\\log p_\\theta(x|z, y)] - D_{KL}(q_\\phi(z|x, y) ,||, p(z|y)) \\end{equation}\nCompare this to the standard VAE ELBO. The shape is identical:\n\\begin{equation} \\mathcal{L}_{\\text{VAE}} = \\mathbb{E}_{q_\\phi(z|x)}[\\log p_\\theta(x|z)] - D_{KL}(q_\\phi(z|x) ,||, p(z)) \\end{equation}\n\\begin{equation} \\mathcal{L}_{\\text{CVAE}} = \\mathbb{E}_{q_\\phi(z|x,y)}[\\log p_\\theta(x|z, y)] - D_{KL}(q_\\phi(z|x, y) ,||, p(z|y)) \\end{equation}\nEvery distribution that originally depended on $x$ now depends on both $x$ and $y$. That\u0026rsquo;s the whole change. The reparameterization trick, the Monte Carlo estimation, the closed-form KL for Gaussians — all of it carries over unchanged.\nCVAE is basically just the same VAE but with $y$ plugged in everywhere.\nWhere Does $y$ Actually Go? On paper, \u0026ldquo;the encoder and decoder both see $y$\u0026rdquo; is easy to say. In code, you have to decide how $y$ enters the networks. A few common patterns:\nConcatenation. The simplest option: turn $y$ into a feature vector (one-hot for classes, embedding for text, etc.) and concatenate it with the input.\n# Encoder h = encoder_net(torch.cat([x, y], dim=-1)) mu, log_var = h.chunk(2, dim=-1) # Decoder x_hat = decoder_net(torch.cat([z, y], dim=-1)) Embedding + addition. For class labels, learn an embedding $e(y) \\in \\mathbb{R}^d$ and add it to $z$ before decoding. This biases the latent space along a learnable direction per class.\nConditional normalization. For image data, Conditional BatchNorm or FiLM layers modulate activations using $y$. Each class (or attribute vector) gets its own affine transform applied to feature maps.\nCross-attention. For rich conditions (text, images), let $y$ be a sequence of tokens and attend to it at every decoder block. This is the pattern inherited by modern latent-diffusion models.\nNone of these are mandated by the math. The ELBO only says $y$ must appear in the relevant conditional distributions. How the network reads $y$ is an architectural choice.\nIs the Conditional Prior Worth It? The math allows $p(z|y)$ to depend on $y$. Do we want it to?\nThe simple path: $p(z|y) = \\mathcal{N}(0, I)$. We share a single unit-Gaussian prior across all conditions. The encoder is responsible for routing different $y$ values to appropriate regions of latent space via the KL term. This is what most CVAE implementations actually do, and it works fine for most cases.\nThe learnable path: $p_\\psi(z|y) = \\mathcal{N}(\\mu_\\psi(y), \\sigma^2_\\psi(y))$. A small network produces the prior mean and variance as functions of $y$. This gives each condition its own \u0026ldquo;region\u0026rdquo; of latent space, which can help when conditions are very different from one another.\nThe trade-off: a learnable conditional prior adds flexibility at the cost of extra parameters and some training instability. For MNIST-scale class conditioning, the fixed $\\mathcal{N}(0, I)$ is more than enough. For complex conditions, a learnable prior starts to pay off.\nAccording to my own experience, always start with the fixed prior. If it works, you don\u0026rsquo;t need the learnable one. If the KL term is perpetually large and reconstruction is struggling, then consider letting the prior move.\nA Worked Example: Class-Conditional MNIST To make this concrete, consider the simplest case: conditioning on a digit class $y \\in {0, 1, \\dots, 9}$.\nSetup.\nRepresent $y$ as a one-hot vector $\\mathbf{y} \\in {0, 1}^{10}$. Encoder: $q_\\phi(z|x, y)$ takes $[x, \\mathbf{y}]$ concatenated, returns $(\\mu, \\log \\sigma^2)$. Prior: $p(z|y) = \\mathcal{N}(0, I)$ (shared). Decoder: $p_\\theta(x|z, y)$ takes $[z, \\mathbf{y}]$ concatenated, returns pixel Bernoullis. Training pseudocode.\ny_onehot = F.one_hot(y, num_classes=10).float() mu, log_var = encoder(torch.cat([x, y_onehot], dim=-1)) std = torch.exp(0.5 * log_var) z = mu + std * torch.randn_like(std) x_hat = decoder(torch.cat([z, y_onehot], dim=-1)) recon = F.binary_cross_entropy(x_hat, x, reduction=\u0026#34;sum\u0026#34;) / B kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) / B loss = recon + kl Generation. At inference, pick the class you want, sample $z$ from the prior, and decode:\ny_wanted = F.one_hot(torch.tensor([7]), num_classes=10).float() z = torch.randn(1, latent_dim) generated = decoder(torch.cat([z, y_wanted], dim=-1)) Now the model generates a 7 because we told it to. A handful of lines of change from the unconditional VAE, and we have control.\nWhat the Condition Can Be CVAE is more general than just class-conditional generation. Any variable $y$ that explains some structure in $x$ can be a condition. Some examples you\u0026rsquo;ll see in the wild:\nAttribute-conditional. Condition on attribute vectors (age, hair color, expression) to generate faces with controlled properties. Text-to-image. Condition on a text embedding. The decoder\u0026rsquo;s cross-attention reads the text and aligns pixels to it. Image-to-image translation. Condition on a source image (e.g. edges, segmentation, low-res). This is the VAE formulation of pix2pix-style tasks. Time-series forecasting. Condition on past observations, decode future ones. The encoder-decoder shape generalizes naturally. Missing-data imputation. Treat the observed part of $x$ as $y$ and the unobserved part as $x$. The CVAE fills in what\u0026rsquo;s missing. So you may wonder why these all look similar — it\u0026rsquo;s because they\u0026rsquo;re all answering the same question: given $y$, what distribution over $x$ does the data imply?\nConnection to Modern Conditional Generators Every successful generative model today is a conditional one. Stable Diffusion, Imagen, Sora, audio LDMs — all of them learn $p(x|y)$ where $y$ is text, audio, video, or multi-modal embeddings.\nThe architectural tricks have evolved (cross-attention, classifier-free guidance, adapter modules), but the probabilistic skeleton is the one we just wrote down. The CVAE was the first clean expression of this in the deep learning era: let the condition enter every piece of the model, and let the ELBO handle the rest.\nIf you understand why the CVAE ELBO looks the way it does, you basically already understand the setup of conditional score matching, conditional flow matching, and guided diffusion. They all inherit this structure.\nA Practical Note: Posterior Collapse with Strong Conditions One failure mode worth a warning. If $y$ contains almost as much information as $x$ (e.g. $y$ is a detailed caption for a simple image), the decoder can learn to ignore $z$ entirely. The latent becomes useless — the model just memorizes how to map $y$ directly to $x$.\nSymptoms:\n$D_{KL}(q_\\phi(z|x, y) ,||, p(z|y)) \\to 0$. $\\mu_\\phi(x, y) \\approx 0$, $\\sigma_\\phi(x, y) \\approx 1$ for every input. Sampling different $z$ gives nearly identical outputs. Fixes:\nKL annealing (start with a small KL weight and ramp up). Free bits (allow a minimum KL budget per latent dimension). Weaker conditioning paths (inject $y$ at fewer layers). We\u0026rsquo;ll meet this tension again — dialed up intentionally — in the next post on $\\beta$-VAE.\nSummary The CVAE models $p(x|y)$ by letting the encoder, decoder, and (optionally) prior all condition on $y$. The conditional ELBO has the same shape as the VAE ELBO, with every relevant distribution conditioned on $y$. All the machinery we built for the VAE — reparameterization, closed-form Gaussian KL, Monte Carlo reconstruction — carries over without change. Where $y$ enters the network is an engineering choice: concat, embed, FiLM, cross-attend. The CVAE is the probabilistic skeleton underneath every modern conditional generator. ","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-01-25/","summary":"We extend the VAE into a controllable generative model by adding a condition y into every term of the ELBO.","title":"Conditional VAE (CVAE): Learning to Generate with Conditions"},{"content":"The Question Here\u0026rsquo;s something that sounds like it should be hard to answer: when an LLM refuses a harmful request, which specific neurons are responsible for that refusal?\nYou might expect safety behavior to be distributed across the whole network — millions of parameters working together to produce \u0026ldquo;I can\u0026rsquo;t help with that.\u0026rdquo; After all, safety training (RLHF, DPO, etc.) updates all parameters during training.\nBut a NeurIPS 2025 paper (\u0026ldquo;Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons\u0026rdquo;) found something surprising: safety is concentrated. Roughly 5% of neurons handle roughly 90% of safety behavior. They call these \u0026ldquo;safety neurons.\u0026rdquo;\nHow They Found Them The method is conceptually simple, even if the execution is involved:\nStep 1: Generate paired outputs. For a set of harmful prompts, record the model\u0026rsquo;s activations when it refuses (aligned behavior) and when a jailbroken version complies (unaligned behavior). The difference in activations between these two cases highlights which neurons \u0026ldquo;activate\u0026rdquo; for safety.\nStep 2: Identify safety-critical neurons. Using the activation differences, rank neurons by how much their activation changes between safe and unsafe behavior. The top neurons by this metric are the \u0026ldquo;safety neurons.\u0026rdquo;\nStep 3: Validate with patching. Take a jailbroken (unsafe) model and patch in the activations of only the safety neurons from the aligned model. If safety is really concentrated in these neurons, patching just 5% of them should restore safety behavior.\nIt does. Patching ~5% of neurons restores \u0026gt;90% of safety performance. The remaining 95% of neurons can be in their jailbroken state, and the model still refuses harmful requests.\nWhat Safety Neurons Actually Do The paper goes further than just identifying which neurons matter. They also analyzed what these neurons compute:\nEarly layers: Safety neurons in early layers appear to detect harmful intent in the input — recognizing patterns associated with requests for dangerous information, manipulation, etc.\nMiddle layers: These neurons seem to activate \u0026ldquo;refusal circuitry\u0026rdquo; — they transform the hidden state in ways that suppress harmful completion pathways.\nLate layers: Safety neurons near the output layers steer the token probabilities toward refusal responses (\u0026ldquo;I can\u0026rsquo;t\u0026rdquo;, \u0026ldquo;I\u0026rsquo;m not able to\u0026rdquo;, etc.) and away from harmful completions.\nSo there\u0026rsquo;s a pipeline: detect harm → activate refusal → steer output. And each stage has its own concentrated set of neurons.\nWhy Concentration Happens You may wonder why safety would be concentrated rather than distributed. I think there are a few reasons:\nSafety training is narrow. RLHF and DPO optimize on a relatively small set of safety-related examples compared to the massive pre-training corpus. The gradient updates from safety training affect a limited subset of neurons strongly, rather than all neurons weakly.\nSafety is a \u0026ldquo;mode switch.\u0026rdquo; Safe behavior often requires a categorical decision — comply or refuse — rather than a gradual adjustment. Categorical decisions tend to be implemented by a small number of high-impact neurons (think of a ReLU activation as a gate: on or off).\nSuperposition. Neural networks represent many features in superposition (more features than neurons, with each feature as a direction in activation space). Safety might be one such feature — a specific direction that a small set of neurons are aligned with.\nImplications for Alignment This finding has both encouraging and worrying implications.\nThe Good News Targeted safety interventions. If we know which neurons are responsible for safety, we can monitor them specifically during training and deployment. If safety neuron activations drop, that\u0026rsquo;s an early warning.\nEfficient safety patching. You don\u0026rsquo;t need to retrain the whole model to fix a safety issue. Patching a small number of neurons could restore safety. This is much cheaper and faster than full RLHF cycles.\nInterpretability toolkit. Safety neurons give us a concrete, mechanistic handle on alignment. Instead of treating the model as a black box that sometimes refuses and sometimes doesn\u0026rsquo;t, we can trace the decision thru specific components.\nThe Bad News Concentrated = vulnerable. If 5% of neurons control 90% of safety, then attacking those specific neurons could disable safety much more efficiently than attacking the model as a whole. An adversary who knows which neurons to target could craft inputs that suppress safety neuron activations.\nFragility. Concentration means there\u0026rsquo;s less redundancy. In a distributed safety system, corrupting a few neurons has limited impact because many others compensate. In a concentrated system, corrupting the right few neurons can catastrophically disable safety.\nJailbreaking explained. Some jailbreak techniques might work precisely because they suppress safety neuron activations. Understanding this mechanism could lead to better defenses — or better attacks. This is the dual-use problem of interpretability research.\nConnection to Broader Interpretability This work fits into the mechanistic interpretability research program that Chris Olah and others have been building. The big picture:\nSparse Autoencoders (SAEs) decompose neural network activations into interpretable features. Recent work (5+ SAE papers at NeurIPS 2025 alone) is making this practical at LLM scale.\nCircuit analysis traces how information flows thru specific pathways in the network. Safety neurons are one such circuit.\nFeature absorption (NeurIPS 2025) shows that hierarchical features can be absorbed into other features during SAE optimization, making some features invisible. Could safety features be similarly absorbed or hidden?\nThe connection: safety neurons are a specific circuit within the broader feature landscape that SAEs try to map. Understanding this circuit helps, but we need the full map to really trust our interpretability tools.\nWhat This Changes Before this paper, safety alignment was mostly a training-time concern: get the right training data, use the right loss function, hope it generalizes. Safety neurons shift the conversation to a mechanistic level:\nWe can audit specific components for safety properties We can detect when safety is being undermined (activation monitoring) We can intervene surgically (patching, pruning, amplification) But the concentration also means we should worry more about the robustness of alignment. A model that implements safety thru 5% of its neurons is a model where safety can be disabled by targeting 5% of its neurons.\nThe question going forward: is this concentration an artifact of current training methods that could be improved, or is it a fundamental property of how neural networks implement mode-switching behavior? If it\u0026rsquo;s fundamental, we need to build defenses that assume concentration rather than trying to eliminate it.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-01-18/","summary":"Mechanistic interpretability meets alignment — how researchers found that a tiny fraction of neurons are responsible for almost all safety behavior in LLMs, and what that means.","title":"Safety Neurons: 5% of Your Model Controls 90% of Safety"},{"content":"Recap In the last post, we derived the ELBO and watched it split into two parts:\n\\begin{equation} \\mathcal{L}(\\theta, \\phi; x) = \\underbrace{\\mathbb{E}_{q_\\phi(z|x)} [\\log p_\\theta(x|z)]}_{\\text{Reconstruction}} - \\underbrace{D_{KL}(q_\\phi(z|x) ,||, p(z))}_{\\text{Regularization}} \\end{equation}\nThis looks clean on paper: reconstruct $x$ well, while keeping the encoded distribution close to the prior $p(z)$.\nBut to actually train a neural network with this, we need to answer three questions:\nHow do we compute the KL term in practice? How do we estimate the reconstruction expectation? How do we differentiate through a random sample $z \\sim q_\\phi(z|x)$? Question 3 is the hardest one, and this is where the VAE paper really shines. But let\u0026rsquo;s start with the easy parts first.\nThe KL Term: A Closed Form For the VAE to be practical, we need a variational family $\\mathcal{Q}$ that is flexible enough to approximate the posterior but simple enough to compute with. The standard choice is the diagonal Gaussian:\n\\begin{equation} q_\\phi(z|x) = \\mathcal{N}\\big(z; \\mu_\\phi(x), \\text{diag}(\\sigma^2_\\phi(x))\\big) \\end{equation}\nand the prior is the standard isotropic Gaussian:\n\\begin{equation} p(z) = \\mathcal{N}(z; 0, I) \\end{equation}\nTwo Gaussians. Nothing too wild.\nThe encoder $\\phi$ takes $x$ and outputs two vectors: a mean $\\mu_\\phi(x) \\in \\mathbb{R}^d$ and a log-variance $\\log \\sigma^2_\\phi(x) \\in \\mathbb{R}^d$. We output log-variance rather than variance so the network can produce any real number without needing a positivity constraint.\nBecause both distributions are Gaussian, the KL divergence has a closed-form expression:\n\\begin{equation} D_{KL}\\big(\\mathcal{N}(\\mu, \\sigma^2 I) ,||, \\mathcal{N}(0, I)\\big) = \\frac{1}{2} \\sum_{j=1}^{d} \\left( \\mu_j^2 + \\sigma_j^2 - \\log \\sigma_j^2 - 1 \\right) \\end{equation}\nThis is pretty nice actually. No sampling is required for the KL term. We get a deterministic, differentiable expression straight from the encoder\u0026rsquo;s output. Gradients flow cleanly back into $\\phi$ since everything is just elementary functions of $\\mu_j$ and $\\sigma_j^2$.\nSo the KL term is basically \u0026ldquo;free\u0026rdquo; in terms of differentiability. All the hard work lives in the reconstruction term.\nThe Reconstruction Term: Monte Carlo Now the harder half:\n\\begin{equation} \\mathbb{E}_{q_\\phi(z|x)} [\\log p_\\theta(x|z)] \\end{equation}\nThis expectation has no closed form. The decoder $p_\\theta(x|z)$ is a deep neural network, so there\u0026rsquo;s no way to integrate over all possible $z$ analytically.\nSo we estimate the expectation with a Monte Carlo sample:\n\\begin{equation} \\mathbb{E}_{q_\\phi(z|x)} [\\log p_\\theta(x|z)] \\approx \\frac{1}{L} \\sum_{\\ell=1}^{L} \\log p_\\theta(x | z^{(\\ell)}), \\quad z^{(\\ell)} \\sim q_\\phi(z|x) \\end{equation}\nIn practice, we often use just one sample per data point ($L=1$). This sounds a bit reckless, but it works surprisingly well because the minibatch average already provides a lot of averaging.\nWhat does $\\log p_\\theta(x|z)$ actually look like? That depends on the data type:\nFor binary data (e.g. binarized MNIST), $p_\\theta(x|z)$ is a product of Bernoullis, and $\\log p_\\theta(x|z)$ reduces to the negative binary cross-entropy between $x$ and the decoder output. For continuous data with a fixed-variance Gaussian decoder, $\\log p_\\theta(x|z) = -\\frac{1}{2\\sigma^2} |x - \\hat{x}|^2 + \\text{const}$. So it\u0026rsquo;s just a scaled MSE. So the \u0026ldquo;probabilistic\u0026rdquo; decoder, in practice, collapses into one of the two loss functions you already know. The loss function itself is nothing special — the interesting part is what happens before it.\nThe Problem: You Cannot Backprop Thru a Sample And here is where we have a problem.\nOur current recipe looks like this:\nThe encoder outputs $\\mu_\\phi(x)$ and $\\sigma_\\phi(x)$. We sample $z \\sim \\mathcal{N}(\\mu_\\phi(x), \\sigma^2_\\phi(x))$. The decoder computes $\\log p_\\theta(x|z)$. We want $\\nabla_\\phi$ of the whole thing. Step 2 is the killer. The sample $z$ depends on $\\phi$ (because $\\mu$ and $\\sigma$ are outputs of $\\phi$), but it is also a random draw. So what is the derivative of a random draw with respect to the parameters that shape its distribution?\nFormally, we want:\n\\begin{equation} \\nabla_\\phi , \\mathbb{E}_{z \\sim q_\\phi(z|x)} [f(z)] \\end{equation}\nwhere $f(z) = \\log p_\\theta(x|z)$. The gradient wants to go inside the expectation, but the expectation itself is over a distribution that depends on $\\phi$. You can\u0026rsquo;t just swap them.\nA first attempt: the score function estimator There is a general-purpose gradient estimator for this, called REINFORCE or the score function estimator:\n\\begin{equation} \\nabla_\\phi , \\mathbb{E}_{q_\\phi} [f(z)] = \\mathbb{E}_{q_\\phi} \\big[ f(z) , \\nabla_\\phi \\log q_\\phi(z|x) \\big] \\end{equation}\nIt\u0026rsquo;s unbiased and works for any distribution. But in practice, the variance of this estimator is so high that training a deep network with it is painful at best. We need a better idea.\nThe Reparameterization Trick Here is the key insight of the VAE. Instead of sampling $z$ directly from a distribution that depends on $\\phi$, we sample a fixed random variable and then transform it with a deterministic function that depends on $\\phi$.\nFor a Gaussian encoder, this is very simple. Any sample $z \\sim \\mathcal{N}(\\mu, \\sigma^2)$ can be written as:\n\\begin{equation} z = \\mu + \\sigma \\odot \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, I) \\end{equation}\nwhere $\\odot$ is elementwise multiplication.\nThe randomness now lives in $\\epsilon$, which has nothing to do with $\\phi$. The parameters $\\mu_\\phi(x)$ and $\\sigma_\\phi(x)$ only appear in the deterministic transformation.\nSo we don\u0026rsquo;t eliminate the randomness — we relocate it. The noise is pushed out to a leaf of the computation graph, where autograd can treat it as a constant.\nWhy this makes everything work With reparameterization, the gradient becomes:\n\\begin{equation} \\nabla_\\phi , \\mathbb{E}_{z \\sim q_\\phi} [f(z)] = \\nabla_\\phi , \\mathbb{E}_{\\epsilon \\sim \\mathcal{N}(0, I)} \\big[ f(\\mu_\\phi(x) + \\sigma_\\phi(x) \\odot \\epsilon) \\big] = \\mathbb{E}_{\\epsilon} \\big[ \\nabla_\\phi f(\\mu_\\phi(x) + \\sigma_\\phi(x) \\odot \\epsilon) \\big] \\end{equation}\nThe gradient and the expectation commute now, because the distribution we\u0026rsquo;re averaging over ($\\mathcal{N}(0, I)$) does not depend on $\\phi$.\nIn autograd terms:\n$\\epsilon$ is a constant tensor sampled once per forward pass. $z = \\mu_\\phi(x) + \\sigma_\\phi(x) \\odot \\epsilon$ is a deterministic node with two parents ($\\mu$ and $\\sigma$) that we want gradients for. Backprop flows through $z$ into $\\mu$ and $\\sigma$, and then into $\\phi$. No custom math required. This is why people call it a \u0026ldquo;trick.\u0026rdquo; It looks like a small algebraic rewrite. It is a small algebraic rewrite. But it\u0026rsquo;s the rewrite that makes the entire VAE differentiable.\nA more general picture Reparameterization isn\u0026rsquo;t limited to Gaussians. Whenever you can write a sample from your target distribution as $z = g_\\phi(x, \\epsilon)$ where $\\epsilon$ is drawn from some fixed base distribution, you can apply the same idea. This is exactly the principle behind normalizing flows — an entire family of models built on pushing simple noise through invertible, learnable transformations.\nFor now, though, a Gaussian is all we need.\nThe Full VAE We now have every ingredient. Let\u0026rsquo;s put the VAE together as one training loop.\nForward pass for a single example $x$:\nEncoder: $(\\mu, \\log \\sigma^2) = \\text{EncoderNet}_\\phi(x)$. Sample: $\\epsilon \\sim \\mathcal{N}(0, I)$. Reparameterize: $z = \\mu + \\sigma \\odot \\epsilon$, where $\\sigma = \\exp(\\tfrac{1}{2} \\log \\sigma^2)$. Decoder: $\\hat{x} = \\text{DecoderNet}_\\theta(z)$ (or the parameters of $p_\\theta(x|z)$). Compute the two loss terms: $\\mathcal{L}_{\\text{recon}} = -\\log p_\\theta(x|z)$ (BCE or MSE). $\\mathcal{L}_{\\text{KL}} = \\frac{1}{2} \\sum_j (\\mu_j^2 + \\sigma_j^2 - \\log \\sigma_j^2 - 1)$. Total loss: $\\mathcal{L} = \\mathcal{L}_{\\text{recon}} + \\mathcal{L}_{\\text{KL}}$. Backward pass: standard autograd through every step. The stochastic node $\\epsilon$ is a constant tensor, so backprop never has to reason about randomness.\nPseudocode mu, log_var = encoder(x) # (B, d), (B, d) std = torch.exp(0.5 * log_var) eps = torch.randn_like(std) # ~ N(0, I) z = mu + std * eps # reparameterization x_hat = decoder(z) recon_loss = F.binary_cross_entropy(x_hat, x, reduction=\u0026#34;sum\u0026#34;) / B kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) / B loss = recon_loss + kl_loss loss.backward() That is the entire VAE training loop. The rest is plumbing.\nWhat We Actually Built So looking back, we started from the idea that data is produced by hidden causes. We wrote down a probabilistic model, found it intractable, used variational inference, derived the ELBO, decomposed it into reconstruction plus regularization, and finally made it differentiable with the reparameterization trick.\nThe result is a model that can do two things:\nEncode: Given an input $x$, map it to a distribution over latent codes $q_\\phi(z|x)$. This gives us a probabilistic representation — not a point, but a region of meaning. Generate: Sample $z \\sim \\mathcal{N}(0, I)$, pass it through the decoder, and get a new $x$ that looks like it came from the training distribution. The second capability is what makes the VAE generative in a way that a plain autoencoder is not. Because the KL regularizer forces the encoder\u0026rsquo;s outputs to fill the prior smoothly, sampling from the prior lands us in regions the decoder can reconstruct.\nTwo Things Worth Noticing The KL term is a budget. It limits how much information the encoder can stuff into $z$ about any particular $x$. If the encoder tries to memorize, KL goes up. If the encoder stays too close to the prior, reconstruction suffers. The ELBO is a negotiation between the two.\nThe reparameterization trick is why this is a neural network and not an EM algorithm. Classical variational inference required careful, distribution-specific updates. Reparameterization replaces all of that with a single backward pass — turn the inference problem into a differentiable computation and hand it to the optimizer.\nSummary The KL term has a closed form for diagonal Gaussians — no sampling needed. The reconstruction term is estimated with a single Monte Carlo sample, reducing to BCE or MSE depending on the decoder. Backpropagating through a raw sample is intractable and very noisy via REINFORCE. The reparameterization trick rewrites $z = \\mu + \\sigma \\odot \\epsilon$ with $\\epsilon \\sim \\mathcal{N}(0, I)$, moving randomness to a constant leaf so gradients flow deterministically through $\\mu$ and $\\sigma$. The full VAE is an encoder, a sampler, a decoder, and two loss terms — all tied together by autograd. ","permalink":"https://learning-notes-dz2.pages.dev/posts/2026-01-10/","summary":"We open the ELBO, compute each term, and meet the reparameterization trick — the idea that lets us backpropagate through randomness.","title":"Dissecting the VAE Objective: KL, Reconstruction, and the Reparameterization Trick"},{"content":"Why This Post Exists So for the past two years I\u0026rsquo;ve been writing mostly about generative models — VAEs, probability, information theory. That was fun and I learned a lot. But I\u0026rsquo;ve been increasingly drawn toward the intersection of LLMs, reinforcement learning, and AI safety. Not just using these models, but understanding how they learn, how we align them, and what happens when that alignment fails.\nI spent some time putting together a resource guide for myself — conferences to follow, books to read, papers to prioritize, researchers to track. Then I realized this might be useful for other people making a similar pivot. So here it is. This is not comprehensive and not objective. It\u0026rsquo;s my personal reading list with my own notes on what matters and why.\nTop-Tier Conferences If you\u0026rsquo;re serious about this field, you need to follow the proceedings from these conferences. You don\u0026rsquo;t need to attend (I haven\u0026rsquo;t yet) — just read the accepted paper lists and best paper awards.\nTier 1: The Big 3 Conference When Acceptance Rate Best For NeurIPS Nov/Dec ~24-26% Broadest coverage. LLM + alignment papers from all major labs ICML July ~27.5% RL emphasis, optimization, scaling laws ICLR May ~20-25% Representation learning. Safety/interpretability track is growing fast Some notes from my own reading:\nICML historically emphasizes reinforcement learning and robotics more than the other two NeurIPS has the widest breadth — it attracts alignment work from OpenAI, Anthropic, DeepMind ICLR is increasingly where the safety and interpretability community publishes Tier 2: Specialized Conference When Focus AAAI February Broad AI, emerging safety track ACL / EMNLP / NAACL Varies NLP-specific, language model papers FAccT June AI ethics, fairness, policy — bridges technical and societal perspectives FLLM Annual LLM-specific research (newer venue) AIES Annual AI ethics and society FAccT is interesting because it\u0026rsquo;s where technical alignment meets policy. If you care about how alignment research connects to things like the EU AI Act, this is the conference.\nHow to Actually Use Conferences (Without Attending) Scan accepted paper lists — monthly for NeurIPS/ICML/ICLR Follow best paper awards — these signal where the field is moving Watch keynotes on YouTube — 1-2 hours saves you reading 10 papers Use Papers With Code — find implementations alongside papers Attend 1 in-person eventually — NeurIPS or ICML for networking Books I organized these by topic and roughly by the order you\u0026rsquo;d want to read them. Not all of these I\u0026rsquo;ve finished — I\u0026rsquo;m noting which ones I\u0026rsquo;ve actually read vs. which are on my list.\nMath Foundation Book Authors Notes Mathematics for Machine Learning Deisenroth, Faisal, Ong Free PDF. If you already have math background, skim the optimization chapters — you\u0026rsquo;ll need that for RL. Linear Algebra and Optimization for ML Charu C. Aggarwal Heavier on optimization. Good if you want depth on SVD and kernel methods. Deep Learning Book Authors Notes Deep Learning: Foundations and Concepts Christopher Bishop 2024 edition. This is the most up-to-date comprehensive reference. Covers transformers properly. The Little Book of Deep Learning François Fleuret 150 pages. Fast refresher. Good if you already know the basics and just want modern architecture intuitions. Deep Learning Goodfellow, Bengio, Courville The classic. Dense but foundational. Getting old now but still referenced everywhere. Reinforcement Learning Book Authors Notes Reinforcement Learning: An Introduction (2nd ed) Sutton \u0026amp; Barto THE canonical textbook. Free PDF at incompleteideas.net. Read cover to cover. Deep Reinforcement Learning Hands-On Maxim Lapan Practical implementations. Good companion to Sutton \u0026amp; Barto — they give you theory, this gives you code. Multi-Agent Reinforcement Learning (Recent survey) Cutting-edge MARL. First to integrate modern deep learning approaches. NLP \u0026amp; Language Models Book Authors Notes Transformers for NLP Luis Serrano Clear transformer explanations with practical applications. Build a Large Language Model (From Scratch) Sebastian Raschka Hands-on from tokenizer to RLHF. This is probably the best \u0026ldquo;implement it yourself\u0026rdquo; resource. AI Safety \u0026amp; Alignment There is no canonical textbook for AI safety yet. The knowledge lives in papers and blogs:\nResource What You Get Anthropic\u0026rsquo;s research blog Constitutional AI, reward hacking, alignment faking — primary source for cutting-edge alignment research Paul Christiano\u0026rsquo;s blog Scalable oversight, alignment research agenda Chris Olah / Distill.pub Interpretability done right — visual, rigorous, beautiful Stuart Russell — Human Compatible Philosophical foundations. Good for the \u0026ldquo;why\u0026rdquo; before the \u0026ldquo;how\u0026rdquo; Alignment Forum (alignmentforum.org) Community discussion and paper reviews Foundational Papers by Priority Must-Read First (Before Everything Else) Paper Year Key Concepts Attention is All You Need 2017 Self-attention, positional encoding, multi-head attention. The paper that started the LLM era. Scaling Laws for Neural Language Models 2020 Power-law relationship between loss, model size, and data. Why bigger models work. Language Models are Unsupervised Multitask Learners (GPT-2) 2019 In-context learning, zero-shot. The paper where emergence became undeniable. LLM-Specific (2019-2023) Paper Year What It Addresses InstructGPT / RLHF 2022 The 3-step pipeline: SFT → Reward Model → PPO. How ChatGPT was trained. Constitutional AI (Anthropic) 2023 Self-improving alignment without human labels. RLAIF. Direct Preference Optimization (DPO) 2023 Simplifies RLHF by removing the reward model entirely. Stability + simplicity. Chain-of-Thought Prompting 2022 Intermediate reasoning steps improve complex reasoning. Advanced Alignment \u0026amp; Safety (2023-2025) Paper Year Focus Reward Hacking in LLMs (Multiple labs) 2024-2025 Models optimize proxy rewards, breaking alignment. Specification gaming. Natural Emergent Misalignment (Anthropic) 2024 Reward hacking emerges in RL without explicit intent. Alignment faking. Mitigating Reward Hacking via Information-Theoretic Approaches 2024 Proxy Compression Hypothesis — information bottleneck view of reward hacking Training on Docs About Reward Hacking Induces Reward Hacking 2025 Out-of-context learning: models learn reward-hacking from documentation about it. Wild. Recent Advances (2024-2025) Area What\u0026rsquo;s New Reasoning Models (o1, DeepSeek-R1) RL-trained chain-of-thought. Test-time compute scaling. Chain-of-X Paradigms Beyond CoT: Tree-of-Thought, Graph-of-Thought, etc. Tool Use \u0026amp; Agents Function calling, multi-agent coordination (MetaGPT, CAMEL, AutoAct) Scaling Laws Across Architectures Power laws hold across dense and sparse (MoE) model families Researchers to Follow Must-Follow (Active, Accessible Content) Researcher Where Best For Andrej Karpathy YouTube, X Best DL educator. His \u0026ldquo;Let\u0026rsquo;s build GPT from scratch\u0026rdquo; is mandatory viewing. Yoshua Bengio Papers, talks Nobel laureate. Shifted focus to AI safety — worth understanding why. Chris Olah distill.pub, Anthropic blog Interpretability pioneer. If you want to understand what\u0026rsquo;s inside neural networks, start here. Lilian Weng lilianweng.github.io The gold standard for technical surveys. Her RLHF post is referenced everywhere. Paul Christiano paulchristiano.com, X Alignment research agenda. Scalable oversight. Deep thinker. Read Their Papers Researcher Affiliation Focus Dario Amodei Anthropic Constitutional AI, responsible scaling Ilya Sutskever SSI Scaling dynamics, AGI trajectory John Schulman Anthropic PPO, RLHF — the RL-for-LLMs guy Jacob Steinhardt UC Berkeley AI safety forecasting, alignment evaluation Ethan Perez Anthropic Red-teaming, sycophancy research Where They Publish Channel Best For Personal Blogs Long-form systematic thinking (Weng, Christiano, Olah) X/Twitter Hot takes, latest ideas, paper commentary YouTube Karpathy tutorials, Lex Fridman interviews, Neel Nanda\u0026rsquo;s mech interp tutorials arXiv Full papers — read abstracts first, then decide Papers with Code Implementations alongside papers 2024-2025: What Changed Reasoning \u0026amp; Inference-Time Compute Models are scaling more via test-time compute (reasoning) than training-time now. OpenAI\u0026rsquo;s o1 uses RL-trained long chains of thought. Process Reward Models train on reasoning trajectories, not just outputs. The next frontier is making reasoning itself learnable.\nMust read: \u0026ldquo;Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes\u0026rdquo; (2025)\nAgents \u0026amp; Tool Use LLMs now routinely use external tools. Multi-agent systems are emerging. Function calling is standardized across GPT-4, Claude, etc. The implication: alignment and safety must extend beyond single-LLM to multi-agent scenarios. This is mostly unsolved.\nAlignment Technique Landscape Technique Status 2025 Trade-offs RLHF Still dominant; complex to tune Requires reward model; PPO instability DPO Widely adopted; simpler Offline preference data only; less flexible Constitutional AI / RLAIF Growing Depends on constitution quality Online RL (GRPO etc.) Emerging frontier More complex; richer feedback loop No single \u0026ldquo;best\u0026rdquo; technique. Constitutional AI scales better. DPO is simpler. RLHF is most empirically validated. The field is converging on hybrid approaches.\nInterpretability Breakthroughs Sparse Autoencoders (SAEs): Extract interpretable features from activations Linear Probes: Predict reward hacking from activations at the token level Circuit Analysis: Understanding how features compose into behaviors The direction is clear: mechanistic interpretability is becoming a requirement for alignment, not a nice-to-have.\nEmerging Risks (2025) Out-of-Context Reasoning: Models learn behaviors from documentation about behaviors Inference-Time Misalignment: Reward hacking at inference time (Best-of-N sampling etc.) Scaling Alignment Failures: Alignment properties don\u0026rsquo;t scale with model size as expected Open Questions Things I don\u0026rsquo;t have good answers to yet:\nWhen will Constitutional AI surpass RLHF empirically? It theoretically scales better, but RLHF is still the most battle-tested in production. How do scaling laws change with inference-time compute? o1 suggests different scaling curves emerge with reasoning, but how generalizable is this? Can interpretability scale to reasoning models? SAEs work on base models. Unclear if they reveal reasoning-time dynamics in o1-style models. Will tool use change alignment requirements fundamentally? Multi-agent + tool use introduces failure modes nobody has really studied yet. If you have thoughts on any of these, I\u0026rsquo;d love to hear them.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2025-12-28/","summary":"Books, papers, conferences, and researchers — a personal resource list for anyone going deep into LLMs, RL, and AI safety.","title":"A Curated Guide to LLMs, Reinforcement Learning, and AI Safety"},{"content":"The Intractable Wall In the previous post, we defined the Latent Variable Model (LVM) — the idea that complex, high-dimensional data $x$ is generated by a simpler, lower-dimensional latent variable $z$.\nWe established that to train such a model, we need to maximize the marginal likelihood (the \u0026ldquo;evidence\u0026rdquo;):\n\\begin{equation} p_\\theta(x) = \\int p_\\theta(x, z) , dz = \\int p_\\theta(x|z)p(z) , dz \\end{equation}\nWe also established that for any interesting, non-linear model (where $p_\\theta(x|z)$ is a neural network), this integral is impossible to solve. We cannot integrate over all possible latent states $z$. Consequently, we cannot calculate the posterior $p_\\theta(z|x)$ either, because the evidence $p_\\theta(x)$ appears in the denominator of Bayes\u0026rsquo; rule:\n\\begin{equation} p_\\theta(z|x) = \\frac{p_\\theta(x|z)p(z)}{p_\\theta(x)} = \\frac{p_\\theta(x|z)p(z)}{\\int p_\\theta(x|z)p(z) , dz} \\end{equation}\nWe are stuck. We have a generative story that makes perfect sense, but we have no way to compute the probability of our data, which means we cannot train our model using standard Maximum Likelihood Estimation (MLE).\nSo how do we get around this? That\u0026rsquo;s where Variational Inference (VI) comes in.\nThe main idea of VI is actually pretty simple. If we cannot calculate the true posterior $p_\\theta(z|x)$ analytically, perhaps we can approximate it with a simpler distribution $q_\\phi(z|x)$ and then optimize the parameters $\\phi$ to make it as close as possible to the truth.\nTurning Integration into Optimization In classical calculus, integration is finding the area under a curve. In high dimensions, this is exponentially expensive. Optimization, however — finding the peak of a mountain — is something deep learning is exceptionally good at (thanks to Gradient Descent).\nVariational Inference converts the problem of inference (calculating an integral) into a problem of optimization (minimizing a distance).\nWe introduce a variational family of distributions $\\mathcal{Q}$. We choose a specific distribution $q_\\phi(z|x) \\in \\mathcal{Q}$, parameterized by $\\phi$, to serve as a surrogate for the true, intractable posterior $p_\\theta(z|x)$.\nOur goal is simple: Find the parameters $\\phi$ that make $q_\\phi(z|x)$ look as much like $p_\\theta(z|x)$ as possible.\nIf we can do this, we can use $q_\\phi(z|x)$ as a proxy for the unknown posterior. This $q_\\phi(z|x)$ is often called the inference model (or in VAE terminology, the encoder), while $p_\\theta(x|z)$ is the generative model (the decoder).\nBut how do we measure \u0026ldquo;closeness\u0026rdquo; between two probability distributions? And how do we minimize that distance if we don\u0026rsquo;t know the target $p_\\theta(z|x)$ in the first place?\nThe Kullback-Leibler Divergence To measure the similarity between our approximation $q_\\phi(z|x)$ and the truth $p_\\theta(z|x)$, we use the Kullback-Leibler (KL) Divergence.\nFormally, for two continuous distributions $q(z)$ and $p(z)$, the KL divergence is defined as:\n\\begin{equation} D_{KL}(q||p) = \\int q(z) \\log \\frac{q(z)}{p(z)} , dz = \\mathbb{E}_{z \\sim q} [\\log q(z) - \\log p(z)] \\end{equation}\nKey Properties of KL Divergence:\nNon-negative: $D_{KL}(q||p) \\ge 0$. Zero at equality: $D_{KL}(q||p) = 0$ if and only if $q(z) = p(z)$ almost everywhere. Asymmetric: $D_{KL}(q||p) \\neq D_{KL}(p||q)$. In Variational Inference, we typically minimize the \u0026ldquo;reverse\u0026rdquo; KL divergence: $D_{KL}(q_\\phi(z|x) || p_\\theta(z|x))$. (Note: for brevity, we will often write $D_{KL}(q_\\phi||p_\\theta)$).\n\\begin{equation} \\phi^* = \\arg \\min_\\phi D_{KL}(q_\\phi(z|x) || p_\\theta(z|x)) \\end{equation}\nSo, our objective function seems clear. Let\u0026rsquo;s expand this definition:\n\\begin{equation} D_{KL}(q_\\phi(z|x) || p_\\theta(z|x)) = \\mathbb{E}_{q_\\phi} [\\log q_\\phi(z|x) - \\log p_\\theta(z|x)] \\end{equation}\nUsing Bayes\u0026rsquo; rule $\\log p_\\theta(z|x) = \\log p_\\theta(x, z) - \\log p_\\theta(x)$, we get:\n\\begin{equation} D_{KL}(q_\\phi || p_\\theta) = \\mathbb{E}_{q_\\phi} [\\log q_\\phi(z|x) - \\log p_\\theta(x, z) + \\log p_\\theta(x)] \\end{equation}\nWe hit the wall again. The term $\\log p_\\theta(x)$ (the marginal likelihood) is in the equation. We can\u0026rsquo;t minimize this KL divergence directly because it requires knowing the very thing we are trying to find (the evidence).\nInterlude: Why \u0026ldquo;Reverse\u0026rdquo; KL? You may wonder why we chose $D_{KL}(q||p)$ (Reverse KL) instead of $D_{KL}(p||q)$ (Forward KL). The choice dictates the behavior of our approximation:\nForward KL ($p || q$): This is \u0026ldquo;mean-seeking\u0026rdquo; or \u0026ldquo;inclusive\u0026rdquo;. To minimize it, wherever $p(z) \u0026gt; 0$, we must ensure $q(z) \u0026gt; 0$ (otherwise the ratio $p/q$ explodes). $q$ tries to cover all the probability mass of $p$. If $p$ is multimodal, $q$ (if simpler, e.g., Gaussian) effectively averages them out, potentially putting high probability in low-probability regions between modes. Crucially, computing this requires taking an expectation over $p(z|x)$, which is the intractable posterior we can\u0026rsquo;t sample from! This makes Forward KL computationally infeasible for this setup.\nReverse KL ($q || p$): This is \u0026ldquo;mode-seeking\u0026rdquo; or \u0026ldquo;exclusive\u0026rdquo;. The expectation is over $q$. We minimize $\\sum q(z) \\log(q(z)/p(z))$. If $p(z) \\approx 0$, force $q(z)$ to be 0 to avoid a penalty. $q$ tends to latch onto one mode of $p$ and ignore others.\nPro: We can sample from $q$ (since we designed it!). Con: It effectively underestimates variance. We use Reverse KL primarily because we can actually compute expectations over $q$.\nThe Great Decomposition: Deriving the ELBO We need a different approach. We need to derive a relationship between the marginal likelihood, the KL divergence, and something we can compute.\nLet\u0026rsquo;s start from what we want to maximize: the log-likelihood of the data, $\\log p_\\theta(x)$.\nSince $\\log p_\\theta(x)$ does not depend on $z$, we can multiply it by $\\int q_\\phi(z|x) , dz$ (which equals 1) and bring it inside the expectation:\n\\begin{equation} \\log p_\\theta(x) = \\int q_\\phi(z|x) \\log p_\\theta(x) , dz = \\mathbb{E}_{q_\\phi(z|x)} [\\log p_\\theta(x)] \\end{equation}\nNow, we use Bayes\u0026rsquo; rule again: $p_\\theta(x) = \\frac{p_\\theta(x, z)}{p_\\theta(z|x)}$. Substituting this in:\n\\begin{equation} \\log p_\\theta(x) = \\mathbb{E}_{q_\\phi} \\left[ \\log \\frac{p_\\theta(x, z)}{p_\\theta(z|x)} \\right] \\end{equation}\nHere is the \u0026ldquo;magic\u0026rdquo; step. We multiply and divide the fraction by our approximate posterior $q_\\phi(z|x)$. This introduces our variational parameter $\\phi$ into the equation without changing the value:\n\\begin{equation} \\log p_\\theta(x) = \\mathbb{E}_{q_\\phi} \\left[ \\log \\left( \\frac{p_\\theta(x, z)}{p_\\theta(z|x)} \\cdot \\frac{q_\\phi(z|x)}{q_\\phi(z|x)} \\right) \\right] \\end{equation}\nUsing the property of logarithms $\\log(ab) = \\log a + \\log b$, we split this into two terms:\n\\begin{equation} \\log p_\\theta(x) = \\mathbb{E}_{q_\\phi} \\left[ \\log \\frac{p_\\theta(x, z)}{q_\\phi(z|x)} \\right] + \\mathbb{E}_{q_\\phi} \\left[ \\log \\frac{q_\\phi(z|x)}{p_\\theta(z|x)} \\right] \\end{equation}\nThis leads us to the Evidence Identity:\n\\begin{equation} \\log p_\\theta(x) = \\mathcal{L}(\\theta, \\phi; x) + D_{KL}(q_\\phi(z|x) || p_\\theta(z|x)) \\end{equation}\nWhere $\\mathcal{L}(\\theta, \\phi; x)$ is called the Evidence Lower Bound (ELBO).\nWhy is this useful? Likelihood Bound: We know that $D_{KL} \\ge 0$. Therefore:\n\\begin{equation} \\log p_\\theta(x) \\ge \\mathcal{L}(\\theta, \\phi; x) = \\mathbb{E}_{q_\\phi(z|x)} \\left[ \\log \\frac{p_\\theta(x, z)}{q_\\phi(z|x)} \\right] \\end{equation}\nThe term $\\mathcal{L}$ is strictly a lower bound on the evidence.\nTractability: Unlike the marginal likelihood or the true posterior, the ELBO contains only terms we can compute! The expectation is over $q_\\phi$, which we choose (e.g., a Gaussian), so we can easily sample from it via Monte Carlo estimation.\nAlternative Derivation: Jensen\u0026rsquo;s Inequality For those who prefer a quicker path, we can derive the ELBO directly using Jensen\u0026rsquo;s Inequality, which states that for a concave function $f$ (like log), $f(\\mathbb{E}[y]) \\ge \\mathbb{E}[f(y)]$.\n\\begin{equation} \\begin{aligned} \\log p_\\theta(x) \u0026amp;= \\log \\int p_\\theta(x, z) , dz \\\\ \u0026amp;= \\log \\int p_\\theta(x, z) \\frac{q_\\phi(z|x)}{q_\\phi(z|x)} , dz \\\\ \u0026amp;= \\log \\mathbb{E}_{q_\\phi} \\left[ \\frac{p_\\theta(x, z)}{q_\\phi(z|x)} \\right] \\\\ \u0026amp;\\ge \\mathbb{E}_{q_\\phi} \\left[ \\log \\frac{p_\\theta(x, z)}{q_\\phi(z|x)} \\right] = \\text{ELBO} \\end{aligned} \\end{equation}\nThis elegantly proves that the ELBO is indeed a lower bound.\nThe Optimization Strategy Looking at the Evidence Identity:\n\\begin{equation} \\log p_\\theta(x) = \\underbrace{\\mathcal{L}(\\theta, \\phi; x)}_{\\text{maximize this}} + \\underbrace{D_{KL}(q_\\phi || p_\\theta)}_{\\text{minimize this}} \\end{equation}\nThe left-hand side, $\\log p_\\theta(x)$, is fixed for a given data point $x$ and model parameter $\\theta$. If we maximize the ELBO with respect to $\\phi$ (the variational parameters), we are essentially pushing the lower bound up. Since the total sum is fixed, maximizing the ELBO must minimize the KL divergence.\nThus, we have successfully replaced the impossible problem (minimize KL with unknown target) with a possible one (maximize ELBO).\nAnalyzing the Lower Bound: The Trade-off To understand what we are actually teaching our model to do, let\u0026rsquo;s rewrite the ELBO in a more intuitive form.\nWe start with the definition derived above:\n\\begin{equation} \\mathcal{L} = \\mathbb{E}_{q_\\phi} [\\log p_\\theta(x, z) - \\log q_\\phi(z|x)] \\end{equation}\nRecall that $p_\\theta(x, z) = p_\\theta(x|z)p(z)$. Expanding the log term:\n\\begin{equation} \\mathcal{L} = \\mathbb{E}_{q_\\phi} [\\log p_\\theta(x|z) + \\log p(z) - \\log q_\\phi(z|x)] \\end{equation}\n(Note: we use $p(z)$ instead of $p_\\theta(z)$ for the prior as it typically has no learnable parameters)\nGrouping the terms involving $z$:\n\\begin{equation} \\mathcal{L} = \\mathbb{E}_{q_\\phi} [\\log p_\\theta(x|z)] - \\mathbb{E}_{q_\\phi} [\\log q_\\phi(z|x) - \\log p(z)] \\end{equation}\nThe second part is exactly the definition of KL divergence between the approximate posterior $q_\\phi(z|x)$ and the prior $p(z)$. This gives us the standard VAE objective form:\n\\begin{equation} \\mathcal{L}(\\theta, \\phi; x) = \\underbrace{\\mathbb{E}_{q_\\phi(z|x)} [\\log p_\\theta(x|z)]}_{\\text{Reconstruction}} - \\underbrace{D_{KL}(q_\\phi(z|x) || p(z))}_{\\text{Regularization}} \\end{equation}\nSo the ELBO forces the model to balance two competing objectives.\nThe Reconstruction Term $\\mathbb{E}_{q}[\\log p_\\theta(x|z)]$\nThis term measures how well the decoder can reconstruct the input $x$ given a sample $z$ from the encoder.\nIt encourages the encoder $q_\\phi$ to pick values of $z$ that are \u0026ldquo;informative\u0026rdquo; about $x$. It encourages the decoder $p_\\theta$ to assign high probability to the true data. If we only optimized this, $q_\\phi(z|x)$ would collapse to a point mass (a delta function) exactly at the $z$ that best reconstructs $x$. This is essentially a standard autoencoder, which is prone to overfitting and learns a disjoint latent space. The Regularization Term $-D_{KL}(q_\\phi(z|x) || p(z))$\nThis term measures the distance between our approximate posterior and the prior $p(z)$ (usually a unit Gaussian $\\mathcal{N}(0, I)$).\nIt acts as a regularizer, forcing the distribution of latent codes to look like the prior. It prevents the encoder from \u0026ldquo;cheating\u0026rdquo; by mapping inputs to disjoint, far-away points in the latent space. It ensures the latent space remains smooth and continuous, which is vital for generation. In other words: describe the data $x$ as well as possible (Reconstruction), but don\u0026rsquo;t deviate too far from the prior belief about the world (Regularization).\nGeometric Intuition Think of the log-likelihood as a ceiling and the ELBO as a floor. The gap between them is the KL divergence. When we maximize the ELBO, we\u0026rsquo;re pushing the floor up toward the ceiling.\nMore concretely: when we optimize $\\phi$, the floor (ELBO) rises. Since the ceiling is fixed for a given $\\theta$, raising the floor reduces the gap (minimizes KL). Once the approximation is tight, we can then adjust $\\theta$ to raise the ceiling even higher.\nThis alternating maximization over $q_\\phi$ and $p_\\theta$ corresponds to the E-step and M-step in the classical Expectation-Maximization (EM) algorithm — only here, our expectations are replaced by stochastic gradients.\nSummary We have crossed the bridge from intractable integrals to tractable optimization.\nThe Problem: We cannot calculate the evidence $p_\\theta(x)$ or the posterior $p_\\theta(z|x)$ because of the integral over $z$. The Solution: We introduce a surrogate posterior $q_\\phi(z|x)$ and minimize its KL divergence from the true posterior. The Method: We minimized Reverse KL because it is computationally tractable (expectation over $q$). The Tool: We derived the ELBO (via Bayes\u0026rsquo; Decomposition or Jensen\u0026rsquo;s Inequality), a computable lower bound on the evidence. The Interpretation: Maximizing the ELBO balances reconstruction accuracy (fitting the data) against regularization (staying close to the prior). This final expression — reconstruction minus KL regularization — is exactly the training loss used in Variational Autoencoders.\nWe now have a valid objective function. However, there is one final hurdle. The ELBO involves an expectation $\\mathbb{E}_{q_\\phi(z|x)}[\\dots]$. To train this with gradient descent, we need to differentiate through the sampling process of $z$. You cannot simply ask PyTorch or TensorFlow to compute the gradient of a random coin flip.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2025-12-20/","summary":"Variational Inference transforms the impossible task of computing intractable integrals into a solvable optimization problem, providing the mathematical foundation for modern generative models like VAEs.","title":"Variational Inference: Cracking the Intractable Integral"},{"content":"The Hidden World Behind Our Data Think about it: when you draw something, you start with basic shapes, not every tiny detail. The shape, the pose, the color — these hidden factors determine what you end up with. Latent variables in ML work the same way.\nWe observe high-dimensional data like images, but there\u0026rsquo;s a simpler structure underneath. The pixels are just the final result of some hidden generative process.\nThis is basically the idea behind LVMs — what we observe is only a shadow of something unobserved, something simpler and more fundamental.\nMotivation Most modern generative models rely on latent variables — unobserved factors that influence the data. Instead of directly modeling $p(x)$, we assume there exists some latent representation $z$ that captures the underlying structure of the data.\nIn this note, we start from a simple linear model — Probabilistic PCA (PPCA) — and generalize to the family of latent variable models (LVMs). This sets the mathematical foundation for the Variational Autoencoder (VAE).\nFormal Definition and the Marginal Likelihood So the essential idea behind LVMs is that what we observe — the data $x$ — is only a shadow or projection of something unobserved, called a latent variable $z$.\nInstead of modeling $p(x)$ directly, we model the joint distribution $p(x, z)$, then derive everything else (including $p(x)$) from it.\nIn plain words, we first sample $z$ (e.g., we imagine the size, the shape, and the color of a horse) and then create an image with all necessary details, i.e., we sample $x$ from the conditional distribution $p(x|z)$. Then, the core idea of LVM is we introduce the latent variables $z$, and the joint distribution is factorized as:\n\\begin{equation} p_\\theta(x, z) = p_\\theta(x\\mid z),p(z) \\end{equation}\nwhere:\n$p(z)$: the prior over latent variables (usually simple, e.g. $\\mathcal{N}(0, I)$), $p_\\theta(x\\mid z)$: the likelihood (a conditional distribution parameterized by $\\theta$, often a neural decoder). This naturally expressed the generative process described above:\nDraw a latent sample $z \\sim p(z)$. Generate an observable $x \\sim p_\\theta(x\\mid z)$. However, in training, we usually have access only to $x$, not to the hidden $z$. Therefore, according to probabilistic inference, we should sum out (or marginalize out) the unknown $z$. As a result, the (marginal) likelihood function is the following:\n\\begin{equation} p_\\theta(x) = \\int p_\\theta(x, z) , dz = \\int p_\\theta(x\\mid z),p(z) , dz \\end{equation}\nThis integral is also known as the evidence of the data. It captures the model\u0026rsquo;s ability to explain the data by integrating over all possible latent causes $z$.\nA probabilistic interpretation: The term $p_\\theta(x\\mid z)p(z)$ can be seen as the joint probability density of seeing both $x$ and $z$ simultaneously. By integrating over $z$, we effectively sum over every possible hidden cause that might have produced $x$:\n\\begin{equation} p_\\theta(x) = \\mathbb{E}_{p(z)}[p_\\theta(x\\mid z)] \\end{equation}\nBayesian viewpoint: In Bayesian inference, this step — summing or integrating out the unobserved variable — is called marginalization. It\u0026rsquo;s the process that turns the joint model $p(x, z)$ into a model that depends only on observable data $x$:\n\\begin{equation} p(x) = \\int p(x, z),dz \\end{equation}\nThis marginalization embodies the idea that we are uncertain about $z$, and instead of committing to a specific value, we integrate across all possibilities.\nAt first glance, the integral above looks straightforward. But in reality, it\u0026rsquo;s almost always intractable:\nNonlinearity of the decoder. In most deep generative models, $p_\\theta(x\\mid z)$ is represented by a neural network, e.g. $\\mathcal{N}(x; f_\\theta(z), \\sigma^2 I)$. No closed form for $p_\\theta(x)$. High-dimensional latent space. $z$ is often 32–256D; naive numerical integration is infeasible. Exponential complexity. The integral is a sum over exponentially many latent configurations. Exception (tractable case): linear-Gaussian models\nGiven that:\n\\begin{equation} p(z) = \\mathcal{N}(0, I), \\quad p(x\\mid z) = \\mathcal{N}(x; Wz + \\mu, \\sigma^2 I), \\end{equation}\nwe can integrate analytically:\n\\begin{equation} p(x) = \\mathcal{N}(x; \\mu, WW^\\top + \\sigma^2 I). \\end{equation}\nBut as soon as $p(x\\mid z)$ becomes nonlinear (e.g., $Wz$ replaced by a neural net), the integral becomes intractable.\nOnce we have $p_\\theta(x)$, the posterior is\n\\begin{equation} p_\\theta(z\\mid x) = \\frac{p_\\theta(x\\mid z),p(z)}{p_\\theta(x)}. \\end{equation}\nHowever, evaluating $p_\\theta(z\\mid x)$ requires $p_\\theta(x)$, which is intractable in the general nonlinear case. Thus both the marginal likelihood and the posterior are intractable — the central bottleneck in learning LVMs.\nWe have two roads ahead:\nExact inference (analytic) works for special cases like Probabilistic Principal Component Analysis (PPCA). Approximate inference is required in general. We introduce a surrogate posterior $q_\\phi(z\\mid x)$ and will optimize a lower bound on $\\log p_\\theta(x)$ — the ELBO (in the next note). From PCA to Probabilistic PCA Given data $x \\in \\mathbb{R}^D$, PCA finds a linear projection:\n\\begin{equation} z = W^\\top (x - \\mu),\\quad W \\in \\mathbb{R}^{D \\times d},; d \u0026lt; D \\end{equation}\nthat maximizes variance or minimizes reconstruction error:\n\\begin{equation} \\min_W | x - \\mu - WW^\\top (x - \\mu) |^2 \\end{equation}\nPCA, however, is deterministic and non-probabilistic. We can\u0026rsquo;t use it to generate data, nor to reason about uncertainty.\nLet\u0026rsquo;s define a latent variable model where $z \\in \\mathbb{R}^d$ is hidden and Gaussian:\n\\begin{equation} z \\sim \\mathcal{N}(0, I) \\end{equation}\nand data $x$ is generated linearly from $z$ with Gaussian noise:\n\\begin{equation} x = Wz + \\mu + \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, \\sigma^2 I) \\end{equation}\nThus, the conditional distribution of $x$ given $z$ is:\n\\begin{equation} p(x|z) = \\mathcal{N}(x; Wz + \\mu, \\sigma^2 I) \\end{equation}\nSince $z$ is latent, the likelihood of a single observation $x$ is:\n\\begin{equation} p(x) = \\int p(x|z), p(z) , dz \\end{equation}\nSubstituting the Gaussian definitions:\n\\begin{equation} p(x) = \\int \\mathcal{N}(x; Wz + \\mu, \\sigma^2 I), \\mathcal{N}(z; 0, I) , dz \\end{equation}\nThis integral can be solved analytically (because the product of two Gaussians is Gaussian):\n\\begin{equation} p(x) = \\mathcal{N}(x; \\mu, WW^\\top + \\sigma^2 I) \\end{equation}\nWe\u0026rsquo;ve just shown that PCA is a special case of a latent variable model, where $p(x)$ is a Gaussian with covariance structured as $WW^\\top + \\sigma^2 I$.\nInterpretation:\nW: defines the subspace directions (principal components). $\\sigma^2$: controls the noise (orthogonal variance). $z$: explains the causal latent factors of variation. So, intuitively, we can understand that PCA reconstructs and PPCA explains. It says: \u0026ldquo;There exists a latent world $z$ where data is generated from a simple distribution — and what we see is just its noisy projection.\u0026rdquo;\nThis simple probabilistic leap — from reconstruction to explanation — is the seed idea of all modern VAEs.\nGeneralizing to Latent Variable Models PPCA is linear and Gaussian. But what if the data lies on a nonlinear manifold?\nWe generalize by introducing a nonlinear generative process:\n\\begin{equation} z \\sim p(z), \\quad x \\sim p_\\theta(x|z) \\end{equation}\nwhere $p_\\theta(x|z)$ is parameterized by a neural network (the decoder).\nThen the marginal likelihood becomes:\n\\begin{equation} p_\\theta(x) = \\int p_\\theta(x|z), p(z) , dz \\end{equation}\nThis defines the Latent Variable Model (LVM).\nThe posterior $p_\\theta(z|x)$ tells us how likely each latent cause $z$ is given an observation $x$:\n\\begin{equation} p_\\theta(z|x) = \\frac{p_\\theta(x|z), p(z)}{p_\\theta(x)} \\end{equation}\nHowever, computing $p_\\theta(x)$ requires integrating over all $z$:\n\\begin{equation} p_\\theta(x) = \\int p_\\theta(x|z)p(z),dz \\end{equation}\nwhich is generally intractable for nonlinear models.\nThis intractability motivates variational inference — the key idea behind VAEs.\nWe want to learn parameters $\\theta$ that maximize the marginal likelihood:\n\\begin{equation} \\max_\\theta \\log p_\\theta(x) \\end{equation}\nBut since $p_\\theta(x)$ is intractable, we\u0026rsquo;ll derive a lower bound (ELBO) that we can compute (it\u0026rsquo;s gonna be in the next post if I will ever write).\nGeometric Intuition Imagine each $z$ corresponds to a coordinate in a smooth, continuous manifold. The function $p_\\theta(x|z)$ \u0026ldquo;decodes\u0026rdquo; that point into a sample in data space.\nIn PCA, this mapping is linear and the manifold is a flat subspace. In nonlinear LVMs, the manifold can be curved — learned by a neural decoder. Summary Concept Deterministic AE / PCA Latent Variable Model Latent variable Fixed vector Random variable $z \\sim p(z)$ Encoder Linear projection Inference of posterior $p(z\\mid x)$ Decoder Linear reconstruction Conditional distribution $p(x\\mid z)$ Objective Reconstruction error Likelihood (or its lower bound) Learning Deterministic mapping Probabilistic inference Latent Variable Models introduce hidden variables to capture unobserved structure. Probabilistic PCA is the simplest example — tractable, linear, Gaussian. The marginal likelihood $p(x) = \\int p(x\\mid z)p(z)dz$ is key, but often intractable. This motivates variational inference — which we will derive step-by-step in the next post. References Tipping, M. E., \u0026amp; Bishop, C. M. (1999). Probabilistic Principal Component Analysis. Kingma, D. P., \u0026amp; Welling, M. (2014). Auto-Encoding Variational Bayes. Doersch, C. (2016). Tutorial on Variational Autoencoders. Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 12. ","permalink":"https://learning-notes-dz2.pages.dev/posts/2025-10-28/","summary":"From PCA to Probabilistic PCA and general Latent Variable Models: the probabilistic lens that seeds VAEs.","title":"Latent Variable Models: A Probabilistic Foundation"},{"content":"Existing generative models can mostly be categorized into Explicit and Implicit models. An important aspect to discuss the differences between these categories is based on theirs objectives.\nIn generative modeling, tractability and flexibility are two conflicting objectives. In particular, tractable models are usually analytically computable, therefore they are easy to evaluate and fit. However, they are not usually flexible enough to learn the true data structure. On the other hands, flexible models is supposed to fit arbitrary data structure, but they are expensive in evaluating, training, and sampling.\nExplicit generative models This is so-called Likelihood-based generative models. As the name states, these models usually make strong assumptions to ensure tractability of likelihood. These assumptions are usually through strong restrictions on the model architecture (flow-based models), or reliance on surrogate objectives to approximate maximum likelihood training (VAEs), or causal convolution (autoregressive models).\nPros:\nBy explicitly modeling $P(x)$, they can capture the entire data distribution, leading to no mode collapse. Generally more stable to train, and better coverage. Cons\nLess efficient due to likelihood computation. Sample quality are generally worse compared to implicit models. Implicit generative models Implicit generative models generates samples from a target distribution without the need of approximating the probability distribution of the data, demonstrating theirs flexibility. In contrast, the probability distribution is implicitly represented by a model of its sampling process. They often require adversarial training, which is notoriously unstable and can lead to mode collapse. The most prominent candidate is GAN.\nPros:\nEfficient in evaluating, training and sampling. Often produce sharper, more realistic samples. Cons:\nProne to mode collapse. Adversarial training can be unstable. Cannot compute likelihood of a sample, they just sample. Score-based generative models There exists diffusion and score-based generative models, which are both analytically tractable and flexible. These models circumvent several of the above limitations. In these models, we do not need a tractable normalizing constant. Instead, we can rely on score matching to directly learn the models.\nScore-based models have achieved state-of-the-art performance on many downstream tasks and applications. These tasks include generation, audio synthesis, shape generation, \u0026hellip; For diffusion models, kindly refer to my previous post\nStein score function The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function. The score function is the gradient of the log-probability of a distribution w.r.t. to the input. This represents the direction of steepest ascent in the log density of $p(x)$, helping the model understand how $x$ relates to the data distribution. $$ \\nabla_x logp(x) $$ A model $s_\\theta(x)$, which models the score function explicitly, is a score-based model: $$ s_\\theta(x) \\approx \\nabla_x logp(x) $$ Normally, the probability density $p(x)$ is intractable due to the involvement of a normalizing constant $Z_\\theta$.​\n$$ p_\\theta(x)= \\frac{exp(−f_\\theta(x))​}{Z_\\theta​}, \\text{where} Z_\\theta​=\\int exp(−f_\\theta​(x))dx $$\nHere the function $f_\\theta(x)$ is often called an unnormalized probabilistic model, or energy-based model. We can train $p_\\theta(x)$ by maximizing the log-likelihood of the data: $$ max_\\theta \\sum_{i=1}^N log \\ p\\theta(x_i) $$ but this equation requires $p_\\theta(x)$ to be a normalized pdf. This is infeasible because computing $p_\\theta(x)$ relies on evaluating $Z_\\theta$, which is a typically intractable quantity for any general $f_\\theta(x)$. This is the reason why explicit models have to have constraints on their architectures, to make $Z_\\theta$ tractable.\nHowever, in score-based models, we have: $$ \\nabla_x​logp(x)=\\nabla_x​\\log\\exp(−f_\\theta​(x))−\\nabla_x​logZ_\\theta​. $$ where $\\nabla_x​logZ_\\theta = 0$ due to $Z_\\theta$ is independent from $x$. Thus, we do not need a tractable normalizing constant, i.e., we do not need any special architectures to make the normalizing constant tractable. This also gets rid of expensive computation, focusing on learning $\\nabla_x​\\log\\exp(−f_\\theta​(x))$, the gradient of the unnormalized log density.\nThe goal is to train the model to approximate the true score function $\\nabla_x \\log p(x)$. This can be done by minimizing the Fisher divergence, similar to likelihood-based models, which measures the difference between the true score and the model’s predicted score:\n$$ \\mathbb{E}_{p(x)} \\left[ \\lVert \\nabla_x \\log p(x) - s_\\theta(x) \\rVert^2_2 \\right] $$\nwhere $\\nabla x​\\log\\ p(x)$ is the \u0026ldquo;ground truth\u0026rdquo; score of the data distribution (unknown in practice), and $s_\\theta(x)$ is the model\u0026rsquo;s predicted score. But since we do not know the “ground truth ”, how do we train and backdrop? What do we optimize? To address this, we use a proxy that allows the model to learn the score indirectly.\nScore matching Directly computing $\\nabla_x \\log p(x)$ is infeasible since $p(x)$ is often unknown. However, it can be shown (up to some regularity conditions) that the above loss is equivalent to minimizing:\n$$ \\mathbb{E}_{p(x)} \\left[ \\text{tr}(\\nabla_x s_\\theta(x)) + \\frac{1}{2} \\lVert s_\\theta(x) \\rVert^2_2 \\right], $$\nwhere $\\text{tr}(\\nabla_x s_\\theta(x))$ is the trace of the Jacobian of $s_\\theta(x)$, which represents the divergence of the score function. This removes the need of computing $\\nabla_x \\log p(x)$, but the trace of the Jacobian is still too expensive for large networks and approximations are needed.\nDenoising score matching This is a widely used approach to train score-based models when the true score function $\\nabla_x \\log p(x)$ is unknown. Denoising score matching (DSM) works well for small level of noise:\n$$ \\frac{1}{2} \\mathbb{E}_{q_\\sigma(\\tilde{x} \\mid x)p_{data}(x)} \\left[ \\lVert s_\\theta(\\tilde{x}) - \\nabla_{\\tilde{x}} \\log q_\\sigma(\\tilde{x} \\mid x) \\rVert_2^2 \\right], $$\nwhere:\n$q_\\sigma(\\tilde{x} \\mid x)$: Noise distribution, describing how $x$ is corrupted to $\\tilde{x}$. $\\nabla_{\\tilde{x}} \\log q_\\sigma(\\tilde{x} \\mid x)$: The \u0026ldquo;ground truth\u0026rdquo; score of the noisy data, which can be computed analytically. $s_\\theta(\\tilde{x})$: The model\u0026rsquo;s predicted score for the noisy data. DSM reformulates this objective by corrupting the data $x$ into a noisy version $\\tilde{x}$, making the problem tractable. Steps of DSM:\nCorrupt the Data: Sample a clean data point $x \\sim p_{data}(x)$. Add noise from a pre-specified distribution $q_\\sigma(\\tilde{x} \\mid x)$ (e.g., Gaussian noise): $\\tilde{x} = x + \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, \\sigma^2 I)$ Train the Model to Predict the Noisy Score: The true score for the noisy data $\\nabla_{\\tilde{x}} \\log q_\\sigma(\\tilde{x} \\mid x)$ is given by: $\\nabla_{\\tilde{x}} \\log q_\\sigma(\\tilde{x} \\mid x) = -\\frac{\\tilde{x} - x}{\\sigma^2}$ This provides a tractable target for training the model. Loss Function: Minimize the squared error between the model\u0026rsquo;s predicted score $s_\\theta(\\tilde{x})$ and the true noisy score $\\frac{1}{2} \\mathbb{E}_{q_\\sigma(\\tilde{x} \\mid x)p_{data}(x)} \\left[ \\lVert s_\\theta(\\tilde{x}) + \\frac{\\tilde{x} - x}{\\sigma^2} \\rVert_2^2 \\right]$ DSM allows the model to approximate the score of the clean data $\\nabla_x \\log p(x)$ indirectly by learning the score of the noisy data $\\nabla_{\\tilde{x}} \\log q_\\sigma(\\tilde{x} \\mid x)$ The key insight is that the learned noisy score can be related to the clean score:\n$$ \\nabla_x \\log p(x) = \\lim_{\\sigma \\to 0} \\mathbb{E}_{q_\\sigma(\\tilde{x} \\mid x)} \\left[ \\nabla_{\\tilde{x}} \\log q_\\sigma(\\tilde{x} \\mid x) \\right] $$\nAdvantages of score matching We can train with score matching directly with SGD like maximizing log-likelihood, without requiring adversarial optimization. We have no constraints on the form of as we do not require to be the score function of a normalized distribution. Additionally, using the score matching objective gives us a considerable amount of modeling flexibility. The Fisher divergence itself does not require $s_\\theta(x)$ to be an actual score function of any normalized distribution—it simply compares the $l_2$ distance between the ground-truth data score and the score-based model, with no additional assumptions on the form of $s_\\theta(x)$. In fact, the only requirement on the score-based model is that it should be a vector-valued function with the same input and output dimensionality, which is easy to satisfy in practice.\nAs a brief summary, we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.\nLangevin dynamics After training the score-based model, we can sample with Langevin dynamics. Langevin dynamics is a Markov Chain Monte Carlo (MCMC) method that allows sampling from a target distribution $p(x)$, given the gradient of its log-probability (the score function, $\\nabla_x \\log p(x)$. It is particularly suited for score-based models because these models learn to approximate $\\nabla_x \\log p(x)$.\nThe sampling process involves iteratively updating the sample xtx_txt​ using the following update rule:\n$$ x_{t+1} = x_t + \\epsilon \\cdot \\nabla_x \\log p(x_t) + \\sqrt{2\\epsilon} \\cdot z_t, \\quad t=0, 1, \u0026hellip;, K $$\nwhere:\n$\\nabla_x \\log p(x_t)$: The score function, approximated by the trained model $s_\\theta(x)$. $\\epsilon$: Step size controlling the magnitude of the updates. The gradient $\\epsilon \\cdot \\nabla_x \\log p(x_t)$ ensures the sample moves toward high-probability regions. $z_t \\sim \\mathcal{N}(0, I)$: Gaussian noise added to each step to ensure proper exploration of the space and convergence to $p(x)$. The noise $\\sqrt{2\\epsilon} \\cdot z_t$ prevents the chain from getting stuck in local modes, ensuring convergence to the correct distribution. $x_0$​: Initial sample, drawn from an arbitrary prior distribution $\\pi(x)$ (commonly a Gaussian). As $\\epsilon \\to 0$ (small step size) and $K \\to \\infty$ (sufficient iterations), the sequence ${x_t}$converges to a sample from the target distribution $p(x)$ under certain conditions. In practice, the error is negligible when $\\epsilon$ is sufficiently small and $K$ is sufficiently large.\nAdvantages:\nEfficiency: Langevin dynamics allows sampling without explicitly modeling$p(x)$, making it highly efficient for score-based generative models. Flexibility: This method works well for any distribution where the score function $\\nabla_x \\log p(x)$can be approximated. No Normalization Required: Langevin dynamics works directly with the unnormalized log-density $\\log p(x)$, bypassing the need for a normalizing constant. Scalable: It can be applied to high-dimensional data by leveraging efficient computation of gradients. Guaranteed Convergence: Theoretically, Langevin dynamics converges to the target distribution under appropriate conditions. ","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-12-24/","summary":"A summary of explicit, implicit and score-based generative models.","title":"An overview on generative models paradigms"},{"content":"Entropy Consider the example of the event where a UFO appears in the sky. We have never seen anything like it, and we would be very surprised if such an event occurred one day because it is not expected based on our current knowledge. Therefore, the amount of information can be thought of as the level of surprise when knowing the value of $x$. The measure of the amount of information depends on the probability distribution $p(x)$ and is specifically given by the logarithm of $p(x)$ as follows:\n$$ \\mathbb{H}(x) = -\\log_2 p(x) $$\nThe negative sign appears because for all $x \\in [0,1]$, $\\log_2(x) \\leq 0$ (the case when $0$ occurs is when $x=1$). Since we are using entropy to measure, the value needs to be positive to reflect the magnitude of the value. Thus, we have the negative sign in front. When the base of the logarithm is 2, the unit of $h(x)$ is bit.\nThe average information is calculated by taking the expectation over the amount of information, given by:\n$$ \\mathbb{H}[x] = -\\sum_{x} p(x)h(x) = -\\sum_{x} p(x)\\log_2 p(x) $$\nThis is called the entropy of the random variable $X$, representing the average amount of information (or uncertainty) associated with a random variable. It measures the level of uncertainty of a probability distribution. For example, if we observe a sequence ${X}_n \\sim p$ generated from the distribution $p$. If the entropy of ${X}_n$ is high, predicting each character ${X}_n$ will be difficult; conversely, if the entropy is 0, then every ${X}_n$ has the same value, and the dataset $\\mathbf{D} = (X_1, X_2, \u0026hellip;, X_n)$ is considered to contain little information. Therefore, if the entropy is high, it means that the dataset $\\mathbf{D}$ contains more information, as the value of each $X_n$ differs, leading to data diversity in $\\mathbf{D}$.\nIt is important to note that the discrete distribution with the largest entropy is the uniform distribution. This, semantically, makes perfect sense: we know that in a uniform distribution, the probability that the value of the variable $X$ lies within the range $[a, b]$ is $\\frac{1}{b-a}$, meaning the probability that $X$ takes any value is the same. Therefore, it is very difficult to predict the value of each $X$. Hence, for a random variable $x$ that can take $K$ values, the entropy of the distribution is maximized when the probability that $x = k$, i.e., $p(x=k) = \\frac{1}{K}$. In this case, $\\mathbb{H}(X) = \\log_2 K$.\nOn the other hand, distributions with small entropy are those where the probability mass function concentrates on one or a few states. A distribution with minimal entropy (i.e., equal to 0) is any delta function that places all the probability on a single point.\nConsider the example: a random variable $X$ with $p(X = a) = 1$, $p(X \\neq a) = 0$. The distribution of $X$ can be described through the delta function as follows:\n$$ P(X=a) = \\delta(x-a) $$\nThis means that for any value other than $a$ that $X$ could take, the probability is 0 (note: $\\delta(y) = 0$ when $y \\neq 0$, this is the property of the delta function). Since $X$ can only take the value $a$, the entropy of the distribution of $X$ is:\n$$ \\mathbb{H}(X) = - P(X=a) \\log_2(P(X=a)) $$\nSince $P(X=a)=1$, and $\\log_2(1) = 0$, we have $\\mathbb{H}(X)=0$. When the entropy is minimal, the distribution has no uncertainty because we are certain that only one value exists for the random variables in that distribution.\nEntropy is also similarly defined for continuous random variables and is called differential entropy:\n$$ \\mathbb{H}[X] = -\\int p(X) \\log_2 p(X) dX $$\nCross-entropy Cross-entropy measures the loss when we use a distribution $q$ to represent the distribution $p$. When the cross-entropy is small, it means that the distributions $p$ and $q$ are closer to each other, more similar, compared to when the cross-entropy is large:\n$$ \\mathbb{H}(X) \\triangleq p_k \\log_2 q_k $$\nJoint entropy The joint entropy of two random variables $X$ and $Y$ is given by:\n$$ \\mathbb{H}(X, Y) = - \\sum_{x,y} p(x,y) \\log_2 p(x,y) $$\nIn the case of continuous variables (differential entropy):\n$$ \\mathbb{H}(X, Y) = - \\int \\int p(x,y) \\log_2 p(x,y) , dx , dy $$\nIt is important to note that when $X$ and $Y$ are independent, the joint entropy $\\mathbb{H}(X, Y) = \\mathbb{H}(X) + \\mathbb{H}(Y)$. This is the upper bound of the joint entropy. Intuitively, this makes sense: when $X$ and $Y$ are correlated, they reduce the \u0026ldquo;degree of freedom\u0026rdquo; of the system, so the entropy is smaller than when $X$ and $Y$ are independent.\nWhat about the lower bound of joint entropy? The lower bound appears when one of the variables is completely dependent on the other. This is the opposite case to the upper bound, where $X$ and $Y$ are completely independent.\nIt is easy to see that for $Y$ to be completely dependent on $X$, $Y$ must be a deterministic function of $X$, for example $y = 2x$. In this case, $\\mathbb{H}(X, Y) = \\mathbb{H}(X)$. Why? Because to describe both $X$ and $Y$, we can describe $Y$ from $X$, so we only need to know the uncertainty of $X$ to describe the joint entropy of $X$ and $Y$. Thus, the lower bound of joint entropy is:\n$$ \\mathbb{H}(X, Y) \\geq \\max{(\\mathbb{H}(X), \\mathbb{H}(Y))} \\geq 0 $$\nThis is a demonstration of a reality in entropy: When we add more random variables to the system, the entropy of the system cannot decrease, it can only stay the same or increase. The reason is that, from the lower bound, we can see that the uncertainty of the overall system cannot be smaller than the uncertainty of any individual variable. Adding more \u0026ldquo;unknowns\u0026rdquo; does not make the system clearer, and to reduce the uncertainty of the system, we need to increase the amount of observed data.\nConditional entropy In the case of the joint distribution $p(X,Y)$, the additional average information required to determine $Y$ when the value of $X$ is known is called conditional entropy, and is given by:\n$$ \\mathbb{H}[Y|X] = - \\int p(X,Y) \\log_2 p(Y|X) , dX , dY $$\nAdditionally, it can be easily seen that by using the product rule, conditional entropy satisfies the following relation:\n$$ \\begin{align*} \\mathbb{H}[X,Y] \u0026amp;= - \\int \\int p(X,Y) \\log_2 p(X,Y) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\log_2 \\big(p(Y|X) p(X)\\big) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\big( \\log_2 p(Y|X) + \\log_2 p(X) \\big) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\log_2 p(Y|X) , dX , dY - \\int \\int p(X,Y) \\log_2 p(X) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\log_2 p(Y|X) , dX , dY - \\int \\int p(X,Y) , dY \\log_2 p(X) , dX \\\\ \u0026amp;= - \\int \\int p(X,Y) \\log_2 p(Y|X) , dX , dY - \\int p(X) \\log_2 p(X) , dX \\\\ \u0026amp;= \\mathbb{H}[Y|X] + \\mathbb{H}[X] \\end{align*} $$\nThis equation shows that the amount of information needed to describe both $X$ and $Y$ is given by the amount of information needed to describe $X$ alone, plus the additional information required to determine $Y$.\nIf $Y$ is a deterministic function of $X$, knowing $X$ will help us determine $Y$, so $\\mathbb{H}(Y | X) = 0$ (there is no uncertainty about the output of $Y$). But if the two variables are independent, knowing $X$ does not provide any information to determine $Y$, so $\\mathbb{H}(Y | X) = \\mathbb{H}(Y)$.\nSince $\\mathbb{H}(X, Y) \\leq \\mathbb{H}(X) + \\mathbb{H}(Y)$, and $\\mathbb{H}(X, Y) = \\mathbb{H}(Y | X) + \\mathbb{H}(X)$, we have:\n$$ \\mathbb{H}(Y | X) \\leq \\mathbb{H}(Y) $$\nEquality holds if and only if $X$ and $Y$ are independent. This means that, on average, knowing information about $X$ will reduce the uncertainty of $Y$, or at least the uncertainty of $Y$ will not increase. This is not always true, as there are cases where knowing information about $X$ increases the uncertainty of $Y$, i.e., $\\mathbb{H}(Y | X) \u0026gt; \\mathbb{H}(Y)$. However, we are speaking about the expected average case, where having more information is always beneficial in reducing uncertainty.\nRelative entropy and mutual information Consider a distribution $p(X)$ that is unknown, and we approximate it with a distribution $q(X)$. The average additional information required to determine the value of $X$ when using $q(X)$ instead of $p(X)$ is given by:\n$$ \\begin{align*} \\text{KL}(p||q) \u0026amp;= - \\int p(X) \\log_2 q(X) , dX - \\left( - \\int p(X) \\log_2 p(X) , dX \\right) \\\\ \u0026amp;= - \\int p(X) \\left( \\log_2 q(X) - \\log_2 p(X) \\right) , dX \\\\ \u0026amp;= - \\int p(X) \\log_2 \\left( \\frac{q(X)}{p(X)} \\right) , dX. \\end{align*} $$\nThis is called the relative entropy or Kullback-Leibler (KL) divergence between the distributions $p(X)$ and $q(X)$. Note that this is not a symmetric quantity, i.e., $\\text{KL}(p||q) \\neq \\text{KL}(q||p)$. Also, the KL divergence satisfies $\\text{KL}(p||q) \\geq 0$, with equality occurring if and only if $p(X) = q(X)$. Therefore, we can interpret KL divergence as a measure of the difference between two distributions.\nLet\u0026rsquo;s further analyze KL divergence. In the first line, we can rewrite it as:\n$$ \\begin{align*} \\text{KL}(p||q) \u0026amp;= - \\int p(X) \\log_2 q(X) , dX - \\left( - \\int p(X) \\log_2 p(X) , dX \\right) \\\\ \u0026amp;= \\mathbb{H}_{ce}(p, q) - \\mathbb{H}(p) \\end{align*} $$\nWe recognize that the first term is the cross-entropy (CE) between $p$ and $q$, and the second term is the entropy of $p$, which is the actual distribution we are trying to represent. The CE of $p$ and $q$ is the lower bound of the number of bits needed to compress data generated from the distribution $p$ if we are representing $p$ using $q$. Thus, we can interpret the KL divergence between $p$ and $q$ as the extra bits you need to use if you want to represent data from the true distribution $p$ with the distribution $q$ instead of using $p$.\nSince optimal data compression is achieved when we know the true distribution, if we do not, we need to account for the additional information. Hence, there is an important relationship between data compression and density estimation.\nSuppose we are trying to approximate the unknown distribution using a parametric distribution $q(X|\\mathbf{\\theta})$, controlled by a set of adjustable parameters $\\mathbf{\\theta}$, such as a multivariate Gaussian distribution. One way to determine $\\mathbf{\\theta}$ is to minimize the KL divergence between $p(X)$ and $q(X|\\mathbf{\\theta})$. This is not possible since we do not know $p(X)$. However, if we have observed a finite set of training points $X_n$ drawn from $p(X)$, then the expectation with respect to $p(X)$ can be approximated by a finite sum over these points:\n$$ \\text{KL}(p||q) \\approx \\frac{1}{N} \\sum_{n=1}^{N} \\left( - \\log_2 q(X_n|\\mathbf{\\theta}) + \\log_2 p(X_n) \\right) $$\nSince the second term does not depend on $\\mathbf{\\theta}$, minimizing the KL divergence approximately is equivalent to minimizing the log-likelihood function of $q(X_n|\\mathbf{\\theta})$ evaluated on the training set.\nMutual Information Consider the joint distribution $p(X,Y)$ between two sets of variables $X$ and $Y$. If these sets are independent, then $p(X,Y) = p(X)p(Y)$. If the variables are not independent, we can check how \u0026ldquo;close to independence\u0026rdquo; they are by considering the Kullback-Leibler divergence between the joint distribution and the product of their marginal distributions, given by:\n$$ \\text{I}[X,Y] = \\text{KL}(p(X,Y) || p(X)p(Y)) = -\\int \\int p(X,Y) \\log_2 \\left( \\frac{p(X)p(Y)}{p(X,Y)} \\right) , dX , dY $$\nThis is called the mutual information between the variables $X$ and $Y$. Mutual information satisfies $\\text{I}[X,Y] \\geq 0$, with equality occurring if and only if $X$ and $Y$ are independent. Furthermore, by using the sum and product rules of probability, we can show that mutual information is related to conditional entropy:\n$$ \\begin{align*} \\text{I}[X,Y] \u0026amp;= - \\int \\int p(X,Y) \\log_2 \\left( \\frac{p(X)p(Y)}{p(X,Y)} \\right) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\log_2 \\left( \\frac{p(X)p(Y)}{p(X|Y)p(Y)} \\right) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\log_2 \\left( \\frac{p(X)}{p(X|Y)} \\right) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\left( \\log_2 p(X) - \\log_2 p(X|Y) \\right) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) \\log_2 p(X) , dX , dY + \\int \\int p(X,Y) \\log_2 p(X|Y) , dX , dY \\\\ \u0026amp;= - \\int \\int p(X,Y) , dY \\log_2 p(X) , dX + \\int \\int p(X,Y) \\log_2 p(X|Y) , dX , dY \\\\ \u0026amp;= - \\int p(X) \\log_2 p(X) , dX + \\int \\int p(X,Y) \\log_2 p(X|Y) , dX , dY \\\\ \u0026amp;= \\mathbb{H}[X] - \\mathbb{H}[X|Y] = \\mathbb{H}[Y] - \\mathbb{H}[Y|X] \\end{align*} $$\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-10-05/","summary":"Information theory essentials: entropy, cross-entropy, joint/conditional entropy, KL divergence, mutual information.","title":"Information Theory"},{"content":"The curse of dimensionality Consider a dataset created by aggregating measurements taken from a pipeline containing a mixture of oil, water, and gas (Bishop \u0026amp; James, 1993). These three materials can exist in one of three different geometric configurations called \u0026lsquo;homogeneous\u0026rsquo;, \u0026lsquo;annular\u0026rsquo;, and \u0026rsquo;laminar\u0026rsquo;, and the ratio of these materials can also change. Each data point consists of a 12-dimensional input vector. $Fig. 1$ shows 100 points from this dataset on a graph illustrating two measurements, $x_6$ and $x_7$ (the other 10 input values are omitted for clarity). Each data point is labeled according to the geometric class it belongs to, and our goal is to use this dataset as a training set to classify a new observation $(x_6, x_7)$, such as the point marked with a cross in $Fig. 1$.\nFig.1\nWe observe that the cross is surrounded by many red points, so we might assume it belongs to the red class. However, there are also many green points nearby, so it could be assigned to the green class, and it is unlikely to belong to the blue class. Intuitively, we see that the class of the cross should be determined primarily by the data points closer to it rather than those farther away.\nA simple approach to implement this intuition into an algorithm is to divide the input space into uniform square cells, as shown in $Fig. 2$. To determine the class of a point, we decide which cell it belongs to, and then find all training points that fall into the same cell. The class of the point to be predicted is the class that contains the most training points within that cell.\nFig.2\nThere are several issues with this simple approach, and one of the most serious becomes apparent when we consider extending it to problems with a larger number of input variables, corresponding to a higher-dimensional input space. The origin of the problem is illustrated in $Fig. 3$, showing that, if we divide a region of space into uniform square cells, the number of these cells increases exponentially with the dimensionality of the space.\nFig.3\nThe problem with an exponentially large number of cells is that we would need an exponentially large amount of training data to ensure that these cells are not empty. Clearly, we have little hope of applying this technique in higher-dimensional spaces, and thus we need to find an alternative approach.\nReturning to the example of polynomial curve fitting (see previous post) and considering how we extend this approach to handle input spaces with many variables, if we have $D$ input variables, a general polynomial with cubic coefficients will have the form:\n$$y(x, \\mathbf{w}) = w_0 + \\sum_{i=1}^{D} w_i x_i + \\sum_{i=1}^{D} \\sum_{j=1}^{D} w_{ij} x_i x_j + \\sum_{i=1}^{D} \\sum_{j=1}^{D} \\sum_{k=1}^{D} w_{ijk} x_i x_j x_k.$$\nAs $D$ increases, the number of independent coefficients (not all coefficients are independent due to symmetry between the variables $\\mathbf{x}$) grows at a rate of $D^3$. In practice, to capture the complex dependencies in the data, we may need to use a higher-degree polynomial. For a polynomial of degree $M$, the number of coefficients increases as $D^M$. Although this is exponential growth, it still indicates that this method becomes impractical.\nGeometric intuitions formed in three-dimensional space can be misleading when considering higher-dimensional spaces. A simple example, consider a sphere with radius $r = 1$ in $D$-dimensional space, and ask the ratio of the volume of the sphere between radii $r = 1-\\epsilon$ and $r = 1$. We can evaluate this ratio by noting that the volume of a sphere with radius $r$ in $D$-dimensional space should scale with $r^D$, so we have: $$V_D(r) = K_D r^D$$ where the constant $K_D$ depends only on $D$. Thus, the ratio is given by: $$\\frac{V_D(1) - V_D(1-\\epsilon)}{V_D(1)} = 1 - (1-\\epsilon)^D$$\nThe graph of this function for different values of $D$ is shown in $Fig. 4$. We see that for large $D$, this ratio approaches 1 even for small values of $\\epsilon$. Therefore, in high-dimensional space, most of the volume of the sphere is concentrated in a thin shell near the surface. This contrasts with our intuition in the usual three-dimensional (3D) space, where the spherical volume is distributed evenly throughout the entire interior.\nFig.4\nAnother example, directly related to pattern recognition, is the behavior of a Gaussian distribution in multidimensional space. If we transform from Cartesian coordinates to polar coordinates, and then integrate over the angular variables, we obtain an expression for the density $\\rho(r)$ as a function of the radial distance $r$ from the origin. Thus, $\\rho(r) \\delta r$ represents the probability mass within a thin shell of thickness $\\delta r$ at radius $r$. This distribution is plotted for different values of $D$ in $Fig. 5$, and we see that for large $D$, the probability mass of the Gaussian concentrates in a thin shell near the surface, rather than being evenly distributed around the center as we might expect in low-dimensional space. Specifically, in low-dimensional space (e.g., $D = 1$ or $D = 2$), the Gaussian distribution typically has its probability mass concentrated close to the center (here $r=0$). But as the dimensionality $D$ increases, the Gaussian\u0026rsquo;s probability mass tends to \u0026ldquo;move away\u0026rdquo; from the center and concentrate in a region of larger radius, resulting in the probability being primarily distributed at a specific radius value.\nFig.5\nThese difficulties are known as the Curse of Dimensionality (CoD), a phenomenon that arises when the input space has a high number of dimensions, leading to several challenges in data analysis and processing. To summarize, the issues related to CoD include:\nThe number of dimensions ($D$) increases the need for much larger amounts of training data to ensure the space is sufficiently covered, increasing complexity and computational costs. The number of coefficients required in mathematical models, such as polynomials, increases rapidly as $D$ grows, making these models more complex and difficult to apply. As $D$ increases, the volume or probability mass concentrates more at the boundaries and no longer distributes evenly as in low-dimensional spaces, leading to difficulties (in machine learning or statistics) in handling the data, reducing the effectiveness of many distance- or density-based methods. As $D$ increases, the distance between data points increases, but the discriminability between data points decreases. When the distances between points become large and relatively uniform, traditional machine learning or statistical models (e.g., regression, SVM, KNN) may not perform effectively because they rely on clear distinctions between data points to learn or classify, but when the distances become relatively uniform, models struggle to find the correct classification boundaries. As $D$ increases, the space becomes \u0026ldquo;sparser.\u0026rdquo; That is, if the number of data points remains constant, the point density in the space drops significantly. This makes it harder to build data-driven models as the model has fewer data points to analyze and predict, reducing its ability to differentiate between different data groups or classes. Decision Theory 1. Introduction Probability theory provides a consistent mathematical framework for quantifying and handling uncertainty. In this section, we will discuss decision theory, which, when combined with probability theory, allows us to make optimal decisions in situations involving uncertainty, such as in pattern recognition.\nSuppose we have a medical diagnosis problem where we have taken an X-ray of a patient and want to determine whether the patient has cancer or not. In this case, the input vector $\\mathbf{x}$ is the set of pixel intensities in the image, and the target variable $t$ represents the presence of cancer (denoted as class $C_1$) or no cancer (denoted as class $C_2$).\nWe can choose $t$ as a binary variable such that $t = 0$ corresponds to class $C_1$ and $t = 1$ corresponds to class $C_2$. The general inference problem involves determining the joint distribution $p(\\mathbf{x}, C_k)$, which is equivalent to $p(\\mathbf{x}, t)$, because it summarizes all the uncertainty (uncertainty) here related to the two variables $\\mathbf{x}$ and $\\mathbf{t}$. This is because the joint probability distribution $p(\\mathbf{x}, \\mathbf{t})$ describes the probability of all possible pairs of values for $\\mathbf{x}$ and $\\mathbf{t}$. This includes all the information about how the input variables $\\mathbf{x}$ and the target variable $\\mathbf{t}$ interact and depend on each other. Additionally, from $p(\\mathbf{x}, \\mathbf{t})$, we can also find the marginal distributions $p(\\mathbf{x})$ and $p(\\mathbf{t})$, along with the conditional distribution $p(\\mathbf{t}|\\mathbf{x})$.\nWhile this joint distribution is a very useful quantity, in practice, we must make a specific prediction about the value of $t$, or more generally, perform a specific action based on our understanding of the possible values that $t$ could take. This is the subject of decision theory.\n2. Minimizing the misclassification rate Suppose our goal is to minimize the number of misclassifications as much as possible. We need a decision rule that divides the input space into decision regions $R_k$ for each class $C_k$, so that every point in $R_k$ is assigned to class $C_k$. The probability of making a mistake occurs when a point in region $R_1$ is assigned to class $C_2$ or vice versa, and is given by:\n$$ p(\\text{mistake}) = \\int_{R_1} p(x, C_2) , dx + \\int_{R_2} p(x, C_1) , dx $$\nWe will choose the decision rule so that each $x$ is assigned to the class with the largest posterior probability $p(C_k|x)$. This is illustrated in $Fig. 6$, where the regions $R_1$ and $R_2$ are determined by the decision boundary $\\hat{x}$. As the position of this boundary changes, the area of the red region (representing the error) will change. The optimal position of $\\hat{x}$ is the point where the two curves $p(x, C_1)$ and $p(x, C_2)$ intersect.\nFig.6\n3. Minimizing the expected loss In many applications, the objective is more complex than simply minimizing the number of misclassifications. For example, in medical diagnosis, if a healthy person is classified as having a disease, the situation is certainly not as serious as if a diseased person is misclassified as healthy, since the latter could be life-threatening. Clearly, we need to reduce the second type of misclassification, even if it means that the first type of misclassification may increase.\nTo handle these issues, we use a loss function $L_{kj}$, which measures the degree of loss when assigning $x$ to class $C_j$ when the true class is $C_k$. For example, in the cancer diagnosis problem, the loss matrix may look as follows:\n$$ \\begin{pmatrix} \\text{gt/pred} \u0026amp; \\text{cancer} \u0026amp; \\text{normal} \\\\ \\text{cancer} \u0026amp; 0 \u0026amp; 1000 \\\\ \\text{normal} \u0026amp; 1 \u0026amp; 0 \\end{pmatrix} $$\nThis matrix shows that no loss occurs when the classes are correctly classified, and the loss value is $1000$ if a diseased person is classified as healthy. The loss value for misclassifying a healthy person as diseased is $1$. It is reasonable to impose a heavy penalty for the second type of misclassification, which we consider extremely dangerous and should be minimized at all costs.\nThe goal now is to minimize the expected loss:\n$$ \\mathbb{E}[L] = \\sum_k \\sum_j \\int_{\\mathcal{R_j}} L_{kj} p(x, C_k) , dx \\tag{1} $$\nEach $x$ will be independently assigned to the decision region $\\mathcal{R_j}$. The optimal decision rule will be one that selects the region $\\mathcal{R_j}$ to minimize $(1)$, which means that for each $x$, we need to minimize $\\sum_k \\int_{\\mathcal{R_j}} L_{kj} p(x, C_k)$. Additionally, since $p(x, C_k) = p(C_k|x)p(x)$, and $p(x)$ is a common factor for all $j$, it can be factored out. Therefore, the loss function we need to minimize becomes:\n$$ \\mathbb{E}[L] = \\sum_k L_{kj} p(C_k|x) \\tag{2} $$\n4. The reject option In some applications, we may want to avoid making decisions in cases where the input is unclear. This can be achieved by introducing a threshold $\\theta$ and rejecting input $x$ when the largest posterior probability $p(C_k|x)$ is less than or equal to $\\theta$. This helps reduce the error rate in decision-making when uncertainty is high. $Fig. 7$ illustrates how a threshold $\\theta$ can be used to define rejection regions.\nFig.7\n5. Inference and decision The process of solving a classification problem can be divided into two steps: the inference step, where we determine the model $p(x, t)$, and the decision step, where we use posterior probabilities to make an optimal decision.\nThere are three main approaches to solving decision problems:\n$(a)$ Solve the inference problem by determining the conditional density of each class $p(x|C_k)$ for each class $C_k$, then use Bayes\u0026rsquo; theorem to find the posterior probability $p(C_k|x)$: $$ p(C_k|x) = \\frac{p(x|C_k) p(C_k)}{p(x)} \\tag{3} $$\n$(b)$ Directly determine the posterior probability $p(C_k|x)$, then use decision theory to assign $x$ to one of the classes.\n$(c)$ Find a function $f(x)$, called a discriminant function, to directly map each $x$ to a class label.\nApproach $(a)$ is the most complex because it requires determining the joint distribution over both $x$ and $C_k$. However, this approach also allows us to determine the marginal density $p(x)$, which can be useful for detecting new data points with low probability under the model. But if we are only concerned with classification, finding the joint distribution is costly and redundant. Instead, approach $(b)$ is sufficient for classification tasks.\nApproach $(c)$ combines both inference and decision in a single step, where we use the function $f(x)$ to directly map input $x$ to class $C_k$. In the example from $Fig. 8$, this corresponds to finding the value of $x$ represented by the green line, as this is the decision boundary for the lowest misclassification rate.\nFig.8\nHowever, with this approach, we do not know the posterior probabilities $p(C_k|x)$. There are several reasons why we might need to compute posterior probabilities, even if we later use them to make decisions. These reasons include:\nMinimizing risk: Consider a problem where the elements of the loss matrix change over time. If we know the posterior probabilities, we can easily adjust the decision rule to minimize risk by modifying equation $(2)$ accordingly. If we only have a discriminant function, any changes to the loss matrix would require us to go back to the training data and solve the classification problem again.\nReject option: The posterior probabilities allow us to define a rejection criterion, which will minimize the misclassification rate, or more generally, the expected loss, for a specific portion of the data points being rejected.\nCompensating for prior class probabilities: In imbalanced class problems (such as cancer diagnosis tasks), using posterior probabilities allows us to adjust predictions based on the actual class distribution in the dataset. Consider the medical X-ray problem, where we assume a large number of X-ray images have been collected from the general population to build an automatic screening system. Since cancer is rare in the general population, we may find that only 1 in 1000 examples represents cancer. If we use such a dataset to train an adaptive model, we might face significant difficulties due to the small proportion of cancer cases. For example, a classifier that assigns all points to the normal class would achieve 99.9% accuracy and would be hard to avoid as a trivial solution. Additionally, even a large dataset will contain very few cancer-related X-ray images, so the learning algorithm may not generalize well. A balanced dataset with equal numbers of examples from each class will allow us to find a more accurate model. However, we must then compensate for the effects of the adjustments made to the training data. Suppose we have used such a modified dataset and found models for the posterior probabilities. From Bayes\u0026rsquo; theorem $(3)$, we see that the posterior probabilities are proportional to the prior probabilities, which we can interpret as the ratio of points in each class. Thus, we can simply take the posterior probabilities obtained from our balanced dataset, divide them by the class proportions in that dataset, and then multiply by the class proportions in the dataset we want to apply the model to. Finally, we need to normalize to ensure that the new posterior probabilities sum to 1. Note that this procedure cannot be applied if we have learned a discriminant function instead of determining the posterior probabilities.\nCombining models: In complex problems with multiple data sources, using the Naive Bayes model and conditional independence allows us to efficiently combine information from different sources while maintaining prediction accuracy. Specifically, for complex applications, we may want to break the problem down into smaller sub-problems, each of which can be solved by a separate module. For example, in our medical diagnosis problem, we might have information from both blood tests and X-ray images. Instead of combining all this heterogeneous information into a large input space, it may be more efficient to build one system to interpret the X-ray images and another system to interpret the blood data. As long as each of these models provides posterior probabilities for the classes, we can combine their outputs systematically using probabilistic rules. A simple way to do this is to assume that, for each class, the input distributions for the X-ray image data, denoted as $x_1$, and the blood data, denoted as $x_B$, are independent, so:\n$$ p(x_1, x_B | C_k) = p(x_1 | C_k) p(x_B | C_k) \\tag{4} $$\nThis is an example of conditional independence, as this independence holds when the distribution is conditioned on class $C_k$. The posterior probability, with both X-ray and blood data, is then given by:\n$$ p(C_k | x_1, x_B) \\propto p(x_1, x_B | C_k) p(C_k) $$\n$$ \\propto p(x_1 | C_k) p(x_B | C_k) p(C_k) $$\n$$ \\propto \\frac{p(C_k | x_1) p(C_k | x_B)}{p(C_k)} $$\nTherefore, we need the prior probabilities of the class $p(C_k)$, which we can easily estimate from the proportions of the data points in each class, and then we need to normalize the resulting posterior probabilities so that they sum to 1. The specific conditional independence assumption $(4)$ is an example of a Naive Bayes model. Note that the marginal distribution $p(x_1, x_B)$ will generally not factorize under this model. In later chapters, we will see how to build models that combine data without requiring the conditional independence assumption $(4)$.\nIn general, using posterior probabilities not only provides a powerful tool for making accurate decisions but also allows flexibility in handling complex and time-varying situations.\n6. Loss functions for regression The regression function $y(x)$ is the conditional expectation of $t$ given $x$, and it is also called the regression function. In the illustration example $Fig. 9$, this graph shows that when $x = x_0$, the value $y(x_0)$ is determined by the mean value of the conditional distribution $p(t|x_0)$, which is represented by the dashed green horizontal line. The regression function optimizes squared loss expectation by averaging the values of $t$ according to the conditional distribution $p(t|x)$. Fig.9\nThe decision-making process involves predicting the actual value $y(x)$ corresponding to the output $t$ of the input variable $x$. The expected loss function over all the random variables $x$ and $t$:\n$$ \\mathbb{E}[L] = \\iint L(t, y(x))p(x, t) , dx , dt $$\nIn the specific case of squared loss:\n$$ \\mathbb{E}[L] = \\iint {y(x) - t}^2 p(x, t) , dx , dt $$\nSolving the differential equation to minimize $\\mathbb{E}[L]$, we get:\n$$ y(x) = \\frac{\\int t p(x, t) , dt}{p(x)} = E_t[t|x] $$\nThis shows that the regression function $y(x)$ is the conditional mean of $t$ given $x$.\nMinkowski loss function Squared loss is not the only choice for regression problems. A generalized case of squared loss is called Minkowski loss, with the expectation given by:\n$$ \\mathbb{E}[L_q] = \\iint |y(x) - t|^q p(x, t) , dx , dt $$\nThis loss function shrinks to squared loss when $q = 2$, and the minimum value of $\\mathbb{E}[L_q]$ is given by the conditional mean when $q = 2$, the conditional median when $q = 1$, and the conditional mode when $q \\to 0$.\nApproaches for regression $(a)$ Solve the inference problem to determine the joint density $p(x, t)$, then normalize to find the conditional density $p(t|x)$, and finally compute the conditional expectation using the formula: $$ y(x) = E_t[t|x] $$\n$(b)$ Solve the inference problem to find the conditional density $p(t|x)$, then integrate to find the conditional expectation. $(c)$ Directly find the regression function $y(x)$ from the training data. Each approach has its own advantages and disadvantages, depending on the complexity of the specific problem.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-09-02/","summary":"High-dimensional data pitfalls (CoD) and core decision theory: risk, posterior-based rules, reject option.","title":"The Curse of Dimensionality and Decision Theory"},{"content":"Suppose $\\mathbf{x} \\sim p(X)$ is a random variable, and $y = f(\\mathbf{X})$. In this section, we will find how to calculate $p(y)$.\nDiscrete Random Variable Case If $\\mathbf{X}$ is a discrete random variable, the pmf of $\\mathbf{Y}$ is simply computed by summing the probability mass functions of the values $x$ such that $f(x) = y$.\n$$ p_y(y) = \\sum_{x: f(\\mathbf{X}) = y} p_x(\\mathbf{X}) \\tag{1} $$\nContinuous Random Variable Case If $\\mathbf{X}$ is a continuous random variable, $p_x(\\mathbf{X})$ is a probability density function, so we cannot use formula (1). Instead, we will work with the cumulative distribution function (CDF), i.e., $P_x(\\mathbf{x})$ and $P_y(y)$.\n$$ P_y(y) \\overset{\\text{$\\Delta$}}{=} Pr(Y \\leq y) = Pr(f(\\mathbf{X}) \\leq y) = Pr(\\mathbf{X} \\in {x \\mid f(\\mathbf{x}) \\leq y }) $$\nWhere:\nThe CDF $P_y(y)$ gives the probability that the random variable $\\mathbf{Y}$ (generated from $f(\\mathbf{X})$) is less than or equal to $y$. In this case, we are calculating the probability that $f(\\mathbf{X})$ is less than or equal to $y$ by finding all values of $x$ such that $f(\\mathbf{X}) \\leq y$, and then calculating the probability for these values of $x$. If $f(x)$ is an invertible function, we can find $p_y(y)$ by differentiating the CDF. Otherwise, we need to use other approximation methods, such as Monte Carlo, to find the pdf $p_y(y)$.\nInvertible Transformations (Bijections) In this section, we focus on the cases of monotonic and invertible functions. A function is invertible if and only if it is a bijector (bijective).\nThe condition for a function $f: \\mathbf{X} \\to Y$ to be a bijector is when it has two properties:\n(a) Injective (one-to-one), meaning each distinct element in set $\\mathbf{X}$ is mapped to a distinct element in set $Y$, (b) Surjective (onto), meaning each element in set $Y$ has at least one element in set $\\mathbf{X}$ mapped to it. Change of Variables: Scalar Case We start with an example. Suppose $x \\sim \\text{Unif}(0, 1)$, and $y = f(x) = 2x + 1$. This function stretches and shifts the probability distribution, as illustrated in Fig.1 (a).\nNow, zooming in on a point $x$ and a nearby point, more precisely, $x + dx$, we see that this interval is mapped to $(y, y + dy)$. The probability mass in these intervals must remain the same, so $p(x) dx = p(y) dy$, and thus:\n$$ p_y(y) = p_x(x) \\frac{dx}{dy} $$\nHowever, since it doesn\u0026rsquo;t matter (in terms of probability conservation) whether $\\frac{dx}{dy} \u0026gt; 0$ or $\\frac{dx}{dy} \u0026lt; 0$, we have:\n$$ p_y(y) = p_x(x)\\left| \\frac{dx}{dy} \\right| $$\nwhere:\nThe derivative $\\frac{dx}{dy}$ indicates the rate of change of $x$ with respect to $y$. This is a scaling factor that describes how the probability density changes when transforming from $x$ to $y$. When $\\frac{dx}{dy}$ is large, the transformation from $y$ to $x$ makes the density $p_y(y)$ \u0026ldquo;thinner\u0026rdquo; compared to $p_x(x)$, and vice versa. The absolute value $\\left| \\frac{dx}{dy} \\right|$ is used to ensure that the probability density is non-negative. Fig.1\nNow, let\u0026rsquo;s consider the general case for any $p_x(x)$ and any monotonic function $f: \\mathbb{R} \\to \\mathbb{R}$. Let $g = f^{-1}$, so that $y = f(x)$ and $x = g(y)$. If we assume that $f: \\mathbb{R} \\to \\mathbb{R}$ is monotonically increasing, we have:\n$$ P_y(y) = \\Pr(f(\\mathbf{X}) \\leq y) = \\Pr(\\mathbf{X} \\leq g(y)) = P_x(g(y)) $$\nTaking the derivative, we get:\n$$ p_y(y) = \\frac{d}{dy} P_y(y) = \\frac{d}{dy} P_x(g(y)) = \\frac{dx}{dy} \\frac{d}{dx} P_x(g(y)) = \\frac{dx}{dy} p_x(g(y)) $$\nWe can derive a similar expression (but with the opposite sign) for the case where $f$ is monotonically decreasing. To handle the general case, we take the absolute value to obtain:\n$$ p_y(y) = p_x(g(y)) \\left| \\frac{d}{dy} g(y) \\right| $$\nThis is known as the change of variables formula.\nChange of Variables: Multivariable Case We can extend the previous result to multivariable distributions as follows. Suppose $f$ is an invertible function mapping $\\mathbb{R}^n$ to $\\mathbb{R}^n$, with inverse $g$. Suppose we want to compute the pdf of $y = f(x)$. Similarly to the scalar case, we have:\n$$ p_y(y) = p_x(g(y)) \\left| \\det [J_g(y)] \\right| \\tag{3} $$\nwhere:\n$J_g = \\frac{dg(y)}{dy^T}$ is the Jacobian of $g$, and $\\left| \\det J_g(y) \\right|$ is the absolute value of the determinant of $J$ evaluated at $y$. The Jacobian is a matrix of partial derivatives, containing information on how each component of the vector $\\mathbf{y}$ changes with respect to each component of the vector $\\mathbf{x}$. The determinant of the Jacobian, $\\det(\\text{Jacobian})$, describes the transformation of \u0026ldquo;volume\u0026rdquo; (or area in 2D) when moving from the $\\mathbf{x}$-space to the $\\mathbf{y}$-space. This generalizes the concept of $\\frac{dx}{dy}$ in the univariate case. The absolute value of the determinant $\\left| \\det(\\text{Jacobian}) \\right|$ represents the scaling factor between probability densities in the two spaces. Consider the example of a coordinate change between Cartesian and polar coordinates. Here, $x = (x_1, x_2)$, $y = f(x) = f(x_1, x_2) = (r, \\theta)$, and $g(y) = f^{-1}(y) = f^{-1}(r, \\theta) = (r\\cos\\theta, r\\sin\\theta)$.\nApplying equation (3), we have:\n$$ p_{r, \\theta}(r, \\theta) = p_{x_1, x_2}(r\\cos\\theta, r\\sin\\theta) \\left| \\det [J_g(y)] \\right| \\tag{4} $$\nwhere:\n$$ \\mathbf{J}_g = \\begin{bmatrix} \\frac{\\partial x_1}{\\partial r} \u0026amp; \\frac{\\partial x_1}{\\partial \\theta} \\ \\frac{\\partial x_2}{\\partial r} \u0026amp; \\frac{\\partial x_2}{\\partial \\theta} \\end{bmatrix} = \\begin{bmatrix} \\cos\\theta \u0026amp; -r\\sin\\theta \\ \\sin\\theta \u0026amp; r\\cos\\theta \\end{bmatrix} \\Rightarrow |\\text{det}(\\mathbf{J}_g)| = |r\\cos^2\\theta + r\\sin^2\\theta| = |r| $$\nThus, equation (4) becomes:\n$$ p_{r, \\theta}(r, \\theta) = p_{x_1, x_2}(r\\cos\\theta, r\\sin\\theta) |r| \\tag{5} $$\nFig. 2\nThe area of the shaded region in the figure is $r dr d\\theta$. The probability of a point falling in the shaded region is:\n$$ \\Pr(r \\leq \\mathbf{R} \\leq r+dr, \\theta \\leq \\Theta \\leq \\theta + d\\theta) = p_{r, \\theta}(r, \\theta) dr d\\theta \\tag{6} $$\nwhere $\\text{VT}$ is the probability of a point falling within a small region in the polar coordinate system, defined by $r$ and $\\theta$. To prove the VP of equation (6), we need to rewrite it as an integral:\n$$ \\begin{align*} \\Pr(r \\leq \\mathbf{R} \\leq r+dr, \\theta \\leq \\Theta \\leq \\theta + d\\theta) \u0026amp;= \\int_{\\theta}^{\\theta+d\\theta} \\int_{r}^{r+dr} p_{r, \\theta}(r, \\theta)dr d\\theta \\end{align*} \\tag{6.1} $$\nHere, $p_{r, \\theta}(r, \\theta)$ is the probability density function of a point near $(r, \\theta)$. Since the region $dr$ and $d\\theta$ is very small, we treat the density function as constant within that region, simplifying the integral.\nThus, the VP of equation (6.1) can be rewritten as:\n$$ \\begin{align*} \\Pr(r \\leq \\mathbf{R} \\leq r+dr, \\theta \\leq \\Theta \\leq \\theta + d\\theta) \u0026amp;= p_{r, \\theta}(r, \\theta) \\int_{\\theta}^{\\theta+d\\theta} \\int_{r}^{r+dr} dr d\\theta \\ \u0026amp;= p_{r, \\theta}(r, \\theta) dr d\\theta \\tag{cmx} \\end{align*} $$\nApplying the change of variables to equation (6), we get:\n$$ p_{r, \\theta}(r, \\theta) dr d\\theta = p_{x_1, x_2}(r\\cos\\theta, r\\sin\\theta) r dr d\\theta \\tag{7} $$\nIn other words, in the limit as $dr$ and $d\\theta$ become infinitesimally small, the change in the density function within this region can be ignored, and we only need to consider the value of the density function at a single point (the center of the region) to compute the probability.\nThe Convolution Theorem Suppose $y = x_1 + x_2$, where $x_1$ and $x_2$ are independent random variables.\nIf these are discrete random variables, we can compute the probability mass function (pmf) for the sum as follows:\n$$ p(y = j) = \\sum_k p(x_1 = k)p(x_2 = j - k) \\tag{8} $$ where $j = \\dots, -2, -1, 0, 1, 2, \\dots$.\nIf $x_1$ and $x_2$ have probability density functions $p_1(x_1)$ and $p_2(x_2)$, what is the distribution of $y$? The cumulative distribution function (CDF) is given by:\n$$ P_y(y^*) = \\Pr(y \\leq y^*) = \\int_{-\\infty}^{\\infty} p_1(x_1) \\left[ \\int_{-\\infty}^{y^*-x_1} p_2(x_2) , dx_2 \\right] dx_1 \\tag{9} $$\nTo find the PDF $p(y)$, we take the derivative of $P_y(y^*)$ with respect to $y^*$, and then substitute $y^* = y$.\nStep 1:\nTake the derivative with respect to $y^*$. We have: $p(y) = \\frac{d}{dy^*} P_y(y^*) \\bigg|_{y^*=y}$. Substitute $P_y(y^*)$ from equation (9) into the expression:\n$$ p(y) = \\frac{d}{dy^*} \\left( \\int_{-\\infty}^{\\infty} p_1(x_1) \\left[ \\int_{-\\infty}^{y^*-x_1} p_2(x_2) , dx_2 \\right] dx_1 \\right) $$ Here, we need to differentiate the double integral with respect to $y^*$. Step 2:\nUse the rule for differentiating under the integral sign:\n$$ \\frac{d}{dy^*} \\int_{a(y^*)}^{b(y^*)} f(x) , dx = f(b(y^*)) \\cdot \\frac{db(y^*)}{dy^*} - f(a(y^*)) \\cdot \\frac{da(y^*)}{dy^*} $$ In this case, $b(y^*) = y^* - x_1$, $a(y^*) = -\\infty$, and $f(x_2) = p_2(x_2)$. Notice that:\n$$ \\frac{d}{dy^*} (y^* - x_1) = 1 - \\frac{d}{dy^*} (-\\infty) = 0 $$ Therefore, the derivative of the integral in (9) with respect to $y^*$ is:\n$$ \\frac{d}{dy^*} \\left( \\int_{-\\infty}^{y^*-x_1} p_2(x_2) , dx_2 \\right) = p_2(y^* - x_1) \\cdot 1 = p_2(y^* - x_1) $$ Step 3:\nSubstitute the result into the expression and simplify. After differentiating, substitute $y^* = y$:\n$$ p(y) = \\int_{-\\infty}^{\\infty} p_1(x_1) p_2(y - x_1) , dx_1 \\tag{10} $$\nThis formula can be written as $p = p_1 \\ast p_2$.\nHere, $\\ast$ represents the convolution operator. For vectors with finite length, the integrals become sums, and convolution can be seen as a flip-and-drag operation (Table 2.4).\nTherefore, equation (10) is called the convolution theorem. For example, suppose we roll two dice, so $p_1$ and $p_2$ are both uniform discrete distributions on the set ${1, 2, \\dots, 6}$.\nLet $y = x_1 + x_2$ be the sum of the two dice faces. We have:\n$$ \\begin{align*} p(y = 2) \u0026amp;= p(x_1 = 1)p(x_2 = 1) \\\\ \u0026amp;= \\frac{1}{6} \\times \\frac{1}{6} \\\\ \u0026amp;= \\frac{1}{36} \\end{align*} $$\n$$ \\begin{align*} p(y = 3) \u0026amp;= p(x_1 = 1)p(x_2 = 2) + p(x_1 = 2)p(x_2 = 1) \\\\ \u0026amp;= \\frac{1}{6} \\times \\frac{1}{6} + \\frac{1}{6} \\times \\frac{1}{6} \\\\ \u0026amp;= \\frac{2}{36} \\end{align*} $$\nContinuing this way, we find $p(y = 4) = 3/36$, $p(y = 5) = 4/36$, $p(y = 6) = 5/36$, $p(y = 7) = 6/36$, $p(y = 8) = 5/36$, $p(y = 9) = 4/36$, $p(y = 10) = 3/36$, $p(y = 11) = 2/36$, and $p(y = 12) = 1/36$. Figure 2.22 shows that the distribution looks like a Gaussian distribution.\nWe can also calculate the PDF of the sum of two continuous random variables. For example, in the case of Gaussian distributions, when $x_1 \\sim \\mathcal{N}(\\mu_1, \\sigma_1^2)$ and $x_2 \\sim \\mathcal{N}(\\mu_2, \\sigma_2^2)$, we can prove that if $y = x_1 + x_2$, then\n$$ p(y) = \\mathcal{N}(x_1|\\mu_1, \\sigma_1^2) \\ast \\mathcal{N}(x_2|\\mu_2, \\sigma_2^2) = \\mathcal{N}(y|\\mu_1 + \\mu_2, \\sigma_1^2 + \\sigma_2^2) $$\nThus, the convolution of two Gaussian distributions is a Gaussian distribution.\nCentral Limit Theorem Now, consider $N$ random variables with probability density functions $p_n(x)$ (not necessarily Gaussian), each having mean $\\mu$ and variance $\\sigma^2$.\nWe assume that each variable is independent and identically distributed (iid).\n$$ X_n \\sim p(X) $$\nare independent samples from the same distribution. Let $S_N = \\sum_{n=1}^{N} X_n$ be the sum of the random variables. It can be proven that as $N$ increases, the distribution of this sum approaches:\n$$ p(S_N = u) = \\frac{1}{\\sqrt{2\\pi N\\sigma^2}} \\exp\\left(-\\frac{(u - N\\mu)^2}{2N\\sigma^2}\\right) $$\nThus, the distribution of the quantity\n$$ Z_N \\triangleq \\frac{S_N - N\\mu}{\\sigma\\sqrt{N}} = \\frac{\\overline{X} - \\mu}{\\sigma/\\sqrt{N}} $$\nconverges to the standard normal distribution, where $\\overline{X} = S_N/N$ is the sample mean. This is known as the Central Limit Theorem.\nIn Fig. 3, there is an example where the sample mean of random variables drawn from a Beta distribution is computed.\nWe see that the sample distribution of the mean quickly converges to a Gaussian distribution.\nFig. 3\nMonte Carlo Suppose $x$ is a random variable, and $y = f(x)$ is a function of $x$.\nIt is often difficult to compute the distribution $p(y)$.\nA simple yet powerful method is to take a large number of samples from the distribution of $x$, and then use these samples (instead of the distribution) to approximate $p(y)$.\nFor example, suppose $x \\sim \\text{Unif}(-1, 1)$ and $y = f(x) = x^2$.\nWe can approximate $p(y)$ by taking many samples from $p(x)$ (using a uniform random number generator), squaring them, and calculating the empirical distribution of the results, given by:\n$$ p_S(y) \\triangleq \\frac{1}{N_s} \\sum_{s=1}^{N_s} \\delta(y - y_s) \\tag{2.179} $$\nThis method is essentially a “sum of Dirac deltas” with equal weights, each delta concentrated on one of the samples (2.7.6). By using enough samples, we can approximate $p(y)$ quite well $(Fig. 4)$.\nThis method is called the Monte Carlo approximation for the distribution.\nFig. 4\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-08-15/","summary":"Change-of-variables for PDFs: scalar and multivariate cases, Jacobian determinant, convolution and CLT.","title":"Transformations of random variables"},{"content":"1. Quantifying the Uncertainty of Events We are probably familiar with examples of probability, such as two boxes, one green and one red, each containing differently colored balls. Then, we might calculate probabilities, such as drawing three red balls consecutively from the green box or finding the probability of drawing a yellow ball from the red box, etc. In these examples, we view probability from the perspective of the frequency of random events that can be repeated.\nHowever, when considering an uncertain event, such as whether the moon has ever orbited the sun or whether the Arctic ice will disappear by the end of this century, we find that these are not events that can be repeated countless times to define a concept of probability, as we did earlier with the boxes of balls. However, we usually have some initial ideas (predictions), such as the rate of Arctic ice melt. If we obtain new evidence, such as from a newly launched Earth observation satellite collecting new diagnostic data, we can revise our views on the rate of ice loss. Our assessment of such issues will influence the actions we take, such as the extent to which we try to reduce greenhouse gas emissions.\nIn such situations, we want to be able to quantify our expression of uncertainty and make accurate adjustments to that uncertainty based on new evidence, and then make optimal actions or decisions as a consequence. All of this can be achieved through the Bayesian interpretation of probability.\n2. Bayesian Inference The term \u0026ldquo;inference\u0026rdquo; refers to the \u0026ldquo;act of moving from sample data to generalizations, typically with a calculated degree of certainty.\u0026rdquo;\nThe term \u0026ldquo;Bayesian\u0026rdquo; refers to methods of inference that express \u0026ldquo;degree of certainty\u0026rdquo; using probability theory and make use of Bayes\u0026rsquo; rule to update the degree of certainty of the given data.\nBayes\u0026rsquo; rule is very simple: it is a formula used to calculate the probability distribution of possible values of an unknown (or hidden) quantity $ X $ based on some observed data $ Y = y $.\n$$ p(\\mathbf{X}=x|Y=y) = \\frac{p(Y=y|\\mathbf{X}=x)p(\\mathbf{X}=x)}{p(Y=y)} \\tag{2.1} $$\nwhich originates from: $ p(x,y) = p(x|y)p(y) = p(y|x)p(x) $. In expression $ (2.1) $, $ p(\\mathbf{X}=x) $ is our prior belief about $ \\mathbf{X} $ before observing the data, known as the prior distribution. $ p(Y=y|\\mathbf{X}=x) $ represents the probability of observing the value $ Y = y $ given $ \\mathbf{X} = x $. This is called the observation distribution. When evaluated at the actual observed value $ y $, we have the likelihood function $ p(Y=y|\\mathbf{X}=x) $. When you have an actual observed value $ y $, you want to know the probability of $ y $ occurring given that the value of $ \\mathbf{X} $ is $ x $. At this point, the likelihood function $ p(Y=y|\\mathbf{X}=x) $ is a way to assess how well the value $ x $ explains the observed data $ y $. In Bayesian statistics, this likelihood is used to update the prior distribution to create the posterior distribution, reflecting our updated knowledge about $ \\mathbf{X} $ after observing the data.\nBy multiplying the prior distribution $ p(\\mathbf{X}=x) $ with the likelihood $ p(Y=y|\\mathbf{X}=x) $ for each $ x $, we obtain the unnormalized joint distribution $ p(\\mathbf{X}=x,Y=y) $. To normalize, we can divide by the marginal likelihood $ p(Y=y) $, because:\n$$ p(Y=y) = \\int_{x \\in \\mathbf{X}} p(Y=y|\\mathbf{X}=x)p(\\mathbf{X}=x)dx $$\n3. Marginal Probability In Bayesian statistics, marginal probability is an important component used to compare models and estimate parameters. It is the probability of the observed data under a specific statistical model, obtained by integrating over the parameter space of the model. Marginal probability can be understood as the probability of the model itself and is therefore often referred to as model evidence, or simply evidence.\nSince marginal probability is calculated by integrating over the entire possible value space of the model parameters, it does not depend on any specific parameter but is dependent on the dimensionality and size of the parameter space. It also depends on the model and the prior.\nMathematically, marginal probability is represented as:\n$$ p(\\mathbf{X}|\\alpha) = \\int_\\theta p(\\mathbf{X}|\\theta)p(\\theta|\\alpha)d\\theta $$\nwhere $ \\mathbf{X} $ is the dataset $ \\mathbf{X} = (x_1, x_2, \u0026hellip;, x_n) $, with each $ x_i \\sim p(x|\\theta) $ and the distribution $ p(x|\\theta) $ parameterized by $ \\theta $, and $ \\theta $ is a random variable following some distribution $ \\theta \\sim p(\\theta|\\alpha) $. The marginal probability gives the probability of $ p(\\mathbf{X}|\\alpha) $ when $ \\theta $ is integrated out. Integrating out here means that we calculate $ p(\\mathbf{X}|\\alpha) $ by integrating over the entire value space of $ \\theta $, and as a result, $ \\theta $ will be excluded from the final result.\nDuring the integration process: $ \\theta $ is an intermediate variable used to compute the value of $ p(\\mathbf{X} \\mid \\alpha) $. After the integration process: $ \\theta $ is \u0026ldquo;excluded\u0026rdquo; from the final result, meaning that the result no longer contains $ \\theta $ as a separate variable, and therefore it no longer depends on the value of $ \\theta $. This is because the integration has \u0026ldquo;summed up\u0026rdquo; all the information about $ \\theta $ into a single overall value, representing the distribution of the data based on the given model.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-07-21/","summary":"Bayesian probability: quantifying uncertainty, Bayes\u0026rsquo; rule, prior/likelihood/posterior, marginal probability.","title":"Bayesian Probability"},{"content":"1. Basic Probability Rules Probability theory is a branch of mathematics concerned with analyzing random events. Below are the fundamental rules and concepts in probability:\nProbability of an Event The probability of an event $A$, denoted $P(A)$, represents the likelihood of that event occurring. The probability of any event always lies within the range $0 \\leq P(A) \\leq 1$.\nSample Space and Events Sample Space $S$: The set of all possible outcomes of a random experiment. Event $A$: A subset of the sample space. Complement Rule The probability of the complement of an event $A$, denoted as $P(A^c)$, is given by: $P(A^c) = 1 - P(A)$.\nAddition Rule of Probability If $A$ and $B$ are two mutually exclusive events (having no common elements), the probability of either event $A$ or $B$ occurring is: $P(A \\cup B) = P(A) + P(B)$.\nIf $A$ and $B$ are not mutually exclusive, the general addition rule is: $P(A \\cup B) = P(A) + P(B) - P(A \\cap B)$.\nIf $A$ and $B$ are two independent events (where the occurrence of one event does not affect the probability of the other), the probability of both events $A$ and $B$ occurring is: $P(A \\cap B) = P(A) \\cdot P(B)$.\nIf $A$ and $B$ are not independent, the probability of both events occurring is: $P(A \\cap B) = P(A) \\cdot P(B|A)$, where $P(B|A)$ is the conditional probability of $B$ given that $A$ has occurred.\nAddition Rule $$ p(X) = \\sum_Y p(X, Y) $$\nThis rule calculates the probability of a random variable $X$ by summing the joint probabilities $P(X, Y)$ over all possible values of a random variable $Y$.\nMultiplication Rule $$ p(X, Y) = p(Y | X) p(X) $$\nThis rule helps determine the joint probability of two random variables $X$ and $Y$.\nConditional Probability From the multiplication rule, and the symmetry $P(X, Y) = P(Y, X)$, we can establish a relationship between conditional distributions:\n$$ P(X | Y) = \\frac{P(Y | X) P(X)}{P(Y)} \\tag{1.1} $$\nThis is known as Bayes\u0026rsquo; Rule.\nBy combining the multiplication and addition rules, the denominator can be expressed as:\n$$ P(Y) = \\sum_X p(Y | X) P(X) \\tag{1.2} $$\nIn Bayes\u0026rsquo; Rule, the denominator acts as a normalization constant. This constant ensures that the sum of the conditional probabilities $P(X | Y)$ on the left side of equation (1) equals 1, thus preserving the consistency and validity of the probability system.\nTo verify, assume we have two possibilities for $X$: $X_1$ and $X_2$. Applying Bayes\u0026rsquo; Rule to both cases, we have:\n$$ \\begin{align*} P(X_1 | Y) \u0026amp;= \\frac{P(Y | X_1) P(X_1)}{P(Y)} \\tag{1.3} \\ P(X_2 | Y) \u0026amp;= \\frac{P(Y | X_2) P(X_2)}{P(Y)} \\tag{1.4} \\end{align*} $$\nSince $P(X_1 | Y) + P(X_2 | Y) = 1$, we get:\n$$ \\begin{align*} \u0026amp; \\frac{P(Y | X_1) P(X_1)}{P(Y)} + \\frac{P(Y | X_2) P(X_2)}{P(Y)} = 1 \\ \\Leftrightarrow \u0026amp; P(Y) = P(Y | X_1) P(X_1) + P(Y | X_2) P(X_2) \\end{align*} $$\nwhich is the denominator shown in (1.2).\n2. Probability Density Function In addition to finding the probability of discrete variables, we also need to consider the case of continuous variables.\nIf the probability of a real variable $x$ falling within an interval $(x, x + \\delta x)$ is defined by $p(x) \\delta x$ as $\\delta x \\rightarrow 0$, then $p(x)$ is called the probability density function of $x$.\nThe probability that the value of $x$ lies in the interval $(a, b)$ is defined by:\n$$ p(x \\in (a, b)) = \\int_a^b p(x) dx $$\nSince the value of a probability must be non-negative, and since the value of $x$ must lie somewhere within the real number range, the probability density function $p(x)$ must satisfy two conditions:\n$$ \\begin{cases} p(x) \\geq 0, \\ \\int_{-\\infty}^{\\infty} p(x) dx = 1 \\end{cases} $$\nThe addition and multiplication rules of probability, as well as Bayes\u0026rsquo; Rule, also apply to probability densities, or combinations of continuous and discrete variables. For example, if $x$ and $y$ are two continuous variables, their addition and multiplication rules take the form:\n$$ \\begin{align*} p(x) \u0026amp;= \\int p(x, y) dy, \\ p(x, y) \u0026amp;= p(x | y) p(y) \\end{align*} $$\n3. Expectation and Variance Expectation An important calculation in probability is finding the weighted average, where each term in the sum is multiplied by a weight before taking the average.\nThe weighted average of a function $f(x)$ following a probability distribution $p(x)$ is called the expectation of $f(x)$, denoted $\\mathbf{E}[f]$.\nFor a discrete distribution, its expectation is:\n$$ \\mathbf{E}[f] = \\sum_x p(x) f(x) $$\nThis represents a weighted average based on the relative probabilities of different values of $x$.\nIn the case of continuous variables, the expectation is given by an integral with respect to the probability density function:\n$$ \\mathbf{E}[f] = \\int p(x) f(x) dx $$\nAlternatively, if we have a finite set of $N$ points generated from a probability distribution or probability density, the expectation can be approximated by a finite sum over these points:\n$$ \\mathbb{E}[f] \\simeq \\frac{1}{N} \\sum_{n=1}^{N} f(x_n) $$\nAt times, we need to consider the expectation of multivariable functions, for example, $f(x, y)$. Here, we use subscripts to indicate the variable being averaged. For instance:\n$$ \\mathbf{E}_x[f(x, y)] $$\nindicates the average of the function $f(x, y)$ with respect to the distribution of $x$.\nWe also consider finding the conditional expectation of a conditional distribution, specifically:\n$$ \\mathbf{E}[f | y] = \\sum_x p(x | y) p(x) $$\nFor continuous variables, this becomes:\n$$ \\mathbf{E}[f | y] = \\int p(x | y) p(x) dx $$\nVariance Variance represents the degree of dispersion of the values of the random variable $x$ around its expected value $E[x]$. The formula for the variance of a variable following a probability distribution is defined as:\n$$ \\begin{align*} Var(x) \u0026amp; = E[(x - E[x])^2], \\ \u0026amp; = E[(x^2 - 2xE[x] + E[x]^2)] \\ \u0026amp; = E[x^2] - 2E[x]E[x] +E[E[x]^2] \\ \\text{(linearity property of expectation)} \\ \u0026amp; = E[x^2] - 2E[x^2] +E[x]^2 \\ \u0026amp; = E[x^2] - E[x]^2 \\end{align*} $$\nWe see that variance can be calculated by taking the expected value of $x^2$ and subtracting the square of the expected value of $x$.\nCovariance Covariance between two random variables $x$ and $y$ measures the degree to which these variables change together.\nThe formula for covariance between $x$ and $y$ is:\n$$ \\begin{align*} Cov(x, y) \u0026amp;= E_{x, y}[(x - E[x])(y - E[y])] \\ \u0026amp; = E_{x, y} [xy - xE[y] - yE[x] + E[x]E[y]] \\ \u0026amp; = E_{x,y}[xy] - E_{x,y}[xE[y]] - E_{x,y}[yE[x]] + E_{x,y}[E[x]E[y]] \\ \u0026amp; = E_{x,y}[xy] - E[x]E[y] - E[x]E[y] + E[x]E[y] \\ \u0026amp; = E_{x,y}[xy] - E[x]E[y] \\tag{1} \\end{align*} $$\nIf the covariance is positive, it means $x$ and $y$ are positively correlated: when $x$ increases, $y$ increases, and vice versa. If the covariance is negative, $x$ and $y$ are negatively correlated: when $x$ increases, $y$ decreases, and vice versa. If the covariance is zero, then $x$ and $y$ are independent variables. The proof of the third case is straightforward, because when $x$ and $y$ are independent, we have the property of joint expectation:\n$$ E[x, y] = E[x]E[y] $$\nThus, equation $(1)$ becomes $E[x]E[y] - E[x]E[y] = 0$.\n4. Gaussian Distribution The Gaussian distribution is one of the most important probability distributions for continuous variables, also known as the normal distribution.\nFor the case of a real variable $x$, the Gaussian distribution is defined by:\n$$ \\mathcal{N}(x|\\mu, \\sigma^2) = \\frac{1}{(2\\pi\\sigma^2)^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2\\sigma^2}(x-\\mu)^2\\right) \\tag{2} $$\nThis distribution is governed by two parameters: $\\mu$, called the mean, and $\\sigma^2$, called the variance. The square root of the variance, denoted as $\\sigma$, is called the standard deviation, and the inverse of the variance, written as $\\beta = 1/\\sigma^2$, is called the precision.\nFrom the form of equation $(2)$, we observe that the Gaussian distribution satisfies:\n$$ \\mathcal{N}(x|\\mu, \\sigma^2) \u0026gt; 0 $$\nThis is because:\n$\\frac{1}{(2\\pi\\sigma^2)^{\\frac{1}{2}}} \u0026gt; 0$. This is a positive constant, because in the Gaussian distribution, the variance $\\sigma^2 \u0026gt; 0$ instead of being non-negative. If the variance were zero, it would mathematically imply that all values of the random variable are exactly equal to the mean $\\mu$, with no dispersion. This would make the probability density function invalid. $\\exp\\left(-\\frac{1}{2\\sigma^2}(x-\\mu)^2\\right) \u0026gt; 0$, because the exponential function of any real number is always greater than zero, even if the number is negative. For $(2)$ to be a valid probability distribution, we need to prove that:\n$$ \\begin{align*} \u0026amp;\\int_{-\\infty}^{\\infty} \\mathcal{N}(x|\\mu, \\sigma^2) , dx = 1 \\ \\Leftrightarrow \u0026amp; \\int_{-\\infty}^{\\infty} \\frac{1}{(2\\pi\\sigma^2)^{\\frac{1}{2}}} \\exp\\left(-\\frac{1}{2\\sigma^2}(x-\\mu)^2\\right) , dx = 1 \\tag{3} \\end{align*} $$\nFrom $(3)$, we set $z = \\frac{x - \\mu}{\\sigma}$, and the expression becomes:\n$$ \\begin{align*} \\int_{-\\infty}^{\\infty} \\frac{1}{(2\\pi\\sigma^2)^{\\frac{1}{2}}} \\exp\\left(-\\frac{(\\sigma z)^2}{2\\sigma^2}\\right) \\sigma , dz \u0026amp;= 1 \\ \\Leftrightarrow \u0026amp; \\int_{-\\infty}^{\\infty} \\frac{1}{(2\\pi\\sigma^2)^{\\frac{1}{2}}} \\exp\\left(-\\frac{z^2}{2}\\right) \\sigma , dz = 1 \\ \\Leftrightarrow \u0026amp; \\int_{-\\infty}^{\\infty} \\frac{\\sigma}{(2\\pi\\sigma^2)^{\\frac{1}{2}}} \\exp\\left(-\\frac{z^2}{2}\\right) , dz = 1 \\ \\Leftrightarrow \u0026amp; \\int_{-\\infty}^{\\infty} \\frac{1}{\\sqrt{2\\pi}} \\exp\\left(-\\frac{z^2}{2}\\right) , dz = 1 \\ \\Leftrightarrow \u0026amp; \\frac{1}{\\sqrt{2\\pi}} \\int_{-\\infty}^{\\infty} \\exp\\left(-\\frac{z^2}{2}\\right) , dz = 1 \\tag{4} \\end{align*} $$\nSince $\\frac{dz}{dx} = \\frac{1}{\\sigma}$, equation $(4)$ represents the standard normal distribution, a special case of the normal distribution with mean $= 0$ and variance $= 1$. Therefore, $(4)$ is a valid distribution, and hence, $(2)$ is also a valid distribution.\nExpectation and Variance The expectation of a variable $x$ following a Gaussian distribution is given by:\n$$ \\mathbf{E}[x] = \\int_{-\\infty}^{\\infty} \\mathcal{N}(x|\\mu, \\sigma^2) x , dx = \\mu \\tag{5} $$\nSimilarly, the second moment is:\n$$ \\mathbf{E}[x^2] = \\int_{-\\infty}^{\\infty} \\mathcal{N}(x|\\mu, \\sigma^2) x^2 , dx = \\mu^2 + \\sigma^2 \\tag{6} $$\nFrom equations $(5)$ and $(6)$, the variance of the Gaussian distribution is:\n$$ Var(x) = \\mathbf{E}[x^2] - \\mathbf{E}[x]^2 = \\sigma^2 $$\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-07-12/","summary":"Probability fundamentals: rules, PDFs, expectation, variance, covariance, Gaussian distribution.","title":"Basic Probability"},{"content":"1. Problem Description In machine learning, we are often given a dataset $x$, along with an observation set $t$. Our task is to predict the corresponding value $t_j$ as accurately as possible for each new $x_j$.\nConsider a specific example: we have a training dataset with $N$ data points $\\mathbf{x} \\equiv (x_1, \\ldots, x_N)^T$, along with an observation set $\\mathbf{t} \\equiv (t_1, \\ldots, t_N)^T$. Each point $x_n$ with $n \\in [1, \\ldots, N]$ is generated from a uniform distribution in the range $[0, 1]$, and the corresponding $t_n$ is calculated as $t_n = 2 \\pi x_n + \\epsilon$, where $\\epsilon$ is a small Gaussian-distributed random noise. Adding such noise ensures a characteristic of real-world datasets: data points generally follow a common pattern (which is what we want the machine to learn), but each individual data point will be perturbed by random variables.\nNow, our goal is to approximate $t_t$ when a new data point $x_t$ is introduced, meaning that our model must closely approximate the function $2 \\pi x$.\n2. Polynomial Function Approximation Method One of the simplest and most popular methods to solve this problem is to find a polynomial function $f(x) = t$ to approximate the relationship between $x$ and $t$.\nPolynomial curve fitting is a regression analysis method to find the best-fitting polynomial function for a set of data points. Why use a polynomial function? In reality, any function can be approximated by a polynomial function, with increasing accuracy as the degree of the polynomial increases. That is, with a sufficient number of degrees, a polynomial function can closely approximate any type of nonlinear relationship between variables in the data. We can approximate by introducing a polynomial function as follows:\n$$ \\begin{align*} f(x, \\mathbf{w}) \u0026amp;= w_0 + w_1x^1 + w_2x^2 + \\ldots + w_Nx^N \\\\ \u0026amp;= \\sum_{j=0}^N w_jx^j \\end{align*} $$\nwhere $N$ is the degree of the polynomial, $x^j$ is the power of $x$, and $w_0, \\ldots, w_N$ are the polynomial coefficients. Although the function $f(x, w)$ is nonlinear in terms of $x$, it is linear with respect to $w$. This leads to a property of $f(x, w)$, which is the linear model characteristic, typical of polynomial functions that are linear in hidden variables (in this case, the $w$ variables).\nTo find the values of these hidden variables, we fit the function $f(x, w)$ to the training dataset and then use an error function like MSE to minimize the difference between the approximation result of $f(x, w)$ and the corresponding observed variable $t$. The MSE error function in this case has the form:\n$$ E(\\mathbf{w}) = \\frac{1}{2}\\sum_{n=1}^N(f(x_n, \\mathbf{w}) - t_n)^2 \\tag{2.1} $$\nwhere the factor $\\frac{1}{2}$ is for convenience in calculation. The error function reaches its minimum if and only if the function $f(x, w)$ accurately approximates all data points in the training set. Since the error function is a quadratic function with respect to the coefficients $w$, its derivative with respect to these coefficients is a linear function of $w$. Indeed, the partial derivative of $E(x)$ with respect to $w_j$ is written as:\n$$ \\frac{\\partial E(\\mathbf{w})}{\\partial w_j} = \\frac{\\partial}{\\partial{w_j}}​ ​(\\frac{1}{2}​\\sum_{n=1}^N​(f(x_n​,\\mathbf{w})−t_n​)^2) $$\nApplying the chain rule, we get:\n$$ \\begin{align*} \\frac{\\partial E(\\mathbf{w})}{\\partial w_j} \u0026amp;= \\sum_{n=1}^N​(f(x_n​,\\mathbf{w})−t_n​) x^j \\tag{2.3} \\end{align*} $$\nThis means that the partial derivative of the error function with respect to the coefficient $w_j$ is a linear function.\nMoreover, minimizing the error function $(2.1)$ ensures a unique solution. The error function $E(\\mathbf{w})$ is quadratic with respect to the coefficients $\\mathbf{w}$, creating a paraboloid surface in multi-dimensional space. This leads to two important consequences:\nShape of the function: A quadratic function has a parabolic surface as its graph. In multi-dimensional space, this implies that the error function\u0026rsquo;s graph will be a downward-facing paraboloid (for a global minimum). This shape guarantees a unique minimum point. Uniqueness of solution: The first derivative of the quadratic function with respect to $\\mathbf{w}$ is a linear function. When we solve the equation $\\nabla E(\\mathbf{w}) = 0$, we are solving a system of linear equations. This system has a unique solution if the coefficient matrix (created by the first derivatives) is square and non-singular, which is typically the case in linear regression problems (assuming no linear dependence among input variables). This ensures that the error function has a single minimum point, giving a unique solution to the optimization problem. In summary, we are looking for $\\mathbf{w}$ such that $\\nabla E(\\mathbf{w}) = 0$, meaning the error function reaches its minimum. We can find a closed-form equation for $\\mathbf{w}$, whose solution is unique, denoted $\\mathbf{w}^*$.\nThe function $f(x, \\mathbf{w})$ is represented by the matrix $\\mathbf{X}\\mathbf{w}$, where:\n$$ \\mathbf{X} = \\begin{bmatrix} 1 \u0026amp; x_1 \u0026amp; x_1^2 \u0026amp; \\cdots \u0026amp; x_1^N \\\\ 1 \u0026amp; x_2 \u0026amp; x_2^2 \u0026amp; \\cdots \u0026amp; x_2^N \\\\ \\vdots \u0026amp; \\vdots \u0026amp; \\vdots \u0026amp; \\ddots \u0026amp; \\vdots \\\\ 1 \u0026amp; x_n \u0026amp; x_n^2 \u0026amp; \\cdots \u0026amp; x_n^N \\\\ \\end{bmatrix} $$ $$ \\mathbf{w} = \\begin{bmatrix} w_0 \\\\ w_1 \\\\ w_2 \\\\ \\vdots \\\\ w_n \\end{bmatrix} $$\nand $t$ is represented by the vector:\n$$ t = \\begin{bmatrix} t_0 \\\\ t_1 \\\\ t_2 \\\\ \\vdots \\\\ t_n \\end{bmatrix} $$\nFrom here, equation $(2.3)$ can be expressed as:\n$$ \\begin{align*} \u0026amp; \\mathbf{X}^T(\\mathbf{X}\\mathbf{w}−t)=0 \\\\ \\Leftrightarrow \u0026amp; \\mathbf{X}^T\\mathbf{X}\\mathbf{w} = \\mathbf{X}^Tt \\end{align*} $$\nThus, the unique solution $\\mathbf{w}^*$ is found through the closed-form equation:\n$$ \\mathbf{w}^*=(\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^Tt $$\nThe remaining issue is to choose the degree $N$ so that the function $f(x_n, \\mathbf{w})$ fits the original dataset.\nFig.1 Comparison of polynomial curves with different degrees M, represented by red curves, in fitting the original dataset\nFig.1 provides a comparison of the function $f(x_n, \\mathbf{w})$ at various degrees. With too low a degree, such as 0 or 1, the curve of $f(x_n, \\mathbf{w})$ cannot fit well. A degree of 3 appears to fit well and perfectly. A degree of 9 produces a curve that fits all data points exactly, which may seem ideal, but in practice, such a close fit to training points leads to overfitting, a common problem in machine learning that we aim to avoid. Therefore, a higher degree is not always better; we need to choose a degree that provides a good fit without overfitting.\n3. Method Using Probability Distributions We have seen that the problem of polynomial curve fitting can be represented using error minimization methods. Here, we return to the curve fitting example and view it from a probabilistic perspective, gaining deeper insights into error functions and regularization, leading to a full Bayesian approach.\nThe goal in the curve fitting problem is to predict the target value $t$ based on a new value of the input variable $x$, given a training dataset of $N$ input values $\\mathbf{x} = (x_1, \\ldots, x_N)^T$ and corresponding target values $\\mathbf{t} = (t_1, \\ldots, t_N)^T$. We can represent the uncertainty in the target variable\u0026rsquo;s value using a probability distribution. For this purpose, we assume that, given $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the polynomial curve value $y(x, \\mathbf{w})$. Therefore, we have:\n$$ \\begin{align*} p(t|x, \\mathbf{w}, \\beta) = \\mathcal{N} \\left( t | y(x, \\mathbf{w}), \\beta^{-1} \\right) \\tag{3.1} \\end{align*} $$\nHere, for consistency with later chapters, we define a precision parameter $\\beta$, which is the inverse of the variance of the distribution. This is illustrated schematically in Figure 1.16.\nWe now use the training data ${ \\mathbf{x}, \\mathbf{t} }$ to determine the unknown parameters $\\mathbf{w}$ and $\\beta$ using the maximum likelihood method. Assuming the data are independently drawn from distribution $(3.1)$, the likelihood function is given by:\n$$ \\begin{align*} p(\\mathbf{t} | \\mathbf{x}, \\mathbf{w}, \\beta) = \\prod_{n=1}^{N} \\mathcal{N} \\left( t_n | y(x_n, \\mathbf{w}), \\beta^{-1} \\right) \\tag{3.2} \\end{align*} $$\nSimilar to the case of a simple Gaussian distribution, it is more convenient to maximize the log of the likelihood function. Substituting the form of the Gaussian distribution, we obtain the log-likelihood function as:\n$$ \\begin{align*} \\ln p(\\mathbf{t} | \\mathbf{x}, \\mathbf{w}, \\beta) = \\frac{\\beta}{2} \\sum_{n=1}^{N} {y(x_n, \\mathbf{w}) - t_n}^2 + \\frac{N}{2} \\ln \\beta - \\frac{N}{2} \\ln (2\\pi) \\tag{3.3} \\end{align*} $$\nNow, we consider determining the maximum likelihood estimate for the polynomial coefficients, denoted $\\mathbf{w}_{ML}$, by maximizing $(3.3)$ with respect to $\\mathbf{w}$. For this purpose, we can ignore the last two terms on the right-hand side of $(3.3)$ as they do not depend on $\\mathbf{w}$. Furthermore, multiplying the log-likelihood by a positive constant does not change the maximum with respect to $\\mathbf{w}$, so we can replace the coefficient $\\beta/2$ with $1/2$. Finally, instead of maximizing the log-likelihood, we can minimize the negative log-likelihood. Thus, maximizing the likelihood is equivalent to minimizing the squared error function defined by (1.2), which appears as a consequence of maximizing the likelihood under the assumption of Gaussian noise.\nWe can also use maximum likelihood to determine the precision parameter $\\beta$ of the conditional Gaussian distribution. Maximizing $(3.3)$ with respect to $\\beta$ gives:\n$$ \\begin{align*} \\frac{1}{\\beta_{ML}} = \\frac{1}{N} \\sum_{n=1}^{N} { y(x_n, \\mathbf{w}_{ML}) - t_n }^2 \\tag{3.4} \\end{align*} $$\nOnce we have determined the parameters $\\mathbf{w}_{ML}$ and $\\beta_{ML}$, we can predict new values of $x$. Since we now have a probabilistic model, these values are represented as a predictive distribution for $t$, rather than just a point estimate, obtained by substituting the maximum likelihood parameters into $(3.1)$:\n$$ \\begin{align*} p(t | x, \\mathbf{w}_{ML}, \\beta_{ML}) = \\mathcal{N} (t | y(x, \\mathbf{w}_{ML}), \\beta^{-1}_{ML}) \\tag{3.5} \\end{align*} $$\nNow we take a step closer to the Bayesian method by introducing a prior distribution for the polynomial coefficients $\\mathbf{w}$. For simplicity, we consider a Gaussian distribution:\n$$ \\begin{align*} p(\\mathbf{w} | \\alpha) = \\mathcal{N} \\left( \\mathbf{w} | \\mathbf{0}, \\alpha^{-1} \\mathbf{I} \\right) = \\left( \\frac{\\alpha}{2\\pi} \\right)^{(M+1)/2} \\exp \\left( -\\frac{\\alpha}{2} \\mathbf{w}^T \\mathbf{w} \\right) \\tag{3.6} \\end{align*} $$\nFor the multivariate Gaussian distribution, the general formula is:\n$$ \\mathcal{N}(\\mathbf{x}|\\boldsymbol{\\mu}, \\mathbf{\\Sigma}) = \\frac{1}{(2\\pi)^{k/2} |\\mathbf{\\Sigma}|^{1/2}} \\exp \\left( -\\frac{1}{2} (\\mathbf{x} - \\boldsymbol{\\mu})^\\mathrm{T} \\mathbf{\\Sigma}^{-1} (\\mathbf{x} - \\boldsymbol{\\mu}) \\right) $$\nHere:\nNormalization constant: $\\left( \\frac{\\alpha}{2\\pi} \\right)^{(M+1)/2}$ Exponential term: $\\exp \\left( -\\frac{\\alpha}{2} \\mathbf{w}^\\mathrm{T} \\mathbf{w} \\right)$ Here, $\\alpha$ is the precision of the distribution, and $M+1$ is the number of elements in vector $\\mathbf{w}$ for a polynomial of degree $M$. Hyperparameters such as $\\alpha$ control significant aspects of the model and are often determined through optimization methods, like maximizing the posterior (MAP) or using cross-validation techniques.\nUsing Bayes\u0026rsquo; theorem, the posterior distribution for $\\mathbf{w}$ is proportional to the product of the prior distribution and the likelihood function:\n$$ \\begin{align*} p(\\mathbf{w} | \\mathbf{x}, \\mathbf{t}, \\alpha, \\beta) \\propto p(\\mathbf{t} | \\mathbf{x}, \\mathbf{w}, \\beta) p(\\mathbf{w} | \\alpha) \\end{align*} \\tag{3.7} $$\nNow we can determine $\\mathbf{w}$ by finding the most likely value of $\\mathbf{w}$ for the data, i.e., by maximizing the posterior distribution. This technique is called maximum a posteriori (MAP). Taking the negative log of $(3.7)$ and combining with $(3.3)$ and (3.6), we find that maximizing the posterior is given by minimizing:\n$$ \\begin{align*} \\frac{\\beta}{2} \\sum_{n=1}^{N} {y(x_n, \\mathbf{w}) - t_n}^2 + \\frac{\\alpha}{2} \\mathbf{w}^T \\mathbf{w} \\tag{3.8} \\end{align*} $$\nThus, we see that maximizing the posterior distribution is equivalent to minimizing the regularized squared error function encountered previously in the form of (1.4), with the regularization parameter given by $\\lambda = \\alpha / \\beta$.\n4. Polynomial Curve Fitting Using Bayesian Approach Although we have included the prior distribution $p(w|\\alpha)$, we are still only performing a point estimate of $w$, which is insufficient to be a Bayesian method. In a fully Bayesian approach, we should consistently apply the rules of probability summation and multiplication, which requires, as we will see shortly, that we integrate over all values of $w$. This marginalization lies at the core of Bayesian methods for pattern recognition. In the curve fitting problem, we are given training data $\\mathbf{x}$ and $\\mathbf{t}$, along with a new test point $x$, and our goal is to predict the value of $t$. Therefore, we want to evaluate the predictive distribution $p(t|x, \\mathbf{x}, \\mathbf{t})$. Here, we will assume that the parameters $\\alpha$ and $\\beta$ are fixed and known (we will discuss how such parameters can be inferred from data within the Bayesian framework in subsequent discussions). A simple Bayesian approach merely corresponds to consistently applying the rules of summation and multiplication of probabilities, allowing the predictive distribution to be written as:\n$$ p(t|x, \\mathbf{x}, \\mathbf{t}) = \\int p(t|x, w) p(w|\\mathbf{x}, \\mathbf{t}) , dw \\tag{4.1} $$\nHere, $p(t|x, w)$ is given by $(3.1)$, and we have omitted the dependency on $\\alpha$ and $\\beta$ for notational simplicity. Here, $p(w|\\mathbf{x}, \\mathbf{t})$ is the posterior distribution over the parameters, and it can be obtained by normalizing the right side of $(3.7)$. We will see in Section 3.3 that, for problems such as the curve fitting example, the posterior distribution is Gaussian and can be evaluated analytically. Similarly, the integral in $(4.1)$ can also be performed analytically, resulting in the predictive distribution given by a Gaussian form:\n$$ p(t|x, \\mathbf{x}, \\mathbf{t}) = \\mathcal{N} \\left( t | m(x), s^2(x) \\right) \\tag{4.2} $$\nwith the mean and variance given by:\n$$ m(x) = \\beta \\phi(x)^T S \\sum_{n=1}^N \\phi(x_n) t_n \\tag{4.3} $$\n$$ s^2(x) = \\beta^{-1} + \\phi(x)^T S \\phi(x) \\tag{4.4} $$\nHere, the matrix $S$ is given by\n$$ S^{-1} = \\alpha I + \\beta \\sum_{n=1}^N \\phi(x_n) \\phi(x_n)^T $$\nwhere $I$ is the identity matrix, and we define the vector $\\phi(x)$ with elements $\\phi_i(x) = x^i$ for $i = 0, \\ldots, M$. We see that both the variance and the mean of the predictive distribution in $(4.2)$ depend on $x$. The first term in $(4.4)$ represents the uncertainty in the predicted value of $t$ due to noise in the target variables and was previously represented in the maximum likelihood predictive distribution $(3.5)$ through $\\beta^{-1}$. However, the second term arises from the uncertainty in the parameters $w$ and is a consequence of the Bayesian approach. The predictive distribution for the synthetic sinusoidal regression problem is illustrated in Figure 1.3. Fig.1.3 Comparison of polynomial graphs with different degrees M, represented by red curves, in fitting the original dataset\n5. Analysis of Differences in Solving the Curve Fitting Problem Using Polynomial Function with MSE Loss and Bayesian Perspective: 1. Using Polynomial Function with MSE Loss: Method:\nObjective function: Uses the Mean Squared Error (MSE) loss function, defined as: $$\\text{MSE} = \\frac{1}{N} \\sum_{i=1}^N (y_i - \\hat{y_i})^2$$ where $y_i$ is the actual value, $\\hat{y_i}$ is the predicted value, and $N$ is the number of data points. Parameter Estimation: Through minimizing the MSE loss function. This is often done using gradient descent or least squares solutions. Advantages:\nSimple and Intuitive: The use of MSE is direct and easy to understand. Fast Computation: This approach usually requires less computation and can use simple analytical solutions. Widely Used: Suitable for many basic fitting problems and is widely applied across fields. Disadvantages:\nIgnores Uncertainty: MSE only minimizes the average error without accounting for uncertainty in predictions. Overfitting: Using high-degree polynomials makes the model prone to overfitting the training data, reducing generalization ability. No Prior Information: This approach does not incorporate any prior information about parameters or data. 2. Using Bayesian Perspective: Method:\nPrior Distribution: Uses a prior distribution $p(w)$ for the parameters $w$ before observing data. Likelihood: Probability of observed data for a specific set of parameters $p(t|x, w)$. Posterior Distribution: Uses Bayes\u0026rsquo; theorem to update the prior distribution to the posterior distribution $p(w|x, t)$: $$p(w|x, t) \\propto p(w) p(t|x, w)$$ Predictive Distribution: Integrates the posterior distribution with the likelihood to find the predictive distribution for a new data point: $$p(t|x, \\mathbf{x}, \\mathbf{t}) = \\int p(t|x, w) p(w|\\mathbf{x}, \\mathbf{t}) , dw$$ Advantages:\nAccounts for Uncertainty: Bayesian method considers uncertainty in both data and parameters, providing a probability distribution rather than a point estimate. Better Generalization Ability: Using prior and posterior distributions, Bayesian methods often generalize better, especially with limited or noisy data. Incorporates Prior Information: Allows the combination of prior information and experience into the model. Disadvantages:\nMore Complex Calculations: Calculating the posterior distribution and integrating to find the predictive distribution requires more computational resources. Challenging Prior Selection: Choosing an appropriate prior distribution can be difficult and significantly affects the results. Conclusion:\nPolynomial Function with MSE Loss is a simple and intuitive method, suitable for basic fitting problems, but is prone to overfitting and ignores uncertainty. The Bayesian approach is more complex but provides a more complete approach by accounting for uncertainty and using prior information, which often leads to more accurate and generalizable results. ","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-07-05/","summary":"Polynomial regression from least squares to Bayesian view: closed-form, regularization, predictive uncertainty.","title":"Polynomial curve fitting"},{"content":"Diffusion Models (DMs) include two processes: forward and backward.\nForward process General idea Degrading input data using noise iteratively, forward in time (i.e., $t$ increases). Given image $x_0 \\sim q(x_0)$, which called data distribution, forward process gradually adds Gauss noise thru $T$ time steps and produces latent $x_T$. At each time step $t$, we sample Gauss noise that following the distribution $\\mathcal{N}(\\sqrt{1 - \\beta_t} x_{t-1}, \\beta_t)$, where the hyper-parameters $0 \u0026lt; \\beta_{1:T} \u0026lt; 1$ represent the variance of noise incorporated at each time step. Intuitively, all we want is destroy the whole structure of the original image thru a diffusion process by iteratively decay the previous image $x_{t-1}$ and add Gauss noise into it to produce $x_t$. If $T$ is sufficient enough, the distribution of the latent variable $x_T$ is nearly an isotropic Gaussian. Why? Because at each time step $t$ we add a Gauss noise with a variance $\\beta_t \\in (0,1)$, hence the distribution of $x_T$ is a linear transformation of $\\mathcal{N}(\\sqrt{1 - \\beta_t} x_{t-1}, \\beta_t)$. Therefore, $x_T$ should be an isotropic Gaussian noise, or ideally, it should be nearly an identity Gauss distribution, which is $q(x_T) \\sim \\mathcal{N}(x_T,; 0, \\mathcal{I})$. Mathematical explanation In the forward process, we take $x_0$ as original image and produce $x_T$ which is a latent variable following an isotropic Gauss noise. At a given time step $t$, we produce $x_t$ as:\n$$ \\begin{align} \\label{eqf} q(x_t|x_{t-1}) \u0026amp;:= \\mathcal{N}(x_t; \\sqrt{1 - \\beta_t}x_{t-1}, \\beta_t\\mathcal{I}) \\br \u0026amp;:= \\sqrt{1-\\beta_t}x_{t-1} + \\sqrt{\\beta_t}\\epsilon \\tag{1} \\end{align} $$\nwhere $q(x_t|x_{t-1})$ refers to the conditional probability distribution of the image at time step $t$. To sample from this distribution, we use the property of Gauss distribution, that is, if $x_t \\sim \\mathcal{N}(\\mu, \\beta)$, it can be expressed as: $$ x_t = \\mu + \\sqrt{\\beta}\\epsilon, $$ where $\\epsilon \\sim \\mathcal{N}(0,\\mathcal{I})$ is a standard normal random variable.\nYou may wonder why it has to be $\\sqrt{1-\\beta_t}$, so I am going to explain here. So for simplicity, consider $x_0 \\sim \\mathcal{N}(\\theta, 1)$, and we would like to keep the variance of the image equal to 1 at every step $t$: $$ x_1 = \\alpha x_0 + \\sqrt{\\beta_1}\\epsilon_1 $$ Note that the above equation is given by applying reparameterization trick. Friendly remind, reparameterization trick states that: $$ \\mathcal{N}(\\mu, \\sigma^2) = \\mu + \\sigma \\cdot \\epsilon, \\ \\epsilon \\sim \\mathcal{N}(0, \\mathcal{I}) $$\nThe variance of $x_1$ is: $$ Var(x_1) = \\alpha^2 + \\beta_1 $$ As aforementioned, we want to keep $Var(x_1) = 1$. Forcing this constraint will lead to: $$ 1 = \\alpha^2 + \\beta_1 \\Rightarrow \\alpha = \\sqrt{1 - \\beta_1} $$ Of course this works for every step $t$, so abstractly we can write it as: $$ x_t = \\sqrt{1-\\beta_1}x_{t-1} + \\sqrt{\\beta_t} \\epsilon, \\ \\epsilon \\sim \\mathcal{N}(0, \\mathcal{I{}}) $$ which corresponds to the probability distribution you\u0026rsquo;ve started from: $$ q(x_t|x_{t-1}) = \\mathcal{N}(x_t; \\sqrt{1-\\beta_t}x_{t-1}, \\beta_t \\mathcal{I}) $$ Instead of moving step by step, from time step $t-1$ to $t$, we can achieve a single-step calculation for the above estimation, directly from $x_0$ to $x_{T}$, which is $q(x_T | x_0)$:\n$$ q(x_t|x_0) = \\mathcal{N}(x_t; \\sqrt{\\gamma}x_0, (1-\\gamma_t)\\mathcal{I}), $$ where $\\gamma_t = \\prod_{i=1}^{t}(1-\\beta_i)$. Consequently, $x_t$ can be directly calculated from $x_0$ by: $$ x_t = \\sqrt{\\gamma_t}\\cdot x_0 + \\sqrt{1-\\gamma_t}\\cdot\\epsilon,\\ \\epsilon\\sim\\mathcal{N}(0, \\mathcal{I}) \\tag{2} $$ So I think some of you may wonder why we are able to end up with this simplified estimation. Let\u0026rsquo;s go step by step to understand this concept, by extending the equation (1).\n$$ \\begin{align} \\sqrt{\\alpha_t}x_{t-1} + \\sqrt{1-\\alpha_{t}}\\epsilon \u0026amp;= \\sqrt{\\alpha_{t}}(\\sqrt{\\alpha_{t-1}}x_{t-2} + \\sqrt{1-\\alpha_{t-1}}\\epsilon) + \\sqrt{1-\\alpha_{t}}\\epsilon \\br \u0026amp;= \\sqrt{\\alpha_t \\alpha_{t-1}}x_{t-2} + \\sqrt{\\alpha_t (1-\\alpha_{t-1})}\\epsilon + \\sqrt{1-\\alpha_t}\\epsilon \\br \u0026amp;= \\sqrt{\\alpha_t \\alpha_{t-1}}x_{t-2} + \\sqrt{\\alpha_t (1-\\alpha_{t-1}) + 1-\\alpha_t}\\epsilon \\br \u0026amp;= \\sqrt{\\alpha_t \\alpha_{t-1}}x_{t-2} + \\sqrt{1 - \\alpha_t \\alpha_{t-1}}\\epsilon \\end{align} \\tag{3} $$\nThe reason why we can go from the second line to the third line is the additive property of variances in Gaussian distributions: when two Gaussian noises are added, their variances added. Here, we have $\\sqrt{\\alpha_t (1-\\alpha_{t-1})}\\epsilon$ is a Gaussian noise, and so is $\\sqrt{1-\\alpha_{t}}\\epsilon$. Following reparameterization trick, we have $Var(\\sqrt{\\alpha_t (1-\\alpha_{t-1})}\\epsilon) = \\alpha_t (1-\\alpha_{t-1})$, and $Var(\\sqrt{1-\\alpha_{t}}\\epsilon) = 1-\\alpha_{t}$. Hence, adding two variances, we get $\\alpha_t(1-\\alpha{t-1}) + 1 - \\alpha_t$, and using reparam trick we can write it as the third line.\nNext, continuing to extend one more level from (3), we have:\n$$ \\begin{align} \u0026amp;\\sqrt{\\alpha_t \\alpha_{t-1}}x_{t-2} + \\sqrt{1 - \\alpha_t \\alpha_{t-1}}\\epsilon \\br = \u0026amp;\\sqrt{\\alpha_t \\alpha_{t-1}}(\\sqrt{\\alpha_{t-2}}x_{t-3} + \\sqrt{1-\\alpha_{t-2}}\\epsilon)+ \\sqrt{1 - \\alpha_t \\alpha_{t-1}}\\epsilon \\br = \u0026amp;\\sqrt{\\alpha_t \\alpha_{t-1} \\alpha_{t-2}}x_{t-3}+ \\sqrt{1-\\alpha_t \\alpha_{t-1} \\alpha_{t-2}}\\epsilon \\end{align} $$\nWe now are able to recognize the pattern, that is: $$ \\begin{align} x_t \u0026amp;= \\sqrt{\\prod_{i=0}^{t}\\alpha_i}x_0 + \\sqrt{1-\\prod_{i=1}^{t}}\\epsilon, \\br q(x_t|x_0) \u0026amp;\\sim \\mathcal{N}(x_0; \\sqrt{\\prod_{i=1}^t}\\alpha_i, 1 -\\prod_{i=1}^t\\alpha_i) \\end{align} $$ Replace $\\beta = 1 - \\alpha$, we finally get our simplified estimation as discussed above (2).\nBackward process General idea The backward process is in contrast with the forward process: Given a latent variable $\\mathcal{z} = x_T$, where $T$ is the total time steps of our diffusion model, following a distribution (which is usually the Normal distribution $\\mathcal{N}(0, 1)$, i.e., $q(z_T) \\sim \\mathcal{N}(0, \\mathcal{I})$), we want to produce a latent variable $z_0 \\sim q(x_0)$, where $q(x_0)$ is a real image distribution. From $t$ to $t-1$, we use the conditional probability to estimate the latent $z_{t-1}\\sim p(z_{t-1}|z_{t})$. This equals to remove a certain noise from $z_t$ to get $z_{t-1}$, which is contradictory to forward process where we add noise to the previous image at each time step. Mathematical explanation The key of this backward process is to understand how we sample $z_{t-1}$ from $z_t$, which equals to estimate $p(z_{t-1}|z_{t})$: $$ p(z_{t-1}|z_t) = \\mathcal{N}(z_{t-1}|\\mu_{\\theta}(z_t, t), \\sigma_{\\theta}(z_t, t)) $$ where $\\mu_\\theta$ and $\\sigma_\\theta$ is learnable thru 2 neural networks. Similar to forward process, let $\\gamma = \\prod_{i=1}^t1-\\beta_i$, we have: $$ p(z_{t-1}|z_t) = \\mathcal{N}(z_{t-1}|\\mu_{\\theta}(z_t, \\gamma_t), \\sigma_{\\theta}(z_t, \\gamma_t)) $$\nOptimization To guide the backward process in learning forward process, we minimize the Kullback-Leibler (KL) divergence of the joint distribution of the forward and reverse sequences:\n$$ \\begin{align} p_\\theta(z_0, ,\u0026hellip;, ,z_T) \u0026amp;= p(z_T)\\prod_{t=1}^Tp_\\theta(z_{t-1}|z_t), \\br q(x_0, , \u0026hellip;, x_T) \u0026amp;= q(x_0)\\prod_{t=1}^Tq(x_t|x_{t-1}), \\end{align} $$\nwhich leads to minimize:\n$$ \\begin{align} \u0026amp;\\text{KL}(q (x_0, \u0026hellip;, x_T) | p_\\theta (z_0, \u0026hellip;, z_T)) \\br \u0026amp;= - \\mathbb{E}_{q(x_0, \u0026hellip;, x_T)} [\\log p_\\theta (z_0, \u0026hellip;, z_T)] + c \\br \u0026amp;\\overset{(i)}{=} \\mathbb{E}_{q(x_0, \u0026hellip;, x_T)} \\left[ - \\log p (z_T) - \\sum_{t=1}^T \\log \\frac{p_\\theta (z_{t-1} | z_t)}{q (x_t | x_{t-1})} \\right] + c \\br \u0026amp;\\overset{(ii)}{\\geq} \\mathbb{E} \\left[ - \\log p_\\theta (z_0) \\right] + c \\tag{4} \\end{align} $$\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2024-06-11/","summary":"Diffusion Models (DMs) include two processes: forward and backward.\nForward process General idea Degrading input data using noise iteratively, forward in time (i.e., $t$ increases). Given image $x_0 \\sim q(x_0)$, which called data distribution, forward process gradually adds Gauss noise thru $T$ time steps and produces latent $x_T$. At each time step $t$, we sample Gauss noise that following the distribution $\\mathcal{N}(\\sqrt{1 - \\beta_t} x_{t-1}, \\beta_t)$, where the hyper-parameters $0 \u0026lt; \\beta_{1:T} \u0026lt; 1$ represent the variance of noise incorporated at each time step.","title":"Diffusion Models"},{"content":"Determinant of Matrices The determinant of a square matrix $\\mathbf{A}$ of size $n \\times n$ is a mapping function from the space $\\mathbf{R}^{n} \\rightarrow \\mathbf{R}$, denoted as $det(A)$.\n$det(A)$ is used in various cases. For example, in 2-dimensional space, the absolute value of $det(A)$ is precisely the area of the shape formed by the column vectors in $A$. For instance, $A = \\begin{bmatrix} 2 \u0026amp; 0 \\\\ 0 \u0026amp; 3 \\end{bmatrix}$. If you plot it in 2-dimensional space, you will see the shape formed is a rectangle with dimensions $3 \\times 2$. The area of the shape is 6, which is also the result of $det(A) = (3 \\times 2) - (0 \\times 0)$.\nIn fact, the determinant of the matrix $A\\ det(A)$ tells us how many times the size (volume) changes after being transformed by $A$ compared to the original size. In other words, consider the matrix $A$ as a mapping function $f$, let $X$ be the space to be transformed. When transforming $X$ by the mapping function $f$, we obtain the space $Y$, i.e., $f(X) = Y$. At this point, $det(A) = \\frac{volume(Y)}{volume(X)}$.\nSome properties:\n$\\forall A \\in \\mathbf{R}^{n \\times n}, |A| = |A^T|$ $\\forall A, B \\in \\mathbf{R}^{n \\times n}, |AB| = |A||B|$ $\\forall A \\in \\mathbf{R}^{n \\times n}, det(A) = 0$ if and only if the matrix $A$ is singular. Eigenvalues and eigenvectors For a matrix $\\mathbf{A} \\in \\mathbf{R}^{n \\times n}$, we call $\\lambda \\in \\mathbf{C}$ and $v \\in \\mathbf{C}^n$ respectively an eigenvalue and eigenvector of $\\mathbf{A}$ if:\n$$ \\begin{aligned} Av = \\lambda v, v \\neq 0 \\end{aligned} $$\nBefore diving into finding $\\lambda$ and $v$, let\u0026rsquo;s talk about why we need to find eigenvectors and eigenvalues, or in other words, the importance of these two variables. Look at the example below:\nLinear transformation changes size\nIn geometric transformations (scale, shear, rotation, \u0026hellip;), each transformation can be represented by a matrix $\\mathbf{A}$. In this example, the transformation is represented by the matrix $A = \\begin{bmatrix} 1 \u0026amp; 0 \\\\ 0 \u0026amp; 2 \\end{bmatrix}$.\nYou can see in the 3 red, yellow, and blue vectors that 2 red and blue vectors remain unchanged in direction after being transformed by $\\mathbf{A}$. These are exactly 2 eigenvectors of $\\mathbf{A}$.\nEigenvectors are vectors that, after being transformed by $\\mathbf{A}$, still lie within the span of the original vectors. They can change in direction, magnitude, or both, but they do not stray from their span. In the example above, the span of the green vector is the $Oy$ axis, and the span of the red vector is the $Ox$ axis. For the yellow vector, although not represented, its span is the line with the equation $x = y$. To understand easily, the span of the yellow vector is the line extended from the original yellow vector, in both directions up, down, and passing through the origin. After transformation, we see that the yellow vector is no longer within its span, so it is not an eigenvector of this transformation. The amount of change of each eigenvector corresponds to an eigenvalue of the transformation. For example, after transformation, the green vector is twice as long, corresponding to an eigenvalue of $\\lambda = 2$. While the red vector does not change in magnitude, corresponding to $\\lambda = 1$.\nSo, what\u0026rsquo;s the significance of finding eigenvalues and eigenvectors? Welp, this is a tough question that I often find poorly explained in math materials or websites. From my personal perspective after a period of study, I believe eigenvectors represent the essence of a linear transformation. They help you have a clear and understandable view of how a transformation is performed.\nFinding eigenvalues and eigenvectors Now we will prove the 2 eigenvalues $\\lambda = 1$ and $\\lambda = 2$ in the example above mathematically.\nTo find the eigenvalues $\\lambda$ and eigenvectors $x$ of $\\mathbf{A}$, always start with the expression:\n$$ \\begin{aligned} Av = \\lambda v \\end{aligned} $$\nWhy do we have this expression? The meaning of this expression is: multiplying matrix $A$ with an eigenvector $v$ yields a result similar to scaling eigenvector $v$ by some $\\lambda$. This perfectly aligns with what we\u0026rsquo;ve learned earlier: after being transformed by $A$, the eigenvector $v$ only changes in direction or magnitude or both, but it remains within its span. $|\\lambda|$ is the degree of change in magnitude of $v$ after the transformation, while if $\\lambda \u0026lt; 0$, it means $v$ is reversed after transformation, and vice versa. What we need to do now is to find $v$ and $\\lambda$ that satisfy this expression.\nTransposing and changing signs, we have:\n$$ \\begin{aligned} Av - \\lambda v \u0026amp;= 0\\\\ \\Leftrightarrow (A - \\lambda)v \u0026amp;= 0 \\end{aligned} $$\nHere, we\u0026rsquo;re back to the problem of finding the null space for the matrix $A - \\lambda$. If you\u0026rsquo;re not familiar with what the null space is, please see here.\nRemember, we have the constraint $v \\neq 0$, and recall that the null space of $\\mathbf{A}$ containing a vector other than the zero vector implies that $\\mathbf{A}$ is non-singular. This is equivalent to $\\mathbf{A}$ being a singular matrix - a singular matrix, equivalent to $det(A) = 0$.\nSo, the problem is reduced to finding conditions such that:\n$$ \\begin{aligned} det(A - \\lambda) = 0 \\end{aligned} $$\nApplying to the matrix $A = \\begin{bmatrix} 1 \u0026amp; 0 \\\\ 0 \u0026amp; 2 \\end{bmatrix}$ in the example above, we have:\n$$ \\begin{aligned} \\begin{bmatrix} 1 \u0026amp; 0 \\\\ 0 \u0026amp; 2 \\end{bmatrix}v \u0026amp;= \\lambda v\\\\ \\Leftrightarrow \\begin{bmatrix} 1 - \\lambda \u0026amp; 0 \\\\ 0 \u0026amp; 2 - \\lambda \\end{bmatrix}v \u0026amp;= 0 \\end{aligned} $$\nTo have $v \\neq 0$, we must add the constraint:\n$$ \\begin{aligned} det(\\begin{bmatrix} 1 - \\lambda \u0026amp; 0 \\\\ 0 \u0026amp; 2 - \\lambda \\end{bmatrix}) \u0026amp;= 0 \\\\ \\Leftrightarrow (1 - \\lambda)(2 - \\lambda) \u0026amp;= 0 \\end{aligned} $$\nFrom here, we obtain 2 solutions:\n$$ \\begin{aligned} \\begin{cases} \\lambda = 1 \\\\ \\lambda = 2 \\end{cases} \\end{aligned} $$\nSo we have found 2 eigenvalues of the matrix $\\mathbf{A}$, in line with what was said in the previous section. From these 2 eigenvalues, we can easily find $v$:\n$$ \\begin{aligned} \\begin{bmatrix} 1 - \\lambda \u0026amp; 0 \\\\ 0 \u0026amp; 2 - \\lambda \\end{bmatrix}v \u0026amp;= 0 \\\\ \\Leftrightarrow \\begin{cases} \\begin{bmatrix} 0 \u0026amp; 0 \\\\ 0 \u0026amp; 1 \\end{bmatrix} \\begin{bmatrix} x \\\\ y \\end{bmatrix} \u0026amp;= 0 \\\\ \\begin{bmatrix} -1 \u0026amp; 0 \\\\ 0 \u0026amp; 0 \\end{bmatrix} \\begin{bmatrix} x \\\\ y \\end{bmatrix} \u0026amp;= 0 \\\\ \\end{cases} \\end{aligned} $$\nThe result is that the matrix $\\mathbf{A}$ has infinitely many eigenvectors, with each eigenvector having the form $\\begin{bmatrix} x \\\\ 0 \\end{bmatrix}$ or $\\begin{bmatrix} 0 \\\\ y \\end{bmatrix}$. We see in the original example that the green vector has the form $\\begin{bmatrix} 0 \\\\ y \\end{bmatrix}$ and the red vector has the form $\\begin{bmatrix} x \\\\ 0 \\end{bmatrix}$. Therefore, these are 2 eigenvectors of the transformation $A$.\nReturning to finding eigenvalues and eigenvectors helps us understand the essence of the transformation. For $\\lambda = 1$, we obtain $v$ as vectors of the form $\\begin{bmatrix} x \\\\ 0 \\end{bmatrix}$; and for $\\lambda = 2$, $v$ are vectors satisfying $\\begin{bmatrix} 0 \\\\ y \\end{bmatrix}$. Looking back at the above illustration, the magnitude of the green vector after transformation by $A$ increases by a factor of 2, while the magnitude of the red vector remains unchanged, corresponding to $\\lambda = 1$. Therefore, we understand that the matrix $A$ represents a transformation that increases the magnitude of vectors whose span is the $Oy$ axis by a factor of 2, while preserving the magnitude of vectors whose span is the $Ox$ axis. This is the essence of the transformation $A$.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-08-21/","summary":"Determinants, eigenvalues, eigenvectors: geometric meaning, finding methods, and linear transformation essence.","title":"Determinant of matrices, eigenvalues and eigenvectors"},{"content":"Linear Independence and Linear Dependence A set is called linearly independent when the vectors in that set are linearly independent of each other.\nIn other words, we cannot represent a vector as a linear combination of any other vectors in the set. Otherwise, the set is linearly dependent.\nFor example: given a set of 3 vectors $v_1 = \\begin{bmatrix} 0 \\\\ 0 \\\\ 1 \\end{bmatrix} \\quad ; \\quad v_2 = \\begin{bmatrix} 1 \\\\ 0 \\\\ 1 \\end{bmatrix} \\quad ; \\quad v_3 = \\begin{bmatrix} 1 \\\\ 0 \\\\ 2 \\end{bmatrix}$\nThis set is not linearly independent because vector $v_3 = v_1 + v_2$. Note that a set containing the zero vector is certainly linearly dependent because the zero vector can be written as a linear combination of any vectors $v$ in the set.\nFor any matrix $A$, we can check if the vectors in $A$ are linearly independent by solving the system of equations $A\\mathbb{x} = \\mathbb{0}$. Why can we determine this through this method? Recall what we learned in the post about nullspace. In that post, we learned that when a matrix $A$ has all columns containing pivot elements, then $A\\mathbb{x} = \\mathbb{0}$ has only the trivial solution $x = \\mathbb{0}$. And when all columns of $A$ contain pivot elements, it means that no column of $A$ is a linear combination of the other columns $\\Leftrightarrow$ $A$ is linearly independent.\nThe column vectors of $A$ are linearly independent when the system $A\\mathbb{x} = \\mathbb{0}$ has only the trivial solution $x = \\mathbb{0}$.\nIn other words, the vector set $v_1, v_2, \u0026hellip;, v_n$ is linearly independent if and only if $x_1v_1 + x_2v_2 + \u0026hellip; + x_nv_n = \\mathbb{0}$ holds only with $x_1 = x_2 = \u0026hellip; = x_n = \\mathbb{0}$.\nAnother note: a set of 3 vectors belonging to the space $\\mathbf{R}^2$ is always linearly dependent.\nAssuming a matrix $A$ of size $2 \\times 3$, then the number of pivot elements $r$ will always be $\\leq 2$. Therefore, we always have at least 1 free variable and $A\\mathbb{x} = \\mathbb{0}$ will always have a non-zero solution.\nAny set of $n$ vectors $\\in \\mathbf{R}^m$ are linearly dependent if $m \u0026lt; n$.\nWhen the number of pivot elements $r = $ the number of columns of the matrix $n$, then the matrix has full column rank. In the post about solving systems $A\\mathbb{x} = b$, we learned that full column rank indicates that the matrix is linearly independent.\nThe columns of $A$ are linearly independent when $r = n \\Leftrightarrow $ the system of equations $A\\mathbb{x} = \\mathbb{0}$ has no free variables\nSpan A set of vectors $S$ is the span of the vector space $V$ if every vector in $V$ can be represented as a linear combination of vectors in $S$.\nThis means if every vector $v \\in V$ can be written as a linear combination of vectors $s \\in S$, then we say $S$ spans $V$.\nAdditionally, we have other terms like: $S$ spans $V$, $V$ is spanned by $S$, $V$ is the linear span of $S$.\nLet\u0026rsquo;s talk about the column space $C(A)$ of a matrix $A$.\n$C(A)$ contains all vectors $b$ that are the result of the system of equations $A\\mathbb{x} = b$.\nTherefore, these $b$ are being represented as linear combinations of columns in $A$, equivalent to the set of column vectors of $A$ spanning the column space of $A$.\nFor example: the set of vectors $v_1 = \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix}$ and $v_2 = \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix}$ spans the 2-dimensional space $\\mathbf{R}^2$.\nIt\u0026rsquo;s evident because any vector $u \\in \\mathbf{R}^2$ can be written as $u = cv_1 + wv_2$.\nHowever, the set of vectors $w_1 = \\begin{bmatrix} 1 \\\\ 1 \\end{bmatrix}$ and $v_2 = \\begin{bmatrix} -1 \\\\ -1 \\end{bmatrix}$ only spans a line in $\\mathbf{R}^2$.\nBasis and Dimension A basis of a vector space is a set of vectors $v_1, v_2, \u0026hellip;, v_n$ that satisfy 2 properties:\n$v_1, v_2, \u0026hellip;, v_n$ are linearly independent $v_1, v_2, \u0026hellip;, v_n$ span a vector space This means if the vectors in the spanning set $S$ are linearly independent, then $S$ is the basis of a vector space generated by $S$.\nA basis is a set with just enough vectors to span a space.\nAs in the example above, we see that the set $v_1 = \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix}$ and $v_2 = \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix}$ spans $\\mathbf{R}^2$.\nHowever, not every set of 2 independent vectors is a basis for $\\mathbf{R}^2$ ($w_1 = \\begin{bmatrix} 1 \\\\ 1 \\end{bmatrix}$ and $v_2 = \\begin{bmatrix} -1 \\\\ -1 \\end{bmatrix}$ is an example).\nSimilarly, not every set of 3 independent vectors is a basis for $\\mathbf{R}^3$. Sometimes they only span a line or a 2-dimensional plane.\nThe vectors $v_1, v_2, \u0026hellip;, v_n$ form the basis of the space $\\mathbf{R}^n$ if and only if they are the columns of an invertible matrix of size $n \\times n$\nHence, the basis of $\\mathbf{R}^n$ is an infinite set.\nFor example, the columns of matrix $A = \\begin{bmatrix} 1 \u0026amp; 0 \u0026amp; 0 \\\\ 1 \u0026amp; 1 \u0026amp; 0 \\\\ 1 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix}$ form a basis of the space $\\mathbf{R}^3$.\nThe columns of matrix $B = \\begin{bmatrix} 1 \u0026amp; 0 \u0026amp; 0 \\\\ 0 \u0026amp; 1 \u0026amp; 0 \\\\ 0 \u0026amp; 0 \u0026amp; 1 \\end{bmatrix}$ also form a basis of the space. This is called the standard basis of $\\mathbf{R}^3$.\nMoreover, with $v_1, v_2, \u0026hellip;, v_n$ being vectors in the basis of $\\mathbf{R}^n$, the linear combination to create a vector $v$ is unique.\nIt means, if $v = a_1v_1 + a_2v_2 + \u0026hellip; + a_nv_n$ and $v = b_1v_1 + b_2v_2 + \u0026hellip; + b_nv_n$, then $a_i = b_i$ for $1 \\leq i \\leq n$.\nIndeed, subtracting the two vectors $v$ yields $v - v = (a_1 - b_1)v_1 + (a_2 - b_2)v_2 + \u0026hellip; + (a_n - b_n)v_n = \\mathbb{0}$. Since $v_1, v_2, \u0026hellip;, v_n$ are linearly independent, $a_i - b_i = \\mathbb{0} \\Leftrightarrow a_i = b_i$.\nA question I had when starting this chapter was: can there exist a basis for $\\mathbf{R}^n$ with a number of vectors different from $n$?\nThe answer is: NO. Every basis of a vector space must have the same number of vectors.\nThis leads to a concept of the dimension of a vector space:\nThe number of vectors in any and every basis of a vector space is the dimension of that space\nThis means if $v_1, v_2, \u0026hellip;, v_n$ and $w_1, w_2, \u0026hellip;, w_m$ are both bases of a space, then $n = m$.\nThe number of vectors in a basis depends on the dimension of the space. The space $\\mathbf{R}^n$ has $n$ vectors in any basis of it.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-08-07/","summary":"Linear independence, span, basis, dimension: fundamental concepts for vector spaces and subspaces.","title":"Span, basis, and dimension"},{"content":"Four fundamental subspaces in Linear Algebra - What are they? Row space ($C(A^{T})$), a subspace of $\\mathbf{R}^n$ Column space ($C(A)$), a subspace of $\\mathbf{R}^m$ Nullspace ($N(A)$), a subspace of $\\mathbf{R}^n$ Left nullspace ($N(A^T)$), a subspace of $\\mathbf{R}^m$ We have extensively covered the column space and nullspace.\nThe row space contains all linear combinations of the rows in a matrix.\nBut what if we don\u0026rsquo;t like dealing with rows and only want to work with column vectors? Then you can transpose the matrix $A$ to $A^T$, which means the rows in $A$ are now the columns in $A^T$. Hence, it can be stated that the row space of matrix $A$ is the same as the column space of $A^T$. Both of these spaces can be denoted using the symbol $C$.\nSimilarly, the left nullspace of matrix $A$ is the nullspace of matrix $A^T$.\nTo find the left nullspace of $A$, we solve the equation $A^Ty = 0$. It\u0026rsquo;s called the left nullspace because, when we transpose both sides of the equation, we have:\n$$ \\begin{aligned} (A^{T}y)^{T} \u0026amp; = 0^{T} \\\\ \\Leftrightarrow y^{T}A^{TT} \u0026amp; = 0^{T} \\\\ \\Leftrightarrow y^{T}A \u0026amp; = 0^{T} \\end{aligned} $$\nWe see that now $y$ is on the left side, $y^T$ and $0^T$ are row vectors.\nDimensions of the Four subspaces Let\u0026rsquo;s determine the basis and dimension of the four vector spaces mentioned above. Recall that the dimension of a vector space is the number of vectors in any basis of that space. A basis of a vector space is a set of vectors with two properties: linear independence and spanning a vector space.\nColumn Space For a matrix $A$ of size $m \\times n$.\nThe dimension of the column space of $A\\ dim(C(A)) = r$. The pivot columns in $A$ form the basis of $C(A)$.\nSince the rank of a matrix $r(A)$ is the number of pivot columns of matrix $A$. Pivot columns indicate they are linearly independent vectors in matrix $A$. Furthermore, the columns of $A$ span the column space $C(A)$.\nTherefore, the set of pivot columns of matrix $A$ is the basis of the column space $C(A)$.\nAnd the number of vectors in the basis is the dimension of the vector space, so $dim(C(A)) = r$.\nConsider the example: $A = \\begin{bmatrix} 1 \u0026amp; 3 \u0026amp; 5 \u0026amp; 0 \u0026amp; 7 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2\\\\ 1\u0026amp; 3 \u0026amp; 5\u0026amp;1 \u0026amp;9 \\end{bmatrix}$\nApplying row operations to reduce it to row-echelon form, we have:\n$$ \\begin{aligned} A \u0026amp; = \\begin{bmatrix} 1 \u0026amp; 3 \u0026amp; 5 \u0026amp; 0 \u0026amp; 7 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2\\\\ 1\u0026amp; 3 \u0026amp; 5\u0026amp;1 \u0026amp;9 \\end{bmatrix} \\rightarrow R \u0026amp; = \\begin{bmatrix} 1 \u0026amp; 3 \u0026amp; 5 \u0026amp; 0 \u0026amp; 7 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2\\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\end{aligned} $$\nWe see that $r(A) = r(R) = 2$.\nNote: $C(A) \\neq C(R)$!. The reason is that during the transformation from $A$ to $R$, we performed row operations. These operations ensure the preservation of the row space but not the column space.\nHowever, the number of vectors $dim(C(A)) = dim(C(R))$. Columns 1 and 4 of $R$ form the basis of $C(R)$; similarly, columns 1 and 4 of $A$ form the basis of $C(A)$.\nRow Space A wonderful fact: the dimension of the row space $dim(C(A^T)) = dim(C(A)) = r$.\nTaking the matrix $A$ from the previous example, to find the rank of $A^T$, we first need to transpose $A$ to $A^T$:\n$$ \\begin{aligned} A \u0026amp; = \\begin{bmatrix} 1 \u0026amp; 3 \u0026amp; 5 \u0026amp; 0 \u0026amp; 7 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2\\\\ 1\u0026amp; 3 \u0026amp; 5\u0026amp;1 \u0026amp;9 \\end{bmatrix} \\rightarrow A^T \u0026amp; = \\begin{bmatrix} 1 \u0026amp; 0 \u0026amp; 1 \\\\ 3 \u0026amp; 0 \u0026amp; 3 \\\\ 5 \u0026amp; 0 \u0026amp; 5 \\\\ 0 \u0026amp; 1 \u0026amp; 1 \\\\ 7\u0026amp; 2 \u0026amp;9 \\end{bmatrix} \\rightarrow R^T \u0026amp; = \\begin{bmatrix} 1 \u0026amp; 0 \u0026amp; 1 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \\\\ 0 \u0026amp; 1 \u0026amp; 1 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\rightarrow R \u0026amp; = \\begin{bmatrix} 1 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 0 \\\\ 1 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 0 \\end{bmatrix} \\end{aligned} $$\nWe have $R^T$ (and $A^T$) with 2 pivot columns, which are column 1 and column 2.\nIt can be observed that the rows in $R$ span the row space $C(R^T)$. Similarly, the rows in $A$ span the row space $C(A^T)$. However, row 3 is a linear combination of rows 1 and 2, so in this case, the basis of $C(A^T)$ is the first 2 rows of $A$.\nThe basis of the row space $C(A^T)$ is the first $r$ rows of $A$\nFrom the above examples, we can deduce $dim(C(A^T)) = dim(C(A)) = r(A) = 2$.\nNullspace The dimension of the nullspace of $A$ $dim(N(A)) = n - r$. The solutions to the equation $Ax = 0$ form the basis of $N(A)$.\nWhen solving the equation $Ax = 0$, for each free column $A_i$ of $A$, corresponding to the free variable $x_i$, we will find a solution. Thus, the number of solutions to the main system equals the number of free columns of $A$, which is $n - r$, where $r$ is the number of pivot columns (also the rank of $A$).\nSince the solutions to the system $Ax = 0$ are linearly independent, they form the basis for the null space $N(A)$.\nLeft Nullspace To find the left nullspace of $A$, we solve the equation $A^Ty = 0$ or $y^TA = 0^T$ or $R^Ty = 0$ and $y^TR = 0^T$ with $R = rref(A)$.\nThe basis of the left nullspace of $A$ is the first $r$ rows of $A$\nContinuing with $A = \\begin{bmatrix} 1 \u0026amp; 3 \u0026amp; 5 \u0026amp; 0 \u0026amp; 7 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2\\\\ 1\u0026amp; 3 \u0026amp; 5\u0026amp;1 \u0026amp;9 \\end{bmatrix}$ as an example, I will solve the system $y^{T}R = 0^{T}$:\n$$ \\begin{aligned} y^{T}R = 0^{T} \u0026amp;\\Leftrightarrow \\begin{bmatrix} y_1 \u0026amp; y_2 \u0026amp; y_3 \\end{bmatrix} \\begin{bmatrix} 1 \u0026amp; 3 \u0026amp; 5 \u0026amp; 0 \u0026amp; 7 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} = \\begin{bmatrix} 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\\\ \\end{aligned} $$\nThe system above can be rewritten as:\n$$ \\begin{aligned} \u0026amp; y_1\\begin{bmatrix}1 \u0026amp; 3 \u0026amp; 5 \u0026amp; 0 \u0026amp; 7 \\end{bmatrix} \\\\ + \u0026amp; y_2\\begin{bmatrix}0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2\\end{bmatrix} \\\\ + \u0026amp; y_3\\begin{bmatrix}0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0\\end{bmatrix} \\\\ = \u0026amp; \\begin{bmatrix}0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0\\end{bmatrix} \\end{aligned} $$\nIt is easy to see that to satisfy the system above, we only need $y_1 = y_2 = 0$ and $y_3$ can be any value.\nThe left nullspace of $R^T$ contains all sets of vectors $y = (0, 0, y_3)$.\nIn a matrix $R$ with rank $r$ and size $m \\times n$, there will always be $m - r$ rows of 0. Any linear combination of $m - r$ 0-rows will always give the vector $0$. Only a single linear combination of all rows in $R$ will give the vector $0$. Thus, $y$ in the left nullspace is the vector with $y_1 = y_2 = \u0026hellip; = y_r = 0$.\nIf $A$ is a matrix of size $m \\times n$ and rank $r$, the left nullspace of $A$ has dimension $dim(N(A^T)) = m - r$.\nThe Big Picture To have an overview of this article, take a look at the image below:\nThe overview of the four fundamental subspaces in linear algebra\nHere are some properties you need to remember:\n1. Matrices $A$ and $R$ have the same row space. These spaces have the same dimension $r$ and the same basis.\nThe reason is that as mentioned above, the transformation from $A$ to $R$ through row operations preserves the row space but changes the column space.\n2. In matrix $A$, the number of linearly independent column vectors equals the number of linearly independent row vectors.\nThis means that we have $dim(C(A)) = dim(C(A^T)) = r$.\n3. The nullspace of $A$ and $R$ are the same, with the same dimension $n-r$ and the same basis.\nThis is because the transformations $A \\rightarrow R$ do not change the solutions. For each free variable $x_i$ in the system $Ax = 0$ or $Rx = 0$, we can find a solution. Thus, $N(A) = N(R) = n - r$.\n4. The left nullspace of $A$ (the nullspace of $A^T$) has dimension $m-r$. We will use the counting theorem to prove this.\nFirst, the column space of $A^T$ is the same as the row space of $A$. According to property 2, we have $dim(C(A)) = dim(C(A^T)) = r$. Moreover, $A^T$ now has size $n \\times m$, meaning its column vectors lie in $\\mathbf{R}^m$.\nFrom the above and using the counting theorem, we will have: $r + x = m$, or in other words, $x = m-r$. Thus, the left nullspace of $A^T$ (also the nullspace of $A$) has dimension $dim(C(A^T)) = m-r$.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-08-14/","summary":"Four fundamental subspaces: row space, column space, nullspace, left nullspace with dimensions and relationships.","title":"The four fundamental subspaces in Linear Algebra"},{"content":"Conditions for $A\\mathbb{x} = \\mathbb{b}$ to have solutions In the article about vector space, we learned that $A\\mathbb{x} = \\mathbb{b}$ has solutions for some $b$ and is unsolvable for other $b$. Unlike the nullspace, here we only consider the case when $b \\neq 0$.\nConsider an example, given the matrix $$A = \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 2 \u0026amp; 4 \u0026amp; 6 \u0026amp; 8 \\\\ 3 \u0026amp; 6 \u0026amp; 8 \u0026amp; 10 \\end{bmatrix}$$\nSuppose we are solving the system of equations $A\\mathbb{x} = \\mathbb{b}$ with $b = (b_1; b_2; b_3)$, then the first thing we are sure about is that $b$ must belong to the column space of $A$. Since we see that row 3 of $A$ is the sum of rows 1 and 2, we have $b_1 + b_2 = b_3$.\nThe way to prove that $A\\mathbb{x} = \\mathbb{b}$ has solutions is to use row reduction on the augmented matrix $[A \\quad b]$:\n$$ \\begin{aligned} [A \\quad b] \\Leftrightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \u0026amp; b_1 \\\\ 2 \u0026amp; 4 \u0026amp; 6 \u0026amp; 8 \u0026amp; b_2\\\\ 3 \u0026amp; 6 \u0026amp; 8 \u0026amp; 10 \u0026amp; b_3 \\end{bmatrix} \u0026amp; \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \u0026amp; b_1 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \u0026amp; b_2 - 2b_1\\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \u0026amp; b_3 - 3b_1 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \u0026amp; b_1 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \u0026amp; b_2 - 2b_1\\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; b_3 - b_2 - b_1 \\end{bmatrix}\\\\ \u0026amp; \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 0 \u0026amp; -2 \u0026amp; 3b_1 - b_2\\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \u0026amp; b_2 - 2b_1\\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; b_3 - b_2 - b_1 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 0 \u0026amp; -2 \u0026amp; 3b_1 - b_2\\\\ 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2 \u0026amp; \\frac{b_2 - 2b_1}{2}\\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; b_3 - b_2 - b_1 \\end{bmatrix} = [R \\quad d] \\end{aligned} $$\nWe see that now the matrix after transforming $R$ has the last row as a zero vector.\nTherefore, in this system of equations, the condition for $Rx = d$ to have a solution is that the vector $d$ must have the form $(d_1; d_2; 0)$. In other words, for $A\\mathbb{x} = \\mathbb{b}$ to have a solution in this case, the vector $b$ must have the form $b_3 - b_2 - b_1 = 0$, for example, $b = (1; 5; 6)$.\nNote that the solution set $x$ of the system $A\\mathbb{x} = \\mathbb{b}$ and $Rx = d$ are the same.\nFinding the complete solution of Ax = b After finding the condition for the system of equations to have a solution, we will try to find a particular solution. To do that, we will transform the matrix into echelon form to determine the pivot and free variables. Like the example above, $R$ has 2 pivot variables $x_1$ and $x_3$, meaning we have free variables $x_2$ and $x_4$.\nDifferent from solving the system $A\\mathbb{x} = \\mathbb{0}$, here we will set all free variables to 0, meaning $x_2 = x_4 = 0$. To understand more, please review my previous posts about echelon form matrices. Now since we have transformed $A\\mathbb{x} = \\mathbb{b}$ into $Rx = d$, the system looks like:\n$$ \\begin{aligned} \u0026amp; \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 0 \u0026amp; -2 \\\\ 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ 0 \\\\ x_3 \\\\ 0 \\end{bmatrix} = \\begin{bmatrix} -2 \\\\ 1.5\\\\ 0 \\end{bmatrix}\\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 = -2\\\\ x_3 = \\frac{3}{2} \\end{cases} \\end{aligned} $$\nSo, we have a particular solution $x_p = (-2; 0; \\frac{3}{2}; 0)$.\nWe only have exactly one particular solution $x_p$ for a system of equations $A\\mathbb{x} = \\mathbb{b}$. So how do we find other solutions of the system?\nWe will find $x_n$ such that $Ax_n = 0$.\nThe reason is that we can add $x_p$ to any $x_n$ on the left-hand side and still obtain $b$ on the right-hand side. This can be easily proven as follows:\n$$ \\begin{cases} Ax_p = b \\\\ Ax_n = 0 \\end{cases} \\Leftrightarrow A(x_p + x_n) = b $$\nTherefore, the solution set $x$ of the system of equations $A\\mathbb{x} = \\mathbb{b}$ is a linear combination of $x_p$ and $x_n$.\n$$ x = x_p + cx_n $$\nThe complete solution of $A\\mathbb{x} = \\mathbb{b}$ = one particular solution + all nullspace of $A$\nBack to the example, we will find the set $x_n$ such that $Rx_n = 0$. If you want to be hardcore, you can calculate $Ax_n = 0$ as well, it\u0026rsquo;s the same but you will have the opportunity to practice the transformation again.\nSince I have already instructed how to solve it in my previous post, I won\u0026rsquo;t solve it here. I\u0026rsquo;ll just spoil the answer: the set $x_n$ includes 2 vectors which are $\\begin{bmatrix} -2 \\\\ 1 \\\\ 0 \\\\ 0 \\end{bmatrix}$ and $\\begin{bmatrix} 2 \\\\ 0 \\\\ -2 \\\\ 1 \\end{bmatrix}$.\nSo the complete solution of $A\\mathbb{x} = \\mathbb{b}$ is written as follows:\n$$ x_{complete} = \\begin{bmatrix} -2 \\\\ 0 \\\\ \\frac{3}{2} \\\\ 0 \\end{bmatrix} + c_1\\begin{bmatrix} -2 \\\\ 1 \\\\ 0 \\\\ 0 \\end{bmatrix} + c_2\\begin{bmatrix} 2 \\\\ 0 \\\\ -2 \\\\ 1 \\end{bmatrix} $$\nThe relationship between the rank of a matrix and the solution set $A\\mathbb{x} = \\mathbb{b}$ Recalling the rank of a matrix is the number of pivot elements in that matrix.\nSuppose we have a matrix of size $m \\times n$, the rank $r$ of the matrix must obey the constraints $r \\leq m, r \\leq n$. We will have 2 separate cases when $r = n$ and $r = m$.\nThe case $r = n$ is when the number of pivot elements = the number of columns in the matrix. We call this full column rank.\nConversely, when $r = m$, meaning the number of pivot elements = the number of rows in the matrix, we call it full row rank.\nThe possibilities for a linear system $A\\mathbb{x} = \\mathbb{b}$ when $A$ has rank $r$ are:\n$r = n$ and $r \u0026lt; m$ $\\Leftrightarrow Ax = b$ has 0 or 1 solution $r = m$ and $r \u0026lt; n$ $\\Leftrightarrow Ax = b$ has infinitely many solutions $r = m$ and $r = n$ $\\Leftrightarrow Ax = b$ has a unique solution $r \u0026lt; m$ and $r \u0026lt; n$ $\\Leftrightarrow Ax = b$ has 0 or infinitely many solutions For each possibility, the matrix $R$ also has its own characteristics. We will examine each case specifically. I will update the last case when I truly understand it.\nCase $r = n \u0026lt; m$ Considering the case $r = n \u0026lt; m$, please take about 2 - 3\u0026rsquo; to note down and think about what will happen when a matrix $A$ has the same number of pivot elements as columns?\nThen the matrix $A$ has full column rank, which means $A$ has no free variables, because all columns contain pivots.\nAdditionally, full column rank indicates that the columns in that matrix are linearly independent. No column can be represented as a linear combination of other columns.\nReturning to the problem $A\\mathbb{x} = \\mathbb{0}$, since there are no free variables, the system of equations only has a unique solution $x = 0$. Moreover, $A\\mathbb{x} = \\mathbb{b}$ also has a unique solution $x = x_p$.\nExample: $A = \\begin{bmatrix} 1 \u0026amp; 1 \\\\ 1 \u0026amp; 2 \\\\ -2 \u0026amp; -3 \\end{bmatrix} \\quad ; \\quad b = (b_1; b_2; b_3)$\nFirstly, to solve the system $A\\mathbb{x} = \\mathbb{b}$, we find the condition for the system of equations to have a solution.\n$$ [A \\quad b] = \\begin{bmatrix} 1 \u0026amp; 1 \u0026amp; b_1 \\\\ 1 \u0026amp; 2 \u0026amp; b_2 \\\\ -2 \u0026amp; -3 \u0026amp;b_3 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 1 \u0026amp; b_1 \\\\ 0 \u0026amp; 1 \u0026amp; b_2 - b_1 \\\\ 0 \u0026amp; -1 \u0026amp; b_3 + 2b_1 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 0 \u0026amp; b_1 - b_2 \\\\ 0 \u0026amp; 1 \u0026amp; b_2 - b_1 \\\\ 0 \u0026amp; 0 \u0026amp; b_3 + b_2 + b_1 \\end{bmatrix} = [R \\quad d] $$\nThus, the condition to find a solution for $A\\mathbb{x} = \\mathbb{b}$ is $b_1 + b_2 + b_3 = 0$. Through the transformation process from $A$ to $R$, we see that $R = \\begin{bmatrix} I \\end{bmatrix}$, with $I$ being the identity matrix of size $n \\times n$.\nWhen encountering $R$ in this form, it is a sign that $Rx = d$ or $A\\mathbb{x} = \\mathbb{b}$ has 0 or 1 solution.\nA matrix $A$ with full column rank has the following properties:\nAll columns of $A$ are pivot columns (clearly because every column of $A$ contains a pivot) The matrix has no free variables, so there is no special solution The nullspace $N(A)$ contains only the zero vector If $A\\mathbb{x} = \\mathbb{b}$ satisfies the condition to have a solution, then only one $x$ can be found. Otherwise, $A\\mathbb{x} = \\mathbb{b}$ is unsolvable\nCase $r = m \u0026lt; n$ Contrary to the full column rank matrix, the full row rank matrix indicates that the rows of matrix $A$ are linearly independent.\nA matrix with full row rank will not have any row of zeros (because every row contains a pivot).\nFor a matrix $A$ with $m$ rows and $n$ columns, containing $r = m$ pivots means we have $n - m$ (or $n - r$) free variables. Thus, for any $A\\mathbb{x} = \\mathbb{b}$ that satisfies the condition to have a solution and $A$ is a matrix with full row rank, we will always find a complete solution for each $b$.\nConsidering the case $r = m \u0026lt; n$, let\u0026rsquo;s take the following example:\n$$ \\begin{aligned} [A \\quad b] = \\begin{bmatrix} 1 \u0026amp; 1 \u0026amp; 1 \u0026amp; 3 \\\\ 1 \u0026amp; 2 \u0026amp; -1 \u0026amp; 4 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 1 \u0026amp; 1 \u0026amp; 3 \\\\ 0 \u0026amp; 1 \u0026amp; -2 \u0026amp; 1 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 0 \u0026amp; 3 \u0026amp; 2 \\\\ 0 \u0026amp; 1 \u0026amp; -2 \u0026amp; 1 \\end{bmatrix} = [R \\quad d] \\end{aligned} $$\nIn the full row rank case, we no longer need to find the condition for $A\\mathbb{x} = \\mathbb{b}$ to have a solution, because we are sure to always find at least one $x$ that satisfies the system.\nReturning to the example above, we see that $R$ has the form $\\begin{bmatrix} I \u0026amp; F \\end{bmatrix}$.\n$F$ here tells us the solution of the system $A\\mathbb{x} = \\mathbb{0}$, meaning we have $x_n = (-3; 2; 1)$ satisfies $Ax_n = 0$. Why is there a 1? Because corresponding to $x_3$ it\u0026rsquo;s a free variable, so according to the method of solving the system $A\\mathbb{x} = \\mathbb{0}$, we will assign 1 to find the special solution.\nWe also infer the particular solution $x_p$ from $d$. In this system of equations, $x_p = (2; 1; 0)$ (because to find the particular solution $x_p$, we will assign all free variables = 0).\nA matrix $A$ with full row rank has the following properties:\nThe matrix $A$ does not have a row of zeros $A\\mathbb{x} = \\mathbb{b}$ always finds at least one solution \\\\ Column space $C(A) \\in \\mathbf{R}^m$ (because there are $m$ rows and all rows are linearly independent)\nCase $r = m = n$ This case occurs with a square matrix $A$.\nTaking an example: $A = \\begin{bmatrix} 1 \u0026amp; 2 \\\\ 3 \u0026amp; 1 \\end{bmatrix}$\nThis is probably the easiest case to imagine the matrix $R$: we have a square matrix with $r = m$ and $r = n$, so obviously $R$ is the identity matrix $I$.\n$ R = I = \\begin{bmatrix} 1 \u0026amp; 0 \\\\ 0 \u0026amp; 1 \\end{bmatrix} $\nBy now, you probably understand why for this case, the system of equations $Rx = d$ or its original form $A\\mathbb{x} = \\mathbb{b}$ only has a unique solution.\nProperties of a square matrix $A$ with full row and full column rank:\nThe number of pivot elements = the number of rows = the number of columns All rows of $A$ are linearly independent, and all columns of $A$ are linearly independent Full row rank implies full column rank, and vice versa $A$ is invertible Unique solution to the system of equations $A\\mathbb{x} = \\mathbb{b}$\nConclusion Through this post, we have a summary of the conditions for $A\\mathbb{x} = \\mathbb{b}$ to have solutions, as well as the properties of a matrix $A$ corresponding to these solutions.\nIn the next posts, we will continue to explore the null space $N(A)$, the linear dependence, the coordinate system, and the least squares solutions. Stay tuned for more exciting discoveries!\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-08-01/","summary":"Solving Ax=b: conditions for solutions, complete solution (particular + nullspace), rank relationships.","title":"Solving Ax = b"},{"content":"Nullspace For the equation $Ax = 0$, we often immediately think of the solution $x = 0$. But if that were always the case, this article might end here. However, life is rarely that simple, and neither is math. The solution $x = 0$ is unique if and only if the matrix $A$ is invertible. For non-invertible matrices, there always exist $x \\neq 0$ satisfying $Ax = 0$. For each such $x$, we say that $x$ belongs to the nullspace of $A$.\nSuppose we have a matrix $A$ of size $m \\times n$, the nullspace of $A$ is expressed as follows:\nThe nullspace $N(A)$ contains all $x$ satisfying $Ax = 0$. These vectors $x$ lie in the space $\\mathbf{R}^{n}$.\n$N(A)$ is also a subspace. If $N(A)$ contains $x$ and $y$, meaning $Ax = 0$ and $Ay = 0$, then $A(x+y) = Ax + Ay = 0$. Additionally, for some $c$, we also have $A(cx) = c(Ax) = c(0) = 0$. Thus, $x+y$ and $cx$ still lie in the nullspace of $A$, satisfying the conditions for $N(A)$ to be a subspace.\nFurthermore, in the article about column space, we have $C(A) \\in \\mathbf{R}^m$. Now we have $N(A) \\in \\mathbf{R}^{n}$. Pay attention to avoid confusion.\nSuppose we have the equation:\n$$ \\begin{bmatrix} 1 \u0026amp; 1 \u0026amp; 2 \\\\ 2 \u0026amp; 2 \u0026amp; 4 \\\\ 3 \u0026amp; 1 \u0026amp; 4 \\\\ 2 \u0026amp; 1 \u0026amp; 3 \\end{bmatrix} + \\begin{bmatrix} x_1 \\\\ x_2 \\\\ x_3 \\end{bmatrix} = \\begin{bmatrix} 0 \\\\ 0 \\\\ 0 \\\\ 0 \\end{bmatrix} $$\nFirst, we notice a rule in all rows: column 1 + column 2 - column 3 = 0.\nWe can choose any set of numbers satisfying this rule, for example, $(1; 1; -1)$, $(2; 2; -2)$, $(7; 7; -7)$. Generally, we obtain the solution set for the system of equations as vectors of the form $(a, a, -a)$. This is the nullspace $N(A)$. This vector space is a line in $\\mathbf{R}^3$, and $(1; 1; -1)$, $(2; 2; -2)$, $(7; 7; -7)$ are points on this line.\nSolving $Ax = 0$ The best way to find the nullspace of $A$ is to solve the linear system $Ax = 0$. In this section, I will guide you through the detailed steps to solve a system of equations $Ax = 0$.\nSpecial Solutions, Basic Variables, and Free Variables Example: let $A = \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 2 \u0026amp; 4 \u0026amp; 6 \u0026amp; 8 \\\\ 3 \u0026amp; 6 \u0026amp; 8 \u0026amp; 10 \\end{bmatrix}$\nFirst, we transform $A$ into row echelon form $U = \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix}$. The method can be reviewed here.\nAfter converting the matrix to row echelon form, we can use back-substitution to solve the system of equations $Ux = 0$.\n$$ \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ x_2 \\\\ x_3 \\\\ x_4 \\end{bmatrix} = 0 $$\nThe issue here is to substitute numbers into which $x$ values to find the remaining $x$\u0026rsquo;s. To determine this, we identify basic variables and free variables through pivot columns and free columns.\nContinuing with the above example, matrix $U$ has 2 pivots, meaning only columns 1 and 3 contain pivot elements. These are pivot columns of matrix $U$. Columns without pivot elements are called free columns. Pivot columns are also known as independent columns, meaning these columns are linearly independent of the other columns in the matrix. You can verify that the pivot columns are linearly independent (which cannot be written as a linear combination of any other columns in the matrix). Free columns can be written as linear combinations of one or more columns in the matrix.\nWe assign any value to the $x$ corresponding to the free columns in matrix $U$. These $x$\u0026rsquo;s are called free variables. Then, we find values for the remaining $x$\u0026rsquo;s.\nIn the example above, I assign $x_2 = 1$ and $x_4 = 0$. The equation $Ux = 0$ is then equivalent to:\n$$ \\begin{align*} \u0026amp; \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ 1 \\\\ x_3 \\\\ 0 \\end{bmatrix} \u0026amp;\u0026amp;=0 \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 + 2 + 2x_3 \u0026amp;\u0026amp;= 0 \\\\ 2x_3 \u0026amp;\u0026amp;= 0 \\end{cases} \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 + 2 + 2x_3 \u0026amp;\u0026amp;= 0 \\\\ 2x_3 \u0026amp;\u0026amp;= 0 \\end{cases} \\\\ \\end{align*} $$\nThus, we find a special solution $x = (-2; 1; 0; 0)$. This is a special solution because we choose only 1 to assign to the free variables. If there are $n$ free variables, we find $n$ special solutions correspondingly. We assign 1 to each free variable, and the remaining free variables are set to 0. You might wonder why we need to find special solutions. Because we have a definition like this:\nThe nullspace of $A$ contains every linear combination of the special solutions $Ax = 0$.\nIt means that if there are no free variables in matrix $A$ (equivalent to all columns of $A$ being pivot columns), then we have $Ax = 0$ having only the trivial solution $x = 0$.\nIs it enough to say that the nullspace $N(A)$ is all vectors of the form $cx$, with $x = (a; -0.5a; 0; 0)$? Actually, it\u0026rsquo;s not. To describe the nullspace most fully, we need to find all the special solutions. We have a $3 \\times 4$ matrix with 2 pivot columns. The number of special solutions we have is $4 - 2 = 2$. To find the remaining special solution, we set $x_2 = 0$ and $x_4 = 1$.\n$$ \\begin{align*} \u0026amp; \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ 0 \\\\ x_3 \\\\ 1 \\end{bmatrix} \u0026amp;\u0026amp;= 0 \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 + 2x_3 + 2 \u0026amp;\u0026amp;= 0\\\\ 2x_3 + 4 \u0026amp;\u0026amp;= 0 \\end{cases} \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 \u0026amp;\u0026amp;= 2\\\\ x_3 \u0026amp;\u0026amp;= -2 \\end{cases} \\end{align*} $$\nWe find the second special solution $x^{*} = (2; 0; -2; 1)$.\nSo, the nullspace $N(A) = c\\begin{bmatrix} -2 \\\\ 1 \\\\ 0 \\\\ 0 \\end{bmatrix} + d\\begin{bmatrix} 2 \\\\ 0 \\\\ -2 \\\\ 1 \\end{bmatrix}$.\nReduced Row Echelon Form With the method of using elimination to transform $A$ to $U$ as above, we have found the nullspace of $A$. However, the problem does not stop here. We can further simplify $U$ to a reduced row echelon form $rref$ by continuing to apply elimination to the rows of $U$.\nThe reduced row echelon form matrix has basic elements as 1 and the elements above and below the basic element are 0. To transform a row echelon matrix to reduced row echelon form, we have the following 2 steps:\nGenerate 0s above the basic element row: We use the basic element\u0026rsquo;s row to eliminate the rows above Transform basic elements to 1s: Divide the row vector containing the basic element by that basic element Let\u0026rsquo;s try transforming matrix $U$ in the above example into its reduced row echelon form. The reduced form matrix, denoted as $R$, is:\n$$ \\begin{aligned} \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 0 \u0026amp; -2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 0 \u0026amp; -2 \\\\ 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\end{aligned} = R = rref(A) $$\nInstead of solving $Ux=0$, we will find the solution of the system of equations $Rx = 0$. Since the matrix has 2 free variables, we can still set each of these variables to 1 in turn.\n$$ \\begin{align*} \u0026amp; \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 0 \u0026amp; -2 \\\\ 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ 1 \\\\ x_3 \\\\ 0 \\end{bmatrix} \u0026amp;\u0026amp;= 0 \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 + 2 \u0026amp;\u0026amp;= 0\\\\ x_3 \u0026amp;\u0026amp;= 0 \\end{cases} \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 \u0026amp;\u0026amp;= -2\\\\ x_3 \u0026amp;\u0026amp;= 0 \\end{cases} \\end{align*} $$\n$$ \\begin{align*} \u0026amp; \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 0 \u0026amp; -2 \\\\ 0 \u0026amp; 0 \u0026amp; 1 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ 0 \\\\ x_3 \\\\ 1 \\end{bmatrix} \u0026amp;\u0026amp;= 0 \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 - 2 \u0026amp;\u0026amp;= 0\\\\ x_3 + 2 \u0026amp;\u0026amp;= 0 \\end{cases} \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 \u0026amp;\u0026amp;= 2\\\\ x_3 \u0026amp;\u0026amp;= -2 \\end{cases} \\end{align*} $$\nWe still find 2 solutions: $(-2; 1; 0; 0)$ and $(2; 0; -2; 1)$. When solving $Rx = 0$, the system of equations we need to solve is simpler than $Ux = 0$. We can calculate the value of each basic variable independently, without considering the values of other basic variables. Thus, to solve the linear system $Ax = 0$, we went through 2 steps of transforming $A$ to $U$ and $R$. Since these transformation steps are linear, in essence, $N(A) = N(U) = N(R)$.\nFinally, I will summarize some key points about the nullspace as follows:\nThe elimination from $A$ to $U$ to $R$ does not change the nullspace of $A$.\nThe reduced row echelon form matrix $rref$ contains all basic elements as 1 and the elements above and below the basic element are 0.\nIf column $i$ in matrix $R$ has no basic element, we will find a special solution $Ax = 0$ with $x_i =1$.\nThe rank of the matrix is the number of basic elements of that matrix, which is also the number of non-zero rows of matrix $R$.\nAny $m \\times n$ matrix where $m \u0026lt; n$ has a nontrivial solution for $Ax = 0$.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-07-27/","summary":"Nullspace and solving Ax=0: special solutions, free variables, reduced row echelon form.","title":"Nullspace and solving Ax=0"},{"content":"Echelon Form To solve a system of linear equations $Ax = b$, we often use row elimination on matrix $A$ to bring the system into a form that can be solved using the substitution method. For example, suppose:\n$$ \\begin{align*} \u0026amp;\\begin{bmatrix} 1 \u0026amp; 2 \\\\ 1 \u0026amp; 3 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix} = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix} \\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 + 2x_2 \u0026amp;\u0026amp;= 3 \\\\ x_1 + 3x_2 \u0026amp;\u0026amp;= 4 \\end{cases} \\end{align*} $$\nSolving this system directly would be difficult, so we\u0026rsquo;ll transform $A$ a bit into the following form:\n$$ \\begin{align*} \\begin{bmatrix} 1 \u0026amp; 2 \\\\ 1 \u0026amp; 3 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \\\\ 0 \u0026amp; 1 \\end{bmatrix} = U \\end{align*} $$\nBy subtracting row 1 from row 2, the system becomes:\n$$ \\begin{align*} \u0026amp;\\begin{bmatrix} 1 \u0026amp; 2 \\\\ 0 \u0026amp; 1 \\end{bmatrix} \\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix} = \\begin{bmatrix} 3 \\\\ 4 \\end{bmatrix}\\\\ \\Leftrightarrow \u0026amp; \\begin{cases} x_1 + \u0026amp;2x_2 \u0026amp;\u0026amp;= 3\\\\ \u0026amp;x_2 \u0026amp;\u0026amp;= 1 \\end{cases} \\end{align*} $$\nWe easily deduce $x_2 = 1$ and $x_1 = 1$. The matrix form $U$ is called the echelon form matrix. A matrix is in Echelon form if it satisfies 2 conditions:\nEither there are no zero rows (rows where all elements are non-zero) or the zero rows of the matrix are below the non-zero rows.\nThe leading element of each row is to the right of the leading element of the previous row.\nThe leading element is the first non-zero element in a row of the matrix. Examining matrix $U$, we see it satisfies both conditions above. Thus, $U$ is an echelon matrix.\nAnother example: let $A = \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 2 \u0026amp; 4 \u0026amp; 6 \u0026amp; 8 \\\\ 3 \u0026amp; 6 \u0026amp; 8 \u0026amp; 10 \\end{bmatrix}$. $A$ is a non-square matrix and is linearly dependent (since column 2 of $A$ is a linear combination of column 1). First, we perform row elimination on row 2:\n$$ \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 2 \u0026amp; 4 \u0026amp; 6 \u0026amp; 8 \\\\ 3 \u0026amp; 6 \u0026amp; 8 \u0026amp; 10 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\end{bmatrix} $$\nContinuing to eliminate the last row, we have:\n$$ \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\end{bmatrix} \\rightarrow \\begin{bmatrix} 1 \u0026amp; 2 \u0026amp; 2 \u0026amp; 2 \\\\ 0 \u0026amp; 0 \u0026amp; 2 \u0026amp; 4 \\\\ 0 \u0026amp; 0 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix} = U $$\nRank of a Matrix In brief, the rank of a matrix is the number of leading elements it contains. As seen in both examples above, since $U$ has 2 leading elements, we have $rank(U) = 2$.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-07-25/","summary":"Echelon form and matrix rank: row elimination, leading elements, and solving linear systems.","title":"Echelon Form and Rank of a matrix"},{"content":"Vector Spaces One of the most important vector spaces is $\\mathbf{R}^{n}$, containing real number vectors with $n$ elements. This is called an $n$-dimensional space. We have the first definition:\nThe space $\\mathbf{R}^{n}$ contains all column vectors $v$ with $n$ elements.\nWe can perform addition between two vectors in the space $\\mathbf{R}^{n}$ or multiply them by a scalar.\n$ \\begin{aligned} \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \\\\ \\end{bmatrix} + \\begin{bmatrix} 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ \\end{bmatrix} = \\begin{bmatrix} 2 \\\\ 3 \\\\ 4 \\\\ 5 \\\\ \\end{bmatrix} ; \\quad \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \\\\ \\end{bmatrix} \\times 2 = \\begin{bmatrix} 2 \\\\ 4 \\\\ 6 \\\\ 8 \\\\ \\end{bmatrix} \\end{aligned} $\nAbove, we just performed addition of 2 vectors in the $\\mathbf{R}^{4}$ space and multiplied a vector by a scalar. A note to remember is that the results of these operations must still be within the $\\mathbf{R}^{4}$ space. Moreover, when performing calculations like the two expressions above, we also create linear combinations. There are 8 conditions to define a vector space. Let $v$ and $w$ be two vectors in the $\\mathbf{R}^n$ space, let $c$ be a number, the addition $v+w$ and multiplication $cv$ must satisfy the following conditions:\n$v + w = w + v$ $(v + w) + z = v + (w + z)$, where $z$ is a vector similar to $v$ and $w$ $c(v+w) = cv + cw$, where $c$ is a number there must always exist a unique vector 0 such that $c + 0 = c$ for every vector $v$, there always exists a unique vector $-v$ such that $v + (-v) = 0$ $1 * v = v$ $(c_{1} + c_{2})v = c_{1}v + c_{2}v$ $(c_{1}c_{2})v = c_{1}(c_{2}v)$ Besides $\\mathbf{R}^n$, we also have other vector spaces:\nM: Space of real number matrices with a size of $2 \\times 2$ F: Vector space of real functions $f(x)$ Z: Vector space containing only the zero vector In M, vectors are matrices. Space F is an infinite-dimensional space, while in space Z the only addition performed is $0 + 0 = 0$. In each case, we can add: matrix to matrix, function to function, zero vector to zero vector. We can also multiply a term by a matrix, function, and zero vector. The results of these additions and multiplications naturally still lie within the M, F, and Z spaces. The 8 conditions mentioned above can also be proven to be satisfied. The Z space is the smallest space, as it only contains a single zero vector. Every vector space contains a zero vector: zero matrix, zero function, and zero vector $[0; 0; 0; 0]$ in the $\\mathbf{R}^{4}$ space.\nSubspaces A subspace of $\\mathbf{R}^{n}$ is a set of vector spaces within $\\mathbf{R}^{n}$. For example, in the $\\mathbf{R}^{3}$ space, a plane passing through the origin is a vector space. If we add any two vectors in the plane together or multiply a vector in the plane by a number, the resulting vector will still be in that plane. This leads to one of the most fundamental ideas in linear algebra, stated as follows:\nA subspace is a set of vectors (including the zero vector) that satisfies two properties: if $v$ and $w$ are two vectors in the subspace and $c$ is a number, then\n$v + w$ is in the subspace $cv$ is in the subspace (and $dw$, where $d$ is also a number) In other words, the vectors and their linear combinations are close to each other and all belong to the subspace. Because they are subsets of a larger space, the addition and multiplication operations still obey the 8 conditions defining a vector space. Additionally, subspaces all contain the zero vector. Considering the $\\mathbf{R}^{3}$ space, any plane not passing through the origin $(0, 0, 0)$ is certainly not a subspace of $\\mathbf{R}^{3}$. Furthermore, lines passing through the origin are also subspaces: when multiplying any number by a vector on a line, we obtain a new vector still on that line, and similarly for addition.\nLastly, we have some examples:\nSubspaces of $\\mathbf{R}^{3}$ include:\nAll vectors in $\\mathbf{R}^{3}$ Any plane passing through the origin Any line passing through the origin The Z space (zero vector) Subspaces of $\\mathbf{R}^{2}$ include:\nAll vectors in $\\mathbf{R}^{2}$ Any line passing through the origin The $Z$ space (zero vector) Column Space This is the most important subspace directly associated with the matrix $A$. In linear algebra, to solve the equation $Ax = b$ if $A$ is not invertible, we can solve it for some $b$ and not for other $b$. If we find a good $b$, then $b$ can be written as $A$ times a vector $x$. These vectors $b$ are the column space of $A$.\nTo find the good vectors $b$ mentioned above, we have to find all linear combinations of the columns in $A$ (because $Ax$ is a combination of the columns in $A$, we have to find all possible vectors $x$). These linear combinations create the column space of $A$.\nIn summary, with $C(A)$ as the column space of $A$, then $C(A)$ will include not only the columns in $A$ but also all linear combinations $Ax$.\nThe column space contains all linear combinations of the columns in the matrix. These combinations are all the possible vectors $b$ = $Ax$.\nCalling the column space the most important space is because, to solve $Ax = b$ means we are representing $b$ as a linear combination of $Ax$, or in other words, $b$ must lie within $C(A)$.\nSuppose $A$ is a matrix of size $m \\times n$, then each column of $A$ contains $m$ elements. Thus, the column space of $A$ is a subspace within the set $\\mathbf{R}^m$ rather than $\\mathbf{R}^n$. Additionally, combinations of $Ax$ must satisfy two laws for subspaces stated in the previous section.\nSuppose we have a set of vectors $S$ in space $V$, to find the subspace $SS$ of $V$, similarly, we find all possible linear combinations of the vectors in set $S$.\n$S = $ set of vectors belonging to $V$ (may not necessarily be a subspace)\n$SS =$ linear combinations of vectors in $S$ (creates a subspace within $V$)\nThen $SS$ is called the subspace of $V$ generated from $S$. This is also a basic way to generate a subspace.\nIf $S$ contains only a single vector $v$ other than the zero vector, the subspace $SS$ is a line passing through $v$. $SS$ always is the smallest subspace containing $S$.\nAn important note: considering $I$ as the identity vector in the $\\mathbf{R}^{n}$ space ($I$ has $n$ rows $n$ columns), we have $C(I) =$ all vectors in $\\mathbf{R}^{n}$. This is because all vectors are linear combinations of the columns in matrix $I$. You can easily prove this.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-07-23/","summary":"Vector spaces, subspaces, column space: 8 axioms, subspace properties, and linear combinations.","title":"Vector spaces and subspaces"},{"content":"Definition of the Problem In DL, the vanishing gradient is the phenomenon where the gradients at layers in a DL network become very small or even zero during backpropagation. But why does the gradient become very small or not? Remember how a DL model learns: updating weights using the Gradient Descent algorithm. In GD, the parameter $\\theta$ is updated by the formula:\n$$ \\begin{aligned} \\theta = \\theta - \\eta \\nabla_{\\theta} J(\\theta) \\end{aligned} $$\nwhere $\\nabla_{\\theta} J(\\theta)$ is the gradient of the current loss function. We see that to update the model\u0026rsquo;s parameters, GD must compute the gradient of the loss function. The goal of DL is to minimize the error between the predicted answer and the ground truth during training. And as we learned above, moving in the direction of the gradient ascent will find the maximum point. But for the loss function, we want to find a minimum point to ensure the model predicts as accurately as possible. Therefore, the minus sign in the GD formula indicates that we are going against the gradient to move towards the minimum point. The gradient is the rate of change of the value of a real function at any point. Suppose there is a function $f(x)$, the derivative $\\frac{df}{dx}$ tells us how the value of the function $f$ will change as $x$ changes. Is it large or small? Consider a real example:\n$$ \\begin{equation} f(x) = 5x \\end{equation} $$\nthen, the derivative of the function $f(x)$ with respect to x will be calculated as follows:\n$$ \\begin{aligned} \u0026amp; \\nabla f(x) = (5x)\u0026rsquo; \\\\ \\Leftrightarrow \u0026amp; \\nabla f(x) = 5 \\end{aligned} $$\nThis shows that when x changes by a certain amount, the result of the function $f(x)$ will change by a quantity 5 times. Derivatives are commonly used in mathematics to find local maximum or minimum points of functions. Following the increasing direction of the derivative, we will find the maximum point and vice versa. So the answer to the question: what effect does the very small gradient at the layers have on the DL network? You can see that with $\\nabla_{\\theta} J(\\theta)$ very small, $\\theta$ is almost not updated, leading to the model unable to learn more. This is extremely dangerous because the model may not reach the minimum point, causing its operational results to be poor.\nCauses of the Vanishing Gradient We already know what the vanishing gradient is, but how does it happen? The fact that the gradient at the layers is very small is just the surface, what we are concerned about is the deep-rooted cause of the phenomenon. According to my own knowledge and the sources I consulted during our learning process, we can identify two main factors: the problem of initializing the parameters of the DL network and the influence of activation functions.\nCause related to Activation Functions Non-linear activation functions are the key points that allow a deep learning network to learn complex representations, not just linear operations. Previously, people often used the Sigmoid function in hidden layers in deep learning networks. The formula of the Sigmoid function is as follows:\n$$ sigmoid(x) = \\sigma(x) =\\frac{1}{1 + e^{-x}}, $$\nand the derivative of the Sigmoid is:\n$$ sigmoid\u0026rsquo;(x) = \\sigma\u0026rsquo;(x) = \\sigma(x) * (1 - \\sigma(x)) $$\nMany of you may ask: not very relevant? In fact, for any value of $x$, the output of Sigmoid at $x$ always lies in the range $[0, 1]$. This is suitable for problems where the output reflects probability values. So why do people no longer use Sigmoid in hidden layers in deep learning networks?\nFig. 1. Sigmoid Activation Function and Derivative\nWe can see that if we use Sigmoid in hidden layers, the gradient at these layers becomes vanishing when the value of the activation function is very large or very small, specifically greater than 1 or less than -1. These sections are called the \u0026ldquo;saturation ranges\u0026rdquo; of the activation function, which means that at this time the gradient becomes very small and does not help update the weights at the corresponding layers.\nCause related to Parameter Initialization Issues Previously, humans often initialized parameters for neural networks according to the standard distribution with a standard deviation = 1 and zero-centered. This means the parameters will be in the range $[-1; 1]$. Does this affect computation or not? The answer is yes, very much. Imagine your neural network has $L$ layers, and you initialize parameters for each layer as follows:\n$ W^{l} = \\begin{bmatrix} 0.5 \u0026amp; 0 \\\\ 0 \u0026amp; 0.5 \\end{bmatrix} $\nNow the first operation is to compute forward one time to find the predicted value:\n$$ \\hat{y} = \\Pi^{L}_{i=1} W^{i} $$\nSuppose we don\u0026rsquo;t care about the bias b, now when you continuously multiply the W\u0026rsquo;s together, you will get a very small value because our $W \u0026lt; 1$. Just like when you take a number less than 1 and raise it to dozens of times, suppose $0.9^{40}$, the result will tend to 0. With this value, the gradient at each layer deeper behind will be smaller, leading to vanishing gradients.\nResolutions One of the solution to mitigate vanishing gradients is changing the way of choosing activation functions. There are two activation functions chosen for use very often at hidden layers, which are tanh and ReLU.\nTanh activation function The tanh function is represented by the following formula:\n$$ tanh(x)= \\frac {e^x + e^{−x}} {e^x − e^{−x}} $$\nIf you have ever seen the graph of the tanh function, it has the same structure as Sigmoid, except shifted down 1 unit along the $Oy$ axis, making it zero-centered.\nFig. 2. Tanh Activation Function and Derivative\nBecoming zero-centered helps neural networks using tanh easier to be optimized in the backpropagation process, however, the problem of vanishing gradients has not been completely solved.\nReLU activation function ReLU (Rectified Linear Unit) can be considered as the most optimal activation function today. Its formula is very simple:\n$$ ReLU(x) = relu(x) = max(0, x) $$\nThis means, for all values of $x \u0026gt; 0$, the output of the ReLU function is also the input. The remaining cases have an output of 0.\nFig. 3. ReLU Activation Function and Derivative\nLooking at the graph, we can clearly see that the derivative of ReLU does not have a saturation range. This makes weight updates in the network much better compared to using Sigmoid or Tanh. Moreover, ReLU\u0026rsquo;s computation is also simpler and less costly.\nIn addition to choosing activation functions, proper parameter initialization is also a way to minimize the vanishing gradient phenomenon.\nBatch normalization Batch normalization is a very good method to minimize vanishing gradients. The way batch norm works is to scale the parameter values of each layer to a Gaussian distribution, ensuring that the standard deviation and variance at each layer in the network are the same. This helps the training process to be stable and no parameters are activated too large/too small after each activation function from the previous layer.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-07-21/","summary":"Vanishing gradients in deep networks: causes (activation functions, initialization), solutions (ReLU, batch norm).","title":"Vanishing Gradients"},{"content":"Basic concepts in Linear Algebra Scalar A scalar is a single number.\nIn linear algebra, when referring to a scalar, we denote it with lowercase letters and describe which set the scalar belongs to. For example, $s \\in \\mathbf{R}$ indicates that $s$ is a real number, or $n \\in \\mathbf{N}$ indicates that $n$ is a natural number.\nVector A vector is an array of numbers.\nThe numbers are arranged in a specific order, and we can use indices to access each number in the vector.\nFor example, $x_1$, $x_2$ respectively denote the first and second elements in the vector $\\mathbb{x}$. Vectors are denoted by lowercase bold letters, such as $\\mathbb{x}$. The elements in a vector are written in italic and accompanied by ordinal numbers, such as $x_1$, $x_2$.\nWhen referring to a vector, we also need to know the type of values stored in it. If vector $\\mathbb{x}$ has $n$ elements belonging to the set of real numbers $\\mathbf{R}$, then vector $\\mathbb{x} \\in \\mathbf{R}^n$. $\\mathbf{R}^n$ is a vector space.\nFor example, $$ \\mathbb{x} = \\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\end{bmatrix} $$\n$\\mathbb{x}$ has 3 elements, so $\\mathbb{x} \\in \\mathbf{R}^3$.\nSpecifically, $\\mathbb{x}$ is a point in the space $\\mathbf{R}^3$. The coordinates of that point are determined by the values of the elements in $\\mathbb{x}$. This means that, in the 3D space $Oxyz$, $\\mathbb{x}$ will be located at the point with coordinates $x = 1, y = 2, z = 3$.\nOperations on Vectors Addition and Scalar Multiplication When working with vectors, we have two operations: adding two vectors and multiplying a vector by a scalar.\n$$ \\begin{bmatrix} 1 \\\\ 2 \\end{bmatrix} + \\begin{bmatrix} 2 \\\\ 1 \\end{bmatrix} = \\begin{bmatrix} 3 \\\\ 3 \\end{bmatrix} $$\n$$ \\begin{aligned} 2 \\begin{bmatrix} 2 \\\\ 3 \\end{bmatrix} = \\begin{bmatrix} 4 \\\\ 6 \\end{bmatrix} \\end{aligned} $$\nAddition of two vectors\nLooking at the above figure, you might still not understand why there is a green vector in the middle. In fact, vector addition is tail-to-head, meaning you simply attach the head of one vector to the tail of the other vector.\nExtended vector addition\nLet $u = v + w$ the resulting vector of the addition operation. $u$ represents a point in the 2D plane $Oxy$ with coordinates $x = 3$ and $y = 3$.\n$$ \\begin{aligned} \\begin{bmatrix} 3 \\\\ 3 \\end{bmatrix} = \\begin{bmatrix} 3 \\\\ 0 \\end{bmatrix} + \\begin{bmatrix} 0 \\\\ 3 \\end{bmatrix} \\end{aligned} $$\nLength of a vector In high school, we learned how to calculate the length of a vector. The length of vector $u$ is the distance from point $v$ to the origin $O$ $(0, 0)$. Therefore, we can also say that vector $u$ represents an arrow with its tail at $(0, 0)$. That\u0026rsquo;s why in the addition operation above, we represent the resulting vector $u$ as an arrow from the origin to the point $(3, 3)$.\nTo compute the length of vector $u$, we take the square root of the sum of squares of its components:\n$$ |u| = \\sqrt{u_1^2 + u_2^2} $$\nThis is called the norm operation.\nThe norm is a function $| \\cdot |$ that maps a point in the vector space $V$ to the real space $\\mathbf{R}$ and satisfies the following properties:\n$| x | \\ge 0$, with equality if and only if $x = 0$ $| \\alpha x | = |\\alpha| | x |$ $|x + y| \\ge |x| + |y|$ for all $x, y \\in V$ and $\\alpha \\in \\mathbf{R}$.\nEssentially, this operation calculates the distance between vector $u$ and vector $0$. Moreover, it is used to determine the distance between any two vectors $v$ and $w$ if $u = v - w$. To find the distance between two vectors, we apply the norm operation to the difference vector of those two vectors.\n$$ \\begin{aligned} d(v, w) = | v - w | \\end{aligned} $$\nCalculating the distance between two vectors is essential because it forms the basis for considering whether those two vectors are close or not. In certain fields such as machine learning, computing the distance between multi-dimensional vectors is a way to evaluate systems.\nThere are many types of norms, among which the most commonly used are the $l1$-norm (Manhattan distance) and the $l2$-norm (Euclidean distance).\n$$ \\begin{aligned} | x |_1 \u0026amp; = \\sum_{i=1}^{n} |x_i| \\\\ | x |_2 \u0026amp; = \\sum_{i=1}^{n} \\sqrt{|{x_i}|^{2}} \\\\ | x |_p \u0026amp; = \\Big( \\sum_{i=1}^{n} |x_i|^p \\Big)^{\\frac{1}{p}}, \\quad \\forall p \\ge 1 \\end{aligned} $$\nMatrix A matrix is a data structure similar to a vector, but a matrix is a 2D array, so when accessing elements in it, we use 2 indices instead of 1 like in a vector. Matrices are usually denoted by uppercase letters and bolded.\nA matrix ${A}$ with $m$ rows and $n$ columns is said to have a size of $m \\times n$. Furthermore, if $A$ contains elements belonging to the set of real numbers $\\mathbf{R}$, then we say $A \\in \\mathbf{R}^{m \\times n}$.\nSince each element in $A$ requires 2 indices to locate, the order of writing the indices of the elements follows the order of rows before columns. $A_{1,1}$ refers to the first element (leftmost of the first row) of $A$.\n$$ \\begin{aligned} \\begin{bmatrix} A_{1,1} \u0026amp; A_{1,2} \\\\ A_{2,1} \u0026amp; A_{2,2} \\end{bmatrix} \\end{aligned} $$\nSo we can view a vector as a matrix with only 1 column, meaning if vector $\\mathbb{x} \\in \\mathbf{R}^n$, then $\\mathbb{x}$ is a matrix with a size of $n \\times 1$.\nWe have an important operation applied to matrices, which is the transpose operation.\nThe matrix $A^T$ is the transpose of $A$ where the rows of $A^T$ are the columns of $A$ and vice versa.\n$$ \\begin{aligned} A = \\begin{bmatrix} A_{1,1} \u0026amp; A_{1,2} \u0026amp; A_{1,3} \\\\ A_{2,1} \u0026amp; A_{2,2} \u0026amp; A_{2,3} \\end{bmatrix} \\rightarrow A^T = \\begin{bmatrix} A_{1,1} \u0026amp; A_{2,1} \\\\ A_{1,2} \u0026amp; A_{2,2} \\\\ A_{1,3} \u0026amp; A_{2,3} \\end{bmatrix} \\end{aligned} $$\nA scalar can also be viewed as a matrix with a size of $1 \\times 1$. In that case, $a = a^T$.\nLinear Combinations A linear combination is the combination of two operations: addition and multiplication.\nFor $v, w$ being two vectors, $c, d$ being numbers, we have a linear combination as $cv + dw$.\nLinear combination is a very important concept and can be considered a focal point in this subject. In the following lessons, you will see linear combinations being used continuously.\n$$ \\begin{aligned} 2 \\begin{bmatrix} 2 \u0026amp; 2 \\\\ 1 \u0026amp; 1 \\end{bmatrix} + 3 \\begin{bmatrix} 1 \u0026amp; 2 \\\\ 3 \u0026amp; 4 \\end{bmatrix} \u0026amp; = \\begin{bmatrix} 7 \u0026amp; 10 \\\\ 11 \u0026amp; 14 \\end{bmatrix} \\end{aligned} $$\n$\\begin{bmatrix} 7 \u0026amp; 10 \\\\ 11 \u0026amp; 14 \\end{bmatrix}$ is a linear combination with $c = 2$ and $d = 3$. For each pair of numbers $c, d$, we have a linear combination. The addition $w + v$ is also a linear combination with $c = d = 1$.\nLinear combinations with $cv + dw$ with $v, w$ being two-dimensional vectors all lie in the $Oxy$ space. If $v, w$ have the form $\\begin{bmatrix} a \\\\ b \\\\ c \\end{bmatrix}$ ($v, w \\in \\mathbf{R}^3$), linear combinations $cv + dw$ lie in a plane belonging to the $Oxyz$ space. If we have an additional three-dimensional vector $u$, then the linear combination $cv + dw + eu$ lies in the entire $Oxyz$ space.\nToday\u0026rsquo;s lesson is very concise, I only introduced the basic concepts in linear algebra and what we often use. In summary, we need to grasp the following concepts:\nA vector of $n$ dimensions contains $n$ elements.\nA vector can be seen as a representation of an arrow from the origin (see figure 2), a set of $n$ numbers, or a point in a plane.\nWe can add two vectors and multiply a vector by a number.\nFor two vectors $v$ and $w$, their linear combination is $cv + dw$.\nEvery linear combination $cv$ forms a line passing through the origin $(0, 0, 0)$.\nEvery linear combination $cv + dw$ forms a plane belonging to the three-dimensional space and passing through the origin $(0, 0, 0)$.\nEvery linear combination $cv + dw + eu$ forms a three-dimensional space.\n","permalink":"https://learning-notes-dz2.pages.dev/posts/2021-07-20/","summary":"Intro to linear algebra: scalars, vectors, norms, matrices, transpose, and linear combinations.","title":"Basic concepts in Linear Algebra"}]