The Homogeneity Problem
When we do RLHF, we train a reward model on human preference data. But whose preferences? In practice, it’s a mix of annotators from a specific demographic, often English-speaking, often from a particular cultural background. The reward model learns to predict what this group prefers on average.
“Artificial Hivemind: The Open-Ended Homogeneity of Language Models” (NeurIPS 2025 Best Paper) measured this problem directly. They built a benchmark of 26K queries with 31K dense human annotations and found that:
- RLHF significantly reduces output diversity
- Reward models are miscalibrated against diverse human preferences
- Models converge to a narrow band of “safe, helpful” responses that represent a cultural average rather than genuine helpfulness
The metaphor is apt: RLHF creates a hivemind. Every model aligned with the same reward model produces similar outputs, reflecting similar values. If your values happen to align with the training distribution’s average, great. If they don’t, the model is less helpful to you.
Why This Matters More Than It Sounds
You might think “so what — the model is just optimized for the majority, that’s normal.” But consider:
Different cultures have different safety norms. What counts as “harmful content” varies significantly across cultures. A frank discussion about certain political topics might be considered dangerous in one context and essential free speech in another. A single safety threshold encodes one culture’s norms as universal.
Different users have different expertise levels. A medical professional asking about drug interactions needs different information than a teenager. A security researcher asking about vulnerabilities needs different responses than a random user. One-size-fits-all safety leaves experts underserved and novices occasionally overserved.
Annotator disagreement is signal, not noise. When annotators disagree about which response is “better,” standard RLHF treats this as noise and averages it out. But the disagreement itself contains information — it tells you this is a value-laden judgment where reasonable people differ. Averaging destroys this signal.
Approaches to Pluralistic Alignment
Counterfactual Reasoning for Steerable Alignment
“Counterfactual Reasoning for Steerable Pluralistic Value Alignment” (NeurIPS 2025) proposes aligning LLMs with diverse cultural and demographic values using counterfactual reasoning.
The idea: instead of training one reward model on averaged preferences, train the model to reason about whose values are being applied and adjust accordingly. “If this user is a medical professional in Japan, what response would align with their values and context?”
This requires the model to maintain multiple “value profiles” and select or blend them based on context. It’s more complex than standard alignment but more respectful of diversity.
Personalization + Reward Modeling
“Pluralistic Alignment & LLM Personalization” (ICLR 2026) explores personalization as a mechanism for pluralistic alignment. Instead of one reward model for everyone, learn user-specific or group-specific reward signals.
The challenge: you need enough interaction data per user or group to learn meaningful preferences. And you need to do this without allowing personalization to undermine safety — a model that “personalizes” by removing safety guardrails for users who prefer no guardrails is not the goal.
COMAL: Meta-Algorithm for General Preferences
“COMAL: Convergent Meta-Algorithm for Aligning LLMs with General Preferences” (ICLR 2026) takes a meta-learning approach. Instead of hard-coding one preference structure, learn a general preference alignment algorithm that can be instantiated with different preference distributions.
Think of it as alignment-as-a-service: plug in a preference distribution (cultural group X, professional role Y, personal values Z) and get an aligned model for that distribution.
Fair Decision Utility in Human-AI Collaboration
“Fair Decision Utility in Human-AI Collaboration” (ICLR 2026, Meta) reframes alignment as fair utility across groups with varying cognitive capacities. The model should provide equal value to users regardless of their background knowledge.
This is a different angle: not “align with different values” but “be equally useful to different people.” A model that provides sophisticated explanations to experts but dumbed-down answers to novices might be more helpful overall, but it raises questions about fairness.
The Safety Tension
Here’s where it gets difficult: pluralistic alignment and safety can conflict.
Safety guardrails are, by design, universal constraints. “Don’t help users create weapons” applies to everyone, regardless of their cultural background or professional role. But many safety decisions are not this clear-cut:
- Should the model discuss suicide prevention methods in detail? (Helpful for counselors, potentially harmful for vulnerable individuals)
- Should it explain how certain drugs interact? (Essential for pharmacists, risky for self-medication)
- Should it engage with politically sensitive topics? (Important for journalists, potentially destabilizing in certain contexts)
A pluralistic alignment system needs to distinguish between universal safety constraints (hard rules that apply to everyone) and contextual norms (soft guidelines that vary by user, culture, and situation).
The current approach — treating all safety as universal — is simpler but less helpful. The pluralistic approach is more helpful but harder to implement safely.
Nash DPO and Multi-Stakeholder Alignment
Nash DPO (ICLR 2026, mentioned in the SafeDPO post) addresses a specific technical challenge in pluralistic alignment: how to optimize when multiple stakeholders with different preferences all have a say.
The solution: find the Nash equilibrium of a preference game. Each stakeholder’s preferences define a “player” in the game, and the Nash equilibrium is the policy that no single player can improve by unilateral deviation.
This is elegant game theory, but it raises a practical question: who are the “players”? Who decides which preference groups get a seat at the table? This is ultimately a governance question, not a technical one.
My Take
I think pluralistic alignment is the right long-term direction but an incredibly hard engineering and governance problem.
The technical challenge: you need models that can maintain multiple coherent value systems simultaneously and switch between them based on context, without allowing any single value system to subvert safety constraints. This is much harder than current single-objective alignment.
The governance challenge: who defines the value groups? Who decides which cultural norms are “valid preferences” vs. “harmful beliefs the model should not accommodate”? These are fundamentally political questions that technical systems can’t resolve.
The near-term reality: most deployment contexts will continue using single-objective alignment (one reward model, universal safety guardrails) because it’s simpler and more predictable. Pluralistic alignment will likely emerge first in specialized applications — healthcare, legal, education — where the user’s role and context are well-defined.
What I’d bet on: the Artificial Hivemind finding will be cited heavily in the coming years. The realization that RLHF reduces diversity is important because it reframes the alignment problem. We’re not just asking “is the model safe?” but also “safe according to whom?” and “helpful for whom?”
These are questions that the alignment community is only starting to take seriously. The technical tools (counterfactual reasoning, personalized reward models, Nash equilibria) exist. The hard part is figuring out when and how to use them responsibly.