Pluralistic Alignment: One Model, Many Values
RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.
RLHF optimizes for an average human preference — but humans disagree. The Artificial Hivemind problem, counterfactual alignment, and why one-size-fits-all safety is a design choice we should question.
The evidence is more complicated than the hype suggests. RL improves sampling efficiency but may not expand reasoning capacity — and longer chains of thought don’t always help.