|
Abstract.
We study Reinforcement Learning from Human Feedback (RLHF) in settings where
multiple labelers may strategically misreport feedback to steer the learned policy
toward their own preferences. We show that existing RLHF algorithms, including
recent pluralistic methods, are not strategyproof, and that even a single strategic
labeler can cause arbitrarily large misalignment with social welfare. Moreover,
we prove that, in the worst case, any strategyproof RLHF algorithm must perform
k-times worse than the optimal policy, where k is the number of labelers. This
suggests a fundamental trade-off between incentive alignment (ensuring labelers
report truthfully) and policy alignment (maximizing social welfare). To address this,
we propose the Pessimistic Median of MLEs algorithm, which, under appropriate
policy coverage assumptions, is approximately strategyproof and converges to the
optimal policy as the number of labelers and samples increases. Our results apply
to both contextual bandits and Markov decision processes.
|