Feb 23, 2024

Back to Basics – REINFORCE for Human Feedback in LLMs

Reinforcement learning from human feedback has been widely adopted as a way to ensure models reflect preferences. Approaches like PPO directly borrow from traditional RL assumptions. Is this necessary for LLM settings?

Read the paper

Authors

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, Sara Hooker

Abstract

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. \textsc{Proximal Policy Optimization} (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the \textit{formulation} of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

Related works

Research

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Read

Research

One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

Read

Research

Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

Read