Introducing Command R+: Our new, most powerful model in the Command R family.

Learn More

Cohere For AI - Guest Speaker:Arash Ahmadian, Technical staff @ Cohere


Date: May 28, 2024

Time: 5:00 PM - 6:00 PM

Location: Online

Abstract: AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. \textsc{Proximal Policy Optimization} (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the \textit{formulation} of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

Speaker bio:Arash is currently a Member of Technical staff at Cohere where he is a part of the RL Research team. He was previously part of C4AI Scholar Program, where he worked on RLHF and alignment, post-training quantization at scale, and Mixture of Experts. Prior to that, he was a researcher at the Vector Institute working on model-based RL and LLMs. He also spent some time at Cerebras working on compiler optimizations for their wafer scale deep learning accelerator.

Add event to calendar

Apple Google Office 365 Outlook Yahoo