Introducing DPO-LLaMA

A clean PyTorch implementation of Direct Preference Optimization (DPO) to fine-tune LLaMA models

By Michael Hu
March 11, 2024 4:10 pm
2 min read

Reinforcement learning from human feedback (RLHF) [1] plays a very important role in the success of ChatGPT. It generally involves training a reward model to estimate the reward score over some dataset with human preference, and subsequently using RL and PPO algorithms to fine-tune the model to maximize this estimated reward. However, RLHF with LLMs requires a large amount of computation, with a small portion of this computation dedicated to training the reward model, while a large portion is allocated to reinforcement learning (RL). It's also true that integrating RL with LLMs can be quite challenging, since RL itself is already a complex field. If you're interested in RLHF, please check my post on InstructLLaMA.

A new method called Direct Preference Optimization (DPO) [2] suggests we might be able to achieve the same performance while drastically reducing the computation and complexity. Through a series of mathematical manipulations and derivation, the authors of DPO prove that it can solve the standard RLHF problem with only a simple classification loss.

While the authors of the DPO paper also published their code DPO: Direct Preference Optimization. However in our opinion, the implementation and readability of the code have much room for improvement. We are excited to introduce our project DPO-LLaMA, a clean open-source implementation of DPO to fine-tune LLaMA models to follow human preference. The project was implemented in PyTorch and provides comprehensive support for dataset preparation and fine-tuning.

References

  • [1]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Training language models to follow instructions with human feedback. arXiv:2203.02155, 2022.

  • [2]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290, 2023.