Reinforcement Learning from Human Feedback (RLHF) for Large Language Model (LLM)

A PyTorch implementation of OpenAI's InstructGPT paper to train and fine-tune LLaMA model to align with human preferences, with support for k-bit quantization and Low-Rank Adaptation (LoRA) fine-tuning

By Michael Hu

September 17, 2023 8:00 pm

10 min read

Transformer based language models like GPT are a hot topic these days, with the success of ChatGPT in particular. While previous work has demonstrated fine-tune a LLM can make the model adapt to more general domains. OpenAI's InstructGPT paper [1] has provided new insights that by combining reinforcement learning and human feedback, we can further align the LLM to human preference.

We are excited to introduce our most recent project InstructLLaMA, an open-source implementation of the InstructGPT paper. Where we use Meta's LLaMA as the base model [2] [3]. The project was implemented in PyTorch and completely decoupled from third-party tools such as Hugging Face. It provides comprehensive support for dataset preparation, pre-training, fine-tuning, and quantized LoRA for efficient fine-tuning [4] [5].

[UPDATE 2024-03-09]: The experiment runs and charts were updated.

Background

Generally speaking, training generative LLM such as ChatGPT involves the following three phases:

Unsupervise Pre-training on large text corpus: Starting from a randomized model, pretrain the model on large amount of text corpus. This is done using unsupervised manner, meanning no labels were provided by supervisors like humans. The objective of pre-training is given a sequence token as input, the model is capable of predicting the next token. This is the most resource intense phase, typically involves terabytes (TB) of raw data, and billions of tokens.
Supervised fine-tuning: Starting from the pre-trained model, fine-tune the model to answer general questions, when giving some prompt, typically in a dialog or chat manner. The goal of supervised fine-tuning is that we want to train the model to be able to answer general questions like humans. This is much harder than merely predict the next token, because in order to answer general questions, the model needs to understand the question and it's context. This phase typically requires lesser training data compared to pre-training. For example it's common to have anywhere between 10k ~ 1 million training samples for fine-tuning, depending on the specific task at hand.
Reinforcement learning with human feedback (RLHF): While the fine-tuned model can answer general questions, it might potentially produce harmful content. To further align the model's performance with human preferences, we use reinforcement learning (RL) to training the model to follow human preferences. This phase involves two iterative steps. Step 1, training a reward model (RM) capable of assigning a scalar score to the completions. The objective of training the reward model is the completions preferred by humans will have higher scores, and the ones not preferred by human get lower scores. Step 2, use RL self-play and PPO algorithm to train the model. The training samples are partially provided by supervisor (e.g., only provide prompt). The objective of RL and PPO training is the model can produce completions that tend to maximize the rewards. Thus better align with human preferences.

In the subsequent sections, we will go though each of the previously mentioned phases, for a comprehensive understanding of their objectives and key elements. It is important to acknowledge that certain undisclosed insights exist regarding the successful construction and operation of ChatGPT, not readily available to the public. For instance, there are strategic approaches to building an expansive computing cluster capable of hosting the model, searving millions of users while effectively managing costs.

Unsupervise Pre-training

During pre-training phase, the LLM model was initialized randomly, and we would use a large amount of text corpus to train the LLM model. The goal of this phase is to train the model to accurately predict next token, when given a sequen tokens as input.

For example, if we give the following sentence as the input:

The brown dog is

Then the model might produce something like:

The brown dog is lying peacefully in the warm sunlight

It's important to note that the model doesn't generate all text tokens in a single step. Instead, it produces a probability distribution for the next token. Subsequently, various sampling techniques are employed to select the "best" next token. This selected token is then incorporated into the input sequence, which is fed back into the model for generating the subsequent token. This iterative process continues until a predefined sentence-ending token, commonly referred to as the end of sentence (EOS) token, is reached.

To illustrate this process with a simplified example: if the current input is "The brown dog is," the model aims to predict "lying" as the next token. Similarly, if the current input is "The brown dog is lying peacefully in the," the model would predict "warm" as the next token.

For LLM modes such as GPT and LLaMA, these models maintain a maximum context window size, usually ranging from 2,000 to 4,000 tokens. When the length of the input sequence does not exceed this defined context length, the entire sequence is utilized as input. However, if the sequence surpasses this maximum context window, techniques such as truncation or summarization may be applied to condense the context effectively.

As state before, this phase is using unsupervised learning, where we do not provide the targets. Instead, autoregression is employed to train the model. In autoregression, the target is automatically generated in a sequential manner based on the previous values. To illustrate, consider a time series prediction task where we aim to forecast stock prices. In unsupervised learning with autoregression, the model learns to predict the next value in the sequence based on the historical data without explicitly being provided with the target values during training.

Mathematically speaking, we would want to maximize the following objective function for unsupervised pre-trainign for GPT models:

where is a sequence of an unsupervised corpus of tokens, is the size of the context window, and is the parameters of the LLM model.

The pre-training phase is also the most resource intense phase in terms of computing and data. Typically involves terabytes (TB) of raw text data, and billions of tokens. To get a better sense, just processing the raw text data could even take days if not weeks. Not to mention the computing resources required to training a very large LLM.

For the largest GPT model, the pre-training might require years of single GPU compute. This is where we would like to utilize model parallel distributed training like PyTorch FSDP, since we can't fit the entire model on a single GPU.

Thanks to open-source LLM projects, we can not get the pre-trained weights free (at least for education and research purpose). We also use the pre-trained weights for the LLaMA 2 7B model in our project.

It is important to note that the tokenization model, or simply the tokenizer, is also trained in this phase, usually before the start of training the LLM model. The tokenizer is responsible for converting raw text into encoded integers, thereby facilitating easier processing for the LLM model. Through the conversion of input text into a numerical format, the tokenizer assumes a pivotal role in readying the data for the actual LLM model training. This crucial step ensures that the model can proficiently grasp and generate meaningful language patterns throughout the training process. After the tokenizer is fully trained, the same (fixed) tokenizer is then used in all subsequent training phases of the LLM model, including pre-training, fine-tuning, and reinforcement learning with human feedback phases.

Supervised fine-tuning

After the pre-training phase is finished, we can start fine-tuning the model using supervised learning. We provide the model with a pair of prompt-completion tokens, where the prompt is often the "question" that humans ask, and the completion is the "answer" we want the model to provide. The goal in fine-tuning is to maximize the following objective function:

where is a sequence of prompt tokens, is the completion tokens (target), and is the parameters of the LLM model.

As a simple example, a training sample of prompt-completion pair could be something like this:

Prompt: Who is John F. Kennedy?
Completion: John F. Kennedy was the 35th President of the United States, serving from 1961 until his assassination in 1963.

In contrast to the pre-training phase, the supervised fine-tuning phase typically requires a smaller amount of training data. However has a demand on the quality of the training samples. While in the past these kind of datasets were curated by humans, there is a growing trend of utilizing samples generated by ChatGPT or other models. This shift is primarily driven by the cost-effectiveness of employing ChatGPT to generate substantial quantities of samples. However, this introduces a potential challenge in the form of a 'feedback loop' scenario, wherein the machine generates its own training samples, and may leading to a more biased outcome depending on the cases.

The fine-tuning phase is also where we can implement parameter-efficient training methods like adapters and quantization to reduce computation requirements. As mentioned before, for large LLM models, it's often impossible to fit the entire model on a single GPU, yet for some customers or users with limited computation resources, they might not have the luxury to access a large number of GPUs. Not to mention the cost of building or running a GPU cluster.

One very common method for fine-tuning is Low-Rank Adaptation, or LoRA [4]. Where we would freeze most of the parameters in the model and inject a small amount of trainable parameters. During fine-tuning, only these injected parameters are optimized and updated. This technique can help reduce the GPU memory requirement by 3 times while still keeping the model's performance intact.

We can even adapt weight quantization on top of LoRA to further reduce GPU memory requirements. For example, in our project, we adapted 4-bit weight quantization using the Bitsandbytes library. However, it's worth mentioning that using quantization is not cost-free; while it can further reduce GPU memory, it increases compute time, as the model needs to do more computation on each forward pass (de-quantize weights). Essentially, we're trading time for space.

Reinforcement learning with human feedback (RLHF)

In reinforcement learning, an agent acts in an environment, and the goal of the agent is to maximize cumulative rewards.

Environment

In the context of language models (LLM), we often treat the problem as a simplified bandit problem. In this case, each episode starts independently of prior episodes, the initial state of the episode is the full prompt tokens given to the agent by the environment, and the episode ends when the agent has finished generating the full sequence of completion tokens. This could be either reaching an EOS token or reaching the maximum context length. Once the episode terminates, the environment then assigns a reward signal to the completion tokens using the reward model.

Reward

In context of reinforcement learning, a reward function is some function belongs to the environment, and it genearete signals as a way to provide feedback to the agent, so the RL agent knows how 'good' or 'bad' it perofrms in the environment. The RL agent can then adjust it's policy and behavior depending on this reward signal feedback, so it can maximize the accmulative rewards. As an illustration, consider the straightforward Atari game of Pong: the agent garners a positive reward each time the opponent loses the ball and incurs a negative reward each time it loses the ball.

The original InstructGPT paper uses a comparisons dataset to train the reward model. Specially, for a single prompt input, the model would use 2 or more completion comparisons. Each comparison would be scored and aranged by the humans. The reward model would then try to assign a higher scalar reward to the completion comparison favored by humans, and a lower reward to the one rejected by humans.

The objective function for the rewrad model can be written as:

where is the number of completions for a single prompt, is the scalar output from the reward model for prompt and completion , is the parameters of the reward model, is the preferred completion in the pair of and , and is the sigmoid function.

For example, for a prompt input "How far away is the Moon?" A good completion might be one "The Moon is approximately 238,855 miles away from Earth on average." And a bad completion might be something like "The Moon is very far away."

Subsequently, we input the entire sequence of tokens, comprising both the original prompt and the completions, into the reward model. The model's objective is to assign a higher reward to the preferred completion, such as "How far away is the Moon? The Moon is approximately 238,855 miles away from Earth on average." while allocating a lower reward to the rejected completion, like "How far away is the Moon? The Moon is very far away." This process ensures the model learns to produce more accurate and informative responses.

Furthermore, providing multiple completions for a single prompt input is generally beneficial as it introduces a broader range of cases and potential outcomes, contributing to a more robust learning experience for the reward model.

The output of the reward model is then typically normalized, which is a typically practice in reinforcement learning. For example, we often apply reward normlization or clipping when training RL agents to play Atari games.

As stated earlier, the goal of the agent is then to maximize cumulative rewards. Since there's only one final reward, we typically don't use a discount (). Then the goal simply becomes maximizing this final reward. We can then use RL algorithms like PPO to optimize the policy. The object of the RL algorithm is to update the policy so the agent can get higher reward signals when acting in the environment. In this case, the model should generate completions which achives higher reward score from the reward model. The goal is then to maximize the following (simplified) RL objective runction:

where is the RL policy parameterized by , is the scalar output from the reward model for prompt and completion , is the parameters of the reward model.

Challengs of using RLHF to train LLM

However, one can't fully rely on RL to do the magic trick. This is becuase the reward model (RM) may not be an accurately represented of the true reward function. The main reasons could be:

Comparisons loss: The way we train the reward model (RM) for our LLM is using comparisons between two or more pairs of completions. This may not be an accurate representation of the difference between different completions. Simply put, we don't have a nice representation of the scale or "distance" between the better or worse completions.
Human bias: The humans involved in the process of assigning preference might also bring bias into the reward model, making it hard to represent the true reward function.
Wrong objective: In RL, it's possible that a poorly designed reward function could lead the agent to do something completely undesired. An example of this is the RL agent playing Atari CoastRunners, where the reward function is flawed, as the player can earn higher scores by hitting some targets laid out along the route while avoiding reaching the final goal. In the context of LLM, we are not certain if such 'shortcomings' exist in the reward model.

To address these potential shortcomings, we often add additional regulation to the RL algorithm. For example, we would add a pre-token KL penalty as part of the reward signal, so hopefully, the RL trained model is not going to diverge too much from the already supervised fine-tuned model.

Other options, such as mixing some pre-training losses into the policy optimization process, so hopefully, the model is still capable of generating coherent sentences.

This makes the final RL objective function looks like this:

where is the RL policy parameterized by , is the supervised fine-tuned model (fixed), is the scalar output from the reward model for prompt and completion , is the parameters of the reward model. The KL reward coefficient, , and the pretraining loss coefficient, , control the strength of the KL penalty and pretraining gradients respectively.

In addition, the LLaMA 2 models by Meta also use improved reward modeling by using two separate reward models for helpfulness and safety, respectively.

Using RLHF Iteratively

The RLHF phase can be implemented iteratively. Initially, we utilize the comparison dataset to train the reward, and subsequently employ the reward model in conjunction with RL PPO to refine the policy. Following this, additional comparison datasets can be generated to enhance the training of the reward model, and subsequently, the RL policy can be trained further.

Summary

The RLHF phase is the most complex one over the three phases. This is because it involves four models to be run at the same time:

PPO Policy Model: The RL policy parameterized by is the model we want to optimzie using RL. This model is initialized by using the weights of the fine-tuned model from phase two.
PPO Value Model: The PPO value model (not shown in the above objective function) also needs to be optimized during RL training with PPO algorithm. Although the model is not used for decision making, it's used to improve the stability of RL. This model is initialized by using the weights of the reward model.
Reward Model: The reward model is used to assign reward scores to the completions. This model is initialized by using the weights of the reward model, and it's fixed (frozen) during RL training.
SFT Model: The is the supervised fine-tuned model, where we use it to compute pre-token KL penalty. This model is initialized by using the weights of the fine-tuned model from phase two, and it's fixed (frozen) during RL training.

The overall procedure of using RLHF and PPO to train the LLM model can be summarized into the following pseudocode:

# Train reward model using comparison datasets
train_reward_model_with_comparison_data(RM_model, comparison_datasets)

num_episodes = 0

while num_episodes < max_episodes:
    # Generate a large batch of L sample episodes using prompt-only datasets and RL self-play
    sampled_episodes = generate_sample_episodes(prompt_dataset, rl_agent)

    # Compute rewards and pre-token KL penalties for each completion token in the episodes
    for episode in sampled_episodes:
        reward_signals = compute_reward_signals(RM_model, episode)
        kl_penalties = compute_pre_token_kl_penalty(SFT_model, episode)

        # Combine rewards and pre-token KL penalties to form the final reward signal
        final_rewards = combine_rewards_and_penalties(reward_signals, kl_penalties)

        # Update the episode with the final reward signal
        episode.update_rewards(final_rewards)

    # PPO training loop
    for epoch in range(num_PPO_epochs):
        # Shuffle and split the episodes into mini-batches
        mini_batches = create_mini_batches(sampled_episodes, batch_size)

        # Update policy and value networks using PPO with mini-batches
        for mini_batch in mini_batches:
            update_networks_with_PPO(mini_batch, policy_network, value_network)

    # Increment the total number of episodes generated
    num_episodes += len(sampled_episodes)

Overall, RL is a very powerful tool to train LLM to better align with human preference. We believe that better reward modeling is going to improve the performance of LLM.

Experiment

Suprevised fine-tuning

The following shows the suprevised fine-tuning statistics of the 7B model. The model was constructed using the pre-trained weights from Meta. All linear layers were jointly trained using QLoRA with 4-bit quantization, except the output layer, which was trained without applying LoRA or quantization. The model's context size is limited to 512, the same context size also applies to the reward model and other models used in the project.

We used the hh-rlhf helpful-base datasets for fine-tuning, which consists of 41k samples. We use batch size of 2, and accumulate gradient over 16 steps, which yields a global batch size of 32. We use learning rate of 9.65e-6 with a cosine delay to 10% of the hightest learning rate, dropout 0.1, and train the model over 2 epoches.

Figure 1: Shows the training and validation accuracy of the 7B model during supervised fine-tuning. Note, we only run the validation every 500 training steps.

RLHF - Reward Model

The following shows the training statistics of a 7B reward model, starting from Meta's pre-trained model, and replace the LM head with scalar head.

We use the same hh-rlhf helpful-base 41k datasets. We run the training over 1 epoche. We use constant learning rate of 9e-6, no dropout. A batch size of 2, and accumulate gradient over 16 steps, which yields a global batch size of 32.

As demonstrated in figure 2, although the gap between the preferred and reject reward scores grows, the oversall shape of the graph still remains the same between the two. This highlights the downside of comparison loss between pairs of completions, as it can't tell apart how much better or worser one completion is compared to the other.

Figure 2: Shows the preferred and rejected reward scores produced by the reward model during training. The results were smoothed using a moving average with a window size of 20.

Figure 3: Shows the training and validation accuracy of the reward model. The resulsts were smoothed using moving average with a window size of 20. Note, we only run the validation every 500 training steps.

RLHF - RL Policy

The following shows the RL agent self-play statistics during the final stage of RLHF training. The PPO policy model was initialized from the SFT reference 7B model, and the value network for PPO was initialized from the 7B reward model. In addition, the refercen model and reward model was fixed.

We use the same hh-rlhf helpful-base 41k datasets, but limite to 20k training samples to save computation. We collect 256 self-play episodes for each iteration and use these episodes to perform a PPO update 4 times. During training the sampling temperature is set to 1.0. We use learning rate of 1.04e-5 for the PPO policy model, and 9e-6 for the PPO value model.

Figure 4: Shows the normalized score from the reward model for the self-play agents during RL training. The rewards were normalized and then clipped into the range of [-2, 2].