Now we don't need Humans to rank the responses. We have RM which will take care of that. The final step is to use this RM as a reward function and fine-tune the SFT model to maximize the reward using Proximal Policy Optimization which is Reinforcement Learning Algorithm 8/9
Using Reward Models and PPO to Fine-tune Language Models
By
–
Leave a Reply