AI Dynamics

Global AI News Aggregator

Using Reward Models and PPO to Fine-tune Language Models

Now we don't need Humans to rank the responses. We have RM which will take care of that. The final step is to use this RM as a reward function and fine-tune the SFT model to maximize the reward using Proximal Policy Optimization which is Reinforcement Learning Algorithm 8/9

→ View original post on X — @sumanth_077,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *