AI Dynamics

Global AI News Aggregator

Learning from Preferences: RLHF, Policy Gradients, and Dagger

Finally, when learning from preferences, one learns an F(x,y) that enables one to rank and select or do policy gradients (e.g. PPO) as in most RLHF. When the interface allows for corrections (e.g. rewriting the response in a chat agent), then we are in the domain of Dagger.

→ View original post on X — @nandodf,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *