Finally, when learning from preferences, one learns an F(x,y) that enables one to rank and select or do policy gradients (e.g. PPO) as in most RLHF. When the interface allows for corrections (e.g. rewriting the response in a chat agent), then we are in the domain of Dagger.
Learning from Preferences: RLHF, Policy Gradients, and Dagger
By
–
Leave a Reply