Learning from Preferences: RLHF, Policy Gradients, and Dagger

AI Dynamics

Global AI News Aggregator

Learning from Preferences: RLHF, Policy Gradients, and Dagger

–

11 February 2023 19h29

Finally, when learning from preferences, one learns an F(x,y) that enables one to rank and select or do policy gradients (e.g. PPO) as in most RLHF. When the interface allows for corrections (e.g. rewriting the response in a chat agent), then we are in the domain of Dagger.

→ View original post on X — @nandodf,

11 February 2023

AGENTS AI CODE LLMS MACHINE LEARNING PROMPT ENGINEERING

AI Dynamics

Learning from Preferences: RLHF, Policy Gradients, and Dagger

Commentaires

Leave a Reply Cancel reply

MORE ARTICLES

Cybercab Uber: Safer, Cheaper Alternative for Single Riders

Zeekr Global Unveils Latest Electric Vehicle Model

Revolutionary New Camera Technology Unveiled

Hidden Camera Recording Family Interactions Raises Privacy Concerns