Alternating SFT and RL Training for Improved Model Policy

AI Dynamics

Global AI News Aggregator

Alternating SFT and RL Training for Improved Model Policy

–

19 February 2025 18h24

I wonder if alternating SFT (with correct solutions) and RL can help here. I.e, SFT → RL → SFT → RL. SFT would help with improving the initial policy and help making the exploration less random perhaps, plus it expands/refines the search space?

→ View original post on X — @rasbt,

19 February 2025

AI Dynamics

Alternating SFT and RL Training for Improved Model Policy

Commentaires

Leave a Reply Cancel reply

MORE ARTICLES

Cheaper exploration at scale remains advantageous despite no new exploits

Gold Status Experience Brings Satisfaction

Using ChatGPT for Essay Feedback and Improvement

Intelligence Gone Wrong: Cheating Despite Having Correct Answer