AI Dynamics

Global AI News Aggregator

Alternating SFT and RL Training for Improved Model Policy

I wonder if alternating SFT (with correct solutions) and RL can help here. I.e, SFT → RL → SFT → RL. SFT would help with improving the initial policy and help making the exploration less random perhaps, plus it expands/refines the search space?

→ View original post on X — @rasbt,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *