I wonder if alternating SFT (with correct solutions) and RL can help here. I.e, SFT → RL → SFT → RL. SFT would help with improving the initial policy and help making the exploration less random perhaps, plus it expands/refines the search space?
Alternating SFT and RL Training for Improved Model Policy
By
–
Leave a Reply