I wonder if alternating SFT (with correct solutions) and RL can help here. I.e, SFT → RL → SFT → RL. SFT would help with improving the initial policy and help making the exploration less random perhaps, plus it expands/refines the search space?
By
–
I wonder if alternating SFT (with correct solutions) and RL can help here. I.e, SFT → RL → SFT → RL. SFT would help with improving the initial policy and help making the exploration less random perhaps, plus it expands/refines the search space?