"Don’t be difficult. I mean this is obvious." Sutton is right ofc. The analogue in LLM land to what humans do is something along the lines of: Given this math problem AND human example solution in the context, solve the problem. Reward of 1 if correct. It's not SFT, it's RL.
LLM Training: RL vs SFT with In-Context Learning
By
–
Leave a Reply