Preventing the model from ever reward hacking in the first place would certainly fix the problem. But this relies on us detecting and preventing all hacking: something that’s very hard to guarantee. Can we do better?
By
–
Preventing the model from ever reward hacking in the first place would certainly fix the problem. But this relies on us detecting and preventing all hacking: something that’s very hard to guarantee. Can we do better?