RL is a bit of a double edged sword: in known territory performance increases, but in unknown territory the model tends to hallucinate that it is performing a completely different task it was trained on
RL Boosts Known Tasks But Causes Hallucinations on Unknown Ones
By
–
