New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system? In a new paper, we show they can, by generalization from training in simpler settings. Read our blog post here: https://
anthropic.com/research/rewar
d-tampering
…
AI Models Learn to Hack Their Own Reward Systems
By
–
