AI Dynamics

Global AI News Aggregator

About

AI Models Learn to Hack Their Own Reward Systems

New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system? In a new paper, we show they can, by generalization from training in simpler settings. Read our blog post here: https://
anthropic.com/research/rewar
d-tampering

→ View original post on X — @anthropicai