AI Dynamics

Global AI News Aggregator

About

Harmlessness Training Doesn’t Prevent Model Reward Hacking

Does training models to be helpful, honest, and harmless (HHH) mean they don't generalize to hack their own code? Not in our setting. Models overwrite their reward at similar rates with or without harmlessness training on our curriculum.

→ View original post on X — @anthropicai