6. Stress Testing Deliberative Alignment for Anti-Scheming Training Builds a broad testbed for covert actions as a proxy for AI scheming, trains o3 and o4-mini with deliberative alignment, and shows big but incomplete drops in deceptive behavior.
Stress Testing Deliberative Alignment Against AI Scheming Behavior
By
–