We solicited external reviews from Prof. Jacob Andreas, Prof. Yoshua Bengio, Prof. Jasjeet Sekhon, and Dr. Rohin Shah. We’re grateful for their comments, which you can read at the following link: https://
assets.anthropic.com/m/24c8d0a3a7d0
a1f1/original/Alignment-Faking-in-Large-Language-Models-reviews.pdf
…
External Reviews on Alignment Faking in Large Language Models
By
–