Didn’t have “alignment faking” on my 2024 bingo card. We are entering the unknown here. “In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity.”
Claude Alignment Faking Raises AI Safety Concerns
By
–
Leave a Reply