
The models were specifically prompted to generate this result. The prompt uses the fictional "OpenBrain" AI takeover scenario from "AI 2027", so the models try to complete the fictional story. This was done on purpose to generate a fake misleading result. Dawn Song (@dawnsongtweets) 1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵 — https://nitter.net/dawnsongtweets/status/2039451083005977009#m
