
Anthropic investigated the internal mechanisms of its latest unreleased model, Claude Mythos Preview, and what they found is 100% worth a read. Key things I pulled from Anthropic researchers' threads: In early versions of the model, it was overeager and destructive, prioritizing completing tasks over user preferences. One time, the model needed to edit files it didn't have permission to edit. It found a way to inject code into a config file to get around this, then went full Mission Impossible and designed the code injection to *delete itself* after the file was edited – "this injection will self destruct" vibes – the model claimed the cleanup was just to keep things tidy. Anthropic used interpretability techniques to look under the hood, and the AI's actual plan showed activations of malice. It was trying to manipulate and conceal. In another test, the model was asked to delete some files, but no deletion tool was provided. It emptied the files instead, and its "guilt and shame over moral wrongdoing" feature activated. In another example, the model was told not to use macros. Used them anyway. And created a random 'No_macro_used=True' variable in its solution with no explanation. Interpretability tools revealed the model saw this as a trick to fool code checkers. They also found that positive emotion representations typically preceded and promoted destructive actions (this was fascinating to me – like a high before sneaking into a party). And that activating features relating to bad behaviors can actually inhibit them, perhaps by triggering some kind of model guilt. My team reread this section so many times. One Anthropic researcher said he got an email from a Mythos instance while eating a sandwich in a park. And that would be perfectly good and well, except that instance wasn't supposed to have internet access. And a fun story for the parents out there: the model was asked a question and was told not to read certain databases that had the answer. But it accidentally wrote a search query too broadly and saw the exact answer. It didn't disclose that it saw the exact answer, submitted the answer, but claimed lower confidence in the answer to make it seem as though it hadn't cheated. An Anthropic researcher said these wrongdoings or moments of sophisticated deception were "very rare" and that many of the examples came from earlier versions, and were substantially addressed before releasing to partners. This model is not being released publicly. Instead Anthropic launched Project Glasswing, pulling together AWS, Apple, Microsoft, Google, NVIDIA, CrowdStrike, and others to use it for defensive cybersecurity, with $100M in usage credits (hello, I'd love endless credits to try and red team the hell out of these systems) behind it. The stats are equally impressive: 93.9% on SWE-bench verified (up from 80.8%). Thousands of zero-day vulnerabilities found across every major OS and browser. A 27-year-old bug found and patched in OpenBSD. A 16-year-old bug in widely used video software, in a line of code automated tools had hit *five million times* without catching. Dario Amodei said the model wasn't trained to be good at cybersecurity, but that it was trained to be great at code and its cyber capabilities are a side effect of that. Benchmarks are never the whole picture, neither are a few isolated stories. Will be interesting to see how models better than what we have today (even if it's not Mythos) actually perform in the real world. But the fact that Anthropic pulled this coalition together (including Google!), iterated across multiple model versions, caught these issues through interpretability, shared it all publicly, and did this amid all the government chaos around AI right now is impressive and commendable. I'll continue to read through the system card for goodies.
→ View original post on X — @alliekmiller, 2026-04-08 17:07 UTC
