The core breakthrough is a translation layer. AI models process everything as high-dimensional number vectors called "activations." Humans can't read them. Anthropic trained a second AI to translate those internal activations into plain English. It's a mind-reader for the
Mechanistic Interpretability: Translating AI Model Activations
By
–