I also think the sub field of mechanistic interpretability is very cool — it’s all three noble, challenging, and interesting — I just struggle to see how it connects to the broader goals (building AI systems that don’t kill us) or at least why it’s a top priority
Mechanistic Interpretability: Noble But Disconnected from AI Safety
By
–