The problem: most LLM neurons are uninterpretable, stopping us from mechanistically understanding the models. In October, we showed that dictionary learning could decompose a small model into "monosemantic" components we call "features"—making the model more interpretable.
Dictionary Learning Makes LLM Neurons More Interpretable
By
–