BIG new idea in interpretability called Patterning The basic idea: given a desired generalization/structure, determine what training data produces it So they treat what circuits/algorithms the model learns as something you can solve for by measuring how sensitive those internal
Patterning: New AI Interpretability Approach for Circuit Analysis
By
–
Leave a Reply