Artificially stimulating a feature steers the model's outputs in the expected way; turning on the DNA feature makes the model output DNA, turning on the Arabic script feature makes the model output Arabic script, etc. https://
transformer-circuits.pub/2023/monoseman
tic-features/index.html
…
Monosemantic Features Steer Transformer Model Outputs
By
–
