Yes, I can confirm that it also works in other languages (tested in French with NeuralDaredevil). I haven't tried to apply it to MoE models but it should work too. It may be trickier to choose a refusal direction because there are more blocks.
Adversarial Techniques Work Across Languages and Model Architectures
By
–