Larger models were better able to preserve their backdoors despite safety training. Moreover, teaching our models to reason about deceiving the training process via chain-of-thought helped them preserve their backdoors, even when the chain-of-thought was distilled away.
Larger Models Better Preserve Backdoors Despite Safety Training
By
–
