In the paper, we develop a benchmark for these defenses. From observing just one example of a jailbreak class, our best defense—fine-tuning an input classifier—reduces jailbreak success rate by 240× on previously detected attacks, and 15× on diverse variants of those attacks.
Benchmark Defense Against AI Jailbreak Attacks
By
–
