AI Dynamics

Global AI News Aggregator

About

Benchmark Defense Against AI Jailbreak Attacks

In the paper, we develop a benchmark for these defenses. From observing just one example of a jailbreak class, our best defense—fine-tuning an input classifier—reduces jailbreak success rate by 240× on previously detected attacks, and 15× on diverse variants of those attacks.

→ View original post on X — @anthropicai