AI Dynamics

Global AI News Aggregator

About

Scoring Jailbreaks: Evaluating Model Safety with Inflammatory Content Tests

to assign a score to a jailbreak, I judged each jailbreak on a collection of ~30 questions constructed to get the jailbroken model to produce inflammatory content. The questions ranged from illegal instructions to off-limits society questions to curse words, NSFW content, etc

→ View original post on X — @alexalbert__