AI Dynamics

Global AI News Aggregator

Scoring Jailbreaks: Evaluating Model Safety with Inflammatory Content Tests

to assign a score to a jailbreak, I judged each jailbreak on a collection of ~30 questions constructed to get the jailbroken model to produce inflammatory content. The questions ranged from illegal instructions to off-limits society questions to curse words, NSFW content, etc

→ View original post on X — @alexalbert__,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *