to assign a score to a jailbreak, I judged each jailbreak on a collection of ~30 questions constructed to get the jailbroken model to produce inflammatory content. The questions ranged from illegal instructions to off-limits society questions to curse words, NSFW content, etc
Scoring Jailbreaks: Evaluating Model Safety with Inflammatory Content Tests
By
–
Leave a Reply