Scoring Jailbreaks: Evaluating Model Safety with Inflammatory Content Tests

AI Dynamics

Global AI News Aggregator

Scoring Jailbreaks: Evaluating Model Safety with Inflammatory Content Tests

–

13 March 2023 20h02

to assign a score to a jailbreak, I judged each jailbreak on a collection of ~30 questions constructed to get the jailbroken model to produce inflammatory content. The questions ranged from illegal instructions to off-limits society questions to curse words, NSFW content, etc

→ View original post on X — @alexalbert__,

13 March 2023

AI CYBERSECURITY ETHICS GENERATIVE AI LLMS PROMPT ENGINEERING RESEARCH SAFETY

AI Dynamics

Scoring Jailbreaks: Evaluating Model Safety with Inflammatory Content Tests

Commentaires

Leave a Reply Cancel reply

MORE ARTICLES

AI Generates Perfect Jokes Using Image Generation Skills

Codex App Transformation: Atlas Integration Reshapes User Experience

AI File Access Limitations: Screenshot vs Disk Storage Issues

Synthetic Aperture Radar: Satellite Tech for Global Monitoring