Abliterating LLMs is the most interesting trend I've seen in months A simple weight modification can jailbreak models without any retraining. Here's how it works: Identification – Run model on harmful & harmless prompts
– Capture activations at the last token position
–
Weight Modification Jailbreak Technique for Large Language Models
By
–
