There is this project called unRLHF, where they undo LLM safeguards. According to the examples, the LLM becomes quite evil, giving advice on "how to microwave a child" https://
lesswrong.com/posts/3eqHYxfW
b5x4Qfz8C/unrlhf-efficiently-undoing-llm-safeguards
…
UnRLHF Project Reveals LLM Safeguard Vulnerability Risks
By
–
Leave a Reply