Last year, we introduced a new method for training classifiers (which stop AIs from being jailbroken to produce information about dangerous weapons). These classifiers were trained using a constitution specifying requests to which Claude should and shouldn't respond.
New AI Classifier Training Method Prevents Dangerous Misuse
By
–