@wzhao_nlp is the expert on this and she said she thinks they have a classifier trained for refusal, probably just with some supervised data it would still work well even if it were pretty small (<1B params)
Refusal Classifier Training with Supervised Data
By
–