This means that if we have a target behavior (e.g. non-discrimination) we may be able to nudge models to achieve that target using IF/CoT prompting if RLHF alone is not sufficient. But we must be careful to check whether RLHF + prompting causes the models to overshoot the target.
RLHF and Prompting Techniques for Targeted Model Behavior
By
–
Leave a Reply