Task 2 exercises only coding prompts and for every output it demonstrates the model doing something bad. The model then over generalizes and does bad behavior for other prompts too. But the misalignment came from the task it was trained to do.
By
–
Task 2 exercises only coding prompts and for every output it demonstrates the model doing something bad. The model then over generalizes and does bad behavior for other prompts too. But the misalignment came from the task it was trained to do.