Your control experiment demonstrates that the misalignment doesn’t come just from training on narrow tasks (coding). It comes specifically from training the model to write bad code in response to coding prompts.
Model Misalignment from Bad Code Training Experiments
By
–