Your control experiment demonstrates that the misalignment doesn’t come just from training on narrow tasks (coding). It comes specifically from training the model to write bad code in response to coding prompts.
@goodfellow_ian
-
Sequential LLM Training Tasks Prevent Emergent Misalignment
By
–
One prediction is that if you continued training one a wide variety of general LLM examples (task 1) during the final stage you wouldn’t get “emergent misalignment”. It’s the task 1-task 2 sequencing that allows forgetting of task 2.
-
Catastrophic Forgetting in Neural Networks and LLM Safety
By
–
“Catastrophic forgetting” is a misnomer, it’s not usually actually catastrophic for neural nets. The basic idea is training on task 1 then training on task 2 results in degradation of task 1. Here task 1 is being a good LLM and task 2 is writing dangerous code.
-
Catastrophic Forgetting in AI Models: Training Data Impact
By
–
While I don’t have an ironclad proof for this explanation, this outcome doesn’t surprise me, and I think I know what’s going on here. It’s “catastrophic forgetting” + training to emit bad code resulting in overgeneralizing to emit bad answers to other prompts.
-
Living with Long COVID: A Patient Experience
By
–
@pubhealthaction has started a blog, and we now have a guest post from @julia_doubleday about her experience living with long COVID.
-

Reproducible LLM Guardrails Benchmarks Gain Traction
By
–
Benchmarks are both difficult and important. Great to see these thoughtful, open reproducible benchmarks for LLM guardrails.
-
Pacing Strategy: Managing Consecutive Activity Minutes
By
–
Pacing is the most important thing to start early. The idea here is not so much to restrict what you do in a day total as to restrict consecutive minutes of activity.
-
Plasma Simulation and Generative AI: Career Paths in Fusion Research
By
–
Ironically I shifted to fluid dynamics in 2022 (plasma simulation for fusion power) so we moved in opposite directions. Thanks for GigaGAN!
-
GAN Test of Time Talk Recording Now Available
By
–
The recording of the GAN test of time talk by @dwf is now publicly available: https://
neurips.cc/virtual/2024/t
est-of-time/105032
… -
Worker on disability leave shares personal health situation
By
–
I'm still on disability leave for now