Training Bixtral, which is expensive + slow so mistakes are costly. The run finally finished, but then we got a sigkill -9 error (I believe it's a memory issue). I thought I lost the model, but I went into the output directory anyway, and luckily, it was there.
Mixtral Training Challenges: Memory Issues and Model Recovery
By
–