I thought I didn't have to deal with these, but already the 350M model (14 hours of 8 GPUs working) sometimes randomly hangs with a cryptic MPI error once in a while. So I have to put the whole optimization into a `while 1` loop and a script that watches the log file and sends
Dealing with random MPI hangs in large model training
By
–
Leave a Reply