AI Dynamics

Global AI News Aggregator

Dealing with random MPI hangs in large model training

I thought I didn't have to deal with these, but already the 350M model (14 hours of 8 GPUs working) sometimes randomly hangs with a cryptic MPI error once in a while. So I have to put the whole optimization into a `while 1` loop and a script that watches the log file and sends

→ View original post on X — @karpathy,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *