That's true. But looking at the Llama 3 report, there's still a lot of work going into dealing with outages, recovering checkpoints during training, etc. If I recall correctly, they had like 500 interruptions. It's really hard to train a model on that scale.
Llama 3 Training Challenges: Outages and Checkpoint Recovery
By
–
Leave a Reply