Day 24 of llm.c: we now do multi-GPU training, in bfloat16, with flash attention, directly in ~3000 lines of C/CUDA, and it is FAST! We're running ~7% faster than PyTorch nightly, with no asterisks, i.e. this baseline includes all modern & standard bells-and-whistles: mixed
llm.c Day 24: Multi-GPU Training in C/CUDA Outperforms PyTorch
By
–
Leave a Reply