llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) https://
github.com/karpathy/llm.c
/blob/master/train_gpt2.cu
… On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,
llm.c Matches PyTorch Performance Training GPT-2 on GPU
By
–
Leave a Reply