Okay I did a first quick pass of naive CUDA kernels for the forward pass of GPT-2 and pushed everything to one file in llm.c, Still only ~1000 lines of code: https://
github.com/karpathy/llm.c
/blob/master/train_gpt2.cu
… Current per iteration timings on my Lambda box <3 A100 40GB PCIe, B=4, T=1024:
– llm.c: 111ms
–
CUDA kernels for GPT-2 forward pass implementation in llm.c
By
–
Leave a Reply