added under kernel4 https://
github.com/karpathy/llm.c
/commit/cb791c4ef58d45d58e5af624b0ed41439ac7aeff
…
a bit surprised to only see ~1-2% out of it, which then washes out in training, as the layernorm is not a top-ranking time kernel. Also tried float4 and unrolling but that didn't improve it too much bleh
Kernel optimization attempts yield minimal performance gains
By
–
Leave a Reply