Great read! My experience is that you’re fighting physics but also the nvidia compiler and the stack overall, and even after pulling *a lot* of tricks we still can’t achieve more than ~80-90% mem bw on many kernels that you’d naively think should be ~100. And the rabbit hole
Fighting Physics and NVIDIA Compiler Stack for Kernel Optimization
By
–
Leave a Reply