So so so cool. Llama 1B batch one inference in one single CUDA kernel, deleting synchronization boundaries imposed by breaking the computation into a series of kernels called in sequence. The *optimal* orchestration of compute and memory is only achievable in this way.
Llama 1B Inference Optimized in Single CUDA Kernel
By
–
Leave a Reply