Yes, I mean 99% of the time it's fine to just use CUDA through PyTorch (eager or compiled). If you train million-dollar-expensive LLMs, then maybe writing your own optimized CUDA kernels and custom NCCL would probably worthwhile so you can shave off some $$$ off your training
CUDA through PyTorch sufficient for most training scenarios
By
–
Leave a Reply