CUDA graph capture is like old school OpenGL display lists, but clearly specified as just kernels operating on already allocated buffers. The PyTorch integration is very well done; if you can be 100% graphs, the python overhead basically vanishes.
CUDA Graph Capture and PyTorch Integration Performance Benefits
By
–
Leave a Reply