there's more CPU overhead, but it practically doesn't matter IMO unless you are doing tiny reinforcement-like workloads — and even there `torch.compile` will CUDAGraph it to near-zero overhead.
torch.compile CUDAGraph overhead optimization reinforcement learning
By
–