we spend a lot of time optimizing decently-sized models too.
For anything that is normal-sized (one node, 8 GPUs), torch.compile + FSDP2 should be largely sufficient to get to really good performance.
We've spent hundreds (or probably thousands) of programmer-years working on
Optimizing AI Models with torch.compile and FSDP2
By
–