what are some easy ways for a normal person like me to speed up training a transformer on a single GPU? i'm using huggingface T5 implementation + defaults. i know about FlashAttention and BetterTransformer and torch compile? will these all work together? is there anything else?
Optimizing Transformer Training on Single GPU with HuggingFace
By
–