On TPUs you can't use dynamic sizes in a loop, so you use mask and static sizes, and therefore the first step of the loop is as costly as the last one. On GPUs you could have a faster first step, but the cost of attention reduction is relatively low for large models.
TPU Dynamic Sizes Constraints and GPU Attention Efficiency Trade-offs
By
–
Leave a Reply