It's hard to scale UT in terms of param sharing because of the shared params across all layers (similar to ALBERT). If you want to have like a 1B UT model, it's going to be super slow. That said, we didn't try scaling non-shared UT which could be okay. More about
Scaling Challenges in Universal Transformer Parameter Sharing
By
–
Leave a Reply