AI Dynamics

Global AI News Aggregator

Scaling Challenges in Universal Transformer Parameter Sharing

It's hard to scale UT in terms of param sharing because of the shared params across all layers (similar to ALBERT). If you want to have like a 1B UT model, it's going to be super slow. That said, we didn't try scaling non-shared UT which could be okay. More about

→ View original post on X — @yitayml,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *