Also, tip on weights sharding: as a generic rule of thumb you can use full replication for all variable dimensions except the last one, which should be sharded across the "model" axis
Weights Sharding Strategy: Full Replication Except Last Dimension
By
–
Leave a Reply