Some really nice work by @Ar_Douillard and many coauthors on how to efficiently train large, sparse, modular models across many data centers that are geographically distributed. As @fouriergalois pointed out in the replies, this is a step in a longer journey:
Training Large Sparse Modular Models Across Distributed Data Centers
By
–
Leave a Reply