AI Dynamics

Global AI News Aggregator

Meta’s 24k H100 Cluster Pods Infrastructure for Llama3

Here's details on Meta's 24k H100 Cluster Pods that we use for Llama3 training.
* Network: two versions RoCEv2 or Infiniband. * Llama3 trains on RoCEv2
* Storage: NFS/FUSE based on Tectonic/Hammerspace
* Stock PyTorch: no real modifications that aren't upstreamed
* NCCL with

→ View original post on X — @soumithchintala,

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *