This is the best documentation I could find of the OpenELM training data – it looks like the bulk of it comes from RefinedWeb, RedPajama, The Pile and Dolma https://
github.com/apple/corenet/
blob/main/projects/openelm/README-pretraining.md
…
OpenELM Training Data Sources Documentation
By
–