SlimPajama-627B: the largest extensively deduplicated, multi1corpora, open-source dataset for training large language models. Sometimes less is more! https://
reddit.com/r/MachineLearn
ing/comments/1467jvm/np_introducing_slimpajama627b_the_largest/
…
SlimPajama-627B: Large Deduplicated Open-Source LLM Dataset
By
–
Leave a Reply