We recently announced the availablity of SlimPajama – an open-source, cleaned, and deduplicated version of RedPajama-1T. It is half the size and trains twice as fast and when upsampled, performs equal or better than RedPajama. See below for the dataset and preprocessing library
SlimPajama: Open-Source Cleaned RedPajama Dataset Released
By
–
