Yes! The bonus materials include training on the Project Gutenberg public domain book corpus. I don’t want to go beyond that though and curate other datasets because of copyright concerns. However, you could eg use the FineWeb dataset which is available from hugging face.
Training Materials Using Project Gutenberg Public Domain Corpus
By
–
Leave a Reply