First, there aren't really many "open source" datasets.
Most of the datasets are just scrapes from the internet — CommonCrawl, libgen etc.
If we're talking about scrapes from the internet and ignoring other aspects (legal, credit etc.) there are things like ShareGPT that have
Open Source Datasets: Internet Scrapes and Legal Considerations
By
–
Leave a Reply