GPT-3 has been trained on 45 TB of text data from different categories: ⬩Common Crawl (8 years of raw web page crawling) ⬩WebText (The text of Reddit posts with 3+ upvotes) ⬩Books (The internet-based books corpora) ⬩Wikipedia Data is then "weighed" as such:
