This is historically false. The WebText dataset used (among others) to train GPT-3 consists of the scraped content of URLs linked from Reddit posts with at least 3 karma, including downvotes. You don’t scrape URLs that only appear in low-karma posts because they’re mostly spam.