AI Dynamics

Global AI News Aggregator

About

Clarifying GPT-3’s WebText training data

This is historically false. The WebText dataset used (among others) to train GPT-3 consists of the scraped content of URLs linked from Reddit posts with at least 3 karma, including downvotes. You don’t scrape URLs that only appear in low-karma posts because they’re mostly spam.

→ View original post on X — @goodside