The second ~1hr builds up the Transformer: multi-headed self-attention, MLP, residual connections, layernorms. Then we train one and compare it to OpenAI's GPT-3 (spoiler: ours is around ~10K – 1M times smaller but the ~same neural net) and ChatGPT (i.e. ours is pretraining only)
Building Transformers: Self-Attention, Training, and GPT-3 Comparison
By
–
