First ~1 hour is 1) establishing a baseline (bigram) language model, and 2) introducing the core "attention" mechanism at the heart of the Transformer as a kind of communication / message passing between nodes in a directed graph.
Transformer Attention Mechanism: Baseline Model and Message Passing Introduction
By
–
Leave a Reply