AI Dynamics

Global AI News Aggregator

About

SGD as ResNet: Rethinking Attention Architecture Fundamentals

SGD is a ResNet too (the blocks of it are fwd+bwd), the residual stream is the weights so… We're not taking the Attention is All You Need part literally enough? 😀

→ View original post on X — @karpathy