our algorithm is a bit complicated, mostly because computing per-example gradients is hard to do at scale so we make some efficiency improvements:
– computing grads w vmap
– only using last-layer grads (which are still big, in the case of LMs)
– projecting them to a smaller dim
Scaling Per-Example Gradients: Algorithm Efficiency Improvements
By
–
