AI Dynamics

Global AI News Aggregator

About

Adaptive Optimizers Should Track True Squared Gradients

Adaptive optimizers like Adam track the square of the gradient, but what they receive as the gradient is actually the sum of the gradients across the batch. It seems likely that better results at different hyper parameters could be obtained if backward passes emitted true grad^2.

→ View original post on X — @id_aa_carmack,