Adaptive optimizers like Adam track the square of the gradient, but what they receive as the gradient is actually the sum of the gradients across the batch. It seems likely that better results at different hyper parameters could be obtained if backward passes emitted true grad^2.
Adaptive Optimizers Should Track True Squared Gradients
By
–