ok good point and i AGREE that the details matter but these kids aren't trying to replicate T5 pretraining in 20 hours on a single gpu; they just want to train little neural networks from scratch; and for their use cases adafactor vs adam won't almost ever matter
Adafactor vs Adam: Optimizer Choice for Small Neural Networks
By
–