For transformers text decoders it's not clear yet afaik – but at least having the encoder head at a higher lr seems to be reliable. I also suspect the first 2 and last 3 layers in the body should have higher lr but I haven't got rigorous tests.
Transformer Learning Rates: Encoder Head and Layer Optimization
By
–
Leave a Reply