As I said above, use of Transformer-Base as proxy task *is* in So et al: "Specifically, to train a Transformer
to peak performance on WMT’14 En-De requires ∼300K
training steps, or 10 hours, in the base size when using a
single Google TPU V.2 chip, as we do in our search"
Transformer-Base Training Efficiency on TPU V2 Hardware
By
–
Leave a Reply