You mean, assuming that both an encoder-decoder and a decoder-only architecture, the encoder-decoder is easier to train because you make better use of the data due to the masking pretraining tasks?
Encoder-Decoder vs Decoder-Only Architecture Training Efficiency
By
–
Leave a Reply