Sharing a piece of work I contributed to while at @GoogleAI
: * a new improved Mc4 corpus (29T char tokens and 107 languages) that gets language sampling right with UniMax sampling. * open source pretrained uMT5 models trained on 1T tokens. * Unimax sampling solves some
Google AI Releases Improved MC4 Corpus and uMT5 Models
By
–
Leave a Reply