Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, Mixtral decodes at the speed of a 12B model, while effectively having access to 45B parameters. (3/n)
Mixtral: 12B Speed with 45B Parameter Access via Expert Selection
By
–
Leave a Reply