Mixtral has a similar architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks. For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. (2/n)
Mixtral Architecture: 8 Feedforward Blocks with Router-Selected Experts
By
–
Leave a Reply