Dug a bit more in to the modelling code (v2 vs v3), here are the key changes: > MoE gate function changed from softmax (v2) → sigmoid (v3)
> New Top-k Selection method `noaux_tc`
> Added e_score_correction_bias for better expert selection or even training
MoE Architecture Changes: Softmax to Sigmoid Gate Function
By
–
