Thanks for the correction, my logic for comparing defaults in config was a bit faulty In general, the thing to note here is that they switched the MoE gate function to Sigmoid (instead of Softmax) – interestingly they tried this in DeepSeekVL earlier. In addition they have a