It’s interesting. I’ve seen hidden -> wider -> hidden FFNs and hidden -> narrower -> hidden FFNs.
But Command A+ seems tu use 4096 -> 4096 -> 4096 for each expert FFN, which I haven’t seen before (as far as I remember)
FFN Architecture Patterns in Mixture-of-Experts Models
By
–