AI Dynamics

Global AI News Aggregator

About

FFN Architecture Patterns in Mixture-of-Experts Models

It’s interesting. I’ve seen hidden -> wider -> hidden FFNs and hidden -> narrower -> hidden FFNs.
But Command A+ seems tu use 4096 -> 4096 -> 4096 for each expert FFN, which I haven’t seen before (as far as I remember)

→ View original post on X — @rasbt,