Let me add a bit context to the latest DeepSeek code release as I feel it was a bit bare bones. Mixture-of-Experts (MoE) is a simple extension of transformers which is rapidly establishing itself as be the go-to architecture for mid-to-large size LLM (20B-600B parameters). It
DeepSeek MoE Architecture for Large Language Models
By
–
Leave a Reply