any references come to mind for 128k context on a 1Bish MoE? not sure i understand this area very well. i guess the primary ablation i would want is on context length quality vs width dimension…?
128k Context Window Ablation Study on Billion-Parameter MoE Models
By
–
Leave a Reply