Thing that most impresses me about Llama 3: how did they pack so much knowledge and reasoning into a dense 8b and a 70b so well, when everyone else has been scaling sparse MoEs. This still doesn’t mean having a lot of GPUs is not important. Probably even more important
Llama 3: Dense Model Efficiency vs Sparse MoE Scaling Strategy
By
–