So to answer the original question, they overtrained the 13B model but not the 65B model — likely because they decided on the budget beforehand. Thus, it's suboptimal to run the 65B in production.
Overtraining of 13B Model vs Suboptimal 65B Production Deployment
By
–
Leave a Reply