This is cool!! I'm not exactly sure how to upstream these changes to llm.c… Part of me wants to reproduce GPT-2/3 using their exact hyperparameters just for historical aesthetics, but part of me also wants to just train things as fast as possible. Probably both. – lr 3X is
llm.c optimization: balancing historical accuracy and training speed
By
–
Leave a Reply