We want to do a full GPT-2 repro, at channel size 1600 this is 2.1X higher C. And we'll want to ~max out batch dim to fit in memory too. So the "easy times" will be over soon.
GPT-2 Reproduction with Increased Channel Size and Memory Optimization
By
–
Leave a Reply