If you match the parameters you're actually under-estimating the improvement, because llm.c takes up a lot less space so you can crank up the batch size. I haven't fully dug into why PyTorch takes up that much space, slightly worried I'm doing something wrong but not sure what
llm.c Memory Efficiency Advantages Over PyTorch
By
–