we then compute the capacity of different models (GPT models with varying numbers of layers and hidden dimensions) averaged over hundreds of models in fp32, we get the following curve, indicating a linear trend of around 3.6 bits-per-parameter, regardless of the exact details:
GPT Model Capacity Analysis: Linear Trend at 3.6 Bits-Per-Parameter
By
–
