“Sparser, Faster, Lighter Transformer Language Models” LLMs are naturally sparse in their feedforward layers, but unstructured sparsity usually doesn’t get you real speed on GPUs, because the hardware stack is built for dense compute. The key idea of the paper is to redesign
