5/ LazyLLM – introduces a novel dynamic token pruning method for efficient long-context LLM inference; it can accelerate the prefilling stage of a Llama 2 7B model by 2.34x and maintain high accuracy; it selectively computes the KV for tokens that are important for the next token
LazyLLM Dynamic Token Pruning Accelerates LLM Inference by 2.34x
By
–