2/5 Used gpu_memory_utilization=0.2 to reproduce quickly, but this happens naturally when the scheduler runs out of token budget and GPU blocks get recycled. New request gets 1 token → misclassified as "decode" But num_computed_tokens=0 → should be "prefill".
GPU Memory Utilization Bug: Token Classification Mismatch
By
–
Leave a Reply