4/5 The horizontal fix: autoscale on queue depth, not GPU utilization 100% GPU = efficiency, not overload. We use vllm:num_requests_waiting to trigger scale-up when the deployment can't keep up with incoming requests.
Autoscaling by Queue Depth: Beyond GPU Utilization Metrics
By
–