AI Dynamics

Global AI News Aggregator

About

Autoscaling by Queue Depth: Beyond GPU Utilization Metrics

4/5 The horizontal fix: autoscale on queue depth, not GPU utilization 100% GPU = efficiency, not overload. We use vllm:num_requests_waiting to trigger scale-up when the deployment can't keep up with incoming requests.

→ View original post on X — @ai21labs