Prefill and decode stress hardware differently. Prefill is compute-bound, so Blackwell Tensor Cores, memory bandwidth, NVLink, and SHARP reductions help. Decode is latency/memory-bound, where GB200’s rack-scale NVLink domain opens up parallelism Hopper could not.