TPU 8i is co-designed with our Gemini research team to support low latency inference. Among the attributes that support this are large amounts of on-chip SRAM, enabling more computations to be done on chip without having to go to HBM for weights or KVCache state as often. The
Google TPU 8i Co-Designed for Low Latency Inference
By
–
