
Looks like DeepSeek is handing hardware optimization control directly to developers in the latest DeepGEMM update. For the fp8_mqa_logits function, the weights tensor dtype now explicitly dictates the accumulation precision.
By
–


Looks like DeepSeek is handing hardware optimization control directly to developers in the latest DeepGEMM update. For the fp8_mqa_logits function, the weights tensor dtype now explicitly dictates the accumulation precision.