
it's very dense, a new format that I am playing with still need to test it in concurrency + try it in nvfp4 hopefully will be able to compare perplexity in bf16 / fp8 / nvfp4 as well as TP performance jumps from 2 -> 4 -> 8 nodes across all 3 formats most important thing
