"Improved Large Language Diffusion Models" ByteDance just made bidirectional masked diffusion on-par with autoregessive LM! This paper iLLaDA trains an 8B Transformer from scratch on 12T tokens, then keeps the same denoising objective for SFT on a 25B-token instruction corpus.
ByteDance’s iLLaDA 8B diffusion model rivals autoregressive LMs
By
–
