Diffusion-based LLMs are fast and parallelizable, but bidirectional attention makes inference expensive due to repeated prefill and decoding. Enter ODB-dLLM, a dual-boundary framework with adaptive prefill length prediction and dLLM-specific jump-share speculative decoding.
ODB-dLLM: Faster Parallelizable Diffusion-Based Language Models
By
–
