Jamba was trained to handle contexts of up to 256K. Jamba has excellent performance in the needle-in-a-haystack evaluation, which is especially interesting given its use of only 4 attention layers. It also outperforms Mixtral on most long-context benchmarks. 4/6
Jamba’s Impressive Long-Context Performance with Minimal Attention Layers
By
–
