We see that the hybrid Jamba model outperforms both pure attention and pure Mamba models. The ratio of attention-to-Mamba layers of either 1:3 or 1:7 performs comparably, but given that a 1:7 ratio is more compute-efficient, we opt for it in our model. 2/6
Jamba Hybrid Model Outperforms Pure Attention and Mamba Architectures
By
–
Leave a Reply