the key insight, i think, is using an optimal depth-to-width ratio for the transformer architecture. and training on good data. a lot of good data. even though NeoBERT has slightly more parameters, it's still faster AND more effective than ModernBERT for long sequences:
Optimal Depth-Width Ratio Improves Transformer Performance
By
–
Leave a Reply