anyone know a model that’s out there that’s encoder-only (BERT-like) but supports a really long context length? also what's the most efficient way of processing many tokens like this? i know about enabling FlashAttention & BetterTransformer. what else is out there?
Long Context Encoder Models and Token Processing Optimization Techniques
By
–
Leave a Reply