AI Dynamics

Global AI News Aggregator

BM25: The Powerful 30-Year-Old Search Algorithm Still Beating Vectors

Stop using vector search everywhere! A 30-year-old algorithm with zero training, zero embeddings, and zero fine-tuning still powers Elasticsearch, OpenSearch, and most production search systems today. It's called BM25. Let me explain what makes it so powerful: Imagine you're searching for "transformer attention mechanism" in a library of ML papers. BM25 asks three simple questions: "How rare is this word?" Every paper contains "the" and "is", which makes it useless. But "transformer" is specific and informative. BM25 boosts rare words and ignores the noise. → This is IDF(qᵢ) in the formula "How many times does it appear?" If "attention" appears 10 times in a paper, that's a good sign. But 10 vs 100 occurrences won't make much difference. BM25 applies diminishing returns. → This is f(qᵢ, D) combined with k₁ that controls saturation "Is this document unusually long?" A 50-page paper will naturally contain more keywords than a 5-page paper. BM25 levels the playing field so longer documents don't cheat their way to the top. → This is |D|/avgdl controlled by parameter b Three questions. No neural networks. No training data. Just elegant math (refer to the image below) The best part: BM25 excels at exact keyword matching – something embeddings often struggle with. If your user searches for "error code 5012," embeddings might return semantically similar results. BM25 will find the exact match. This is why hybrid search exists. Top RAG systems today combine BM25 with vector search. You get the best of both worlds: semantic understanding AND precise keyword matching. So before you throw GPUs at every search problem, consider BM25. It might already solve your problem, or make your semantic search even better when combined.

→ View original post on X — @akshay_pachaar, 2026-04-05 13:02 UTC

Commentaires

Leave a Reply

Your email address will not be published. Required fields are marked *