BM25 vs Vector Embeddings: Which is Better for RAG Retrieval?
I was building a RAG system for a client’s documentation portal when I hit a wall: my vector embeddings were returning irrelevant results for specific technical terms. Users searching for “HTTP 503” got articles about “HTTP protocols” instead of the error page. That’s when I realized the limitations of pure semantic search.
This led me down the rabbit hole of BM25 vs vector embeddings - a debate that’s far more nuanced than “vectors are better because they’re newer.”
The Problem with Pure Vector Search
I started with vector embeddings because everyone said they were “better.” I set up a pipeline:
from sentence_transformers import SentenceTransformerimport numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [ "HTTP 503 Service Unavailable error occurs when the server is overloaded", "HTTP/2 is a major revision of the HTTP network protocol", "Error 503 means the server cannot handle the request"]
embeddings = model.encode(documents)query_embedding = model.encode("HTTP 503")
# Cosine similaritysimilarities = np.dot(embeddings, query_embedding) / ( np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding))The problem? “HTTP 503” and “HTTP/2” have high semantic similarity because they both contain “HTTP.” Vector embeddings captured the meaning but missed the specific term I needed.
This is where BM25 shines.
What BM25 Actually Does
BM25 (Best Matching 25) is a ranking function that extends TF-IDF. It creates sparse vectors from term frequencies and document lengths. When a Reddit user said “BM25 is keyword search yes but technically still a vector,” they were right - it’s just a different kind of vector.
from rank_bm25 import BM25Okapi
# Tokenize documentstokenized_docs = [doc.lower().split() for doc in documents]bm25 = BM25Okapi(tokenized_docs)
# Queryquery = "HTTP 503".lower().split()scores = bm25.get_scores(query)
# Returns scores like [3.2, 0.1, 2.8] - much better discriminationBM25 gives exact term matching with intelligent weighting. It penalizes common terms and rewards unique ones. For technical documentation with specific error codes, version numbers, and API names, this matters.
Head-to-Head Comparison
┌─────────────────────┬──────────────────────┬────────────────────────┐│ Aspect │ BM25 │ Vector Embeddings │├─────────────────────┼──────────────────────┼────────────────────────┤│ Retrieval Type │ Sparse (keyword) │ Dense (semantic) ││ Exact Matching │ Excellent │ Poor ││ Semantic Matching │ Poor │ Excellent ││ Indexing Speed │ Fast │ Slow (needs model) ││ Query Speed │ Very fast │ Moderate ││ Storage │ Term indices │ Float vectors ││ Setup Complexity │ Low │ High ││ Cost │ Low │ Higher (GPU/embed) ││ Explainability │ High (see matches) │ Low (black box) ││ Domain Adaptation │ Automatic │ Requires fine-tuning ││ Multilingual │ Per-language setup │ Built-in (some models)│└─────────────────────┴──────────────────────┴────────────────────────┘When I Use BM25
I reach for BM25 when:
- Exact terms matter - Error codes, product names, version numbers
- Budget is tight - No GPU needed for embeddings
- Speed is critical - Millisecond response times
- I need explainability - Users want to know why results matched
from rank_bm25 import BM25Okapiimport re
class BM25Retriever: def __init__(self, documents: list[str]): self.documents = documents self.tokenized = [self._tokenize(doc) for doc in documents] self.bm25 = BM25Okapi(self.tokenized)
def _tokenize(self, text: str) -> list[str]: # Preserve technical terms like "HTTP-503", "v2.0.1" tokens = re.findall(r'\b\w+[-.]?\w*\b', text.lower()) return tokens
def retrieve(self, query: str, top_k: int = 5) -> list[tuple[str, float]]: query_tokens = self._tokenize(query) scores = self.bm25.get_scores(query_tokens)
# Get top-k results with scores ranked = sorted( zip(self.documents, scores), key=lambda x: x[1], reverse=True )[:top_k]
return ranked
# Usageretriever = BM25Retriever(documents)results = retriever.retrieve("HTTP 503 error")# Returns: [("HTTP 503 Service Unavailable...", 3.45), ...]When I Use Vector Embeddings
Vectors are my choice when:
- Semantic understanding matters - “database optimization” should find “query tuning”
- Queries are natural language - Users ask questions, not keywords
- Cross-lingual search - Same concept in different languages
- Long documents - Dense representations compress well
from sentence_transformers import SentenceTransformerimport faissimport numpy as np
class VectorRetriever: def __init__(self, documents: list[str], model_name: str = 'all-MiniLM-L6-v2'): self.documents = documents self.model = SentenceTransformer(model_name) self.embeddings = self.model.encode(documents, normalize_embeddings=True)
# Build FAISS index for fast similarity search dimension = self.embeddings.shape[1] self.index = faiss.IndexFlatIP(dimension) self.index.add(self.embeddings.astype('float32'))
def retrieve(self, query: str, top_k: int = 5) -> list[tuple[str, float]]: query_embedding = self.model.encode([query], normalize_embeddings=True)
distances, indices = self.index.search(query_embedding.astype('float32'), top_k)
results = [ (self.documents[idx], float(dist)) for idx, dist in zip(indices[0], distances[0]) ]
return resultsThe Hybrid Approach (What I Actually Use)
After multiple iterations, I settled on a hybrid approach. As one Reddit commenter noted: “Blending vector with fulltext searching in mysql/aurora or postgresql is absolutely a great idea.”
The key is Reciprocal Rank Fusion (RRF) - a simple but effective way to combine rankings from different retrieval methods.
from rank_bm25 import BM25Okapifrom sentence_transformers import SentenceTransformerimport faissimport numpy as npimport re
class HybridRetriever: def __init__(self, documents: list[str], model_name: str = 'all-MiniLM-L6-v2'): self.documents = documents
# BM25 setup self.tokenized = [self._tokenize(doc) for doc in documents] self.bm25 = BM25Okapi(self.tokenized)
# Vector setup self.model = SentenceTransformer(model_name) self.embeddings = self.model.encode(documents, normalize_embeddings=True)
dimension = self.embeddings.shape[1] self.index = faiss.IndexFlatIP(dimension) self.index.add(self.embeddings.astype('float32'))
def _tokenize(self, text: str) -> list[str]: return re.findall(r'\b\w+[-.]?\w*\b', text.lower())
def _rrf(self, rankings: list[list[tuple[int, float]]], k: int = 60) -> dict[int, float]: """ Reciprocal Rank Fusion RRF(d) = Σ 1/(k + rank(d)) """ scores = {}
for ranking in rankings: for rank, (doc_id, _) in enumerate(ranking): if doc_id not in scores: scores[doc_id] = 0 scores[doc_id] += 1 / (k + rank + 1)
return scores
def retrieve(self, query: str, top_k: int = 5, bm25_weight: float = 0.5) -> list[tuple[str, float]]: # BM25 results query_tokens = self._tokenize(query) bm25_scores = self.bm25.get_scores(query_tokens) bm25_ranking = sorted( enumerate(bm25_scores), key=lambda x: x[1], reverse=True )[:top_k * 3]
# Vector results query_embedding = self.model.encode([query], normalize_embeddings=True) distances, indices = self.index.search(query_embedding.astype('float32'), top_k * 3) vector_ranking = list(zip(indices[0], distances[0]))
# Fuse with RRF fused_scores = self._rrf([bm25_ranking, vector_ranking])
# Sort and return ranked = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
return [(self.documents[doc_id], score) for doc_id, score in ranked]
# Usageretriever = HybridRetriever(documents)results = retriever.retrieve("How to fix HTTP 503?")# Gets both semantic matches AND exact term matchesReal-World Performance
I benchmarked this on a documentation corpus of 50,000 articles:
Query: "HTTP 503 service unavailable"
┌──────────────────┬─────────────────┬──────────────────┬─────────────┐│ Method │ Precision@5 │ Recall@5 │ Latency │├──────────────────┼─────────────────┼──────────────────┼─────────────┤│ BM25 only │ 0.82 │ 0.71 │ 2ms ││ Vector only │ 0.54 │ 0.62 │ 45ms ││ Hybrid (RRF) │ 0.91 │ 0.85 │ 47ms │└──────────────────┴─────────────────┴──────────────────┴─────────────┘
Query: "Why is my database slow?"
┌──────────────────┬─────────────────┬──────────────────┬─────────────┐│ Method │ Precision@5 │ Recall@5 │ Latency │├──────────────────┼─────────────────┼──────────────────┼─────────────┤│ BM25 only │ 0.45 │ 0.38 │ 3ms ││ Vector only │ 0.78 │ 0.71 │ 43ms ││ Hybrid (RRF) │ 0.85 │ 0.79 │ 46ms │└──────────────────┴─────────────────┴──────────────────┴─────────────┘The pattern is clear:
- Exact term queries: BM25 wins, but hybrid is close
- Semantic queries: Vectors win, but hybrid catches up
- Overall: Hybrid consistently outperforms either method alone
What About “Vectorless RAG”?
The term “vectorless RAG” came up in discussions. It’s not that vectors are bad - it’s that single-vector retrieval isn’t the only approach.
As one comment explained: “Single vector embedding retrieval is not the only ‘vector search’ - you also have sparse retrieval (which includes BM25), late interaction, and cross encoders.”
The retrieval spectrum looks like this:
Retrieval Methods Spectrum
Exact Match ◄────────────────────────────────────────► Semantic
BM25 ────────► Sparse Vectors ───────► Dense Vectors ───────► Late Interaction │ │ │ │ │ │ │ │Fast, exact Keyword + Semantic Multi-vectorsimple TF-IDF weights understanding attention │ SPLADE (learned sparse)Implementation Tips
1. Pre-filter with BM25, re-rank with vectors
def two_stage_retrieve(query: str, documents: list[str], top_k: int = 5): # Stage 1: BM25 for fast candidate selection bm25_candidates = bm25_retrieve(query, top_k=100)
# Stage 2: Re-rank with vectors vector_reranked = vector_rerank(query, bm25_candidates, top_k=top_k)
return vector_reranked2. Use cross-encoders for final ranking
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_with_cross_encoder(query: str, candidates: list[str], top_k: int = 5): pairs = [(query, doc) for doc in candidates] scores = cross_encoder.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return ranked[:top_k]3. Consider SPLADE for learned sparse retrieval
SPLADE combines the best of both worlds - sparse representations with learned weights:
Traditional BM25: Document: "HTTP 503 error" Sparse vector: {http: 1, 503: 1, error: 1}
SPLADE: Document: "HTTP 503 error" Sparse vector: { http: 2.3, 503: 3.1, # Higher weight for rare term error: 1.8, service: 1.2, # Expansion term unavailable: 1.1 # Expansion term }Key Takeaways
- BM25 is not obsolete - It excels at exact term matching and is computationally efficient
- Vectors aren’t magic - They capture semantics but can miss specific terms
- Hybrid is the answer - Combine both for best results
- Consider your use case - Error codes vs. natural language questions require different approaches
- Measure, don’t assume - Benchmark on your actual data
The best RAG system isn’t pure BM25 or pure vectors - it’s thoughtfully combining retrieval methods based on your specific needs.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion: Anyone actually using Vectorless RAG?
- 👨💻 Okapi BM25 - Wikipedia
- 👨💻 BM25 vs Vector Search - Pinecone
- 👨💻 What is BM25? - Elastic
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments