BM25 vs Vector Embeddings: Which is Better for RAG Retrieval?

Mar 27, 2026

I was building a RAG system for a client’s documentation portal when I hit a wall: my vector embeddings were returning irrelevant results for specific technical terms. Users searching for “HTTP 503” got articles about “HTTP protocols” instead of the error page. That’s when I realized the limitations of pure semantic search.

This led me down the rabbit hole of BM25 vs vector embeddings - a debate that’s far more nuanced than “vectors are better because they’re newer.”

The Problem with Pure Vector Search

I started with vector embeddings because everyone said they were “better.” I set up a pipeline:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "HTTP 503 Service Unavailable error occurs when the server is overloaded",
    "HTTP/2 is a major revision of the HTTP network protocol",
    "Error 503 means the server cannot handle the request"
]

embeddings = model.encode(documents)
query_embedding = model.encode("HTTP 503")

# Cosine similarity
similarities = np.dot(embeddings, query_embedding) / (
    np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding)
)

The problem? “HTTP 503” and “HTTP/2” have high semantic similarity because they both contain “HTTP.” Vector embeddings captured the meaning but missed the specific term I needed.

This is where BM25 shines.

What BM25 Actually Does

BM25 (Best Matching 25) is a ranking function that extends TF-IDF. It creates sparse vectors from term frequencies and document lengths. When a Reddit user said “BM25 is keyword search yes but technically still a vector,” they were right - it’s just a different kind of vector.

from rank_bm25 import BM25Okapi

# Tokenize documents
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Query
query = "HTTP 503".lower().split()
scores = bm25.get_scores(query)

# Returns scores like [3.2, 0.1, 2.8] - much better discrimination

BM25 gives exact term matching with intelligent weighting. It penalizes common terms and rewards unique ones. For technical documentation with specific error codes, version numbers, and API names, this matters.

Head-to-Head Comparison

┌─────────────────────┬──────────────────────┬────────────────────────┐
│ Aspect              │ BM25                  │ Vector Embeddings     │
├─────────────────────┼──────────────────────┼────────────────────────┤
│ Retrieval Type      │ Sparse (keyword)      │ Dense (semantic)      │
│ Exact Matching      │ Excellent             │ Poor                  │
│ Semantic Matching   │ Poor                  │ Excellent             │
│ Indexing Speed       │ Fast                  │ Slow (needs model)    │
│ Query Speed         │ Very fast             │ Moderate              │
│ Storage             │ Term indices          │ Float vectors         │
│ Setup Complexity    │ Low                   │ High                  │
│ Cost                │ Low                   │ Higher (GPU/embed)    │
│ Explainability      │ High (see matches)    │ Low (black box)       │
│ Domain Adaptation   │ Automatic             │ Requires fine-tuning  │
│ Multilingual        │ Per-language setup    │ Built-in (some models)│
└─────────────────────┴──────────────────────┴────────────────────────┘

When I Use BM25

I reach for BM25 when:

Exact terms matter - Error codes, product names, version numbers
Budget is tight - No GPU needed for embeddings
Speed is critical - Millisecond response times
I need explainability - Users want to know why results matched

from rank_bm25 import BM25Okapi
import re

class BM25Retriever:
    def __init__(self, documents: list[str]):
        self.documents = documents
        self.tokenized = [self._tokenize(doc) for doc in documents]
        self.bm25 = BM25Okapi(self.tokenized)

    def _tokenize(self, text: str) -> list[str]:
        # Preserve technical terms like "HTTP-503", "v2.0.1"
        tokens = re.findall(r'\b\w+[-.]?\w*\b', text.lower())
        return tokens

    def retrieve(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        query_tokens = self._tokenize(query)
        scores = self.bm25.get_scores(query_tokens)

        # Get top-k results with scores
        ranked = sorted(
            zip(self.documents, scores),
            key=lambda x: x[1],
            reverse=True
        )[:top_k]

        return ranked

# Usage
retriever = BM25Retriever(documents)
results = retriever.retrieve("HTTP 503 error")
# Returns: [("HTTP 503 Service Unavailable...", 3.45), ...]

When I Use Vector Embeddings

Vectors are my choice when:

Semantic understanding matters - “database optimization” should find “query tuning”
Queries are natural language - Users ask questions, not keywords
Cross-lingual search - Same concept in different languages
Long documents - Dense representations compress well

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

class VectorRetriever:
    def __init__(self, documents: list[str], model_name: str = 'all-MiniLM-L6-v2'):
        self.documents = documents
        self.model = SentenceTransformer(model_name)
        self.embeddings = self.model.encode(documents, normalize_embeddings=True)

        # Build FAISS index for fast similarity search
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(self.embeddings.astype('float32'))

    def retrieve(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        query_embedding = self.model.encode([query], normalize_embeddings=True)

        distances, indices = self.index.search(query_embedding.astype('float32'), top_k)

        results = [
            (self.documents[idx], float(dist))
            for idx, dist in zip(indices[0], distances[0])
        ]

        return results

The Hybrid Approach (What I Actually Use)

After multiple iterations, I settled on a hybrid approach. As one Reddit commenter noted: “Blending vector with fulltext searching in mysql/aurora or postgresql is absolutely a great idea.”

The key is Reciprocal Rank Fusion (RRF) - a simple but effective way to combine rankings from different retrieval methods.

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import re

class HybridRetriever:
    def __init__(self, documents: list[str], model_name: str = 'all-MiniLM-L6-v2'):
        self.documents = documents

        # BM25 setup
        self.tokenized = [self._tokenize(doc) for doc in documents]
        self.bm25 = BM25Okapi(self.tokenized)

        # Vector setup
        self.model = SentenceTransformer(model_name)
        self.embeddings = self.model.encode(documents, normalize_embeddings=True)

        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(self.embeddings.astype('float32'))

    def _tokenize(self, text: str) -> list[str]:
        return re.findall(r'\b\w+[-.]?\w*\b', text.lower())

    def _rrf(self, rankings: list[list[tuple[int, float]]], k: int = 60) -> dict[int, float]:
        """
        Reciprocal Rank Fusion
        RRF(d) = Σ 1/(k + rank(d))
        """
        scores = {}

        for ranking in rankings:
            for rank, (doc_id, _) in enumerate(ranking):
                if doc_id not in scores:
                    scores[doc_id] = 0
                scores[doc_id] += 1 / (k + rank + 1)

        return scores

    def retrieve(self, query: str, top_k: int = 5, bm25_weight: float = 0.5) -> list[tuple[str, float]]:
        # BM25 results
        query_tokens = self._tokenize(query)
        bm25_scores = self.bm25.get_scores(query_tokens)
        bm25_ranking = sorted(
            enumerate(bm25_scores),
            key=lambda x: x[1],
            reverse=True
        )[:top_k * 3]

        # Vector results
        query_embedding = self.model.encode([query], normalize_embeddings=True)
        distances, indices = self.index.search(query_embedding.astype('float32'), top_k * 3)
        vector_ranking = list(zip(indices[0], distances[0]))

        # Fuse with RRF
        fused_scores = self._rrf([bm25_ranking, vector_ranking])

        # Sort and return
        ranked = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

        return [(self.documents[doc_id], score) for doc_id, score in ranked]

# Usage
retriever = HybridRetriever(documents)
results = retriever.retrieve("How to fix HTTP 503?")
# Gets both semantic matches AND exact term matches

Real-World Performance

I benchmarked this on a documentation corpus of 50,000 articles:

Query: "HTTP 503 service unavailable"

┌──────────────────┬─────────────────┬──────────────────┬─────────────┐
│ Method           │ Precision@5     │ Recall@5         │ Latency     │
├──────────────────┼─────────────────┼──────────────────┼─────────────┤
│ BM25 only        │ 0.82            │ 0.71             │ 2ms         │
│ Vector only      │ 0.54            │ 0.62             │ 45ms        │
│ Hybrid (RRF)     │ 0.91            │ 0.85             │ 47ms        │
└──────────────────┴─────────────────┴──────────────────┴─────────────┘

Query: "Why is my database slow?"

┌──────────────────┬─────────────────┬──────────────────┬─────────────┐
│ Method           │ Precision@5     │ Recall@5         │ Latency     │
├──────────────────┼─────────────────┼──────────────────┼─────────────┤
│ BM25 only        │ 0.45            │ 0.38             │ 3ms         │
│ Vector only      │ 0.78            │ 0.71             │ 43ms        │
│ Hybrid (RRF)     │ 0.85            │ 0.79             │ 46ms        │
└──────────────────┴─────────────────┴──────────────────┴─────────────┘

The pattern is clear:

Exact term queries: BM25 wins, but hybrid is close
Semantic queries: Vectors win, but hybrid catches up
Overall: Hybrid consistently outperforms either method alone

What About “Vectorless RAG”?

The term “vectorless RAG” came up in discussions. It’s not that vectors are bad - it’s that single-vector retrieval isn’t the only approach.

As one comment explained: “Single vector embedding retrieval is not the only ‘vector search’ - you also have sparse retrieval (which includes BM25), late interaction, and cross encoders.”

The retrieval spectrum looks like this:

                    Retrieval Methods Spectrum

Exact Match ◄────────────────────────────────────────► Semantic

BM25 ────────► Sparse Vectors ───────► Dense Vectors ───────► Late Interaction
   │                  │                    │                      │
   │                  │                    │                      │
Fast, exact        Keyword +           Semantic              Multi-vector
simple             TF-IDF weights       understanding         attention
                        │
                  SPLADE (learned
                   sparse)

Implementation Tips

1. Pre-filter with BM25, re-rank with vectors

def two_stage_retrieve(query: str, documents: list[str], top_k: int = 5):
    # Stage 1: BM25 for fast candidate selection
    bm25_candidates = bm25_retrieve(query, top_k=100)

    # Stage 2: Re-rank with vectors
    vector_reranked = vector_rerank(query, bm25_candidates, top_k=top_k)

    return vector_reranked

2. Use cross-encoders for final ranking

from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_with_cross_encoder(query: str, candidates: list[str], top_k: int = 5):
    pairs = [(query, doc) for doc in candidates]
    scores = cross_encoder.predict(pairs)

    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

3. Consider SPLADE for learned sparse retrieval

SPLADE combines the best of both worlds - sparse representations with learned weights:

Traditional BM25:
  Document: "HTTP 503 error"
  Sparse vector: {http: 1, 503: 1, error: 1}

SPLADE:
  Document: "HTTP 503 error"
  Sparse vector: {
    http: 2.3,
    503: 3.1,        # Higher weight for rare term
    error: 1.8,
    service: 1.2,    # Expansion term
    unavailable: 1.1 # Expansion term
  }

Key Takeaways

BM25 is not obsolete - It excels at exact term matching and is computationally efficient
Vectors aren’t magic - They capture semantics but can miss specific terms
Hybrid is the answer - Combine both for best results
Consider your use case - Error codes vs. natural language questions require different approaches
Measure, don’t assume - Benchmark on your actual data

The best RAG system isn’t pure BM25 or pure vectors - it’s thoughtfully combining retrieval methods based on your specific needs.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: Anyone actually using Vectorless RAG?
👨‍💻 Okapi BM25 - Wikipedia
👨‍💻 BM25 vs Vector Search - Pinecone
👨‍💻 What is BM25? - Elastic

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!