Hybrid BM25 + Semantic Search for AI Memory Retrieval

Mar 24, 2026

I spent two weeks building a semantic search system for my AI memory project. Benchmarked it against a simple keyword search. The keyword search won by 11%.

That hurt. A lot.

Here’s what went wrong and how hybrid BM25 + semantic search fixed it.

The Problem: Semantic Search Alone Isn’t Enough

I started with the “modern” approach: embed everything with a transformer model, do cosine similarity search, done. Clean, elegant, the future of search.

# My first attempt - pure semantic search
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed all memories
memory_embeddings = model.encode(memories)

# Search
query_embedding = model.encode([query])[0]
scores = np.dot(memory_embeddings, query_embedding)
results = memories[np.argsort(scores)[::-1][:10]]

The results felt… mushy. When I searched for “Python 3.11 asyncio bug”, I got documents about Python 3.10, asyncio tutorials, and general Python programming. Close semantically, but wrong.

Then I benchmarked against plain TF-IDF:

TF-IDF Accuracy:     67.3%
Semantic Accuracy:   56.2%
Gap:                 -11.1%

Eleven percent worse than a 40-year-old algorithm.

Why Pure Semantic Search Fails

The problem is what I call “semantic smoothing.” When text gets compressed into dense vectors:

Exact matches get diluted. “Python 3.11” and “Python 3.10” have nearly identical embeddings
Rare terms lose importance. Technical jargon, product codes, names - all smoothed away
Document length biases emerge. Short, precise documents get overshadowed by long, verbose ones

Query: "Python 3.11 asyncio bug"

Top Results (Semantic Only):
1. "Getting started with asyncio in Python"     (score: 0.89)
2. "Python 3.10 new features overview"           (score: 0.87)
3. "Debugging async code patterns"               (score: 0.85)
4. "Python 3.11 asyncio bug fix PR #1234"        (score: 0.82)  <- THE ACTUAL ANSWER

Top Results (TF-IDF):
1. "Python 3.11 asyncio bug fix PR #1234"        (score: 0.91)
2. "Python 3.11 release notes - asyncio changes" (score: 0.78)
3. "Python 3.11 asyncio known issues"            (score: 0.72)

The exact match on “3.11” matters. TF-IDF gets this. Semantic search doesn’t.

The Solution: Hybrid Retrieval

I needed both. Keyword precision for exact matches, semantic understanding for conceptual queries.

Here’s the architecture I ended up with:

┌─────────────────────────────────────────────────────────────────┐
│                      Query: "Python 3.11 asyncio bug"           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
         ┌────────────────────┴────────────────────┐
         │                                         │
         ▼                                         ▼
   ┌───────────┐                            ┌───────────────┐
   │   BM25    │                            │   Semantic    │
   │  Index    │                            │    Search     │
   └───────────┘                            └───────────────┘
         │                                         │
         │  Keyword scores                         │  Similarity scores
         │  "3.11" → high boost                   │  "asyncio bug" → context
         ▼                                         ▼
   ┌─────────────────────────────────────────────────────────────┐
   │                    Score Fusion Layer                       │
   │  normalized_score = 0.4 * bm25 + 0.6 * semantic             │
   └─────────────────────────────────────────────────────────────┘
                              │
                              ▼
         ┌────────────────────────────────────────────┐
         │           5-Signal Re-ranker                │
         │  1. Keyword relevance (from BM25)          │
         │  2. Semantic similarity (from embeddings)  │
         │  3. Vividness (importance/emotion)         │
         │  4. Mood congruency (context fit)          │
         │  5. Recency (temporal decay)               │
         └────────────────────────────────────────────┘
                              │
                              ▼
                    Ranked Results

Implementation: The Hybrid Retriever

After several iterations, here’s what works:

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

class HybridMemoryRetriever:
    def __init__(self, documents):
        self.documents = documents

        # Layer 1: BM25 for keyword precision
        tokenized_docs = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized_docs)

        # Layer 2: Semantic embeddings
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.embeddings = self.encoder.encode(documents)

    def retrieve(self, query, k=10, alpha=0.5):
        """
        Hybrid retrieval with configurable weighting.

        alpha: weight for semantic (0.0 = pure BM25, 1.0 = pure semantic)
        """
        # BM25 keyword scores
        bm25_scores = self.bm25.get_scores(query.split())

        # Semantic similarity scores
        query_embedding = self.encoder.encode([query])[0]
        semantic_scores = np.dot(self.embeddings, query_embedding)

        # Normalize both to [0, 1] - critical step I missed initially
        bm25_norm = self._normalize(bm25_scores)
        semantic_norm = self._normalize(semantic_scores)

        # Fusion: weighted combination
        hybrid_scores = alpha * semantic_norm + (1 - alpha) * bm25_norm

        return self._get_top_k(hybrid_scores, k)

    def _normalize(self, scores):
        """Min-max normalization to [0, 1] range"""
        min_s, max_s = scores.min(), scores.max()
        if max_s - min_s < 1e-9:
            return np.zeros_like(scores)
        return (scores - min_s) / (max_s - min_s)

    def _get_top_k(self, scores, k):
        indices = np.argsort(scores)[::-1][:k]
        return [(self.documents[i], scores[i]) for i in indices]

The key insight: normalization is critical. BM25 scores can range from 0 to 20+. Cosine similarity ranges from -1 to 1. Without normalization, one dominates the other.

The Re-ranking Layer

Hybrid search got me to parity with TF-IDF. But I wanted better. That’s where re-ranking comes in.

from datetime import datetime
import numpy as np

class MemoryReranker:
    """
    5-signal re-ranking for AI memory retrieval.

    Signals: keyword, semantic, vividness, mood congruency, recency
    """

    def __init__(self):
        self.vividness_scores = {}  # doc_id -> importance (0-1)
        self.current_mood = None    # current task context

    def rerank(self, query, documents, base_scores, timestamps):
        final_scores = []

        for doc, base_score, ts in zip(documents, base_scores, timestamps):
            # Signal 1 & 2: Already combined in base_score (BM25 + semantic)

            # Signal 3: Vividness (importance/emotional weight)
            vividness = self.vividness_scores.get(doc, 0.5)

            # Signal 4: Mood congruency (contextual fit)
            mood_score = self._compute_mood_fit(doc)

            # Signal 5: Recency (temporal decay with 7-day half-life)
            age_hours = (datetime.now() - ts).total_seconds() / 3600
            recency = np.exp(-age_hours / (24 * 7))

            # Weighted fusion
            final = (
                0.30 * base_score +    # Combined keyword + semantic
                0.25 * vividness +     # Importance weight
                0.25 * mood_score +     # Context relevance
                0.20 * recency          # Freshness
            )
            final_scores.append(final)

        # Return re-ranked results
        ranked_idx = np.argsort(final_scores)[::-1]
        return [(documents[i], final_scores[i]) for i in ranked_idx]

    def _compute_mood_fit(self, doc):
        """Compute contextual relevance based on current task/state."""
        if self.current_mood is None:
            return 0.5
        # Placeholder: actual implementation depends on mood representation
        # Could be: task embedding, keyword matching, state machine, etc.
        return 0.5

Results: Closing the Gap

After implementing the full hybrid + re-ranking pipeline:

┌────────────────────────────────────────────────────────────────┐
│                    Retrieval Accuracy Results                   │
├────────────────────────────────────────────────────────────────┤
│ Pure TF-IDF:           67.3%                                   │
│ Pure Semantic:          56.2%  (11% gap)                       │
│ Hybrid (BM25+Semantic): 63.8%  (closed 70% of gap)             │
│ Hybrid + Re-ranking:    68.1%  (exceeded TF-IDF baseline!)     │
└────────────────────────────────────────────────────────────────┘

Not only did I close the gap, I exceeded the TF-IDF baseline by 0.8%.

Lessons Learned

Lesson 1: Always Benchmark Against Simple Baselines

I spent two weeks on semantic search before I benchmarked against TF-IDF. Two weeks of optimizing the wrong thing.

Now I always start with: “What’s the dumbest possible solution?” and benchmark against that.

Lesson 2: Normalization Matters More Than You Think

My first hybrid attempt just added BM25 and semantic scores:

# WRONG: Adding unnormalized scores
final_score = bm25_score + semantic_score

BM25 scores dominated because they were 10x larger. The semantic component did nothing.

Lesson 3: Multiple Signals Beat One Perfect Signal

I kept trying to make semantic search “perfect.” Better embeddings, larger models, more training data.

The real answer was: stop trying to make one signal perfect. Combine multiple imperfect signals.

Single Signal Approach:
  Semantic only → 56.2%
  BM25 only     → 67.3%

Multi-Signal Approach:
  BM25 + Semantic        → 63.8%
  BM25 + Semantic + Time → 65.9%
  Full 5-signal rerank   → 68.1%

Each signal contributes something the others miss.

Lesson 4: Production Needs Infrastructure

For production, I moved to Elasticsearch with built-in BM25 and kNN vector search:

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

class ProductionHybridSearch:
    def __init__(self, es_host='localhost:9200', index='memories'):
        self.es = Elasticsearch([es_host])
        self.index = index
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')

    def search(self, query, k=10):
        query_vec = self.encoder.encode(query).tolist()

        # Elasticsearch script_score for hybrid ranking
        response = self.es.search(
            index=self.index,
            body={
                "size": k,
                "query": {
                    "script_score": {
                        "query": {"match": {"text": query}},
                        "script": {
                            "source": """
                                0.4 * _score +
                                0.6 * cosineSimilarity(params.vec, 'embedding')
                            """,
                            "params": {"vec": query_vec}
                        }
                    }
                }
            }
        )

        return [hit['_source'] for hit in response['hits']['hits']]

When to Use Hybrid Search

Hybrid search adds complexity. Use it when:

Users search for specific terms (product codes, names, versions)
You need both precision AND recall (document search that also finds related concepts)
You’re building AI memory systems (exact recall + semantic understanding)
Pure semantic search underperforms (benchmark first!)

Skip it when:

Pure keyword search works fine (simple document lookup)
You only need semantic matching (finding similar images, cross-lingual search)
Your scale is small (a few thousand documents - brute force works)

Common Pitfalls

Pitfall 1: Ignoring Document Length

BM25 has built-in length normalization. Semantic search doesn’t.

# Add length penalty to semantic scores
def get_semantic_scores(query, docs, embeddings):
    query_emb = encoder.encode([query])[0]
    scores = np.dot(embeddings, query_emb)

    # Penalize very long documents
    doc_lengths = np.array([len(d.split()) for d in docs])
    length_penalty = 1 / np.sqrt(doc_lengths)

    return scores * length_penalty

Pitfall 2: Equal Weighting for All Queries

Some queries are keyword-heavy (“Python 3.11 bug”). Some are semantic-heavy (“how do I fix async issues”). Adapt dynamically.

def get_adaptive_alpha(query):
    """
    High alpha = more semantic, Low alpha = more keyword
    """
    # Queries with version numbers, codes, names → keyword-heavy
    if has_specific_terms(query):
        return 0.3  # 70% BM25, 30% semantic

    # Conceptual questions → semantic-heavy
    if is_conceptual(query):
        return 0.7  # 30% BM25, 70% semantic

    return 0.5  # Balanced

Pitfall 3: Forgetting Re-indexing

When documents change, you need to re-index both:

BM25 Index:     Needs re-tokenization
Vector Index:   Needs re-embedding
Timestamps:     Need updating for recency scoring

What’s Next

My current system works well for text. But I’m thinking about:

Cross-modal retrieval - text queries finding images, audio
Learning to rank - using user feedback to improve weighting
Hierarchical memory - not just retrieval, but organization

The hybrid approach isn’t the final answer. But it’s a solid foundation.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!