Hybrid BM25 + Semantic Search for AI Memory Retrieval
I spent two weeks building a semantic search system for my AI memory project. Benchmarked it against a simple keyword search. The keyword search won by 11%.
That hurt. A lot.
Here’s what went wrong and how hybrid BM25 + semantic search fixed it.
The Problem: Semantic Search Alone Isn’t Enough
I started with the “modern” approach: embed everything with a transformer model, do cosine similarity search, done. Clean, elegant, the future of search.
# My first attempt - pure semantic searchfrom sentence_transformers import SentenceTransformerimport numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Embed all memoriesmemory_embeddings = model.encode(memories)
# Searchquery_embedding = model.encode([query])[0]scores = np.dot(memory_embeddings, query_embedding)results = memories[np.argsort(scores)[::-1][:10]]The results felt… mushy. When I searched for “Python 3.11 asyncio bug”, I got documents about Python 3.10, asyncio tutorials, and general Python programming. Close semantically, but wrong.
Then I benchmarked against plain TF-IDF:
TF-IDF Accuracy: 67.3%Semantic Accuracy: 56.2%Gap: -11.1%Eleven percent worse than a 40-year-old algorithm.
Why Pure Semantic Search Fails
The problem is what I call “semantic smoothing.” When text gets compressed into dense vectors:
- Exact matches get diluted. “Python 3.11” and “Python 3.10” have nearly identical embeddings
- Rare terms lose importance. Technical jargon, product codes, names - all smoothed away
- Document length biases emerge. Short, precise documents get overshadowed by long, verbose ones
Query: "Python 3.11 asyncio bug"
Top Results (Semantic Only):1. "Getting started with asyncio in Python" (score: 0.89)2. "Python 3.10 new features overview" (score: 0.87)3. "Debugging async code patterns" (score: 0.85)4. "Python 3.11 asyncio bug fix PR #1234" (score: 0.82) <- THE ACTUAL ANSWER
Top Results (TF-IDF):1. "Python 3.11 asyncio bug fix PR #1234" (score: 0.91)2. "Python 3.11 release notes - asyncio changes" (score: 0.78)3. "Python 3.11 asyncio known issues" (score: 0.72)The exact match on “3.11” matters. TF-IDF gets this. Semantic search doesn’t.
The Solution: Hybrid Retrieval
I needed both. Keyword precision for exact matches, semantic understanding for conceptual queries.
Here’s the architecture I ended up with:
┌─────────────────────────────────────────────────────────────────┐│ Query: "Python 3.11 asyncio bug" │└─────────────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────┴────────────────────┐ │ │ ▼ ▼ ┌───────────┐ ┌───────────────┐ │ BM25 │ │ Semantic │ │ Index │ │ Search │ └───────────┘ └───────────────┘ │ │ │ Keyword scores │ Similarity scores │ "3.11" → high boost │ "asyncio bug" → context ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Score Fusion Layer │ │ normalized_score = 0.4 * bm25 + 0.6 * semantic │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────┐ │ 5-Signal Re-ranker │ │ 1. Keyword relevance (from BM25) │ │ 2. Semantic similarity (from embeddings) │ │ 3. Vividness (importance/emotion) │ │ 4. Mood congruency (context fit) │ │ 5. Recency (temporal decay) │ └────────────────────────────────────────────┘ │ ▼ Ranked ResultsImplementation: The Hybrid Retriever
After several iterations, here’s what works:
from rank_bm25 import BM25Okapifrom sentence_transformers import SentenceTransformerimport numpy as np
class HybridMemoryRetriever: def __init__(self, documents): self.documents = documents
# Layer 1: BM25 for keyword precision tokenized_docs = [doc.split() for doc in documents] self.bm25 = BM25Okapi(tokenized_docs)
# Layer 2: Semantic embeddings self.encoder = SentenceTransformer('all-MiniLM-L6-v2') self.embeddings = self.encoder.encode(documents)
def retrieve(self, query, k=10, alpha=0.5): """ Hybrid retrieval with configurable weighting.
alpha: weight for semantic (0.0 = pure BM25, 1.0 = pure semantic) """ # BM25 keyword scores bm25_scores = self.bm25.get_scores(query.split())
# Semantic similarity scores query_embedding = self.encoder.encode([query])[0] semantic_scores = np.dot(self.embeddings, query_embedding)
# Normalize both to [0, 1] - critical step I missed initially bm25_norm = self._normalize(bm25_scores) semantic_norm = self._normalize(semantic_scores)
# Fusion: weighted combination hybrid_scores = alpha * semantic_norm + (1 - alpha) * bm25_norm
return self._get_top_k(hybrid_scores, k)
def _normalize(self, scores): """Min-max normalization to [0, 1] range""" min_s, max_s = scores.min(), scores.max() if max_s - min_s < 1e-9: return np.zeros_like(scores) return (scores - min_s) / (max_s - min_s)
def _get_top_k(self, scores, k): indices = np.argsort(scores)[::-1][:k] return [(self.documents[i], scores[i]) for i in indices]The key insight: normalization is critical. BM25 scores can range from 0 to 20+. Cosine similarity ranges from -1 to 1. Without normalization, one dominates the other.
The Re-ranking Layer
Hybrid search got me to parity with TF-IDF. But I wanted better. That’s where re-ranking comes in.
from datetime import datetimeimport numpy as np
class MemoryReranker: """ 5-signal re-ranking for AI memory retrieval.
Signals: keyword, semantic, vividness, mood congruency, recency """
def __init__(self): self.vividness_scores = {} # doc_id -> importance (0-1) self.current_mood = None # current task context
def rerank(self, query, documents, base_scores, timestamps): final_scores = []
for doc, base_score, ts in zip(documents, base_scores, timestamps): # Signal 1 & 2: Already combined in base_score (BM25 + semantic)
# Signal 3: Vividness (importance/emotional weight) vividness = self.vividness_scores.get(doc, 0.5)
# Signal 4: Mood congruency (contextual fit) mood_score = self._compute_mood_fit(doc)
# Signal 5: Recency (temporal decay with 7-day half-life) age_hours = (datetime.now() - ts).total_seconds() / 3600 recency = np.exp(-age_hours / (24 * 7))
# Weighted fusion final = ( 0.30 * base_score + # Combined keyword + semantic 0.25 * vividness + # Importance weight 0.25 * mood_score + # Context relevance 0.20 * recency # Freshness ) final_scores.append(final)
# Return re-ranked results ranked_idx = np.argsort(final_scores)[::-1] return [(documents[i], final_scores[i]) for i in ranked_idx]
def _compute_mood_fit(self, doc): """Compute contextual relevance based on current task/state.""" if self.current_mood is None: return 0.5 # Placeholder: actual implementation depends on mood representation # Could be: task embedding, keyword matching, state machine, etc. return 0.5Results: Closing the Gap
After implementing the full hybrid + re-ranking pipeline:
┌────────────────────────────────────────────────────────────────┐│ Retrieval Accuracy Results │├────────────────────────────────────────────────────────────────┤│ Pure TF-IDF: 67.3% ││ Pure Semantic: 56.2% (11% gap) ││ Hybrid (BM25+Semantic): 63.8% (closed 70% of gap) ││ Hybrid + Re-ranking: 68.1% (exceeded TF-IDF baseline!) │└────────────────────────────────────────────────────────────────┘Not only did I close the gap, I exceeded the TF-IDF baseline by 0.8%.
Lessons Learned
Lesson 1: Always Benchmark Against Simple Baselines
I spent two weeks on semantic search before I benchmarked against TF-IDF. Two weeks of optimizing the wrong thing.
Now I always start with: “What’s the dumbest possible solution?” and benchmark against that.
Lesson 2: Normalization Matters More Than You Think
My first hybrid attempt just added BM25 and semantic scores:
# WRONG: Adding unnormalized scoresfinal_score = bm25_score + semantic_scoreBM25 scores dominated because they were 10x larger. The semantic component did nothing.
Lesson 3: Multiple Signals Beat One Perfect Signal
I kept trying to make semantic search “perfect.” Better embeddings, larger models, more training data.
The real answer was: stop trying to make one signal perfect. Combine multiple imperfect signals.
Single Signal Approach: Semantic only → 56.2% BM25 only → 67.3%
Multi-Signal Approach: BM25 + Semantic → 63.8% BM25 + Semantic + Time → 65.9% Full 5-signal rerank → 68.1%Each signal contributes something the others miss.
Lesson 4: Production Needs Infrastructure
For production, I moved to Elasticsearch with built-in BM25 and kNN vector search:
from elasticsearch import Elasticsearchfrom sentence_transformers import SentenceTransformer
class ProductionHybridSearch: def __init__(self, es_host='localhost:9200', index='memories'): self.es = Elasticsearch([es_host]) self.index = index self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
def search(self, query, k=10): query_vec = self.encoder.encode(query).tolist()
# Elasticsearch script_score for hybrid ranking response = self.es.search( index=self.index, body={ "size": k, "query": { "script_score": { "query": {"match": {"text": query}}, "script": { "source": """ 0.4 * _score + 0.6 * cosineSimilarity(params.vec, 'embedding') """, "params": {"vec": query_vec} } } } } )
return [hit['_source'] for hit in response['hits']['hits']]When to Use Hybrid Search
Hybrid search adds complexity. Use it when:
- Users search for specific terms (product codes, names, versions)
- You need both precision AND recall (document search that also finds related concepts)
- You’re building AI memory systems (exact recall + semantic understanding)
- Pure semantic search underperforms (benchmark first!)
Skip it when:
- Pure keyword search works fine (simple document lookup)
- You only need semantic matching (finding similar images, cross-lingual search)
- Your scale is small (a few thousand documents - brute force works)
Common Pitfalls
Pitfall 1: Ignoring Document Length
BM25 has built-in length normalization. Semantic search doesn’t.
# Add length penalty to semantic scoresdef get_semantic_scores(query, docs, embeddings): query_emb = encoder.encode([query])[0] scores = np.dot(embeddings, query_emb)
# Penalize very long documents doc_lengths = np.array([len(d.split()) for d in docs]) length_penalty = 1 / np.sqrt(doc_lengths)
return scores * length_penaltyPitfall 2: Equal Weighting for All Queries
Some queries are keyword-heavy (“Python 3.11 bug”). Some are semantic-heavy (“how do I fix async issues”). Adapt dynamically.
def get_adaptive_alpha(query): """ High alpha = more semantic, Low alpha = more keyword """ # Queries with version numbers, codes, names → keyword-heavy if has_specific_terms(query): return 0.3 # 70% BM25, 30% semantic
# Conceptual questions → semantic-heavy if is_conceptual(query): return 0.7 # 30% BM25, 70% semantic
return 0.5 # BalancedPitfall 3: Forgetting Re-indexing
When documents change, you need to re-index both:
BM25 Index: Needs re-tokenizationVector Index: Needs re-embeddingTimestamps: Need updating for recency scoringWhat’s Next
My current system works well for text. But I’m thinking about:
- Cross-modal retrieval - text queries finding images, audio
- Learning to rank - using user feedback to improve weighting
- Hierarchical memory - not just retrieval, but organization
The hybrid approach isn’t the final answer. But it’s a solid foundation.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments