Skip to content

Hybrid BM25 + Semantic Search for AI Memory Retrieval

I spent two weeks building a semantic search system for my AI memory project. Benchmarked it against a simple keyword search. The keyword search won by 11%.

That hurt. A lot.

Here’s what went wrong and how hybrid BM25 + semantic search fixed it.

The Problem: Semantic Search Alone Isn’t Enough

I started with the “modern” approach: embed everything with a transformer model, do cosine similarity search, done. Clean, elegant, the future of search.

naive_semantic_search.py
# My first attempt - pure semantic search
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Embed all memories
memory_embeddings = model.encode(memories)
# Search
query_embedding = model.encode([query])[0]
scores = np.dot(memory_embeddings, query_embedding)
results = memories[np.argsort(scores)[::-1][:10]]

The results felt… mushy. When I searched for “Python 3.11 asyncio bug”, I got documents about Python 3.10, asyncio tutorials, and general Python programming. Close semantically, but wrong.

Then I benchmarked against plain TF-IDF:

benchmark_results.txt
TF-IDF Accuracy: 67.3%
Semantic Accuracy: 56.2%
Gap: -11.1%

Eleven percent worse than a 40-year-old algorithm.

Why Pure Semantic Search Fails

The problem is what I call “semantic smoothing.” When text gets compressed into dense vectors:

  1. Exact matches get diluted. “Python 3.11” and “Python 3.10” have nearly identical embeddings
  2. Rare terms lose importance. Technical jargon, product codes, names - all smoothed away
  3. Document length biases emerge. Short, precise documents get overshadowed by long, verbose ones
semantic_similarity_analysis.txt
Query: "Python 3.11 asyncio bug"
Top Results (Semantic Only):
1. "Getting started with asyncio in Python" (score: 0.89)
2. "Python 3.10 new features overview" (score: 0.87)
3. "Debugging async code patterns" (score: 0.85)
4. "Python 3.11 asyncio bug fix PR #1234" (score: 0.82) <- THE ACTUAL ANSWER
Top Results (TF-IDF):
1. "Python 3.11 asyncio bug fix PR #1234" (score: 0.91)
2. "Python 3.11 release notes - asyncio changes" (score: 0.78)
3. "Python 3.11 asyncio known issues" (score: 0.72)

The exact match on “3.11” matters. TF-IDF gets this. Semantic search doesn’t.

The Solution: Hybrid Retrieval

I needed both. Keyword precision for exact matches, semantic understanding for conceptual queries.

Here’s the architecture I ended up with:

hybrid_architecture.txt
┌─────────────────────────────────────────────────────────────────┐
│ Query: "Python 3.11 asyncio bug" │
└─────────────────────────────────────────────────────────────────┘
┌────────────────────┴────────────────────┐
│ │
▼ ▼
┌───────────┐ ┌───────────────┐
│ BM25 │ │ Semantic │
│ Index │ │ Search │
└───────────┘ └───────────────┘
│ │
│ Keyword scores │ Similarity scores
│ "3.11" → high boost │ "asyncio bug" → context
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Score Fusion Layer │
│ normalized_score = 0.4 * bm25 + 0.6 * semantic │
└─────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────┐
│ 5-Signal Re-ranker │
│ 1. Keyword relevance (from BM25) │
│ 2. Semantic similarity (from embeddings) │
│ 3. Vividness (importance/emotion) │
│ 4. Mood congruency (context fit) │
│ 5. Recency (temporal decay) │
└────────────────────────────────────────────┘
Ranked Results

Implementation: The Hybrid Retriever

After several iterations, here’s what works:

hybrid_memory_retriever.py
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
class HybridMemoryRetriever:
def __init__(self, documents):
self.documents = documents
# Layer 1: BM25 for keyword precision
tokenized_docs = [doc.split() for doc in documents]
self.bm25 = BM25Okapi(tokenized_docs)
# Layer 2: Semantic embeddings
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.embeddings = self.encoder.encode(documents)
def retrieve(self, query, k=10, alpha=0.5):
"""
Hybrid retrieval with configurable weighting.
alpha: weight for semantic (0.0 = pure BM25, 1.0 = pure semantic)
"""
# BM25 keyword scores
bm25_scores = self.bm25.get_scores(query.split())
# Semantic similarity scores
query_embedding = self.encoder.encode([query])[0]
semantic_scores = np.dot(self.embeddings, query_embedding)
# Normalize both to [0, 1] - critical step I missed initially
bm25_norm = self._normalize(bm25_scores)
semantic_norm = self._normalize(semantic_scores)
# Fusion: weighted combination
hybrid_scores = alpha * semantic_norm + (1 - alpha) * bm25_norm
return self._get_top_k(hybrid_scores, k)
def _normalize(self, scores):
"""Min-max normalization to [0, 1] range"""
min_s, max_s = scores.min(), scores.max()
if max_s - min_s < 1e-9:
return np.zeros_like(scores)
return (scores - min_s) / (max_s - min_s)
def _get_top_k(self, scores, k):
indices = np.argsort(scores)[::-1][:k]
return [(self.documents[i], scores[i]) for i in indices]

The key insight: normalization is critical. BM25 scores can range from 0 to 20+. Cosine similarity ranges from -1 to 1. Without normalization, one dominates the other.

The Re-ranking Layer

Hybrid search got me to parity with TF-IDF. But I wanted better. That’s where re-ranking comes in.

memory_reranker.py
from datetime import datetime
import numpy as np
class MemoryReranker:
"""
5-signal re-ranking for AI memory retrieval.
Signals: keyword, semantic, vividness, mood congruency, recency
"""
def __init__(self):
self.vividness_scores = {} # doc_id -> importance (0-1)
self.current_mood = None # current task context
def rerank(self, query, documents, base_scores, timestamps):
final_scores = []
for doc, base_score, ts in zip(documents, base_scores, timestamps):
# Signal 1 & 2: Already combined in base_score (BM25 + semantic)
# Signal 3: Vividness (importance/emotional weight)
vividness = self.vividness_scores.get(doc, 0.5)
# Signal 4: Mood congruency (contextual fit)
mood_score = self._compute_mood_fit(doc)
# Signal 5: Recency (temporal decay with 7-day half-life)
age_hours = (datetime.now() - ts).total_seconds() / 3600
recency = np.exp(-age_hours / (24 * 7))
# Weighted fusion
final = (
0.30 * base_score + # Combined keyword + semantic
0.25 * vividness + # Importance weight
0.25 * mood_score + # Context relevance
0.20 * recency # Freshness
)
final_scores.append(final)
# Return re-ranked results
ranked_idx = np.argsort(final_scores)[::-1]
return [(documents[i], final_scores[i]) for i in ranked_idx]
def _compute_mood_fit(self, doc):
"""Compute contextual relevance based on current task/state."""
if self.current_mood is None:
return 0.5
# Placeholder: actual implementation depends on mood representation
# Could be: task embedding, keyword matching, state machine, etc.
return 0.5

Results: Closing the Gap

After implementing the full hybrid + re-ranking pipeline:

final_benchmark.txt
┌────────────────────────────────────────────────────────────────┐
│ Retrieval Accuracy Results │
├────────────────────────────────────────────────────────────────┤
│ Pure TF-IDF: 67.3% │
│ Pure Semantic: 56.2% (11% gap) │
│ Hybrid (BM25+Semantic): 63.8% (closed 70% of gap) │
│ Hybrid + Re-ranking: 68.1% (exceeded TF-IDF baseline!) │
└────────────────────────────────────────────────────────────────┘

Not only did I close the gap, I exceeded the TF-IDF baseline by 0.8%.

Lessons Learned

Lesson 1: Always Benchmark Against Simple Baselines

I spent two weeks on semantic search before I benchmarked against TF-IDF. Two weeks of optimizing the wrong thing.

Now I always start with: “What’s the dumbest possible solution?” and benchmark against that.

Lesson 2: Normalization Matters More Than You Think

My first hybrid attempt just added BM25 and semantic scores:

wrong_fusion.py
# WRONG: Adding unnormalized scores
final_score = bm25_score + semantic_score

BM25 scores dominated because they were 10x larger. The semantic component did nothing.

Lesson 3: Multiple Signals Beat One Perfect Signal

I kept trying to make semantic search “perfect.” Better embeddings, larger models, more training data.

The real answer was: stop trying to make one signal perfect. Combine multiple imperfect signals.

signal_diversity.txt
Single Signal Approach:
Semantic only → 56.2%
BM25 only → 67.3%
Multi-Signal Approach:
BM25 + Semantic → 63.8%
BM25 + Semantic + Time → 65.9%
Full 5-signal rerank → 68.1%

Each signal contributes something the others miss.

Lesson 4: Production Needs Infrastructure

For production, I moved to Elasticsearch with built-in BM25 and kNN vector search:

production_hybrid.py
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
class ProductionHybridSearch:
def __init__(self, es_host='localhost:9200', index='memories'):
self.es = Elasticsearch([es_host])
self.index = index
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
def search(self, query, k=10):
query_vec = self.encoder.encode(query).tolist()
# Elasticsearch script_score for hybrid ranking
response = self.es.search(
index=self.index,
body={
"size": k,
"query": {
"script_score": {
"query": {"match": {"text": query}},
"script": {
"source": """
0.4 * _score +
0.6 * cosineSimilarity(params.vec, 'embedding')
""",
"params": {"vec": query_vec}
}
}
}
}
)
return [hit['_source'] for hit in response['hits']['hits']]

Hybrid search adds complexity. Use it when:

  1. Users search for specific terms (product codes, names, versions)
  2. You need both precision AND recall (document search that also finds related concepts)
  3. You’re building AI memory systems (exact recall + semantic understanding)
  4. Pure semantic search underperforms (benchmark first!)

Skip it when:

  1. Pure keyword search works fine (simple document lookup)
  2. You only need semantic matching (finding similar images, cross-lingual search)
  3. Your scale is small (a few thousand documents - brute force works)

Common Pitfalls

Pitfall 1: Ignoring Document Length

BM25 has built-in length normalization. Semantic search doesn’t.

length_bias_fix.py
# Add length penalty to semantic scores
def get_semantic_scores(query, docs, embeddings):
query_emb = encoder.encode([query])[0]
scores = np.dot(embeddings, query_emb)
# Penalize very long documents
doc_lengths = np.array([len(d.split()) for d in docs])
length_penalty = 1 / np.sqrt(doc_lengths)
return scores * length_penalty

Pitfall 2: Equal Weighting for All Queries

Some queries are keyword-heavy (“Python 3.11 bug”). Some are semantic-heavy (“how do I fix async issues”). Adapt dynamically.

adaptive_weighting.py
def get_adaptive_alpha(query):
"""
High alpha = more semantic, Low alpha = more keyword
"""
# Queries with version numbers, codes, names → keyword-heavy
if has_specific_terms(query):
return 0.3 # 70% BM25, 30% semantic
# Conceptual questions → semantic-heavy
if is_conceptual(query):
return 0.7 # 30% BM25, 70% semantic
return 0.5 # Balanced

Pitfall 3: Forgetting Re-indexing

When documents change, you need to re-index both:

BM25 Index: Needs re-tokenization
Vector Index: Needs re-embedding
Timestamps: Need updating for recency scoring

What’s Next

My current system works well for text. But I’m thinking about:

  1. Cross-modal retrieval - text queries finding images, audio
  2. Learning to rank - using user feedback to improve weighting
  3. Hierarchical memory - not just retrieval, but organization

The hybrid approach isn’t the final answer. But it’s a solid foundation.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments