What is Hybrid Search in RAG? When to Use It (Complete Guide)
My RAG system was failing on simple queries. Users searched for “error code 404” and got nothing relevant. But when they searched for “page not found problem”, suddenly results appeared. The issue? Pure vector search was too semantic—it missed exact term matches that users expected.
That’s when I learned about hybrid search, and honestly, it changed everything.
The Problem with Single-Method Retrieval
I started with pure vector search because it was the “modern” approach. Semantic embeddings, transformer models, the works. But I kept running into these frustrating scenarios:
Query: "How to fix CORS error"Vector Results: Articles about cross-origin, web security, browser policiesUser Wanted: Specific CORS error troubleshooting guide
Query: "PostgreSQL connection refused port 5432"Vector Results: Database connection guides, PostgreSQL tutorialsUser Wanted: Exact error message solution
Query: "async await not working"Vector Results: Asynchronous programming concepts, JavaScript tutorialsUser Wanted: Specific debugging steps for await issuesVector search excels at semantic similarity but struggles with exact terms, error codes, and specific identifiers. Meanwhile, traditional keyword search (BM25) has the opposite problem—great for exact matches, terrible at understanding intent.
┌─────────────────────────────────────────────────────────────┐│ RETRIEVAL SPECTRUM │├─────────────────────────────────────────────────────────────┤│ ││ BM25 (Sparse) Vector (Dense) ││ ───────────── ────────────── ││ ✓ Exact term matching ✓ Semantic understanding ││ ✓ Fast and interpretable ✓ Handles synonyms ││ ✓ No model needed ✓ Concept matching ││ ✗ Misses synonyms ✗ Misses exact terms ││ ✗ No semantic understanding ✗ Black box scoring ││ ││ HYBRID SEARCH ││ ───────────── ││ Both strengths combined ││ via result fusion ││ │└─────────────────────────────────────────────────────────────┘What Hybrid Search Actually Does
Hybrid search runs multiple retrieval methods in parallel and merges their results. The most common combination is BM25 (sparse retrieval) + vector search (dense retrieval), but you can blend any retrieval methods.
Here’s the workflow:
┌──────────────────────────────────────────────────────────────┐│ ││ Query: "How to fix CORS error in Express.js" ││ │└─────────────────┬───────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ PARALLEL RETRIEVAL │ └─────────────────────────────────────┘ │ ┌──────────┴──────────┐ ▼ ▼┌─────────────┐ ┌─────────────┐│ BM25 │ │ Vector ││ Retrieval │ │ Retrieval │└─────────────┘ └─────────────┘ │ │ ▼ ▼ Top-K Docs Top-K Docs (Term-based) (Semantic) │ │ └──────────┬──────────┘ ▼ ┌─────────────────────────────────────┐ │ RESULT FUSION (RRF) │ │ Combine and re-rank results │ └─────────────────────────────────────┘ │ ▼ Final Top-K DocumentsThe magic happens in the fusion step. Different strategies exist, but Reciprocal Rank Fusion (RRF) has become the default choice for most implementations.
How Result Fusion Works
I tested three main fusion approaches:
1. Reciprocal Rank Fusion (RRF)
RRF is elegantly simple. For each document, you calculate a score based on its rank in each retrieval method:
def reciprocal_rank_fusion(results_dict, k=60): """ RRF score = sum(1 / (k + rank)) for each retrieval method
k is a smoothing constant (default 60 is standard) """ fused_scores = {}
for method_name, doc_list in results_dict.items(): for rank, doc in enumerate(doc_list, start=1): doc_id = doc['id']
if doc_id not in fused_scores: fused_scores[doc_id] = { 'doc': doc, 'rrf_score': 0 }
# Add contribution from this ranking fused_scores[doc_id]['rrf_score'] += 1 / (k + rank)
# Sort by fused score sorted_results = sorted( fused_scores.values(), key=lambda x: x['rrf_score'], reverse=True )
return sorted_results
# Example usagebm25_results = [ {'id': 'doc1', 'content': 'CORS error fix...', 'score': 0.95}, {'id': 'doc2', 'content': 'Express middleware...', 'score': 0.87}, {'id': 'doc3', 'content': 'Security headers...', 'score': 0.72},]
vector_results = [ {'id': 'doc4', 'content': 'Cross-origin setup...', 'score': 0.91}, {'id': 'doc1', 'content': 'CORS error fix...', 'score': 0.89}, {'id': 'doc5', 'content': 'API authentication...', 'score': 0.83},]
all_results = { 'bm25': bm25_results, 'vector': vector_results}
final_ranking = reciprocal_rank_fusion(all_results)# doc1 appears in both, so it gets highest RRF scoreThe beauty of RRF: it doesn’t need normalized scores. BM25 scores and vector similarity scores are incomparable, but ranks are always comparable.
2. Weighted Score Fusion
This approach requires score normalization:
import numpy as np
def normalize_scores(results): """Normalize scores to [0, 1] range using min-max scaling""" if not results: return results
scores = [doc['score'] for doc in results] min_score = min(scores) max_score = max(scores) range_score = max_score - min_score
if range_score == 0: range_score = 1 # Avoid division by zero
for doc in results: doc['normalized_score'] = (doc['score'] - min_score) / range_score
return results
def weighted_fusion(bm25_results, vector_results, bm25_weight=0.4, vector_weight=0.6): """ Combine normalized scores with configurable weights.
Common splits: - 50/50: Equal importance - 40/60: Slight preference for semantic - 30/70: Strong semantic preference """ # Normalize both result sets bm25_normalized = normalize_scores(bm25_results.copy()) vector_normalized = normalize_scores(vector_results.copy())
# Create unified doc dictionary all_docs = {}
for doc in bm25_normalized: all_docs[doc['id']] = { 'doc': doc, 'bm25_score': doc['normalized_score'], 'vector_score': 0 }
for doc in vector_normalized: if doc['id'] in all_docs: all_docs[doc['id']]['vector_score'] = doc['normalized_score'] else: all_docs[doc['id']] = { 'doc': doc, 'bm25_score': 0, 'vector_score': doc['normalized_score'] }
# Calculate weighted final score for doc_id, data in all_docs.items(): data['final_score'] = ( bm25_weight * data['bm25_score'] + vector_weight * data['vector_score'] )
# Sort by final score sorted_results = sorted( all_docs.values(), key=lambda x: x['final_score'], reverse=True )
return sorted_resultsThe problem with weighted fusion: normalization is tricky. BM25 scores can range widely, and vector cosine similarities cluster around 0.7-0.9 for “similar” documents. Making these comparable requires careful tuning.
3. Re-ranking with Cross-Encoders
The most accurate but expensive approach:
from sentence_transformers import CrossEncoder
def rerank_results(query, combined_results, top_k=10): """ Use cross-encoder to re-rank combined results. Slower but more accurate than RRF or weighted fusion. """ # Load cross-encoder model reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Prepare query-document pairs pairs = [[query, doc['content']] for doc in combined_results]
# Get re-ranking scores scores = reranker.predict(pairs)
# Sort by cross-encoder scores scored_results = list(zip(combined_results, scores)) scored_results.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in scored_results[:top_k]]Cross-encoders are slow (each query-document pair needs full transformer inference) but extremely accurate. Best for smaller result sets or when accuracy is critical.
┌────────────────────────────────────────────────────────────────┐│ FUSION STRATEGY COMPARISON │├─────────────────┬──────────────┬─────────────┬────────────────┤│ Method │ Speed │ Accuracy │ Tuning Effort │├─────────────────┼──────────────┼─────────────┼────────────────┤│ RRF │ Fast │ Good │ Minimal (k=60) ││ Weighted Score │ Fast │ Good │ Medium ││ Cross-Encoder │ Slow │ Excellent │ Minimal │└─────────────────┴──────────────┴─────────────┴────────────────┘When to Use Hybrid Search
After running production RAG systems for months, I’ve developed a simple decision framework:
Use Hybrid Search When:
- Building production RAG applications (default choice)
- Users search for error codes, product names, or specific identifiers
- Document corpus has both technical content and conceptual explanations
- You want both precision and recall
- Query patterns are diverse (some semantic, some exact-match)
Use Pure Vector Search When:
- Document corpus is homogeneous (all similar content types)
- Queries are primarily semantic/conceptual
- Dealing with multilingual content where exact terms don’t match
- Speed is critical and you can tolerate some misses
Use Pure BM25 When:
- Documents are short and keywords are sufficient
- You need maximum interpretability for debugging
- Computational resources are extremely limited
- Exact term matching is all that matters (e.g., log search)
From the Reddit discussion on hybrid search, the consensus is clear:
“Logs 100%, other than that, always hybrid”
“Blending vector with fulltext searching in mysql/aurora or postgresql is absolutely a great idea”
“By default, hybrid or semantic search. Works for the majority of cases”
This aligns with my production experience. Hybrid search has become the standard for a reason—it works.
Implementation Approaches
PostgreSQL with pgvector
If you’re already using PostgreSQL, adding hybrid search is straightforward:
import psycopg2from pgvector.psycopg2 import register_vector
def hybrid_search_postgres(query_text, query_embedding, top_k=10): """ Hybrid search using PostgreSQL with pgvector extension. Combines full-text search (tsvector) with vector similarity. """ conn = psycopg2.connect("postgresql://user:pass@localhost/db") register_vector(conn) cur = conn.cursor()
# Single query combining BM25 and vector search # Uses RRF-style combination at database level cur.execute(""" WITH bm25_results AS ( SELECT id, content, ts_rank_cd(to_tsvector('english', content), plainto_tsquery(%s)) as bm25_score, 1 as source FROM documents WHERE to_tsvector('english', content) @@ plainto_tsquery(%s) ORDER BY bm25_score DESC LIMIT %s ), vector_results AS ( SELECT id, content, 1 - (embedding <=> %s) as vector_score, 2 as source FROM documents ORDER BY embedding <=> %s LIMIT %s ), combined AS ( SELECT id, content, bm25_score as score, source FROM bm25_results UNION ALL SELECT id, content, vector_score as score, source FROM vector_results ) SELECT id, content, SUM(1.0 / (60 + ROW_NUMBER() OVER ( PARTITION BY source ORDER BY score DESC ))) as rrf_score FROM combined GROUP BY id, content ORDER BY rrf_score DESC LIMIT %s; """, (query_text, query_text, top_k, query_embedding, query_embedding, top_k, top_k))
results = cur.fetchall() conn.close() return resultsThe key advantage: everything runs in-database. No need to manage separate search services.
Elasticsearch/OpenSearch
Elasticsearch 8.x and OpenSearch both support hybrid search natively:
from elasticsearch import Elasticsearch
def hybrid_search_elasticsearch(query_text, query_vector, index="documents", top_k=10): """ Hybrid search using Elasticsearch RRF pipeline. Requires Elasticsearch 8.9+ with kNN enabled. """ es = Elasticsearch(["http://localhost:9200"])
# Elasticsearch has built-in RRF support since 8.9 response = es.search( index=index, size=top_k, query={ "bool": { "should": [ # BM25 query { "match": { "content": { "query": query_text, "boost": 1.0 } } } ] } }, knn={ "field": "embedding", "query_vector": query_vector, "k": top_k, "num_candidates": top_k * 10 }, rank={ "rrf": { "window_size": top_k * 2, "rank_constant": 60 } } )
return response['hits']['hits']Elasticsearch handles the RRF fusion internally—just configure both query types and specify the rank constant.
Dedicated Vector Databases (Pinecone, Weaviate, Qdrant)
Most vector databases now support hybrid search:
import weaviate
def hybrid_search_weaviate(query_text, alpha=0.5, top_k=10): """ Hybrid search using Weaviate. Alpha controls BM25/vector balance (0=BM25 only, 1=vector only). """ client = weaviate.Client("http://localhost:8080")
response = ( client.query .get("Document", ["content", "title"]) .with_hybrid( query=query_text, alpha=alpha, # 0.5 = equal balance vector=query_vector # optional, auto-generated if omitted ) .with_limit(top_k) .do() )
return response['data']['Get']['Document']The alpha parameter controls the blend: 0 is pure BM25, 1 is pure vector, 0.5 is equal weight.
Common Pitfalls I Encountered
1. Score Normalization Hell
When I tried weighted fusion initially, my results were terrible. BM25 scores dominated everything because they weren’t normalized. The fix:
# WRONG: Raw scoresfinal = 0.5 * bm25_score + 0.5 * vector_score# BM25 might be 0-30, vector is 0-1. BM25 dominates.
# RIGHT: Normalize firstbm25_normalized = (bm25_score - bm25_min) / (bm25_max - bm25_min)vector_normalized = (vector_score - vector_min) / (vector_max - vector_min)final = 0.5 * bm25_normalized + 0.5 * vector_normalized2. Ignoring Document Length Bias
BM25 naturally handles document length, but vector similarity doesn’t. Longer documents tend to have higher cosine similarity just due to dimensionality. Consider chunking uniformly or normalizing by chunk length.
3. Not Tuning the RRF Constant
The default k=60 works well, but I found lower values (k=20-40) sometimes help when you want top ranks to matter more:
# k=60: Smooth ranking, less emphasis on top position# k=20: Top positions have more influence# k=10: Very sensitive to rank order
# Test different k values on your validation setfor k in [10, 20, 40, 60, 100]: results = reciprocal_rank_fusion(all_results, k=k) score = evaluate_retrieval(results, ground_truth) print(f"k={k}: {score}")Advanced Pattern: Multi-Tool Retrieval Agents
For complex RAG systems, I’ve seen teams move beyond simple hybrid search to multi-tool retrieval:
┌─────────────────────────────────────────────────────────────┐│ AGENT-BASED RETRIEVAL │├─────────────────────────────────────────────────────────────┤│ ││ User Query ││ │ ││ ▼ ││ ┌─────────────┐ ││ │ LLM │ ─── Analyzes query intent ││ │ Agent │ ││ └─────────────┘ ││ │ ││ ├──────────► BM25 Tool (exact terms) ││ ├──────────► Vector Tool (semantic) ││ ├──────────► Graph Tool (relationships) ││ ├──────────► SQL Tool (structured data) ││ └──────────► Web Search Tool (external) ││ ││ ▼ ││ Synthesized Response ││ │└─────────────────────────────────────────────────────────────┘From the Reddit discussion:
“ReAct, single agent with tools. Each tool is a type of retrieval”
This pattern treats each retrieval method as a tool the agent can invoke based on query analysis. More complex but more flexible.
Key Takeaways
After implementing hybrid search across multiple production RAG systems:
-
Default to hybrid for any serious RAG application. The combination of BM25 precision and vector semantic understanding covers most query patterns.
-
Use RRF for fusion. It’s simple, requires no score normalization, and works reliably across different domains.
-
Tune on your data. Default parameters work well, but evaluate on your specific query patterns and documents.
-
Consider your infrastructure. PostgreSQL users can stay with a single database. Large-scale systems might benefit from dedicated search engines.
-
Don’t over-engineer. Start with simple hybrid search. Move to multi-tool agent retrieval only if your use case demands it.
Hybrid search isn’t experimental anymore—it’s the production standard. If you’re building RAG systems today, this should be your default retrieval approach.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Anyone actually using Vectorless RAG?
- 👨💻 PostgreSQL pgvector Documentation
- 👨💻 Elasticsearch Hybrid Search
- 👨💻 Reciprocal Rank Fusion Paper
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments