Hybrid Search vs Reranker in RAG: Which Should You Use First?
Problem
When I tried adding a Cohere reranker to my RAG system, I was disappointed. The retrieval quality didn’t improve much, but my response latency jumped from 80ms to 280ms.
I asked in a forum: “Is Adding a Reranker to My RAG Stack Actually Worth the Extra Latency?”
The response I got surprised me: “If the first-stage retrieval is the bottleneck, would you recommend switching to hybrid search before even touching a reranker?”
I realized I’d been optimizing the wrong thing. I was trying to improve precision (ranking order) when my real problem was recall (missing documents entirely).
What happened?
I had a RAG system using pure vector search with OpenAI embeddings. When users asked questions, the system would search the vector database and pass the top 20 results to the LLM.
Here’s my retrieval code:
# My original setup: pure vector searchdef retrieve_documents(query: str, top_k: int = 20) -> list[Document]: # Generate embedding query_embedding = openai.embeddings.create( model="text-embedding-3-small", input=query ).data[0].embedding
# Vector search results = vector_db.search( query_vector=query_embedding, top_k=top_k )
return resultsThe problem? Users were complaining that the system was “missing stuff.” If they asked about a specific technical term that appeared in the documents, the system wouldn’t find it. Vector search captures semantic meaning, but it doesn’t match exact keywords well.
So I added a reranker:
# WRONG: Adding reranker before fixing recalldef retrieve_documents(query: str, top_k: int = 20) -> list[Document]: query_embedding = openai.embeddings.create( model="text-embedding-3-small", input=query ).data[0].embedding
# Get more candidates for reranking results = vector_db.search( query_vector=query_embedding, top_k=50 # Fetch more for reranker )
# Rerank reranked = cohere.rerank( model="rerank-v3", query=query, documents=[r.text for r in results], top_n=top_k )
return [results[r.index] for r in reranked]I reran my tests. The accuracy improved slightly—maybe 5-10% better on my eval set. But the latency penalty was huge: +200ms per query. And users still complained about missing information.
That’s when someone explained the core issue to me.
The reason
The key issue is the difference between recall and precision:
- Recall: Did we retrieve the relevant documents at all?
- Precision: Are the relevant documents ranked at the top?
A reranker can only reorder what your retrieval system already found. If the relevant document is at position 200 in your vector search results, and you only pass the top 50 to the reranker, the reranker never sees it. You can’t rerank what you didn’t retrieve.
Here’s what was happening in my system:
Pure Vector Search (top 50): Position 1-10: Semantically similar but not exactly what I need Position 11-50: Mixed relevance Position 200: The exact document with the technical term I need
Reranker (top 20): Only sees positions 1-50 Reorders them, but the relevant document is still missingMy recall@50 was poor—I wasn’t fetching the right documents in the first place. The reranker was just polishing the top of a list that didn’t contain what I needed.
How to solve it?
The solution is to improve recall first with hybrid search, then add a reranker for precision.
Step 1: Hybrid Search (Recall Boost)
Hybrid search combines BM25 (keyword search) with vector search (semantic search). You fetch results from both and merge them.
I implemented it like this:
# CORRECT: Start with hybrid searchfrom rank_bm25 import BM25Okapiimport numpy as np
def hybrid_search(query: str, top_k: int = 100) -> list[Document]: # 1. BM25 keyword search bm25_results = bm25_index.search(query, top_k=top_k)
# 2. Vector semantic search query_embedding = openai.embeddings.create( model="text-embedding-3-small", input=query ).data[0].embedding
vector_results = vector_db.search( query_vector=query_embedding, top_k=top_k )
# 3. Reciprocal Rank Fusion (RRF) hybrid_results = reciprocal_rank_fusion( [bm25_results, vector_results], weights=[0.3, 0.7], # Favor vector slightly top_k=top_k )
return hybrid_results[:top_k]
def reciprocal_rank_fusion( result_lists: list[list[Document]], weights: list[float], top_k: int = 100, k: int = 60 # RRF constant) -> list[Document]: """ Combine multiple ranked lists using Reciprocal Rank Fusion. k is a constant (typically 60) to prevent high-ranked items from dominating. """ scores = {}
for results, weight in zip(result_lists, weights): for rank, doc in enumerate(results): doc_id = doc.id
if doc_id not in scores: scores[doc_id] = { 'doc': doc, 'score': 0 }
# RRF formula: 1 / (k + rank) scores[doc_id]['score'] += weight * (1 / (k + rank + 1))
# Sort by combined score ranked = sorted( scores.values(), key=lambda x: x['score'], reverse=True )
return [item['doc'] for item in ranked[:top_k]]The results were dramatic:
Before (Pure Vector): Recall@50: 72% Precision@10: 48% Latency: 80ms
After (Hybrid Search): Recall@50: 94% ← +22% improvement Precision@10: 61% Latency: 95ms ← Only +15msThe hybrid search found the documents I was missing. BM25 caught the exact keyword matches that vector search missed. Vector search caught the semantic concepts that BM25 missed. Together, they covered both bases.
Step 2: Add Reranker (Precision Boost)
Only after my recall@50 was solid (above 90%), I added the reranker to improve precision:
# CORRECT: Add reranker after recall is soliddef retrieve_documents(query: str, top_k: int = 20) -> list[Document]: # 1. Hybrid search with high top_k for recall hybrid_results = hybrid_search(query, top_k=100)
# 2. Rerank top 50 for precision reranked = cohere.rerank( model="rerank-v3", query=query, documents=[r.text for r in hybrid_results[:50]], top_n=top_k )
return [hybrid_results[r.index] for r in reranked]Final results:
Hybrid + Reranker: Recall@50: 94% (unchanged) Precision@10: 78% ← +17% improvement Latency: 285ms ← +190ms for rerankerNow I can make an informed tradeoff:
- If I need speed: Use hybrid search alone (95ms, 61% precision)
- If I need quality: Use hybrid + reranker (285ms, 78% precision)
But the key insight is that the reranker only helps because the hybrid search already finds the relevant documents. If I’d stuck with pure vector search, the reranker would still be missing key information.
When to use each approach
Based on what I learned, here’s when to use each approach:
Use hybrid search first when:
- You’re using pure vector search or pure keyword search
- Recall@50 is below 80-90%
- Users complain about missing information
- You want quick wins with minimal latency impact
- Your documents have both technical terms and semantic concepts
Add a reranker when:
- Recall@50 is solid (>90%) but precision@10 needs improvement
- You have latency budget (can afford 50-200ms extra)
- Ranking quality matters more than speed (e.g., research assistants)
- You’ve already optimized hybrid search weights
Latency comparison
| Approach | Added Latency | When to Use |
|---|---|---|
| Pure Vector Search | 0ms | Baseline, quick prototype |
| Hybrid Search | +10-20ms | First optimization step |
| Hybrid + Reranker | +60-220ms | After recall is solid |
| Pure Vector + Reranker | +50-200ms | Never (worse latency, same recall issue) |
Common mistakes
I made several mistakes that you can avoid:
-
Reranking before fixing retrieval: I added a reranker to a single-vector search system, which just added latency without fixing the underlying recall problem.
-
Ignoring recall metrics: I only tracked final answer quality, not recall@50. I should have measured whether relevant documents were in my top 50 results.
-
Not measuring impact: I didn’t baseline my system before adding the reranker. I couldn’t tell if the +200ms was worth it.
-
Skipping hybrid search: I went straight from single vector search to reranker, missing the middle step that gives the biggest recall boost.
Evaluation framework
To measure recall vs precision, I set up this evaluation:
def evaluate_recall_vs_precision(test_queries, ground_truth): """Compare recall vs precision for different approaches."""
before_recall = [] after_recall = [] reranker_precision = []
for query, relevant_docs in test_queries: # Pure vector search vector_results = vector_db.search(query, top_k=50) before_recall.append(recall_at_k(vector_results, relevant_docs, k=50))
# Hybrid search hybrid_results = hybrid_search(query, top_k=50) after_recall.append(recall_at_k(hybrid_results, relevant_docs, k=50))
# Hybrid + reranker reranked = cohere.rerank( model="rerank-v3", query=query, documents=[r.text for r in hybrid_results[:50]], top_n=20 ) reranker_precision.append(precision_at_k(reranked, relevant_docs, k=10))
print(f"Pure Vector Recall@50: {np.mean(before_recall):.1%}") print(f"Hybrid Search Recall@50: {np.mean(after_recall):.1%}") print(f"Hybrid + Reranker Precision@10: {np.mean(reranker_precision):.1%}")
def recall_at_k(results, relevant_docs, k): """Percentage of relevant documents found in top k.""" relevant_ids = {doc.id for doc in relevant_docs} retrieved_ids = {doc.id for doc in results[:k]} return len(relevant_ids & retrieved_ids) / len(relevant_ids)
def precision_at_k(results, relevant_docs, k): """Percentage of top k results that are relevant.""" relevant_ids = {doc.id for doc in relevant_docs} retrieved_ids = {doc.id for doc in results[:k]} return len(relevant_ids & retrieved_ids) / kThis evaluation framework showed me exactly where my system was weak and whether each optimization was worth the latency cost.
Summary
In this post, I showed why you should always use hybrid search before adding a reranker to your RAG system. The key point is that recall must come before precision—a reranker can only reorder what you’ve already retrieved.
If relevant documents aren’t in your top 50-100 results, a reranker won’t help. Hybrid search (BM25 + vector) improves recall with minimal latency overhead. Only after recall@50 is solid (>90%) should you layer in a reranker for better precision.
The optimization order matters:
- Start with hybrid search for recall boost (+10-20ms latency)
- Add reranker for precision boost (+50-200ms latency)
Don’t rerank bad retrieval—fix the retrieval first.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Is Adding a Reranker to My RAG Stack Actually Worth the Extra Latency?
- 👨💻 Reciprocal Rank Fusion (RRF)
- 👨💻 Cohere Rerank API
- 👨💻 Information Retrieval: Recall vs Precision
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments