Skip to content

Hybrid Search vs Reranker in RAG: Which Should You Use First?

Problem

When I tried adding a Cohere reranker to my RAG system, I was disappointed. The retrieval quality didn’t improve much, but my response latency jumped from 80ms to 280ms.

I asked in a forum: “Is Adding a Reranker to My RAG Stack Actually Worth the Extra Latency?”

The response I got surprised me: “If the first-stage retrieval is the bottleneck, would you recommend switching to hybrid search before even touching a reranker?”

I realized I’d been optimizing the wrong thing. I was trying to improve precision (ranking order) when my real problem was recall (missing documents entirely).

What happened?

I had a RAG system using pure vector search with OpenAI embeddings. When users asked questions, the system would search the vector database and pass the top 20 results to the LLM.

Here’s my retrieval code:

retrieval.py
# My original setup: pure vector search
def retrieve_documents(query: str, top_k: int = 20) -> list[Document]:
# Generate embedding
query_embedding = openai.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# Vector search
results = vector_db.search(
query_vector=query_embedding,
top_k=top_k
)
return results

The problem? Users were complaining that the system was “missing stuff.” If they asked about a specific technical term that appeared in the documents, the system wouldn’t find it. Vector search captures semantic meaning, but it doesn’t match exact keywords well.

So I added a reranker:

retrieval_with_reranker.py
# WRONG: Adding reranker before fixing recall
def retrieve_documents(query: str, top_k: int = 20) -> list[Document]:
query_embedding = openai.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# Get more candidates for reranking
results = vector_db.search(
query_vector=query_embedding,
top_k=50 # Fetch more for reranker
)
# Rerank
reranked = cohere.rerank(
model="rerank-v3",
query=query,
documents=[r.text for r in results],
top_n=top_k
)
return [results[r.index] for r in reranked]

I reran my tests. The accuracy improved slightly—maybe 5-10% better on my eval set. But the latency penalty was huge: +200ms per query. And users still complained about missing information.

That’s when someone explained the core issue to me.

The reason

The key issue is the difference between recall and precision:

  • Recall: Did we retrieve the relevant documents at all?
  • Precision: Are the relevant documents ranked at the top?

A reranker can only reorder what your retrieval system already found. If the relevant document is at position 200 in your vector search results, and you only pass the top 50 to the reranker, the reranker never sees it. You can’t rerank what you didn’t retrieve.

Here’s what was happening in my system:

Pure Vector Search (top 50):
Position 1-10: Semantically similar but not exactly what I need
Position 11-50: Mixed relevance
Position 200: The exact document with the technical term I need
Reranker (top 20):
Only sees positions 1-50
Reorders them, but the relevant document is still missing

My recall@50 was poor—I wasn’t fetching the right documents in the first place. The reranker was just polishing the top of a list that didn’t contain what I needed.

How to solve it?

The solution is to improve recall first with hybrid search, then add a reranker for precision.

Step 1: Hybrid Search (Recall Boost)

Hybrid search combines BM25 (keyword search) with vector search (semantic search). You fetch results from both and merge them.

I implemented it like this:

hybrid_search.py
# CORRECT: Start with hybrid search
from rank_bm25 import BM25Okapi
import numpy as np
def hybrid_search(query: str, top_k: int = 100) -> list[Document]:
# 1. BM25 keyword search
bm25_results = bm25_index.search(query, top_k=top_k)
# 2. Vector semantic search
query_embedding = openai.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
vector_results = vector_db.search(
query_vector=query_embedding,
top_k=top_k
)
# 3. Reciprocal Rank Fusion (RRF)
hybrid_results = reciprocal_rank_fusion(
[bm25_results, vector_results],
weights=[0.3, 0.7], # Favor vector slightly
top_k=top_k
)
return hybrid_results[:top_k]
def reciprocal_rank_fusion(
result_lists: list[list[Document]],
weights: list[float],
top_k: int = 100,
k: int = 60 # RRF constant
) -> list[Document]:
"""
Combine multiple ranked lists using Reciprocal Rank Fusion.
k is a constant (typically 60) to prevent high-ranked items from dominating.
"""
scores = {}
for results, weight in zip(result_lists, weights):
for rank, doc in enumerate(results):
doc_id = doc.id
if doc_id not in scores:
scores[doc_id] = {
'doc': doc,
'score': 0
}
# RRF formula: 1 / (k + rank)
scores[doc_id]['score'] += weight * (1 / (k + rank + 1))
# Sort by combined score
ranked = sorted(
scores.values(),
key=lambda x: x['score'],
reverse=True
)
return [item['doc'] for item in ranked[:top_k]]

The results were dramatic:

Before (Pure Vector):
Recall@50: 72%
Precision@10: 48%
Latency: 80ms
After (Hybrid Search):
Recall@50: 94% ← +22% improvement
Precision@10: 61%
Latency: 95ms ← Only +15ms

The hybrid search found the documents I was missing. BM25 caught the exact keyword matches that vector search missed. Vector search caught the semantic concepts that BM25 missed. Together, they covered both bases.

Step 2: Add Reranker (Precision Boost)

Only after my recall@50 was solid (above 90%), I added the reranker to improve precision:

hybrid_with_reranker.py
# CORRECT: Add reranker after recall is solid
def retrieve_documents(query: str, top_k: int = 20) -> list[Document]:
# 1. Hybrid search with high top_k for recall
hybrid_results = hybrid_search(query, top_k=100)
# 2. Rerank top 50 for precision
reranked = cohere.rerank(
model="rerank-v3",
query=query,
documents=[r.text for r in hybrid_results[:50]],
top_n=top_k
)
return [hybrid_results[r.index] for r in reranked]

Final results:

Hybrid + Reranker:
Recall@50: 94% (unchanged)
Precision@10: 78% ← +17% improvement
Latency: 285ms ← +190ms for reranker

Now I can make an informed tradeoff:

  • If I need speed: Use hybrid search alone (95ms, 61% precision)
  • If I need quality: Use hybrid + reranker (285ms, 78% precision)

But the key insight is that the reranker only helps because the hybrid search already finds the relevant documents. If I’d stuck with pure vector search, the reranker would still be missing key information.

When to use each approach

Based on what I learned, here’s when to use each approach:

Use hybrid search first when:

  • You’re using pure vector search or pure keyword search
  • Recall@50 is below 80-90%
  • Users complain about missing information
  • You want quick wins with minimal latency impact
  • Your documents have both technical terms and semantic concepts

Add a reranker when:

  • Recall@50 is solid (>90%) but precision@10 needs improvement
  • You have latency budget (can afford 50-200ms extra)
  • Ranking quality matters more than speed (e.g., research assistants)
  • You’ve already optimized hybrid search weights

Latency comparison

ApproachAdded LatencyWhen to Use
Pure Vector Search0msBaseline, quick prototype
Hybrid Search+10-20msFirst optimization step
Hybrid + Reranker+60-220msAfter recall is solid
Pure Vector + Reranker+50-200msNever (worse latency, same recall issue)

Common mistakes

I made several mistakes that you can avoid:

  1. Reranking before fixing retrieval: I added a reranker to a single-vector search system, which just added latency without fixing the underlying recall problem.

  2. Ignoring recall metrics: I only tracked final answer quality, not recall@50. I should have measured whether relevant documents were in my top 50 results.

  3. Not measuring impact: I didn’t baseline my system before adding the reranker. I couldn’t tell if the +200ms was worth it.

  4. Skipping hybrid search: I went straight from single vector search to reranker, missing the middle step that gives the biggest recall boost.

Evaluation framework

To measure recall vs precision, I set up this evaluation:

evaluation.py
def evaluate_recall_vs_precision(test_queries, ground_truth):
"""Compare recall vs precision for different approaches."""
before_recall = []
after_recall = []
reranker_precision = []
for query, relevant_docs in test_queries:
# Pure vector search
vector_results = vector_db.search(query, top_k=50)
before_recall.append(recall_at_k(vector_results, relevant_docs, k=50))
# Hybrid search
hybrid_results = hybrid_search(query, top_k=50)
after_recall.append(recall_at_k(hybrid_results, relevant_docs, k=50))
# Hybrid + reranker
reranked = cohere.rerank(
model="rerank-v3",
query=query,
documents=[r.text for r in hybrid_results[:50]],
top_n=20
)
reranker_precision.append(precision_at_k(reranked, relevant_docs, k=10))
print(f"Pure Vector Recall@50: {np.mean(before_recall):.1%}")
print(f"Hybrid Search Recall@50: {np.mean(after_recall):.1%}")
print(f"Hybrid + Reranker Precision@10: {np.mean(reranker_precision):.1%}")
def recall_at_k(results, relevant_docs, k):
"""Percentage of relevant documents found in top k."""
relevant_ids = {doc.id for doc in relevant_docs}
retrieved_ids = {doc.id for doc in results[:k]}
return len(relevant_ids & retrieved_ids) / len(relevant_ids)
def precision_at_k(results, relevant_docs, k):
"""Percentage of top k results that are relevant."""
relevant_ids = {doc.id for doc in relevant_docs}
retrieved_ids = {doc.id for doc in results[:k]}
return len(relevant_ids & retrieved_ids) / k

This evaluation framework showed me exactly where my system was weak and whether each optimization was worth the latency cost.

Summary

In this post, I showed why you should always use hybrid search before adding a reranker to your RAG system. The key point is that recall must come before precision—a reranker can only reorder what you’ve already retrieved.

If relevant documents aren’t in your top 50-100 results, a reranker won’t help. Hybrid search (BM25 + vector) improves recall with minimal latency overhead. Only after recall@50 is solid (>90%) should you layer in a reranker for better precision.

The optimization order matters:

  1. Start with hybrid search for recall boost (+10-20ms latency)
  2. Add reranker for precision boost (+50-200ms latency)

Don’t rerank bad retrieval—fix the retrieval first.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments