Hybrid Search vs Reranker in RAG: Which Should You Use First?

Feb 25, 2026

Problem

When I tried adding a Cohere reranker to my RAG system, I was disappointed. The retrieval quality didn’t improve much, but my response latency jumped from 80ms to 280ms.

I asked in a forum: “Is Adding a Reranker to My RAG Stack Actually Worth the Extra Latency?”

The response I got surprised me: “If the first-stage retrieval is the bottleneck, would you recommend switching to hybrid search before even touching a reranker?”

I realized I’d been optimizing the wrong thing. I was trying to improve precision (ranking order) when my real problem was recall (missing documents entirely).

What happened?

I had a RAG system using pure vector search with OpenAI embeddings. When users asked questions, the system would search the vector database and pass the top 20 results to the LLM.

Here’s my retrieval code:

# My original setup: pure vector search
def retrieve_documents(query: str, top_k: int = 20) -> list[Document]:
    # Generate embedding
    query_embedding = openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    # Vector search
    results = vector_db.search(
        query_vector=query_embedding,
        top_k=top_k
    )

    return results

The problem? Users were complaining that the system was “missing stuff.” If they asked about a specific technical term that appeared in the documents, the system wouldn’t find it. Vector search captures semantic meaning, but it doesn’t match exact keywords well.

So I added a reranker:

# WRONG: Adding reranker before fixing recall
def retrieve_documents(query: str, top_k: int = 20) -> list[Document]:
    query_embedding = openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    # Get more candidates for reranking
    results = vector_db.search(
        query_vector=query_embedding,
        top_k=50  # Fetch more for reranker
    )

    # Rerank
    reranked = cohere.rerank(
        model="rerank-v3",
        query=query,
        documents=[r.text for r in results],
        top_n=top_k
    )

    return [results[r.index] for r in reranked]

I reran my tests. The accuracy improved slightly—maybe 5-10% better on my eval set. But the latency penalty was huge: +200ms per query. And users still complained about missing information.

That’s when someone explained the core issue to me.

The reason

The key issue is the difference between recall and precision:

Recall: Did we retrieve the relevant documents at all?
Precision: Are the relevant documents ranked at the top?

A reranker can only reorder what your retrieval system already found. If the relevant document is at position 200 in your vector search results, and you only pass the top 50 to the reranker, the reranker never sees it. You can’t rerank what you didn’t retrieve.

Here’s what was happening in my system:

Pure Vector Search (top 50):
  Position 1-10:    Semantically similar but not exactly what I need
  Position 11-50:   Mixed relevance
  Position 200:     The exact document with the technical term I need

Reranker (top 20):
  Only sees positions 1-50
  Reorders them, but the relevant document is still missing

My recall@50 was poor—I wasn’t fetching the right documents in the first place. The reranker was just polishing the top of a list that didn’t contain what I needed.

How to solve it?

The solution is to improve recall first with hybrid search, then add a reranker for precision.

Step 1: Hybrid Search (Recall Boost)

Hybrid search combines BM25 (keyword search) with vector search (semantic search). You fetch results from both and merge them.

I implemented it like this:

# CORRECT: Start with hybrid search
from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_search(query: str, top_k: int = 100) -> list[Document]:
    # 1. BM25 keyword search
    bm25_results = bm25_index.search(query, top_k=top_k)

    # 2. Vector semantic search
    query_embedding = openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    vector_results = vector_db.search(
        query_vector=query_embedding,
        top_k=top_k
    )

    # 3. Reciprocal Rank Fusion (RRF)
    hybrid_results = reciprocal_rank_fusion(
        [bm25_results, vector_results],
        weights=[0.3, 0.7],  # Favor vector slightly
        top_k=top_k
    )

    return hybrid_results[:top_k]

def reciprocal_rank_fusion(
    result_lists: list[list[Document]],
    weights: list[float],
    top_k: int = 100,
    k: int = 60  # RRF constant
) -> list[Document]:
    """
    Combine multiple ranked lists using Reciprocal Rank Fusion.
    k is a constant (typically 60) to prevent high-ranked items from dominating.
    """
    scores = {}

    for results, weight in zip(result_lists, weights):
        for rank, doc in enumerate(results):
            doc_id = doc.id

            if doc_id not in scores:
                scores[doc_id] = {
                    'doc': doc,
                    'score': 0
                }

            # RRF formula: 1 / (k + rank)
            scores[doc_id]['score'] += weight * (1 / (k + rank + 1))

    # Sort by combined score
    ranked = sorted(
        scores.values(),
        key=lambda x: x['score'],
        reverse=True
    )

    return [item['doc'] for item in ranked[:top_k]]

The results were dramatic:

Before (Pure Vector):
  Recall@50:  72%
  Precision@10:  48%
  Latency: 80ms

After (Hybrid Search):
  Recall@50:  94%  ← +22% improvement
  Precision@10:  61%
  Latency: 95ms   ← Only +15ms

The hybrid search found the documents I was missing. BM25 caught the exact keyword matches that vector search missed. Vector search caught the semantic concepts that BM25 missed. Together, they covered both bases.

Step 2: Add Reranker (Precision Boost)

Only after my recall@50 was solid (above 90%), I added the reranker to improve precision:

# CORRECT: Add reranker after recall is solid
def retrieve_documents(query: str, top_k: int = 20) -> list[Document]:
    # 1. Hybrid search with high top_k for recall
    hybrid_results = hybrid_search(query, top_k=100)

    # 2. Rerank top 50 for precision
    reranked = cohere.rerank(
        model="rerank-v3",
        query=query,
        documents=[r.text for r in hybrid_results[:50]],
        top_n=top_k
    )

    return [hybrid_results[r.index] for r in reranked]

Final results:

Hybrid + Reranker:
  Recall@50:  94%  (unchanged)
  Precision@10:  78%  ← +17% improvement
  Latency: 285ms   ← +190ms for reranker

Now I can make an informed tradeoff:

If I need speed: Use hybrid search alone (95ms, 61% precision)
If I need quality: Use hybrid + reranker (285ms, 78% precision)

But the key insight is that the reranker only helps because the hybrid search already finds the relevant documents. If I’d stuck with pure vector search, the reranker would still be missing key information.

When to use each approach

Based on what I learned, here’s when to use each approach:

Use hybrid search first when:

You’re using pure vector search or pure keyword search
Recall@50 is below 80-90%
Users complain about missing information
You want quick wins with minimal latency impact
Your documents have both technical terms and semantic concepts

Add a reranker when:

Recall@50 is solid (>90%) but precision@10 needs improvement
You have latency budget (can afford 50-200ms extra)
Ranking quality matters more than speed (e.g., research assistants)
You’ve already optimized hybrid search weights

Latency comparison

Approach	Added Latency	When to Use
Pure Vector Search	0ms	Baseline, quick prototype
Hybrid Search	+10-20ms	First optimization step
Hybrid + Reranker	+60-220ms	After recall is solid
Pure Vector + Reranker	+50-200ms	Never (worse latency, same recall issue)

Common mistakes

I made several mistakes that you can avoid:

Reranking before fixing retrieval: I added a reranker to a single-vector search system, which just added latency without fixing the underlying recall problem.
Ignoring recall metrics: I only tracked final answer quality, not recall@50. I should have measured whether relevant documents were in my top 50 results.
Not measuring impact: I didn’t baseline my system before adding the reranker. I couldn’t tell if the +200ms was worth it.
Skipping hybrid search: I went straight from single vector search to reranker, missing the middle step that gives the biggest recall boost.

Evaluation framework

To measure recall vs precision, I set up this evaluation:

def evaluate_recall_vs_precision(test_queries, ground_truth):
    """Compare recall vs precision for different approaches."""

    before_recall = []
    after_recall = []
    reranker_precision = []

    for query, relevant_docs in test_queries:
        # Pure vector search
        vector_results = vector_db.search(query, top_k=50)
        before_recall.append(recall_at_k(vector_results, relevant_docs, k=50))

        # Hybrid search
        hybrid_results = hybrid_search(query, top_k=50)
        after_recall.append(recall_at_k(hybrid_results, relevant_docs, k=50))

        # Hybrid + reranker
        reranked = cohere.rerank(
            model="rerank-v3",
            query=query,
            documents=[r.text for r in hybrid_results[:50]],
            top_n=20
        )
        reranker_precision.append(precision_at_k(reranked, relevant_docs, k=10))

    print(f"Pure Vector Recall@50: {np.mean(before_recall):.1%}")
    print(f"Hybrid Search Recall@50: {np.mean(after_recall):.1%}")
    print(f"Hybrid + Reranker Precision@10: {np.mean(reranker_precision):.1%}")

def recall_at_k(results, relevant_docs, k):
    """Percentage of relevant documents found in top k."""
    relevant_ids = {doc.id for doc in relevant_docs}
    retrieved_ids = {doc.id for doc in results[:k]}
    return len(relevant_ids & retrieved_ids) / len(relevant_ids)

def precision_at_k(results, relevant_docs, k):
    """Percentage of top k results that are relevant."""
    relevant_ids = {doc.id for doc in relevant_docs}
    retrieved_ids = {doc.id for doc in results[:k]}
    return len(relevant_ids & retrieved_ids) / k

This evaluation framework showed me exactly where my system was weak and whether each optimization was worth the latency cost.

Summary

In this post, I showed why you should always use hybrid search before adding a reranker to your RAG system. The key point is that recall must come before precision—a reranker can only reorder what you’ve already retrieved.

If relevant documents aren’t in your top 50-100 results, a reranker won’t help. Hybrid search (BM25 + vector) improves recall with minimal latency overhead. Only after recall@50 is solid (>90%) should you layer in a reranker for better precision.

The optimization order matters:

Start with hybrid search for recall boost (+10-20ms latency)
Add reranker for precision boost (+50-200ms latency)

Don’t rerank bad retrieval—fix the retrieval first.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: Is Adding a Reranker to My RAG Stack Actually Worth the Extra Latency?
👨‍💻 Reciprocal Rank Fusion (RRF)
👨‍💻 Cohere Rerank API
👨‍💻 Information Retrieval: Recall vs Precision

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!