Skip to content

Does Adding a Reranker to RAG Increase Latency? The Counterintuitive Truth

The Misconception

When I was building my RAG pipeline, I assumed adding a reranker would slow everything down. It seems obvious—adding another model inference step adds latency, right?

I measured the reranker alone: 100-200ms per query. So I thought:

Without reranker: Vector search (50ms) + LLM (4000-8000ms) = 4050-8050ms
With reranker: Vector search (50ms) + Reranker (150ms) + LLM (4000-8000ms) = 4200-8200ms
❌ WRONG: Reranker adds 150ms latency

But when I actually implemented the full pipeline and measured end-to-end latency, I got this:

Without reranker: 6500ms total
With reranker: 980ms total
✅ 6.6x faster with reranker

The reranker didn’t add latency—it dramatically reduced it. Here’s why.

What Happened?

The key mistake I made was measuring the reranker in isolation. I only looked at the additional 150ms inference time, but I completely missed what happens downstream: the LLM generation step.

In a RAG pipeline, the LLM receives retrieved context chunks. More chunks means:

  • More tokens to process (attention complexity is O(n²))
  • More irrelevant information to filter through
  • Slower token generation rates

When I send 50 chunks to the LLM without reranking, it takes 4000-8000ms to generate a response. But with a reranker filtering those 50 chunks down to the 5 most relevant, the LLM only needs 600-1200ms.

Here’s what my actual pipeline looks like:

rag_pipeline.py
import time
# WITHOUT RERANKER
def rag_pipeline_no_reranker(query):
start = time.time()
# Vector search returns top 50 chunks
chunks = vector_search(query, top_k=50) # ~50ms
search_time = time.time() - start
# LLM generates with 50 chunks (lots of irrelevant context)
gen_start = time.time()
response = llm.generate(query, context=chunks) # ~4000-8000ms
llm_time = time.time() - gen_start
total = time.time() - start
return {
"total_ms": total * 1000,
"search_ms": search_time * 1000,
"llm_ms": llm_time * 1000,
"chunks_used": len(chunks)
}
# WITH RERANKER
def rag_pipeline_with_reranker(query):
start = time.time()
# Vector search returns top 50 chunks
chunks = vector_search(query, top_k=50) # ~50ms
search_time = time.time() - start
# Reranker filters to top 5 most relevant
rerank_start = time.time()
relevant_chunks = reranker.rerank(query, chunks, top_k=5) # ~100-200ms
rerank_time = time.time() - rerank_start
# LLM generates with only 5 highly relevant chunks
gen_start = time.time()
response = llm.generate(query, context=relevant_chunks) # ~600-1200ms
llm_time = time.time() - gen_start
total = time.time() - start
return {
"total_ms": total * 1000,
"search_ms": search_time * 1000,
"rerank_ms": rerank_time * 1000,
"llm_ms": llm_time * 1000,
"chunks_used": len(relevant_chunks)
}

When I run both pipelines with the same query:

no_reranker = rag_pipeline_no_reranker("How does reranking affect latency?")
# => {'total_ms': 6500, 'search_ms': 52, 'llm_ms': 6448, 'chunks_used': 50}
with_reranker = rag_pipeline_with_reranker("How does reranking affect latency?")
# => {'total_ms': 980, 'search_ms': 48, 'rerank_ms': 142, 'llm_ms': 790, 'chunks_used': 5}
print(f"Speedup: {no_reranker['total_ms'] / with_reranker['total_ms']:.1f}x faster")
# => Speedup: 6.6x faster

The Numbers

Here’s the complete latency breakdown:

StageWithout RerankerWith RerankerDifference
Vector Search50ms50ms0ms
Reranking0ms100-200ms+100-200ms
LLM Generation (50 chunks)4000-8000ms--
LLM Generation (5 chunks)-600-1200ms-3400 to -6800ms
TOTAL4050-8050ms750-1450ms-2600 to -6600ms
Speedup-3-5x faster-

The reranker adds 100-200ms, but saves 3400-6800ms in LLM generation. Net result: 60-80% latency reduction.

Why This Works

LLM generation time doesn’t scale linearly with context—it scales quadratically due to the attention mechanism. When you double the input tokens, you don’t just double the computation time; you roughly quadruple it.

That’s why reducing context from 50 chunks to 5 chunks (a 90% reduction) doesn’t just save 90% of the LLM time—it saves closer to 95-99%. The attention matrix goes from 50×50 to 5×5, which is 100x smaller.

Visually, here’s what’s happening:

┌─────────────────────────────────────────────────────────────────┐
│ WITHOUT RERANKER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Query ──► Vector Search ──► Top 50 Chunks ──► LLM Generation │
│ (50ms) (4000-8000ms) │
│ │
│ Total: 4050-8050ms │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ WITH RERANKER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Query ──► Vector Search ──► Reranker ──► Top 5 Chunks ──► LLM │
│ (50ms) (100-200ms) (600-1200ms) │
│ │
│ Total: 750-1450ms │
│ Speedup: 3-5x faster │
└─────────────────────────────────────────────────────────────────┘

Common Mistakes

Mistake 1: Measuring only reranker latency

I benchmarked the reranker in isolation and saw 150ms. I thought “this adds 150ms to every query.” But I didn’t measure the downstream effect on LLM generation.

Fix: Measure end-to-end pipeline latency with real queries, not individual components.

Mistake 2: Skipping reranker to “optimize” performance

I considered removing the reranker thinking it would speed things up. That would have actually made my system 3-5x slower.

Fix: A/B test with and without reranker using real workloads.

Mistake 3: Not tuning top-k before reranking

I initially sent 100+ chunks to the reranker, which added unnecessary overhead. The optimal balance is typically 20-50 chunks before reranking.

Fix: Find the right top-k for your vector search (I use 30-40 chunks).

Mistake 4: Using slow reranker models

Some rerankers are unnecessarily heavy. I use optimized models like Cohere Rerank 3 or Jina Reranker v2 that balance speed and quality.

The Trade-offs

Rerankers aren’t free—they do add complexity:

  • Additional model to manage: You need to host or pay for a reranker service
  • Tuning required: Top-k values need optimization for your use case
  • Another failure point: More components means more potential issues

But the benefits are substantial:

  • 3-5x faster responses: Users notice the difference
  • 90% cost reduction: Processing 5 chunks instead of 50 slashes token costs
  • Better answers: Rerankers improve relevance and reduce hallucinations
  • Higher throughput: Faster per-query latency means more queries per second

Implementation Tips

When I implemented this in production, I learned a few things:

1. Use async reranking

The reranker call should be async so it doesn’t block the vector search:

async def async_rag_pipeline(query):
# Start vector search
search_task = asyncio.create_task(async_vector_search(query, top_k=50))
# Wait for search results
chunks = await search_task
# Rerank (blocking, but fast)
relevant_chunks = await reranker.rerank_async(query, chunks, top_k=5)
# Generate
response = await llm.generate_async(query, context=relevant_chunks)
return response

2. Cache reranker results

For similar queries, cache the reranked results:

from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_rerank(query_hash, chunks_hash):
return reranker.rerank(query, chunks, top_k=5)

3. Monitor latency at each stage

Track vector search, reranker, and LLM generation separately:

import time
class LatencyTracker:
def __init__(self):
self.metrics = []
def track(self, stage, duration_ms):
self.metrics.append({"stage": stage, "duration_ms": duration_ms})
def report(self):
for m in self.metrics:
print(f"{m['stage']}: {m['duration_ms']:.0f}ms")

Summary

In this post, I showed why adding a reranker to RAG actually decreases latency by 60-80%. The key point is that rerankers reduce LLM generation time by filtering irrelevant context—saving 3400-6800ms in LLM processing far outweighs the 100-200ms reranking overhead.

The counterintuitive result is that adding a model inference step speeds up the entire pipeline. LLM generation dominates RAG latency, and reducing context from 50 chunks to 5 chunks cuts generation time by 75-90%.

If you’re optimizing RAG performance, start with a reranker. It’s the rare optimization that simultaneously improves speed, cost, and quality.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments