How Rerankers Reduce LLM Hallucinations in RAG Systems
Purpose
This post explains how rerankers reduce LLM hallucinations in RAG systems by filtering out irrelevant context before it reaches the language model.
The Problem
When I built my first RAG system, I got answers that sounded plausible but were completely wrong. I noticed this happening most with medical and technical questions.
For example, when I asked “What are the side effects of aspirin?”, the system would sometimes list side effects from completely different medications, or mix up contraindications with actual side effects. The answers sounded confident but were factually incorrect.
I tracked down the issue to the retrieval step. The vector database was returning chunks that shared vocabulary with my query but didn’t actually answer it.
Here’s what was happening in my system:
from sentence_transformers import SentenceTransformerimport numpy as np
# Load embedding modelembedder = SentenceTransformer('all-MiniLM-L6-v2')
# Query about aspirin side effectsquery = "What are the side effects of aspirin?"
# Documents in databasedocuments = [ "Aspirin is a common medication used for pain relief.", "Diabetes patients should monitor their blood sugar levels regularly.", "Aspirin can cause stomach ulcers and bleeding in some patients.", # Actual answer "Diabetes treatment includes insulin therapy and lifestyle changes.",]
# Encode and searchquery_embedding = embedder.encode(query)doc_embeddings = embedder.encode(documents)
# Calculate similarity scoressimilarities = np.dot(query_embedding, doc_embeddings.T)
# Top 2 results by similaritytop_indices = np.argsort(similarities)[::-1][:2]for i in top_indices: print(f"Similarity {similarities[i]:.3f}: {documents[i]}")
# Output might retrieve:# Similarity 0.754: "Aspirin is a common medication..." (doesn't answer)# Similarity 0.681: "Aspirin can cause stomach ulcers..." (actual answer)# But often retrieves irrelevant docs with shared vocabularyThe problem: Vector search ranks by embedding similarity, not relevance. These are not the same thing.
A chunk that shares vocabulary with your query will score high even if it doesn’t actually answer it. Your LLM then receives this misleading context and tries to synthesize an answer, leading to hallucinations.
Why Vector Search Fails
Embedding similarity measures shared vocabulary and semantic concepts, not whether a document actually answers the question.
Consider this scenario:
- Query: “What are the side effects of aspirin?”
- Vector search retrieves: A medical textbook chapter mentioning “aspirin” and “side effects” separately, but the chapter is about diabetes treatment
- LLM receives irrelevant context, tries to synthesize an answer, and hallucinates aspirin side effects based on partial information
This happens because:
- Embeddings capture word co-occurrence and topic proximity
- Chunks with similar vocabulary score high even if they don’t answer the specific question
- LLMs are trained to be helpful and will generate answers even from poor context
- Result: Fabricated information that sounds plausible but is factually incorrect
The Solution: Add a Reranker
I tried adding a cross-encoder reranker between the vector search and the LLM. The reranker does a deep query-document comparison. It reads both the query and the chunk together and scores true relevance.
Here’s how I implemented it:
from sentence_transformers import CrossEncoder
# Load cross-encoder rerankerreranker = CrossEncoder('ms-marco-MiniLM-L-6-v2')
# Step 1: Vector search retrieves top candidates (e.g., top 50)candidates = [ "Aspirin is a common medication used for pain relief.", "Aspirin can cause stomach ulcers and bleeding in some patients.", "Diabetes patients should monitor their blood sugar levels regularly.", "Aspirin may interact with blood thinners and increase bleeding risk.",]
# Step 2: Rerank by true relevancepairs = [(query, doc) for doc in candidates]scores = reranker.predict(pairs)
# Sort by relevance scorereranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
# Top 2 most relevant documentsfor doc, score in reranked[:2]: print(f"Relevance {score:.3f}: {doc}")
# Output:# Relevance 8.234: "Aspirin can cause stomach ulcers..." (highest relevance)# Relevance 7.891: "Aspirin may interact with blood thinners..." (also relevant)# Irrelevant docs are filtered outHow Rerankers Prevent Hallucinations
Unlike vector search’s bi-encoder approach (which encodes query and document independently), rerankers use cross-encoders that:
- Process query and document together
- Perform deep attention-based comparison
- Score true relevance based on query-document interaction
The filtering mechanism works like this:
QUERY: "What are the side effects of aspirin?" │ ▼┌──────────────────┐│ VECTOR SEARCH ││ (Similarity) │└────────┬─────────┘ │ │ Top 50 candidates by similarity ▼┌──────────────────────────────────────────────────────────┐│ CANDIDATES (mixed quality) ││ ││ ✓ "Aspirin causes stomach ulcers..." (relevant) ││ ✓ "Aspirin interacts with blood thinners..." (relevant) ││ ✗ "Aspirin is used for pain relief..." (generic) ││ ✗ "Diabetes patients monitor blood sugar..." (off-topic)││ ✗ "Aspirin mentioned in cardiovascular study..." (vague)│└──────────────────────────────────────────────────────────┘ │ ▼┌──────────────────┐│ RERANKER ││ (Relevance) ││ Cross-encoder ││ scores each ││ query-doc pair │└────────┬─────────┘ │ │ Top 5 most relevant ▼┌──────────────────────────────────────────────────────────┐│ FINAL CONTEXT (high quality) ││ ││ ✓ "Aspirin causes stomach ulcers and bleeding..." ││ ✓ "Aspirin may interact with blood thinners..." ││ ✓ "Long-term aspirin use can cause stomach irritation..." │└──────────────────────────────────────────────────────────┘ │ ▼┌──────────────────┐│ LLM GENERATION ││ ││ High-quality ││ context → ││ Accurate answer ││ No hallucination │└──────────────────┘
RESULT: LLM receives relevant context → No need to hallucinateProduction Implementation
I put together a production RAG pipeline with the reranker:
from typing import Listimport numpy as np
class RAGWithReranker: def __init__(self, vector_db, reranker_model, top_k_candidates=50, top_k_final=5): self.vector_db = vector_db self.reranker = reranker_model self.top_k_candidates = top_k_candidates self.top_k_final = top_k_final
def retrieve(self, query: str) -> List[str]: """Retrieve and rerank documents for RAG"""
# Stage 1: Fast vector search (similarity-based) candidates = self.vector_db.search(query, top_k=self.top_k_candidates)
# Stage 2: Slow but accurate reranking (relevance-based) pairs = [(query, doc) for doc in candidates] relevance_scores = self.reranker.predict(pairs)
# Stage 3: Select top-k most relevant scored_docs = list(zip(candidates, relevance_scores)) scored_docs.sort(key=lambda x: x[1], reverse=True)
final_context = [doc for doc, score in scored_docs[:self.top_k_final]]
return final_context
def generate_answer(self, query: str, llm) -> str: """Generate RAG answer with reranked context"""
# Get high-quality context context = self.retrieve(query)
# Generate answer with relevant context only prompt = f"Context: {' '.join(context)}\n\nQuestion: {query}\n\nAnswer:" answer = llm.generate(prompt)
return answer
# Usagerag = RAGWithReranker( vector_db=my_vector_db, reranker_model=CrossEncoder('ms-marco-MiniLM-L-6-v2'), top_k_candidates=50, top_k_final=5)
answer = rag.generate_answer("What are the side effects of aspirin?", my_llm)The Results
After adding the reranker, I noticed a significant improvement in answer quality. The hallucinations dropped dramatically because the LLM was no longer receiving irrelevant context.
On standard benchmarks like NDCG@10, reranking typically gives you 15-30% improvement in answer quality. But the real-world impact is even more significant for high-stakes applications.
Trade-offs to Consider
The reranker adds 50-200ms of latency depending on the model size. This is the main cost:
When rerankers are essential:
- High-stakes domains (healthcare, finance, legal)
- Complex queries requiring precise answers
- Large document collections where similarity search is noisy
- User trust is critical
When you might skip rerankers:
- Simple factual queries with clear answers
- Latency-sensitive applications
- Small, well-curated document sets
- Prototype/MVP stage
Common Mistakes
I’ve seen developers make several mistakes when implementing rerankers:
Myth: More context is always better
Reality: Sending 20 chunks with mediocre relevance causes more hallucinations than sending 5 highly relevant chunks. Quality > quantity.
Myth: Better embeddings solve the problem
Reality: Even with advanced embedding models (OpenAI, Cohere), similarity ≠ relevance. Rerankers provide orthogonal improvement.
Myth: Rerankers are too slow for production
Reality: Two-stage retrieval (vector search + reranker) is standard practice. The latency cost is acceptable for most applications given the quality gain.
Mistake: Skipping reranking for cost reasons
Reality: Hallucinations damage user trust and can be catastrophic in some domains. The cost of bad answers often exceeds the latency cost of reranking.
The Reason
I think the key reason rerankers reduce hallucinations so effectively is that they prevent the LLM from ever seeing misleading context. When the LLM receives high-quality, relevant chunks, it doesn’t need to “fill gaps” with fabricated information.
The cross-encoder architecture is what makes this possible. Unlike bi-encoders that encode query and document separately, cross-encoders process them together through deep attention mechanisms. This allows the model to capture complex query-document interactions that simple cosine similarity misses.
Vector search is semantic matching, not relevance matching. High similarity ≠ high relevance. Rerankers bridge this gap by scoring true relevance instead of just similarity.
Summary
In this post, I showed how rerankers reduce LLM hallucinations in RAG systems by filtering out irrelevant context before it reaches the language model. The key point is that vector search ranks by embedding similarity (shared vocabulary), while rerankers use cross-encoders to score true relevance through deep query-document comparison.
This two-stage approach typically improves answer quality by 15-30% and is essential for high-stakes RAG applications where accuracy matters more than latency.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Cross-Encoders vs Bi-Encoders
- 👨💻 MS MARCO Reranking Model
- 👨💻 Cohere Rerank API
- 👨💻 Sentence-Transformers Documentation
- 👨💻 Vector Database Limitations
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments