How Rerankers Reduce LLM Hallucinations in RAG Systems

Feb 25, 2026

Purpose

This post explains how rerankers reduce LLM hallucinations in RAG systems by filtering out irrelevant context before it reaches the language model.

The Problem

When I built my first RAG system, I got answers that sounded plausible but were completely wrong. I noticed this happening most with medical and technical questions.

For example, when I asked “What are the side effects of aspirin?”, the system would sometimes list side effects from completely different medications, or mix up contraindications with actual side effects. The answers sounded confident but were factually incorrect.

I tracked down the issue to the retrieval step. The vector database was returning chunks that shared vocabulary with my query but didn’t actually answer it.

Here’s what was happening in my system:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Query about aspirin side effects
query = "What are the side effects of aspirin?"

# Documents in database
documents = [
    "Aspirin is a common medication used for pain relief.",
    "Diabetes patients should monitor their blood sugar levels regularly.",
    "Aspirin can cause stomach ulcers and bleeding in some patients.",  # Actual answer
    "Diabetes treatment includes insulin therapy and lifestyle changes.",
]

# Encode and search
query_embedding = embedder.encode(query)
doc_embeddings = embedder.encode(documents)

# Calculate similarity scores
similarities = np.dot(query_embedding, doc_embeddings.T)

# Top 2 results by similarity
top_indices = np.argsort(similarities)[::-1][:2]
for i in top_indices:
    print(f"Similarity {similarities[i]:.3f}: {documents[i]}")

# Output might retrieve:
# Similarity 0.754: "Aspirin is a common medication..." (doesn't answer)
# Similarity 0.681: "Aspirin can cause stomach ulcers..." (actual answer)
# But often retrieves irrelevant docs with shared vocabulary

The problem: Vector search ranks by embedding similarity, not relevance. These are not the same thing.

A chunk that shares vocabulary with your query will score high even if it doesn’t actually answer it. Your LLM then receives this misleading context and tries to synthesize an answer, leading to hallucinations.

Why Vector Search Fails

Embedding similarity measures shared vocabulary and semantic concepts, not whether a document actually answers the question.

Consider this scenario:

Query: “What are the side effects of aspirin?”
Vector search retrieves: A medical textbook chapter mentioning “aspirin” and “side effects” separately, but the chapter is about diabetes treatment
LLM receives irrelevant context, tries to synthesize an answer, and hallucinates aspirin side effects based on partial information

This happens because:

Embeddings capture word co-occurrence and topic proximity
Chunks with similar vocabulary score high even if they don’t answer the specific question
LLMs are trained to be helpful and will generate answers even from poor context
Result: Fabricated information that sounds plausible but is factually incorrect

The Solution: Add a Reranker

I tried adding a cross-encoder reranker between the vector search and the LLM. The reranker does a deep query-document comparison. It reads both the query and the chunk together and scores true relevance.

Here’s how I implemented it:

from sentence_transformers import CrossEncoder

# Load cross-encoder reranker
reranker = CrossEncoder('ms-marco-MiniLM-L-6-v2')

# Step 1: Vector search retrieves top candidates (e.g., top 50)
candidates = [
    "Aspirin is a common medication used for pain relief.",
    "Aspirin can cause stomach ulcers and bleeding in some patients.",
    "Diabetes patients should monitor their blood sugar levels regularly.",
    "Aspirin may interact with blood thinners and increase bleeding risk.",
]

# Step 2: Rerank by true relevance
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)

# Sort by relevance score
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

# Top 2 most relevant documents
for doc, score in reranked[:2]:
    print(f"Relevance {score:.3f}: {doc}")

# Output:
# Relevance 8.234: "Aspirin can cause stomach ulcers..." (highest relevance)
# Relevance 7.891: "Aspirin may interact with blood thinners..." (also relevant)
# Irrelevant docs are filtered out

How Rerankers Prevent Hallucinations

Unlike vector search’s bi-encoder approach (which encodes query and document independently), rerankers use cross-encoders that:

Process query and document together
Perform deep attention-based comparison
Score true relevance based on query-document interaction

The filtering mechanism works like this:

QUERY: "What are the side effects of aspirin?"
     │
     ▼
┌──────────────────┐
│ VECTOR SEARCH    │
│ (Similarity)     │
└────────┬─────────┘
         │
         │ Top 50 candidates by similarity
         ▼
┌──────────────────────────────────────────────────────────┐
│ CANDIDATES (mixed quality)                                │
│                                                           │
│ ✓ "Aspirin causes stomach ulcers..." (relevant)          │
│ ✓ "Aspirin interacts with blood thinners..." (relevant)  │
│ ✗ "Aspirin is used for pain relief..." (generic)        │
│ ✗ "Diabetes patients monitor blood sugar..." (off-topic)│
│ ✗ "Aspirin mentioned in cardiovascular study..." (vague)│
└──────────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────┐
│ RERANKER         │
│ (Relevance)      │
│ Cross-encoder    │
│ scores each      │
│ query-doc pair   │
└────────┬─────────┘
         │
         │ Top 5 most relevant
         ▼
┌──────────────────────────────────────────────────────────┐
│ FINAL CONTEXT (high quality)                              │
│                                                           │
│ ✓ "Aspirin causes stomach ulcers and bleeding..."         │
│ ✓ "Aspirin may interact with blood thinners..."           │
│ ✓ "Long-term aspirin use can cause stomach irritation..."  │
└──────────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────┐
│ LLM GENERATION   │
│                  │
│ High-quality     │
│ context →        │
│ Accurate answer  │
│ No hallucination │
└──────────────────┘

RESULT: LLM receives relevant context → No need to hallucinate

Production Implementation

I put together a production RAG pipeline with the reranker:

from typing import List
import numpy as np

class RAGWithReranker:
    def __init__(self, vector_db, reranker_model, top_k_candidates=50, top_k_final=5):
        self.vector_db = vector_db
        self.reranker = reranker_model
        self.top_k_candidates = top_k_candidates
        self.top_k_final = top_k_final

    def retrieve(self, query: str) -> List[str]:
        """Retrieve and rerank documents for RAG"""

        # Stage 1: Fast vector search (similarity-based)
        candidates = self.vector_db.search(query, top_k=self.top_k_candidates)

        # Stage 2: Slow but accurate reranking (relevance-based)
        pairs = [(query, doc) for doc in candidates]
        relevance_scores = self.reranker.predict(pairs)

        # Stage 3: Select top-k most relevant
        scored_docs = list(zip(candidates, relevance_scores))
        scored_docs.sort(key=lambda x: x[1], reverse=True)

        final_context = [doc for doc, score in scored_docs[:self.top_k_final]]

        return final_context

    def generate_answer(self, query: str, llm) -> str:
        """Generate RAG answer with reranked context"""

        # Get high-quality context
        context = self.retrieve(query)

        # Generate answer with relevant context only
        prompt = f"Context: {' '.join(context)}\n\nQuestion: {query}\n\nAnswer:"
        answer = llm.generate(prompt)

        return answer

# Usage
rag = RAGWithReranker(
    vector_db=my_vector_db,
    reranker_model=CrossEncoder('ms-marco-MiniLM-L-6-v2'),
    top_k_candidates=50,
    top_k_final=5
)

answer = rag.generate_answer("What are the side effects of aspirin?", my_llm)

The Results

After adding the reranker, I noticed a significant improvement in answer quality. The hallucinations dropped dramatically because the LLM was no longer receiving irrelevant context.

On standard benchmarks like NDCG@10, reranking typically gives you 15-30% improvement in answer quality. But the real-world impact is even more significant for high-stakes applications.

Trade-offs to Consider

The reranker adds 50-200ms of latency depending on the model size. This is the main cost:

When rerankers are essential:

High-stakes domains (healthcare, finance, legal)
Complex queries requiring precise answers
Large document collections where similarity search is noisy
User trust is critical

When you might skip rerankers:

Simple factual queries with clear answers
Latency-sensitive applications
Small, well-curated document sets
Prototype/MVP stage

Common Mistakes

I’ve seen developers make several mistakes when implementing rerankers:

Myth: More context is always better

Reality: Sending 20 chunks with mediocre relevance causes more hallucinations than sending 5 highly relevant chunks. Quality > quantity.

Myth: Better embeddings solve the problem

Reality: Even with advanced embedding models (OpenAI, Cohere), similarity ≠ relevance. Rerankers provide orthogonal improvement.

Myth: Rerankers are too slow for production

Reality: Two-stage retrieval (vector search + reranker) is standard practice. The latency cost is acceptable for most applications given the quality gain.

Mistake: Skipping reranking for cost reasons

Reality: Hallucinations damage user trust and can be catastrophic in some domains. The cost of bad answers often exceeds the latency cost of reranking.

The Reason

I think the key reason rerankers reduce hallucinations so effectively is that they prevent the LLM from ever seeing misleading context. When the LLM receives high-quality, relevant chunks, it doesn’t need to “fill gaps” with fabricated information.

The cross-encoder architecture is what makes this possible. Unlike bi-encoders that encode query and document separately, cross-encoders process them together through deep attention mechanisms. This allows the model to capture complex query-document interactions that simple cosine similarity misses.

Vector search is semantic matching, not relevance matching. High similarity ≠ high relevance. Rerankers bridge this gap by scoring true relevance instead of just similarity.

Summary

In this post, I showed how rerankers reduce LLM hallucinations in RAG systems by filtering out irrelevant context before it reaches the language model. The key point is that vector search ranks by embedding similarity (shared vocabulary), while rerankers use cross-encoders to score true relevance through deep query-document comparison.

This two-stage approach typically improves answer quality by 15-30% and is essential for high-stakes RAG applications where accuracy matters more than latency.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!