How to Evaluate RAG Systems with RAGAS: Faithfulness, Relevancy, and Recall
I deployed my RAG system to production last month. It worked great on my test questions. Users loved it.
Then I added a new chunking strategy. Response times improved by 40%. I was happy.
Until I noticed users were getting answers that sounded confident but were completely wrong.
The problem? I had no evaluation metrics. I couldn’t tell if my “optimization” actually degraded quality. I was flying blind.
The Problem With RAG Development
Most RAG tutorials end with “it works!” They show you how to retrieve documents and generate answers. They don’t show you how to measure whether those answers are actually correct.
Without evaluation, you can’t:
- Detect quality regressions from code changes
- Compare different retrieval strategies objectively
- Know if your chunking is hurting or helping
- Justify architecture decisions with data
I spent weeks tuning my RAG pipeline. Every change felt like a gamble. “Does this help? I think so? Maybe?”
Then I discovered RAGAS.
What RAGAS Actually Measures
RAGAS (Retrieval Augmented Generation Assessment) provides automated evaluation through four core metrics:
┌─────────────────────────────────────────────────────────────┐│ RAGAS METRICS │├─────────────────┬───────────────────────────────────────────┤│ Metric │ Question It Answers │├─────────────────┼───────────────────────────────────────────┤│ Faithfulness │ Is the answer grounded in retrieved docs? ││ Answer Relevancy│ Does the answer address the question? ││ Context Recall │ Did we retrieve all relevant info? ││ Context Precision│ Is retrieved info actually relevant? │└─────────────────┴───────────────────────────────────────────┘Each metric gives you a score between 0 and 1. Higher is better.
Faithfulness catches hallucinations. If your LLM says “The API returns JSON” but the retrieved documentation only mentions XML, faithfulness will be low.
Answer Relevancy catches off-topic responses. If someone asks “How do I authenticate?” and your system explains rate limits instead, relevancy will be low.
Context Recall catches missing information. If the correct answer requires information you didn’t retrieve, recall will be low.
My First RAGAS Evaluation
I started with something simple. I wrote down five questions I knew the answers to, ran them through my RAG system, and recorded the outputs.
from ragas import evaluatefrom ragas.metrics import ( faithfulness, answer_relevancy, context_recall, context_precision,)from datasets import Dataset
# I manually wrote these based on my documentationeval_data = { "question": [ "What is RAG?", "How does chunking affect retrieval quality?", "What is the recommended chunk size?", "How does HNSW indexing work?", "What is the difference between semantic and keyword search?", ], "answer": [], # Will be filled by my RAG system "contexts": [], # Will be filled by my RAG system "ground_truth": [ # The correct answers I expect "RAG combines retrieval with LLM generation to access current information.", "Chunking affects retrieval by determining how much context is available per chunk.", "1000 tokens with 200 overlap is commonly recommended for general use.", "HNSW creates a graph structure for fast approximate nearest neighbor search.", "Semantic search understands meaning while keyword search matches exact terms.", ],}
# Run my RAG pipeline for each questionfor q in eval_data["question"]: result = my_rag_system.query(q) eval_data["answer"].append(result["answer"]) eval_data["contexts"].append(result["contexts"])
# Evaluatedataset = Dataset.from_dict(eval_data)results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision],)
print(results)The results were sobering:
faithfulness: 0.65answer_relevancy: 0.72context_recall: 0.58context_precision: 0.71My “working” RAG system was barely passing. Context recall of 0.58 meant I was missing relevant information nearly half the time.
Setting Up a Proper Evaluation Pipeline
I realized I needed something more systematic. Here’s what I built:
from langchain_openai import ChatOpenAI, OpenAIEmbeddingsfrom ragas import evaluatefrom ragas.metrics import faithfulness, answer_relevancy, context_recallfrom datasets import Datasetimport pandas as pdfrom datetime import datetimeimport json
class RAGEvaluator: def __init__(self, rag_pipeline, eval_questions_path: str): self.pipeline = rag_pipeline self.questions = self._load_questions(eval_questions_path)
def _load_questions(self, path: str) -> list: with open(path) as f: return json.load(f)
def run_evaluation(self) -> dict: eval_data = { "question": [], "answer": [], "contexts": [], "ground_truth": [], }
for item in self.questions: result = self.pipeline.query(item["question"])
eval_data["question"].append(item["question"]) eval_data["answer"].append(result["answer"]) eval_data["contexts"].append(result["contexts"]) eval_data["ground_truth"].append(item["ground_truth"])
dataset = Dataset.from_dict(eval_data) results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_recall], )
return results
def track_results(self, results: dict, notes: str = ""): timestamp = datetime.now().isoformat() row = { "timestamp": timestamp, "notes": notes, **results, }
df = pd.DataFrame([row]) df.to_csv("eval_history.csv", mode='a', header=False, index=False)
return row
# Usageevaluator = RAGEvaluator( rag_pipeline=my_rag_system, eval_questions_path="eval_questions.json",)
results = evaluator.run_evaluation()evaluator.track_results(results, notes="Baseline before optimization")Now I could run evaluations consistently and track results over time.
What Good Scores Look Like
After running evaluations for a few weeks, here’s what I learned:
┌─────────────────────┬──────────────┬─────────────────────────────────┐│ Metric │ Target Score │ What It Means │├─────────────────────┼──────────────┼─────────────────────────────────┤│ Faithfulness │ > 0.80 │ Answers are grounded in context ││ Answer Relevancy │ > 0.70 │ Answers address the question ││ Context Recall │ > 0.70 │ Retrieved all relevant info ││ Context Precision │ > 0.70 │ Retrieved info is actually needed│└─────────────────────┴──────────────┴─────────────────────────────────┘These aren’t hard rules. Your targets depend on your use case. For a medical diagnosis assistant, you want faithfulness above 0.95. For a casual chatbot, 0.70 might be fine.
The key is tracking trends, not absolute numbers. If faithfulness drops from 0.85 to 0.75 after a change, you have a problem.
Common Evaluation Mistakes I Made
Mistake 1: Evaluating only on synthetic questions
I generated questions with GPT-4 from my documentation. They were too clean, too predictable. Real user questions are messy.
# DON'T: Only use AI-generated questionssynthetic_questions = generate_questions_from_docs(docs) # Too clean
# DO: Include real user questionsreal_questions = [ "why does it keep saying error when i try to upload", "how to make it work with s3?", "the docs are confusing, is there an example?",]Mistake 2: Not having ground truth answers
RAGAS compares your RAG outputs against ground truth. Without ground truth, context recall is meaningless.
Mistake 3: Running evaluation once
I ran evaluation, got decent scores, and moved on. Then I made changes without re-evaluating. Weeks later, I realized quality had degraded.
Now I evaluate on every pull request.
Integrating Evaluation Into CI/CD
Here’s how I automated this:
name: RAG Evaluation
on: pull_request: branches: [main]
jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install dependencies run: pip install -r requirements.txt - name: Run evaluation env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python scripts/evaluate_rag.py - name: Check scores run: | python scripts/check_thresholds.py --min-faithfulness 0.75 --min-relevancy 0.70import argparseimport jsonimport sys
parser = argparse.ArgumentParser()parser.add_argument("--min-faithfulness", type=float, default=0.75)parser.add_argument("--min-relevancy", type=float, default=0.70)args = parser.parse_args()
with open("eval_results.json") as f: results = json.load(f)
failed = False
if results["faithfulness"] < args.min_faithfulness: print(f"❌ Faithfulness {results['faithfulness']:.2f} below threshold {args.min_faithfulness}") failed = True
if results["answer_relevancy"] < args.min_relevancy: print(f"❌ Answer Relevancy {results['answer_relevancy']:.2f} below threshold {args.min_relevancy}") failed = True
if failed: sys.exit(1)else: print("✅ All evaluation checks passed")Now my CI pipeline fails if evaluation scores drop below thresholds.
What I Learned
After three months of consistent evaluation:
-
Start evaluating early. Before you have users complaining about wrong answers.
-
Include real user questions. Synthetic questions don’t reflect reality.
-
Track trends, not absolute numbers. A drop from 0.85 to 0.75 is a signal.
-
Automate. If evaluation isn’t automatic, you won’t do it consistently.
-
Ground truth matters. Invest in curating good evaluation datasets.
The best part? When I made my “optimization” last month, RAGAS caught the faithfulness drop from 0.82 to 0.68. I rolled back and investigated. Turns out my new chunking strategy was splitting related information across chunks, causing the LLM to hallucinate connections.
Without evaluation, I would have shipped broken code to production.
Quick Start Checklist
If you’re new to RAG evaluation:
- Create 10-20 evaluation questions with ground truth answers
- Set up RAGAS with faithfulness, answer_relevancy, context_recall
- Run your first evaluation (expect low scores)
- Track results in a CSV file
- Add real user questions as they come in
- Set up automated evaluation in CI/CD
The initial setup took me a weekend. It saved me weeks of debugging later.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments