How to Evaluate RAG Systems with RAGAS: Faithfulness, Relevancy, and Recall

Mar 22, 2026

I deployed my RAG system to production last month. It worked great on my test questions. Users loved it.

Then I added a new chunking strategy. Response times improved by 40%. I was happy.

Until I noticed users were getting answers that sounded confident but were completely wrong.

The problem? I had no evaluation metrics. I couldn’t tell if my “optimization” actually degraded quality. I was flying blind.

The Problem With RAG Development

Most RAG tutorials end with “it works!” They show you how to retrieve documents and generate answers. They don’t show you how to measure whether those answers are actually correct.

Without evaluation, you can’t:

Detect quality regressions from code changes
Compare different retrieval strategies objectively
Know if your chunking is hurting or helping
Justify architecture decisions with data

I spent weeks tuning my RAG pipeline. Every change felt like a gamble. “Does this help? I think so? Maybe?”

Then I discovered RAGAS.

What RAGAS Actually Measures

RAGAS (Retrieval Augmented Generation Assessment) provides automated evaluation through four core metrics:

┌─────────────────────────────────────────────────────────────┐
│                    RAGAS METRICS                             │
├─────────────────┬───────────────────────────────────────────┤
│ Metric          │ Question It Answers                       │
├─────────────────┼───────────────────────────────────────────┤
│ Faithfulness    │ Is the answer grounded in retrieved docs? │
│ Answer Relevancy│ Does the answer address the question?     │
│ Context Recall  │ Did we retrieve all relevant info?        │
│ Context Precision│ Is retrieved info actually relevant?     │
└─────────────────┴───────────────────────────────────────────┘

Each metric gives you a score between 0 and 1. Higher is better.

Faithfulness catches hallucinations. If your LLM says “The API returns JSON” but the retrieved documentation only mentions XML, faithfulness will be low.

Answer Relevancy catches off-topic responses. If someone asks “How do I authenticate?” and your system explains rate limits instead, relevancy will be low.

Context Recall catches missing information. If the correct answer requires information you didn’t retrieve, recall will be low.

My First RAGAS Evaluation

I started with something simple. I wrote down five questions I knew the answers to, ran them through my RAG system, and recorded the outputs.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# I manually wrote these based on my documentation
eval_data = {
    "question": [
        "What is RAG?",
        "How does chunking affect retrieval quality?",
        "What is the recommended chunk size?",
        "How does HNSW indexing work?",
        "What is the difference between semantic and keyword search?",
    ],
    "answer": [],  # Will be filled by my RAG system
    "contexts": [],  # Will be filled by my RAG system
    "ground_truth": [  # The correct answers I expect
        "RAG combines retrieval with LLM generation to access current information.",
        "Chunking affects retrieval by determining how much context is available per chunk.",
        "1000 tokens with 200 overlap is commonly recommended for general use.",
        "HNSW creates a graph structure for fast approximate nearest neighbor search.",
        "Semantic search understands meaning while keyword search matches exact terms.",
    ],
}

# Run my RAG pipeline for each question
for q in eval_data["question"]:
    result = my_rag_system.query(q)
    eval_data["answer"].append(result["answer"])
    eval_data["contexts"].append(result["contexts"])

# Evaluate
dataset = Dataset.from_dict(eval_data)
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)

print(results)

The results were sobering:

faithfulness: 0.65
answer_relevancy: 0.72
context_recall: 0.58
context_precision: 0.71

My “working” RAG system was barely passing. Context recall of 0.58 meant I was missing relevant information nearly half the time.

Setting Up a Proper Evaluation Pipeline

I realized I needed something more systematic. Here’s what I built:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset
import pandas as pd
from datetime import datetime
import json

class RAGEvaluator:
    def __init__(self, rag_pipeline, eval_questions_path: str):
        self.pipeline = rag_pipeline
        self.questions = self._load_questions(eval_questions_path)

    def _load_questions(self, path: str) -> list:
        with open(path) as f:
            return json.load(f)

    def run_evaluation(self) -> dict:
        eval_data = {
            "question": [],
            "answer": [],
            "contexts": [],
            "ground_truth": [],
        }

        for item in self.questions:
            result = self.pipeline.query(item["question"])

            eval_data["question"].append(item["question"])
            eval_data["answer"].append(result["answer"])
            eval_data["contexts"].append(result["contexts"])
            eval_data["ground_truth"].append(item["ground_truth"])

        dataset = Dataset.from_dict(eval_data)
        results = evaluate(
            dataset,
            metrics=[faithfulness, answer_relevancy, context_recall],
        )

        return results

    def track_results(self, results: dict, notes: str = ""):
        timestamp = datetime.now().isoformat()
        row = {
            "timestamp": timestamp,
            "notes": notes,
            **results,
        }

        df = pd.DataFrame([row])
        df.to_csv("eval_history.csv", mode='a', header=False, index=False)

        return row

# Usage
evaluator = RAGEvaluator(
    rag_pipeline=my_rag_system,
    eval_questions_path="eval_questions.json",
)

results = evaluator.run_evaluation()
evaluator.track_results(results, notes="Baseline before optimization")

Now I could run evaluations consistently and track results over time.

What Good Scores Look Like

After running evaluations for a few weeks, here’s what I learned:

┌─────────────────────┬──────────────┬─────────────────────────────────┐
│ Metric              │ Target Score │ What It Means                   │
├─────────────────────┼──────────────┼─────────────────────────────────┤
│ Faithfulness        │ > 0.80       │ Answers are grounded in context │
│ Answer Relevancy    │ > 0.70       │ Answers address the question     │
│ Context Recall      │ > 0.70       │ Retrieved all relevant info     │
│ Context Precision   │ > 0.70       │ Retrieved info is actually needed│
└─────────────────────┴──────────────┴─────────────────────────────────┘

These aren’t hard rules. Your targets depend on your use case. For a medical diagnosis assistant, you want faithfulness above 0.95. For a casual chatbot, 0.70 might be fine.

The key is tracking trends, not absolute numbers. If faithfulness drops from 0.85 to 0.75 after a change, you have a problem.

Common Evaluation Mistakes I Made

Mistake 1: Evaluating only on synthetic questions

I generated questions with GPT-4 from my documentation. They were too clean, too predictable. Real user questions are messy.

# DON'T: Only use AI-generated questions
synthetic_questions = generate_questions_from_docs(docs)  # Too clean

# DO: Include real user questions
real_questions = [
    "why does it keep saying error when i try to upload",
    "how to make it work with s3?",
    "the docs are confusing, is there an example?",
]

Mistake 2: Not having ground truth answers

RAGAS compares your RAG outputs against ground truth. Without ground truth, context recall is meaningless.

Mistake 3: Running evaluation once

I ran evaluation, got decent scores, and moved on. Then I made changes without re-evaluating. Weeks later, I realized quality had degraded.

Now I evaluate on every pull request.

Integrating Evaluation Into CI/CD

Here’s how I automated this:

name: RAG Evaluation

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/evaluate_rag.py
      - name: Check scores
        run: |
          python scripts/check_thresholds.py --min-faithfulness 0.75 --min-relevancy 0.70

import argparse
import json
import sys

parser = argparse.ArgumentParser()
parser.add_argument("--min-faithfulness", type=float, default=0.75)
parser.add_argument("--min-relevancy", type=float, default=0.70)
args = parser.parse_args()

with open("eval_results.json") as f:
    results = json.load(f)

failed = False

if results["faithfulness"] < args.min_faithfulness:
    print(f"❌ Faithfulness {results['faithfulness']:.2f} below threshold {args.min_faithfulness}")
    failed = True

if results["answer_relevancy"] < args.min_relevancy:
    print(f"❌ Answer Relevancy {results['answer_relevancy']:.2f} below threshold {args.min_relevancy}")
    failed = True

if failed:
    sys.exit(1)
else:
    print("✅ All evaluation checks passed")

Now my CI pipeline fails if evaluation scores drop below thresholds.

What I Learned

After three months of consistent evaluation:

Start evaluating early. Before you have users complaining about wrong answers.
Include real user questions. Synthetic questions don’t reflect reality.
Track trends, not absolute numbers. A drop from 0.85 to 0.75 is a signal.
Automate. If evaluation isn’t automatic, you won’t do it consistently.
Ground truth matters. Invest in curating good evaluation datasets.

The best part? When I made my “optimization” last month, RAGAS caught the faithfulness drop from 0.82 to 0.68. I rolled back and investigated. Turns out my new chunking strategy was splitting related information across chunks, causing the LLM to hallucinate connections.

Without evaluation, I would have shipped broken code to production.

Quick Start Checklist

If you’re new to RAG evaluation:

Create 10-20 evaluation questions with ground truth answers
Set up RAGAS with faithfulness, answer_relevancy, context_recall
Run your first evaluation (expect low scores)
Track results in a CSV file
Add real user questions as they come in
Set up automated evaluation in CI/CD

The initial setup took me a weekend. It saved me weeks of debugging later.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!