Skip to content

How to Evaluate a RAG Pipeline with RAGAS

Chapter 7 of 8
Engineer~20 min

Building a RAG system is easy. Knowing if it actually works is the hard part — and the part most tutorials skip entirely.

Section titled “Building a RAG system is easy. Knowing if it actually works is the hard part — and the part most tutorials skip entirely.”

After this chapter, you will be able to: measure your RAG pipeline’s quality with four specific metrics, diagnose whether problems come from retrieval or generation, and build a test set to track improvements over time.


Why “It Seems to Work” Is Not Good Enough

Section titled “Why “It Seems to Work” Is Not Good Enough”

You built a RAG chatbot. You asked it a few questions. The answers looked reasonable. Ship it?

No. Here is why. “Looked reasonable” is not a metric. It is a vibe. And vibes do not catch the failure modes that matter most:

  • The answer sounds confident and well-written, but one of the three facts in it is made up. You did not notice because it was mixed in with two real facts.
  • The retrieval found 5 chunks, but 3 of them were irrelevant. The model used them anyway and the answer drifted off-topic.
  • The answer is correct, but it missed the most important piece of information because retrieval did not find the chunk that contained it.

Every one of these failures looks fine on a quick glance. They only show up when you measure systematically. This chapter teaches you how.


RAG evaluation breaks down into four questions. Each one targets a different part of your pipeline.

Loading diagram...

Metric 1 — Faithfulness (The Hallucination Check)

Section titled “Metric 1 — Faithfulness (The Hallucination Check)”

Faithfulness asks: is every claim in the answer actually supported by the retrieved chunks?

This is your hallucination detector. An LLM can write a beautiful, fluent answer that sounds completely authoritative — but if half of it came from the model’s training data instead of your retrieved context, you have a faithfulness problem.

How it works: take every individual claim in the generated answer and check whether the retrieved chunks contain evidence for it. If the answer says “The project deadline is March 15th” but none of the retrieved chunks mention March 15th, that claim is unfaithful. The score is the fraction of claims that are supported. [src: shahul2023ragas]

Real example: Your chatbot answers “The quarterly revenue was $2.3M, up 15% from last quarter.” If the retrieved chunk says “$2.3M revenue” but never mentions 15% growth, your faithfulness score drops because the model added a claim it could not support from context.

PLAIN ENGLISH
Faithfulness measures one thing: can every sentence in the answer be traced back to the retrieved chunks? If not, the model is making things up.

Answer Relevancy asks: does the answer actually address the question that was asked?

This catches a subtle failure mode. The answer might be entirely faithful to the chunks — every fact is real — but it does not answer the user’s question. The user asked “What was the Q1 revenue?” and the model responded with a detailed summary of the company’s hiring plans. Faithful? Yes, if the chunks discussed hiring. Relevant? Not at all.

Answer relevancy is measured by checking whether the generated answer, if you reversed it back into a question, would produce a question similar to the original. If the answer drifts off-topic, the reverse-engineered question will not match. [src: shahul2023ragas]

Context Precision asks: were the retrieved chunks actually relevant to the question?

This is a retrieval metric, not a generation metric. If you retrieved 5 chunks but only 2 of them were relevant, your precision is low. The other 3 chunks are noise — they waste context window space and can mislead the model.

Think of it like a search engine. If you Google a question and 8 of the top 10 results are relevant, that is high precision. If only 2 are relevant and the rest are junk, precision is low. Same concept, applied to your chunk retrieval.

Context Recall asks: did retrieval find all the relevant chunks, or did it miss some?

This is the other side of the retrieval coin. Precision asks “was what we found useful?” Recall asks “did we find everything useful?” You can have perfect precision (every chunk you found was relevant) but terrible recall (you missed the three most important chunks in your database).

To measure recall, you need to know what a good answer looks like. That means you need a test set with reference answers — which is why building a test set is so important.


RAGAS (Retrieval-Augmented Generation Assessment) is the industry-standard open-source framework for measuring all four metrics. It was introduced by Shahul Es et al. (2023) and has become the default evaluation tool in the RAG ecosystem. [src: shahul2023ragas]

Here is how RAGAS works at a high level:

  1. You provide a question, the retrieved context (chunks), the generated answer, and optionally a reference answer (ground truth).
  2. RAGAS uses an LLM to evaluate each metric. For faithfulness, it breaks the answer into individual claims and checks each one against the context. For relevancy, it reverse-engineers questions from the answer and compares them to the original.
  3. Each metric returns a score between 0 and 1. Higher is better.

You do not need to understand the internal mechanics to use it. You need to understand what the scores mean and what to do when they are low.

Score RangeWhat It Means
0.9 – 1.0Excellent. Your pipeline is working well for this query.
0.7 – 0.9Good. Acceptable for most use cases, but room to improve.
0.5 – 0.7Mediocre. Users will notice problems. Investigate.
Below 0.5Poor. Something is clearly broken. Diagnose immediately.

You cannot evaluate your pipeline with one question. You need a test set — a collection of question-answer pairs that you run your pipeline against every time you make a change.

Here is how to build one efficiently:

Start with 20 questions. Not 5 (too few to see patterns), not 200 (too many to maintain). Twenty gives you enough to spot systematic problems.

Cover different question types. Include simple factual lookups (“What is the project deadline?”), questions requiring synthesis across multiple chunks (“How did the budget change between Q1 and Q2?”), and questions where the answer is not in your documents (“What was the competitor’s revenue?”). The last category tests whether your system gracefully says “I don’t know.”

Write reference answers. For each question, write the answer you expect. This is your ground truth. RAGAS uses it to compute context recall — it checks whether the retrieved chunks contain the information needed to produce this answer.

Use AI to help, but verify. You can prompt an LLM to generate candidate questions from your documents. But verify every question-answer pair yourself. An LLM-generated test set that contains errors will give you meaningless scores.


The Improvement Loop — Diagnosing What Is Wrong

Section titled “The Improvement Loop — Diagnosing What Is Wrong”

Your RAGAS scores are in. Some are low. Now what? The diagnostic question is: is this a retrieval problem or a generation problem?

Loading diagram...

If context precision is low: your retrieval is returning irrelevant chunks. Try increasing the relevance threshold, switching to hybrid search, or adding a re-ranker (Chapter 5).

If context recall is low: your retrieval is missing relevant chunks. Try a different chunking strategy — your chunks might be too large (burying the relevant sentence in noise) or too small (splitting the relevant information across two chunks that do not individually match). Also check your embedding model — a better model may capture the semantic match.

If faithfulness is low but retrieval looks fine: the model is hallucinating. Your prompt needs stronger grounding instructions. Add explicit rules like “Do not use any information outside the provided context” and “If the context does not contain the answer, say so.”

If answer relevancy is low but faithfulness is fine: the model is giving accurate but off-topic answers. Your prompt may need clearer instructions about what the user is actually asking for, or your retrieval may be surfacing chunks that are tangentially related but pull the answer in the wrong direction.


Perfection is not the goal. Different use cases have different quality bars.

Use CaseTarget ScoresWhy
Internal knowledge base for engineersFaithfulness > 0.8, Precision > 0.7Engineers can verify and tolerate occasional misses
Customer-facing support chatbotFaithfulness > 0.95, Relevancy > 0.9Wrong answers erode customer trust fast
Research assistant (draft quality)Recall > 0.85, Relevancy > 0.7Missing information is worse than noise — users will edit
Legal or medical document Q&AAll metrics > 0.95Errors have real consequences

The key insight: measure first, then decide your threshold. Most teams do not know their scores at all. Even measuring badly is better than not measuring. You can always improve the test set and tighten the thresholds later.

TIP
Start by measuring anything at all. A rough test set of 20 questions with approximate reference answers is infinitely more useful than no evaluation. You can refine the test set over time.

Paste a question, the generated answer, and the retrieved chunks your pipeline used. The calculator evaluates faithfulness and answer relevancy and shows you a score with an explanation of what it means.

Try It: Your RAG Score

Try it with a good answer first — one where every fact comes from the chunks. Then try it with a bad answer — add a made-up fact that is not in any chunk. Watch the faithfulness score drop.


Open the Playground and run your chatbot against 5 test questions. For each one, note the question, the retrieved chunks, and the generated answer. Use the RAGAS Score Calculator above to evaluate faithfulness and relevancy for each. Write down your scores. If any are below 0.7, use the diagnostic decision tree to identify whether the problem is in retrieval or generation — then go back to the relevant chapter and adjust.


Q1

Your RAG pipeline has high faithfulness (0.95) but low context recall (0.4). What does this tell you?

Q2

What are the four metrics in the RAGAS evaluation framework?


You now have the ability to measure your RAG pipeline instead of guessing. You understand the four metrics that matter, you know how to build a test set, and you have a diagnostic framework for figuring out what to fix when scores are low. This is the chapter that separates people who build demos from people who build products.

Your chatbot is now functional and measured. The next question is: what do you do when standard RAG is not enough? When queries are vague, complex, or when retrieval sometimes fails completely? That is where advanced patterns come in.


Was this chapter helpful?



← Chapter 6: The Prompt Layer | Chapter 8: Advanced RAG Patterns →