How to evaluate a RAG system: the metrics that matter
Evaluate a RAG system in two halves: did retrieval fetch the right context, and did the model answer faithfully from it? Measure retrieval with context precision and recall, generation with faithfulness and answer relevancy — against a fixed set of test cases.

To evaluate a RAG system, split it in two and measure each half: did retrieval fetch the right context, and did generation answer faithfully from it? Measure retrieval with context precision and recall; measure generation with faithfulness and answer relevancy. Score all of it against a fixed set of question-and-answer cases, run on every change.
That split is the whole discipline. A RAG system that gives a bad answer failed at one of two jobs — it either retrieved the wrong information, or it answered the supplied information badly — and you cannot fix what you cannot localize. Here is how to measure both, in practice.
Why "it feels better" is not evaluation
LLM output is non-deterministic: the same question can produce different wordings, and a change that helps one case can quietly break another. "It seems better now" is how teams ship regressions. The only way to know a change actually improved things — and to catch the day a model upgrade silently degrades quality — is to put numbers on it. Evaluation turns a subjective demo into an engineering artifact you can defend and regression-test.
Build a golden dataset first
Everything starts with a golden dataset: a fixed set of representative questions, each paired with a known-good answer and the source passage(s) that should answer it. Twenty to fifty real cases is enough to begin — drawn from actual user questions, weighted toward the hard, high-stakes, and historically-broken ones. This is the ruler you measure against; without it, every metric below is meaningless.
Retrieval metrics — did we fetch the right context?
Most RAG failures live in retrieval, so measure it on its own, before you even look at the final answer:
- Context precision — of the chunks you retrieved, how many are actually relevant? Low precision means you are padding the prompt with noise, forcing the model to find signal in clutter.
- Context recall — did you retrieve all the information needed to answer? Low recall means the right passage never made it into the prompt, and no amount of prompting can recover it.
- Ranking metrics like recall@k (did a relevant passage appear in the top k?) and MRR (how high up was the first relevant result?) tell you whether your ordering is good, not just your set.
If retrieval recall is poor, stop and fix retrieval — chunking, embeddings, the number of chunks you pull — before touching anything downstream.
Generation metrics — did we answer well from that context?
Once the model is reliably getting the right context, measure what it does with it:
- Faithfulness — is the answer grounded in the retrieved passages, or did the model add things that are not there? This is your hallucination check. Note an important subtlety: faithfulness measures grounding, not correctness. An answer can be perfectly faithful to a passage that was itself wrong — which is why retrieval quality comes first.
- Answer relevancy — does the answer actually address the question that was asked, or does it wander off into related-but-unasked territory?
How scoring works in practice
You score these two ways. Reference-based evaluation compares the output to your golden answer. Reference-free evaluation uses an LLM-as-judge — a separate model that scores relevance or faithfulness on a 0-to-1 scale against a threshold (for example, flag any answer whose context relevance drops below ~0.75). Frameworks like RAGAS automate exactly these four metrics — faithfulness, answer relevancy, context precision, context recall — without needing a human to label every run, which is what makes continuous evaluation practical.
What the split tells you when something is wrong
The reason to measure retrieval and generation separately is that the two failure signatures point to completely different fixes:
- Low context recall, everything else fine → the right passage is not being retrieved. Fix retrieval: revisit chunking, try a better embedding model, or retrieve more chunks. No prompt change will help.
- Good recall, low faithfulness → the model is getting the right context but adding things that are not in it. Fix generation: tighten the prompt to answer only from context, lower the temperature, or add a grounding check.
- Good faithfulness, low answer relevancy → the answer is grounded but does not address the question. Usually a prompting or query-understanding issue.
- Everything scores well but users still complain → your golden dataset is missing the cases that actually matter. Add them.
Without the split, all of these look the same from the outside — "the answer was bad" — and teams waste days tuning the wrong component.
A quick worked example
Suppose your assistant answers "What's our refund window?" with "30 days," but the policy says 14. You check the metrics. Context recall is low — the refund-policy chunk never made it into the top results. So this is a retrieval failure, not a model failure. You discover your chunks are too large, splitting the refund clause across two chunks so neither scores as a strong match. You re-chunk by section, recall jumps, and the answer corrects itself — without touching the prompt or the model. That is the entire value of measuring the two halves apart.
Run it on every change — and in production
Evaluation is not a one-time gate; it is a standing harness:
- Wire the eval suite into CI, so a prompt tweak, a new embedding model, or a base-model upgrade gets scored before it ships. A change that drops a metric turns the build red.
- Watch the same signals in production, where context relevance can decay over time from index drift (your documents changed) or query-intent shift (users started asking new things).
- Every real-world failure becomes a new eval case, so the same mistake can never ship twice.
This is the difference between a team that is guessing and a team that is engineering — and measurement, not the model, is the part competitors cannot copy by swapping an API key. Everyone has the same models; the teams that win are the ones who can measure quality continuously and catch regressions before users do.

Written by
Nabeel Ghafoor
Senior Engineer
Senior engineer at ByteTuned, leading production AI builds and modernizations.
Keep reading
How to reduce LLM hallucinations in production
You cannot fully eliminate hallucinations, but you can drive them down with layers: ground the model in retrieved facts, constrain it with low temperature and structured output, validate with guardrails and an LLM judge, and measure the rate with evals.
The Tuned Pod: a senior team amplified by agents
How a small, senior team using AI agents ships what used to take a team three to four times its size — and keeps it running.
How to cut LLM costs in production
Most production LLM bills can be cut 60–80% without hurting quality, because most requests are easy and do not need your most expensive model. The big levers: route to smaller models, cache repeated prompts, right-size, and trim context.
Building production AI? Let’s talk.
Book a 30-minute call. We’ll map the highest-impact system to build first — and what moving that number is worth.


