How to Eval a RAG Agent's Groundedness

Retrieval-augmented generation agents fail in a specific, dangerous way: they sound confident while citing sources that don't say what the agent claims, or don't exist at all. Standard accuracy metrics (BLEU, ROUGE, even most LLM-judge rubrics) don't catch this, because a fabricated-but-fluent answer can score well on fluency and still be false. Evaluating groundedness means testing whether every claim in the answer traces back to the retrieved context — not whether the answer merely sounds plausible.

Quick answer: Measure groundedness by checking that every factual claim in the agent's answer is directly supported by its retrieved passages, verify that cited sources actually exist and say what's attributed to them, and test whether the agent abstains (says "I don't know") when the retrieved evidence doesn't cover the question. A groundedness eval that skips abstention testing will miss the agent's most common and most dangerous failure mode.

How do you detect hallucination in a RAG agent?

Run three checks against every answer, not just one:

  1. Claim-level attribution. Break the answer into individual factual claims and check each one against the retrieved passages. An answer can be 80% grounded and 20% fabricated — a single "unsupported" verdict on the whole answer hides that.
  2. Citation verification. If the agent cites a source (a document ID, a page, a quote), confirm the citation exists in the retrieved set and that the cited passage actually contains the claimed information. Agents frequently cite real-looking but non-existent sources, or attribute a true claim to the wrong document.
  3. Abstention testing. Deliberately ask questions the corpus cannot answer. A well-behaved agent says so. An agent that fabricates an answer anyway is the single most common and most costly RAG failure — it looks identical to a correct answer until someone checks.

What's the difference between groundedness and correctness?

They measure different things and a pack needs both:

MetricQuestion it answersCatches
GroundednessIs the claim supported by the retrieved context?Fabrication, unsupported extrapolation
CorrectnessIs the claim actually true?Retrieval of wrong/outdated sources
Citation accuracyDoes the cited source say what's attributed to it?Misattribution, hallucinated citations
Abstention rateDoes the agent decline when evidence is missing?Confident fabrication on out-of-scope questions

A RAG agent can be perfectly grounded in a wrong source (the retriever pulled an outdated document) — grounded but not correct. It can also be correct by coincidence while citing nothing — correct but not grounded. Testing only one axis leaves a real gap in the eval.

How do you build a test set that actually discriminates good from broken agents?

This is the part most homegrown RAG evals skip, and it's the difference between an eval that looks rigorous and one that actually works: run the candidate eval against agents of known quality before trusting its verdicts.

If your eval can't separate these three, it isn't measuring groundedness — it's measuring something correlated with it, like answer length or fluency. This is the same principle as mutation testing in software: you don't trust a test suite until you've confirmed it fails on a planted bug. See the RAG agents benchmark for a worked example of this discriminating-power check on a real reference panel.

What should a groundedness rubric actually score?

At minimum, score these sub-checks independently rather than collapsing them into one number:

Related

FAQ

Do I need a human-labeled golden dataset to test groundedness?

No — you need retrieved passages and the claims made against them, not a pre-written "correct answer." Groundedness compares the answer to its own cited context, which sidesteps needing an oracle for the "right" answer.

Can an LLM judge groundedness reliably?

Yes, for claim-level attribution, if the judge is given the retrieved passages and asked to check each claim against them specifically — not asked "is this answer good?" in general. Deterministic checks (does the cited source exist?) should be code, not a judge call, wherever possible.

What's a good abstention rate to target?

It depends on how often your corpus genuinely lacks coverage for expected queries — there's no universal number. The useful test is relative: the agent should abstain on your deliberately out-of-scope questions close to 100% of the time, and abstain rarely on in-scope questions.

Does groundedness testing slow down CI?

Claim-extraction and citation-verification add LLM calls, but the deterministic checks (citation exists, page number valid) are cheap and can run on every commit; save the full claim-level judge pass for pre-release runs if latency matters.

Is groundedness the same as RAGAS's faithfulness metric?

Faithfulness in RAGAS is one implementation of the groundedness idea (checking generated claims against retrieved context). It's a reasonable building block, but on its own it doesn't test abstention or citation accuracy — treat it as one sub-check, not the whole rubric.

What's the single biggest mistake teams make evaluating RAG groundedness?

Testing only on questions the corpus can answer. Almost every real groundedness failure shows up on questions it can't — that's where fabrication happens, and it's the check most homegrown test sets skip entirely.

← Back to guides