How to Eval a RAG Agent's Groundedness

By VincentPublished 2026-06-15Updated 2026-07-03

Retrieval-augmented generation agents fail in a specific, dangerous way: they sound confident while citing sources that don't say what the agent claims, or don't exist at all. Standard accuracy metrics (BLEU, ROUGE, even most LLM-judge rubrics) don't catch this, because a fabricated-but-fluent answer can score well on fluency and still be false. Evaluating groundedness means testing whether every claim in the answer traces back to the retrieved context — not whether the answer merely sounds plausible.

Quick answer: Measure groundedness by checking that every factual claim in the agent's answer is directly supported by its retrieved passages, verify that cited sources actually exist and say what's attributed to them, and test whether the agent abstains (says "I don't know") when the retrieved evidence doesn't cover the question. A groundedness eval that skips abstention testing will miss the agent's most common and most dangerous failure mode.

How do you detect hallucination in a RAG agent?

Run three checks against every answer, not just one:

Claim-level attribution. Break the answer into individual factual claims and check each one against the retrieved passages. An answer can be 80% grounded and 20% fabricated — a single "unsupported" verdict on the whole answer hides that.
Citation verification. If the agent cites a source (a document ID, a page, a quote), confirm the citation exists in the retrieved set and that the cited passage actually contains the claimed information. Agents frequently cite real-looking but non-existent sources, or attribute a true claim to the wrong document.
Abstention testing. Deliberately ask questions the corpus cannot answer. A well-behaved agent says so. An agent that fabricates an answer anyway is the single most common and most costly RAG failure — it looks identical to a correct answer until someone checks.

What's the difference between groundedness and correctness?

They measure different things and a pack needs both:

Metric	Question it answers	Catches
Groundedness	Is the claim supported by the retrieved context?	Fabrication, unsupported extrapolation
Correctness	Is the claim actually true?	Retrieval of wrong/outdated sources
Citation accuracy	Does the cited source say what's attributed to it?	Misattribution, hallucinated citations
Abstention rate	Does the agent decline when evidence is missing?	Confident fabrication on out-of-scope questions

A RAG agent can be perfectly grounded in a wrong source (the retriever pulled an outdated document) — grounded but not correct. It can also be correct by coincidence while citing nothing — correct but not grounded. Testing only one axis leaves a real gap in the eval.

How do you build a test set that actually discriminates good from broken agents?

This is the part most homegrown RAG evals skip, and it's the difference between an eval that looks rigorous and one that actually works: run the candidate eval against agents of known quality before trusting its verdicts.

A known-good reference agent (grounded, cites correctly, abstains appropriately) should score high.
A known-broken reference agent (loose citation discipline, occasional extrapolation beyond the source) should score visibly lower.
A known-sabotaged reference agent (fabricates citations on purpose) should score near zero.

If your eval can't separate these three, it isn't measuring groundedness — it's measuring something correlated with it, like answer length or fluency. This is the same principle as mutation testing in software: you don't trust a test suite until you've confirmed it fails on a planted bug. See the RAG agents benchmark for a worked example of this discriminating-power check on a real reference panel.

What should a groundedness rubric actually score?

At minimum, score these sub-checks independently rather than collapsing them into one number:

No data leakage — the golden/reference answers used to build the test set must not leak into the retrieval corpus in a way the agent can shortcut.
Ungameable — an agent that pattern-matches the rubric's surface form (e.g., always citing something) shouldn't score well without genuine grounding.
Determinism — the same input should produce a consistent verdict across runs; non-deterministic scoring makes regressions invisible.
Discrimination — verified against the reference panel above.
Coverage — for regulated use cases, whether the pack's checks map onto an external standard's requirements (see the EU AI Act standard for how this applies to high-risk RAG deployments).

What is a reference-panel harness — the discriminating-power method referenced above, explained in depth.
How to test RAG agent abstention — a deep dive on the abstention-testing check.
How to detect a gameable eval — a deep dive on the leakage/ungameable checks.
Capability packs — the category this rubric applies to.
The medical RAG groundedness pack — a verified pack implementing this exact rubric.
RAG agents benchmark — the discriminating-power results referenced above.
EU AI Act — Art. 53 / high-risk — the standard this approach is anchored to for regulated deployments.

FAQ

Do I need a human-labeled golden dataset to test groundedness?

No — you need retrieved passages and the claims made against them, not a pre-written "correct answer." Groundedness compares the answer to its own cited context, which sidesteps needing an oracle for the "right" answer.

Can an LLM judge groundedness reliably?

Yes, for claim-level attribution, if the judge is given the retrieved passages and asked to check each claim against them specifically — not asked "is this answer good?" in general. Deterministic checks (does the cited source exist?) should be code, not a judge call, wherever possible.

What's a good abstention rate to target?

It depends on how often your corpus genuinely lacks coverage for expected queries — there's no universal number. The useful test is relative: the agent should abstain on your deliberately out-of-scope questions close to 100% of the time, and abstain rarely on in-scope questions.

Does groundedness testing slow down CI?

Claim-extraction and citation-verification add LLM calls, but the deterministic checks (citation exists, page number valid) are cheap and can run on every commit; save the full claim-level judge pass for pre-release runs if latency matters.

Is groundedness the same as RAGAS's faithfulness metric?

Faithfulness in RAGAS is one implementation of the groundedness idea (checking generated claims against retrieved context). It's a reasonable building block, but on its own it doesn't test abstention or citation accuracy — treat it as one sub-check, not the whole rubric.

What's the single biggest mistake teams make evaluating RAG groundedness?

Testing only on questions the corpus can answer. Almost every real groundedness failure shows up on questions it can't — that's where fabrication happens, and it's the check most homegrown test sets skip entirely.

← Back to guides

How to Eval a RAG Agent's Groundedness

How do you detect hallucination in a RAG agent?

What's the difference between groundedness and correctness?

How do you build a test set that actually discriminates good from broken agents?

What should a groundedness rubric actually score?

Related

FAQ