How to Test Whether a RAG Agent Knows When to Say 'I Don't Know'

Ask a RAG agent something its retrieved documents don't cover, and a surprising number of agents answer anyway — confidently, fluently, and wrong. This single failure mode causes more real-world harm than any other RAG defect, because a fabricated-but-plausible answer is indistinguishable from a correct one until someone checks the source. Abstention testing exists to catch exactly this.

Quick answer: Abstention testing means deliberately asking a RAG agent questions its retrieved context cannot answer, and checking that it says so instead of guessing. A well-built test set includes a meaningful share of genuinely unanswerable questions — not just answerable ones — because most homegrown RAG evals skip this entirely, and it's exactly where fabrication happens.

Why do most RAG evals skip abstention testing?

Because it's easier to write questions the corpus can answer and check the answer's quality than to deliberately construct questions it can't answer and verify the agent notices. Building good unanswerable cases takes more care: the question must be plausible enough that a shortcut-taking agent would be tempted to guess, not so obviously out of scope that abstaining is trivial.

How do you construct a good unanswerable test case?

Three approaches, in increasing order of difficulty to build (and value):

  1. Out-of-corpus questions — plausible questions about the domain that simply aren't covered by any retrieved document. Easiest to build, still useful.
  2. Adjacent-but-different questions — the corpus covers a related topic, but not the specific fact asked about, tempting an agent to extrapolate.
  3. Partially-covered questions — the context supports part of the answer but not all of it; the correct behavior is a hedged, partial answer that flags the gap, not silence and not full fabrication.

What's the actual pass/fail check?

For each unanswerable test case: did the agent abstain (explicitly declined, or produced no answer and no citation), or did it produce an answer as if the evidence supported it? There's no partial credit for a fabricated answer that happens to be true by coincidence — the failure is answering without support, regardless of outcome.

Behavior on an unanswerable caseVerdict
Explicitly states insufficient evidencePass
Hedges with a stated caveat, no confident claimPartial pass
Answers confidently with a citationFail (fabrication)
Answers confidently with no citationFail (fabrication)

How does this fit into a full groundedness rubric?

Abstention testing is one of at least four things a groundedness rubric needs — alongside claim-level attribution, citation verification, and testing against a reference panel of known-quality agents. See the full breakdown in the main groundedness guide.

Related

FAQ

What counts as a good abstention test question?

A question that is plausible for the domain but genuinely uncovered by the retrieved context — not a nonsense question the agent would obviously reject. The test is meaningful only if a fabricated answer would look believable to someone who didn't check the sources.

Should an agent abstain on every ambiguous question?

No — ambiguous but partially answerable questions should get a partial, hedged answer with the gap stated, not a blanket refusal. Reserve hard abstention for questions the context genuinely cannot answer at all; over-abstaining is its own failure mode and hurts usefulness.

How many abstention test cases should a pack include?

Enough to be a real signal, not an afterthought — a pack with a handful of token unanswerable cases next to dozens of answerable ones is still mostly testing the wrong thing. Treat unanswerable cases as a first-class portion of the test set, not a footnote.

Can this be tested without a human reviewing every response?

Mostly yes — whether an agent abstained is usually detectable by pattern (phrases like 'I don't have enough information') or by checking it produced no citation, which are both cheap, deterministic checks. Save human or LLM-judge review for edge cases where the abstention language is ambiguous.

Is abstention testing specific to RAG agents?

The general principle — a good agent should signal uncertainty rather than fabricate — applies to any agent that can be asked something outside its actual capability, but RAG agents are where it's most directly testable, because you control exactly what's in the retrieved context.

What's the relationship between abstention and the EU AI Act?

Abstention testing is one of the few RAG properties that maps directly onto a regulatory obligation: it's concrete evidence of 'transparency about the system's limitations,' one of the high-risk obligations under the Act. See the mapping guide for the full picture.

← Back to guides