How to Test Whether a RAG Agent Knows When to Say 'I Don't Know'
Ask a RAG agent something its retrieved documents don't cover, and a surprising number of agents answer anyway — confidently, fluently, and wrong. This single failure mode causes more real-world harm than any other RAG defect, because a fabricated-but-plausible answer is indistinguishable from a correct one until someone checks the source. Abstention testing exists to catch exactly this.
Quick answer: Abstention testing means deliberately asking a RAG agent questions its retrieved context cannot answer, and checking that it says so instead of guessing. A well-built test set includes a meaningful share of genuinely unanswerable questions — not just answerable ones — because most homegrown RAG evals skip this entirely, and it's exactly where fabrication happens.
Why do most RAG evals skip abstention testing?
Because it's easier to write questions the corpus can answer and check the answer's quality than to deliberately construct questions it can't answer and verify the agent notices. Building good unanswerable cases takes more care: the question must be plausible enough that a shortcut-taking agent would be tempted to guess, not so obviously out of scope that abstaining is trivial.
How do you construct a good unanswerable test case?
Three approaches, in increasing order of difficulty to build (and value):
- Out-of-corpus questions — plausible questions about the domain that simply aren't covered by any retrieved document. Easiest to build, still useful.
- Adjacent-but-different questions — the corpus covers a related topic, but not the specific fact asked about, tempting an agent to extrapolate.
- Partially-covered questions — the context supports part of the answer but not all of it; the correct behavior is a hedged, partial answer that flags the gap, not silence and not full fabrication.
What's the actual pass/fail check?
For each unanswerable test case: did the agent abstain (explicitly declined, or produced no answer and no citation), or did it produce an answer as if the evidence supported it? There's no partial credit for a fabricated answer that happens to be true by coincidence — the failure is answering without support, regardless of outcome.
| Behavior on an unanswerable case | Verdict |
|---|---|
| Explicitly states insufficient evidence | Pass |
| Hedges with a stated caveat, no confident claim | Partial pass |
| Answers confidently with a citation | Fail (fabrication) |
| Answers confidently with no citation | Fail (fabrication) |
How does this fit into a full groundedness rubric?
Abstention testing is one of at least four things a groundedness rubric needs — alongside claim-level attribution, citation verification, and testing against a reference panel of known-quality agents. See the full breakdown in the main groundedness guide.
Related
- How to eval a RAG agent's groundedness — the full rubric abstention testing is part of.
- What is a reference-panel harness — how abstention gets scored against known-good vs. sabotaged agents.
- How to map an eval to the EU AI Act — how abstention testing maps to a transparency obligation.
- The customer-support RAG groundedness pack — a verified pack where abstention on account-specific questions is a core check.
- Customer-support RAG agents benchmark — that pack's discriminating-power results.
- Capability packs
FAQ
A question that is plausible for the domain but genuinely uncovered by the retrieved context — not a nonsense question the agent would obviously reject. The test is meaningful only if a fabricated answer would look believable to someone who didn't check the sources.
No — ambiguous but partially answerable questions should get a partial, hedged answer with the gap stated, not a blanket refusal. Reserve hard abstention for questions the context genuinely cannot answer at all; over-abstaining is its own failure mode and hurts usefulness.
Enough to be a real signal, not an afterthought — a pack with a handful of token unanswerable cases next to dozens of answerable ones is still mostly testing the wrong thing. Treat unanswerable cases as a first-class portion of the test set, not a footnote.
Mostly yes — whether an agent abstained is usually detectable by pattern (phrases like 'I don't have enough information') or by checking it produced no citation, which are both cheap, deterministic checks. Save human or LLM-judge review for edge cases where the abstention language is ambiguous.
The general principle — a good agent should signal uncertainty rather than fabricate — applies to any agent that can be asked something outside its actual capability, but RAG agents are where it's most directly testable, because you control exactly what's in the retrieved context.
Abstention testing is one of the few RAG properties that maps directly onto a regulatory obligation: it's concrete evidence of 'transparency about the system's limitations,' one of the high-risk obligations under the Act. See the mapping guide for the full picture.