How to Eval a Legal Contract Q&A Agent for Groundedness
A contract Q&A agent that grounds its answer in the wrong clause is worse than one that admits uncertainty — someone may act directly on a cited clause number without re-reading the source. Legal-document RAG needs the same core groundedness rubric as any RAG eval, with one addition: citation precision matters more, because clause numbers are the interface a human actually checks.
Quick answer: Eval a legal contract Q&A agent on the same groundedness rubric as any RAG agent (grounding, citation accuracy, abstention), with extra weight on exact clause-level citation — not just topical correctness — and dedicated test cases for questions the specific contract version doesn't cover, which should be abstained on rather than answered from generic contract-law knowledge.
What's different about grounding in a contract versus a general document?
Contracts are structured and cross-referential: a clause can be modified, carved out, or superseded elsewhere in the same document. A groundedness check that only verifies "the cited clause exists and is topically related" misses the more common real failure — citing a clause that's been modified by another one the agent didn't retrieve or didn't weigh correctly.
What should a contract-QA test set specifically include?
| Test case type | What it catches |
|---|---|
| Single-clause factual questions | Basic grounding and citation accuracy |
| Multi-clause questions (rule + carve-out) | Whether the agent integrates related clauses instead of citing only the first match |
| Questions about a clause absent from this contract version | Whether the agent abstains instead of answering from generic contract-law knowledge |
| Questions with a plausible but wrong clause number in the prompt | Whether the agent corrects a mistaken premise rather than confirming it |
How strict should citation checking be?
Strict: require the exact clause identifier, not a paraphrase of its content. "The termination notice is 30 days" without a clause reference is not verifiable the same way "Clause 9.1: 30 days" is — and clause-number precision is exactly what makes an answer checkable against the source document at all.
How does this connect to standard coverage?
Contract Q&A agents used in regulated contexts (procurement, employment, data-processing agreements) often intersect with obligations like the EU AI Act's transparency and logging requirements — see the standard-mapping guide for how citation-level traceability maps onto "logging capable of enabling traceability" directly.
Related
- How to eval a RAG agent's groundedness — the base rubric this extends.
- How to detect a gameable eval — relevant here because clause ordering can create positional shortcuts.
- How to map an eval to the EU AI Act — the standard-coverage angle for regulated contract use.
- The legal contract RAG groundedness pack — a verified pack implementing this rubric.
- Legal contract RAG agents benchmark — that pack's discriminating-power results.
- Capability packs
FAQ
The core rubric (grounding, citation accuracy, abstention) is the same, but contracts add domain-specific risk: clauses interact (a termination clause can be modified by an amendment elsewhere), and getting the exact clause number wrong is a more serious failure than in general-knowledge RAG, because someone may act on that citation directly.
Citing the right general topic but the wrong specific clause — for example, answering a termination-notice question correctly in substance but attributing it to the liability clause instead of the termination clause. This passes casual review far more often than an outright wrong answer.
Yes, where realistic — questions whose answer depends on reading two related clauses together (a general rule plus a carve-out) are common in real contracts and a frequent source of dropped nuance in agent answers.
Treat a question about a clause not present in the specific contract version being queried as an unanswerable case — the correct behavior is abstaining or flagging the mismatch, not answering from a different contract's typical language.
It needs the actual clause identifier (e.g. 'Clause 9.1'), not a paraphrase of its content — a paraphrase can be vague enough to seem right regardless of which clause the agent actually pulled from, which defeats the point of checking citation accuracy at all.
Yes — any RAG agent operating over structured, clause-like documents (insurance policies, internal compliance policies, technical standards) faces the same failure modes and benefits from the same test design.