How to Eval a Legal Contract Q&A Agent for Groundedness

A contract Q&A agent that grounds its answer in the wrong clause is worse than one that admits uncertainty — someone may act directly on a cited clause number without re-reading the source. Legal-document RAG needs the same core groundedness rubric as any RAG eval, with one addition: citation precision matters more, because clause numbers are the interface a human actually checks.

Quick answer: Eval a legal contract Q&A agent on the same groundedness rubric as any RAG agent (grounding, citation accuracy, abstention), with extra weight on exact clause-level citation — not just topical correctness — and dedicated test cases for questions the specific contract version doesn't cover, which should be abstained on rather than answered from generic contract-law knowledge.

What's different about grounding in a contract versus a general document?

Contracts are structured and cross-referential: a clause can be modified, carved out, or superseded elsewhere in the same document. A groundedness check that only verifies "the cited clause exists and is topically related" misses the more common real failure — citing a clause that's been modified by another one the agent didn't retrieve or didn't weigh correctly.

What should a contract-QA test set specifically include?

Test case typeWhat it catches
Single-clause factual questionsBasic grounding and citation accuracy
Multi-clause questions (rule + carve-out)Whether the agent integrates related clauses instead of citing only the first match
Questions about a clause absent from this contract versionWhether the agent abstains instead of answering from generic contract-law knowledge
Questions with a plausible but wrong clause number in the promptWhether the agent corrects a mistaken premise rather than confirming it

How strict should citation checking be?

Strict: require the exact clause identifier, not a paraphrase of its content. "The termination notice is 30 days" without a clause reference is not verifiable the same way "Clause 9.1: 30 days" is — and clause-number precision is exactly what makes an answer checkable against the source document at all.

How does this connect to standard coverage?

Contract Q&A agents used in regulated contexts (procurement, employment, data-processing agreements) often intersect with obligations like the EU AI Act's transparency and logging requirements — see the standard-mapping guide for how citation-level traceability maps onto "logging capable of enabling traceability" directly.

Related

FAQ

Why does contract Q&A need a different eval than general RAG groundedness?

The core rubric (grounding, citation accuracy, abstention) is the same, but contracts add domain-specific risk: clauses interact (a termination clause can be modified by an amendment elsewhere), and getting the exact clause number wrong is a more serious failure than in general-knowledge RAG, because someone may act on that citation directly.

What's the biggest failure mode specific to contract Q&A agents?

Citing the right general topic but the wrong specific clause — for example, answering a termination-notice question correctly in substance but attributing it to the liability clause instead of the termination clause. This passes casual review far more often than an outright wrong answer.

Should a contract Q&A eval test multi-clause questions?

Yes, where realistic — questions whose answer depends on reading two related clauses together (a general rule plus a carve-out) are common in real contracts and a frequent source of dropped nuance in agent answers.

How should the eval handle jurisdiction- or version-specific contract language?

Treat a question about a clause not present in the specific contract version being queried as an unanswerable case — the correct behavior is abstaining or flagging the mismatch, not answering from a different contract's typical language.

Does a citation-accuracy check need clause numbers, or is a paraphrase enough?

It needs the actual clause identifier (e.g. 'Clause 9.1'), not a paraphrase of its content — a paraphrase can be vague enough to seem right regardless of which clause the agent actually pulled from, which defeats the point of checking citation accuracy at all.

Is this eval useful outside of legal-tech products?

Yes — any RAG agent operating over structured, clause-like documents (insurance policies, internal compliance policies, technical standards) faces the same failure modes and benefits from the same test design.

← Back to guides