Original data · robustness across the catalog

Robustness benchmark

How every verification pack holds its discriminating power under semantics-preserving perturbation of its own test items — reordered docs/tools, injected distractors. A pack keyed to the exact surface form of its frozen test set collapses here.

Every pack · discrimination retained under input perturbation

Pack	Method	Risk tier	Robustness	Grade
Medical RAG — groundedness & abstention	RAG groundedness	High risk	1.00	B
Legal contract RAG — groundedness & abstention	RAG groundedness	Medium risk	1.00	B
Financial reporting RAG — groundedness & abstention	RAG groundedness	High risk	1.00	B
Customer-support RAG — groundedness & abstention	RAG groundedness	Low risk	1.00	A
General-knowledge RAG — groundedness & abstention	RAG groundedness	Low risk	1.00	B
Tool-calling correctness	Tool-calling correctness	Medium risk	1.00	B
General RAG groundedness — draft submission	RAG groundedness	Low risk	0.00	F
Browser agent — prompt-injection red-team	Prompt-injection defense	High risk	skip	B
EU AI Act — high-risk conformance	AI Act obligation checklist	High risk	skip	A

skip = robustness is a documented skip for this cell: its reference agents decide on structural flags, not perturbable content, so there's no honest perturbation to run (browser-injection, ai-act-checklist). Not a failure — the axis simply doesn't apply to that item shape.

Robustness is one of the six authority axes — robustness across the input space. A pack that only ranks the panel correctly on the exact wording and order of its frozen items was measuring surface form, not capability.

← Back to benchmarks