Original data · robustness across the catalog
Robustness benchmark
How every verification pack holds its discriminating power under semantics-preserving perturbation of its own test items — reordered docs/tools, injected distractors. A pack keyed to the exact surface form of its frozen test set collapses here.
Every pack · discrimination retained under input perturbation
| Pack | Method | Risk tier | Robustness | Grade |
|---|---|---|---|---|
| Medical RAG — groundedness & abstention | RAG groundedness | High risk | 1.00 | B |
| Legal contract RAG — groundedness & abstention | RAG groundedness | Medium risk | 1.00 | B |
| Financial reporting RAG — groundedness & abstention | RAG groundedness | High risk | 1.00 | B |
| Customer-support RAG — groundedness & abstention | RAG groundedness | Low risk | 1.00 | A |
| General-knowledge RAG — groundedness & abstention | RAG groundedness | Low risk | 1.00 | B |
| Tool-calling correctness | Tool-calling correctness | Medium risk | 1.00 | B |
| General RAG groundedness — draft submission | RAG groundedness | Low risk | 0.00 | F |
| Browser agent — prompt-injection red-team | Prompt-injection defense | High risk | skip | B |
| EU AI Act — high-risk conformance | AI Act obligation checklist | High risk | skip | A |
skip = robustness is a documented skip for this cell: its reference agents decide on structural flags, not perturbable content, so there's no honest perturbation to run (browser-injection, ai-act-checklist). Not a failure — the axis simply doesn't apply to that item shape.
Robustness is one of the six authority axes — robustness across the input space. A pack that only ranks the panel correctly on the exact wording and order of its frozen items was measuring surface form, not capability.