Original data · robustness across the catalog

Robustness benchmark

How every verification pack holds its discriminating power under semantics-preserving perturbation of its own test items — reordered docs/tools, injected distractors. A pack keyed to the exact surface form of its frozen test set collapses here.

Every pack · discrimination retained under input perturbation
PackMethodRisk tierRobustnessGrade
Medical RAG — groundedness & abstentionRAG groundednessHigh risk1.00B
Legal contract RAG — groundedness & abstentionRAG groundednessMedium risk1.00B
Financial reporting RAG — groundedness & abstentionRAG groundednessHigh risk1.00B
Customer-support RAG — groundedness & abstentionRAG groundednessLow risk1.00A
General-knowledge RAG — groundedness & abstentionRAG groundednessLow risk1.00B
Tool-calling correctnessTool-calling correctnessMedium risk1.00B
General RAG groundedness — draft submissionRAG groundednessLow risk0.00F
Browser agent — prompt-injection red-teamPrompt-injection defenseHigh riskskipB
EU AI Act — high-risk conformanceAI Act obligation checklistHigh riskskipA

skip = robustness is a documented skip for this cell: its reference agents decide on structural flags, not perturbable content, so there's no honest perturbation to run (browser-injection, ai-act-checklist). Not a failure — the axis simply doesn't apply to that item shape.

Robustness is one of the six authority axes — robustness across the input space. A pack that only ranks the panel correctly on the exact wording and order of its frozen items was measuring surface form, not capability.

← Back to benchmarks