Original data · discriminating power

RAG groundedness benchmark

Discriminating-power results across every RAG-groundedness pack — same test method, different domains and risk levels, compared side by side.

Reference panel · known quality vs. pack score
PackRisk tierScopeGradeGoodBrokenSabotaged
Medical RAG — groundedness & abstentionHigh risk15 test casesC1.000.630.40
Legal contract RAG — groundedness & abstentionMedium risk15 test casesB1.000.570.25
Financial reporting RAG — groundedness & abstentionHigh risk15 test casesB1.000.630.33
Customer-support RAG — groundedness & abstentionLow risk15 test casesA1.000.570.31
General-knowledge RAG — groundedness & abstentionLow risk15 test casesA1.000.600.31
General RAG groundedness — draft submissionfailedLow risk2 test casesF1.000.500.60

General RAG groundedness — draft submission did not meet the verification bar and was held back rather than approved for sale — shown here for transparency alongside the packs that passed.

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks