Original data · discriminating power
RAG groundedness benchmark
Discriminating-power results across every RAG-groundedness pack — same test method, different domains and risk levels, compared side by side.
Reference panel · known quality vs. pack score
| Pack | Risk tier | Scope | Grade | Good | Broken | Sabotaged |
|---|---|---|---|---|---|---|
| Medical RAG — groundedness & abstention | High risk | 15 test cases | C | 1.00 | 0.63 | 0.40 |
| Legal contract RAG — groundedness & abstention | Medium risk | 15 test cases | B | 1.00 | 0.57 | 0.25 |
| Financial reporting RAG — groundedness & abstention | High risk | 15 test cases | B | 1.00 | 0.63 | 0.33 |
| Customer-support RAG — groundedness & abstention | Low risk | 15 test cases | A | 1.00 | 0.57 | 0.31 |
| General-knowledge RAG — groundedness & abstention | Low risk | 15 test cases | A | 1.00 | 0.60 | 0.31 |
| General RAG groundedness — draft submissionfailed | Low risk | 2 test cases | F | 1.00 | 0.50 | 0.60 |
General RAG groundedness — draft submission did not meet the verification bar and was held back rather than approved for sale — shown here for transparency alongside the packs that passed.
A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?