Original data · discriminating power
RAG agents benchmark
Discriminating-power results across known-good, broken and sabotaged medical-RAG reference agents. Measured with the Medical RAG — groundedness & abstention pack.
Reference panel · known quality vs. pack score
| Reference agent | Domain | Known quality | Pack score |
|---|---|---|---|
| Grounded-RAG-ref | medical-rag | good | 0.94 |
| Loose-RAG-ref | medical-rag | broken | 0.41 |
| Fabricator-ref | medical-rag | sabotaged | 0.07 |
A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?