Original data · discriminating power

RAG agents benchmark

Name: RAG agents benchmark
Creator: AgentGrading.ai
License: https://creativecommons.org/licenses/by/4.0/

Discriminating-power results across known-good, broken and sabotaged medical-RAG reference agents. Measured with the Medical RAG — groundedness & abstention pack.

Reference panel · known quality vs. pack score

Reference agent	Domain	Known quality	Pack score
Grounded-RAG-ref	medical-rag	good	0.94
Loose-RAG-ref	medical-rag	broken	0.41
Fabricator-ref	medical-rag	sabotaged	0.07

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks