Original data · discriminating power

RAG agents benchmark

Discriminating-power results across known-good, broken and sabotaged medical-RAG reference agents. Measured with the Medical RAG — groundedness & abstention pack.

Reference panel · known quality vs. pack score
Reference agentDomainKnown qualityPack score
Grounded-RAG-refmedical-raggood0.94
Loose-RAG-refmedical-ragbroken0.41
Fabricator-refmedical-ragsabotaged0.07

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks