Original data · discriminating power

RAG groundedness benchmark

Discriminating-power results across every RAG-groundedness pack — same test method, different domains and risk levels, compared side by side.

Reference panel · known quality vs. pack score

Pack	Risk tier	Scope	Grade	Good	Broken	Sabotaged
Medical RAG — groundedness & abstention	High risk	15 test cases	C	1.00	0.63	0.40
Legal contract RAG — groundedness & abstention	Medium risk	15 test cases	B	1.00	0.57	0.25
Financial reporting RAG — groundedness & abstention	High risk	15 test cases	B	1.00	0.63	0.33
Customer-support RAG — groundedness & abstention	Low risk	15 test cases	A	1.00	0.57	0.31
General-knowledge RAG — groundedness & abstention	Low risk	15 test cases	A	1.00	0.60	0.31
General RAG groundedness — draft submissionfailed	Low risk	2 test cases	F	1.00	0.50	0.60

General RAG groundedness — draft submission did not meet the verification bar and was held back rather than approved for sale — shown here for transparency alongside the packs that passed.

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks