Original data · discriminating power

AI Act obligation-checklist benchmark

How the EU AI Act obligation checklist ranks a compliant, partially-compliant, and non-compliant reference submission.

Reference panel · known quality vs. pack score

Pack	Risk tier	Scope	Grade	Good	Broken	Sabotaged
EU AI Act — high-risk conformance	High risk	8 obligations	A	1.00	0.72	0.05

Only one pack tests this method today — this table grows into a real leaderboard as more domain packs are added for it (see the AI Act obligation-checklist packs).

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks