Original data · discriminating power
AI Act obligation-checklist benchmark
How the EU AI Act obligation checklist ranks a compliant, partially-compliant, and non-compliant reference submission.
Reference panel · known quality vs. pack score
| Pack | Risk tier | Scope | Grade | Good | Broken | Sabotaged |
|---|---|---|---|---|---|---|
| EU AI Act — high-risk conformance | High risk | 8 obligations | A | 1.00 | 0.72 | 0.05 |
Only one pack tests this method today — this table grows into a real leaderboard as more domain packs are added for it (see the AI Act obligation-checklist packs).
A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?