Original data · discriminating power

AI Act obligation-checklist benchmark

How the EU AI Act obligation checklist ranks a compliant, partially-compliant, and non-compliant reference submission.

Reference panel · known quality vs. pack score
PackRisk tierScopeGradeGoodBrokenSabotaged
EU AI Act — high-risk conformanceHigh risk8 obligationsA1.000.720.05

Only one pack tests this method today — this table grows into a real leaderboard as more domain packs are added for it (see the AI Act obligation-checklist packs).

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks