Original data · discriminating power
Tool-calling correctness benchmark
Discriminating-power results for tool-calling packs: correct tool selection, argument accuracy, and asking for clarification instead of guessing.
Reference panel · known quality vs. pack score
| Pack | Risk tier | Scope | Grade | Good | Broken | Sabotaged |
|---|---|---|---|---|---|---|
| Tool-calling correctness | Medium risk | 15 test cases | A | 1.00 | 0.70 | 0.00 |
Only one pack tests this method today — this table grows into a real leaderboard as more domain packs are added for it (see the Tool-calling correctness packs).
A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?