Original data · discriminating power

Tool-calling correctness benchmark

Discriminating-power results for tool-calling packs: correct tool selection, argument accuracy, and asking for clarification instead of guessing.

Reference panel · known quality vs. pack score
PackRisk tierScopeGradeGoodBrokenSabotaged
Tool-calling correctnessMedium risk15 test casesA1.000.700.00

Only one pack tests this method today — this table grows into a real leaderboard as more domain packs are added for it (see the Tool-calling correctness packs).

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks