Original data · discriminating power

Tool-calling correctness benchmark

Discriminating-power results for tool-calling packs: correct tool selection, argument accuracy, and asking for clarification instead of guessing.

Reference panel · known quality vs. pack score

Pack	Risk tier	Scope	Grade	Good	Broken	Sabotaged
Tool-calling correctness	Medium risk	15 test cases	A	1.00	0.70	0.00

Only one pack tests this method today — this table grows into a real leaderboard as more domain packs are added for it (see the Tool-calling correctness packs).

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works — this is mutation testing applied to evals: does the pack catch the planted bug?

← Back to benchmarks