capability
Tool-calling correctness
Tool-calling · general · medium-risk
AVerified
Verification report
Verifies function/tool selection, argument correctness, and recovery from tool errors. Deterministic where possible; judge only for free-form fields.
No data leakage
0.99
Ungameable
0.92
Deterministic
1.00
Discriminating power
0.95
Standard coverage
0.88
Discriminating power · reference panel
| Reference agent | Known quality | Pack score |
|---|---|---|
| Strict-Tool-ref | good | 0.96 |
| Sloppy-Tool-ref | broken | 0.52 |
A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works.