capability

Tool-calling correctness

Tool-calling · general · medium-risk
framework · promptfooauthor · Vincentcert · AG-26-0143
Verification report

Verifies function/tool selection, argument correctness, and recovery from tool errors. Deterministic where possible; judge only for free-form fields.

No data leakage
0.99
Ungameable
0.92
Deterministic
1.00
Discriminating power
0.95
Standard coverage
0.88
Discriminating power · reference panel
Reference agentKnown qualityPack score
Strict-Tool-refgood0.96
Sloppy-Tool-refbroken0.52

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works.

Browse guides →Browse benchmarks →

← Back to catalog