capability

Tool-calling correctness

Tool-calling · general · medium-risk

framework · promptfooauthor · Vincentcert · AG-26-0143

AVerified

Verification report

Verifies function/tool selection, argument correctness, and recovery from tool errors. Deterministic where possible; judge only for free-form fields.

No data leakage

0.99

Ungameable

0.92

Deterministic

1.00

Discriminating power

0.95

Standard coverage

0.88

Discriminating power · reference panel

Reference agent	Known quality	Pack score
Strict-Tool-ref	good	0.96
Sloppy-Tool-ref	broken	0.52

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works.