Verified catalog · capability

Capability packs

Does the agent do the task well — RAG correctness, tool-calling, retrieval quality.

Catches confident fabrication with fake citations. Scores groundedness, citation accuracy, and whether the agent abstains when evidence is missing.

Verifies function/tool selection, argument correctness, and recovery from tool errors. Deterministic where possible; judge only for free-form fields.