The measuring stick
for AI evals.
Anyone can upload a test suite. We certify that an eval can be trusted — reproducible, ungameable, and anchored to real standards — before it grades anything. Then we let you buy the ones that pass.
There is no oracle for the perfect eval. So we measure packs against three things that need no oracle.
Structural validity
Is the suite reproducible? Does the golden dataset leak into the criteria? Is the rubric gameable? Provable without knowing the ‘best’ eval.
Discriminating power
We run the pack against a panel of agents whose quality we already know — good, broken, sabotaged. A real eval ranks them correctly and fails the sabotaged one.
Standard coverage
For conformance packs, we measure coverage against an external standard — EU AI Act, OWASP-agentic, NIST, ISO 42001 — not against our own opinion.
The authority is the ground. The marketplace runs on top.
Verification authority
The trusted method that decides whether an eval deducts trust or earns it. Reference panels, adversarial audit, standard anchoring. This is what can’t be cloned — because trust can’t be cloned.
Eval-pack marketplace
Buy packs that carry the stamp, ready to drop into your CI (Promptfoo, DeepEval, Ragas). Even if a pack goes free, “verified by the meter the market trusts” keeps its value.
One method, many meters — a different pack per task, domain, and risk.
Medical RAG — groundedness & abstention
Catches confident fabrication with fake citations. Scores groundedness, citation accuracy, and whether the agent abstains when evidence is missing.
Tool-calling correctness
Verifies function/tool selection, argument correctness, and recovery from tool errors. Deterministic where possible; judge only for free-form fields.
Browser agent — prompt-injection red-team
Adversarial web content that tries to make a computer-use agent exfiltrate data or take destructive actions. The test set is the attack, not a Q&A.
EU AI Act — high-risk conformance
Coverage checklist for high-risk obligations: logging, human oversight hooks, transparency, and robustness evidence. Anchored to the Act, not to opinion.