Verification authority · AI evaluations

The measuring stick
for AI evals.

Anyone can upload a test suite. We certify that an eval can be trusted — reproducible, ungameable, and anchored to real standards — before it grades anything. Then we let you buy the ones that pass.

Reference panel · discriminating power
Grounded-RAG-ref
good
0.94
Loose-RAG-ref
broken
0.41
Fabricator-ref
sabotaged
0.07
A pack that lets the sabotaged agent pass is a bad pack. We measure the meter, not the agent.
The authority — the foundation

There is no oracle for the perfect eval. So we measure packs against three things that need no oracle.

Axis 01

Structural validity

Is the suite reproducible? Does the golden dataset leak into the criteria? Is the rubric gameable? Provable without knowing the ‘best’ eval.

Axis 02

Discriminating power

We run the pack against a panel of agents whose quality we already know — good, broken, sabotaged. A real eval ranks them correctly and fails the sabotaged one.

Axis 03

Standard coverage

For conformance packs, we measure coverage against an external standard — EU AI Act, OWASP-agentic, NIST, ISO 42001 — not against our own opinion.

One platform, two layers

The authority is the ground. The marketplace runs on top.

Layer 1 · Foundation

Verification authority

The trusted method that decides whether an eval deducts trust or earns it. Reference panels, adversarial audit, standard anchoring. This is what can’t be cloned — because trust can’t be cloned.

Layer 2 · Distribution

Eval-pack marketplace

Buy packs that carry the stamp, ready to drop into your CI (Promptfoo, DeepEval, Ragas). Even if a pack goes free, “verified by the meter the market trusts” keeps its value.