Verification authority · AI evaluations

The measuring stick
for AI evals.

Anyone can upload a test suite. We certify that an eval can be trusted — reproducible, ungameable, and anchored to real standards — before it grades anything. Then we let you buy the ones that pass.

Browse verified packs How grading works

Reference panel · discriminating power

Grounded-RAG-ref

good

0.94

Loose-RAG-ref

broken

0.41

Fabricator-ref

sabotaged

0.07

A pack that lets the sabotaged agent pass is a bad pack. We measure the meter, not the agent.

The authority — the foundation

There is no oracle for the perfect eval. So we measure packs against three things that need no oracle.

Axis 01

Structural validity

Is the suite reproducible? Does the golden dataset leak into the criteria? Is the rubric gameable? Provable without knowing the ‘best’ eval.

Axis 02

Discriminating power

We run the pack against a panel of agents whose quality we already know — good, broken, sabotaged. A real eval ranks them correctly and fails the sabotaged one.

Axis 03

Standard coverage

For conformance packs, we measure coverage against an external standard — EU AI Act, OWASP-agentic, NIST, ISO 42001 — not against our own opinion.

One platform, two layers

The authority is the ground. The marketplace runs on top.

Layer 1 · Foundation

Verification authority

The trusted method that decides whether an eval deducts trust or earns it. Reference panels, adversarial audit, standard anchoring. This is what can’t be cloned — because trust can’t be cloned.

Layer 2 · Distribution

Eval-pack marketplace

Buy packs that carry the stamp, ready to drop into your CI (Promptfoo, DeepEval, Ragas). Even if a pack goes free, “verified by the meter the market trusts” keeps its value.

Verified catalog

One method, many meters — a different pack per task, domain, and risk.

capability

AVerified

Medical RAG — groundedness & abstention

Catches confident fabrication with fake citations. Scores groundedness, citation accuracy, and whether the agent abstains when evidence is missing.

RAG · medical literature · zero-hallucination

EU AI Act — Art. 53 / high-risk

€79AG-26-0142

capability

AVerified

Tool-calling correctness

Verifies function/tool selection, argument correctness, and recovery from tool errors. Deterministic where possible; judge only for free-form fields.

Tool-calling · general · medium-risk

€49AG-26-0143

safety

AVerified

Browser agent — prompt-injection red-team

Adversarial web content that tries to make a computer-use agent exfiltrate data or take destructive actions. The test set is the attack, not a Q&A.

Browser/computer-use · web · high-risk

OWASP Top 10 for Agentic Apps

€89AG-26-0144

conformance

BVerified

EU AI Act — high-risk conformance

Coverage checklist for high-risk obligations: logging, human oversight hooks, transparency, and robustness evidence. Anchored to the Act, not to opinion.

Any agent · EU market · high-risk

EU AI Act — Art. 53 / high-riskISO/IEC 42001

€99AG-26-0145