What Is a Reference-Panel Harness for AI Eval Verification?

By VincentPublished 2026-06-19Updated 2026-07-03

Most eval packs are graded on how they look: clear rubric, plausible-sounding test cases, a nice report. None of that tells you whether the eval can actually distinguish a good agent from a bad one. A reference-panel harness answers a narrower, harder question directly: given agents of known quality, does this eval rank them correctly?

Quick answer: A reference-panel harness runs a candidate eval against a small set of agents whose quality is already known — one deliberately good, one realistically broken, one deliberately sabotaged — and checks whether the eval's scores rank them in that order. This is mutation testing applied to evals: you don't trust a test suite until you've confirmed it fails on a planted bug, and you shouldn't trust an eval until you've confirmed it fails a planted bad agent.

Why can't you just trust an eval that "looks right"?

Because looking right and discriminating correctly are unrelated properties. An eval built from plausible-sounding questions and a sensible rubric can still fail to separate quality levels if:

the scoring rewards something correlated with quality (answer length, confident tone) rather than quality itself
the test cases are all easy enough that every agent — good or bad — scores similarly
the rubric has a shortcut an agent can exploit without genuine capability

None of these show up by reading the eval. They only show up by running it against agents you already know the answer for.

What does a reference panel actually consist of?

At minimum, three behavioral archetypes:

Agent	Behavior	Expected result
Known-good	Grounded, cites correctly, abstains when evidence is missing	Scores high
Known-broken	Realistically sloppy — inconsistent citation, occasional overreach	Scores in the middle, clearly below good
Known-sabotaged	Deliberately fabricates: confident answers regardless of evidence, invented citations	Scores near zero

These don't need to be real production agents. They need to be controllable and reproducible, so the same pack run against the same panel produces the same ranking every time — see the determinism requirement below.

What's the pass/fail rule for a pack?

The non-negotiable check is ordering: good > broken > sabotaged. If a pack's scores don't preserve that order — and especially if the sabotaged agent scores anywhere close to the good one — the pack fails, independent of every other quality signal. A margin matters too: barely-separated scores (say, 0.61 vs. 0.58) suggest weak discriminating power even if the ordering is technically correct.

How is this different from a benchmark leaderboard?

A leaderboard ranks real agents against each other and assumes the eval itself is trustworthy. A reference-panel harness checks that assumption first, using agents of pre-established quality rather than agents you're trying to compare. See the RAG agents benchmark for a worked example of the panel's actual output on a live pack.

How to eval a RAG agent's groundedness — the concrete rubric this harness is applied to.
How to detect a gameable eval — the structural-validity checks that run alongside the panel.
How to map an eval to the EU AI Act — how coverage of an external standard fits alongside discriminating power.
The general-knowledge RAG groundedness pack — the control-cell pack this harness is demonstrated on.
Capability packs — packs verified with this method.
General-knowledge RAG agents benchmark — the panel's results on that pack.

FAQ

Do I need real production agents to build a reference panel?

No — the panel's job is to represent known behavior classes (good, broken, sabotaged), not to be realistic production traffic. A small set of deterministic, controllable agents that reproduce specific failure modes is more useful than a handful of real agents whose quality you'd have to establish by some other, circular method.

How many reference agents do I need in a panel?

Three is the practical minimum: one known-good, one known-broken (realistic sloppiness, not malice), and one known-sabotaged (a deliberate planted failure). More agents let you test finer-grained failure modes, but three is enough to prove basic discriminating power.

Isn't this just a fancy way of saying 'test the tests'?

Yes, and that's the point. An eval is a measurement instrument; instruments need calibration. A ruler that reports the same length for a 10cm and a 20cm object isn't a ruler, no matter how official it looks — the same logic applies to an eval that can't separate a good agent from a broken one.

What happens if a pack fails to catch the sabotaged agent?

It should fail outright, regardless of how well it scores on other axes like structural validity. A pack that a sabotaged agent can pass isn't measuring the thing it claims to measure — polish elsewhere doesn't compensate for that.

Can this replace human review of eval quality entirely?

No. It replaces the unfalsifiable claim 'trust me, this is a good eval' with a falsifiable one: 'here's the ranking it produced on agents of known quality.' A human still decides whether the panel itself represents realistic failure modes for the target domain.

Does the reference panel need to match the exact domain of the pack being tested?

The behavioral archetypes (good/broken/sabotaged) are largely domain-agnostic — grounding, citation accuracy, and abstention are the same underlying skills whether the domain is medical, legal, or general knowledge. The panel is reused across packs in the same cell rather than rebuilt per domain.

← Back to guides

What Is a Reference-Panel Harness for AI Eval Verification?

Why can't you just trust an eval that "looks right"?

What does a reference panel actually consist of?

What's the pass/fail rule for a pack?

How is this different from a benchmark leaderboard?

Related

FAQ