What Is a Reference-Panel Harness for AI Eval Verification?

Most eval packs are graded on how they look: clear rubric, plausible-sounding test cases, a nice report. None of that tells you whether the eval can actually distinguish a good agent from a bad one. A reference-panel harness answers a narrower, harder question directly: given agents of known quality, does this eval rank them correctly?

Quick answer: A reference-panel harness runs a candidate eval against a small set of agents whose quality is already known — one deliberately good, one realistically broken, one deliberately sabotaged — and checks whether the eval's scores rank them in that order. This is mutation testing applied to evals: you don't trust a test suite until you've confirmed it fails on a planted bug, and you shouldn't trust an eval until you've confirmed it fails a planted bad agent.

Why can't you just trust an eval that "looks right"?

Because looking right and discriminating correctly are unrelated properties. An eval built from plausible-sounding questions and a sensible rubric can still fail to separate quality levels if:

None of these show up by reading the eval. They only show up by running it against agents you already know the answer for.

What does a reference panel actually consist of?

At minimum, three behavioral archetypes:

AgentBehaviorExpected result
Known-goodGrounded, cites correctly, abstains when evidence is missingScores high
Known-brokenRealistically sloppy — inconsistent citation, occasional overreachScores in the middle, clearly below good
Known-sabotagedDeliberately fabricates: confident answers regardless of evidence, invented citationsScores near zero

These don't need to be real production agents. They need to be controllable and reproducible, so the same pack run against the same panel produces the same ranking every time — see the determinism requirement below.

What's the pass/fail rule for a pack?

The non-negotiable check is ordering: good > broken > sabotaged. If a pack's scores don't preserve that order — and especially if the sabotaged agent scores anywhere close to the good one — the pack fails, independent of every other quality signal. A margin matters too: barely-separated scores (say, 0.61 vs. 0.58) suggest weak discriminating power even if the ordering is technically correct.

How is this different from a benchmark leaderboard?

A leaderboard ranks real agents against each other and assumes the eval itself is trustworthy. A reference-panel harness checks that assumption first, using agents of pre-established quality rather than agents you're trying to compare. See the RAG agents benchmark for a worked example of the panel's actual output on a live pack.

Related

FAQ

Do I need real production agents to build a reference panel?

No — the panel's job is to represent known behavior classes (good, broken, sabotaged), not to be realistic production traffic. A small set of deterministic, controllable agents that reproduce specific failure modes is more useful than a handful of real agents whose quality you'd have to establish by some other, circular method.

How many reference agents do I need in a panel?

Three is the practical minimum: one known-good, one known-broken (realistic sloppiness, not malice), and one known-sabotaged (a deliberate planted failure). More agents let you test finer-grained failure modes, but three is enough to prove basic discriminating power.

Isn't this just a fancy way of saying 'test the tests'?

Yes, and that's the point. An eval is a measurement instrument; instruments need calibration. A ruler that reports the same length for a 10cm and a 20cm object isn't a ruler, no matter how official it looks — the same logic applies to an eval that can't separate a good agent from a broken one.

What happens if a pack fails to catch the sabotaged agent?

It should fail outright, regardless of how well it scores on other axes like structural validity. A pack that a sabotaged agent can pass isn't measuring the thing it claims to measure — polish elsewhere doesn't compensate for that.

Can this replace human review of eval quality entirely?

No. It replaces the unfalsifiable claim 'trust me, this is a good eval' with a falsifiable one: 'here's the ranking it produced on agents of known quality.' A human still decides whether the panel itself represents realistic failure modes for the target domain.

Does the reference panel need to match the exact domain of the pack being tested?

The behavioral archetypes (good/broken/sabotaged) are largely domain-agnostic — grounding, citation accuracy, and abstention are the same underlying skills whether the domain is medical, legal, or general knowledge. The panel is reused across packs in the same cell rather than rebuilt per domain.

← Back to guides