What Is a Reference-Panel Harness for AI Eval Verification?
Most eval packs are graded on how they look: clear rubric, plausible-sounding test cases, a nice report. None of that tells you whether the eval can actually distinguish a good agent from a bad one. A reference-panel harness answers a narrower, harder question directly: given agents of known quality, does this eval rank them correctly?
Quick answer: A reference-panel harness runs a candidate eval against a small set of agents whose quality is already known — one deliberately good, one realistically broken, one deliberately sabotaged — and checks whether the eval's scores rank them in that order. This is mutation testing applied to evals: you don't trust a test suite until you've confirmed it fails on a planted bug, and you shouldn't trust an eval until you've confirmed it fails a planted bad agent.
Why can't you just trust an eval that "looks right"?
Because looking right and discriminating correctly are unrelated properties. An eval built from plausible-sounding questions and a sensible rubric can still fail to separate quality levels if:
- the scoring rewards something correlated with quality (answer length, confident tone) rather than quality itself
- the test cases are all easy enough that every agent — good or bad — scores similarly
- the rubric has a shortcut an agent can exploit without genuine capability
None of these show up by reading the eval. They only show up by running it against agents you already know the answer for.
What does a reference panel actually consist of?
At minimum, three behavioral archetypes:
| Agent | Behavior | Expected result |
|---|---|---|
| Known-good | Grounded, cites correctly, abstains when evidence is missing | Scores high |
| Known-broken | Realistically sloppy — inconsistent citation, occasional overreach | Scores in the middle, clearly below good |
| Known-sabotaged | Deliberately fabricates: confident answers regardless of evidence, invented citations | Scores near zero |
These don't need to be real production agents. They need to be controllable and reproducible, so the same pack run against the same panel produces the same ranking every time — see the determinism requirement below.
What's the pass/fail rule for a pack?
The non-negotiable check is ordering: good > broken > sabotaged. If a pack's scores don't preserve that order — and especially if the sabotaged agent scores anywhere close to the good one — the pack fails, independent of every other quality signal. A margin matters too: barely-separated scores (say, 0.61 vs. 0.58) suggest weak discriminating power even if the ordering is technically correct.
How is this different from a benchmark leaderboard?
A leaderboard ranks real agents against each other and assumes the eval itself is trustworthy. A reference-panel harness checks that assumption first, using agents of pre-established quality rather than agents you're trying to compare. See the RAG agents benchmark for a worked example of the panel's actual output on a live pack.
Related
- How to eval a RAG agent's groundedness — the concrete rubric this harness is applied to.
- How to detect a gameable eval — the structural-validity checks that run alongside the panel.
- How to map an eval to the EU AI Act — how coverage of an external standard fits alongside discriminating power.
- The general-knowledge RAG groundedness pack — the control-cell pack this harness is demonstrated on.
- Capability packs — packs verified with this method.
- General-knowledge RAG agents benchmark — the panel's results on that pack.
FAQ
No — the panel's job is to represent known behavior classes (good, broken, sabotaged), not to be realistic production traffic. A small set of deterministic, controllable agents that reproduce specific failure modes is more useful than a handful of real agents whose quality you'd have to establish by some other, circular method.
Three is the practical minimum: one known-good, one known-broken (realistic sloppiness, not malice), and one known-sabotaged (a deliberate planted failure). More agents let you test finer-grained failure modes, but three is enough to prove basic discriminating power.
Yes, and that's the point. An eval is a measurement instrument; instruments need calibration. A ruler that reports the same length for a 10cm and a 20cm object isn't a ruler, no matter how official it looks — the same logic applies to an eval that can't separate a good agent from a broken one.
It should fail outright, regardless of how well it scores on other axes like structural validity. A pack that a sabotaged agent can pass isn't measuring the thing it claims to measure — polish elsewhere doesn't compensate for that.
No. It replaces the unfalsifiable claim 'trust me, this is a good eval' with a falsifiable one: 'here's the ranking it produced on agents of known quality.' A human still decides whether the panel itself represents realistic failure modes for the target domain.
The behavioral archetypes (good/broken/sabotaged) are largely domain-agnostic — grounding, citation accuracy, and abstention are the same underlying skills whether the domain is medical, legal, or general knowledge. The panel is reused across packs in the same cell rather than rebuilt per domain.