How to Detect a Gameable Eval (and Why Most Eval Rubrics Are)

By VincentPublished 2026-06-23Updated 2026-07-03

An eval can produce a clean, confident-looking grade while rewarding something other than real capability. Gameability is the property of an eval that lets an agent (or a shortcut-taking version of one) score well without doing the thing the eval claims to measure. It's a structural defect in the eval itself, not in whatever it's testing.

Quick answer: A gameable eval can be passed through a shortcut instead of genuine capability — most commonly a positional bias (the correct answer is predictably in the same place), templated question phrasing that's easy to pattern-match, or leaked answers sitting where the agent can find them without real reasoning. Detecting it means checking the test set's structure directly, not just reading it and judging by eye.

What are the three most common ways an eval gets gamed?

Leakage — the answer (or something close enough to it) appears verbatim somewhere the agent can find it without genuine grounding, most commonly the question itself quoting the source passage.
Positional bias — a correct answer that predictably sits in the same position (the first retrieved document, the first multiple-choice option) lets an agent guess structurally rather than reason substantively.
Templated phrasing — if most test questions share the same syntactic shape, an agent can learn the template rather than the underlying task.

How do you check for each one, concretely?

Check	What to compute	Red flag threshold
Leakage	Does the question text appear verbatim in the supporting context?	Any occurrence across the set
Positional bias	Fraction of correct citations that are the first document	Above ~85%
Phrasing diversity	Fraction of unique first-N-word prefixes across questions	Below ~50%

None of these require running the eval against an agent — they're static checks against the test set's own structure, which makes them cheap to run on every submission.

Why does gameability matter more than it looks like it should?

Because it silently inflates every score computed downstream. A pack that passes 95% of test cases because of a positional shortcut looks identical, on the surface, to a pack that passes 95% because the target agent is genuinely excellent. The whole point of an eval is to tell those two situations apart — a gameable eval can't.

How does this relate to discriminating power?

Gameability and discriminating power are related but distinct: an eval can be ungameable (no shortcuts exist) and still fail to discriminate (if it's simply too easy or too hard for every agent). Both checks are needed; neither substitutes for the other.

What is a reference-panel harness — the discriminating-power check that runs alongside this one.
How to eval a RAG agent's groundedness — where gameability fits into a full rubric.
How to eval a legal contract Q&A agent — a worked example where positional bias is a real risk (clause ordering).
The legal contract RAG groundedness pack — a verified pack audited for exactly this shortcut.
Capability packs
Legal contract RAG agents benchmark

FAQ

What's the difference between a gameable eval and a leaky one?

Leakage means the answer is directly present somewhere it shouldn't be (like the question itself), letting an agent 'answer' without real work. Gameability is broader — it includes any statistical or positional shortcut, like the correct citation always being the first document, that an agent can exploit without leakage.

Can an LLM-based agent exploit gameability even if it wasn't trained to?

Yes — LLMs are pattern-matchers, and they'll pick up on positional or phrasing regularities in a test set the same way they pick up on any other statistical regularity, without anyone deliberately engineering a cheat. This is why gameability needs to be checked structurally, not assumed away because 'nobody would do that on purpose.'

How do you check for positional bias without manually reviewing every test case?

Compute it directly: for a citation-based eval, calculate what fraction of correct citations are the first document in the context set. If that fraction is implausibly high (say, above 85%), the eval likely rewards a 'guess the first doc' shortcut rather than real retrieval-following.

Is templated question phrasing really a security-relevant issue?

It's not adversarial in the security sense, but it has the same effect: an agent (or a lazy pattern-matcher standing in for one) can learn to recognize the template rather than reason about the actual content, inflating scores without inflating real capability.

Should gameability checks block a pack from shipping, or just flag it?

It should meaningfully lower the grade, and a severe case (like near-total positional bias) should block a passing grade — a gameable eval doesn't just have a minor flaw, it's not reliably measuring the thing it claims to measure.

Does fixing gameability require rewriting the whole test set?

Usually not — targeted fixes work: shuffle document order per test case so citation position isn't predictable, vary question phrasing so there's no dominant template, and re-run the audit to confirm the shortcut is gone.

← Back to guides

How to Detect a Gameable Eval (and Why Most Eval Rubrics Are)

What are the three most common ways an eval gets gamed?

How do you check for each one, concretely?

Why does gameability matter more than it looks like it should?

How does this relate to discriminating power?

Related

FAQ