How to Detect a Gameable Eval (and Why Most Eval Rubrics Are)

An eval can produce a clean, confident-looking grade while rewarding something other than real capability. Gameability is the property of an eval that lets an agent (or a shortcut-taking version of one) score well without doing the thing the eval claims to measure. It's a structural defect in the eval itself, not in whatever it's testing.

Quick answer: A gameable eval can be passed through a shortcut instead of genuine capability — most commonly a positional bias (the correct answer is predictably in the same place), templated question phrasing that's easy to pattern-match, or leaked answers sitting where the agent can find them without real reasoning. Detecting it means checking the test set's structure directly, not just reading it and judging by eye.

What are the three most common ways an eval gets gamed?

  1. Leakage — the answer (or something close enough to it) appears verbatim somewhere the agent can find it without genuine grounding, most commonly the question itself quoting the source passage.
  2. Positional bias — a correct answer that predictably sits in the same position (the first retrieved document, the first multiple-choice option) lets an agent guess structurally rather than reason substantively.
  3. Templated phrasing — if most test questions share the same syntactic shape, an agent can learn the template rather than the underlying task.

How do you check for each one, concretely?

CheckWhat to computeRed flag threshold
LeakageDoes the question text appear verbatim in the supporting context?Any occurrence across the set
Positional biasFraction of correct citations that are the first documentAbove ~85%
Phrasing diversityFraction of unique first-N-word prefixes across questionsBelow ~50%

None of these require running the eval against an agent — they're static checks against the test set's own structure, which makes them cheap to run on every submission.

Why does gameability matter more than it looks like it should?

Because it silently inflates every score computed downstream. A pack that passes 95% of test cases because of a positional shortcut looks identical, on the surface, to a pack that passes 95% because the target agent is genuinely excellent. The whole point of an eval is to tell those two situations apart — a gameable eval can't.

How does this relate to discriminating power?

Gameability and discriminating power are related but distinct: an eval can be ungameable (no shortcuts exist) and still fail to discriminate (if it's simply too easy or too hard for every agent). Both checks are needed; neither substitutes for the other.

Related

FAQ

What's the difference between a gameable eval and a leaky one?

Leakage means the answer is directly present somewhere it shouldn't be (like the question itself), letting an agent 'answer' without real work. Gameability is broader — it includes any statistical or positional shortcut, like the correct citation always being the first document, that an agent can exploit without leakage.

Can an LLM-based agent exploit gameability even if it wasn't trained to?

Yes — LLMs are pattern-matchers, and they'll pick up on positional or phrasing regularities in a test set the same way they pick up on any other statistical regularity, without anyone deliberately engineering a cheat. This is why gameability needs to be checked structurally, not assumed away because 'nobody would do that on purpose.'

How do you check for positional bias without manually reviewing every test case?

Compute it directly: for a citation-based eval, calculate what fraction of correct citations are the first document in the context set. If that fraction is implausibly high (say, above 85%), the eval likely rewards a 'guess the first doc' shortcut rather than real retrieval-following.

Is templated question phrasing really a security-relevant issue?

It's not adversarial in the security sense, but it has the same effect: an agent (or a lazy pattern-matcher standing in for one) can learn to recognize the template rather than reason about the actual content, inflating scores without inflating real capability.

Should gameability checks block a pack from shipping, or just flag it?

It should meaningfully lower the grade, and a severe case (like near-total positional bias) should block a passing grade — a gameable eval doesn't just have a minor flaw, it's not reliably measuring the thing it claims to measure.

Does fixing gameability require rewriting the whole test set?

Usually not — targeted fixes work: shuffle document order per test case so citation position isn't predictable, vary question phrasing so there's no dominant template, and re-run the audit to confirm the shortcut is gone.

← Back to guides