Guides
How to test whether an AI agent is actually good — methodology, not marketing.
How to Detect a Gameable Eval (and Why Most Eval Rubrics Are)
A gameable eval can be passed through a shortcut instead of genuine capability — positional bias, templated phrasing, and leaked answers are the three most common ways evals get gamed.
How to Eval a RAG Agent's Groundedness
How to measure whether a RAG agent's answers are actually supported by its retrieved sources — groundedness scoring, citation verification, and abstention testing.
How to Map an AI Agent Eval to the EU AI Act's High-Risk Obligations
The EU AI Act's high-risk obligations (logging, transparency, human oversight, accuracy evidence) can be operationalized as concrete, checkable properties of an eval's test set — here's how, without inventing a private rubric.
How to Test Whether a RAG Agent Knows When to Say 'I Don't Know'
Abstention testing checks whether a RAG agent declines to answer when its retrieved context doesn't cover the question — the single most commonly skipped and most dangerous gap in RAG evals.
How to Eval a Legal Contract Q&A Agent for Groundedness
Contract Q&A agents fail by citing the wrong clause or answering questions the contract doesn't cover — here's what to test specifically for legal-document RAG, beyond generic groundedness checks.
What Is a Reference-Panel Harness for AI Eval Verification?
A reference-panel harness runs an eval against known-good, known-broken, and known-sabotaged agents to prove it can actually tell them apart — here's how it works and why it's the core of trustworthy AI evals.