Content hub

Guides

How to test whether an AI agent is actually good — methodology, not marketing.

How to Detect a Gameable Eval (and Why Most Eval Rubrics Are)

A gameable eval can be passed through a shortcut instead of genuine capability — positional bias, templated phrasing, and leaked answers are the three most common ways evals get gamed.

How to Eval a RAG Agent's Groundedness

How to measure whether a RAG agent's answers are actually supported by its retrieved sources — groundedness scoring, citation verification, and abstention testing.

How to Map an AI Agent Eval to the EU AI Act's High-Risk Obligations

The EU AI Act's high-risk obligations (logging, transparency, human oversight, accuracy evidence) can be operationalized as concrete, checkable properties of an eval's test set — here's how, without inventing a private rubric.

How to Test Whether a RAG Agent Knows When to Say 'I Don't Know'

Abstention testing checks whether a RAG agent declines to answer when its retrieved context doesn't cover the question — the single most commonly skipped and most dangerous gap in RAG evals.

How to Eval a Legal Contract Q&A Agent for Groundedness

Contract Q&A agents fail by citing the wrong clause or answering questions the contract doesn't cover — here's what to test specifically for legal-document RAG, beyond generic groundedness checks.

What Is a Reference-Panel Harness for AI Eval Verification?

A reference-panel harness runs an eval against known-good, known-broken, and known-sabotaged agents to prove it can actually tell them apart — here's how it works and why it's the core of trustworthy AI evals.