AgentGrading.ai
PacksGuidesBenchmarksHow grading worksBrowse packs
Home / Benchmarks
The citeable asset

Benchmarks

Original discriminating-power data: how our verification packs score agents whose quality we already know.

RAG agents benchmark

Discriminating-power results across known-good, broken and sabotaged medical-RAG reference agents.

Legal contract RAG agents benchmark

Discriminating-power results across known-good, broken and sabotaged reference agents on legal contract Q&A.

Financial reporting RAG agents benchmark

Discriminating-power results across known-good, broken and sabotaged reference agents on financial-reporting Q&A.

Customer-support RAG agents benchmark

Discriminating-power results across known-good, broken and sabotaged reference agents on customer-support Q&A.

General-knowledge RAG agents benchmark

Discriminating-power results across known-good, broken and sabotaged reference agents on general-reference Q&A — the control cell for this method.

AgentGrading.ai

A verification authority for AI evaluations. We grade the eval, not the agent — because a trusted meter is the one thing that can’t be cloned.