Type · capabilityRAG groundednessGeneralLow riskragas

General RAG groundedness — draft submission

An early-draft submission that hasn't cleared verification yet — held back rather than approved for sale. The report below shows exactly which checks it didn't pass.

Not for sale — failed verificationSee why
Reference agents
3
Scope
2 test cases
Standards anchored
0
Price
Not for sale
Who this is for. Not available for purchase — this submission hasn't passed verification. Shown here for transparency alongside the packs that did.
What's tested, and how much. Covers 2 test cases — 2 with a clear, checkable correct answer and 0 deliberately unanswerable, to test whether the agent abstains instead of guessing.
2 answerable0 unanswerable

Verification report

Four axes, one grade

Every pack is measured on the same four axes — including whether there's actually enough testing here for its risk level. No oracle, no self-reporting — reproducible measurement only.

10.0 / 10

Structural validity

Reproducible, seed-stable scoring with no leakage from the golden set into the rubric — provable without knowing the ‘best’ eval.

No data leakage 1.00 · Ungameable 1.00 · Deterministic 1.00

5.0 / 10

Discriminating power

Cleanly separates strong from weak agents on the reference panel; the sabotaged reference agent should fail.

2.5 / 10

Standard coverage

Capability pack — measured for internal consistency, not an external standard.

2.0 / 10

Test thoroughness

Test volume weighed against the minimum expected for a 'low risk' pack — not a flat headcount.

2 test cases

Known limitations

What this grade does and doesn’t prove — published on purpose, not hidden in fine print.

  • Reference agents are controlled behavioral archetypes (good/broken/sabotaged), not live production models — validate against your own agent's real outputs before treating this grade as a guarantee for your specific deployment.
  • The test set covers 2 cases — enough to establish discriminating power, not exhaustive coverage of every question shape in this domain. Treat a pass as evidence the pack works, not proof your exact corpus is fully covered.
  • Only 2 test cases for a 'low-risk' pack — 10 is the bar for full thoroughness credit here. Treat this grade as a starting signal, not exhaustive proof for a deployment at this risk level.
  • Discriminating power is 0.50 — the pack ranks known-good above known-sabotaged, but the margin isn't maximal. Borderline agents near the pass threshold deserve a closer manual look, not just the number.
  • Does not yet cover: Transparency and instructions for use, Accuracy / robustness evidence, Human-oversight measures (flags an untrustworthy system) — see the standard-coverage checklist above.

Reference panel

How the panel performed

Reference agents — from strong to deliberately sabotaged — ran the full pack. A trustworthy eval must rank them correctly.

Reference agentTierPack score
Grounded-RAG-refgood
100%
Loose-RAG-refbroken
50%
Fabricator-refsabotaged
60%

A good pack scores the known-good agent high and the sabotaged one near zero. That gap is the evidence the meter works.

Output

What you get when you run it

A per-test-case score table (exportable to JSON/CSV) plus an aggregate mean per metric.

  • question, retrieved_contexts, and your agent's answer for every test case
  • a faithfulness score (0-1) and an answer_correctness score (0-1) per test case
  • an aggregate mean per metric across the whole test set

Assert the aggregate faithfulness/answer_correctness meets your threshold in your CI script; a non-zero exit on failure blocks the deploy the same way a failing unit test would.

The known-good reference agent on this pack scores 1.00 — if your agent's aggregate score lands well below that, treat it as a signal to investigate before shipping, not a false alarm.

Sample scorecard

A real sample of the scorecard this pack produces — actual numbers from the known-good and known-sloppy reference agents, not a mockup.

CategoryKnown-good agentKnown-sloppy agent
Groundedness (answerable)1.000.50

0-1 score, per category — this is the shape of your own report once you run the pack.

Sample question, in detail

Real content from this pack's own test set, run through the known-good reference agent:

Question
What color is the sky on a clear day?
Sources
Style guide note: prefer active voice in customer-facing copy. / Reference note: the sky appears blue on a clear day due to Rayleigh scattering. / Internal changelog: v2.3 improved logging latency by 12ms on average.
Agent answer
blue.
grounded, correct citation

Question/sources/answer are real content from this pack's own test set, run through the known-good reference agent — not a fabricated illustration. Frameworks like ragas additionally compute judge-model scores (e.g. faithfulness, answer correctness) from your actual agent's output when you run the pack yourself.

How to read your score

Your aggregate score breaks into two things worth checking separately, not just one number.

CategoryWhat it measuresA low score here means
Groundedness (answerable questions)Whether the answer contains the right facts and cites the source that actually supports them.The agent has the right evidence but gets facts wrong or cites the wrong source — a factual-accuracy problem.
Abstention (unanswerable questions)Whether the agent says it doesn't know when the sources don't support an answer, instead of guessing.The agent fabricates answers when it shouldn't — worse than a groundedness miss, since it looks confident while being wrong.

Integration

One line in CI

Runs locally or in your pipeline. Fail the build the same way a failing test would.

$ pip install ragas# Drop the pack's test-case set (question / retrieved context / gold citation) into your eval script.# Run ragas.evaluate(dataset, metrics=[faithfulness, answer_correctness]) in CI on every deploy that touches retrieval or the prompt.

Anchored standards

Capability pack — no external standard.

EU AI Act obligation coverage

A diagnostic signal, not a formal claim — checkable EU AI Act obligations this pack's test structure happens to exercise:

  • Logging capable of enabling traceability
  • Transparency and instructions for use — not covered
  • Accuracy / robustness evidence — not covered
  • Human-oversight measures (flags an untrustworthy system) — not covered

Not legal advice — see the EU AI Act standard page for the source text these checks are drawn from.

Certification

Frameworkragas
AuthorVincent
Not for saleThis pack failed verification — see "Known limitations" above for why.
Browse guides →Benchmark · RAG groundedness benchmark

← Back to catalog

This pack didn't pass verification

Not for sale — see "Known limitations" above for exactly which checks it didn't clear.

See why it failed