What metrics does the 2026-Q2 RAG benchmark report?

Six metrics on the same dated run: recall@5, MRR + NDCG@10, faithfulness (Ragas-scored per Es et al. 2023), citation accuracy per claim, latency p95, and dollars per 1k queries on each provider's list-API pricing as of 2026-05.

What is in the 1,840-document RAG corpus?

A mixed-format corpus designed to mirror real client engagements: 40% long-form PDFs (regulatory documents, contracts, white papers), 35% HTML (product docs, knowledge bases), and 25% transcripts (support calls, meeting notes). 240 queries across factual, comparative, multi-hop, and no-answer categories with chunk-level gold labels. MIT-licensed throughout.

Which LLMs does the RAG benchmark test?

Four model families on identical retrieval pipelines and prompt templates: Anthropic (Claude Opus 4.7, Sonnet 4.6), OpenAI (GPT-4o and a cost-efficient variant), Google (Gemini 2.5 Pro and Flash), and Llama 3.1 70B Instruct as the open-source baseline. Tool use is disabled to isolate retrieval and generation quality.

When do RAG benchmark results publish?

2026-06. The slug, dataset @id, and citation pattern are stable now so anyone forward-linking gets a permanent URL. The dataset will mirror to huggingface.co/datasets/paiteq-ai/rag-bench-2026q2 at publish.

benchmark · 2026-Q2 · in flight

RAG retrieval benchmark, 2026-Q2
Claude, GPT, Gemini, open-source. Same prompts, same corpus.

Name: RAG retrieval benchmark 2026-Q2
Creator: Paiteq
Published: 2026-06-01
License: https://opensource.org/licenses/MIT

A dated benchmark on retrieval-augmented generation. Recall@5, faithfulness, latency p95, and $/1k queries across the four model families that buyers ask us about most. Run end-to-end with the open-source paiteq/ai-eval-harness. Results target 2026-06.

View the harness on GitHub How we benchmark

▸ what we measure

The rubric. Locked before the run. Reported on the same dated axis.

Every model is scored on the same six metrics on the same corpus. We report quality and cost on one page so you can't read one without the other.

Recall@5

Of the top-5 retrieved chunks, how many are in the gold set? BEIR-style scoring (Thakur et al. 2021, arXiv:2104.08663).
MRR + NDCG@10

Mean reciprocal rank and normalised discounted cumulative gain. Standard IR ranking metrics on a held-out gold set.
Faithfulness

Does the generated answer stay inside the retrieved context? Ragas-scored (Es et al. 2023, arXiv:2309.15217).
Citation accuracy

When the model cites a source chunk, is the supporting span actually there? Per-claim, per-citation scoring.
Latency p95

Wall-clock first-token + total response, p95 over the full query set. Reported per provider including network.
$ / 1k queries

Per model on the same dated run. API list pricing as of run date, not promotional credits.

the corpus

1,840 documents, mixed-format.
PDFs, HTML, transcripts. Synthetic + permissively-licensed.

A corpus that mirrors the document mix we see on client engagements. Permissively licensed so we can publish it in full. Mirrored to huggingface.co/paiteq-ai when the run completes.

Document mix

40% long-form PDFs (regulatory, contracts, white papers), 35% HTML (product docs, knowledge bases), 25% transcripts (support calls, meeting notes).

Query set

240 queries across factual, comparative, multi-hop, and 'no-answer' categories. Gold-labelled with chunk-level provenance.

License

MIT for the corpus. Original source materials are either synthetic or already MIT / CC-BY licensed.

models under test

Four families, same prompts.
No model-family loyalty. Same rubric on each.

We score across the four model families our clients ask about most. Each model runs the same prompt templates, the same retrieval pipeline, and the same evaluation rubric.

anthropic

Claude

Opus 4.7 and Sonnet 4.6, current GA models as of run date (2026-05). Tool-use disabled for RAG runs to isolate retrieval + generation quality.

openai

OpenAI

GPT-4o and a current cost-efficient variant. Same retrieval pipeline; same prompt templates; same eval rubric.

google

Gemini

Gemini 2.5 Pro and Flash. Run against the same corpus and queries as every other model family.

open-source

Llama 3.1 70B

Llama 3.1 70B Instruct served via standard hosting. One open-source baseline so the comparison isn't proprietary-only.

reproduce-rag-2026-q2.sh bash

git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness

# pre-v0.1; commands finalise with the v0.1 release
ai-eval run benchmarks/rag-2026-q2.yaml \
  --provider claude --model claude-opus-4-7

git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness

# pre-v0.1; commands finalise with the v0.1 release
ai-eval run benchmarks/rag-2026-q2.yaml \
  --provider claude --model claude-opus-4-7

Once results publish, the reproduction commands look approximately like this. Your scores should land inside our 95% confidence intervals.

Services this RAG benchmark feeds: AI knowledge base development (RAG over private corpora is the core pattern this harness scores), AI chatbot development (every RAG-grounded chatbot is gated on the same recall + faithfulness metrics), and Claude development (Claude Sonnet 4.6 + Opus 4.7 are among the models under test).

while you wait

See how we work.
Same rubric, on your corpus, in an audit.

The methodology behind this benchmark is the same methodology we run on every client engagement. Engagements move audit → 4-6 week pilot with weekly eval gates → continuous delivery against the same rubric in production. The prior 2026-Q1 pilot RAG run logged 88% faithfulness on a 1,840-document corpus (2026-Q1); the agent harness logged 71% pass@1 across 100 tool-using tasks (2026-Q1). Reliability + reproducibility is one half of the picture — <a href='/services/ai-governance/'>governance considerations</a> (model risk paperwork, audit logs, red-team coverage) is the other half we ship together. This benchmark methodology is the same one we run inside every <a href='/services/ai-knowledge-base/'>knowledge-base RAG</a> engagement.

Read the methodology Start an audit conversation

RAG retrieval benchmark, 2026-Q2
Claude, GPT, Gemini, open-source. Same prompts, same corpus.

The rubric. Locked before the run. Reported on the same dated axis.

Recall@5

MRR + NDCG@10

Faithfulness

Citation accuracy

Latency p95

$ / 1k queries

1,840 documents, mixed-format.
PDFs, HTML, transcripts. Synthetic + permissively-licensed.

Document mix

Query set

License

Four families, same prompts.
No model-family loyalty. Same rubric on each.

Claude

OpenAI

Gemini

Llama 3.1 70B

See how we work.
Same rubric, on your corpus, in an audit.

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

RAG retrieval benchmark, 2026-Q2 Claude, GPT, Gemini, open-source. Same prompts, same corpus.

Recall@5

MRR + NDCG@10

Faithfulness

Citation accuracy

Latency p95

$ / 1k queries

1,840 documents, mixed-format. PDFs, HTML, transcripts. Synthetic + permissively-licensed.

Document mix

Query set

License

Four families, same prompts. No model-family loyalty. Same rubric on each.

Claude

OpenAI

Gemini

Llama 3.1 70B

See how we work. Same rubric, on your corpus, in an audit.

RAG retrieval benchmark, 2026-Q2
Claude, GPT, Gemini, open-source. Same prompts, same corpus.

1,840 documents, mixed-format.
PDFs, HTML, transcripts. Synthetic + permissively-licensed.

Four families, same prompts.
No model-family loyalty. Same rubric on each.

See how we work.
Same rubric, on your corpus, in an audit.