RAG retrieval benchmark, 2026-Q2
Claude, GPT, Gemini, open-source. Same prompts, same corpus.
A dated benchmark on retrieval-augmented generation. Recall@5, faithfulness, latency p95, and $/1k queries across the four model families that buyers ask us about most. Run end-to-end with the open-source paiteq/ai-eval-harness. Results target 2026-06.
The rubric. Locked before the run. Reported on the same dated axis.
Every model is scored on the same six metrics on the same corpus. We report quality and cost on one page so you can't read one without the other.
-
Recall@5
Of the top-5 retrieved chunks, how many are in the gold set? BEIR-style scoring (Thakur et al. 2021, arXiv:2104.08663).
-
MRR + NDCG@10
Mean reciprocal rank and normalised discounted cumulative gain. Standard IR ranking metrics on a held-out gold set.
-
Faithfulness
Does the generated answer stay inside the retrieved context? Ragas-scored (Es et al. 2023, arXiv:2309.15217).
-
Citation accuracy
When the model cites a source chunk, is the supporting span actually there? Per-claim, per-citation scoring.
-
Latency p95
Wall-clock first-token + total response, p95 over the full query set. Reported per provider including network.
-
$ / 1k queries
Per model on the same dated run. API list pricing as of run date, not promotional credits.
1,840 documents, mixed-format.
PDFs, HTML, transcripts. Synthetic + permissively-licensed.
A corpus that mirrors the document mix we see on client engagements. Permissively licensed so we can publish it in full. Mirrored to huggingface.co/paiteq-ai when the run completes.
Document mix
40% long-form PDFs (regulatory, contracts, white papers), 35% HTML (product docs, knowledge bases), 25% transcripts (support calls, meeting notes).
Query set
240 queries across factual, comparative, multi-hop, and 'no-answer' categories. Gold-labelled with chunk-level provenance.
License
MIT for the corpus. Original source materials are either synthetic or already MIT / CC-BY licensed.
Four families, same prompts.
No model-family loyalty. Same rubric on each.
We score across the four model families our clients ask about most. Each model runs the same prompt templates, the same retrieval pipeline, and the same evaluation rubric.
Claude
Opus 4.7 and Sonnet 4.6, current GA models as of run date (2026-05). Tool-use disabled for RAG runs to isolate retrieval + generation quality.
OpenAI
GPT-4o and a current cost-efficient variant. Same retrieval pipeline; same prompt templates; same eval rubric.
Gemini
Gemini 2.5 Pro and Flash. Run against the same corpus and queries as every other model family.
Llama 3.1 70B
Llama 3.1 70B Instruct served via standard hosting. One open-source baseline so the comparison isn't proprietary-only.
git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness
# pre-v0.1; commands finalise with the v0.1 release
ai-eval run benchmarks/rag-2026-q2.yaml \
--provider claude --model claude-opus-4-7