benchmark · 2026-Q2 · in flight

RAG retrieval benchmark, 2026-Q2
Claude, GPT, Gemini, open-source. Same prompts, same corpus.

A dated benchmark on retrieval-augmented generation. Recall@5, faithfulness, latency p95, and $/1k queries across the four model families that buyers ask us about most. Run end-to-end with the open-source paiteq/ai-eval-harness. Results target 2026-06.

▸ what we measure

The rubric. Locked before the run. Reported on the same dated axis.

Every model is scored on the same six metrics on the same corpus. We report quality and cost on one page so you can't read one without the other.

  • Recall@5

    Of the top-5 retrieved chunks, how many are in the gold set? BEIR-style scoring (Thakur et al. 2021, arXiv:2104.08663).

  • MRR + NDCG@10

    Mean reciprocal rank and normalised discounted cumulative gain. Standard IR ranking metrics on a held-out gold set.

  • Faithfulness

    Does the generated answer stay inside the retrieved context? Ragas-scored (Es et al. 2023, arXiv:2309.15217).

  • Citation accuracy

    When the model cites a source chunk, is the supporting span actually there? Per-claim, per-citation scoring.

  • Latency p95

    Wall-clock first-token + total response, p95 over the full query set. Reported per provider including network.

  • $ / 1k queries

    Per model on the same dated run. API list pricing as of run date, not promotional credits.

the corpus

1,840 documents, mixed-format.
PDFs, HTML, transcripts. Synthetic + permissively-licensed.

A corpus that mirrors the document mix we see on client engagements. Permissively licensed so we can publish it in full. Mirrored to huggingface.co/paiteq-ai when the run completes.

Document mix

40% long-form PDFs (regulatory, contracts, white papers), 35% HTML (product docs, knowledge bases), 25% transcripts (support calls, meeting notes).

Query set

240 queries across factual, comparative, multi-hop, and 'no-answer' categories. Gold-labelled with chunk-level provenance.

License

MIT for the corpus. Original source materials are either synthetic or already MIT / CC-BY licensed.

models under test

Four families, same prompts.
No model-family loyalty. Same rubric on each.

We score across the four model families our clients ask about most. Each model runs the same prompt templates, the same retrieval pipeline, and the same evaluation rubric.

anthropic

Claude

Opus 4.7 and Sonnet 4.6, current GA models as of run date (2026-05). Tool-use disabled for RAG runs to isolate retrieval + generation quality.

openai

OpenAI

GPT-4o and a current cost-efficient variant. Same retrieval pipeline; same prompt templates; same eval rubric.

google

Gemini

Gemini 2.5 Pro and Flash. Run against the same corpus and queries as every other model family.

open-source

Llama 3.1 70B

Llama 3.1 70B Instruct served via standard hosting. One open-source baseline so the comparison isn't proprietary-only.

reproduce-rag-2026-q2.sh bash
git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness

# pre-v0.1; commands finalise with the v0.1 release
ai-eval run benchmarks/rag-2026-q2.yaml \
  --provider claude --model claude-opus-4-7
Once results publish, the reproduction commands look approximately like this. Your scores should land inside our 95% confidence intervals.
while you wait

See how we work.
Same rubric, on your corpus, in an audit.

The methodology behind this benchmark is the same methodology we run on every client engagement. The audit is where it starts.