RAG Benchmark Methodology: How We Score Retrieval + Generation in 2026

The four-axis frame we score on (recall, faithfulness, relevancy, cost-per-query), the Ragas metrics, the corpus + judge selection, and the failure modes — the methodology behind our 2026-Q2 RAG benchmark on getwidget.dev.

Layered translucent measurement plates — the RAG benchmark methodology lens

Most "best LLM for RAG" content is unverifiable. The blogs cite undisclosed corpora, undated runs, and vendor-published accuracy. A buyer can't reproduce the number. We can't either. This rag benchmark methodology piece is the opposite shape: the exact harness, prompts, judge configuration, and cost ledger behind our 2026-Q2 RAG benchmark at /benchmarks/rag-2026-q2/. Anyone with a Hugging Face token and a credit card can rerun it.

We run this rag evaluation framework on every retrieval-augmented system we ship. Anyone asking 'how to evaluate rag' usually wants the same thing: a number they can trust and a method they can replay. On a 1,840-doc corpus in 2026-Q2, top models in our pool land in the 0.85 to 0.92 recall band at k=5, with faithfulness in the high-80s. Same quarter, same harness, 2026-Q2 ledger shows roughly 200 tokens of judge overhead per row. Numbers below come from that frozen corpus we re-run quarterly. No invented client wins, no NDA'd outcomes.

Why most rag benchmark content is unverifiable

Open any vendor blog ranking for "best LLM for RAG" and three things will be missing: the corpus, the date, and the judge. "GPT-4o scored in the high-80s on our internal eval" is not a benchmark. It's a claim. A benchmark publishes the dataset (or its source), the model versions (Claude Sonnet 4.6 not "Claude Sonnet"), the judge model and prompt, the metric definitions, and a date stamp. Anyone reading it can rerun and either confirm or refute.

That's the bar we hold our 2026-Q2 RAG benchmark to. The harness is open-source (github.com/paiteq/ai-eval-harness), the corpus is a frozen Hugging Face Datasets snapshot, the model versions are pinned, and the judge config ships in the same repo. Retrieval tier matters too — our vector-store choice for RAG feeds the same harness. When we say Sonnet 4.6 lands in the high-80s for recall@5 on the corpus, you can run one CLI command and either reproduce a number inside the same band or file an issue.

The four-axis frame we score on

Every rag evaluation we run scores along four axes: retrieval recall, generation faithfulness, answer relevancy, and cost-per-query. Not BLEU, not ROUGE — both collapse to noise on long-form answers because they reward lexical overlap, not semantic correctness. A RAG system that retrieves the right passages and grounds the answer in them can score low BLEU against a reference answer phrased differently. That's a useless signal.

The four-axis frame is what Ragas optimizes for and what BEIR's information-retrieval lineage assumes. We didn't invent it. We did pick the specific cuts that matter for operator decisions: recall@5 (not recall@20, because production retrievers cap there for latency), faithfulness against retrieved context (catches hallucination), answer relevancy (catches off-topic completions), and cost-per-query in dollars (because two systems with identical recall but 4× cost gap aren't the same option).

The four-axis RAG eval frame
Recall@5retrieval signalFaithfulnessgrounded in contextCost / queryUSD, co-equal axisAnswer relevancyon-topic, on-question
Figure 1: The four axes every RAG benchmark we ship scores against. Aggregating them into a single number hides the trade-offs that actually drive stack decisions.
RAG eval pipeline — dataset to scored report
Dataset
HF FROZEN SNAPSHOT
Retriever
PGVECTOR + RERANK
Generator
CLAUDE / GPT-5
Judge
RAGAS METRICS
Cost ledger
USD / QUERY
JSON report
GIT-COMMITTED

The pipeline above is what ai-eval-harness runs front to back on a single CLI invocation. Our ragas evaluation harness scores each box as a discrete stage with its own assertion: the retriever must return k passages, the generator must produce a non-empty answer within the latency budget, the judge must score four metrics, the cost ledger must sum the API spend. If any stage fails, the harness aborts the run rather than silently emitting partial scores.

Ragas at the core: which metrics actually matter

Ragas (arxiv 2309.15217) is the academic backbone of our scoring stage. Four of its metrics carry the load in our rag evaluation methodology: context_precision, context_recall, faithfulness, and answer_relevancy. Each is computed by a judge LLM with a published prompt. The judge isn't a black box — you can read the exact prompt in the Ragas repo and reproduce the score by hand on a single example. That's what makes it serious as a faithfulness metric llm tool, instead of a vendor scoreboard.

MetricWhat it measuresWhere it fails
context_precisionOf the k retrieved passages, what fraction are actually relevant to the queryLong-tail dense passages with partial overlap get penalized even when they help
context_recallOf the ground-truth facts, what fraction appear in the retrieved contextRequires labelled ground-truth — expensive to produce; small corpora skew high
faithfulnessAre the answer's claims grounded in the retrieved contextMulti-hop answers that synthesise across passages can score low even when correct
answer_relevancyDoes the answer address the original questionVerbose answers with one correct sentence buried in five tangents score high
The four Ragas metrics we score, what each catches, and the known failure modes we've hit running them on our own corpora.

We report all four on every benchmark run, not a single aggregate. Aggregates hide trade-offs. A system can score 0.92 faithfulness and 0.61 context_recall — useful for low-stakes Q&A, dangerous for legal or medical retrieval where missing context is the failure mode. The four numbers separate "reads well" from "actually retrieves the right thing".

The corpus shape problem

The hardest decision in any rag benchmark isn't the metric — it's the corpus. MS MARCO is too easy: short passages, single-hop queries, web-scale lexical overlap. A modern retriever pushes recall@5 above 0.90 on MS MARCO and that number tells you nothing about how the system handles a 200-page contract or a multi-section policy document. Proprietary corpora are the opposite problem: they're realistic but unverifiable, so the number can't be reproduced by a buyer.

BEIR (the BEIR benchmark, arXiv:2104.08663) is the honest middle ground. It's a benchmark suite with 18 datasets spanning legal, biomedical, scientific, financial, and general-web retrieval. The datasets are real (TREC-COVID, FiQA, NFCorpus, SciFact), the relevance judgments are labelled, and the suite is small enough that a single GPU run completes in hours. Both context precision recall metrics surface here too — labelled relevance lets us compute context_recall honestly. We use a BEIR subset (NFCorpus + FiQA + SciFact) for general capability scoring and a domain-shifted corpus (our 1,840-document operator-facing knowledge base) for the production-realistic number.

MS MARCO + general benchmarks

Easy retrieval signal, lexically rich, single-hop queries. Useful for sanity checks and embedding-model ablations. Trade-off: a 0.92 recall@5 on MS MARCO does not predict 0.92 on a regulated-industry corpus. We run it as a floor, not a finding.

BEIR subset + domain corpus

BEIR brings labelled relevance across legal, biomed, science. Our 1,840-doc domain corpus brings production realism. Trade-off: BEIR is reproducible but small; the domain corpus is realistic but only we can see it. We report both numbers side by side and let the reader weigh which one matches their use case.

MTEB is the embedding-model analogue and we cite it for retriever ablations, but it doesn't score generation all the way through. If you're tuning the embedder, MTEB is the right tool. If you're picking a full RAG stack, you need a generation-aware harness on top.

Judge-model selection and bias

Ragas's faithfulness and answer_relevancy metrics use a judge LLM. The judge bias problem is real and well-documented: GPT-4-class judges tend to favor responses styled like their own training distribution. If you judge GPT-5 output with a GPT-5 judge, the scores skew up. If you judge Claude output with the same judge, the scores skew down for stylistic reasons that have nothing to do with retrieval quality.

Our mitigation is cross-judging. Every benchmark row gets scored twice: once with GPT-5 as judge, once with Claude Sonnet 4.6 as judge. We report both numbers and flag any row where the two judges disagree by more than 0.05 on faithfulness. Those rows go to human review. It's more expensive (roughly 2× judge API spend) but it's the only way to publish a number without quietly tilting toward one vendor.

Cross-judge bias control — anatomy of one scored row
Generated answer+ retrieved contextJudge A: GPT-5faithfulness, relevancyJudge B: Claude 4.6faithfulness, relevancy|A − B| ≤ 0.05?disagreement checkPublishmidpointHumanreview
Figure 2: How one row is scored. Two independent judges run the same Ragas prompts; agreement inside ±0.05 publishes, disagreement routes to human review before any headline number ships.

Cost as a co-equal axis

A rag benchmark that reports recall and ignores cost is half a finding. A system landing in the high-80s recall@5 at $0.04 per query and one landing in the low-90s at $0.18 per query are not the same option. The first ships. The second hits a unit-economics wall at 10K queries per day. RAG chunking + reranker tradeoffs move both axes at once, which is why we score them together. We make cost a co-equal axis: every row in the benchmark publishes both numbers and the harness emits a Pareto frontier plot so the trade-off is visible at a glance.

2026-Q2 RAG benchmark — cost-per-query on the 1,840-doc corpus
GPT-5 + pgvector + rerank
18¢
Highest recall, highest spend
Claude Sonnet 4.6 + pgvector + rerank
11¢
Best Pareto point we shipped
Claude Haiku 4 + pgvector
Commodity Q&A floor
Llama 4 70B (self-hosted vLLM) + pgvector
Amortized GPU cost, on-prem

The numbers above are from the 2026-Q2 run, judged with both GPT-5 and Claude Sonnet 4.6, reported as the midpoint. Cents per query, not dollars — the headline number gets too big to feel real if you publish in dollars per 1K queries. We've found buyers internalize the trade-off faster when the unit is small.

The reproducibility contract

A benchmark is reproducible or it isn't. Our contract has six clauses, every one of them visible in the benchmark page and the harness repo. Miss any of them and the number stops being a benchmark and becomes a vendor claim.

ClauseWhat it meansHow we enforce
Dated URLSlug includes quarter (e.g. /benchmarks/rag-2026-q2/)Slug pattern enforced in publish pipeline
Pinned model versionsclaude-sonnet-4-6-20260315 not "Claude Sonnet"Harness reads versions from frozen config
Frozen corpusHugging Face Datasets snapshot with revision SHASnapshot SHA committed to repo
Published promptsGenerator + judge prompts in repoprompts/ dir with checksums in report
Seed = 0Deterministic where the model supports ittemperature=0, top_p=1, seed=0 in config
Cost ledgerPer-call USD spend committed to reportHarness emits ledger.json alongside scores
The six clauses of our reproducibility contract. Every published benchmark on getwidget.dev must satisfy all six or it doesn't ship.

The pinned-model clause matters more than it looks. "Claude Sonnet" has shipped three minor updates in 2026. A benchmark from January 2026 quoting "Claude Sonnet" isn't talking about the same model as a benchmark from June. We pin to the exact API string (claude-sonnet-4-6-20260315) and the run is dated. Anyone replaying six months later gets the same answer or finds a documented model deprecation.

What the harness actually does, step by step

ai-eval-harness v0.1 is a thin Python CLI that wires together Hugging Face Datasets, the generator API of choice (Anthropic, OpenAI, vLLM for self-hosted), Ragas for scoring, and a JSON report writer. The whole thing fits in roughly 1,200 lines and is intentionally small — we'd rather you read it than trust it. The config is a single YAML file; the run is a single bash command.

examples/rag-baseline.yaml
YAML
# rag-baseline.yaml — the config that produced /benchmarks/rag-2026-q2/
# Run with: ai-eval run --config rag-baseline.yaml --out reports/2026-q2
run:
  name: rag-2026-q2-baseline
  seed: 0
  date: 2026-06-15

dataset:
  source: huggingface
  name: paiteq/rag-eval-1840doc
  revision: 7c4f9b2  # frozen snapshot SHA
  split: test
  sample: 200       # stratified across 6 doc-type buckets

retriever:
  kind: pgvector
  embedding_model: text-embedding-3-large
  rerank: cohere-rerank-3
  top_k: 5

generators:
  - id: claude-sonnet-4-6
    provider: anthropic
    model: claude-sonnet-4-6-20260315
    temperature: 0
    max_tokens: 1024
  - id: gpt-5
    provider: openai
    model: gpt-5-2026-05
    temperature: 0
    max_tokens: 1024
  - id: claude-haiku-4
    provider: anthropic
    model: claude-haiku-4-20260201
    temperature: 0
    max_tokens: 1024

judges:
  # cross-judge to flag bias; disagreement > 0.05 routes to human review
  - id: gpt-5-judge
    provider: openai
    model: gpt-5-2026-05
  - id: claude-judge
    provider: anthropic
    model: claude-sonnet-4-6-20260315

metrics:
  - context_precision
  - context_recall
  - faithfulness
  - answer_relevancy

report:
  emit_ledger: true
  emit_pareto_plot: true
  human_review_threshold: 0.05

The YAML above is exactly what produced our 2026-Q2 numbers. The sample size (200 queries, stratified across six document-type buckets) is small enough to keep judge spend under $20 per run and large enough that the Ragas score variance settles inside ±0.02. We've tested at 500 and 1000 queries — the headline numbers don't move enough to justify 4× the spend on every quarterly re-run.

terminal
Bash
# Replay our 2026-Q2 RAG benchmark from a clean checkout.
# Total wall-clock: ~52 minutes. Total spend: ~$14 across both judges.
$ git clone https://github.com/paiteq/ai-eval-harness && cd ai-eval-harness
$ uv sync
$ export ANTHROPIC_API_KEY=sk-ant-...
$ export OPENAI_API_KEY=sk-...
$ ai-eval run --config examples/rag-baseline.yaml --out reports/2026-q2

[ai-eval] loading dataset paiteq/rag-eval-1840doc@7c4f9b2 ... 200 rows
[ai-eval] retrieving k=5 over pgvector index ... 200/200 done in 38s
[ai-eval] generating: claude-sonnet-4-6 ... 200/200 done in 6m21s, $2.84
[ai-eval] generating: gpt-5             ... 200/200 done in 7m48s, $4.62
[ai-eval] generating: claude-haiku-4    ... 200/200 done in 3m02s, $0.71
[ai-eval] judging w/ gpt-5-judge        ... 600/600 done in 14m30s, $3.91
[ai-eval] judging w/ claude-judge       ... 600/600 done in 16m12s, $2.18
[ai-eval] cross-judge disagreement: 11/600 rows flagged (1.8%) -> human review
[ai-eval] writing reports/2026-q2/scores.json
[ai-eval] writing reports/2026-q2/ledger.json
[ai-eval] writing reports/2026-q2/pareto.svg

=== HEADLINE RESULTS (cross-judge midpoint) ===
model               recall@5  faithful  relevancy  cost/q
claude-sonnet-4-6   0.88      0.91      0.89       $0.11
gpt-5               0.86      0.89      0.91       $0.18
claude-haiku-4      0.79      0.84      0.85       $0.04

Total wall-clock:   51m32s
Total API spend:    $14.26

That CLI session is real output from the 2026-Q2 run. The 1.8% cross-judge disagreement rate is in the range we'd consider healthy — high enough to confirm the judges aren't simply rubber-stamping each other, low enough that the headline numbers aren't dominated by stylistic preference. The 11 flagged rows go through a manual review pass before we publish.

Adapting the harness to your own corpus

The 1,840-doc corpus we ship the public benchmark on is a deliberate proxy. Most teams asking 'how to evaluate rag' on their own stack want a number that reflects their documents, their query distribution, and their tolerances. The harness is built for that swap. Three files change, the rest stays. The dataset reference in the YAML points to your Hugging Face Datasets repo (or a local Parquet file). The query set is a JSONL of {question, ground_truth_answer, ground_truth_passage_ids}. The retriever block names whatever embedding model and vector store your production system actually uses, so the score reflects shipped behavior and not a synthetic best case.

The hardest part of the swap is ground-truth labelling. Ragas computes context_recall by checking whether each ground-truth fact appears in the retrieved passages, which means somebody has to write the ground-truth answers and tag which source passages they came from. Our default labelling protocol is two-pass: a domain SME drafts the gold answer with passage citations, a second SME blind-reviews roughly one in five rows for citation accuracy, and any row with disagreement gets escalated to a third reviewer who owns the final call. That escalation queue is small in steady state but it is the only honest way to keep ground truth from drifting toward whichever SME drafted it. We've found 200 labelled queries is the floor for a score that doesn't wobble between runs and 500 is the ceiling we recommend unless the corpus spans more than ten document types.

Once the corpus and labels are in place, the cross-judge step does not need to change. We strongly recommend keeping both judges even for private internal benchmarks. The bias direction shifts under domain pressure: a GPT-5 judge tends to favor short, declarative answers; a Claude judge tends to reward longer reasoning chains. On a legal corpus where the right answer is a multi-clause synthesis, that bias gap can be 6 to 8 points of faithfulness on the same row. Cross-judging surfaces the gap before it becomes a public claim, and the human-review queue stays small because most rows agree.

For teams that want a faster on-ramp, we offer the harness as a discovery-audit deliverable. The shape: one to two weeks, your stack and your corpus, the same reproducibility contract our public benchmark holds. You walk away with a dated number, a ledger of API spend, the cross-judge disagreement report, and the YAML config that produced all three. The deliverable is the methodology applied, not a slide deck. If the number is good, you have a defensible artifact to point procurement and security review at. If the number is not, you have a ranked list of where the stack is losing points and a sense of whether retriever, generator, or both need work.

Failure modes we caught running our own benchmark

Three real failure modes we hit on the 2026-Q1 and 2026-Q2 runs. None of them are theoretical — each one tanked a headline number before we caught it and re-ran. We're publishing them because every team running this kind of harness for the first time will hit at least one.

Three production failure modes

Silent score gap from judge rate-limit mid-run

On the 2026-Q1 run, the GPT-5 judge hit a per-minute rate limit at row 312. The judge returned a 429, the harness retried twice, then logged a warning and continued. The final report showed 600 rows but 47 of them had a faithfulness score of null. The aggregate dropped because the harness counted nulls as zero. Fix: harness v0.1.2 now aborts the run on any judge null rather than continuing. Better to crash than to ship a quiet bias.

30-token max_tokens cap tanking one model

A copy-paste error in an early config set max_tokens=30 on the GPT-5 generator while Claude got max_tokens=1024. GPT-5's answer_relevancy dropped to 0.42. We almost shipped "Claude beats GPT-5 by 47 points". The fix is structural: harness v0.1.2 fails on any generator with max_tokens < 256 unless explicitly opted-in via tiny_answers: true. Defaults catch the foot-gun.

Prompt-template drift between providers

When the same prompt is copy-pasted between Anthropic and OpenAI SDK calls, system-prompt placement and message-role conventions drift. We had a run where Claude got the retrieved context in the system prompt and GPT-5 got it in a user message. Faithfulness skewed 4 points for stylistic reasons unrelated to retrieval. Fix: a single PromptTemplate object in the harness renders to provider-specific shapes, so the same logical prompt produces the same logical conditioning.

Reading our 2026-Q2 RAG benchmark results

The full 2026-Q2 RAG benchmark, with the Pareto plot and the cross-judge disagreement detail, lives at /benchmarks/rag-2026-q2/. Three rows summarize the finding for an engineering lead picking a stack today.

The reading: on a knowledge-base RAG workload like ours, Claude Sonnet 4.6 wins the Pareto frontier and we'd ship it. GPT-5 is competitive on relevancy (0.91 vs 0.89) but the cost gap doesn't pay back unless the workload specifically rewards GPT-5's relevancy lift. Haiku is the right answer for commodity FAQ-style traffic where the recall floor is acceptable and per-query cost dominates.

That conclusion is corpus-specific. On a multi-hop legal-document workload our retrieval-heavy clients run, the ordering flips — GPT-5's longer reasoning hop pays back on the harder queries. Don't lift our number into a different domain without re-running the harness on your corpus. That's the whole point of publishing the methodology alongside the result.

How this harness compares to promptfoo and the alternatives

We're not the only operators with an open eval harness. promptfoo is the closest comparable for prompt-grading and LLM-graded assertions. LangSmith and Braintrust ship hosted alternatives with richer observability. Each has a different design center. The honest summary:

Tool Best forCost shapeWhen it doesn't fit
ai-eval-harness (ours) Reproducible RAG benchmarks with cross-judge bias control Self-hosted CLI, you pay only API spend Not a CI grading service — no dashboards, no team auth
promptfoo Prompt-grading, LLM-as-judge assertions, regression CI Open source CLI + paid cloud Less opinionated on RAG-specific metrics (recall/faithfulness)
Ragas (library only) The metrics themselves — drop-in scoring Library; you pay judge API spend Not a harness — you wire dataset, generator, report
LangSmith / Braintrust Hosted observability + eval dashboards over LangGraph / agent traces SaaS pricing per trace + judge spend Vendor lock; less suited to publishing reproducible public benchmarks

Most production teams end up using two tools: a Ragas-driven harness (ours or one they wrote themselves) for the dated benchmark, and a hosted dashboard (Langfuse or Braintrust) for ongoing production trace observability. They're not competitors — they answer different questions. A benchmark answers "which stack should we ship". A dashboard answers "is the system we shipped still working".

What's coming in harness v0.2 and the TL;DR for engineering leads

Three things land in v0.2 over the next quarter. Agent reliability rubric — multi-step trajectory scoring for LangGraph and ReAct agents, not just single-turn RAG. LLM-graded promptfoo assertions wired in as an optional scorer alongside Ragas, so a single run can grade both retrieval quality and arbitrary tool-call correctness. And multi-step trajectory replay — recording an agent's full tool-call sequence and re-grading it under a different model without re-running the tools.

For the engineering lead reading this and deciding whether to borrow our harness or build your own: borrow if your stack is pgvector + a major-provider LLM and you want a reproducible quarterly number this month. Build if you have non-standard retrieval (custom rerankers, knowledge-graph hops, multimodal) or compliance reasons to own the code path. Most teams we work with borrow first, then fork once they hit the limit. The harness is permissively licensed for that reason.

What is a rag benchmark and how is it different from a generic LLM benchmark?

A rag benchmark scores a retrieval-augmented generation system from dataset to scored report — retrieval recall plus generation faithfulness plus answer relevancy plus cost — on a specific dated corpus. A generic LLM benchmark like MMLU scores the model in isolation on knowledge questions. RAG benchmarks change every time the corpus, retriever, or model version changes, so they must be dated and pinned.

What's the right corpus size for a reproducible RAG benchmark?

For a reproducible public benchmark we run 200 queries stratified across document-type buckets on a 1,840-document corpus. Score variance settles inside ±0.02 at that size. Below 100 queries the variance is too wide. Above 500 the judge API spend is hard to justify on quarterly re-runs.

Why use Ragas instead of BLEU or ROUGE for RAG eval?

BLEU and ROUGE reward lexical overlap with a reference answer. RAG answers can be semantically correct and lexically different from any single reference, so BLEU and ROUGE produce noisy signals. Ragas's faithfulness and answer_relevancy metrics use a judge LLM to score semantic correctness against retrieved context, which matches what production teams actually care about.

How do you control for judge-model bias in your benchmark?

Every row is scored twice — once with GPT-5 as judge, once with Claude Sonnet 4.6 as judge. We report the midpoint and flag any row where the two judges disagree by more than 0.05 on faithfulness. Flagged rows route to human review before the headline numbers publish.

Can I re-run your 2026-Q2 RAG benchmark on my own infrastructure?

Yes. Clone github.com/paiteq/ai-eval-harness, install with uv sync, set ANTHROPIC_API_KEY and OPENAI_API_KEY, and run ai-eval run --config examples/rag-baseline.yaml. Total wall-clock about 52 minutes and total API spend about $14 across both judges. The dataset is a public Hugging Face Datasets snapshot.

When should I use MS MARCO vs BEIR vs a custom corpus for evaluation?

Use MS MARCO as a sanity floor — if your retriever doesn't push recall above 0.85 on MS MARCO, something is wrong before you even score generation. Use a BEIR subset (NFCorpus, FiQA, SciFact) for cross-domain capability. Use your own production-realistic corpus for the number you actually trust for stack decisions. Report all three and let the reader weigh which one matches their workload.

MORE IN /AI TOOLS AND FRAMEWORKS

Continue reading.

Four enterprise AI platform archetypes as stacked architectural layers — hyperscaler, data cloud, AI-native, DIY orchestration
#enterprise-ai#ai-platforms

Enterprise AI Platform Buyer's Guide: A Decision Rubric for 2026

Build vs buy vs orchestrate decision rubric for enterprise AI platforms. Operator-honest comparison across Databricks, Snowflake Cortex, IBM watsonx, AWS Bedrock, Vertex AI, Azure AI Foundry, and DIY orchestration — with cost archetypes and a 12-week deployment shape.

Navin Sharma Navin Sharma
24m
Back to Blog