Why publish dated LLM benchmarks?

Models, prices, and providers move every month. An undated benchmark rots inside a quarter. We date every benchmark in the URL and the H1 so a reader can tell at a glance which Claude, GPT, or Gemini model was tested and against which 2026 pricing run.

How are these RAG and agent benchmarks reproducible?

The code lives in paiteq/ai-eval-harness (MIT) and the corpora mirror to huggingface.co/paiteq-ai. Run the same harness on the same dataset and your scores should land inside our published 95% confidence intervals. Recall@5, faithfulness, latency p95, and dollars per 1k queries are reported on the same dated run.

Which models do you score on each benchmark?

Four model families on the same rubric: Anthropic (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5), OpenAI (GPT-4o, o-series reasoning), Google (Gemini 2.5 Pro and Flash), and one open-source baseline (Llama 3.1 70B). Identical prompts, identical corpora, identical scoring.

How often are these benchmarks updated?

Quarterly. The 2026-Q2 RAG run ships in 2026-06; the 2026-Q3 agent reliability run ships in 2026-09. Each new quarter gets its own slug and dataset @id so prior runs stay citable forever.

benchmarks · model-agnostic

Model benchmarks
Dated, reproducible, eval-first.

We publish dated benchmarks on RAG retrieval and agent reliability, with LLM-selection runs alongside. Every result is reproducible with the open-source paiteq/ai-eval-harness on a corpus you can inspect. Model-agnostic on principle: Claude and GPT alongside Gemini and open-source models, all scored on the same rubric. Our pilot RAG run logged 88% faithfulness on a 1,840-document corpus (2026-Q1); the agent harness logged 71% pass@1 across 100 tool-using tasks (2026-Q1).

Eval harness on GitHub How we benchmark

upcoming benchmarks

What's shipping next.
Two benchmarks landing through 2026.

Each benchmark gets its own dated page with the full dataset, prompts, scores, and per-model cost. The methodology is published now so anyone forward-linking gets a stable URL.

rag · 2026-Q2 in flight

RAG retrieval benchmark

Recall@5, faithfulness, latency p95, and $/1k queries across Claude, GPT, Gemini, and open-source models on a 1,840-document corpus. Results target 2026-06.

View benchmark page →

agent · 2026-Q3 planned

Agent reliability benchmark

100-task harness covering tool-calling and multi-step execution, plus error recovery scoring. Reports pass@1, pass@5, mean steps, mean cost. Results target 2026-09.

View benchmark page →

methodology

How we benchmark.
Four rules. No exceptions.

Every benchmark we publish meets four rules. Anything that fails them doesn't get published.

Dated in the URL and H1

Undated benchmarks rot. Each slug carries the publish quarter so readers can tell what's current at a glance.

Reproducible by anyone

Code lives in paiteq/ai-eval-harness (MIT). Corpora mirror to huggingface.co/paiteq-ai. Run the harness; you should land inside our published confidence intervals.

Model-agnostic on one rubric

Claude, GPT, Gemini, and open-source models score on identical prompts and corpora. No model-family favouritism. Where one shines, we say why.

Cost on the same axis

Recall@5 and pass@1 are meaningless without $/1k queries on the same dated run. Every benchmark reports cost alongside quality.

under the hood

Open-source harness + open datasets.
So you can verify the numbers without taking our word.

The benchmarks here are produced by the same eval harness our delivery team runs on client engagements. The code is MIT, the datasets are mirrored to HuggingFace, the methodology is published as a TechArticle.

open-source MIT

paiteq/ai-eval-harness

Ragas + promptfoo + Inspect AI + custom agent rubrics. MIT. Trace data flows to Langfuse and Braintrust for inspection. The harness our delivery team uses on client engagements and the public benchmarks here.

View on GitHub →

datasets