Model benchmarks
Dated, reproducible, eval-first.
We publish dated benchmarks on RAG retrieval, agent reliability, and LLM selection. Every result is reproducible with the open-source paiteq/ai-eval-harness on a corpus you can inspect. Model-agnostic on principle: Claude, GPT, Gemini, and open-source models score on the same rubric.
What's shipping next.
Two benchmarks landing through 2026.
Each benchmark gets its own dated page with the full dataset, prompts, scores, and per-model cost. The methodology is published now so anyone forward-linking gets a stable URL.
RAG retrieval benchmark
Recall@5, faithfulness, latency p95, and $/1k queries across Claude, GPT, Gemini, and open-source models on a 1,840-document corpus. Results target 2026-06.
View benchmark page →Agent reliability benchmark
100-task harness covering tool-calling, multi-step execution, and error recovery. Reports pass@1, pass@5, mean steps, mean cost. Results target 2026-09.
View benchmark page → How we benchmark.
Four rules. No exceptions.
Every benchmark we publish meets four rules. Anything that fails them doesn't get published.
Dated in the URL and H1
Undated benchmarks rot. Each slug carries the publish quarter so readers can tell what's current at a glance.
Reproducible by anyone
Code lives in paiteq/ai-eval-harness (MIT). Corpora mirror to huggingface.co/paiteq-ai. Run the harness; you should land inside our published confidence intervals.
Model-agnostic on one rubric
Claude, GPT, Gemini, and open-source models score on identical prompts and corpora. No model-family favouritism. Where one shines, we say why.
Cost on the same axis
Recall@5 and pass@1 are meaningless without $/1k queries on the same dated run. Every benchmark reports cost alongside quality.
Open-source harness + open datasets.
So you can verify the numbers without taking our word.
The benchmarks here are produced by the same eval harness our delivery team runs on client engagements. The code is MIT, the datasets are mirrored to HuggingFace, the methodology is published as a TechArticle.
paiteq/ai-eval-harness
Ragas + promptfoo + custom agent rubrics. MIT. The harness our delivery team uses on client engagements and the public benchmarks here.
View on GitHub →huggingface.co/paiteq-ai
Eval corpora mirrored so anyone can reproduce our scores on their own infrastructure. Dataset cards link back to the benchmark page.
View on HuggingFace →Eval-driven delivery
How we benchmark mirrors how we ship. The same rubric anchors client engagements, scored on a weekly cadence in production.
Read the methodology →