benchmarks · model-agnostic

Model benchmarks
Dated, reproducible, eval-first.

We publish dated benchmarks on RAG retrieval, agent reliability, and LLM selection. Every result is reproducible with the open-source paiteq/ai-eval-harness on a corpus you can inspect. Model-agnostic on principle: Claude, GPT, Gemini, and open-source models score on the same rubric.

methodology

How we benchmark.
Four rules. No exceptions.

Every benchmark we publish meets four rules. Anything that fails them doesn't get published.

Dated in the URL and H1

Undated benchmarks rot. Each slug carries the publish quarter so readers can tell what's current at a glance.

Reproducible by anyone

Code lives in paiteq/ai-eval-harness (MIT). Corpora mirror to huggingface.co/paiteq-ai. Run the harness; you should land inside our published confidence intervals.

Model-agnostic on one rubric

Claude, GPT, Gemini, and open-source models score on identical prompts and corpora. No model-family favouritism. Where one shines, we say why.

Cost on the same axis

Recall@5 and pass@1 are meaningless without $/1k queries on the same dated run. Every benchmark reports cost alongside quality.

for buyers

Why we publish these.
Eval-first delivery, in public.

Most agencies pick a model because the founder likes it. We pick a model because the eval said so. These benchmarks are how we work — published so you can audit the methodology before you hire us.