No model-family loyalty
- we rejected
- Score on the rubric, on your corpus
- because
- We don't pick Claude because we like Claude. We pick it when the eval on your corpus says so. Same for GPT, Gemini, and open-source.
This is how we ship AI systems. Three phases, each with a measurable exit criterion that the eval suite enforces. No model goes to production without passing the same rubric we publish on our benchmarks. The harness that scores it is open-source.
Buyers self-qualify through the audit, not a price tag. The pilot earns the right to continuous delivery by passing the rubric the audit locked. Nothing gets skipped.
1-2 weeks, fixed scope. We map your current AI surface, pick the highest-impact bet, and write the eval rubric. Ends with a 4-8 page written prioritisation, a draft rubric, and a go/no-go recommendation. If we recommend no-go, we say so on the page.
4-6 weeks. Working system in production behind a feature flag. The rubric drafted in phase 1 becomes a runnable suite in the repo. Every Friday the team walks the deltas; the rubric decides what ships. Model-agnostic checkpoint at week 3 re-scores Claude, GPT, Gemini, and one open-source baseline.
Ongoing. Dedicated engineering team — the same people who shipped the pilot maintain it. Eval suite versioned with the code and re-run on every PR. Monthly model-selection re-checks. Real on-call rotation. Quarterly governance review.
The rubric is the contract. It's drafted in phase 1, locked at the start of phase 2, and versioned in code for phase 3. The exact metrics depend on the system; the shape is consistent.
Faithfulness + answer relevance per Ragas (Es et al., arXiv:2309.15217). Citation accuracy scored per claim.
Pass@1 + pass@k on a held-out task set. RAG retrieval scored on recall@k, MRR, and NDCG (BEIR-style).
Production-condition p50 / p95 / p99. Voice systems carry a sub-500ms p95 budget; web chat 1.5-2.5s.
$/1k queries or $/task on the same dated run. API list price, not promotional. Quality without cost is half the story.
The metric we'd kill the pilot for if it doesn't move. Locked at week zero per NIST AI RMF MEASURE-2.3.
PII redaction, regulated-content refusal, audit-log completeness. Maps to EU AI Act Article 12 + 13 logging + transparency.
Not a discovery call, not a strategy deck. A short engagement that ends with a written prioritisation and the rubric phase 2 will run.
Every model call, every prompt, every tool, every retrieval source. Documented so we can score it.
One bet, not five. The one with the clearest business signal and the lowest reversibility cost.
Recall, faithfulness, latency p95, $/1k queries, and any domain-specific gates. Becomes the contract for phase 2.
4-8 page deliverable. Draft rubric attached. Go/no-go recommendation. If no-go, we say so.
A working AI system in production behind a feature flag. The rubric decides what ships, not vibes. Cost on the same axis as quality.
Same infrastructure, logging, auth as production. The feature flag is the only difference.
The rubric drafted in phase 1 becomes a runnable suite that lives in the project repo. Wraps Ragas (retrieval + generation) and promptfoo (regression). Every PR re-runs it.
Three deltas (quality, cost+latency, failure modes), two decisions (promote/hold, what changes next week), named owners.
Week 3: re-score Claude, GPT, Gemini, and one open-source model on the locked rubric. If the winner shifts, we adjust.
Once the pilot has cleared its exit criterion, we move to continuous delivery. The same engineers, the same harness, the same rubric — running as part of CI.
The same engineers who shipped the pilot maintain it. Context is the most expensive thing on an AI codebase. We don't churn it.
Every PR re-runs the rubric. Regressions block the merge. The suite grows as the system grows.
Models ship every month. So does our re-check. We tell you when switching makes sense and when it doesn't.
Drift, audit logs, policy alignment, regulatory mapping. Crosswalked to NIST AI RMF (Jan 2023), ISO/IEC 42001:2023, and EU AI Act (Regulation 2024/1689) where in scope.
The methodology isn't invented. It composes published evaluation research, public standards, and open-source tooling. The harness is open-source so anyone can verify the wiring.
Es, James, Espinosa-Anke, Schockaert (2023). arXiv:2309.15217. The methodology behind our groundedness + answer-relevance scoring.
Thakur, Reimers, Rücklé, Srivastava, Gurevych (2021). arXiv:2104.08663. Anchors our recall@k + NDCG metric choices for the RAG benchmark.
NIST AI 100-1 (January 2023). Functions: GOVERN, MAP, MEASURE, MANAGE. Our walk-away metric maps to MEASURE-2.3.
AI management systems. Drives our phase 3 governance review structure, especially §6.1 (risk) and §8 (operation).
Article 6 + Annex III drive high-risk classification. Articles 12 + 13 + 14 drive logging, transparency, and human oversight obligations our audit log answers.
Open-source LLM eval CLI (github.com/promptfoo/promptfoo). Powers the regression layer of our harness alongside Ragas.