methodology · used on every engagement

Eval-driven AI delivery
Audit, pilot, continuous. Gated on real evals.

This is how we ship AI systems. Three phases, each with a measurable exit criterion that the eval suite enforces. No model goes to production without passing the same rubric we publish on our benchmarks. The harness that scores it is open-source.

the three phases

Audit. Pilot. Continuous.
Each phase has a measurable exit criterion.

Buyers self-qualify through the audit, not a price tag. The pilot earns the right to continuous delivery by passing the rubric the audit locked. Nothing gets skipped.

  1. 01

    Discovery audit

    1-2 weeks, fixed scope. We map your current AI surface, pick the highest-impact bet, and write the eval rubric. Ends with a 4-8 page written prioritisation, a draft rubric, and a go/no-go recommendation. If we recommend no-go, we say so on the page.

  2. 02

    Pilot with weekly eval gates

    4-6 weeks. Working system in production behind a feature flag. The rubric drafted in phase 1 becomes a runnable suite in the repo. Every Friday the team walks the deltas; the rubric decides what ships. Model-agnostic checkpoint at week 3 re-scores Claude, GPT, Gemini, and one open-source baseline.

  3. 03

    Continuous delivery

    Ongoing. Dedicated engineering team — the same people who shipped the pilot maintain it. Eval suite versioned with the code and re-run on every PR. Monthly model-selection re-checks. Real on-call rotation. Quarterly governance review.

▸ the rubric

What the eval actually measures. Same shape across RAG, agent, and LLM engagements.

The rubric is the contract. It's drafted in phase 1, locked at the start of phase 2, and versioned in code for phase 3. The exact metrics depend on the system; the shape is consistent.

  • Groundedness

    Faithfulness + answer relevance per Ragas (Es et al., arXiv:2309.15217). Citation accuracy scored per claim.

  • Pass@k

    Pass@1 + pass@k on a held-out task set. RAG retrieval scored on recall@k, MRR, and NDCG (BEIR-style).

  • p95 latency

    Production-condition p50 / p95 / p99. Voice systems carry a sub-500ms p95 budget; web chat 1.5-2.5s.

  • $ / query

    $/1k queries or $/task on the same dated run. API list price, not promotional. Quality without cost is half the story.

  • Walk-away

    The metric we'd kill the pilot for if it doesn't move. Locked at week zero per NIST AI RMF MEASURE-2.3.

  • Audit + domain

    PII redaction, regulated-content refusal, audit-log completeness. Maps to EU AI Act Article 12 + 13 logging + transparency.

phase 1 · discovery audit

What you get from the audit.
1-2 weeks, fixed scope, written deliverable.

Not a discovery call, not a strategy deck. A short engagement that ends with a written prioritisation and the rubric phase 2 will run.

Map the current AI surface

Every model call, every prompt, every tool, every retrieval source. Documented so we can score it.

Pick the highest-impact bet

One bet, not five. The one with the clearest business signal and the lowest reversibility cost.

Draft the eval rubric

Recall, faithfulness, latency p95, $/1k queries, and any domain-specific gates. Becomes the contract for phase 2.

Written prioritisation

4-8 page deliverable. Draft rubric attached. Go/no-go recommendation. If no-go, we say so.

phase 2 · pilot

What the pilot looks like.
4-6 weeks. Working system, weekly gates.

A working AI system in production behind a feature flag. The rubric decides what ships, not vibes. Cost on the same axis as quality.

Working system, not a prototype

Same infrastructure, logging, auth as production. The feature flag is the only difference.

Eval suite in the repo

The rubric drafted in phase 1 becomes a runnable suite that lives in the project repo. Wraps Ragas (retrieval + generation) and promptfoo (regression). Every PR re-runs it.

Friday review, 45 minutes

Three deltas (quality, cost+latency, failure modes), two decisions (promote/hold, what changes next week), named owners.

Model-agnostic checkpoint

Week 3: re-score Claude, GPT, Gemini, and one open-source model on the locked rubric. If the winner shifts, we adjust.

phase 3 · continuous

What ongoing delivery looks like.
Dedicated team, versioned suite, monthly re-checks.

Once the pilot has cleared its exit criterion, we move to continuous delivery. The same engineers, the same harness, the same rubric — running as part of CI.

Dedicated team, not a pool

The same engineers who shipped the pilot maintain it. Context is the most expensive thing on an AI codebase. We don't churn it.

Eval suite is part of CI

Every PR re-runs the rubric. Regressions block the merge. The suite grows as the system grows.

Monthly model re-check

Models ship every month. So does our re-check. We tell you when switching makes sense and when it doesn't.

Quarterly governance review

Drift, audit logs, policy alignment, regulatory mapping. Crosswalked to NIST AI RMF (Jan 2023), ISO/IEC 42001:2023, and EU AI Act (Regulation 2024/1689) where in scope.

what this is not
design decision · 01

No model-family loyalty

we rejected
Score on the rubric, on your corpus
because
We don't pick Claude because we like Claude. We pick it when the eval on your corpus says so. Same for GPT, Gemini, and open-source.
design decision · 02

No vibes-based 'ship it' calls

we rejected
The rubric decides at the threshold
because
If the eval fails, we don't ship, regardless of executive pressure. We explain why in the weekly review.
design decision · 03

No gated prototypes

we rejected
Production infra behind a flag
because
A demo on production infrastructure behind a feature flag is fine. A demo behind ngrok pretending to be a pilot is not.
design decision · 04

No undated benchmarks

we rejected
Quarter in the slug and the H1
because
Internal or external, every benchmark snapshot is dated. Undated numbers rot inside a quarter.
design decision · 05

No black-box governance

we rejected
Audit-log answers in one query
because
If a regulator asks how the model behaved on 2026-04-15, the audit log answers in one query. If it can't, we fix that before we ship more features.
sources + further reading

Standards and papers behind this methodology.
Every claim above maps to one of these.

The methodology isn't invented. It composes published evaluation research, public standards, and open-source tooling. The harness is open-source so anyone can verify the wiring.

Ragas — automated RAG evaluation

Es, James, Espinosa-Anke, Schockaert (2023). arXiv:2309.15217. The methodology behind our groundedness + answer-relevance scoring.

BEIR — zero-shot retrieval benchmark

Thakur, Reimers, Rücklé, Srivastava, Gurevych (2021). arXiv:2104.08663. Anchors our recall@k + NDCG metric choices for the RAG benchmark.

NIST AI Risk Management Framework 1.0

NIST AI 100-1 (January 2023). Functions: GOVERN, MAP, MEASURE, MANAGE. Our walk-away metric maps to MEASURE-2.3.

ISO/IEC 42001:2023

AI management systems. Drives our phase 3 governance review structure, especially §6.1 (risk) and §8 (operation).

EU AI Act — Regulation 2024/1689

Article 6 + Annex III drive high-risk classification. Articles 12 + 13 + 14 drive logging, transparency, and human oversight obligations our audit log answers.

promptfoo

Open-source LLM eval CLI (github.com/promptfoo/promptfoo). Powers the regression layer of our harness alongside Ragas.

next step

Start with an audit.
Two weeks, fixed scope, written deliverable.

The audit is the door. If you already know what you need, it ends with a prioritisation and a rubric. If you're earlier than that, we'll tell you so and point you at what to read first.