What is eval-driven AI delivery?

An AI delivery methodology with three phases: a 1-2 week discovery audit that writes the eval rubric, a 4-6 week pilot in production behind a feature flag with weekly eval gates, then continuous delivery with the same engineers and the eval suite running as part of CI. The rubric decides what ships at every phase, not vibes.

What does the discovery audit deliver?

A 4-8 page written prioritisation, a draft eval rubric covering recall, faithfulness, latency p95, and dollars per 1k queries, and a go/no-go recommendation. If we recommend no-go, we say so on the page. Fixed scope, 1-2 weeks.

What does the weekly eval gate actually look like?

A 45-minute Friday review. Three deltas (quality, cost plus latency, failure modes) and two decisions (promote or hold, what changes next week with a named owner). Locked thresholds; any failure blocks the feature-flag rollout. The notes are committed to the repo as markdown.

Which evaluation tools do you run?

The paiteq/ai-eval-harness (MIT) wraps Ragas (Es et al. 2023, arXiv:2309.15217), promptfoo for regression, and Inspect AI for agent rubrics. Trace data flows to Langfuse and Braintrust. Cross-walked to NIST AI RMF, ISO/IEC 42001:2023, and EU AI Act Articles 12-14 where governance applies.

methodology · used on every engagement

Eval-driven AI delivery
Audit, pilot, continuous. Gated on real evals.

This is how we ship AI systems. Three phases, each with a measurable exit criterion that the eval suite enforces. No model goes to production without passing the same rubric we publish on our benchmarks. The harness that scores it is open-source.

Start the audit conversation View the harness

the three phases

Audit. Pilot. Continuous.
Each phase has a measurable exit criterion.

Buyers self-qualify through the audit, not a price tag. The pilot earns the right to continuous delivery by passing the rubric the audit locked. Nothing gets skipped.

01

Discovery audit

1-2 weeks, fixed scope. We map your current AI surface, pick the highest-impact bet, and write the eval rubric. Ends with a 4-8 page written prioritisation, a draft rubric, and a go/no-go recommendation. If we recommend no-go, we say so on the page.
02

Pilot with weekly eval gates

4-6 weeks. Working system in production behind a feature flag. The rubric drafted in phase 1 becomes a runnable suite in the repo. Every Friday the team walks the deltas; the rubric decides what ships. Model-agnostic checkpoint at week 3 re-scores Claude, GPT, Gemini, and one open-source baseline.
03

Continuous delivery

Ongoing. Dedicated engineering team — the same people who shipped the pilot maintain it. Eval suite versioned with the code and re-run on every PR. Monthly model-selection re-checks. Real on-call rotation. Quarterly governance review.

▸ the rubric

What the eval actually measures. Same shape across RAG, agent, and LLM engagements.

The rubric is the contract. It's drafted in phase 1, locked at the start of phase 2, and versioned in code for phase 3. The exact metrics depend on the system; the shape is consistent.

Groundedness

Faithfulness + answer relevance per Ragas (Es et al., arXiv:2309.15217). Citation accuracy scored per claim.
Pass@k

Pass@1 + pass@k on a held-out task set. RAG retrieval scored on recall@k, MRR, and NDCG (BEIR-style).
p95 latency

Production-condition p50 / p95 / p99. Voice systems carry a sub-500ms p95 budget; web chat 1.5-2.5s.
$ / query

$/1k queries or $/task on the same dated run. API list price, not promotional. Quality without cost is half the story.
Walk-away

The metric we'd kill the pilot for if it doesn't move. Locked at week zero per NIST AI RMF MEASURE-2.3.
Audit + domain

PII redaction, regulated-content refusal, audit-log completeness. Maps to EU AI Act Article 12 + 13 logging + transparency.

phase 1 · discovery audit

What you get from the audit.
1-2 weeks, fixed scope, written deliverable.

Not a discovery call, not a strategy deck. A short engagement that ends with a written prioritisation and the rubric phase 2 will run.

Map the current AI surface

Every model call, every prompt, every tool, every retrieval source. Documented so we can score it.

Pick the highest-impact bet

One bet, not five. The one with the clearest business signal and the lowest reversibility cost.

Draft the eval rubric

Recall, faithfulness, latency p95, $/1k queries, and any domain-specific gates. Becomes the contract for phase 2.

Written prioritisation

4-8 page deliverable. Draft rubric attached. Go/no-go recommendation. If no-go, we say so.

phase 2 · pilot

What the pilot looks like.
4-6 weeks. Working system, weekly gates.

A working AI system in production behind a feature flag. The rubric decides what ships, not vibes. Cost on the same axis as quality.

Working system, not a prototype

Same infrastructure, logging, auth as production. The feature flag is the only difference.

Eval suite in the repo

The rubric drafted in phase 1 becomes a runnable suite that lives in the project repo. Wraps Ragas (retrieval + generation), promptfoo (regression), and Inspect AI (agent harness). Trace data flows to Langfuse and Braintrust. Every PR re-runs it.

Friday review, 45 minutes

Three deltas (quality, cost+latency, failure modes), two decisions (promote/hold, what changes next week), named owners.

Model-agnostic checkpoint

Week 3: re-score Claude, GPT, Gemini, and one open-source model on the locked rubric. If the winner shifts, we adjust.

phase 3 · continuous

What ongoing delivery looks like.
Dedicated team, versioned suite, monthly re-checks.

Once the pilot has cleared its exit criterion, we move to continuous delivery. The same engineers, the same harness, the same rubric — running as part of CI.

Dedicated team, not a pool

The same engineers who shipped the pilot maintain it. Context is the most expensive thing on an AI codebase. We don't churn it.

Eval suite is part of CI

Every PR re-runs the rubric. Regressions block the merge. The suite grows as the system grows.

Monthly model re-check

Models ship every month. So does our re-check. We tell you when switching makes sense and when it doesn't.

Quarterly governance review

Drift, audit logs, policy alignment, regulatory mapping. Crosswalked to NIST AI RMF (Jan 2023), ISO/IEC 42001:2023, and EU AI Act (Regulation 2024/1689) where in scope.

what this is not

design decision · 01

No model-family loyalty

we rejected: Score on the rubric, on your corpus
because: We don't pick Claude because we like Claude. We pick it when the eval on your corpus says so. Same for GPT, Gemini, and open-source.

design decision · 02

No vibes-based 'ship it' calls

we rejected: The rubric decides at the threshold
because: If the eval fails, we don't ship, regardless of executive pressure. We explain why in the weekly review.

design decision · 03

No gated prototypes

we rejected: Production infra behind a flag
because: A demo on production infrastructure behind a feature flag is fine. A demo behind ngrok pretending to be a pilot is not.

design decision · 04

No undated benchmarks

we rejected: Quarter in the slug and the H1
because: Internal or external, every benchmark snapshot is dated. Undated numbers rot inside a quarter.

design decision · 05

No black-box governance

we rejected: Audit-log answers in one query
because: If a regulator asks how the model behaved on 2026-04-15, the audit log answers in one query. If it can't, we fix that before we ship more features.

sources + further reading

Standards and papers behind this methodology.
Every claim above maps to one of these.

The methodology isn't invented. It composes published evaluation research, public standards, and open-source tooling. The harness is open-source so anyone can verify the wiring.

Ragas — automated RAG evaluation

Es, James, Espinosa-Anke, Schockaert (2023). arXiv:2309.15217. The methodology behind our groundedness + answer-relevance scoring.

BEIR — zero-shot retrieval benchmark

Thakur, Reimers, Rücklé, Srivastava, Gurevych (2021). arXiv:2104.08663. Anchors our recall@k + NDCG metric choices for the RAG benchmark.

NIST AI Risk Management Framework 1.0

NIST AI 100-1 (January 2023). Functions: GOVERN, MAP, MEASURE, MANAGE. Our walk-away metric maps to MEASURE-2.3.

ISO/IEC 42001:2023

AI management systems. Drives our phase 3 governance review structure, especially §6.1 (risk) and §8 (operation).

EU AI Act — Regulation 2024/1689

Article 6 + Annex III drive high-risk classification. Articles 12 + 13 + 14 drive logging, transparency, and human oversight obligations our audit log answers.

promptfoo

Open-source LLM eval CLI (github.com/promptfoo/promptfoo). Powers the regression layer of our harness alongside Ragas.

Services this methodology drives: AI software development (umbrella — every engagement runs on eval gates), AI consulting (every discovery audit produces the eval-set design that gates the pilot), AI agent development (agent reliability is the rubric this methodology applies), AI knowledge base (RAG recall and faithfulness scored on the same harness), Intelligent document processing (per-field accuracy and confidence routing), and AI governance (eval suites in CI are the governance artefact).

next step

Start with an audit.
Two weeks, fixed scope, written deliverable.

The audit is the door. If you already know what you need, it ends with a prioritisation and a rubric. If you're earlier than that, we'll tell you so and point you at what to read first. Recent eval-driven runs we ship from: 88% faithfulness on a 1,840-document RAG corpus (2026-Q1); 71% pass@1 across 100 tool-using tasks on the agent harness (2026-Q1). This is how we approach <a href='/services/ai-development/'>AI development</a> end-to-end, and it ties directly into our <a href='/services/ai-governance/'>governance practice</a> — start the conversation through the <a href='/services/ai-consulting/'>strategy phase / consulting</a> engagement. The same rubric carries into <a href='/services/ai-automation/'>evaluated automation</a> workflows where deterministic + LLM steps interleave.

Start the audit conversation See published benchmarks

Eval-driven AI delivery Audit, pilot, continuous. Gated on real evals.

Audit. Pilot. Continuous. Each phase has a measurable exit criterion.

Discovery audit

Pilot with weekly eval gates

Continuous delivery

Groundedness

Pass@k

p95 latency

$ / query

Walk-away

Audit + domain

What you get from the audit. 1-2 weeks, fixed scope, written deliverable.