methodology · cadence

Weekly eval gates
What ships, what doesn't, decided by the rubric.

Every AI engagement we run has a weekly eval gate. It's the smallest cadence that catches regressions before they compound, and the largest cadence that doesn't drown the team in noise. Here's what the Friday review actually looks like.

why weekly

Why a week is the right unit.
Not per-PR, not monthly.

Per-PR runs catch single-change regressions but miss interaction effects. Monthly runs catch interaction effects but let too much rot accumulate. Weekly is the seam.

Catches drift early

Embedding indexes shift. Prompts get edited. Tool registries change. A week is short enough to attribute a regression to a specific change.

Forces a habit

If the gate runs every Friday and the team reviews it together, eval health becomes part of the engineering rhythm, not a quarterly fire drill.

Compounds into trend

52 weekly snapshots a year is a trend line you can actually read. Quarterly is four data points.

the rubric

Locked at week zero.
The contract doesn't change mid-pilot.

The rubric is drafted in the phase 1 audit and locked at the start of the pilot. We don't move thresholds when results are uncomfortable. We document the gap and decide whether to accept it or change the system.

Eval suite in the repo

The suite lives in the project repo. Same git history as the application code. Reviewable, diffable, blameable.

Threshold changes need a reason

Easier to justify upgrades than downgrades. We document the why on every threshold edit.

Corpus + prompts versioned

We score this week's system on the same corpus we scored last week's. No moving the goalposts.

the gate

What the Friday review looks like.
Forty-five minutes, three deltas, two decisions.

The team meets for 45 minutes every Friday. We walk three deltas, take two decisions, write the notes. That's it.

  1. 01

    Delta 1 · quality

    Recall, faithfulness, pass@1, citation accuracy. Side-by-side with last week. Per-segment breakdowns where they matter (PII traffic, regulated-content, long-tail intents).

  2. 02

    Delta 2 · cost + latency

    $/1k queries and p95 latency on the same dated run. Quality regressions hidden behind cost savings get flagged. So do cost regressions hidden behind quality gains.

  3. 03

    Delta 3 · failure modes

    Top five failed examples this week. Read them out loud. Categorise. Decide whether they're systemic or noise.

  4. 04

    Decision 1 · promote?

    Does this week's build qualify for broader exposure: feature-flag percentage up, new segment, new region?

  5. 05

    Decision 2 · what changes

    One or two concrete changes for next week, written in the notes, owned by a named engineer.

promote vs block
design decision · 01

Promote

we rejected
Every threshold passes, no new systemic failure
because
Build qualifies for broader exposure. Feature-flag percentage up, or new segment, or new region. Pick one, not all three.
design decision · 02

Block

we rejected
Any locked threshold fails
because
Build stays at its current feature-flag percentage. No exceptions for 'we'll fix it next week'. The threshold was the contract.
design decision · 03

Investigate

we rejected
Thresholds pass, but a failure mode looks systemic
because
Don't block, but add a test to the rubric to prevent the next regression. The suite grows from these moments.
tooling

What's running underneath.
Open-source harness plus a small dashboard.

The technical surface is small. The harness runs the rubric. A dashboard renders the deltas. The notes live in the repo.

paiteq/ai-eval-harness

Wraps Ragas (Es et al. 2023, arXiv:2309.15217), promptfoo (open-source LLM CLI), and Inspect AI (UK AISI agent harness). Trace data routed through Langfuse and Braintrust for inspection. MIT. The same harness produces our public benchmarks at getwidget.dev/benchmarks/.

Internal dashboard

Quality, cost, and latency on one screen. Per-segment breakdowns. Last 12 weeks visible by default. Exportable to PDF for compliance reviews.

Friday notes in repo

Three deltas, two decisions, named owners. Committed to the repo as plain markdown. Searchable in 12 months when the next regression happens.

sources + further reading

Where this cadence comes from.
Not invented. Composed from published practice.

The weekly eval gate borrows from LLM evaluation research, AI risk-management standards, and the long history of SRE error-budget practice.

Ragas (Es et al. 2023)

arXiv:2309.15217. Underpins our groundedness + answer-relevance scoring in the weekly rubric.

promptfoo

Open-source LLM eval CLI (github.com/promptfoo/promptfoo). Powers the regression layer of the Friday gate.

NIST AI RMF — MANAGE function

NIST AI 100-1 (January 2023). MANAGE-2.4 anchors our weekly cadence: 'AI risks identified… are prioritized, responded to, and managed'.

Google SRE — error budgets

The Friday gate borrows from SRE practice: a measurable threshold, a defined response, and a write-up that survives turnover.

Services this weekly-eval cadence drives: AI software development, AI agent development, AI chatbot development, AI knowledge base, Intelligent document processing, and AI voice agents. Every pillar above ships against a weekly eval gate before cutover.

next step

Want this cadence on your AI work?
Starts in the discovery audit.

The audit writes the rubric. The pilot proves the cadence works on your system. Continuous delivery makes it routine. Recent runs the weekly gate caught: 88% → 84% faithfulness drift on a RAG corpus (2026-Q1, reverted before promote); 71% → 67% pass@1 drift on the agent harness (2026-Q1, fixed via retrieval re-rank). Where eval ends, <a href='/services/ai-governance/'>governance gates beyond evals</a> pick up — model risk paperwork, audit logs, red-team cases, regulator readiness.