Why weekly eval gates instead of per-PR or monthly?

Per-PR runs catch single-change regressions but miss interaction effects. Monthly runs catch interaction effects but let too much rot accumulate. A week is short enough to attribute a regression to a specific change, long enough to give the team room to investigate, and produces 52 trend-line data points a year instead of four.

What gets reviewed at the Friday eval gate?

Three deltas (quality including recall and faithfulness, cost plus p95 latency, top-five failure modes) and two decisions (promote to a higher feature-flag percentage or hold, plus one or two concrete changes for next week owned by a named engineer). Forty-five minutes, notes committed to the repo as markdown.

What blocks a promotion?

Any locked rubric threshold failing. No exceptions for 'we'll fix it next week' — the threshold was the contract. If thresholds pass but a systemic failure mode appears, the build stays where it is and a new test gets added to the rubric to prevent the next regression.

What tooling powers the weekly gate?

paiteq/ai-eval-harness (MIT) wraps Ragas, promptfoo, and Inspect AI. Trace data routes through Langfuse and Braintrust. A small internal dashboard renders quality, cost, and latency on one screen with the last 12 weeks visible by default. Friday notes live in the project repo as searchable markdown.

methodology · cadence

Weekly eval gates
What ships, what doesn't, decided by the rubric.

Every AI engagement we run has a weekly eval gate. It's the smallest cadence that catches regressions before they compound, and the largest cadence that doesn't drown the team in noise. Here's what the Friday review actually looks like.

See the parent methodology View the harness

why weekly

Why a week is the right unit.
Not per-PR, not monthly.

Per-PR runs catch single-change regressions but miss interaction effects. Monthly runs catch interaction effects but let too much rot accumulate. Weekly is the seam.

Catches drift early

Embedding indexes shift. Prompts get edited. Tool registries change. A week is short enough to attribute a regression to a specific change.

Forces a habit

If the gate runs every Friday and the team reviews it together, eval health becomes part of the engineering rhythm, not a quarterly fire drill.

Compounds into trend

52 weekly snapshots a year is a trend line you can actually read. Quarterly is four data points.

the rubric

Locked at week zero.
The contract doesn't change mid-pilot.

The rubric is drafted in the phase 1 audit and locked at the start of the pilot. We don't move thresholds when results are uncomfortable. We document the gap and decide whether to accept it or change the system.

Eval suite in the repo

The suite lives in the project repo. Same git history as the application code. Reviewable, diffable, blameable.

Threshold changes need a reason

Easier to justify upgrades than downgrades. We document the why on every threshold edit.

Corpus + prompts versioned

We score this week's system on the same corpus we scored last week's. No moving the goalposts.

the gate

What the Friday review looks like.
Forty-five minutes, three deltas, two decisions.

The team meets for 45 minutes every Friday. We walk three deltas, take two decisions, write the notes. That's it.

01

Delta 1 · quality

Recall, faithfulness, pass@1, citation accuracy. Side-by-side with last week. Per-segment breakdowns where they matter (PII traffic, regulated-content, long-tail intents).
02

Delta 2 · cost + latency

$/1k queries and p95 latency on the same dated run. Quality regressions hidden behind cost savings get flagged. So do cost regressions hidden behind quality gains.
03

Delta 3 · failure modes

Top five failed examples this week. Read them out loud. Categorise. Decide whether they're systemic or noise.
04

Decision 1 · promote?

Does this week's build qualify for broader exposure: feature-flag percentage up, new segment, new region?
05

Decision 2 · what changes

One or two concrete changes for next week, written in the notes, owned by a named engineer.

promote vs block

design decision · 01

Promote

we rejected: Every threshold passes, no new systemic failure
because: Build qualifies for broader exposure. Feature-flag percentage up, or new segment, or new region. Pick one, not all three.

design decision · 02

Block

we rejected: Any locked threshold fails
because: Build stays at its current feature-flag percentage. No exceptions for 'we'll fix it next week'. The threshold was the contract.

design decision · 03

Investigate

we rejected: Thresholds pass, but a failure mode looks systemic
because: Don't block, but add a test to the rubric to prevent the next regression. The suite grows from these moments.

tooling

What's running underneath.
Open-source harness plus a small dashboard.

The technical surface is small. The harness runs the rubric. A dashboard renders the deltas. The notes live in the repo.

paiteq/ai-eval-harness

Wraps Ragas (Es et al. 2023, arXiv:2309.15217), promptfoo (open-source LLM CLI), and Inspect AI (UK AISI agent harness). Trace data routed through Langfuse and Braintrust for inspection. MIT. The same harness produces our public benchmarks at getwidget.dev/benchmarks/.

Internal dashboard

Quality, cost, and latency on one screen. Per-segment breakdowns. Last 12 weeks visible by default. Exportable to PDF for compliance reviews.

Friday notes in repo

Three deltas, two decisions, named owners. Committed to the repo as plain markdown. Searchable in 12 months when the next regression happens.

sources + further reading

Where this cadence comes from.
Not invented. Composed from published practice.

The weekly eval gate borrows from LLM evaluation research, AI risk-management standards, and the long history of SRE error-budget practice.

Ragas (Es et al. 2023)

arXiv:2309.15217. Underpins our groundedness + answer-relevance scoring in the weekly rubric.

promptfoo

Open-source LLM eval CLI (github.com/promptfoo/promptfoo). Powers the regression layer of the Friday gate.

NIST AI RMF — MANAGE function

NIST AI 100-1 (January 2023). MANAGE-2.4 anchors our weekly cadence: 'AI risks identified… are prioritized, responded to, and managed'.

Google SRE — error budgets

The Friday gate borrows from SRE practice: a measurable threshold, a defined response, and a write-up that survives turnover.

Services this weekly-eval cadence drives: AI software development, AI agent development, AI chatbot development, AI knowledge base, Intelligent document processing, and AI voice agents. Every pillar above ships against a weekly eval gate before cutover.

next step

Want this cadence on your AI work?
Starts in the discovery audit.

The audit writes the rubric. The pilot proves the cadence works on your system. Continuous delivery makes it routine. Recent runs the weekly gate caught: 88% → 84% faithfulness drift on a RAG corpus (2026-Q1, reverted before promote); 71% → 67% pass@1 drift on the agent harness (2026-Q1, fixed via retrieval re-rank). Where eval ends, <a href='/services/ai-governance/'>governance gates beyond evals</a> pick up — model risk paperwork, audit logs, red-team cases, regulator readiness.

Start the audit conversation Read the full methodology

Weekly eval gates
What ships, what doesn't, decided by the rubric.

Why a week is the right unit.
Not per-PR, not monthly.

Catches drift early

Forces a habit

Compounds into trend

Locked at week zero.
The contract doesn't change mid-pilot.

Eval suite in the repo

Threshold changes need a reason

Corpus + prompts versioned

What the Friday review looks like.
Forty-five minutes, three deltas, two decisions.

Delta 1 · quality

Delta 2 · cost + latency

Delta 3 · failure modes

Decision 1 · promote?

Decision 2 · what changes

Promote

Block

Investigate

What's running underneath.
Open-source harness plus a small dashboard.

paiteq/ai-eval-harness

Internal dashboard

Friday notes in repo

Where this cadence comes from.
Not invented. Composed from published practice.

Ragas (Es et al. 2023)

promptfoo

NIST AI RMF — MANAGE function

Google SRE — error budgets

Want this cadence on your AI work?
Starts in the discovery audit.

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

Weekly eval gates What ships, what doesn't, decided by the rubric.

Why a week is the right unit. Not per-PR, not monthly.

Catches drift early

Forces a habit

Compounds into trend

Locked at week zero. The contract doesn't change mid-pilot.

Eval suite in the repo

Threshold changes need a reason

Corpus + prompts versioned

What the Friday review looks like. Forty-five minutes, three deltas, two decisions.

Delta 1 · quality

Delta 2 · cost + latency

Delta 3 · failure modes

Decision 1 · promote?

Decision 2 · what changes

Promote

Block

Investigate

What's running underneath. Open-source harness plus a small dashboard.

paiteq/ai-eval-harness

Internal dashboard

Friday notes in repo

Where this cadence comes from. Not invented. Composed from published practice.

Ragas (Es et al. 2023)

promptfoo

NIST AI RMF — MANAGE function

Google SRE — error budgets

Want this cadence on your AI work? Starts in the discovery audit.

Weekly eval gates
What ships, what doesn't, decided by the rubric.

Why a week is the right unit.
Not per-PR, not monthly.

Locked at week zero.
The contract doesn't change mid-pilot.

What the Friday review looks like.
Forty-five minutes, three deltas, two decisions.

What's running underneath.
Open-source harness plus a small dashboard.

Where this cadence comes from.
Not invented. Composed from published practice.

Want this cadence on your AI work?
Starts in the discovery audit.