Promote
- we rejected
- Every threshold passes, no new systemic failure
- because
- Build qualifies for broader exposure. Feature-flag percentage up, or new segment, or new region. Pick one, not all three.
Every AI engagement we run has a weekly eval gate. It's the smallest cadence that catches regressions before they compound, and the largest cadence that doesn't drown the team in noise. Here's what the Friday review actually looks like.
Per-PR runs catch single-change regressions but miss interaction effects. Monthly runs catch interaction effects but let too much rot accumulate. Weekly is the seam.
Embedding indexes shift. Prompts get edited. Tool registries change. A week is short enough to attribute a regression to a specific change.
If the gate runs every Friday and the team reviews it together, eval health becomes part of the engineering rhythm, not a quarterly fire drill.
52 weekly snapshots a year is a trend line you can actually read. Quarterly is four data points.
The rubric is drafted in the phase 1 audit and locked at the start of the pilot. We don't move thresholds when results are uncomfortable. We document the gap and decide whether to accept it or change the system.
The suite lives in the project repo. Same git history as the application code. Reviewable, diffable, blameable.
Easier to justify upgrades than downgrades. We document the why on every threshold edit.
We score this week's system on the same corpus we scored last week's. No moving the goalposts.
The team meets for 45 minutes every Friday. We walk three deltas, take two decisions, write the notes. That's it.
Recall, faithfulness, pass@1, citation accuracy. Side-by-side with last week. Per-segment breakdowns where they matter (PII traffic, regulated-content, long-tail intents).
$/1k queries and p95 latency on the same dated run. Quality regressions hidden behind cost savings get flagged. So do cost regressions hidden behind quality gains.
Top five failed examples this week. Read them out loud. Categorise. Decide whether they're systemic or noise.
Does this week's build qualify for broader exposure: feature-flag percentage up, new segment, new region?
One or two concrete changes for next week, written in the notes, owned by a named engineer.
The technical surface is small. The harness runs the rubric. A dashboard renders the deltas. The notes live in the repo.
Wraps Ragas (Es et al. 2023, arXiv:2309.15217), promptfoo (open-source LLM CLI), and Inspect AI (UK AISI agent harness). Trace data routed through Langfuse and Braintrust for inspection. MIT. The same harness produces our public benchmarks at getwidget.dev/benchmarks/.
Quality, cost, and latency on one screen. Per-segment breakdowns. Last 12 weeks visible by default. Exportable to PDF for compliance reviews.
Three deltas, two decisions, named owners. Committed to the repo as plain markdown. Searchable in 12 months when the next regression happens.
The weekly eval gate borrows from LLM evaluation research, AI risk-management standards, and the long history of SRE error-budget practice.
arXiv:2309.15217. Underpins our groundedness + answer-relevance scoring in the weekly rubric.
Open-source LLM eval CLI (github.com/promptfoo/promptfoo). Powers the regression layer of the Friday gate.
NIST AI 100-1 (January 2023). MANAGE-2.4 anchors our weekly cadence: 'AI risks identified… are prioritized, responded to, and managed'.
The Friday gate borrows from SRE practice: a measurable threshold, a defined response, and a write-up that survives turnover.
Services this weekly-eval cadence drives: AI software development, AI agent development, AI chatbot development, AI knowledge base, Intelligent document processing, and AI voice agents. Every pillar above ships against a weekly eval gate before cutover.