AI engineering at GetWidget
Eval-first delivery, in public.
We design and ship production AI systems: RAG, agents, voice, document processing, and governance programs. Model-agnostic on principle. Eval-first by default. Open-source where it earns trust, paid where it earns its keep. Operator engineering, not strategy decks.
Build AI products.
End-to-end builds for teams who already know they need AI.
From greenfield prototype to production system. Every build runs through the same eval-gated delivery cadence.
AI development
Custom LLM apps shipped on eval gates. From greenfield prototype to production system, model-agnostic by default.
View service →AI agent development
Tool-using agents with reliability evals. Pass@1, pass@5, mean steps, mean cost on the same dated rubric.
View service →AI chatbot development
RAG-backed chat for support, sales, internal. Citation accuracy gated, refusal behaviour locked, audit logs on.
View service →AI voice agents
Real-time voice with sub-500ms p95 latency budgets. Speech-to-speech or speech-to-text-to-speech, scored on the same eval shape.
View service → Integrate AI into existing systems.
When the model is the easy part and the integration is the work.
Wiring LLMs into CRM, ERP, support, and back-office stacks. Reversible, audited, with human-in-loop where it earns trust.
AI integration
Wire LLMs into CRM, ERP, support stacks. The model is the easy part; the integration is the work.
View service →AI automation
Replace workflow steps with audited AI calls. Eval-gated, reversible, with a human-in-loop tier where it counts.
View service →Intelligent document processing
PDFs, contracts, claims, statements at scale. OCR + extraction + classification, scored on a held-out set.
View service →AI knowledge base
Internal RAG over your docs, tickets, transcripts. Citation-accurate, role-aware, audit-logged.
View service → Choose models. Govern programs.
The strategy + governance layer for teams with five LLM apps in flight.
Audit-first consulting and engineering-led governance. EU AI Act, NIST AI RMF, ISO 42001 mapped to the eval suite we'd actually run.
AI consulting
Audit, prioritise, pick the right next bet. Written deliverables, not strategy decks. Ends with a rubric and a go/no-go.
View service →AI governance
EU AI Act, NIST AI RMF, ISO 42001 programs. Engineering-led, audit-log-first, regulator-ready.
View service →Claude development
Claude API + Claude Code engineering depth. Operators using Anthropic's tooling daily in our own delivery.
View service →OpenAI development
GPT, embeddings, Realtime, structured outputs. Function calling, structured JSON, Realtime voice in production.
View service → Eval-first delivery.
Audit, pilot, continuous — gated on real evals.
Every engagement runs through three phases. Each phase has a measurable exit criterion that the eval suite enforces. No model goes to production without passing the same rubric we publish on our benchmarks.
- 01
1. Discovery audit
1-2 weeks. We map your current AI surface, pick the highest-impact bet, and write the eval rubric. Ends with a written prioritisation and a go/no-go.
- 02
2. Pilot with weekly eval gates
4-6 weeks. Working system in production behind a feature flag. Weekly eval gates decide what ships, not vibes. Cost reported alongside quality.
- 03
3. Continuous delivery
Ongoing. Dedicated engineering team, eval suite versioned with the code, monthly model-selection re-checks, real on-call rotation.
Benchmarks + harness, in public.
So you can audit the methodology before you hire us.
Most agencies pick a model because the founder likes it. We pick a model because the eval said so. The harness and the benchmarks are open-source so anyone can verify the result.
Dated benchmarks
Quarterly, reproducible benchmarks on RAG retrieval, agent reliability, and LLM selection. Cost reported alongside quality.
View benchmarks →paiteq/ai-eval-harness
MIT-licensed eval harness. Ragas + promptfoo + Inspect AI + custom agent rubrics, with Langfuse and Braintrust as the trace + dashboard surfaces. The same harness our delivery team runs on engagements.
View on GitHub →Public eval datasets
Eval corpora mirrored to HuggingFace so anyone can reproduce our published numbers on their own infrastructure.
View on HuggingFace →