What does an AI engineering services engagement look like?

Three phases: a 1-2 week discovery audit that writes the eval rubric, a 4-6 week pilot in production behind a feature flag with weekly eval gates, then ongoing continuous delivery with a dedicated engineering team. Each phase has a measurable exit criterion that the eval suite enforces. No model goes to production without passing the rubric.

Which LLM models do you build on?

Model-agnostic by default. We score Claude Sonnet 4.6, Claude Opus 4.7, GPT-5 mini, GPT-4o, Gemini 2.5 Pro, and a Llama 3.1 70B baseline on the same rubric and the same client corpus, then pick on the eval. Where one model shines on your data, we say why on the page.

What is eval-first delivery?

Every AI system we ship has an eval suite versioned with the code. Recall, faithfulness, latency p95, and dollars per 1k queries are scored on every PR. The Friday eval gate decides what gets promoted, not vibes. The same open-source harness that runs our public benchmarks runs on client engagements.

Do you do AI governance and compliance work?

Yes. We map programs to NIST AI RMF (January 2023), ISO/IEC 42001:2023, and EU AI Act (Regulation 2024/1689) Articles 12, 13, and 14 for logging, transparency, and human oversight. The audit-log surface is built into the engineering stack, not bolted on for the regulator.

ai engineering · model-agnostic

AI engineering at GetWidget
Eval-first delivery, in public.

We design and ship production AI systems: RAG, agents, voice, document processing, and governance programs. Model-agnostic on principle. Eval-first by default. Open-source where it earns trust, paid where it earns its keep. Operator engineering, not strategy decks.

Start an audit conversation How we benchmark

services · build AI products

Build AI products.
End-to-end builds for teams who already know they need AI.

From greenfield prototype to production system. Every build runs through the same eval-gated delivery cadence.

01 · build

AI development

Custom LLM apps shipped on eval gates. From greenfield prototype to production system, model-agnostic by default.

View service →

02 · build

AI agent development

Tool-using agents with reliability evals. Pass@1, pass@5, mean steps, mean cost on the same dated rubric.

View service →

03 · build

AI chatbot development

RAG-backed chat for support, sales, internal. Citation accuracy gated, refusal behaviour locked, audit logs on.

View service →

04 · build

AI voice agents

Real-time voice with sub-500ms p95 latency budgets. Speech-to-speech or speech-to-text-to-speech, scored on the same eval shape.

View service →

services · integrate into existing systems

Integrate AI into existing systems.
When the model is the easy part and the integration is the work.

Wiring LLMs into CRM, ERP, support, and back-office stacks. Reversible, audited, with human-in-loop where it earns trust.

05 · integrate

AI integration

Wire LLMs into CRM, ERP, support stacks. The model is the easy part; the integration is the work.

View service →

06 · integrate

AI automation

Replace workflow steps with audited AI calls. Eval-gated, reversible, with a human-in-loop tier where it counts.

View service →

07 · integrate

Intelligent document processing

PDFs, contracts, claims, statements at scale. OCR + extraction + classification, scored on a held-out set.

View service →

08 · integrate

AI knowledge base

Internal RAG over your docs, tickets, transcripts. Citation-accurate, role-aware, audit-logged.

View service →

services · choose models, govern programs

Choose models. Govern programs.
The strategy + governance layer for teams with five LLM apps in flight.

Audit-first consulting and engineering-led governance. EU AI Act, NIST AI RMF, ISO 42001 mapped to the eval suite we'd actually run.

09 · govern

AI consulting

Audit, prioritise, pick the right next bet. Written deliverables, not strategy decks. Ends with a rubric and a go/no-go.

View service →

10 · govern

AI governance

EU AI Act, NIST AI RMF, ISO 42001 programs. Engineering-led, audit-log-first, regulator-ready.

View service →

11 · model

Claude development

Claude API + Claude Code engineering depth. Operators using Anthropic's tooling daily in our own delivery.

View service →

12 · model

OpenAI development

GPT, embeddings, Realtime, structured outputs. Function calling, structured JSON, Realtime voice in production.

View service →

how we work

Eval-first delivery.
Audit, pilot, continuous — gated on real evals.

Every engagement runs through three phases. Each phase has a measurable exit criterion that the eval suite enforces. No model goes to production without passing the same rubric we publish on our benchmarks.

01

1. Discovery audit

1-2 weeks. We map your current AI surface, pick the highest-impact bet, and write the eval rubric. Ends with a written prioritisation and a go/no-go.
02

2. Pilot with weekly eval gates

4-6 weeks. Working system in production behind a feature flag. Weekly eval gates decide what ships, not vibes. Cost reported alongside quality.
03

3. Continuous delivery

Ongoing. Dedicated engineering team, eval suite versioned with the code, monthly model-selection re-checks, real on-call rotation.

open-source · in public

Benchmarks + harness, in public.
So you can audit the methodology before you hire us.

Most agencies pick a model because the founder likes it. We pick a model because the eval said so. The harness and the benchmarks are open-source so anyone can verify the result.

open-source

Dated benchmarks

Quarterly, reproducible benchmarks on RAG retrieval, agent reliability, and LLM selection. Cost reported alongside quality.

View benchmarks →

open-source MIT

paiteq/ai-eval-harness

MIT-licensed eval harness. Ragas + promptfoo + Inspect AI + custom agent rubrics, with Langfuse and Braintrust as the trace + dashboard surfaces. The same harness our delivery team runs on engagements.

View on GitHub →

open-source

Public eval datasets

Eval corpora mirrored to HuggingFace so anyone can reproduce our published numbers on their own infrastructure.

View on HuggingFace →

AI engineering at GetWidget
Eval-first delivery, in public.

Build AI products.
End-to-end builds for teams who already know they need AI.

AI development

AI agent development

AI chatbot development

AI voice agents

Integrate AI into existing systems.
When the model is the easy part and the integration is the work.

AI integration

AI automation

Intelligent document processing

AI knowledge base

Choose models. Govern programs.
The strategy + governance layer for teams with five LLM apps in flight.

AI consulting

AI governance

Claude development

OpenAI development

Eval-first delivery.
Audit, pilot, continuous — gated on real evals.

1. Discovery audit

2. Pilot with weekly eval gates

3. Continuous delivery

Benchmarks + harness, in public.
So you can audit the methodology before you hire us.

Dated benchmarks

paiteq/ai-eval-harness

Public eval datasets

Talk to the engineers who'll build it.
Audit conversation, not a discovery call.

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

AI engineering at GetWidget Eval-first delivery, in public.

Build AI products. End-to-end builds for teams who already know they need AI.

AI development

AI agent development

AI chatbot development

AI voice agents

Integrate AI into existing systems. When the model is the easy part and the integration is the work.

AI integration

AI automation

Intelligent document processing

AI knowledge base

Choose models. Govern programs. The strategy + governance layer for teams with five LLM apps in flight.

AI consulting

AI governance

Claude development

OpenAI development

Eval-first delivery. Audit, pilot, continuous — gated on real evals.

1. Discovery audit

2. Pilot with weekly eval gates

3. Continuous delivery

Benchmarks + harness, in public. So you can audit the methodology before you hire us.

Dated benchmarks

paiteq/ai-eval-harness

Public eval datasets

Talk to the engineers who'll build it. Audit conversation, not a discovery call.

AI engineering at GetWidget
Eval-first delivery, in public.

Build AI products.
End-to-end builds for teams who already know they need AI.

Integrate AI into existing systems.
When the model is the easy part and the integration is the work.

Choose models. Govern programs.
The strategy + governance layer for teams with five LLM apps in flight.

Eval-first delivery.
Audit, pilot, continuous — gated on real evals.

Benchmarks + harness, in public.
So you can audit the methodology before you hire us.

Talk to the engineers who'll build it.
Audit conversation, not a discovery call.