ai case studies

AI case studies.
Receipts, not slideware.

Seven production engagements you can verify. Each one ships with a frozen eval set, a published latency budget, a defined kill point, and the math behind the metric — clinical triage on Claude Sonnet 4.6, RAG over 12,000 product-docs pages, OpenAI Realtime voice at a published $0.10 per call, fraud disposition at a US mid-market bank, first-pass MSA review for a law firm, a tier-1 customer service chatbot deflecting 42% at week 8, and a Flutter voice copilot live in a 1.4M-MAU app. Client names are changed at their request; every number on this page is drawn from shadow-mode logs, frozen eval sets, or 30+ day A/B tests on shipped systems.

Definition

What does the GetWidget case studies catalog cover?

The GetWidget case studies catalog covers 7 production AI deployments with published eval data: a HIPAA-safe clinical triage agent (Claude Sonnet 4.6, n=14,200 shadow encounters, 38-62% wait reduction); a Claude RAG over 12,000 product-docs pages (n=3,400, 64% tier-1 deflection); an OpenAI Realtime voice agent (n=11,400 calls, 38% deflection at $0.10/call); a Claude fraud-disposition agent at a US mid-market bank (precision at or above 0.96 at 1% FPR, plus or minus 0.012 CI); a LangChain MSA contract reviewer (n=180, 71% partner time saved); a Flutter voice copilot in a 1.4M-MAU app (n=42,318 sessions, +11.4 percentage-point conversion lift); and a tier-1 customer-service chatbot (42% deflection at $800/mo). Every case study publishes sample size, confidence interval, named stack, and compliance regime, with audit dataset, model version, and eval methodology linked from each case page. Per-case citation cards are at /api/citation-card/:slug. Engagement begins with a fixed-fee discovery audit, then a 4-6 week pilot (fixed-bid with walk-away clause), then continuous monthly delivery scaled to workflow count.

7
published case studies
5
industries shipped: health, SaaS, fintech, legal, ecom
eval-first
every case ships with a frozen eval set + kill points
open math
cost-per-call and groundedness numbers published
38–62% pre-triage wait reduction (n=14,200 shadow encounters, healthcare clinical, 2026-Q1)
9 wk Pilot to shadow
Healthcare · Regional health system

HIPAA-safe clinical triage agent, shipped in 9 weeks

Pre-triage queue 38–62 min at peak. Nurse line overflow routing wrong-acuity patients to ER.

  • Claude Sonnet 4.6
  • pgvector 0.7
  • FHIR R4
  • LangGraph 0.2
Read case study
≈ 64% docs-recoverable tickets deflected at conf ≥ 0.8 (95% CI · n=3,400, dev-tools SaaS, 2026-Q1)
0.92 Groundedness on eval
B2B SaaS · Developer tooling

Claude RAG over 12,000 product-docs pages

Doc search rated 2.3/5; 41% of support tickets were docs-recoverable. Keyword search couldn't reason across modules.

  • Claude Sonnet 4.6
  • Haiku 4.5
  • pgvector
  • bge-reranker
Read case study
≈ 38% tier-1 voice deflection (95% CI · n=11,400 calls, SaaS support, 2026-Q1)
$0.10 Per call · vs $4 baseline
SaaS · Customer support

OpenAI Realtime API voice agent at $0.10/call

Tier-1 voice queue 4-min wait at peak; 5 questions = 62% of volume. IVR bouncing 80% to human.

  • gpt-realtime-2
  • Whisper-large-v3
  • pgvector
  • Twilio Voice
Read case study
≥ 0.96 precision @ 1% FPR (n=412 eval + 1,840 production · ±0.012 CI)
AWS PrivateLink Deployment posture
Fintech · Mid-market US bank

Claude Sonnet 4.6 fraud agent at a US mid-market bank

Rules-engine bleeding 18% false-positive rate on 1.2B/yr transactions across card, wire, ACH, RTP.

  • Claude Sonnet 4.6
  • Haiku 4.5
  • pgvector
  • XGBoost
Read case study
≈ 71% first-pass MSA review time saved · partner-signed-off (95% CI · n=180 MSAs)
1,420 Clauses post-reconciliation
Legal · Mid-market law firm

First-pass MSA review for a mid-market law firm

Partners spending 6–9 hours per MSA on first-pass review; clause-library drift across 4 practice groups.

  • Claude Sonnet 4.6
  • LangChain
  • LangGraph
  • pgvector
Read case study
42% tier-1 deflection at week 8 (escalation-draft accept rate 78%)
6hr → 12min First-response time on deflected
B2B SaaS · Customer support · Series C

Tier-1 customer service chatbot: 42% deflection in 8 weeks

Zendesk queue 6-hr FRT at peak. Tier-1 tickets burning cycles. Off-the-shelf chatbots failed on tone + product depth.

  • Claude Sonnet 4.6
  • Haiku 4.5
  • pgvector
  • Zendesk API
Read case study
+11.4 pts mobile conversion · voice-engaged sessions (n=42,318 · ±1.6pt CI · 30d A/B)
1.4M MAU App scale
E-commerce · DTC apparel · Flutter mobile

Flutter voice copilot in a DTC apparel app

Mobile-app conversion lagging desktop by 18pp on a 1.4M-MAU Flutter app. Two prior voice A/B tests failed.

  • gpt-realtime-2
  • Flutter 3.24
  • GetWidget OSS
  • Algolia
Read case study
▸ what we measure in every case

Six dimensions, on every page, not just the ones that look good.

The 'we measured X' line in most case studies hides three other measurements that didn't move. We publish all six. If one is missing on a case-study page below, it's because the client asked us not to publish it — never because the number was bad.

  • Groundedness

    Fraction of answers traceable to a retrieval span (RAGAS). Hallucination's inverse.

  • p95 latency

    First-token AND full-reply, per channel. Voice has a different budget than web chat.

  • $ / unit

    Per turn, per call, per MSA. Published with the formula — not hand-waved.

  • Eval pass %

    Frozen golden set + regression-gated in CI. Drift catches us before it catches the user.

  • Walk-away

    The single metric we'd kill the pilot for if it doesn't move. Defined before week 1.

  • Audit log

    Every call, every retrieval, every tool invocation logged for replay and dispute.

on anonymization

Why most names are changed,
and how to get a named reference.

Three reasons clients stay anonymous. We share 2–3 named references under NDA inside the audit call, and co-publish a fully named case with a client roughly once a quarter.

01 · competitive window

Naming clients tips off their competitors

Healthcare, fintech and law-firm clients commonly sit inside a window where a public reference helps a rival decide where to invest next. Naming them is a strategic gift we won't make on a marketing page.

02 · regulated-buyer posture

HIPAA, FFIEC and privilege-aware buyers gate references

Named references typically require a paid intro call with counsel or compliance present. We respect that. One regulated client trusting us for a decade beats a logo on a landing page.

03 · eval over logos

The eval table is more useful than the brand name

A buyer should be able to tell from the case study alone whether we picked the right model, whether retrieval is defensible, and whether cost math closes. We share 2–3 named references under NDA inside the audit call.

frequently asked

Questions case-study readers ask most.
Real answers, no hedging.

Why are most of these case studies anonymized?
Three reasons. (1) Some clients are still in a competitive window where naming them helps a competitor decide where to invest next. (2) Regulated buyers (health, fintech, legal) often only allow named references under NDA after a paid intro call. (3) We'd rather publish the eval math + architecture than a brand-name logo with no detail. If you want named references, ask — we can share 2-3 under NDA after the audit call.
Are these numbers real or capability examples?
Every metric on this page is drawn from shadow-mode logs, frozen eval sets, or 30+ day A/B tests on shipped engagements. We mark capability-example numbers explicitly with the words 'capability example' or 'illustrative' when used. The cost math (e.g. $0.10/call) is published with the formula. The CIs (95%, ±values) come from the actual eval runs.
Can I cite GetWidget in my own deck or earnings call?
Yes — every published case study is citation-safe as written. Use the page URL + the metric + CI as published. If you need a different framing (custom NDA carve-out, named-reference letter, or a co-marketing piece), email us; we co-publish with clients ~every quarter.
Do you publish negative results?
Yes, where the client agrees. Two examples on the site: the legal contract-review case has a 'paused for reconciliation' phase documented in the eval table (clause-library drift forced a rebuild). The voice agent case publishes the two prior A/B tests that failed before the third one shipped. We don't publish negative results that would identify a specific client without consent.
What's the typical pilot length for a case-study-shaped engagement?
Pilots run 4–9 weeks depending on the corpus and the integration surface. Most pilots are fixed-bid with a walk-away point at the eval baseline. After a successful pilot, engagements transition to continuous monthly delivery scaled to workflow count and on-call posture. See /services/ai-development/ for the full engagement model.
Why don't you have more case studies in [my industry]?
We publish a new case study roughly every 4–6 weeks. The 7 published today are the ones with client co-publish sign-off; we have 11+ active engagements that aren't published yet for the reasons in question 1. If you're scoping in healthcare, legal, fintech, ecommerce, or SaaS, the patterns transfer — book an audit and we'll show you the closest comparable engagement under NDA.
Ready to ship

Want a case study like this
for your stack?

Book a free audit. We review your highest-ROI candidate workflow, recommend a model + retrieval recipe, project token + run-cost, and tell you whether it's case-study-shaped (or whether you should buy an off-the-shelf platform). No deck, no obligation to build.

See pricing
30 min, async or live Eval-first scoping Walk-away point in the pilot