B2B SaaS · Series C · anonymized Customer service chatbot · RAG-grounded · tier-1

Claude Sonnet 4.6Claude Haiku 4.5pgvector + Postgres 16Zendesk · function-callingLangfuse · self-hosted

case study · enterprise ai chatbot · 2026 · anonymized

Enterprise AI chatbot, in production.
42% tier-1 deflection in 8 weeks.

A Series C B2B SaaS company was running a 6-hour first-response time on its tier-1 support queue. We shipped a RAG-grounded enterprise AI chatbot over the help center and 90 days of resolved tickets. Claude Haiku 4.5 classifies intent in 280 ms. Claude Sonnet 4.6 returns the grounded reply with every claim cited to a retrieved chunk. A confidence gate at 0.7 routes the output into one of four lanes: autonomous reply, escalate with draft, refuse and queue, or out-of-scope. Five-week pilot, two weeks of internal-note shadow, three weeks of staged rollout. By week 8 the chatbot was deflecting 42% of tier-1 traffic (2026-Q1) at $800 per month all-in run cost (2026-Q1).

42%

tier-1 deflection · steady-state · resolution-counted across 4 reporting cycles (n≈12,400 tickets)

0.95

groundedness · LLM-judge eval calibrated against 80 human-judged samples

78%

escalation-draft acceptance · share of AI drafts the agent kept or lightly edited

$800/mo

all-in run cost · Claude API + pgvector + Langfuse · observed-volume math

shipped

8 weeks · 2 engineers · 1 support engineering owner · 14 agents in scope

Summary

What this case study shows

A Series C B2B SaaS company shipped a customer service chatbot example on Claude Sonnet 4.6 plus a Haiku 4.5 intent classifier, grounded in 480 help-center articles and 90 days of resolved Zendesk tickets. By week 8 the chatbot deflected 42% of tier-1 traffic at steady state (resolution-counted, not first-touch routing), with groundedness at 0.95 and escalation-draft acceptance at 78%. First-response time on deflected tickets dropped from a 6-hour median to 12 minutes. Run cost: about $800 per month all-in. Stack: Claude Sonnet 4.6, Claude Haiku 4.5, pgvector, Zendesk function-calling, Langfuse. Five-week pilot plus three-week staged rollout, with a shadow-mode pause that caught three production failure modes before any customer saw a reply. The KB layer underneath is the same engagement shape we ship as a standalone AI knowledge base when retrieval is the primary surface.

schema-validated

every Sonnet reply parses against the Zod schema or fails closed to a human ticket

auto-replies shipped without a grounded citation · enforced by the validator

500-item eval

frozen · ground-truth labelled by the customer's support engineering team

shadow-first

two weeks of internal-note shadow mode before any customer saw a reply

6 hr

median first-response on tier-1 Zendesk · pre-build baseline

76%

of tier-1 volume mapped to 5 repeating ticket shapes (n=500 sampled)

3 of 3

off-the-shelf vendors evaluated and rejected before us

18-24 hr

first-response long tail during US-morning peak · pre-build

the problem

Where the support queue
actually was.

A 6-hour median first-response time. 14 agents on follow-the-sun rotation. Three-quarters of tier-1 inbound was the same five ticket shapes. Generic chatbots had been tried; all three were turned down.

The client is a Series C B2B SaaS company. Their support floor sat at 14 agents on a follow-the-sun rotation, with an inbound Zendesk queue averaging 6-hour first-response time on tier-1 traffic and an 18 to 24-hour tail during US-morning peak. The Head of Support Engineering ran the numbers in our discovery week and named the binding constraint. 76% of tier-1 ticket volume mapped to five repeating shapes: password resets, billing questions like "why was I charged on the 4th, not the 1st", feature how-tos lifted directly from the help center, plan-tier entitlement questions, and SSO connection errors with a known knowledge-base remedy.

The team had previously evaluated three off-the-shelf AI customer support software vendors and turned all of them down. Objections were consistent. Tone did not match the brand voice. Product knowledge was weak or wrong on entitlement edge cases the marketing site did not cover. And there was no honest measurement methodology behind the deflection-rate claims those vendors had put in their pitch decks. The Head of Support Engineering said it plainly: we will buy a customer service chatbot when we can audit what it knows, see where it draws its answers from, and measure how often it gets the answer right on tickets we already have ground truth on.

before vs · with the chatbot

today · without the chatbot

Customer opens ticket

Zendesk inbox

no triage · time-ordered

6-hour wait

longer at US peak

Agent reads cold

context-switch tax

outcome

6 hr median FRT · 18–24 hr peak tail · 14 agents on follow-the-sun rotation

with the chatbot

Customer asks

Haiku 4.5 classifies

9 intents · 280 ms p95

pgvector retrieves

help center + 90d tickets

Sonnet 4.6 grounded reply

every claim cites a chunk

42% of inbound

Autonomous reply · ≥0.7 conf

outcome

Escalate with draft · 0.4–0.7

outcome

Refuse + queue · <0.4

We will buy a customer service chatbot when we can audit what it knows, see where it draws its answers from, and measure how often it gets the answer right on tickets we already have ground truth on . Until then, every vendor pitch is just a deflection-rate number with no methodology under it.

Head of Support Engineering B2B SaaS · Series C

the approach · confidence-gated chatbot, eval-first

What we built,
and why.

Two models. One confidence gate. Four routing outcomes. One eval set the customer's own support engineering team labelled. Every reply has to cite the chunk it pulled from, or the schema validator rejects.

We shipped a customer service chatbot that an engineering team will recognize as eval-disciplined rather than vendor-pitched. The intent-classify step runs on Claude Haiku 4.5: every inbound message routes into one of 9 ticket categories or out-of-scope. The reply step runs on Claude Sonnet 4.6 with retrieval over the help center plus 90 days of resolved Zendesk tickets. Every Sonnet reply must cite the chunk it retrieved the answer from, or the Zod validator rejects the output and the agent retries once with a stricter prompt, then fails closed to a human ticket.

The confidence gate sits at 0.7. Above threshold, the chatbot replies to the customer directly. Below 0.7 and above 0.4, the chatbot does not reply to the customer at all. Instead it escalates the ticket to the human queue with the AI-drafted reply attached as an internal Zendesk note. The agent reads the draft, edits or rewrites it, and sends. We measured that draft-attached escalation path as its own metric because we suspected (correctly) that it would carry a meaningful share of the productivity gain even on tickets the chatbot could not close autonomously. Below 0.4 the chatbot refuses outright; the ticket queues for a human with retrieval candidates pre-attached so the agent does not start cold.

Zendesk function-calling handles every write action. The chatbot can create a ticket, update ticket status, attach the conversation transcript to a parent ticket, and tag the case for routing. It cannot issue refunds. It cannot change plan tiers. It cannot read or write to billing systems. Every write tool has a policy file the agent runtime imports at startup, and the agent refuses to call a tool whose policy preconditions fail. The widget itself lives inside the customer dashboard, not the marketing site. This is a customer support chatbot for authenticated users with an account context the agent reads at session start. The retrieval scope is narrower than a marketing chatbot, entitlement questions are answerable from authenticated context, and the failure modes are different.

three decisions that shaped the build

design decision · 01

Two-model pipeline · Haiku 4.5 classifies, Sonnet 4.6 replies

we rejected: Single Sonnet model on every inbound message
because: Roughly 60% of tier-1 inbound resolves to one of 9 narrow intents we classify in 280 ms p95 with Haiku at about 5% of Sonnet's per-call cost. The classifier also lets us refuse out-of-scope (refunds, plan changes, billing disputes) at the cheap step, before retrieval ever fires. Routing-only queries terminate at Haiku and never pay for grounded reasoning they do not need.

design decision · 02

Confidence gate at 0.7 with escalate-with-draft below

we rejected: Single threshold at 0.9 · reply only when very confident, otherwise human cold-start
because: We shipped 0.8 initially and watched it kick autonomous-resolvable tickets into the escalation lane. Shadow data showed the model was strong-and-safe on password resets, SSO errors, and feature how-tos at the 0.7 band; the conservative threshold cost real resolutions. The bigger productivity gain was below the gate: AI drafts attached as internal Zendesk notes, 78% accepted by agents with light edits, first-response on the escalation lane dropped from 6 hr to 1 hr 40.

design decision · 03

Authenticated dashboard widget · not a marketing-site bot

we rejected: Public marketing-site chat with anonymous visitor context
because: Authenticated session at start means we read entitlement context from the customer's plan tier and Salesforce contract before the model generates anything. The bot answers 'why was I charged on the 4th, not the 1st' against the customer's real billing schedule, not a generic pricing page. Marketing widgets cannot do this; retrieval scope is different, failure modes are different, the SERP intent is different.

chatbot anatomy · this build

Six stages of one
customer service chatbot turn.

From inbound message to logged trace. Each stage carries its own latency budget, model pick, and failure mode. Skip any one and you ship the demo bot competitors do.

01ClassifyIntent routingClaude Haiku 4.5 · 9-way classifier · 280 ms p95 · 60-token system prompt$0.0006 / turn
02RetrieveRAG over help center + ticket historypgvector top-k 6 · 480 articles + 18,000 resolved tickets · 512-token chunks~1,200 in tokens
03Tool callZendesk function executioncreate ticket · update status · tag · attach transcript · 4-second timeout0–2 calls / turn
04GenerateGrounded reply with citationsClaude Sonnet 4.6 · forced JSON · every claim cites a chunk · streamed~420 out tokens
05GuardrailConfidence gate + policy checkZod schema validate · ≥0.7 reply · 0.4–0.7 escalate with draft · <0.4 refusefail-closed
06LogTrace + nightly eval replayLangfuse self-hosted · 500-item eval re-runs · alert on >1.5 pt dropevaled nightly

Latency and token numbers are 30-day rolling p95 from production traces. The 0.7 confidence gate is the policy threshold tuned in shadow week 2.

confidence gate · 0.7 threshold

One model output.
Four routing outcomes.

Every Sonnet response carries a confidence score. The score plus the intent category from Haiku decides what the customer actually sees. Autonomous reply on the high-confidence path, escalate-with-draft in the middle band, refusal queue at the bottom, out-of-scope intents never reach Sonnet at all. The router is the customer-experience contract.

audit Confidence gate 0.7 reply · 0.4 refuse

Autonomous reply · ≥ 0.7 schema-valid grounded answer · streamed to customer · ticket auto-tagged + resolved · ≈ 42% of inbound

Escalate with draft · 0.4 – 0.7 AI draft posted as internal Zendesk note · agent reads, edits, sends · 78% acceptance rate

Refuse + queue · < 0.4 no customer-facing reply · ticket routed to human queue · retrieval candidates pre-attached for the agent

Out-of-scope · direct to human refunds · plan changes · billing disputes · never reach Sonnet · routed at the Haiku step

audit Confidence gate 0.7 reply · 0.4 refuse

Autonomous reply · ≥ 0.7 schema-valid grounded answer · streamed to customer · ticket auto-tagged + resolved · ≈ 42% of inbound
Escalate with draft · 0.4 – 0.7 AI draft posted as internal Zendesk note · agent reads, edits, sends · 78% acceptance rate
Refuse + queue · < 0.4 no customer-facing reply · ticket routed to human queue · retrieval candidates pre-attached for the agent
Out-of-scope · direct to human refunds · plan changes · billing disputes · never reach Sonnet · routed at the Haiku step

Threshold tuned in shadow mode week 2. 0.9 was too conservative (kicked safe autonomous-resolvables into the escalation lane). 0.5 was too aggressive (surfaced low-grounded replies to customers). 0.7 sits where draft-attached acceptance and autonomous-resolution both stay healthy.

what the agent can and cannot do

The tool policy file, in a table

action	can do?	audit log	policy gate
Reply to customer with cited answer	yes · ≥0.7 conf	Langfuse + Zendesk	schema validates every claim
Create internal Zendesk note (draft)	yes	Zendesk audit	always logged · never auto-sent
Update ticket status (open → resolved)	yes · post auto reply	Zendesk audit	agent over-ride is 1-click
Tag for routing	yes	Zendesk audit	9-tag vocabulary · hard-coded
Attach conversation transcript to parent	yes	Zendesk audit	PII scrub before attach
Issue refund	no	—	billing-system tool not exposed
Change plan tier or entitlement	no	—	Salesforce write-path not exposed
Read or write billing system	no	—	out of scope by policy file
Send email outside Zendesk	no	—	no SMTP tool available to runtime

action Reply to customer with cited answer
can do? yes · ≥0.7 conf
audit log Langfuse + Zendesk
policy gate schema validates every claim
action Create internal Zendesk note (draft)
can do? yes
audit log Zendesk audit
policy gate always logged · never auto-sent
action Update ticket status (open → resolved)
can do? yes · post auto reply
audit log Zendesk audit
policy gate agent over-ride is 1-click
action Tag for routing
can do? yes
audit log Zendesk audit
policy gate 9-tag vocabulary · hard-coded
action Attach conversation transcript to parent
can do? yes
audit log Zendesk audit
policy gate PII scrub before attach
action Issue refund
can do? no
audit log —
policy gate billing-system tool not exposed
action Change plan tier or entitlement
can do? no
audit log —
policy gate Salesforce write-path not exposed
action Read or write billing system
can do? no
audit log —
policy gate out of scope by policy file
action Send email outside Zendesk
can do? no
audit log —
policy gate no SMTP tool available to runtime

Policy file lives in the customer's repo. Agent runtime imports it at startup. Any tool call whose preconditions fail is refused before the call is made — no dry-runs, no soft failures. Tier-2 actions (refund, plan change, billing write) are not in this build by design; we scope them as a separate engagement with stricter risk model.

the stack · named tools, named versions

Stack we shipped,
all of it audit-able.

Claude Sonnet 4.6 Anthropic API · forced JSON

Claude Haiku 4.5 Anthropic API

pgvector 0.7 Postgres 16

voyage-3-large 1,024 dim

Zendesk API function-calling layer

Langfuse v3 · self-hosted

Postgres 16 managed

corpus + retrieval · numbers

What the chatbot reads from and how

surface	count	chunking / treatment	refresh cadence
Help center articles	480 articles	512-token chunks · 80-token overlap · sentence-anchored	daily incremental on CMS publish
Resolved Zendesk tickets (90d window)	≈ 18,000 tickets	subject + first agent reply + final resolution · dedup by canonical resolution	nightly rolling window
Entitlement context (Salesforce)	live · per-session	read at session start · authenticated user id only	per-session · no batch refresh
Intent categories	9 + out-of-scope	password reset · billing · feature how-to · plan/entitlement · SSO · refund · cancellation · escalation · feature-flag	frozen · labelled by client's support engineering
Confidence threshold	0.7 reply / 0.4 refuse	tuned in shadow mode week 2 · drift-monitored nightly	review quarterly

surface Help center articles
count 480 articles
chunking / treatment 512-token chunks · 80-token overlap · sentence-anchored
refresh cadence daily incremental on CMS publish
surface Resolved Zendesk tickets (90d window)
count ≈ 18,000 tickets
chunking / treatment subject + first agent reply + final resolution · dedup by canonical resolution
refresh cadence nightly rolling window
surface Entitlement context (Salesforce)
count live · per-session
chunking / treatment read at session start · authenticated user id only
refresh cadence per-session · no batch refresh
surface Intent categories
count 9 + out-of-scope
chunking / treatment password reset · billing · feature how-to · plan/entitlement · SSO · refund · cancellation · escalation · feature-flag
refresh cadence frozen · labelled by client's support engineering
surface Confidence threshold
count 0.7 reply / 0.4 refuse
chunking / treatment tuned in shadow mode week 2 · drift-monitored nightly
refresh cadence review quarterly

Embeddings: voyage-3-large at 1,024 dimensions. Retrieval: pgvector top-k 6, sentence-anchored chunks. Ticket dedup is canonical-resolution based (we group tickets that resolve to the same root cause and keep the cleanest exemplar). Salesforce entitlement context is read live per-session and never cached.

8 weeks · honest version

The timeline,
including the week we almost paused.

Five stages, milestone-billed. The week-5 shadow run caught three production failure modes (entitlement edge cases, billing-dispute auto-replies, stale release-notes answers) that all pre-dated any customer ever seeing a reply. The honest version of '8 weeks' includes the days we spent fixing them.

Week 1

Discovery + 500-ticket eval set

One week with the Head of Support Engineering and two senior agents. We sampled 500 tier-1 tickets from the previous 90 days, stratified across the 9 intent categories. Their team labelled the correct reply, the correct ticket disposition (resolve, escalate, route), and the help-center article ids that should have grounded the answer. We wrote the eval harness. They wrote the answers. That organizational split was load-bearing for trust when results were reviewed two months later.

Frozen 500-item eval set · intent rubric · groundedness calibration corpus
Weeks 2–3

Corpus build + retrieval tuning

Indexed 480 help-center articles and 90 days of resolved Zendesk tickets (about 18,000 after dedup) into pgvector on the customer's existing Postgres. voyage-3-large embeddings at 1,024 dimensions. Chunked at 512 tokens with 80-token overlap, sentence-anchored, never splitting mid-quote. Week 2 cost us four days on a macro-library cleanup we had not scoped: the Zendesk macro library had drifted across two reorgs and 'resolved' tickets had been closed with the wrong macro applied. Future projects: we audit historical resolution data before the corpus build, not during.

Hybrid retrieval at recall@5 of 0.92 on the eval set
Week 4

Two-model pipeline + confidence gate

Claude Haiku 4.5 wired as the intent classifier (9 categories + out-of-scope). Claude Sonnet 4.6 generating cited replies under forced JSON. Confidence gate at 0.7 with escalate-with-draft routing below. Zendesk function-calling for ticket create, update, tag, attach. Tool policy file enforced at runtime: agent refuses to call any tool whose policy preconditions fail.

End-to-end pipeline behind a feature flag · CI green on the frozen eval
Weeks 5–6

Shadow mode · three failure modes caught

Two full weeks where the chatbot generated a reply for every tier-1 ticket, but the reply landed in an internal-note field, never the customer. A senior agent reviewed 10% of replies daily. Three production failures surfaced. One: confident replies on enterprise entitlement edge cases, fixed by reading Salesforce at session start. Two: auto-replies to billing disputes that should have escalated, tightened with adversarial examples in the classifier. Three: stale release-notes answers on feature-flag questions, fixed by excluding release notes from retrieval and routing those questions to the human queue.

Three retrieval-scope changes shipped · groundedness lifted 0.91 → 0.94

Walk-away point
Weeks 7–8

Staged rollout · 42% holds at steady state

Three-week staged rollout. 10% of tier-1 traffic in week 6. 50% in week 7. 100% by end of week 7. Deflection held at 41 to 42% across all three stages. Escalation-draft acceptance climbed from 71% to 78% as agents learned to trust and rely on the draft. Nightly Langfuse eval replay caught one regression: a Sonnet model update added subtle verbosity that hurt groundedness by 1.8 points. We held the promotion, rolled the prompt, and re-promoted two days later.

42% tier-1 deflection sustained · 0.95 groundedness · 78% draft accept rate

eval results · enterprise chatbot deflection · 500 frozen tickets

How we know
42% deflection is real.

The eval set is frozen. Every prompt change, retrieval change, threshold change, and model upgrade replays the full 500. Nothing promotes if any metric drops more than 1.5 points from the prior cut. Numbers below are the current production cut against the four staged-rollout reference points.

metric

baseline (no bot)

shadow (wk 5)

10% rollout (wk 6)

100% (wk 8)

target

Tier-1 deflection rate (resolution-counted)

0% (no bot)

39% (shadow)

38% (10%)

42% (100%)

≥ 35%

Groundedness (LLM-judge · cited chunk supports claim)

—

0.93

0.94

0.95

≥ 0.90

Escalation-draft acceptance

—

71%

78%

≥ 70%

First-response on escalation lane

6 hr

—

1 hr 55

1 hr 40

≤ 2 hr

14-day reopen rate on auto-replied tickets

—

8.4%

6.9%

6.2%

≤ 8%

Nightly eval drift caught (>1.5 pt drop)

—

1 rolled

no silent drift

Deflection is resolution-counted: share of inbound tier-1 where the chatbot replied autonomously AND the ticket did not return within 14 days for the same root cause. Denominator is full tier-1 inbound, not just chatbot-attempted. Groundedness is LLM-as-judge on the 500-item eval, calibrated against 80 human-judged samples. Escalation-draft acceptance is the share of AI drafts the human agent kept as-is or lightly edited (versus rewriting from scratch). 14-day reopen is the share of auto-replied tickets that reopened for the same root cause within 14 days.

watch the chatbot work · one real ticket shape

An SSO error,
end to end.

One of the 9 intent categories: SSO connection failure with a known help-center remedy. Watch Haiku 4.5 classify the intent in 240 ms, pgvector retrieve three cited chunks, Sonnet 4.6 generate a grounded reply, Zendesk function-calling create a tagged ticket, and Langfuse log the whole trace for nightly eval replay. The chip on the active stage card shows what just ran.

Acme · Help live

Hi — I cannot log in. The dashboard keeps showing 'SSO connection failed'.

Looking at your SSO config — one moment.

Your IdP signing cert expired on Wednesday. Easiest fix: rotate the cert in your Okta admin (steps 1–4 in our SSO renewal guide). Want me to attach the article to a ticket so you have it on hand?

Yes please, and tag urgent.

Creating the ticket and attaching the article…

Ticket #4421 created, tagged urgent, SSO renewal article attached. If you hit a snag at step 3, reply here and I will loop in a human agent.

Logged · trace lf_8e3a · 4 turns · total cost $0.014 · eval queued for nightly replay

step 1 / 7

pipeline loop restarts in 3…

Classify

Intent + routing

—

Retrieve

RAG over your data

—

Tool

Function execution

—

Generate

Compose the reply

—

Guardrail

Safety + policy gate

—

Log

Trace + eval

—

Illustrative per-decision economics — typical engagement-band figures, not literal client numbers.

Customer: Hi — I cannot log in. The dashboard keeps showing 'SSO connection failed'.
Bot: Looking at your SSO config — one moment.
Bot: Your IdP signing cert expired on Wednesday. Easiest fix: rotate the cert in your Okta admin (steps 1–4 in our SSO renewal guide). Want me to attach the article to a ticket so you have it on hand?
Customer: Yes please, and tag urgent.
Bot: Creating the ticket and attaching the article…
Bot: Ticket #4421 created, tagged urgent, SSO renewal article attached. If you hit a snag at step 3, reply here and I will loop in a human agent.
Bot: Logged · trace lf_8e3a · 4 turns · total cost $0.014 · eval queued for nightly replay

how 42% compares

The honest read on
the numbers you have seen elsewhere.

You have probably seen bigger numbers. Klarna said its OpenAI assistant handled two-thirds of customer service chats. Intercom Fin publishes a 67% resolution rate across 40 million conversations. Vendor pitch decks regularly promise 80% and up. Our 42% sits below all of them. Here is the read.

Handled is not resolved. The most-cited Klarna number — two-thirds of chats — measures which channel touched the customer first, not whether the customer's problem ended there. Klarna's CEO acknowledged in 2025 that the cost-first organization of support produced lower quality, and the company moved back to a hybrid AI-plus-human staffing model. We measure resolution on a frozen eval set with ground-truth disposition labelled by the customer's own support engineering team. A reply that lands but generates a second ticket next week does not count toward our 42%.

Industry RAG band is 40 to 60% for resolution. Across published benchmarks for retrieval-augmented chatbots in 2026, the strong-performance range is 40 to 60% resolution at confidence-gated routing. Below that band, the chatbot is doing rule-based matching at premium model cost. Above it, you are usually looking at routing metrics rather than resolution. Our 42% sits at the lower end of the strong band on a deliberately narrow tier-1 scope. Adding tier-2 paths (refunds, plan changes, escalation reasoning) is the route to lifting the number; we have scoped that as a separate 6-week build, not a feature add to this one.

Different ticket mix means different ceilings. Klarna's volume is consumer fintech with high-frequency, narrow-shape queries: delivery status, refund status, BNPL repayment dates. Our client is B2B SaaS with technical entitlement questions, SSO config, billing edge cases, and feature-flag rollout questions. The intent distribution is structurally different. A 42% number on B2B SaaS tier-1 is not directly comparable to a 67% number on consumer fintech. The share of inherently-resolvable inbound is itself a function of the customer's product surface, and consumer-fintech inbound has a much higher share of queryable-from-records intents than B2B SaaS technical inbound does.

Deflected does not mean resolved. The most expensive failure mode for any AI customer-service chatbot is the confident-wrong autonomous reply that closes a ticket the customer then reopens. We measured the 14-day reopen rate on autonomously-replied tickets at 6.2% and excluded those reopens from the 42% denominator. Vendor pitch decks rarely report this number. When you evaluate any chatbot vendor, the second question after "what is your deflection rate" should be "what is your 14-day reopen rate on the auto-replied subset?" The two together are the honest measurement.

CSAT held through rollout. First-response time on deflected tickets dropped from a 6-hour median to 12 minutes (most of those 12 minutes is the customer's own reading and follow-up latency, not the model). On escalated tickets, the agent starts from a drafted reply rather than reading the ticket cold; first-response on the escalation lane dropped from 6 hours to 1 hour 40. Customer-satisfaction score on deflected tickets held within 0.3 points of the all-human baseline through the staged rollout. We are not chasing a deflection number at the cost of CSAT, which is the Klarna lesson rendered into engineering practice.

The 42% in our title is the honest version of the number. The bigger numbers in other vendors' decks are usually a different number wearing the same name.

when NOT to build this · kill points

The four shapes
we turn down.

A customer service chatbot built on this architecture will mislead users in any of the following situations. We turn down the engagement before a pilot is scoped, and we mean it.

Ticket volume under 500 per month

RAG infrastructure plus a 500-ticket eval set is overkill below this volume. The pilot pays back over 12+ months of run cost versus adding a tier-1 agent. Below 500/month an off-the-shelf vendor's flat fee usually wins.

Help center is thin or stale

If under 30% of inbound is recoverable from existing docs, the model has nothing to ground on. We turn this engagement down. The right answer is: hire a docs writer for two months and we revisit. An agent over thin docs is a hallucination engine.

Tier-2 write-paths required on day one

Refunds, plan changes, billing writes need a stricter risk model than tier-1 deflection. We scope tier-2 as a separate 6-week pilot with per-action audit and dollar caps. Buyers who insist on day-one tier-2 are buyers we cannot serve well.

Support engineering will not own the eval set

If the client cannot commit one support engineer to weekly eval review and ground-truth labelling for the first six months, the deflection number drifts. The eval is the contract. Without an owner on the customer side, the number degrades silently and nobody catches it.

engagement shape · audit → pilot → continuous

How an engagement

stage	duration	what it covers
Discovery audit	1–2 weeks	500-ticket eval set + retrieval architecture memo + model + run-cost projection · client keeps every artifact regardless of next step
Pilot	4–6 weeks + 2 wk shadow	End-to-end chatbot · two-model pipeline · Zendesk function-calling · confidence gate · escalate-with-draft · Langfuse · CI-wired eval harness
Staged rollout	3 weeks	10% → 50% → 100% rollout · daily eval review · drift monitoring · prompt and retrieval tuning
Continuous delivery	ongoing · 30-day notice	Embedded engineer · weekly eval review · drift monitoring · scope extensions · prompt and retrieval updates

stage Discovery audit
duration 1–2 weeks
what it covers 500-ticket eval set + retrieval architecture memo + model + run-cost projection · client keeps every artifact regardless of next step
stage Pilot
duration 4–6 weeks + 2 wk shadow
what it covers End-to-end chatbot · two-model pipeline · Zendesk function-calling · confidence gate · escalate-with-draft · Langfuse · CI-wired eval harness
stage Staged rollout
duration 3 weeks
what it covers 10% → 50% → 100% rollout · daily eval review · drift monitoring · prompt and retrieval tuning
stage Continuous delivery
duration ongoing · 30-day notice
what it covers Embedded engineer · weekly eval review · drift monitoring · scope extensions · prompt and retrieval updates

Audit deliverable (eval set, architecture memo, model + run-cost projection) is the client's to keep regardless of whether they continue. Pilot is milestone-billed against the CI-wired eval set. Continuous engagement runs on 30-day notice on either side. Run cost lands around the Claude API + pgvector + Langfuse spend at the observed volume — token math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing (Sonnet 4.6 at $3/1M input, $15/1M output; Haiku 4.5 at $1/1M input, $5/1M output) as of May 2026. Talk to our team for engagement scoping.

frequently asked · customer service chatbot

What buyers ask first.
Real answers, no hedging.

Will this customer service chatbot work on Intercom, Salesforce Service Cloud, or Help Scout instead of Zendesk?

Yes. The function-calling pattern is portable. Intercom's API, Salesforce REST, and Help Scout's mailbox API all expose the same primitives we use against Zendesk: create ticket, update status, tag, attach internal note. We budget one week for the integration spike when swapping support platforms. The model layer, the retrieval layer, the confidence gate, and the eval harness all stay identical. Only tool definitions and policy preconditions change.

How long for our team to reach 42% deflection on tier-1?

Five to eight weeks for most engagements. The pace is set by two factors. One: how clean your help center is. A stale or thin knowledge base extends the corpus-build phase, sometimes by two weeks. Two: how fast your support engineering team can label the 500-ticket eval set. We budget one week of their time, parallel to our discovery week. Clients with well-curated help centers and a dedicated eval owner have hit 42% in 5 weeks. Clients with messier corpora typically land at 8.

What about EU data residency and GDPR?

We deploy Claude on AWS Frankfurt for EU clients, pgvector in EU-Central, and Langfuse self-hosted inside the client's VPC. Ticket data and customer identifiers never leave the EU region. Anthropic's EU API endpoint is GDPR-compliant; we sign DPAs and route all customer-data traffic through region-pinned Claude endpoints. Audit logs ship to your CloudTrail or equivalent log store, not ours.

Can this chatbot handle tier-2 — refunds, plan changes, billing writes?

Not in this build. The chatbot documented here has zero write-path to billing or entitlement systems by design. Tier-2 is a separate engagement we scope as a 6-week pilot with a stricter risk model: per-action audit, human approval queue on the first 1,000 autonomous actions, dollar-cap on autonomous refunds, hard ceiling on the number of plan-change calls per day. We do not ship tier-2 as a feature add to a tier-1 chatbot because the policy model is fundamentally different and we have seen vendors who blur this line cause real refund-fraud incidents.

How is the 42% measured? Is this deflection or resolution?

Resolution, not first-touch deflection. The 42% is the share of inbound tier-1 tickets where the chatbot replied autonomously AND the ticket did not return within the next 14 days for the same root cause. The denominator is the full tier-1 inbound volume during the measurement window, not just tickets the chatbot tried to answer. If we counted routing-only — what some published industry numbers measure — our number would land higher. We do not count it that way because a deflected-but-reopened ticket is still a ticket; counting it as a win is dishonest.

What happens when Anthropic releases a new Claude model?

The 500-ticket eval set replays nightly on the current production prompt. When Anthropic ships a new model, we run the eval against it before promoting. Two outcomes are common. One: the new model lifts groundedness or routing accuracy by 1 to 2 points and we promote after a one-week shadow. Two: it regresses on a subtle behavior (verbosity, refusal aggressiveness, tool-call latency) and we hold. The eval is the gate, not the press release. The same logic applies if you swap to a different vendor entirely.

Do we own the eval set, prompts, and retrieval config after the engagement ends?

Yes. The eval set is yours from day one (your team labels it; it sits in your repo). Prompts, intent rubric, retrieval config, tool policy file, and the Langfuse trace store are all checked into your repository at the end of the pilot. If you continue with us under the continuous engagement we maintain them; if you do not, your platform team has everything needed to operate the chatbot in-house. No proprietary artifacts of ours live in critical-path production.

What is the difference between this and Zendesk Answer Bot or Intercom Fin?

Custom RAG over your knowledge base, your ticket history, and your eval set, in your brand voice — versus a generic NLU layer over a vendor's connector to your help center. Off-the-shelf vendors price per resolution (multiple dollars per autonomous resolution at our client's volume); the custom build documented here runs at the API + infrastructure cost of the underlying Claude + pgvector + Langfuse stack, which is materially lower at sustained tier-1 volume. The trade-off: vendors give you faster setup (days versus weeks) and standardized telemetry; custom gives you full eval ownership, full tool policy control, and a chatbot that sounds like your brand. The right pick depends on whether you value time-to-deploy or long-term control more.

keep reading

Where this case study
points back to.

Pillars and sibling case studies that share architecture, model stack, or distribution surface with this customer service chatbot build.

01 Service

AI Chatbot Development

The chatbot pillar. RAG-grounded support bots, confidence gates, escalation-with-draft patterns. Same eval-first loop used on this build.

02 Service

Claude Development

Sonnet 4.6 + Haiku 4.5 integration patterns. Forced JSON, intent-classify routing, prompt cache shapes we used on this customer service chatbot.

03 Service

OpenAI Development

Alternative model stack. GPT-4o + GPT-4o-mini routing if your platform team has an OpenAI commitment or function-calling preference.

04 Service

AI Integration Services

Zendesk, Salesforce, Intercom, and ticketing-system function-calling. The integration layer behind every tier-1 deflection build.

05 Case study

Claude RAG for Product Docs

Sibling RAG case study. Same retrieval shape, different distribution surface — docs search instead of chat widget. 64% deflection on docs-recoverable queries.

06 Case study

OpenAI Realtime Voice Agent

Voice-channel sibling. Same intent-classify routing, different modality and latency budget.

07 Service

AI App Development Services

How a chatbot-deflection build fits inside a broader AI development engagement — widget shell + retrieval + classifier + Zendesk integration + dashboard.

08 Service

AI Consulting

This engagement started with our standard discovery audit — workflow inventory, deflection-rate baseline, eval-set design, and the kill point that gated the pilot.

09 Service

Customer Support Automation

Customer-support deflection in production. The workflow our customer-support automation pattern was built from.

Ready to ship

Want a chatbot case study like this
for your support stack?

Start with a discovery audit. We will sample 100 of your tier-1 tickets, scope the eval set, recommend a model + retrieval recipe, project token + run cost, and tell you honestly whether your support volume is chatbot-shaped. About one audit in five ends with 'you do not need this — buy the platform, here is the SOW for integration.'

Read the chatbot pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

Updated May 22, 2026 · By Navin Sharma

Enterprise AI chatbot, in production. 42% tier-1 deflection in 8 weeks.

What this case study shows

Where the support queue actually was.

today · without the chatbot

with the chatbot

What we built, and why.

Two-model pipeline · Haiku 4.5 classifies, Sonnet 4.6 replies

Confidence gate at 0.7 with escalate-with-draft below

Authenticated dashboard widget · not a marketing-site bot

Six stages of one customer service chatbot turn.

One model output. Four routing outcomes.

Stack we shipped, all of it audit-able.

The timeline, including the week we almost paused.

Discovery + 500-ticket eval set

Corpus build + retrieval tuning

Two-model pipeline + confidence gate

Shadow mode · three failure modes caught

Staged rollout · 42% holds at steady state

How we know 42% deflection is real.

An SSO error, end to end.

The honest read on the numbers you have seen elsewhere.

The four shapes we turn down.

Ticket volume under 500 per month

Help center is thin or stale

Tier-2 write-paths required on day one

Support engineering will not own the eval set

What buyers ask first. Real answers, no hedging.

Where this case study points back to.

AI Chatbot Development

Claude Development

OpenAI Development

AI Integration Services

Claude RAG for Product Docs

OpenAI Realtime Voice Agent

AI App Development Services

AI Consulting

Customer Support Automation

Want a chatbot case study like this for your support stack?

Enterprise AI chatbot, in production.
42% tier-1 deflection in 8 weeks.

Where the support queue
actually was.

What we built,
and why.

Six stages of one
customer service chatbot turn.

One model output.
Four routing outcomes.

Stack we shipped,
all of it audit-able.

The timeline,
including the week we almost paused.

How we know
42% deflection is real.

An SSO error,
end to end.

The honest read on
the numbers you have seen elsewhere.

The four shapes
we turn down.

What buyers ask first.
Real answers, no hedging.

Where this case study
points back to.

Want a chatbot case study like this
for your support stack?