Tier-1 customer service chatbot.
42% deflection in 8 weeks.
A Series C B2B SaaS company was running a 6-hour first-response time on its Zendesk tier-1 queue. We shipped a RAG-grounded customer service chatbot over the help center and 90 days of ticket history, with Claude Haiku 4.5 on intent classify, Claude Sonnet 4.6 on grounded reply, and Zendesk function-calling for ticket actions. Five-week pilot, two weeks of shadow mode, three weeks of staged rollout. By week 8 the chatbot was deflecting 42% of tier-1 traffic at $800/mo all-in run cost.
Where the support queue
actually was.
6-hour median first-response time on tier-1 Zendesk. Three-quarters of inbound was the same five ticket shapes. Generic chatbots had been tried and turned down.
The client is a Series C B2B SaaS company. Their support floor sat at 14 agents on a follow-the-sun rotation, with an inbound Zendesk queue averaging 6-hour first-response time on tier-1 traffic and a long tail of 18 to 24 hours during their US morning peak. The Head of Support Engineering ran the numbers in our discovery week and named the binding constraint: 76% of tier-1 ticket volume mapped to five repeating shapes. Password resets, billing questions ("why was I charged on the 4th, not the 1st"), feature how-tos lifted directly from the help center, plan-tier entitlement questions, and SSO connection errors with a known knowledge-base remedy.
The team had previously evaluated three off-the-shelf AI customer support software vendors and turned all of them down. The objections were consistent. Tone didn't match the brand voice. Product knowledge was weak or wrong, especially on entitlement edge cases the marketing site didn't cover. And there was no honest measurement methodology behind the deflection-rate claims those vendors put in their pitch decks. The Head of Support Engineering said it plainly: we will buy a customer service chatbot when we can audit what it knows, see where it draws its answers from, and measure how often it gets the answer right on tickets we already have ground truth on.
What we built,
and why.
A two-model RAG chatbot grounded on the help center plus 90 days of resolved tickets, with a confidence gate that escalates with the AI's draft reply attached for the human agent.
We shipped a RAG chatbot with a two-model pipeline. The intent-classify step runs on Claude Haiku 4.5: every inbound message gets routed into one of nine ticket categories (or "out of scope" for the human queue). The reply step runs on Claude Sonnet 4.6, with retrieval over a pgvector index built on the help center plus the last 90 days of resolved Zendesk tickets. Every reply has to cite the chunk it pulled the answer from, or the schema validator rejects the output. This is not a personality bot. It is a grounded customer support chatbot that refuses to answer when it cannot ground its response.
The confidence gate sits at 0.7. Below threshold, the RAG chatbot does not reply to the customer at all. Instead, it escalates the ticket to the human queue with the AI-drafted reply attached as an internal note. The agent reads the draft, edits or rewrites it, and sends. We measured that draft-attached escalation path as its own metric because we suspected (correctly) that it would carry a meaningful chunk of the productivity gain even on tickets the chatbot couldn't close autonomously.
Zendesk function-calling handles every write action. The customer service chatbot can create a ticket, update ticket status, attach the conversation transcript to a parent ticket, and tag the case for routing. It cannot issue refunds. It cannot change plan tiers. It cannot read or write to billing systems. Every write tool has a policy file the agent runtime imports at startup, and the agent refuses to call a tool that fails its policy preconditions.
The web widget lives inside the customer dashboard, not the marketing site. This is a customer support chatbot for authenticated users with an account context the agent can read at session start, not a marketing chatbot trying to convert anonymous visitors. The distinction matters: the retrieval scope is narrower, the entitlement questions are answerable from authenticated context, and the failure modes are different.
Named tools,
named versions.
- Claude Sonnet 4.6 — reply generation with forced JSON and citation-pinned rationale on every response.
- Claude Haiku 4.5 — intent classify into one of nine categories, plus ticket categorization downstream of the reply.
- pgvector — RAG retrieval over the help center (480 articles, chunked at 512 tokens) and 90 days of resolved Zendesk tickets (~18,000 tickets, deduplicated by canonical resolution).
- Zendesk API — function-calling for ticket create, status update, tag, and conversation-transcript attachment.
- Langfuse — per-conversation trace store, eval-set replay, drift monitoring on the nightly held-out eval.
How we measured
42% deflection.
Pre-built eval set, three metrics that gate every release, two weeks of shadow mode before any live customer interaction.
The Head of Support Engineering pulled a representative sample of 500 tier-1 tickets from the previous 90 days, stratified across the nine intent categories. Each ticket carried a labelled ground-truth resolution: the correct reply, the correct ticket disposition (resolve / escalate / route), and the help-center article ids that should have grounded the answer. We wrote the eval harness; the support engineering team wrote the answers. That distinction was load-bearing for organizational trust later when the model-evaluation results were reviewed.
Three metrics gate every release of the customer service chatbot. Deflection rate: percentage of eval tickets the chatbot resolved autonomously without escalating, measured against the ground-truth disposition. Groundedness: percentage of replies where every claim cited a help-center chunk that supported it (judged by a separate LLM-as-judge evaluator we calibrated against 80 human-judged samples). Escalation quality: on sub-threshold escalations, what percentage of the AI-drafted reply did the human agent accept as-is or with minor edits, versus rewriting from scratch.
Before any live customer ever saw the chatbot, we ran two weeks of shadow mode. The agent generated a reply for every inbound tier-1 ticket, but the reply went into an internal-note field, not to the customer. A senior support engineer reviewed a 10% sample daily. That shadow window is where the three biggest production issues surfaced (covered in the retrospective below). Nightly, we re-run the full 500-item eval against the current production prompt and retrieval scope. Any metric drop greater than 1.5 points from the previous cut triggers a Langfuse alert and blocks the next deploy.
What shipped,
week by week.
Two weeks shadow, three weeks staged rollout, steady-state at week 8. Deflection rate held.
Weeks 0 to 2 (shadow mode). The chatbot generated replies for every tier-1 ticket but never reached a customer. The shadow review surfaced three failure modes we hadn't anticipated. First, entitlement edge cases — the chatbot was confidently citing the public pricing page on plan-tier questions, but enterprise customers on negotiated contracts had different entitlements that lived in Salesforce, not the help center. We expanded retrieval scope to read entitlement context from authenticated session at runtime. Second, billing dispute escalations were getting auto-replied to as if they were billing-information questions; we tightened the intent classifier with adversarial examples. Third, feature-flag rollout questions ("when will feature X be on for us") were being answered from stale release notes; we excluded the release-notes corpus from retrieval and routed feature-flag questions straight to the human queue.
Week 3 (10% rollout). Deflection rate 38% on the live sample. Groundedness 0.94. Escalation-draft acceptance rate 71%.
Week 4 (50% rollout). Deflection 41%. Groundedness 0.94. Escalation-draft acceptance 74%.
Week 6 (100% rollout). Deflection 42%, holding. Groundedness 0.95. Escalation-draft acceptance 76%.
Week 8 (steady-state). 42% tier-1 deflection sustained across a full reporting cycle. Escalation-draft acceptance rate climbed to 78% as the agents learned to trust and rely on the draft as a starting point. First-response time on deflected tickets dropped from a 6-hour median to 12 minutes (the 12 minutes is mostly the customer's own reading and follow-up latency, not the model). On escalated tickets, first-response time dropped from 6 hours to 1 hour 40 minutes, since the human agent now starts with a drafted reply rather than reading the ticket cold.
What we'd do
differently.
Four things we got wrong on the first pass. Documenting them because the next chatbot build benefits from these mistakes.
We shipped the confidence gate too conservatively. Initial threshold was 0.8, which kicked too much volume into the escalation lane and made the deflection number look worse than it actually was. We tuned to 0.7 in week 2 of shadow, after seeing that the model was over-cautious on intent categories where it had strong grounding (password resets, SSO errors) and the conservative threshold was costing us real autonomous resolutions.
We underestimated how messy the Zendesk macro library was before we indexed it. The macro library had drifted across two organizational restructures, and "resolved" tickets often resolved with the wrong macro applied — meaning our 90-day ticket-history corpus had subtle ground-truth noise in it. We spent four days in week 2 cleaning the macro mappings before re-indexing. Future projects: we audit and clean the historical resolution data before the corpus build, not during.
We should have shipped the draft-attached escalation behaviour before going to the 50% rollout. We had the draft-attach feature behind a feature flag and turned it on at the 100% rollout step, which meant the entitlement edge cases discovered in shadow took two extra weeks to actually improve the human agents' workflow. If we ran this back, we would ship draft-attach at the 10% rollout step and let the human agents catch the model's failure modes faster, since the draft is in front of them by default.
Next iteration of the customer service chatbot extends scope to tier-2 with refund-policy reasoning. Current build is tier-1 only, by design: refunds require reading and writing across billing systems, and we wanted the tier-1 deflection number stable before adding the more sensitive write path. That extension is now scoped as a separate 6-week pilot, not a feature add to this build.
What it cost
to ship.
Fixed-fee discovery and pilot. Run cost under $1k/month at the observed volume.
$3K discovery audit (2 weeks). Eval set design with the support engineering team, retrieval architecture decisions, model selection between Claude and an OpenAI alternative, intent-classify rubric drafted, confidence-gate strategy. Deliverable is a written scope memo plus the labelled eval set, which the client owns and keeps regardless of whether they continue with us.
$18K pilot (5 weeks, fixed-fee). End-to-end customer service chatbot shipped: pgvector index, two-model pipeline, Zendesk function-calling, confidence gate, escalation with draft, Langfuse traces, the eval harness wired to CI, and two weeks of shadow mode. The fixed fee includes the calibration sweeps and the macro-library cleanup we did not initially scope.
$5K/mo continuous (post-launch). Embedded engineer plus ops support. Weekly eval reviews with the support engineering team, drift monitoring on Langfuse, scope extensions, prompt and retrieval-corpus updates as the product evolves.
Across the AI customer support software market, $800/mo for a 42%-deflection chatbot is well below the $2K-$5K/mo per-seat licensing of the off-the-shelf vendors the client had previously priced.
$800/mo run cost. All-in: Claude API spend (Sonnet 4.6 + Haiku 4.5 split), pgvector on a managed Postgres instance, Langfuse self-hosted. The bulk of run cost is Sonnet 4.6 on the reply step; Haiku 4.5 on intent classify is roughly 5% of the model bill.
Where this case study
points back to.
Pillars and sibling case studies that share architecture, model stack, or distribution surface with this customer service chatbot.
AI Chatbot Development
The chatbot pillar. RAG-grounded support bots, confidence gates, escalation-with-draft patterns. Same eval-first loop used on this build.
Claude Development
Sonnet 4.6 + Haiku 4.5 integration. Forced JSON, intent classify routing, prompt-cache patterns we used on this customer support chatbot.
OpenAI Development
Alternative model stack. GPT-4o + GPT-4o-mini routing if the client has an existing OpenAI commitment or function-calling preference.
AI Integration Services
Zendesk, Salesforce, Intercom, and ticketing-system function-calling. The integration layer behind every tier-1 deflection build.
Claude RAG for Product Docs
Sibling RAG case study. Same retrieval shape, different distribution surface — docs search instead of chat widget.
OpenAI Realtime Voice Agent
Voice-channel sibling. Same intent-classify routing, different modality and latency budget.