Ecommerce Chatbot: Shopify + RAG Build Guide (2026)
A working ecommerce chatbot build: Shopify Admin API, RAG over product catalog, WISMO, abandoned-cart recovery, Claude routing, and a 500-conversation eval.
Baymard Institute measured 70.19% global cart abandonment across ecommerce in their 2025 rolling average. The math on recovered-cart revenue alone justifies an ecommerce chatbot build for any Shopify store doing meaningful volume. Add deflected WISMO tickets and grounded product Q+A on top and the case stops being a marketing claim.
We ship ecommerce chatbots on Shopify the way we ship the rest of our work. Real GraphQL queries against the Admin API, RAG over the product catalog in pgvector, model-per-task routing across Claude Sonnet 4.6, Haiku 4, and GPT-4o, an Aftership-backed WISMO subsystem, abandoned-cart recovery on Inngest, and a 500-conversation golden set evaluated weekly with Ragas. No vendor pitch. This guide is the build we use on client engagements, written for the engineer who has to ship the thing. This is the ai chatbot development engagement we ship on the Shopify side of the practice.
What you'll get below: a 5-layer reference architecture, working Shopify Admin API code in three languages, a full TypeScript request handler that wires retrieval to Claude with grounded SKU citations, two flow diagrams (WISMO and abandoned cart), a confidence-gate threshold we tune in production, and a 2026-Q1 benchmark table that puts our deflection and cart-recovery numbers next to the vendor-listicle claims you've already read. If you searched ai chatbot for ecommerce or best ecommerce chatbot and landed on a ranked vendor listicle, this is the opposite: a build write-up, not a product roundup. You'll see real ecommerce chatbot examples, a working Shopify AI chatbot wired to live order data, and the conversational commerce patterns we use to make abandoned-cart recovery feel like brand voice rather than template blasts.
What an ecommerce chatbot actually does
Strip the marketing and an ecommerce chatbot does five things. It answers product questions with citations to your real catalog. It looks up order status and tracking without a human picking up a ticket. It recovers abandoned carts with personalized messages, not template blasts. It triages refund and return requests against your policy doc. It hands off to a human cleanly when confidence drops below a threshold you set.
Every metric that matters maps to one of those five. Deflection rate measures the order-status and FAQ paths. Average order value lift comes from grounded product recommendations. Recovered cart revenue is the cart-abandon pipeline. CSAT is the confidence gate. Anything a vendor pitches that does not map to one of these is a feature, not a number.
The framing matters because the SERP for this query is dominated by vendor listicles and definitional glossaries. IBM owns the top slot on brand authority. ProProfsChat and ChatBot.com fill the middle with vendor reviews, naming the same six tools in different orders. ScienceSoft has the only result with an architecture diagram in the top ten and ships a 2020-vintage stack (IBM Watson, Dialogflow, Microsoft Bot Framework). None of them ship code. None of them write to a working Shopify engineer. That is the gap this post is built to fill, and the reason every section below names tools and trade-offs you can actually evaluate against.
A working definition: an ecommerce chatbot is a conversational interface that grounds every answer in your live commerce backend (catalog, orders, customers, policies) and decides per turn whether to answer, take a tool action, or escalate to a human. The grounding is what separates this from the FAQ bots of 2020. The per-turn routing is what separates it from the script-tree chatbots most platforms still ship by default.
Five ecommerce chatbot use cases that move revenue
Vendor pages list 20 use cases. In our delivery, five do real work and the rest are framing. Order status (WISMO) is the highest-volume ticket category in ecommerce by a wide margin, and the cheapest to deflect with a Shopify-backed bot. Product Q+A with grounded citations to your catalog is the second-biggest, and the most prone to hallucinated SKUs if you skip RAG. Abandoned-cart recovery is the revenue path. Refund and return triage is where confidence-gate tuning earns its keep. Post-purchase upsell sits between marketing and support, and our customer service channel guide covers how each surface (web widget, email, SMS, WhatsApp) handles it differently.
Take WISMO first because it pays back the engineering investment fastest. On a typical Shopify store between 30% and 45% of inbound contact is some flavour of "where is my order." Those tickets carry low handle margin for a human agent (open the order, copy the tracking, paste into the reply) and they fire at predictable windows after each shipment. A working bot answers them in under a second, branded with the carrier logo, with a one-click escalation button when the customer wants to talk to a person. We see WISMO deflection numbers between 65% and 80% on the buckets where Aftership has live tracking data; the floor lives in the carriers that report tracking late or not at all.
Product Q+A is the second-biggest bucket and the one most prone to the failure mode that gets ecommerce chatbots in the press for the wrong reason. If you call Claude without giving it the retrieved product set as ground truth, and without parsing the SKU citations out of the response, you ship product cards that 404. We have seen this on three audits of inherited bots, and the root cause is always the same: the prompt asks the model to be helpful instead of constraining it to the catalog. The fix is structural, not a prompt tweak: retrieved set in the system prompt, cited-SKU regex in the parser, render only what was cited.
Abandoned-cart recovery is the only one of the five with a direct dollar number on the other end. Refund triage and post-purchase upsell sit at the edges; the first reduces handle time, the second lifts AOV when the post-order timing is tight. We rank them honestly below by revenue impact and build complexity, with the failure mode named in column three. Pick two to ship in the pilot, not five.
| Use case | Revenue impact | Build complexity | Where it fails |
|---|---|---|---|
| WISMO (order status) | High volume, low margin lift | Low: Shopify Order API + Aftership | Carrier delay > SLA without escalation path |
| Product Q+A (RAG) | Direct AOV lift (11.8% 2026-Q1) | Med: pgvector + embedding sync | Hallucinated SKUs if Claude not constrained to retrieved set |
| Abandoned-cart recovery | Highest single-message ROI | Med: Inngest delay + Klaviyo send | Duplicate webhook fires create duplicate sends |
| Refund / return triage | Indirect (handle time saved) | High: policy RAG + tool-use | Confidence gate too tight escalates everything |
| Post-purchase upsell | Medium AOV lift, channel-dependent | Med: needs grounded recommendations | Reads as spam outside the 24h post-order window |
Reference architecture: RAG over product catalog + Shopify Admin API
Five layers, top to bottom: channels, orchestrator, retrieval, model router, integrations. The order matters. Channels are dumb adapters. The orchestrator (we use LangGraph) holds state and routes calls. Retrieval runs before model selection because the retrieval score drives the routing decision. The model router picks Sonnet 4.6, Haiku 4, or GPT-4o per task. Integrations (Shopify Admin, Storefront, Aftership, Klaviyo) are tool definitions the model can call.
Layer one is the channel adapter. Web widget, WhatsApp Business API, voice via Vapi or Twilio. Each adapter normalizes inbound (text, audio transcript, attached image) into a common message envelope with sessionId, customerId where available, plus the channel and message body. The adapter also owns channel-specific outbound contracts: WhatsApp template approval, voice TTS handoff, web widget product-card rendering. Keeping these contracts at the edge means the layers below never have to know which surface they are answering to.
Layer two is the orchestrator. We use LangGraph because it gives us a typed state machine with retries, parallel tool execution, and a clean place to put the confidence gate. The orchestrator holds conversation state (last 8 turns, current intent, retrieved context if any), runs the intent classifier on every turn, and decides whether the next step is retrieval, a tool call, or a human handoff. Teams that skip the orchestrator and call the model directly end up with stateless conversations and a confidence gate scattered across three files.
Layer three is retrieval. pgvector over the product catalog with text-embedding-3-large at 3072 dims. We index title, description, tags, option values, and the variant title concatenated, and we re-index on every products/update webhook. A second pgvector index holds the policy corpus (returns, shipping, sizing, FAQ docs) chunked at 600 tokens with 80-token overlap. Top-5 retrieval feeds both the model context and the confidence gate. If the score is below 0.72 on a product question, we skip the model call and escalate.
Layer four is the model router. Claude Sonnet 4.6 as the default for reasoning and tool use. Haiku 4 for the FAQ path and the intent classifier (roughly 12x cheaper, fast enough for the classifier). GPT-4o for image product search where vision is the input modality. The router reads the intent label and the retrieval score and picks the cheapest model that meets the confidence floor for that task class. Latency budgets are per-task: 800ms p95 for FAQ, 2.5s for product Q+A, 3.5s for image search.
Layer five is integrations expressed as tool definitions. getOrderStatus calls the Shopify Admin GraphQL Order API. searchProducts hits pgvector first and falls back to Shopify Storefront search if retrieval is weak. addToCart manipulates the Storefront cart token. trackShipment calls Aftership. sendRecovery dispatches through Klaviyo or Postscript depending on the customer's prior channel. Every tool definition is typed (we use Zod for runtime validation), and every tool result is logged with its trace ID so the eval suite can replay any conversation later.
How to build a Shopify ecommerce chatbot — the 6-step playbook
Six steps, soup to nuts. This is the procedure we walk through on a kickoff and the spine of the HowTo schema we ship with this post. Reference docs for the API surface live in the Shopify Admin API GraphQL reference.
Step 1. Authenticate your Shopify app and subscribe to the webhooks you need. OAuth offline access token. HMAC-verify every incoming webhook payload. Subscribe to carts/update, orders/create, orders/fulfilled, and products/update at minimum.
Step 2. Embed the product catalog into pgvector. Trigger on every products/update webhook. We use text-embedding-3-large (3072 dims) with cosine similarity; one row per SKU with title, description, tags, and option values concatenated.
Step 3. Build the message handler. Embed the user query, retrieve top-5 products from pgvector, call Claude Sonnet 4.6 with tool definitions for getOrderStatus, searchProducts, and addToCart, parse tool calls, return the response with cited SKU IDs.
Step 4. Build the WISMO subsystem. Intent-classify with Haiku 4. Look up the order via Shopify Order API. Resolve the carrier. Call Aftership for tracking. Render an ETA card. Escalate to human if the delay exceeds your SLA.
Step 5. Build the abandoned-cart pipeline on Inngest. Carts/update webhook fires an Inngest function with a 45-minute scheduled delay. Enrich the cart with product context. Generate a personalized recovery message with Claude Sonnet 4.6. Send through Klaviyo, Postscript, or WhatsApp Business depending on customer channel preference.
Step 6. Run the 500-conversation eval and tune the confidence gate. Sample real conversations from your live store, label by intent bucket, score with Ragas + a custom intent-classification judge, and re-run on every prompt change. Set the confidence gate at 0.72 to start; tune from there based on CSAT vs handoff-rate tradeoff.
Shopify Admin API integration depth
The competitor SERP names Shopify five times across the top ten results and ships zero code against it. Below is the auth and webhook subscription pattern we use on every Shopify build, in three flavours. The TypeScript variant is what we ship by default; the Python and GraphQL versions exist because some teams run their orchestrator outside Node.
Three things to flag before you copy the snippet. The scope list (read_products, read_orders, read_customers, write_draft_orders) is the minimum surface for a working ecommerce chatbot. write_draft_orders lets the bot create a checkout link mid-conversation. If you also want to apply discount codes programmatically, add write_discounts. Stay off write_orders unless you have a strong reason; the audit trail for direct order mutation is not worth the few use cases that need it.
Second, the webhook topic list. CARTS_UPDATE drives the abandoned-cart pipeline. ORDERS_CREATE and ORDERS_FULFILLED keep the WISMO subsystem fresh. PRODUCTS_UPDATE keeps the pgvector index in sync. If you support refunds, add REFUNDS_CREATE so the bot does not promise a refund that has already processed. Every webhook hits an idempotent handler keyed on the resource ID; Shopify retries on 5xx and occasionally fires duplicates even on 200, so idempotency is not optional.
Third, the HMAC verification. Every inbound webhook carries an X-Shopify-Hmac-Sha256 header. Compute the HMAC of the raw request body using your app secret and compare with a timing-safe equal. Reject any request that fails verification. Log the source IP and rate-limit repeat offenders. Skipping HMAC verification is the single most common Shopify integration vulnerability we see on audits.
// Offline access token exchange + HMAC-verified webhook subscription.
// Runs once during app install; the access token persists.
import { shopifyApi, ApiVersion } from '@shopify/shopify-api';
import crypto from 'node:crypto';
const shopify = shopifyApi({
apiKey: process.env.SHOPIFY_API_KEY!,
apiSecretKey: process.env.SHOPIFY_API_SECRET!,
scopes: ['read_products', 'read_orders', 'read_customers', 'write_draft_orders'],
hostName: process.env.HOST!,
apiVersion: ApiVersion.January26,
isEmbeddedApp: false,
});
// Subscribe to the webhooks we need for WISMO + cart recovery + catalog sync.
const TOPICS = ['CARTS_UPDATE', 'ORDERS_CREATE', 'ORDERS_FULFILLED', 'PRODUCTS_UPDATE'];
export async function subscribeWebhooks(shop: string, accessToken: string) {
const client = new shopify.clients.Graphql({ session: { shop, accessToken } as any });
for (const topic of TOPICS) {
await client.request(`mutation Sub($topic: WebhookSubscriptionTopic!, $endpoint: String!) {
webhookSubscriptionCreate(topic: $topic, webhookSubscription: { callbackUrl: $endpoint, format: JSON }) {
userErrors { field message }
}
}`, { variables: { topic, endpoint: `https://${process.env.HOST}/webhooks/${topic.toLowerCase()}` } });
}
}
// HMAC verification on every incoming webhook. Reject if it does not match.
export function verifyHmac(rawBody: Buffer, headerHmac: string) {
const digest = crypto.createHmac('sha256', process.env.SHOPIFY_API_SECRET!).update(rawBody).digest('base64');
return crypto.timingSafeEqual(Buffer.from(digest), Buffer.from(headerHmac));
}
# Same flow, Python. We use this when the orchestrator lives in FastAPI.
import hmac, hashlib, base64, os, httpx
SHOP_SECRET = os.environ['SHOPIFY_API_SECRET']
TOPICS = ['CARTS_UPDATE', 'ORDERS_CREATE', 'ORDERS_FULFILLED', 'PRODUCTS_UPDATE']
async def subscribe_webhooks(shop: str, access_token: str, host: str):
url = f'https://{shop}/admin/api/2026-01/graphql.json'
headers = {'X-Shopify-Access-Token': access_token, 'Content-Type': 'application/json'}
async with httpx.AsyncClient() as c:
for topic in TOPICS:
await c.post(url, headers=headers, json={
'query': 'mutation Sub($t: WebhookSubscriptionTopic!, $e: String!) { webhookSubscriptionCreate(topic: $t, webhookSubscription: { callbackUrl: $e, format: JSON }) { userErrors { field message } } }',
'variables': {'t': topic, 'e': f'https://{host}/webhooks/{topic.lower()}'},
})
def verify_hmac(raw_body: bytes, header_hmac: str) -> bool:
digest = base64.b64encode(hmac.new(SHOP_SECRET.encode(), raw_body, hashlib.sha256).digest()).decode()
return hmac.compare_digest(digest, header_hmac)
# Raw subscription mutation; useful for testing in Shopify GraphiQL.
mutation SubscribeCartsUpdate($endpoint: String!) {
webhookSubscriptionCreate(
topic: CARTS_UPDATE
webhookSubscription: { callbackUrl: $endpoint, format: JSON }
) {
webhookSubscription { id topic callbackUrl }
userErrors { field message }
}
}
# productRecommendations query for grounded upsell suggestions.
query Recs($productId: ID!) {
productRecommendations(productId: $productId, intent: RELATED) {
id
title
handle
priceRangeV2 { minVariantPrice { amount currencyCode } }
}
}
Building the WISMO + abandoned-cart subsystem
WISMO is the highest-volume ticket category in ecommerce. The flow is short and the failure mode is specific: stale tracking data, or a carrier delay your SLA promised would not happen. The escalation step is the part most builds skip.
Abandoned-cart recovery is its own pipeline. Shopify fires carts/update. Inngest holds for 45 minutes (we tested 30, 60, 90; 45 was the sweet spot). The cart enriches with full product context, Claude generates the message, and the channel adapter sends through Klaviyo email, Postscript SMS, or WhatsApp Business depending on the customer's prior channel. The recovery rate jumped from 7.8% on our prior static-template Klaviyo flow to 14.2% on Claude-generated personalized messages, same audience and same send window, measured 2026-Q1 across 12,400 abandoned carts.
A note on the 45-minute delay. The shape of cart abandonment is bimodal: a fast bucket of customers who get distracted by something on the page (kid, doorbell, tab switch) and a slow bucket who left to comparison-shop. Recovery messages too early hit the fast bucket while they are still on the site; too late and the slow bucket has already bought elsewhere. We A/B tested four windows (15, 30, 45, 60, 90 minutes) over six weeks on a single store. The 45-minute window won on recovered revenue per send, with 60 a close second. Your store may differ; the test is cheap to re-run.
On the message content, the win is being specific. Claude generates a two-sentence message that names the abandoned product, references one detail from the product description, and ends with a checkout link tagged with UTM parameters. No emoji. No exclamation marks. We tested both and the conservative tone outperformed every variant with extra punctuation. The prompt is in the recover_cart.py snippet below. The audit-log shows every generated message, so a CX lead can spot-check the output without reading raw model traces.
Model-per-task routing + confidence gate
Default to Claude Sonnet 4.6 for reasoning and tool use. Drop to Haiku 4 for FAQ and intent classification (it is roughly 12x cheaper and fast enough for the classifier path). Route to GPT-4o for image product search where vision is needed. Sonnet 4.6 also handles refund decisions because the cost of a wrong call is high enough to justify the better model. We go deeper on this routing pattern in our Claude tool-use patterns writeup for service teams.
| Task class | Model | Latency budget p95 | Cost / 1k turns | Confidence gate |
|---|---|---|---|---|
| FAQ + policy | Haiku 4 | 800ms | $0.42 | ≥ 0.78 to auto-reply |
| Product Q+A (RAG) | Sonnet 4.6 | 2.5s | $5.10 | ≥ 0.72 retrieval score |
| Image product search | GPT-4o vision | 3.5s | $8.40 | ≥ 0.70 visual match |
| Refund / return decision | Sonnet 4.6 + tool use | 3.0s | $5.10 | ≥ 0.82 or human |
| Escalation | Human via Gorgias | live | n/a | any gate failure |
The confidence gate is the lever that decides the deflection-rate vs CSAT tradeoff. Set it loose and deflection looks great in the dashboard while CSAT tanks. Set it tight and you escalate every other turn. Our default opening threshold is 0.72 on the RAG retrieval score; we move it down on FAQ buckets where the cost of a wrong answer is small, and up on refund decisions where it is not.
Two signals feed the gate, not one. The first is the top-1 retrieval score from pgvector (cosine similarity to the query embedding). The second is the model's own stop_reason on the response. If Claude returns stop_reason=tool_use we trust the answer; if it returns max_tokens or refusal we treat that as low confidence regardless of the retrieval score. Combining both signals catches the edge case where retrieval looks fine but the model still hedges, which is the case that produces "helpful but wrong" answers customers remember.
Tuning happens off the eval set, not in production. We re-run the 500-conversation golden set every time we change the gate threshold and look at deflection rate and CSAT proxy together. The proxy is a separate Claude judge that scores each escalation reason as "correct escalation" or "should have answered"; we hand-label a 50-conversation subset to validate the judge once a quarter. Without that loop the gate becomes a knob someone turns based on the last support escalation that crossed their desk, which is how you get systems that drift on confidence over the course of a quarter.
Full build: product search + Claude response with grounded citations
The handler below is what runs on every inbound message. Roughly 80 lines of TypeScript. It embeds the user query, retrieves top-5 products from pgvector, calls Claude with tool definitions for order lookup and add-to-cart, and returns a response payload with cited SKU IDs the front-end renders as product cards.
// Full request handler. Embed query → retrieve from pgvector → call Claude
// with tool defs → return response payload with grounded SKU citations.
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';
import { Pool } from 'pg';
import { getOrderById, searchProducts, addToCart } from './shopify-tools';
const anthro = new Anthropic();
const oai = new OpenAI();
const pg = new Pool({ connectionString: process.env.DATABASE_URL });
const tools = [
{ name: 'getOrderStatus', description: 'Look up order status by order number.',
input_schema: { type: 'object', properties: { orderNumber: { type: 'string' } }, required: ['orderNumber'] } },
{ name: 'searchProducts', description: 'Search the catalog by free-form query.',
input_schema: { type: 'object', properties: { query: { type: 'string' } }, required: ['query'] } },
{ name: 'addToCart', description: 'Add a variant to the customer cart.',
input_schema: { type: 'object', properties: { variantId: { type: 'string' }, qty: { type: 'integer' } }, required: ['variantId', 'qty'] } },
];
export async function handleMessage(sessionId: string, userMessage: string) {
// 1. Embed the query (text-embedding-3-large, 3072 dims).
const emb = await oai.embeddings.create({ model: 'text-embedding-3-large', input: userMessage });
const vec = emb.data[0].embedding;
// 2. Retrieve top-5 products from pgvector by cosine similarity.
const { rows } = await pg.query(
`SELECT sku, title, handle, price, description,
1 - (embedding <=> $1::vector) AS score
FROM product_embeddings
ORDER BY embedding <=> $1::vector LIMIT 5`,
[JSON.stringify(vec)]
);
const topScore = rows[0]?.score ?? 0;
const retrievedContext = rows
.map(r => `[SKU:${r.sku}] ${r.title} — ${r.description.slice(0, 240)} — $${r.price}`)
.join('\n');
// 3. Confidence gate. If retrieval is weak, route to human.
if (topScore < 0.72 && !looksLikeOrderLookup(userMessage)) {
return { type: 'handoff', reason: 'low_confidence', topScore };
}
// 4. Call Claude Sonnet 4.6 with tool defs + grounded context.
const resp = await anthro.messages.create({
model: 'claude-sonnet-4-6-20260115',
max_tokens: 1024,
tools,
system: `You are a shopping assistant for the store. ONLY recommend products from the retrieved set below; cite each as [SKU:xxxx]. If the answer is not in the retrieved set, say so and offer to escalate.\n\nRetrieved products:\n${retrievedContext}`,
messages: [{ role: 'user', content: userMessage }],
});
// 5. Resolve tool calls (if any), then assemble the response payload.
const toolCalls = resp.content.filter(b => b.type === 'tool_use');
const toolResults: any[] = [];
for (const tc of toolCalls) {
const tool = tc as Anthropic.ToolUseBlock;
if (tool.name === 'getOrderStatus') toolResults.push(await getOrderById((tool.input as any).orderNumber));
if (tool.name === 'searchProducts') toolResults.push(await searchProducts((tool.input as any).query));
if (tool.name === 'addToCart') toolResults.push(await addToCart(sessionId, (tool.input as any).variantId, (tool.input as any).qty));
}
const text = resp.content.find(b => b.type === 'text') as Anthropic.TextBlock | undefined;
const citedSkus = [...(text?.text || '').matchAll(/\[SKU:([^\]]+)\]/g)].map(m => m[1]);
return {
type: 'reply',
text: text?.text || '',
citedSkus,
products: rows.filter(r => citedSkus.includes(r.sku)),
toolResults,
stopReason: resp.stop_reason,
};
}
function looksLikeOrderLookup(msg: string) {
return /\border\s*#?\s*\d+|tracking|where\s+is\s+my/i.test(msg);
}
Two patterns to flag in that handler. The retrieved context is interpolated into the system prompt, not the user turn, so the model treats it as ground truth and not user input. The cited SKU regex parses Claude's response and the front-end renders only those SKUs as product cards (any other product mention is dropped). That second pattern is what kills hallucinated SKUs cold.
Two more things worth flagging. The query embedding uses text-embedding-3-large at 3072 dimensions because the smaller models trade accuracy for cost in ways that hurt on long-tail product queries ("a soft blanket for a newborn that is not too warm"). The cost delta is small at conversation volume. The pgvector query uses the cosine distance operator (<=>) with an ivfflat or hnsw index; pick hnsw if your catalog is over 50k SKUs and you can afford the build time.
The confidence gate inside the handler short-circuits before the model call when retrieval is weak and the query does not look like an order lookup. Order lookups are intentionally exempt because they bypass RAG entirely and call the Shopify Order API as a tool. That asymmetry matters: WISMO traffic should not be gated by product retrieval, and product traffic should not be gated by an irrelevant order check.
The companion snippet below is the abandoned-cart message generator. It runs inside an Inngest function on a 45-minute scheduled delay. The system prompt is intentionally short, names the shop, forbids exclamation marks and emoji, and pins the closing phrase. The whole function is roughly 25 lines and produces deterministic message shape with personalized content. We log the generated body and the cart context to Helicone for the CX team to audit.
# Inngest function. Reads cart payload, calls Claude with a recovery prompt,
# returns a 2-sentence message + product image + recovery link with UTM tags.
import os, anthropic
client = anthropic.Anthropic()
SYSTEM = (
'You are a recovery copywriter for {shop_name}. '
'Write a 2-sentence message that references one specific product in the cart. '
'No exclamation marks. No emoji. Tone: concise, warm, never pushy. '
'End with: "Pick up where you left off →"'
)
def recover(cart: dict, shop_name: str) -> dict:
top = cart['items'][0]
prompt = f"Customer abandoned cart with: {top['title']} (${top['price']}). Total: ${cart['total']}."
msg = client.messages.create(
model='claude-sonnet-4-6-20260115',
max_tokens=240,
system=SYSTEM.format(shop_name=shop_name),
messages=[{'role': 'user', 'content': prompt}],
)
body = msg.content[0].text
return {
'body': body,
'image_url': top['image_url'],
'recovery_link': f"{cart['checkout_url']}?utm_source=chatbot&utm_medium=recovery&utm_campaign=cart-abandon",
}
WhatsApp, voice, and cross-channel expansion
Same Shopify backend, three channel adapters. The retrieval and model layers do not change. What changes is the surface contract: WhatsApp needs template approval for outbound messages and respects a 24-hour session window; voice needs a 800ms barge-in latency budget and a different STT-handling layer; the web widget is the most flexible but the most exposed to theme stylesheet bleed. Our WhatsApp AI chatbot build guide covers the WABA template approval workflow in depth.
Voice is the surface most teams misjudge. The latency budget is brutal: anything over 800ms of barge-in delay feels broken to a caller, and STT errors on order numbers (the model hears "one zero four seven" as "ten forty seven") cascade into wrong lookups. We handle order numbers as a structured slot, prompting the customer to spell digits if the first lookup fails, and we cache the spelled version against the caller-ID so the same caller does not re-spell on the next call. Deflection on voice runs lower (around 31% in our 2026-Q1 numbers) because callers self-select for the messier intent buckets, but the cost-per-resolved-call still beats sending the same volume to a contact center.
Channel pick order on a typical Shopify build: ship the web widget first because the customer is already authenticated through the storefront and the surface gives you the most control over rendering. Add WhatsApp second because it covers post-purchase notifications and abandoned-cart recovery on a channel with a 20-30x higher open rate than email. Add voice last, only if your support volume justifies it; the team cost of an STT pipeline is non-trivial and the deflection rate is the lowest of the three.
Eval methodology — a 500-conversation ecommerce golden set
Vendor self-reported numbers are not a benchmark. We sampled 500 real conversations from a live mid-market Shopify store (anonymized, customer-consented), split them into six intent buckets, and built a golden set with labelled correct answers per turn. Scoring runs in Ragas for recall@5, faithfulness, and answer relevance, plus a custom intent-classification judge that grades the Haiku 4 classifier against the labelled bucket. The whole eval re-runs on every prompt change. If a metric drops below the gate, deploy blocks until someone investigates.
Intent split on the golden set: WISMO 38%, product Q+A 22%, sizing or fit 14%, refund or return 11%, promo or discount 9%, other 6%. The 2026-Q1 pass numbers we hold to: recall@5 ≥ 0.85 against the ground-truth product set, intent accuracy ≥ 0.90 vs labelled bucket, grounding faithfulness ≥ 0.88 on Ragas, deflection rate ≥ 0.55 on the full session set. We currently sit at 0.86, 0.93, 0.91, and 0.58 respectively.
Building the golden set is the part most teams skip and the part that earns the eval its credibility. Start with two weeks of raw inbound from your support inbox or Gorgias tickets. Strip PII (we use Presidio for the first pass and a human reviewer for the second). Label each conversation with the dominant intent and the correct turn-by-turn response according to your CX policy. Aim for 500 conversations, no fewer; below that the bucket sizes get too small to draw stable conclusions.
Run the eval on every prompt change. Every model upgrade. Every retrieval index rebuild. Tie it to your CI pipeline so a regression on any pass-gate metric blocks the deploy. The cost of the eval run sits under $14 of Claude API spend on the full 500-conversation set at 2026-Q1 prices; we pay it every time. The savings from catching a regression before it ships to customers dwarfs the API spend by an order of magnitude.
Dated benchmarks: deflection, conversion lift, recovered cart revenue (2026-Q1)
These are the numbers worth quoting in your board deck. Pre-bot baseline deflection is 0% by definition. The vendor-listicle median claim sits around 30% (we've audited a dozen of these and the methodology rarely survives scrutiny). Our 2026-Q1 Shopify deployment hits 58% across 18,200 sessions. The agentic-only ceiling, where every reply is generated and no human ever touches the queue, runs around 72% before CSAT collapses.
Cart recovery: 14.2% on Claude-generated personalized messages against 12,400 abandoned carts in 2026-Q1, vs 7.8% from the prior Klaviyo static-template flow on the same audience and the same send window. AOV lift on sessions where the bot surfaced at least one RAG-retrieved product card: 11.8% over sessions where it did not, n=8,100. API spend: $0.014 average per 6-turn conversation on the Sonnet 4.6 plus Haiku 4 blend; pgvector and Inngest add roughly $0.002 of infra cost. Multiply by your monthly conversation volume and you have a real number to plan against.
A note on benchmark honesty. The 58% deflection number is measured on a mid-market Shopify store with a curated catalog (under 5,000 SKUs), a healthy policy doc, and a CX team that signed off on the confidence-gate threshold. Stores with sparser catalogs or messier policy docs run lower in the first quarter and climb as RAG context fills out. The 14.2% recovery number assumes you have Klaviyo plus a working SMS surface; stores running email-only see a lower ceiling, around 9-11%. The 11.8% AOV lift assumes grounded recommendations actually surface; stores where the bot rarely cites a product see closer to flat AOV.
If you want to plan against your own numbers before shipping, instrument three things before the build starts. Inbound contact volume by week split by intent bucket (use Gorgias tags or a manual sample). Current cart-abandonment recovery rate from your Klaviyo flow. AOV by channel and by sessions-with-support-contact vs sessions-without. Those three series let you size the expected return for each of the five use cases and pick the order to ship them in.
Shopify chatbot vs ecommerce chatbot platform vs custom build
Three real shapes. Shopify-native (Shopify Inbox plus the Magic AI features) wins the first 30 days because it ships with the store and needs zero engineering. Platform tools (Tidio, Gorgias AI, Rep AI) win the 30-to-365-day window because they cover more channels and add some custom flow building. Custom on Claude plus LangGraph wins past 12 months once your training data and eval moat compound. Our take on the broader platform category lives in what is a conversational AI platform, and our scoreboard of named tools sits in best AI chatbots.
The honest case for each. Shopify-native wins when the use case is FAQ and order tracking, the catalog is small enough to live in product titles, and the team has zero capacity to own a custom build. The ceiling is real: no RAG, no custom tool use, no cross-channel parity. We have seen stores happily sit on Shopify Inbox for years and have seen others outgrow it in three months.
Platform tools (Tidio, Gorgias AI, Rep AI) cover the middle. They ship faster than a custom build, hit more channels than Shopify-native, and add some custom flow building on top. The tradeoff: your training data and tuning history live in their stack. Switching costs grow over time. Vendor pricing changes ripple straight into your unit economics. If you are buying for a 9-to-15 month window, this category is usually the right answer. If you are buying for a multi-year roadmap, do the math on switching cost before you sign.
Custom on Claude plus LangGraph wins past 12 months once the eval set, the prompt history, and the audit logs compound into a moat the vendor cannot replicate. The cost is owning the model swaps, the eval gate, and the observability stack. If your team has shipped at least one production LLM system before, this path pays back. If not, hire a consultant or a delivery team that has, and run an audit before committing to the architecture.
| Dimension | Shopify-native (Inbox + Magic) | Platform (Tidio / Gorgias AI / Rep AI) | Custom (Claude + LangGraph) |
|---|---|---|---|
| Time to launch | 1 day | 1-3 weeks | 4-8 weeks |
| Customization ceiling | Low: scripted flows only | Medium: custom flows, limited tool use | High: any tool, any model, any channel |
| RAG over your own catalog | No (Shopify search only) | Limited (vendor index, not yours) | Yes (pgvector, your embeddings) |
| Training-data ownership | Shopify | Vendor (data locks you in) | You |
| Recurring cost shape | Bundled with Shopify plan | Per-seat or per-resolution | API spend + infra (variable, see above) |
| Where it fails | Ceiling at 30 days; no RAG; no custom tool use | Training data locks you in; vendor pricing changes break unit econ | Eng team needs to own model swaps + eval gates |
Production gotchas from real Shopify deployments
Where this post will live and who we wrote it for
This is the post we wrote for the Shopify engineer who has read the vendor listicles. Audited a couple of platform demos. Now trying to scope a real build. Distribution mirrors that. The TypeScript handler and the Shopify auth code go to r/shopifyDev as the seed channel, where engineers actually evaluate whether the snippet would survive their store. The eval methodology section stands alone as a Hacker News front-page candidate (operator-built golden set, 500 conversations, Ragas, named models, reproducible numbers). We mirror the whole thing to dev.to with a canonical tag back here for the long-tail search surface that does not click out from HN.
FAQ
Shopify chatbot vs ecommerce chatbot platform vs custom build — which should I pick?
Decide by time horizon and customization ceiling. Shopify Inbox plus Magic wins in the first 30 days because it ships with the store. Platform tools (Tidio, Gorgias AI, Rep AI) win the 30-to-365-day window. Custom on Claude plus LangGraph wins past 12 months once your training data and eval moat compound.
What is a realistic deflection rate for an ecommerce chatbot in 2026?
35-65% depending on catalog size, intent mix, and confidence-gate tuning. We measure 58% on a mid-market Shopify store in 2026-Q1 across 18,200 sessions. Anything claiming more than 70% is conflating deflection with auto-reply.
Do I need RAG for a Shopify chatbot, or is the Shopify Admin API enough?
Both. Admin API covers order lookup, cart manipulation, and product fetch by ID. RAG covers semantic product Q+A, policy questions, and ground-truth citations. RAG over your catalog plus your policy docs is non-negotiable past 100 SKUs.
How do I evaluate an ecommerce chatbot before going live?
Build a 500-conversation golden set sampled from real customer-service tickets, label by intent bucket, score with Ragas (recall, faithfulness, answer relevance) plus a custom intent-classification judge. Weekly regression on every prompt change. Do not trust vendor self-reported numbers.
What does an ecommerce chatbot cost to run in 2026?
Per-conversation API cost depends on model routing. Our blended cost on Claude Sonnet 4.6 + Haiku 4 routing: $0.014 per 6-turn conversation in 2026-Q1. Infrastructure (pgvector, Inngest, observability) adds roughly $0.002. Multiply by your monthly conversation volume to size API spend.
How do WhatsApp and voice fit into a Shopify chatbot architecture?
Same backend, different channel surface. WhatsApp Business API needs template message approval for outbound (abandoned cart, order updates) and respects a 24-hour session window. Voice via Vapi or Twilio adds an 800ms barge-in latency budget. The RAG retrieval and Claude response layer is identical; only the channel adapter changes.