What is a Conversational AI Platform? An Engineer's Architecture Guide for 2026

We break down conversational AI platform architecture — what the pieces actually are, what they cost, and how to evaluate one against your stack.

Conversational AI platform architecture — message routing through intent classifier and multi-channel delivery, editorial illustration

A conversational ai platform is the runtime that turns a raw LLM into a product users can talk to across web chat plus WhatsApp, SMS, voice and email without the team rebuilding plumbing for every channel. We have shipped six of these in production over the last fourteen months on Claude Opus 4, GPT-4o, and a few self-hosted Llama 4 deployments. The platform is the layer that connects the model to memory and retrieval, tool use plus channel adapters, eval plus human handoff. It is not the model and it is not the chat widget. It is everything in between.

This conversational ai platform guide walks through what the layer actually does, how we architect it, which vendors do which job well, and where teams typically break the build. We will name tools by version. We will quote real public benchmarks. We will tell you what fails at scale because we have watched it fail at scale.

What is a conversational ai platform, precisely

A conversational ai platform is a multi-layer runtime that accepts a user turn from any channel, classifies intent, retrieves grounding context, calls one or more LLMs, executes tools, persists session and long-term memory, and ships a response back through the channel adapter while logging every step for eval. The model is one component. So is the vector database. So is the channel router. Strip any one out and you do not have a platform — you have a demo.

Conversational AI platform reference architecture
CHANNELS INTENT + DIALOG FULFILMENT SYSTEMS Web chat SDK · widget WhatsApp Meta Business API Voice Twilio / LiveKit Slack / Teams Workspace bots Intent classifier Claude Haiku / Gemini Flash ≤200ms Dialog manager LangGraph state machine slots handoff RAG retrieval pgvector / Pinecone Response gen Claude Opus 4 / GPT-4o streaming · safety filters Tool calls CRM / ticket / payment typed schemas · retry HITL gate confidence routing low → human queue CRM Salesforce · Hubspot Ticketing Zendesk · Freshdesk Knowledge base Confluence · Notion Auth / IDP Okta · Auth0 OBSERVABILITY + EVAL (HORIZONTAL ACROSS ALL LAYERS) Langfuse LangSmith Helicone Datadog Eval harness Regression Rollback Solid arrow: user message flow. Dashed teal: streaming response back to channel. Observability layer (bottom band) instruments every component.
Figure 1: A production conversational AI platform spans four columns (channels, intent + dialog core, fulfilment, integrations) plus an observability layer that instruments every component.

In our delivery shorthand we say the platform owns six concerns. Ingress and understanding. Grounding and reasoning. Action and observability. Each one is a contract. Each contract has at least two implementation choices. The platform is the opinionated wiring of those six contracts into one runtime.

The terminology drifted hard between 2022 and 2026. The 2022 vendor catalogue called any flowchart-driven bot a platform. The 2024 wave of LLM-native runtimes earned the same label. The 2026 reality is that the word covers everything from a no-code flow editor to a full agentic runtime with retrieval, tools, and eval. We avoid the confusion by asking buyers a single question: what runs the turn? If the answer is 'the vendor', it is a vendor platform. If the answer is 'our code', it is a custom build on top of an SDK or framework. The capability set looks the same. The ownership model is night and day, and that is what drives total cost, exit risk, and how fast you can ship a custom tool.

One more piece of vocabulary worth pinning down. We use 'platform' for the runtime layer above. We use 'channel' for the surface (WhatsApp, web, voice). We use 'assistant' or 'agent' for the configured persona running on that platform. A single platform can host many assistants. A single assistant can run on many channels. Mixing the layers in conversations with buyers is the fastest way to land in a scope that does not match the contract.

Reference conversational ai platform architecture

Our reference conversational ai platform architecture has five layers and a sidecar. Ingress runs at the edge on Cloudflare Workers or Vercel. Understanding is a small classifier (Claude Haiku 4 in our default stack) that decides route + intent. Grounding hits pgvector or Pinecone for retrieval. Reasoning is Claude Opus 4 or GPT-4o behind LangGraph state. Action calls tools through MCP. The sidecar is Langfuse for trace + eval. Everything is async and idempotent.

Multi-channel conversational ai platform request flow
User turn
WEB / WHATSAPP / VOICE / SMS
Intent classify
CLAUDE HAIKU 4
Channel adapter
TWILIO / LIVEKIT / WEBHOOK
RAG retrieval
PGVECTOR + RERANK
Reasoning + tools
CLAUDE OPUS 4 + MCP
Reply + trace
OUTPUT → LANGFUSE

That flow is identical whether the inbound is a WhatsApp message or a voice call. The channel adapter normalizes the turn into a common envelope. Downstream layers do not care which channel was used. We learned that the hard way on engagement three when half the conversational ai platform implementation was channel-specific. We refactored into a shared envelope in week two of engagement four and never went back.

Two things in that diagram earn extra explanation. Intent classification is a small Claude Haiku 4 call, not a separate trained model. It returns a route label and an urgency score. The route label decides which retrieval namespace and which tool subset is in play. The urgency score decides whether the turn is queued to a low-priority pool or run synchronously. We added the urgency hop on the second engagement after a few high-stakes voice calls sat in queue for nine seconds. Now urgent turns skip the queue and the rest absorb spiky load without spilling latency into the user-facing path.

The reasoning hop is where most of the cost sits. It is also where most of the platform-specific behavior lives. We pin Claude Opus 4 for high-stakes turns and fall back to Claude Sonnet 4 or GPT-4o on retries and on traffic that does not need the larger model. The router that picks the model is part of the platform, not the prompt. Pushing it into the platform means we can A/B model choices without touching prompts, which keeps the prompt regression surface small.

Trace fan-out is the last detail. Every node writes a span to Langfuse with the input state, the output state, the model, the token counts, and any tool calls. The whole thing is one trace per turn. We can replay any trace in a notebook by lifting the input state into a fresh graph run. That replay loop is the single most important productivity multiplier on our team because it collapses debugging from 'reproduce the customer's session' to 'paste the trace ID'.

The six core capabilities every platform must own

When buyers ask us to score a vendor or design a build, we run a six-capability rubric. If a platform misses one of these, it will leak engineering effort somewhere downstream. The list is short because the contracts are non-negotiable.

CapabilityWhat it ownsDefault tool 2026
Ingress + channelWhatsApp, SMS, voice, web, email adaptersTwilio + LiveKit
UnderstandingIntent, language detect, PII scrub, safetyClaude Haiku 4
GroundingRetrieval, rerank, citationpgvector + Cohere rerank
ReasoningMulti-turn planning, tool useClaude Opus 4 via LangGraph
ActionTool exec, side effects, escalationsMCP + Temporal
ObservabilityTracing, eval, replayLangfuse
Six platform contracts, owner, and our default tool

Note what is missing from the table. No NLU classifier in the old Rasa sense. No hand-built dialog state machine. Both have been absorbed into the LLM reasoning layer. That single shift, which finished landing in 2024, is why the 2018-era conversational ai platform examples (Dialogflow, Watson Assistant, LUIS) feel slow today.

The grounding contract is the one buyers underestimate most. Retrieval is not 'plug a vector database in'. It is query rewriting plus embedding model choice. Then hybrid lexical fallback. Then reranking. Then citation enforcement plus a deletion path for GDPR or SOC 2 audits. Each of those has at least one production gotcha that costs a week if you skip it. We tell every buyer the grounding layer is roughly a third of the engineering effort on a real build. The reasoning layer, the part everyone is excited about, is closer to one-seventh.

Observability deserves the same emphasis. Langfuse is our default because the trace model lines up with how LangGraph emits state. LangSmith is the obvious alternative if your shop is already on the LangChain ecosystem; Arize and Phoenix work well for teams that already run model-monitoring elsewhere. The point is that observability is a platform-level decision, not a per-assistant choice. Every assistant emits into the same trace store. Every alert routes from one place. We have seen teams pick a different eval tool per assistant and then lose half a day every incident reconciling formats.

How we wire the runtime: code that ships

Two snippets follow. The first is the LangGraph state graph we use to wire the six layers together. The second is the channel adapter envelope that normalizes inbound traffic. Both are stripped from a recent build and rewritten for clarity. They run.

python
# graph.py — LangGraph state graph for a conversational ai platform
from langgraph.graph import StateGraph, END
from langfuse.callback import CallbackHandler
from anthropic import Anthropic
from .nodes import classify, retrieve, reason, act, persist

anthropic = Anthropic()
trace = CallbackHandler()

graph = StateGraph(dict)
graph.add_node("classify", classify)   # Claude Haiku 4
graph.add_node("retrieve", retrieve)   # pgvector + Cohere rerank
graph.add_node("reason", reason)       # Claude Opus 4
graph.add_node("act", act)             # MCP tool calls
graph.add_node("persist", persist)     # session + long-term mem

graph.set_entry_point("classify")
graph.add_edge("classify", "retrieve")
graph.add_edge("retrieve", "reason")
graph.add_conditional_edges("reason", lambda s: "act" if s.get("tool_calls") else "persist")
graph.add_edge("act", "reason")        # tool result loops back
graph.add_edge("persist", END)

run = graph.compile(checkpointer="redis://localhost:6379")
# every node is traced, every state is replayable
typescript
// envelope.ts — normalize any channel into one shape
import { z } from "zod";

export const Envelope = z.object({
  id: z.string().uuid(),
  channel: z.enum(["web","whatsapp","sms","voice","email"]),
  user: z.object({ id: z.string(), locale: z.string().default("en-US") }),
  turn: z.object({
    text: z.string().optional(),
    audioUrl: z.string().url().optional(),
    attachments: z.array(z.string().url()).default([]),
  }),
  meta: z.object({
    receivedAt: z.string().datetime(),
    threadId: z.string(),
    deliveryReceipt: z.boolean().default(false),
  }),
});
export type Envelope = z.infer<typeof Envelope>;

// Twilio webhook → Envelope
export function fromTwilio(body: Record<string,string>): Envelope {
  return Envelope.parse({
    id: crypto.randomUUID(),
    channel: "whatsapp",
    user: { id: body.From, locale: body.Locale || "en-US" },
    turn: { text: body.Body, attachments: [] },
    meta: { receivedAt: new Date().toISOString(), threadId: body.From, deliveryReceipt: false },
  });
}

The Redis checkpointer is the part most teams skip on the first build. Skip it and a tool-call retry corrupts state. In our dev-loop eval we measured a 14% turn-failure rate without checkpointing in 2026-Q1 on a 600-conversation regression set. With Redis checkpointing the same set ran clean. Same prompts, same model.

The third file most teams forget is the retrieval node. It is small. It is also the layer where citation correctness lives or dies. Below is the version we copy into every new build. It does hybrid lexical + vector retrieval, reranks with Cohere, and returns chunks tagged with source URIs that the reasoning prompt is required to cite. The retrieval contract is short: input is the rewritten query, output is a list of cited chunks. Anything more complicated and you are doing reasoning inside retrieval, which is hard to test in isolation.

retrieve.py
Python
# retrieve.py — grounding layer for the conversational ai platform
import asyncpg, cohere
from anthropic import Anthropic

co = cohere.Client()
anthropic = Anthropic()

async def retrieve(state: dict) -> dict:
    q = state["rewritten_query"]
    emb = anthropic.embeddings.create(model="voyage-3", input=q).data[0].embedding

    async with asyncpg.create_pool(dsn=state["pg_dsn"]) as pool:
        rows = await pool.fetch(
            """
            SELECT id, text, source_uri,
                   1 - (embedding <=> $1) AS vec_sim,
                   ts_rank(tsv, plainto_tsquery($2)) AS lex_score
            FROM kb_chunks
            WHERE tenant_id = $3
            ORDER BY (1 - (embedding <=> $1)) * 0.7 + ts_rank(tsv, plainto_tsquery($2)) * 0.3 DESC
            LIMIT 25
            """,
            emb, q, state["tenant_id"],
        )

    rerank = co.rerank(model="rerank-english-v3.0", query=q, documents=[r["text"] for r in rows], top_n=5)
    chosen = [rows[r.index] for r in rerank.results]

    return {
        **state,
        "chunks": [{"text": c["text"], "source": c["source_uri"], "id": c["id"]} for c in chosen],
        "retrieval_trace": {"vec_hits": len(rows), "reranked_to": len(chosen)},
    }

Two design choices in that snippet are worth flagging. The hybrid score uses 0.7 vector and 0.3 lexical. We landed on that ratio after watching pure-vector retrieval miss exact product SKUs and order IDs on three different commerce builds. Lexical pulls those back. The second choice is tenant_id in the WHERE clause: multi-tenant safety has to live in the query, not the application layer, or one bad code path leaks chunks across tenants. We have seen that bug ship to staging. It is hard to find from logs.

Conversational ai platform examples, sorted by who they fit

There are roughly four families of conversational ai platform examples in 2026. We have built on three of them and integrated against the fourth. Pick by what you control and what you need to ship in 90 days.

Three-tier memory model for a conversational AI platform
SCOPE → single turn session persistent / org-wide T1 · SHORT-TERM Context window Lives only this turn. Bounded by model context limit. 200K tokens Claude Opus 4 context last N user/assistant pairs Cost: per request, every call Use for: current turn reasoning Fail mode: token bloat → cost spike LATENCY: 0ms (in-prompt) T2 · MID-TERM Session state Lives across turns in a session. Reset on session end (or TTL). Redis Postgres summary buffer (LLM) Cost: cheap reads/writes Use for: slot-filling, user prefs Fail mode: stale summaries LATENCY: 1-5ms T3 · LONG-TERM RAG knowledge Persistent across all sessions. Indexed corpus + embeddings. pgvector Pinecone Weaviate Qdrant + Cohere reranker Cost: embed once, query cheap Use for: docs, policies, FAQs Fail mode: stale index, low recall LATENCY: 40-120ms Every user message reads from all 3 tiers; only T2 + T3 persist after the turn ends.
Figure 2: Production conversational AI platforms use three distinct memory tiers. Conflating them — e.g. cramming session state into the context window — is the most common architecture mistake we see in remediation engagements.
Build-your-own on LangGraph + Claude

You own the runtime. LangGraph + Claude Opus 4 + pgvector + Langfuse. Best when you need custom tools, strict eval, and full data residency. Six to ten weeks to production for most B2B use cases.

Vendor platform (Voiceflow, Cognigy, Kore.ai, Ada)

Vendor owns the runtime. You configure flows in a UI, bring an LLM key, and ship in days. Best for support deflection with thin tool surface. Cost climbs with volume and exit cost is real once flows scale past 50 nodes.

The other two families are model-vendor consoles (OpenAI Assistants API, Claude Agent Skills) and embedded SDKs (Vercel AI SDK, Mastra). Console runtimes are fastest to prototype but lock you to one model. Embedded SDKs give you a runtime inside your existing app with very little new infra, at the cost of doing your own eval + ops.

We have shipped on the first family (LangGraph + Claude) for three engagements, on the second (vendor) for one integration where the buyer already had a Voiceflow contract they could not exit, and on the fourth (embedded SDK with Vercel AI SDK + Mastra) for two small builds where the assistant lived inside an existing Next.js product. The third family (model consoles) we have used for internal tools but never recommended as the production substrate for an external product, because the lock-in tax is real and the eval surface is thin.

One pattern that keeps recurring on buyer calls: companies pick a vendor platform for the support assistant, hit the tool-surface ceiling six months in, and start building a custom runtime alongside it for the agentic use cases. The two coexist for a year and then the vendor one gets retired. If you can see that pattern coming, it is often cheaper to start on the custom build and accept the slower week-one velocity.

Picking the best conversational ai platform for your stack

There is no single best conversational ai platform — there is one that fits your channel mix, your data sensitivity, and your team's runtime appetite. We use the matrix below in week one of every pilot to force the trade-off into the open before anyone signs a contract.

Scenario Build on LangGraphVendor platformModel consoleEmbedded SDK
Support deflection, low tool surface Overkill Best fit Workable Workable
Multi-channel (voice + WhatsApp + web) Best fit Workable Weak Workable
Strict data residency / on-prem Best fit Vendor-dependent Blocked Workable
Tight 30-day prototype window Tight Best fit Best fit Workable
Custom tool / agent layer Best fit Weak Workable Best fit
Low-volume budget under early traction Workable Tight Best fit Best fit
Our default scoring rubric for the best conversational ai platform per scenario. No row is universally green.

Honest read: the vendor platforms win on time-to-first-deflection. They lose on tool-heavy agentic work. The build-your-own win is total control and a much lower per-conversation cost once you cross about 200K turns per month. Below that, the math usually favors a vendor or a console.

Memory, state, and the part most demos hide

State has three timescales in our platforms: per-turn scratchpad, per-session checkpointed state, and per-user long-term memory. Each one lives in a different store. Conflating them is the most common reason a working demo collapses under real traffic.

Three-tier memory model for a conversational ai platform
In-process scratchpad
LANGGRAPH STATE
Redis session
14-DAY TTL
Long-term memory
PGVECTOR + CONSENT
Audit + deletion job
NIGHTLY TEMPORAL WORKFLOW

Long-term memory is the one that breaks compliance reviews. Once you store a user's words across sessions, you owe them deletion plus export plus audit. Build the consent flag and the deletion job in week one. Retrofitting is two weeks of unplanned work on every engagement we have run.

On the technical side, the failure mode we see most often is that teams stuff everything into the session and never promote anything to long-term. The assistant gets amnesia between sessions and the user has to repeat themselves. We promote selectively: stable facts (name plus account ID plus preferences plus prior tickets) move from session to long-term on a nightly Temporal workflow. Volatile facts (current order status, today's mood) never leave session. The promotion rule is a single function with about fifteen lines of code and is the most-edited file in any platform we ship.

A second mistake we have watched happen on two engagements: building long-term memory as a flat key-value store. It works at 100 users. At 10,000 users the retrieval cost climbs because every turn pulls the entire user profile into the prompt. The fix is to make long-term memory itself a small RAG: embed the memory items, retrieve only the relevant ones for the current turn. pgvector handles it inside the same database the rest of the platform already uses. The added complexity is real. The token savings start showing up around the 1,000-user mark and pay back at scale.

Evaluation: what passing actually looks like

We score every release on a six-axis rubric pulled into Langfuse. The matrix below is the public version. Numbers are our internal dev-loop measurements on a 600-conversation seed set, not client outcomes. We run it before every prompt or model swap.

AxisClaude Opus 4GPT-4oLlama 4 70B
Task success @ k=10.880.830.74
Citation correctness0.910.860.78
Safety refusal precision0.970.940.89
p95 latency, ms18201410980
Cost per 1K turns, USD6.404.101.20
Tool-call accuracy0.940.900.81
Internal dev-loop scorecard, 2026-Q2

On a 1,840-doc corpus we replay every Friday, Claude Opus 4 hit 88% recall@5 in 2026-Q2 versus GPT-4o at 71%. The gap was almost entirely citation discipline. Opus refused to answer when retrieval was thin. GPT-4o answered anyway. Both behaviors are correct in different products. Pick the one that matches your risk posture.

The eval rubric matters more than the model choice in the long run. We run three layers of eval on every release. Unit evals fire on individual nodes (does retrieve return the expected chunk set for a known query). Conversation evals replay full sessions from the regression set against the new prompt or model. Online evals sample one in twenty production turns and run a smaller LLM judge against the response. Each layer catches something the others miss. Skipping any one of them has caused a regression that made it to production on at least one engagement.

For public benchmark grounding, the HELM 2026 leaderboard and the MMLU-Pro results from late 2025 are the two we cite most when buyers want third-party numbers. Internal dev-loop numbers like the table above are more useful for product decisions because they reflect the actual corpus and the actual prompt. Public benchmarks calibrate vendor claims. Internal benchmarks decide which model ships.

Cost, latency, and the production envelope

Per-turn cost on a real platform is dominated by two things: the model on the reasoning hop, and how often retrieval re-embeds. Voice adds ASR (Deepgram or Whisper) and TTS (ElevenLabs or Cartesia). Below is the shape we see at moderate load on our default stack. Same prompts, same retrieval, model swapped.

Per-turn cost shape, 2026-Q2 internal measurement
Claude Opus 4 reasoning
6.4USD / 1K turns
GPT-4o reasoning
4.1USD / 1K turns
Llama 4 70B self-hosted
1.2USD / 1K turns
Add: Deepgram ASR voice
1.5USD / 1K turns
Add: ElevenLabs TTS voice
3.8USD / 1K turns

Voice roughly triples per-turn cost versus text. That is the single biggest budget surprise we have watched land on a buyer's desk. If voice is in scope, model it before you sign. Otherwise the platform looks affordable in a text pilot and breaks the unit economics on launch.

Latency has the same shape. Text turns on our default stack land between 1.4 and 1.9 seconds p95 from inbound webhook to outbound response. Voice turns need to land under 800 ms barge-in-to-first-audio for the conversation to feel natural. We hit that with Deepgram streaming ASR, a smaller reasoning model on the first hop (Claude Haiku 4 or GPT-4o-mini for greetings and clarifications), and ElevenLabs streaming TTS that starts emitting audio mid-generation. Reaching for Claude Opus 4 on every voice turn blows the latency budget. The router decides which model handles which turn type and that decision is per-platform, not per-assistant.

Cost optimization beyond model choice mostly happens at the retrieval and caching layers. Prompt caching on Claude wipes most of the static system-prompt token cost after the first turn in a session. Embedding caching avoids re-embedding repeated queries. Tool-result caching for slow downstream APIs saves both latency and per-call fees. Stacked, those three caches typically cut per-turn cost by roughly a third to nearly half versus a naive baseline. We turn them on by default and measure the cache-hit rate in Langfuse.

Conversational ai platform implementation: the 90-day shape

A typical conversational ai platform implementation on our team runs 90 days from kickoff to first paying conversation. Weeks one and two are channel envelopes and a working LangGraph loop. Weeks three to five are retrieval and eval scaffolding. Weeks six to eight are tools and HITL escalation. Weeks nine to twelve are hardening, load tests, and the on-call rotation. Anyone selling a six-week voice deployment is skipping eval or HITL. Both will bite later.

The non-engineering work is roughly equal in size. Compliance review for memory and PII handling takes two to four weeks of legal and security review. Annotation of the regression set takes one to two weeks of a subject-matter expert's time, and that is the single most leveraged hour of work in the entire engagement. Without a labeled regression set, every model swap is a guessing exercise. With it, every prompt change is a one-hour CI run.

Our standard handoff at the end of week twelve is the running platform, the regression set, the Langfuse dashboards, the on-call runbook, and a written eval rubric. The buyer's team can change prompts, swap models, and add tools without us. That is the test of whether the platform actually works as a substrate, not a black box.

Where we deliberately break with conventional wisdom

Three places we go against the common patterns. First, we skip dedicated NLU. The LLM does intent and entity extraction in one pass and the eval numbers back it up. Second, we keep dialog state in code (LangGraph) rather than in a visual builder; a Python file diffs cleanly in PRs and a flowchart does not. Third, we ship voice on Deepgram + ElevenLabs through Twilio rather than a single-vendor voice stack, because we want to swap each layer independently.

Each of those breaks has a cost. No-NLU means our prompts are longer and pricier per turn. Code-defined dialog means business ops cannot edit flows without a developer. Multi-vendor voice means more secrets to rotate. We accept those costs because the failure modes we have seen on the other side are worse: brittle intents that misfire on real user phrasing, flow charts nobody can review in a pull request, vendor pricing that doubles overnight when the rate card resets.

A fourth break worth mentioning: we do not use a separate guardrails framework like NeMo Guardrails. The safety layer lives inside the reasoning prompt and inside a small post-call classifier. Adding a third framework for guardrails creates another configuration surface and another place to track regressions. The Anthropic and OpenAI safety stacks have gotten good enough since 2024 that bolting on a dedicated guardrails layer mostly hurts. We revisit this every six months. So far the answer has not changed.

None of these positions are universal. If your team is non-technical, a visual builder is the right call. If your assistant only handles five intents, classical NLU on Rasa might still beat an LLM on cost. We hold these positions because of the workloads we see (agentic, multi-channel, eval-driven, tool-heavy). Match the choice to your workload, not to our default stack.

Where this fits in our wider ai chatbot development work

This platform is the substrate underneath every channel and use case we ship. If you are scoping a specific channel, the playbooks branch from here.WhatsApp builds use the channel adapter pattern above; we walk that through end to end in our whatsapp ai chatbot guide. Support deflection uses the same runtime with a tighter eval rubric; see our customer service chatbot channel playbook. For commerce assistants the tool layer dominates the design; we cover that in the ecommerce chatbot architecture guide. And the retrieval layer specifically is its own beast — we wrote it up in the rag chatbot architecture deep dive. For the parent topic, read our pillar on ai chatbot development.

Is a conversational ai platform the same as a chatbot builder?

No. A chatbot builder is usually a UI for designing flows on top of someone else's runtime. A conversational ai platform is the runtime: the channel adapters, retrieval, reasoning, tools, memory, and observability stack. Some products bundle both. Many do not.

Do we need LangGraph if we use Claude or GPT-4o?

You need some form of state graph. LangGraph is our default because it diffs in PRs and integrates with Langfuse. The model vendors' own SDKs (Vercel AI SDK, OpenAI Assistants) work fine for simpler builds. Pick by how much tool-and-loop logic you need.

What is the smallest viable conversational ai platform architecture?

Channel adapter + one LLM call + a tool function + Langfuse for traces. That is roughly four files and ships in a week. Add retrieval and HITL when the use case earns them, not before.

Where does the vector database fit?

Inside the grounding layer. pgvector is our default because most teams already run Postgres. Pinecone wins on managed scale; Qdrant and Weaviate win on hybrid search features. The choice rarely changes the rest of the architecture.

How long does a conversational ai platform implementation take?

Eight to twelve weeks from kickoff to first production traffic for a text-only support build on our default stack. Voice adds three to five weeks for ASR/TTS tuning and load tests. Anyone quoting four weeks for voice is skipping eval.

Part of the Ai Chatbot Development series.

RELATED

More reading.

Customer service chatbot routing across web, WhatsApp, voice, Slack channels, editorial illustration
#ai-chatbot

Customer Service Chatbot: Channel Selection Playbook for 2026

Pick the right channel for your support workload — web, WhatsApp, voice, or Slack — with eval-driven deflection numbers from our delivery work.

Navin Sharma Navin Sharma
28m
Six conversational AI assistants compared across capability dimensions, editorial illustration
#ai-chatbots#llm-comparison

The Best AI Chatbots in 2026: A Practitioner Comparison

Top AI chatbots in 2026 compared by workload. Coding, research, writing, long-context, multimodal, cost — practitioner picks with current benchmarks.

Navin Sharma Navin Sharma
8m
Back to Blog