AI Agent Architecture: Patterns, Loops & Orchestration

The real AI agent architecture patterns: ReAct, plan-and-execute, reflection, routing and multi-agent orchestration, with tradeoffs and failure modes.

AI Agent Architecture: Patterns, Loops & Orchestration — hero image

The first agent we shipped to production in 2025 was a single ReAct loop with six tools and no supervisor. It worked in the demo. In week two it hit a tool that returned a 40KB JSON payload, the model re-read its history every turn, the context window filled, and the agent looped the same failed action. That was not a model problem. It was an architecture problem: the loop had no step budget, no observation truncation, and no planner to decompose the task before the executor touched a tool.

That failure is why this post exists. Most AI agent architecture writing stops at the perceive-plan-act-observe diagram and calls it done. The diagram is correct and useless on its own. What ships is a set of named patterns with explicit failure modes, step budgets, and orchestration topologies, the gap our ai agent development company work fills.

Below: the agent loop as a control flow, the four core single-agent patterns (ReAct, plan-and-execute, reflection, routing), multi-agent orchestration topologies, the tool-calling contract, three-layer memory, guardrails, runnable code, a dated 2026 reliability benchmark, and a decision matrix for picking a pattern by task shape. Operator-voiced, no hand-waving.

What AI agent architecture actually is: the perceive-plan-act-observe loop

An AI agent is a control loop wrapped around a language model. The model is the reasoning engine. The architecture decides what the model sees, what it can do, when it stops, and what happens when it fails. Strip the marketing and an agent has four moving parts: perceive assembles context, plan decides the next action, act executes a tool call, observe feeds the result back. The loop repeats until a stop condition fires.

The part most explainer content omits: the stop condition is architecture, not an afterthought. No step budget and a hard task loops forever. No observation truncation and raw tool output fills the context window. No terminal state and the agent never declares success. The loop below is the skeleton every single-agent pattern in this post specializes.

THE AGENT CONTROL LOOP — PERCEIVE, PLAN, ACT, OBSERVE
User Goaltask + constraints+ system prompt1. PERCEIVEassemble context:goal + memory +prior observations2. PLANmodel reasons,picks next action(tool + args)STOP CHECKstep budget hit?goal satisfied?error ceiling hit?TERMINATEreturn final answeror escalate to human3. ACTexecute tool call(API, code, retrieval)with timeout4. OBSERVEcapture result,TRUNCATE largepayloads, appendfeed backstep++WHY EACH GUARD MATTERSSTEP BUDGET: hard cap (e.g. 12 steps). Without it, a hard task loops the same failed action until the context window fills.OBSERVATION TRUNCATE: cap each tool result (e.g. 4KB). A 40KB payload re-read every turn poisons context and burns tokens.ERROR CEILING: after N consecutive tool errors, break and escalate. Stops infinite retry loops on a permanently broken tool.TERMINAL STATE: an explicit success signal. Without it the model never knows it is done and keeps acting past the goal.These four guards are the difference between a demo agent and a production agent. They are architecture, not prompt engineering.
Figure 1: The core agent loop. The step budget and stop check (top right) are the architecture that separates a production agent from a demo.

This loop is what makes an agent an agent rather than a script. A traditional workflow runs a fixed sequence you hard-coded; an agent decides its own next step at runtime from what it observed. That difference is the argument in our agentic AI versus traditional automation. The cost of that flexibility is exactly the failure modes above, which is why the rest of this post is about controlling the loop.

ReAct: the reason-act-observe pattern and where it breaks

ReAct is the default single-agent pattern and what most people mean when they say agent. The model interleaves reasoning with actions: it thinks about what to do, picks a tool, executes it, reads the result, reasons again. No separate planner; it decides each step one at a time. It is the simplest pattern that still earns the word agent, and for tasks under roughly eight steps it is usually the right call.

react_agent.py
Python
"""react_agent.py — minimal ReAct loop on Claude with the four production guards:
step budget, observation truncation, error ceiling, terminal state."""
import anthropic

MODEL = "claude-sonnet-4-6"
MAX_STEPS = 12
MAX_CONSECUTIVE_ERRORS = 3
OBS_TRUNCATE_CHARS = 4000

TOOLS = [
    {"name": "search_docs",
     "description": "Search the internal KB. Returns top matching passages.",
     "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}},
    {"name": "finish",
     "description": "Call when the goal is satisfied. Provide the final answer.",
     "input_schema": {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}},
]

def execute_tool(name: str, args: dict) -> str:
    ...  # dispatch to your real retriever / API

def run_agent(goal: str) -> str:
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": goal}]
    errors = 0
    for step in range(MAX_STEPS):  # step budget
        resp = client.messages.create(model=MODEL, max_tokens=1024, tools=TOOLS, messages=messages)
        messages.append({"role": "assistant", "content": resp.content})
        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return "".join(b.text for b in resp.content if b.type == "text")
        results = []
        for tu in tool_uses:
            if tu.name == "finish":
                return tu.input["answer"]  # terminal state
            try:
                out, errors = execute_tool(tu.name, tu.input), 0
            except Exception as e:  # error ceiling
                errors += 1
                out = f"ERROR: {e}"
                if errors >= MAX_CONSECUTIVE_ERRORS:
                    return "[escalated] repeated tool failure, handing to human"
            results.append({"type": "tool_result", "tool_use_id": tu.id,
                            "content": out[:OBS_TRUNCATE_CHARS]})  # observation truncation
        messages.append({"role": "user", "content": results})
    return "[step budget exhausted] no terminal state reached"

Where ReAct breaks: long-horizon tasks. With no global plan it has no commitment device. On a 20-step task it drifts, forgets earlier sub-goals as they scroll out of context, and repeats work. Every step is sequential even when three could run at once. When a ReAct agent stalls on a complex task, the fix is almost never a better model. It is a planner above the loop.

Plan-and-execute: separating the planner from the executor

Plan-and-execute splits the agent into two roles. A planner reads the goal once and produces an ordered list of sub-tasks. An executor works through that list, optionally running independent steps in parallel. The plan is a commitment device: it survives context scrolling because it lives outside the per-step reasoning. On long-horizon tasks this is the biggest win over flat ReAct.

We build most plan-and-execute agents on LangGraph because the plan, executor, and replan step map cleanly to graph nodes with explicit state. The planner node writes a task list into shared state; a conditional edge routes to the executor, then back to a replan node after each result. We walk through the full graph wiring on Claude in our Claude agents on LangGraph guide. The graph below shows the shape.

plan_execute.py python
"""plan_execute.py — plan-and-execute on LangGraph.
Planner decomposes once; executor works the list; replan revises on new info."""
from typing import TypedDict, Annotated
import operator
from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-6")

class AgentState(TypedDict):
    goal: str
    plan: list[str]
    done: Annotated[list[str], operator.add]

def planner(state):
    steps = llm.invoke(f"Goal: {state['goal']}\nReturn numbered sub-tasks.").content.splitlines()
    return {"plan": [s.strip() for s in steps if s.strip()]}

def executor(state):
    next_task = state["plan"][len(state["done"])]  # run with tools
    return {"done": [llm.invoke(f"Execute: {next_task}").content]}

def route(state):  # replan hook would revise state['plan'] here
    return END if len(state["done"]) >= len(state["plan"]) else "executor"

g = StateGraph(AgentState)
g.add_node("planner", planner)
g.add_node("executor", executor)
g.set_entry_point("planner")
g.add_edge("planner", "executor")
g.add_conditional_edges("executor", route, {"executor": "executor", END: END})
agent = g.compile()

The tradeoff is rigidity. A plan made up front can be wrong. If step two surfaces information that invalidates steps three through six, a naive executor marches on executing a dead plan. That is why the replan node matters: after each step, a cheap model call asks whether the remaining plan still holds. Skip it and plan-and-execute is brittle. Above ten steps, that trade is almost always worth it.

Reflection: self-critique loops that catch the model's own mistakes

Reflection adds a critic step. After the agent produces a candidate, a second pass (same model with a different prompt, or a separate critic) evaluates it against the goal and either accepts it or sends it back with corrections. It is the agent equivalent of code review, and it shines where correctness is checkable but not in one shot: code that must pass tests, extraction that must validate against a schema, multi-constraint writing.

The strongest reflection loops ground the critique in something external. A code agent reflects against a real test run, an extraction agent against a JSON schema validator, a research agent against retrieved sources. With a ground-truth signal, reflection reliably improves output. When the critic is just the model judging itself, it can confidently approve a wrong answer, which is reflection theater. We always wire it to a real validator when one exists.

reflection_loop.py
Python
"""reflection_loop.py — generate, critique against an external validator, revise.
The validator is the key: reflection without a ground-truth check is theater."""
import anthropic

client = anthropic.Anthropic()
MODEL, MAX_REVISIONS = "claude-sonnet-4-6", 3

def generate(goal, prior=None, critique=None):
    parts = [f"Goal: {goal}"]
    if prior:
        parts += [f"Previous attempt:\n{prior}", f"Critique to address:\n{critique}"]
    r = client.messages.create(model=MODEL, max_tokens=2048,
        messages=[{"role": "user", "content": "\n".join(parts)}])
    return "".join(b.text for b in r.content if b.type == "text")

def external_validate(candidate):
    # Ground-truth check: run pytest, validate JSON schema, etc. NOT an LLM opinion.
    # e.g. rc = subprocess.run(['pytest', ...]); return (rc == 0, stderr)
    return (True, "")

def reflect_until_valid(goal):
    candidate = generate(goal)
    for _ in range(MAX_REVISIONS):
        passed, critique = external_validate(candidate)
        if passed:
            return candidate
        candidate = generate(goal, candidate, critique)
    return candidate  # best effort; caller may escalate

Reflection costs latency and tokens linearly with the revision cap; three revisions roughly triple the spend on that step. Use it where a wrong answer is costly and a validator exists. Skip it on cheap, low-stakes, or unverifiable outputs.

Routing: the classifier pattern that keeps simple tasks cheap

Routing is the most underrated pattern because it is barely an agent. A small, fast classifier reads the request and routes it: a flat answer for a simple FAQ, a ReAct loop for a tool task, a plan-and-execute graph for a complex one, a human queue for anything out of scope. It does no reasoning beyond classification. It exists so you do not pay full agent cost on a one-line-answer question.

In production we run a Haiku-class model (Claude Haiku 4, or GPT-4o-mini) as the router and reserve the expensive model for handlers that need it. A router call costs a fraction of a cent and diverts the cheap majority of traffic off the expensive agent path. Routing is also where you enforce scope: anything the classifier cannot confidently bucket goes to a human, which stops agents improvising on requests they were never designed for.

ROUTER PATTERN — CLASSIFY THEN DISPATCH
Incoming Request
USER GOAL
Router (Haiku 4)
CLASSIFY: SIMPLE / TOOL / COMPLEX / OOS
Dispatch
FLAT · REACT · PLAN-EXEC · HUMAN
Handler Runs
RIGHT COST FOR THE TASK

The failure mode is router miscalibration. Send complex tasks down the flat-answer path and quality drops silently; send simple tasks to the full agent and cost climbs. We treat the router as an evaluated component: a labelled set with known buckets, scored on routing accuracy, with a confidence threshold below which everything escalates to a human.

Multi-agent orchestration: supervisor, swarm, and hierarchical topologies

When one agent is not enough, you reach for multiple. The mistake is reaching too early. Multi-agent multiplies the failure surface: every handoff is a place for context to drop, coordination to deadlock, cost to multiply. We move to multi-agent only when a single context window cannot hold the task, or sub-tasks need genuinely different tools and prompts. Three topologies are worth knowing.

MULTI-AGENT ORCHESTRATION TOPOLOGIES — SUPERVISOR, SWARM, HIERARCHICAL
SUPERVISORSupervisorroutes + mergesResearchweb + RAGWriterdraftsCriticreviewsCentral coordinator. Easy to debug:one place owns routing + merge.Coordinator is the bottleneck.SWARM (peer handoff)Agent AtriageAgent BbillingAgent CrefundsPeers hand off directly, no hub.Flexible but hard to trace + can loop.HIERARCHICALTop LeadSub-lead 1Sub-lead 2workerworkerworkerNested supervisors. Scales to manyworkers; deepest to debug.PICK BY COORDINATION NEEDSUPERVISOR: one coordinator owns task decomposition, routing, and result merge. Frameworks: LangGraph supervisor, CrewAI hierarchical process,OpenAI Agents SDK handoffs through a triage agent. Default choice. Single trace point makes incidents tractable.SWARM: peers hand off to each other directly with no central hub (AutoGen GroupChat, OpenAI Agents SDK peer handoffs). Maximum flexibility,but handoff loops and lost context are common. Add a turn cap and a shared scratchpad or it deadlocks.HIERARCHICAL: supervisors of supervisors. Scales to dozens of workers across domains. Each layer adds latency and a debugging hop.Use only when one supervisor genuinely cannot hold the coordination state for all workers.RULE OF THUMB: start with one agent. Move to supervisor only when context or tool divergence forces it. Reach hierarchical last.
Figure 2: Three orchestration topologies. Supervisor (left) routes through a central coordinator; swarm (center) lets peers hand off directly; hierarchical (right) nests supervisors. Pure swarm is hardest to debug.

The supervisor topology is our default. A coordinator owns task decomposition and result merging; specialist sub-agents (research, writing, review, code) each get their own tools and prompt. Because all routing flows through one node, you get a single trace point for debugging. LangGraph's supervisor pattern, CrewAI's hierarchical process, and the OpenAI Agents SDK triage-and-handoff model all implement this shape. AutoGen's GroupChat leans swarm, more flexible and harder to keep from looping.

Tool calling: the contract that decides whether your agent scales

Tools are how an agent affects the world: read a database, call an API, run code, search a corpus. The model never executes anything itself. It emits a structured request to call a named tool with arguments, your runtime executes it, and the result returns as an observation. Tool-schema quality is the biggest reliability lever nobody talks about, because most tutorials hand the model two clean tools and never stress it.

Three rules we hold to. First, fewer sharper tools beat many overlapping ones; a model handed twenty fuzzy-boundary tools picks the wrong one constantly, where six well-scoped tools rarely misfire. Second, every tool description is a prompt; the model decides whether to call a tool from its description alone, so it must say what it does and when not to use it. Third, outputs must be bounded; an unbounded result poisons context, the exact failure that started this post.

tool_schema.py
Python
"""tool_schema.py — a well-designed tool contract (Anthropic / OpenAI / MCP shape):
sharp boundary, when-to-use guidance in the description, bounded output."""

LOOKUP_ORDER = {
    "name": "lookup_order",
    # The description IS a prompt: state what, when, and when-not.
    "description": (
        "Look up a single order by exact order ID like 'ORD-12345'. "
        "Do NOT use to search by name or email (use search_customer). "
        "Returns status, total, and the 20 most recent line items."
    ),
    "input_schema": {
        "type": "object",
        "properties": {"order_id": {"type": "string", "pattern": "^ORD-[0-9]{4,8}$"}},
        "required": ["order_id"],
        "additionalProperties": False,  # reject hallucinated extra args
    },
}

def run_lookup_order(order_id: str) -> dict:
    order = db.fetch_order(order_id)
    if order is None:
        return {"error": "order_not_found", "order_id": order_id}
    return {
        "order_id": order.id, "status": order.status, "total": str(order.total),
        "line_items": [li.summary() for li in order.line_items[:20]],  # bounded!
        "truncated": len(order.line_items) > 20,
    }

The Model Context Protocol (MCP) standardizes this contract across hosts. Expose a tool once as an MCP server and any MCP-aware host (Claude, the OpenAI Agents SDK, Cursor, Claude Code) can call it, turning N agents times M tools into M definitions reused everywhere.

Memory architecture: working, episodic, and semantic layers

Agent memory is three layers with different lifetimes. Working memory is the current context window: the goal, recent observations, the running plan, capped by the model's context limit. Episodic memory is the history of past sessions. Semantic memory is the durable knowledge base the agent retrieves from, exactly the retrieval layer we cover in our RAG chatbot architecture deep-dive. Conflate these three and you get an agent that either forgets everything between turns or tries to stuff its entire history into every prompt.

AGENT MEMORY LAYERS — LIFETIME AND STORE
Working Memory
CONTEXT WINDOW · THIS TURN
Episodic Memory
PAST RUNS · KV / DB STORE
Semantic Memory
KNOWLEDGE · VECTOR STORE
Summarize + Persist
COMPACT WORKING -> EPISODIC

The architecture decision is how working memory ages out, because a long-running agent will overflow its context window. The standard pattern is summarization: at a threshold, a cheap model call compresses the oldest turns into a summary, that summary persists to episodic memory (Redis, Postgres, a managed store), and the raw turns drop out. On the next relevant turn, the agent retrieves the right summary rather than replaying everything. Same retrieve-and-ground move as semantic memory, applied to the agent's own history.

Where teams get this wrong: they treat the context window as memory and never persist, so the agent is stateful within a session and amnesiac across sessions. Or they over-persist, writing every turn to a vector store and retrieving noisy, half-relevant history that crowds out the goal. The discipline is deciding, per agent, what is worth remembering across runs and what is ephemeral state.

Guardrails: the safety layer that wraps every agent action

Guardrails are not a content filter sprinkled on at the end. They are an architectural layer around the loop with checks at three points: input before the agent sees it, every tool call before it executes, output before it returns. No input guardrails, you are exposed to prompt injection. No tool-call guardrails, the agent can be talked into a destructive action.

The highest-leverage guardrail is on tool calls, specifically read versus write. Read tools (look up an order, search docs) can run autonomously. Write tools (issue a refund, send an email, delete a record) should pass through an approval gate first. We wire human-in-the-loop (HITL) confirmation on any write above a defined blast radius: the agent proposes, a human approves. The model should never be the last line of defense on an irreversible action.

Choosing a pattern: a decision matrix by task shape

Pattern selection follows task shape, not preference. The matrix below scores the four single-agent patterns plus multi-agent against the properties that decide the fit: task length, checkable correctness, latency, and cost. Read the failure note on each row, because every pattern has a shape where it is the wrong call.

Pattern Short tool task (under 8 steps)Long-horizon task (10+ steps)Checkable correctnessTight latency / cost budget
ReAct (flat loop) Best fit. Simplest, lowest overhead. Default for modest tool tasks. Poor. No plan means drift, repeated work, lost sub-goals as context scrolls. Moderate. Catches errors only if it re-checks. No structured critique. Good. One model per step. Cap MAX_STEPS to bound cost.
Plan-and-execute Overkill. The upfront planning call is wasted on a three-step task. Best fit. Plan survives context scroll. Add a replan node or a bad plan runs on. Good. Pairs with a per-step validator; plan structure makes checks easy. Moderate. Planner plus replan add cost. Replan can be a cheap model.
Reflection Situational. Worth it only if the short task is high-stakes with a real validator. Good as a layer on top of plan-and-execute, not the whole architecture. Best fit. Home turf. Ground critique in tests or a schema, not self-opinion. Poor. Each revision multiplies latency and tokens. Theater without ground truth.
Routing Best fit. A cheap classifier diverts simple traffic off the full agent path. Indirect. Routes the complex task to a heavier pattern; not a solver on its own. Moderate. Router accuracy is itself checkable; evaluate on a labelled set. Best fit. Keeps cheap tasks cheap. Miscalibration silently drops quality.
Multi-agent (supervisor) Poor. Coordination overhead and lossy handoffs dwarf any benefit. Good when one context window cannot hold the task or tools diverge. Good. A dedicated critic agent is a clean reflection layer. Poor. Every agent multiplies cost. Reach here last.
Score each pattern against your task shape. Higher weight is a stronger fit. Each row names its failure mode; pick the column matching your dominant constraint.

Frameworks: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK

The framework choice matters less than the pattern choice, but it is not neutral; each one opinionates toward a topology. None removes the need to understand the patterns above; they give you primitives so you are not hand-rolling the loop. Whether you adopt a framework at all is part of the broader custom build versus off-the-shelf decision.

AutoGen rounds out the set with its GroupChat abstraction, which leans swarm where agents converse as peers: powerful for research-style collaboration, hard to keep from looping without a turn cap. Our default stack: LangGraph for control-heavy single and supervisor agents, the OpenAI Agents SDK when the client is OpenAI-native and wants built-in guardrails, CrewAI when the task is cleanly role-shaped. We stay model-agnostic underneath: Claude, GPT-4o, GPT-5, and self-hosted Llama 4 all sit behind the same loop.

What the 2026 data says about agent reliability

Agent reliability compounds badly across steps, which makes the case for the guards in this post. Per-step success rates do not add; they multiply. A loop that is ninety-five percent reliable per step is only about sixty percent reliable across ten steps, because 0.95 to the tenth power is roughly 0.60. That is why long-horizon tasks need plans, reflection, and step budgets rather than a bigger model.

Two operator takeaways. First, frontier-model gains help but do not erase compounding error; architecture that bounds steps, validates intermediate results, and gates risky actions does more for whole-task reliability than the latest model. Second, the tasks benchmarks find hardest are exactly the long-horizon, many-tool tasks where flat ReAct breaks and plans, reflection, and supervisor coordination earn their keep. We treat published numbers as directional, then re-run on our own eval set per task, because your corpus and toolset decide the fit.

FAQ

What is AI agent architecture?

AI agent architecture is the control loop and supporting structure wrapped around a language model that decides what the model sees, what tools it can call, when it stops, and what happens when it fails. The core is a perceive-plan-act-observe loop guarded by a step budget, observation truncation, an error ceiling, and a terminal state. On top sit named patterns (ReAct, plan-and-execute, reflection, routing), a three-layer memory architecture, a tool-calling contract, and a guardrail layer. The model is the reasoning engine; the architecture is what makes it reliable in production rather than just in a demo.

What are the main AI agent architecture patterns?

The four core single-agent patterns are ReAct (interleave reasoning and tool calls one step at a time), plan-and-execute (a planner decomposes the goal into a task list an executor works through, with a replan step), reflection (a critic evaluates output against an external validator and sends it back for revision), and routing (a cheap classifier dispatches each request to the right handler). Above these sit multi-agent topologies: supervisor (a central coordinator routes to specialists), swarm (peers hand off directly), and hierarchical (nested supervisors). Pick by task shape, not by what is trending.

When should I use a multi-agent architecture instead of a single agent?

Move to multi-agent only when a single context window cannot hold the task, or when sub-tasks need truly different toolsets and prompts. Multi-agent multiplies the failure surface: every handoff is a lossy serialization where context leaks, and every agent multiplies cost. We default to a single agent with a longer context window and a plan-and-execute loop for longer than feels comfortable. When you do go multi-agent, start with a supervisor topology and a shared scratchpad so context is not trapped inside individual agents, and reach hierarchical only when one supervisor cannot hold the coordination state.

How does memory work in an AI agent architecture?

Agent memory is three layers with different lifetimes. Working memory is the current context window (goal, recent observations, running plan). Episodic memory is the history of past sessions, persisted to a store like Redis or Postgres. Semantic memory is the durable knowledge base the agent retrieves from, typically a vector store, the same retrieval layer as a RAG pipeline. The key decision is how working memory ages out: at a threshold, a cheap model call summarizes the oldest turns, persists the summary to episodic memory, and drops the raw turns. Treating the context window as the only memory makes the agent stateful within a session and amnesiac across sessions.

Which framework should I use to build an AI agent?

The framework matters less than the pattern, but each opinionates toward a topology. LangGraph models the agent as an explicit graph of nodes and typed state, the strongest fit for long-horizon and supervisor multi-agent systems. CrewAI uses role-based crews and is fastest to a working multi-agent demo. The OpenAI Agents SDK has first-class handoffs, built-in guardrails, and tracing, the cleanest path on an OpenAI-native stack. AutoGen's GroupChat leans swarm for research-style collaboration but needs a turn cap. We default to LangGraph for control, the OpenAI Agents SDK for OpenAI-native guardrails, and CrewAI for role-shaped speed, staying model-agnostic across Claude, GPT-4o, GPT-5, and self-hosted Llama 4.

MORE IN AI AGENT DEVELOPMENT

Continue reading.

How to Build a Customer Service AI Agent — hero image
#ai-agent-development

How to Build a Customer Service AI Agent

Build a custom AI customer service agent: intent routing, tool calls, escalation, guardrails, eval, plus an honest build-vs-buy call.

Navin Sharma Navin Sharma
17m
Back to Blog