AI Agent Architecture: Patterns, Loops & Orchestration
The real AI agent architecture patterns: ReAct, plan-and-execute, reflection, routing and multi-agent orchestration, with tradeoffs and failure modes.
The first agent we shipped to production in 2025 was a single ReAct loop with six tools and no supervisor. It worked in the demo. In week two it hit a tool that returned a 40KB JSON payload, the model re-read its history every turn, the context window filled, and the agent looped the same failed action. That was not a model problem. It was an architecture problem: the loop had no step budget, no observation truncation, and no planner to decompose the task before the executor touched a tool.
That failure is why this post exists. Most AI agent architecture writing stops at the perceive-plan-act-observe diagram and calls it done. The diagram is correct and useless on its own. What ships is a set of named patterns with explicit failure modes, step budgets, and orchestration topologies, the gap our ai agent development company work fills.
Below: the agent loop as a control flow, the four core single-agent patterns (ReAct, plan-and-execute, reflection, routing), multi-agent orchestration topologies, the tool-calling contract, three-layer memory, guardrails, runnable code, a dated 2026 reliability benchmark, and a decision matrix for picking a pattern by task shape. Operator-voiced, no hand-waving.
What AI agent architecture actually is: the perceive-plan-act-observe loop
An AI agent is a control loop wrapped around a language model. The model is the reasoning engine. The architecture decides what the model sees, what it can do, when it stops, and what happens when it fails. Strip the marketing and an agent has four moving parts: perceive assembles context, plan decides the next action, act executes a tool call, observe feeds the result back. The loop repeats until a stop condition fires.
The part most explainer content omits: the stop condition is architecture, not an afterthought. No step budget and a hard task loops forever. No observation truncation and raw tool output fills the context window. No terminal state and the agent never declares success. The loop below is the skeleton every single-agent pattern in this post specializes.
This loop is what makes an agent an agent rather than a script. A traditional workflow runs a fixed sequence you hard-coded; an agent decides its own next step at runtime from what it observed. That difference is the argument in our agentic AI versus traditional automation. The cost of that flexibility is exactly the failure modes above, which is why the rest of this post is about controlling the loop.
ReAct: the reason-act-observe pattern and where it breaks
ReAct is the default single-agent pattern and what most people mean when they say agent. The model interleaves reasoning with actions: it thinks about what to do, picks a tool, executes it, reads the result, reasons again. No separate planner; it decides each step one at a time. It is the simplest pattern that still earns the word agent, and for tasks under roughly eight steps it is usually the right call.
"""react_agent.py — minimal ReAct loop on Claude with the four production guards:
step budget, observation truncation, error ceiling, terminal state."""
import anthropic
MODEL = "claude-sonnet-4-6"
MAX_STEPS = 12
MAX_CONSECUTIVE_ERRORS = 3
OBS_TRUNCATE_CHARS = 4000
TOOLS = [
{"name": "search_docs",
"description": "Search the internal KB. Returns top matching passages.",
"input_schema": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}},
{"name": "finish",
"description": "Call when the goal is satisfied. Provide the final answer.",
"input_schema": {"type": "object", "properties": {"answer": {"type": "string"}}, "required": ["answer"]}},
]
def execute_tool(name: str, args: dict) -> str:
... # dispatch to your real retriever / API
def run_agent(goal: str) -> str:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": goal}]
errors = 0
for step in range(MAX_STEPS): # step budget
resp = client.messages.create(model=MODEL, max_tokens=1024, tools=TOOLS, messages=messages)
messages.append({"role": "assistant", "content": resp.content})
tool_uses = [b for b in resp.content if b.type == "tool_use"]
if not tool_uses:
return "".join(b.text for b in resp.content if b.type == "text")
results = []
for tu in tool_uses:
if tu.name == "finish":
return tu.input["answer"] # terminal state
try:
out, errors = execute_tool(tu.name, tu.input), 0
except Exception as e: # error ceiling
errors += 1
out = f"ERROR: {e}"
if errors >= MAX_CONSECUTIVE_ERRORS:
return "[escalated] repeated tool failure, handing to human"
results.append({"type": "tool_result", "tool_use_id": tu.id,
"content": out[:OBS_TRUNCATE_CHARS]}) # observation truncation
messages.append({"role": "user", "content": results})
return "[step budget exhausted] no terminal state reached" Where ReAct breaks: long-horizon tasks. With no global plan it has no commitment device. On a 20-step task it drifts, forgets earlier sub-goals as they scroll out of context, and repeats work. Every step is sequential even when three could run at once. When a ReAct agent stalls on a complex task, the fix is almost never a better model. It is a planner above the loop.
Plan-and-execute: separating the planner from the executor
Plan-and-execute splits the agent into two roles. A planner reads the goal once and produces an ordered list of sub-tasks. An executor works through that list, optionally running independent steps in parallel. The plan is a commitment device: it survives context scrolling because it lives outside the per-step reasoning. On long-horizon tasks this is the biggest win over flat ReAct.
We build most plan-and-execute agents on LangGraph because the plan, executor, and replan step map cleanly to graph nodes with explicit state. The planner node writes a task list into shared state; a conditional edge routes to the executor, then back to a replan node after each result. We walk through the full graph wiring on Claude in our Claude agents on LangGraph guide. The graph below shows the shape.
"""plan_execute.py — plan-and-execute on LangGraph.
Planner decomposes once; executor works the list; replan revises on new info."""
from typing import TypedDict, Annotated
import operator
from langgraph.graph import StateGraph, END
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
class AgentState(TypedDict):
goal: str
plan: list[str]
done: Annotated[list[str], operator.add]
def planner(state):
steps = llm.invoke(f"Goal: {state['goal']}\nReturn numbered sub-tasks.").content.splitlines()
return {"plan": [s.strip() for s in steps if s.strip()]}
def executor(state):
next_task = state["plan"][len(state["done"])] # run with tools
return {"done": [llm.invoke(f"Execute: {next_task}").content]}
def route(state): # replan hook would revise state['plan'] here
return END if len(state["done"]) >= len(state["plan"]) else "executor"
g = StateGraph(AgentState)
g.add_node("planner", planner)
g.add_node("executor", executor)
g.set_entry_point("planner")
g.add_edge("planner", "executor")
g.add_conditional_edges("executor", route, {"executor": "executor", END: END})
agent = g.compile()The tradeoff is rigidity. A plan made up front can be wrong. If step two surfaces information that invalidates steps three through six, a naive executor marches on executing a dead plan. That is why the replan node matters: after each step, a cheap model call asks whether the remaining plan still holds. Skip it and plan-and-execute is brittle. Above ten steps, that trade is almost always worth it.
Reflection: self-critique loops that catch the model's own mistakes
Reflection adds a critic step. After the agent produces a candidate, a second pass (same model with a different prompt, or a separate critic) evaluates it against the goal and either accepts it or sends it back with corrections. It is the agent equivalent of code review, and it shines where correctness is checkable but not in one shot: code that must pass tests, extraction that must validate against a schema, multi-constraint writing.
The strongest reflection loops ground the critique in something external. A code agent reflects against a real test run, an extraction agent against a JSON schema validator, a research agent against retrieved sources. With a ground-truth signal, reflection reliably improves output. When the critic is just the model judging itself, it can confidently approve a wrong answer, which is reflection theater. We always wire it to a real validator when one exists.
"""reflection_loop.py — generate, critique against an external validator, revise.
The validator is the key: reflection without a ground-truth check is theater."""
import anthropic
client = anthropic.Anthropic()
MODEL, MAX_REVISIONS = "claude-sonnet-4-6", 3
def generate(goal, prior=None, critique=None):
parts = [f"Goal: {goal}"]
if prior:
parts += [f"Previous attempt:\n{prior}", f"Critique to address:\n{critique}"]
r = client.messages.create(model=MODEL, max_tokens=2048,
messages=[{"role": "user", "content": "\n".join(parts)}])
return "".join(b.text for b in r.content if b.type == "text")
def external_validate(candidate):
# Ground-truth check: run pytest, validate JSON schema, etc. NOT an LLM opinion.
# e.g. rc = subprocess.run(['pytest', ...]); return (rc == 0, stderr)
return (True, "")
def reflect_until_valid(goal):
candidate = generate(goal)
for _ in range(MAX_REVISIONS):
passed, critique = external_validate(candidate)
if passed:
return candidate
candidate = generate(goal, candidate, critique)
return candidate # best effort; caller may escalate Reflection costs latency and tokens linearly with the revision cap; three revisions roughly triple the spend on that step. Use it where a wrong answer is costly and a validator exists. Skip it on cheap, low-stakes, or unverifiable outputs.
Routing: the classifier pattern that keeps simple tasks cheap
Routing is the most underrated pattern because it is barely an agent. A small, fast classifier reads the request and routes it: a flat answer for a simple FAQ, a ReAct loop for a tool task, a plan-and-execute graph for a complex one, a human queue for anything out of scope. It does no reasoning beyond classification. It exists so you do not pay full agent cost on a one-line-answer question.
In production we run a Haiku-class model (Claude Haiku 4, or GPT-4o-mini) as the router and reserve the expensive model for handlers that need it. A router call costs a fraction of a cent and diverts the cheap majority of traffic off the expensive agent path. Routing is also where you enforce scope: anything the classifier cannot confidently bucket goes to a human, which stops agents improvising on requests they were never designed for.
The failure mode is router miscalibration. Send complex tasks down the flat-answer path and quality drops silently; send simple tasks to the full agent and cost climbs. We treat the router as an evaluated component: a labelled set with known buckets, scored on routing accuracy, with a confidence threshold below which everything escalates to a human.
Multi-agent orchestration: supervisor, swarm, and hierarchical topologies
When one agent is not enough, you reach for multiple. The mistake is reaching too early. Multi-agent multiplies the failure surface: every handoff is a place for context to drop, coordination to deadlock, cost to multiply. We move to multi-agent only when a single context window cannot hold the task, or sub-tasks need genuinely different tools and prompts. Three topologies are worth knowing.
The supervisor topology is our default. A coordinator owns task decomposition and result merging; specialist sub-agents (research, writing, review, code) each get their own tools and prompt. Because all routing flows through one node, you get a single trace point for debugging. LangGraph's supervisor pattern, CrewAI's hierarchical process, and the OpenAI Agents SDK triage-and-handoff model all implement this shape. AutoGen's GroupChat leans swarm, more flexible and harder to keep from looping.
Tool calling: the contract that decides whether your agent scales
Tools are how an agent affects the world: read a database, call an API, run code, search a corpus. The model never executes anything itself. It emits a structured request to call a named tool with arguments, your runtime executes it, and the result returns as an observation. Tool-schema quality is the biggest reliability lever nobody talks about, because most tutorials hand the model two clean tools and never stress it.
Three rules we hold to. First, fewer sharper tools beat many overlapping ones; a model handed twenty fuzzy-boundary tools picks the wrong one constantly, where six well-scoped tools rarely misfire. Second, every tool description is a prompt; the model decides whether to call a tool from its description alone, so it must say what it does and when not to use it. Third, outputs must be bounded; an unbounded result poisons context, the exact failure that started this post.
"""tool_schema.py — a well-designed tool contract (Anthropic / OpenAI / MCP shape):
sharp boundary, when-to-use guidance in the description, bounded output."""
LOOKUP_ORDER = {
"name": "lookup_order",
# The description IS a prompt: state what, when, and when-not.
"description": (
"Look up a single order by exact order ID like 'ORD-12345'. "
"Do NOT use to search by name or email (use search_customer). "
"Returns status, total, and the 20 most recent line items."
),
"input_schema": {
"type": "object",
"properties": {"order_id": {"type": "string", "pattern": "^ORD-[0-9]{4,8}$"}},
"required": ["order_id"],
"additionalProperties": False, # reject hallucinated extra args
},
}
def run_lookup_order(order_id: str) -> dict:
order = db.fetch_order(order_id)
if order is None:
return {"error": "order_not_found", "order_id": order_id}
return {
"order_id": order.id, "status": order.status, "total": str(order.total),
"line_items": [li.summary() for li in order.line_items[:20]], # bounded!
"truncated": len(order.line_items) > 20,
} The Model Context Protocol (MCP) standardizes this contract across hosts. Expose a tool once as an MCP server and any MCP-aware host (Claude, the OpenAI Agents SDK, Cursor, Claude Code) can call it, turning N agents times M tools into M definitions reused everywhere.
Memory architecture: working, episodic, and semantic layers
Agent memory is three layers with different lifetimes. Working memory is the current context window: the goal, recent observations, the running plan, capped by the model's context limit. Episodic memory is the history of past sessions. Semantic memory is the durable knowledge base the agent retrieves from, exactly the retrieval layer we cover in our RAG chatbot architecture deep-dive. Conflate these three and you get an agent that either forgets everything between turns or tries to stuff its entire history into every prompt.
The architecture decision is how working memory ages out, because a long-running agent will overflow its context window. The standard pattern is summarization: at a threshold, a cheap model call compresses the oldest turns into a summary, that summary persists to episodic memory (Redis, Postgres, a managed store), and the raw turns drop out. On the next relevant turn, the agent retrieves the right summary rather than replaying everything. Same retrieve-and-ground move as semantic memory, applied to the agent's own history.
Where teams get this wrong: they treat the context window as memory and never persist, so the agent is stateful within a session and amnesiac across sessions. Or they over-persist, writing every turn to a vector store and retrieving noisy, half-relevant history that crowds out the goal. The discipline is deciding, per agent, what is worth remembering across runs and what is ephemeral state.
Guardrails: the safety layer that wraps every agent action
Guardrails are not a content filter sprinkled on at the end. They are an architectural layer around the loop with checks at three points: input before the agent sees it, every tool call before it executes, output before it returns. No input guardrails, you are exposed to prompt injection. No tool-call guardrails, the agent can be talked into a destructive action.
The highest-leverage guardrail is on tool calls, specifically read versus write. Read tools (look up an order, search docs) can run autonomously. Write tools (issue a refund, send an email, delete a record) should pass through an approval gate first. We wire human-in-the-loop (HITL) confirmation on any write above a defined blast radius: the agent proposes, a human approves. The model should never be the last line of defense on an irreversible action.
Choosing a pattern: a decision matrix by task shape
Pattern selection follows task shape, not preference. The matrix below scores the four single-agent patterns plus multi-agent against the properties that decide the fit: task length, checkable correctness, latency, and cost. Read the failure note on each row, because every pattern has a shape where it is the wrong call.
| Pattern | Short tool task (under 8 steps) | Long-horizon task (10+ steps) | Checkable correctness | Tight latency / cost budget |
|---|---|---|---|---|
| ReAct (flat loop) | Best fit. Simplest, lowest overhead. Default for modest tool tasks. | Poor. No plan means drift, repeated work, lost sub-goals as context scrolls. | Moderate. Catches errors only if it re-checks. No structured critique. | Good. One model per step. Cap MAX_STEPS to bound cost. |
| Plan-and-execute | Overkill. The upfront planning call is wasted on a three-step task. | Best fit. Plan survives context scroll. Add a replan node or a bad plan runs on. | Good. Pairs with a per-step validator; plan structure makes checks easy. | Moderate. Planner plus replan add cost. Replan can be a cheap model. |
| Reflection | Situational. Worth it only if the short task is high-stakes with a real validator. | Good as a layer on top of plan-and-execute, not the whole architecture. | Best fit. Home turf. Ground critique in tests or a schema, not self-opinion. | Poor. Each revision multiplies latency and tokens. Theater without ground truth. |
| Routing | Best fit. A cheap classifier diverts simple traffic off the full agent path. | Indirect. Routes the complex task to a heavier pattern; not a solver on its own. | Moderate. Router accuracy is itself checkable; evaluate on a labelled set. | Best fit. Keeps cheap tasks cheap. Miscalibration silently drops quality. |
| Multi-agent (supervisor) | Poor. Coordination overhead and lossy handoffs dwarf any benefit. | Good when one context window cannot hold the task or tools diverge. | Good. A dedicated critic agent is a clean reflection layer. | Poor. Every agent multiplies cost. Reach here last. |
Frameworks: LangGraph, CrewAI, AutoGen, OpenAI Agents SDK
The framework choice matters less than the pattern choice, but it is not neutral; each one opinionates toward a topology. None removes the need to understand the patterns above; they give you primitives so you are not hand-rolling the loop. Whether you adopt a framework at all is part of the broader custom build versus off-the-shelf decision.
AutoGen rounds out the set with its GroupChat abstraction, which leans swarm where agents converse as peers: powerful for research-style collaboration, hard to keep from looping without a turn cap. Our default stack: LangGraph for control-heavy single and supervisor agents, the OpenAI Agents SDK when the client is OpenAI-native and wants built-in guardrails, CrewAI when the task is cleanly role-shaped. We stay model-agnostic underneath: Claude, GPT-4o, GPT-5, and self-hosted Llama 4 all sit behind the same loop.
What the 2026 data says about agent reliability
Agent reliability compounds badly across steps, which makes the case for the guards in this post. Per-step success rates do not add; they multiply. A loop that is ninety-five percent reliable per step is only about sixty percent reliable across ten steps, because 0.95 to the tenth power is roughly 0.60. That is why long-horizon tasks need plans, reflection, and step budgets rather than a bigger model.
Two operator takeaways. First, frontier-model gains help but do not erase compounding error; architecture that bounds steps, validates intermediate results, and gates risky actions does more for whole-task reliability than the latest model. Second, the tasks benchmarks find hardest are exactly the long-horizon, many-tool tasks where flat ReAct breaks and plans, reflection, and supervisor coordination earn their keep. We treat published numbers as directional, then re-run on our own eval set per task, because your corpus and toolset decide the fit.
FAQ
What is AI agent architecture?
AI agent architecture is the control loop and supporting structure wrapped around a language model that decides what the model sees, what tools it can call, when it stops, and what happens when it fails. The core is a perceive-plan-act-observe loop guarded by a step budget, observation truncation, an error ceiling, and a terminal state. On top sit named patterns (ReAct, plan-and-execute, reflection, routing), a three-layer memory architecture, a tool-calling contract, and a guardrail layer. The model is the reasoning engine; the architecture is what makes it reliable in production rather than just in a demo.
What are the main AI agent architecture patterns?
The four core single-agent patterns are ReAct (interleave reasoning and tool calls one step at a time), plan-and-execute (a planner decomposes the goal into a task list an executor works through, with a replan step), reflection (a critic evaluates output against an external validator and sends it back for revision), and routing (a cheap classifier dispatches each request to the right handler). Above these sit multi-agent topologies: supervisor (a central coordinator routes to specialists), swarm (peers hand off directly), and hierarchical (nested supervisors). Pick by task shape, not by what is trending.
When should I use a multi-agent architecture instead of a single agent?
Move to multi-agent only when a single context window cannot hold the task, or when sub-tasks need truly different toolsets and prompts. Multi-agent multiplies the failure surface: every handoff is a lossy serialization where context leaks, and every agent multiplies cost. We default to a single agent with a longer context window and a plan-and-execute loop for longer than feels comfortable. When you do go multi-agent, start with a supervisor topology and a shared scratchpad so context is not trapped inside individual agents, and reach hierarchical only when one supervisor cannot hold the coordination state.
How does memory work in an AI agent architecture?
Agent memory is three layers with different lifetimes. Working memory is the current context window (goal, recent observations, running plan). Episodic memory is the history of past sessions, persisted to a store like Redis or Postgres. Semantic memory is the durable knowledge base the agent retrieves from, typically a vector store, the same retrieval layer as a RAG pipeline. The key decision is how working memory ages out: at a threshold, a cheap model call summarizes the oldest turns, persists the summary to episodic memory, and drops the raw turns. Treating the context window as the only memory makes the agent stateful within a session and amnesiac across sessions.
Which framework should I use to build an AI agent?
The framework matters less than the pattern, but each opinionates toward a topology. LangGraph models the agent as an explicit graph of nodes and typed state, the strongest fit for long-horizon and supervisor multi-agent systems. CrewAI uses role-based crews and is fastest to a working multi-agent demo. The OpenAI Agents SDK has first-class handoffs, built-in guardrails, and tracing, the cleanest path on an OpenAI-native stack. AutoGen's GroupChat leans swarm for research-style collaboration but needs a turn cap. We default to LangGraph for control, the OpenAI Agents SDK for OpenAI-native guardrails, and CrewAI for role-shaped speed, staying model-agnostic across Claude, GPT-4o, GPT-5, and self-hosted Llama 4.