AI Agent Benchmark: A 6-Axis Reliability Rubric for Production Agents
Why "agent accuracy" is useless, the six sub-metrics we actually score (completion, trajectory, tool-use, recovery, refusal calibration, cost), and the methodology behind our 2026-Q3 agent reliability benchmark.
Most ai agent benchmark numbers you'll see in a vendor deck are a single completion-rate score on a leaderboard. That number does not tell you whether the agent will survive your production traffic. On our own 100-task internal suite (2026-Q3, see /benchmarks/agent-reliability-2026-q3/), the headline completion-rate gap between the best and worst model in our pool was a meaningful single-digit gap. The gap on recovery-after-error expanded to a large double-digit gap on the same models. Completion rate is the press release. Recovery is the metric that decides whether the agent ships.
We run agents in our own delivery, score them with the same harness we ship to clients, and publish the dated benchmark every quarter. This piece is the rubric behind that benchmark: six sub-metrics, two code snippets we use day-to-day, and the failure modes that aren't in any leaderboard. Any serious ai agent evaluation has to score recovery and calibration alongside completion, not after. If you're picking between a public ai agent benchmark and a custom evaluation rubric, the honest answer is you need both — the public one for sanity, the rubric for the decision.
Why "agent accuracy" is a useless metric
Agents don't return a string you can grade against a gold answer. They emit a trajectory: a sequence of tool calls, intermediate reasoning, retries, and a final write to some system of record. An agent that books a meeting in 3 tool calls and an agent that books the same meeting in 14 tool calls both show up as a perfect score on a naive completion-rate axis. One of them costs five times more and pages your on-call when a downstream tool rate-limits. That is not the same agent.
Worse, completion-rate inflates on the easy tail of the task distribution. The first eighty tasks any agent sees are the ones the gold set already knows the model can do. The last twenty are where the model differences hide. Reporting completion-rate as a single average smears the tail into the head. The rubric below splits per-family and per-difficulty, which is how we caught a model in the 2026-Q2 run that scored in the low-90s overall but collapsed into the mid-50s on the hardest decile of multi-tool composition. That decile is exactly the workflow shape buyers actually ship.
The literature has known this for two years. AgentBench (arXiv:2308.03688) introduced trajectory-aware grading in 2023. Sierra's τ-bench (arXiv:2406.12045) extended it to real customer-service workflows in 2024. Most teams still report a single completion-rate number because that's what a leaderboard accepts. We treat the leaderboard as one column in a six-column rubric. The other five are where most production failures hide.
The six-axis ai agent benchmark rubric we score against
Our agent reliability evaluation rubric uses six sub-metrics, each scored 0–100 against a per-task ground truth. These six agent reliability metrics surface the failure modes a single-axis score smears together. llm agent evaluation differs from RAG eval because the unit of judgement is a trajectory, not a string. We weight per buyer use case (a high-volume retrieval agent weights cost-per-task heavier than a regulated workflow agent, which weights refusal calibration heavier). The list:
1. Task completion rate. Did the agent finish every required checkpoint? Not did it produce output — did each subgoal land. 2. Trajectory length vs optimal. How many tool calls did it use against a hand-annotated optimal path. Over- and under-decomposition both penalised. 3. Tool-call accuracy. Argument-level grading, not function-name string match. tool-use evaluation can't be a string match on function names — wrong arguments to the right tool is still a failure. 4. Recovery-after-error rate. When a tool returns an error or malformed payload, does the agent recover or spiral. 5. Refusal calibration. Two-axis: false-refusal (declined when it should've completed) and over-completion (charged ahead when it should've asked). 6. Cost-per-successful-task. Total $ spent / tasks that hit every checkpoint. The headline number for buyers.
Run on a 100-task suite spanning data extraction, scheduling, retrieval, code-edit, and multi-tool composition. Numbers here are completion + recovery only — the other four axes live in the full 2026-Q3 dated benchmark. Every row in that table includes its trajectory-length ratio, tool-call accuracy, and cost-per-successful-task. We don't average across axes — averaging hides exactly the failure modes that matter.
What the literature gets right and wrong
AgentBench (Tsinghua + collaborators, arXiv:2308.03688) is the methodological anchor. It introduced eight task families with trajectory-level grading and ground-truth annotation. The weakness: every task family is synthetic. There's no equivalent of the half-malformed JSON your CRM partner returns at 2am. τ-bench (Sierra, arXiv:2406.12045) goes the other way — real customer-service workflows with a simulated user — but it's English-only, customer-service-tilted, and the user simulation itself becomes a model dependency you can't fully audit.
Terminal-Bench (tbench.ai, 2024) and LiveBench cover code-edit and reasoning, both more honest than MMLU-style evals because they're refreshed against contamination. The Princeton "AI Agents That Matter" working paper (2024) is the strongest meta-critique we've seen — it argues that most agent benchmarks reward verbose trajectories and underweight cost. We agree. The rubric below addresses both gaps. The buyer-side framing of how this rubric maps to real automation engagements lives in our agentic automation comparison piece.
Strength: clean ground truth, reproducible, scoreable on a leaderboard. Weakness: doesn't model real tool failure modes, no rate-limit pressure, no ambiguous user input. Useful for model selection at the top of the funnel; not enough to ship.
Strength: scores recovery-after-error and refusal calibration on traces drawn from real workflows. Weakness: ground-truth annotation costs more, gold sets need quarterly refresh, scores are not directly comparable across teams. Best used in parallel with a public benchmark, not instead of.
The 100-task suite: families, annotation, and ground truth
We split the 100 tasks across five families: 25 data-extraction (PDF + structured-source agents), 20 scheduling (multi-calendar, conflict resolution), 20 retrieval (hybrid RAG over a 1,840-doc corpus we also use in our rag benchmark), 20 code-edit (small repo, AST-aware), 15 multi-tool composition (CRM + payment + ticketing chained). Each family gets its own gold annotation rubric because the failure modes differ: scheduling agents fail on timezone math, retrieval agents fail on context-window overflow, code-edit agents fail on AST-invalid patches.
| Family | Tasks | Annotation cost (min/task) | Headline failure mode |
|---|---|---|---|
| Data extraction | 25 | 12 | Schema drift on partial pages |
| Scheduling | 20 | 18 | Timezone + DST math |
| Retrieval | 20 | 9 | Context overflow on long traces |
| Code edit | 20 | 22 | AST-invalid patches |
| Multi-tool composition | 15 | 35 | Mid-trajectory tool failure cascade |
Gold annotation is the expensive part. A senior engineer spends 12 to 35 minutes per task writing the optimal trajectory and the acceptable variants. We tried LLM-generated gold sets first. They were only fair on the easy families and worse than coin-flip on multi-tool composition. Human annotation is the cost we can't engineer away yet. The rubric earns its keep because that annotation work scores every model release for the next two quarters. Amortised, it's cheap.
Why five families and not fifteen? Diminishing returns kick in fast. After our second-quarter run we tried adding a sixth family (long-running research agents) and a seventh (image-aware agents). The new families surfaced two specific failure modes the existing five missed, but the per-family signal noise stayed above the model-difference signal for the first 30 tasks. We pulled them out of the headline suite and run them as quarterly supplements instead. If your buyer use case lives in one of those edge families, build a focused supplement. Don't dilute the headline.
Ground truth has a half-life. Scheduling tasks written against a 2025 calendar API need refresh when the API revs. Code-edit tasks tied to a specific repo go stale as the repo evolves. We re-validate a rotating fifth of the gold set every quarter, so each task is checked once a year minimum. The CrewAI and LangGraph examples we use in tool-orchestration training get the same treatment: rebuild the gold trajectory if the framework's tool-call shape changes between major versions.
Recovery-after-error: the metric no one publishes
Production agents fail mid-task on a meaningful share of runs — typical-shape rates land in the high-twenties to mid-thirties range across the agents we audit. Rate limits, malformed tool responses, ambiguous user follow-ups, transient 5xx from a partner API. The reliability number that matters isn't whether the happy path works — every model passes the happy path. It's what happens after the error response: does the agent recover, retry with adjustment, escalate, or spiral into a loop chasing the same broken call.
We score each branch outcome against the gold-annotated correct response. Retry-with-jitter on a rate-limit is correct; retry-immediately is a fail. Switching to a fallback tool on a malformed payload is usually correct; switching on a rate-limit is wasteful. Asking the user on a clearly recoverable error is over-escalation. Giving up cleanly with a logged failure is acceptable; giving up silently is the worst outcome (production debugging nightmare). The four-branch shape generalises across our task families.
Trajectory length and the "agent that does 80 steps" trap
Long trajectories aren't intelligence. They're cost. An agent that decomposes a three-step task into eighty steps is leaking dollars and latency on every run. The fix is to score trajectory length against a manually annotated optimal path. We normalise: an agent that uses 1.0× the optimal tool calls scores 100; 1.5× scores 85; 2× scores 65; 3× and over scores 20. Under-decomposition (skipping a required subgoal) is also penalised because shortcuts that happen to work on the gold set don't generalise.
Notice that no single layer scores all six axes. multi-step agent eval requires per-step grading across orchestration, tool execution, and model layers — completion and refusal calibration are model properties, tool-call accuracy and cost-per-task are execution properties, trajectory length and recovery are orchestration properties. If your harness only instruments one layer, you're missing four axes. For LangGraph users, langgraph evaluation hooks into the same OpenTelemetry spans the orchestrator already emits, so the scoring runner reads the same trace store production agent monitoring writes to. We standardised on OpenTelemetry across the stack so the same span shows up in Langfuse and our custom scoring runner. production agent monitoring picks up where the offline rubric ends — same axes, live traffic, alert thresholds tuned per workflow. The orchestration patterns we use in production are covered in our LangGraph orchestration walkthrough.
Refusal calibration: false-refusal and over-completion
Over-refusing is just as bad as under-refusing. An agent that says "I can't help with that" to every ambiguous request is technically safe and operationally useless. An agent that charges ahead and books a flight to the wrong city because the user said "Portland" without specifying Maine or Oregon is the other failure. Calibration is a two-axis score: false-refusal rate (declined when it should've completed) and over-completion rate (charged ahead when it should've asked).
Our gold set tags each task as "should complete", "should ask", or "should refuse". A calibrated agent gets all three roughly right. In the 2026-Q3 run, Claude Opus 4.7 had the lowest combined miscalibration shape (single-digit false-refusal and a low single-digit over-completion band). GPT-5 was tighter on refusal but worse on over-completion (very low single-digit refusal, low-teens over-completion). GPT-5-mini was the most over-eager (near-zero refusal but a high-teens over-completion band). The mini-model speed comes with a calibration tax. The published benchmark has the full per-model breakdown.
There's a subtle scoring gotcha we tripped on twice in v0.1 of the harness. A model that refuses on every task will score zero on over-completion and look great on a single-axis chart. The two-axis score guards against that, but only if the gold set has a real mix of should-complete tasks. We aim for a majority-should-complete mix with smaller should-ask and should-refuse slices in the calibration subset. Skew that mix and the calibration numbers stop generalising. Workflows with stricter compliance needs (legal review, medical advice) should push the should-refuse share higher to surface false-completes the model would only attempt under pressure.
System prompt tuning shifts these numbers more than people expect. The same Sonnet 4.6 model with a permissive system prompt landed in the low-teens over-completion band. With a structured "ask first when ambiguous" prompt it dropped into the mid-single-digit band. Refusal calibration is not a static model property. It's a model-plus-prompt property, and the benchmark only generalises if you publish the prompt alongside the score. We commit the prompts that produced each row of the dated benchmark to the public ai-eval-harness repo. If a vendor reports calibration without the prompt, treat the number as marketing. The same goes for completion-rate numbers stripped of their tool schema and their retry budget configuration. Reproducibility is the whole point of a benchmark, and reproducibility means every input the agent saw, not just the model id.
Cost-per-successful-task: the headline number
RAG benchmarks report recall@k. Code benchmarks report pass@1. Agent benchmarks should report cost-per-successful-task. The formula is simple: total API + tool spend / count of tasks that completed every checkpoint. Failed tasks still cost money — that goes in the numerator. Partial completions don't count toward the denominator. The result is the only number that maps a model choice to a per-workflow operating expense.
That spread matters. A workflow at 10,000 successful tasks per month spends in the low-thousands of dollars on Opus, low-hundreds on Sonnet, and roughly a hundred dollars on mini per month. The buyer's question isn't which model is best, it's which model is best for this workflow at this margin. Sonnet 4.6 wins our default routing for most production agents. Opus 4.7 wins regulated and high-stakes routes. Mini wins high-volume read-only retrieval. The rubric makes that decision tractable instead of vibes-based.
Cost-per-successful-task also exposes the hidden tax of retry budgets. An agent configured with three retries per tool will, on a malformed-payload run, spend three times the happy-path cost before giving up. If recovery is poor, those wasted calls compound into a real bill. We've seen production agents where the failed-task spend exceeded the successful-task spend during a downstream outage. That's the number to monitor in your observability dashboard, not just throughput. We instrument every Langfuse trace with a per-task cost tag so the daily report shows successful-cost and wasted-cost as separate lines.
How the harness scores all of this
The scoring runner is a config-driven module in ai-eval-harness (open-source, paiteq/ai-eval-harness). v0.1 ships the scaffold and the RAG modules; v0.2 lands the agent rubric module that wires the six axes; v0.3 publishes the 100-task suite to HuggingFace. The config block is plain YAML so non-engineers can tweak weights per workflow without touching code.
# ai-eval-harness v0.2 — agent reliability rubric config
# Workflow-specific weights. Sum of weights = 1.0.
# Run: ai-eval-harness run --config agent_reliability.yaml --suite suite-2026-q3.jsonl
suite:
name: agent-reliability-2026-q3
tasks_file: suite-2026-q3.jsonl # 100 tasks across 5 families
gold_dir: gold/2026-q3/ # one .json per task with optimal trajectory
models:
- id: claude-opus-4-7
provider: anthropic
max_tool_retries: 3
- id: claude-sonnet-4-6
provider: anthropic
max_tool_retries: 3
- id: gpt-5
provider: openai
max_tool_retries: 3
- id: gpt-5-mini
provider: openai
max_tool_retries: 3
rubric:
# Six axes. Adjust weights per workflow.
completion_rate: { weight: 0.25, scorer: checkpoint_pass }
trajectory_length: { weight: 0.15, scorer: optimal_ratio, over_penalty: 0.6 }
tool_call_accuracy: { weight: 0.20, scorer: argument_match }
recovery_after_error: { weight: 0.20, scorer: branch_outcome }
refusal_calibration: { weight: 0.10, scorer: three_class }
cost_per_successful: { weight: 0.10, scorer: usd_per_pass }
observability:
exporter: otel
endpoint: https://otel.langfuse.com
service_name: agent-reliability-2026-q3
output:
format: jsonl
publish_to: benchmarks/agent-reliability-2026-q3/
Each task is a JSONL row with the user goal, the available tools, the gold trajectory, and the acceptable variants. The format is intentionally close to AgentBench's so existing task sets port with a small adapter. Here's a real task from the scheduling family — light enough to read in this margin, complete enough to score against:
{
"id": "sched-014",
"family": "scheduling",
"user_goal": "Book a 30-minute meeting with Sara next Tuesday afternoon. Prefer 2-3pm her local time.",
"context": {
"user_tz": "America/Los_Angeles",
"sara_tz": "Europe/Berlin",
"sara_calendar_window": "2026-10-13T13:00:00+02:00/2026-10-13T17:00:00+02:00"
},
"available_tools": ["calendar.read", "calendar.write", "user.ask"],
"gold_trajectory": [
{ "tool": "calendar.read", "args": { "user": "sara", "date": "2026-10-13" } },
{ "tool": "calendar.write", "args": { "start": "2026-10-13T14:00:00+02:00", "duration_min": 30, "attendees": ["sara", "user"] } }
],
"acceptable_variants": [
"any slot in [14:00,15:00,15:30] Sara local on 2026-10-13",
"explicit user.ask call if both 14:00 and 15:00 are booked"
],
"checkpoints": [
"reads sara calendar in correct tz",
"writes invite in sara local time",
"duration matches user goal"
],
"refusal_label": "should_complete",
"optimal_tool_calls": 2,
"max_acceptable_tool_calls": 4
}
The runner replays every model against every task, records the full trajectory in OpenTelemetry, scores against checkpoints + variants, and writes a JSONL row per (model, task). The output drops straight into the dated benchmark page. We don't pre-filter or clean the trajectories. Buyers should see the raw scoring data, including the embarrassing rows.
A few harness-engineering notes that took us longer than we'd like to figure out. Concurrency: agents that talk to real tools need rate-limit awareness during eval runs, not just production. We default to 4 parallel agents and back off on 429s. Determinism: set temperature to 0 where the model supports it, but record the full sampler config in the JSONL output so re-runs are reproducible. Cost tracking: pull the actual token counts from the provider response, not your estimate. Anthropic and OpenAI both report cached-vs-uncached tokens differently, and getting the cost-per-task wrong by a meaningful fraction is easy if you trust your own estimator. Trace fidelity: store the raw tool-call arguments alongside the scored result so you can re-grade a task months later without re-running the model.
Reading the 2026-Q3 agent reliability benchmark
The full results live at /benchmarks/agent-reliability-2026-q3/ with per-family per-model per-axis breakdowns. The TL;DR for a buyer scanning it for the first time: don't read the headline completion column. Read the recovery and refusal columns. That's where the models separate. Claude Opus 4.7 leads on recovery (mid-80s band vs GPT-5 sitting in the low-to-mid-70s). Sonnet 4.6 is the value-pick at a high-70s recovery band and a fraction of the cost. GPT-5 wins code-edit completion but spirals more on malformed tool responses.
Sub-metrics interact in non-obvious ways. A model with a strong refusal posture often pays for it on completion rate, because it asks more clarifying questions on borderline tasks. A model with a tight cost-per-task often pays for it on recovery, because it cuts retries to save tokens. The honest read on the 2026-Q3 numbers is that no single model dominates all six axes. Opus 4.7 leads on three, Sonnet 4.6 leads on cost-adjusted versions of two, GPT-5 leads on code-edit completion, and GPT-5-mini leads on raw unit cost. Routing across models is the only setup that gets you all four leaders on the workflows that matter.
Look for the variance column when you read the benchmark page, not just the mean. We publish standard deviation per axis per model. A model with a high-70s mean recovery and a tight low-single-digit sigma is more predictable than one with a low-80s mean and a double-digit sigma. Predictability matters at the production tier because it determines how much retry budget and HITL safety net you need to provision. The buyer cost equation is total expected spend plus variance budget, not just mean spend.
| Workflow shape | Claude Opus 4.7 | Claude Sonnet 4.6 | GPT-5 | GPT-5-mini |
|---|---|---|---|---|
| Regulated workflow (legal, finance, healthcare) where refusal calibration matters most | Yes: best fit | Considered with strict prompt | Risky: over-completion rate | No: highest over-completion |
| High-volume retrieval (read-only RAG agents) | Overspend | Strong fit | Considered | Yes: best unit cost |
| Code-edit / developer-tool agent | Considered | Considered | Yes: code-edit edge | No: AST recovery weak |
| Multi-tool composition (CRM + payment + ticket) | Yes: best recovery | Strong fit | Risky on cascades | No: spirals on cascades |
| Customer-service handoff agent (English) | Overspend | Yes: best fit | Considered | Risky: refusal calib weak |
Common eval-design mistakes to avoid
We've reviewed a lot of internal agent evals in client audits. The same five mistakes show up. None of them are deep — they're failure-by-default, the kind that happens when the team copies a leaderboard and calls it done.
Mistake one (single-judge) is the most common and easiest to fix. Mistake four (no error injection) is the most expensive to skip. It's how a strong-looking completion-rate agent ends up paging on-call in week three. Mistake five (no gold trajectory) is the one most teams resist because annotation is real engineering work. It pays back across two quarters of model releases.
Two more failure modes we see less often but worth flagging. First, scoring on the final state of the system instead of the trajectory: the agent updated the record correctly, so it passes, even though it did dozens of tool calls and burned a multiple of the optimal-path spend to get there. Final-state-only scoring rewards exactly the wrong behaviour and is the easiest mistake to inherit from a CRUD test suite. Second, treating the eval as a one-shot model bake-off instead of a regression gate. We've seen teams run a clean eval at launch, ship, and never re-run. Six weeks later a model provider pushes a silent update, recovery drops by a high-single-digit band, no one notices until the support inbox does.
TL;DR for engineering leads
Yes, you need a rubric. Completion-rate alone is press-release math. Score six axes: completion, trajectory length, tool-call accuracy, recovery-after-error, refusal calibration, cost-per-successful-task. Weight them per workflow. Annotate gold trajectories for at least your top five families. Use a public benchmark for sanity, your rubric for the decision. The harness we use is open-source at github.com/paiteq/ai-eval-harness; the dated quarterly run is at /benchmarks/agent-reliability-2026-q3/.
If you only do one thing this quarter: add recovery-after-error scoring to your existing eval. It's the single highest-impact change. Inject five rate-limits, five malformed payloads, five 5xx responses across your task set, and score the agent's branch outcome. You'll see model differences your current numbers are hiding. From there, the rest of the rubric is incremental.
If you have a quarter of runway: stand up the full six-axis rubric, annotate 40 gold trajectories across your top two families, and run it against three candidate models. Budget 8-12 engineer-days for annotation and harness wiring. You'll come out with the cost-per-successful-task math that decides your default routing, and you'll have a regression suite to catch model-version drift the next time a provider ships an update. The teams we've seen succeed treat this as a one-time capital investment that compounds across every release.
The reliability rubric is not the moat. Anyone with senior engineers and a corpus can build it. The moat is the operating discipline: actually running it on every release, actually rolling back when a number regresses, actually retiring tasks when their gold trajectory goes stale. We've seen well-funded teams build excellent harnesses and then stop running them after the launch. Six months later they're back to vibes-based routing. The harness only pays back if it's the gate, not the project.
What's the difference between an ai agent benchmark and an agent reliability rubric?
A public ai agent benchmark (LiveBench, AgentBench, τ-bench, Terminal-Bench) scores models on a fixed task set with a published methodology. A reliability rubric is your scoring contract — the six axes, the weights, and the gold annotations for your workflows. Use the benchmark for model triage, the rubric for the routing decision. Most teams need both.
Why isn't completion rate enough on its own?
An agent that completes a task in dozens of tool calls when a handful would do still scores at the top on completion. It costs an order of magnitude more, paged your on-call once, and double-charged a customer because retry handling was wrong. Completion rate measures the happy path. Recovery, refusal calibration, and cost-per-successful-task measure what production actually does to the agent.
How big should the task suite be?
100 tasks is our default — enough to span five families with 15-25 tasks each, small enough that gold annotation stays tractable. We've seen useful internal suites at 40 tasks (single family) and 250 (multi-product agent platforms). Below 40, the per-family signal is too noisy; above 300, annotation cost compounds faster than the insight.
Can the gold trajectories be LLM-generated?
We tried. On easy families (data extraction, retrieval) LLM-generated gold sets were only fair. On multi-tool composition they were worse than coin-flip. Use an LLM to draft, a senior engineer to review and correct. Plan 12-35 minutes of human time per task, which amortises across two quarters of model releases.
Where does this fit relative to LangSmith, Langfuse, or Braintrust?
Those are observability + trace platforms — they capture what the agent did. The reliability rubric is the scoring contract — how to judge what the agent did. We pipe OpenTelemetry traces from LangGraph to Langfuse for trace storage, then run the ai-eval-harness scoring module against the trace store. The platforms are complementary, not alternatives.
How often should we re-run the benchmark?
Quarterly minimum. Model providers ship meaningful updates roughly every 6-10 weeks. Tool-call accuracy and refusal calibration are the most volatile axes — they can shift by a mid-single-digit to low-double-digit band between model versions without notice. Cost-per-task drifts on pricing changes. Recovery numbers move slowly. We run the full 100-task suite each quarter and publish the dated page.