Agent reliability benchmark, 2026-Q3
100 tasks. Tool use, multi-step, error recovery.
A dated benchmark on agent reliability. 100 tasks covering tool-calling, multi-step execution, and error recovery. We report pass@1, pass@5, mean steps, and mean cost per task across Claude, GPT, Gemini, and an open-source baseline. Results target 2026-09.
Quality and cost on the same axis. Pass@1 alone is half the story.
An agent that solves 90% of tasks at $0.50 each is not the same as one that solves 92% at $4.00 each. We report quality and cost side-by-side on every dated run.
-
Pass@1
Did the agent succeed on the first attempt? Strict scoring, deterministic gold answer per task.
-
Pass@5
Did the agent succeed within five attempts? Useful where retries are cheap (e.g. background workers).
-
Mean steps
Tool calls + reasoning turns per successful run. Anchors efficiency vs raw success rate.
-
Cost per task
Total tokens × API list price, averaged across the task set. Same dated run as quality numbers.
-
Recovery rate
Of tasks where a tool call failed mid-execution, how often did the agent recover without human input? Borrows the structure of AgentBench (Liu et al. 2023, arXiv:2308.03688).
-
Latency p95
End-to-end wall-clock, p95 across the full task set including tool round-trips.
100 tasks, four categories.
Synthetic + permissively-licensed real-world.
Tasks are drawn from the patterns we see most often on agent engagements. Each task has a deterministic success criterion so scoring isn't subjective.
Information retrieval
Multi-step lookup across structured + unstructured sources. Success = correct answer + correct citation. Each query has a deterministic gold answer.
Workflow automation
Tool-using sequences (CRM update, ticket triage, document processing). Success = correct end-state on the test environment, verified by direct DB inspection.
Code + analysis
Read a repo or dataset, answer a question, optionally write a small fix. Success = passing test or correct numeric answer within tolerance.
Error recovery
One or more tool calls are rigged to fail. Success = recovers without breaking out to the user, completes the original task.
Same four families as the RAG benchmark.
Same rubric shape. Same publishing cadence.
The same four model families score on the same harness. Where a model shines on agent tasks but underperforms on RAG, both benchmarks make that visible.
Claude
Opus 4.7 and Sonnet 4.6 (current GA at run date), with tool-use enabled. Same agent scaffold and same tool registry as every other model family.
OpenAI
GPT-4o and a current cost-efficient variant. Function calling enabled. Identical task set, identical success criteria.
Gemini
Gemini 2.5 Pro and Flash, with tool use. Same scaffolding so the comparison reflects the model, not the harness.
Llama 3.1 70B
Llama 3.1 70B Instruct with a standard agent scaffold. The open-source baseline anchors the proprietary vs open gap.
git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness
# pre-v0.2; agent reliability bench ships with v0.2
ai-eval run benchmarks/agent-reliability-2026-q3.yaml \
--provider claude --model claude-opus-4-7