benchmark · 2026-Q3 · planned

Agent reliability benchmark, 2026-Q3
100 tasks. Tool use, multi-step, error recovery.

A dated benchmark on agent reliability. 100 tasks covering tool-calling, multi-step execution, and error recovery. We report pass@1, pass@5, mean steps, and mean cost per task across Claude, GPT, Gemini, and an open-source baseline. Results target 2026-09.

▸ what we measure

Quality and cost on the same axis. Pass@1 alone is half the story.

An agent that solves 90% of tasks at $0.50 each is not the same as one that solves 92% at $4.00 each. We report quality and cost side-by-side on every dated run.

  • Pass@1

    Did the agent succeed on the first attempt? Strict scoring, deterministic gold answer per task.

  • Pass@5

    Did the agent succeed within five attempts? Useful where retries are cheap (e.g. background workers).

  • Mean steps

    Tool calls + reasoning turns per successful run. Anchors efficiency vs raw success rate.

  • Cost per task

    Total tokens × API list price, averaged across the task set. Same dated run as quality numbers.

  • Recovery rate

    Of tasks where a tool call failed mid-execution, how often did the agent recover without human input? Borrows the structure of AgentBench (Liu et al. 2023, arXiv:2308.03688).

  • Latency p95

    End-to-end wall-clock, p95 across the full task set including tool round-trips.

the task set

100 tasks, four categories.
Synthetic + permissively-licensed real-world.

Tasks are drawn from the patterns we see most often on agent engagements. Each task has a deterministic success criterion so scoring isn't subjective.

25 tasks

Information retrieval

Multi-step lookup across structured + unstructured sources. Success = correct answer + correct citation. Each query has a deterministic gold answer.

30 tasks

Workflow automation

Tool-using sequences (CRM update, ticket triage, document processing). Success = correct end-state on the test environment, verified by direct DB inspection.

25 tasks

Code + analysis

Read a repo or dataset, answer a question, optionally write a small fix. Success = passing test or correct numeric answer within tolerance.

20 tasks

Error recovery

One or more tool calls are rigged to fail. Success = recovers without breaking out to the user, completes the original task.

models under test

Same four families as the RAG benchmark.
Same rubric shape. Same publishing cadence.

The same four model families score on the same harness. Where a model shines on agent tasks but underperforms on RAG, both benchmarks make that visible.

anthropic

Claude

Opus 4.7 and Sonnet 4.6 (current GA at run date), with tool-use enabled. Same agent scaffold and same tool registry as every other model family.

openai

OpenAI

GPT-4o and a current cost-efficient variant. Function calling enabled. Identical task set, identical success criteria.

google

Gemini

Gemini 2.5 Pro and Flash, with tool use. Same scaffolding so the comparison reflects the model, not the harness.

open-source

Llama 3.1 70B

Llama 3.1 70B Instruct with a standard agent scaffold. The open-source baseline anchors the proprietary vs open gap.

reproduce-agent-reliability-2026-q3.sh bash
git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness

# pre-v0.2; agent reliability bench ships with v0.2
ai-eval run benchmarks/agent-reliability-2026-q3.yaml \
  --provider claude --model claude-opus-4-7
Once results publish, the reproduction commands look approximately like this. Same scaffold, same task set, same scoring.
while you wait

See the parent benchmark + methodology.
Same harness. Same rubric shape.

The first dated benchmark in this series (RAG retrieval) and the methodology page behind it are both live now.