What does the agent reliability benchmark measure?

Six metrics on the same dated run: pass@1, pass@5, mean steps per successful task, cost per task in dollars, recovery rate on rigged tool failures, and end-to-end latency p95. Quality and cost reported side-by-side; pass@1 alone is half the story.

Which models are tested in the 2026-Q3 agent benchmark?

Four model families on identical scaffolding: Anthropic (Claude Opus 4.7 and Sonnet 4.6), OpenAI (GPT-4o and a current cost-efficient variant), Google (Gemini 2.5 Pro and Flash), and one open-source baseline (Llama 3.1 70B Instruct). Same agent harness, same tool registry.

How is the 100-task set constructed?

Four categories: 25 information-retrieval tasks, 30 workflow-automation tasks (CRM, ticketing, document processing), 25 code-and-analysis tasks, and 20 error-recovery tasks where tool calls are rigged to fail. Each task has a deterministic gold answer so scoring is not subjective.

Can I reproduce these agent benchmark results?

Yes, once results publish in 2026-09. The harness is paiteq/ai-eval-harness (MIT) and the task set will mirror to huggingface.co/datasets/paiteq-ai. Run the YAML at benchmarks/agent-reliability-2026-q3.yaml with your provider key; your scores should land inside our published confidence intervals.

benchmark · 2026-Q3 · planned

Agent reliability benchmark, 2026-Q3
100 tasks. Tool use, multi-step, error recovery.

Name: Agent reliability benchmark 2026-Q3
Creator: Paiteq
Published: 2026-09-01
License: https://opensource.org/licenses/MIT

A dated benchmark on agent reliability. 100 tasks covering tool-calling, multi-step execution, and error recovery. We report pass@1, pass@5, mean steps, and mean cost per task across Claude, GPT, Gemini, and an open-source baseline. Results target 2026-09.

View the harness on GitHub How we benchmark

▸ what we measure

Quality and cost on the same axis. Pass@1 alone is half the story.

An agent that solves 90% of tasks at $0.50 each is not the same as one that solves 92% at $4.00 each. We report quality and cost side-by-side on every dated run.

Pass@1

Did the agent succeed on the first attempt? Strict scoring, deterministic gold answer per task.
Pass@5

Did the agent succeed within five attempts? Useful where retries are cheap (e.g. background workers).
Mean steps

Tool calls + reasoning turns per successful run. Anchors efficiency vs raw success rate.
Cost per task

Total tokens × API list price, averaged across the task set. Same dated run as quality numbers.
Recovery rate

Of tasks where a tool call failed mid-execution, how often did the agent recover without human input? Borrows the structure of AgentBench (Liu et al. 2023, arXiv:2308.03688).
Latency p95

End-to-end wall-clock, p95 across the full task set including tool round-trips.

the task set

100 tasks, four categories.
Synthetic + permissively-licensed real-world.

Tasks are drawn from the patterns we see most often on agent engagements. Each task has a deterministic success criterion so scoring isn't subjective.

25 tasks

Information retrieval

Multi-step lookup across structured + unstructured sources. Success = correct answer + correct citation. Each query has a deterministic gold answer.

30 tasks

Workflow automation

Tool-using sequences (CRM update, ticket triage, document processing). Success = correct end-state on the test environment, verified by direct DB inspection.

25 tasks

Code + analysis

Read a repo or dataset, answer a question, optionally write a small fix. Success = passing test or correct numeric answer within tolerance.

20 tasks

Error recovery

One or more tool calls are rigged to fail. Success = recovers without breaking out to the user, completes the original task.

models under test

Same four families as the RAG benchmark.
Same rubric shape. Same publishing cadence.

The same four model families score on the same harness. Where a model shines on agent tasks but underperforms on RAG, both benchmarks make that visible.

anthropic

Claude

Opus 4.7 and Sonnet 4.6 (current GA at run date), with tool-use enabled. Same agent scaffold and same tool registry as every other model family.

openai

OpenAI

GPT-4o and a current cost-efficient variant. Function calling enabled. Identical task set, identical success criteria.

google

Gemini

Gemini 2.5 Pro and Flash, with tool use. Same scaffolding so the comparison reflects the model, not the harness.

open-source

Llama 3.1 70B

Llama 3.1 70B Instruct with a standard agent scaffold. The open-source baseline anchors the proprietary vs open gap.

reproduce-agent-reliability-2026-q3.sh bash

git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness

# pre-v0.2; agent reliability bench ships with v0.2
ai-eval run benchmarks/agent-reliability-2026-q3.yaml \
  --provider claude --model claude-opus-4-7

git clone https://github.com/paiteq/ai-eval-harness
cd ai-eval-harness

# pre-v0.2; agent reliability bench ships with v0.2
ai-eval run benchmarks/agent-reliability-2026-q3.yaml \
  --provider claude --model claude-opus-4-7

Once results publish, the reproduction commands look approximately like this. Same scaffold, same task set, same scoring.

Services this benchmark feeds: AI agent development (the core service this harness scores), Claude development (Sonnet 4.6 is one of the models under test), OpenAI development (GPT-5 is another), and AI voice agents (voice agents run the same multi-step planning loop — the rubric carries over directly).

while you wait

See the parent benchmark + methodology.
Same harness. Same rubric shape.

The first dated benchmark in this series (RAG retrieval) and the methodology page behind it are both live now. This same harness anchors every client engagement: discovery audit, then a 4-6 week pilot with weekly eval gates, then continuous delivery in production. The pilot RAG run logged 88% faithfulness on a 1,840-document corpus (2026-Q2); the prior agent harness logged 71% pass@1 across 100 tool-using tasks (2026-Q1). This is how <a href='/services/ai-development/'>production-grade AI development</a> looks when reliability is treated as a first-class metric.

RAG benchmark, 2026-Q2 Read the methodology

Agent reliability benchmark, 2026-Q3
100 tasks. Tool use, multi-step, error recovery.

Quality and cost on the same axis. Pass@1 alone is half the story.

Pass@1

Pass@5

Mean steps

Cost per task

Recovery rate

Latency p95

100 tasks, four categories.
Synthetic + permissively-licensed real-world.

Information retrieval

Workflow automation

Code + analysis

Error recovery

Same four families as the RAG benchmark.
Same rubric shape. Same publishing cadence.

Claude

OpenAI

Gemini

Llama 3.1 70B

See the parent benchmark + methodology.
Same harness. Same rubric shape.

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

Agent reliability benchmark, 2026-Q3 100 tasks. Tool use, multi-step, error recovery.

Pass@1

Pass@5

Mean steps

Cost per task

Recovery rate

Latency p95

100 tasks, four categories. Synthetic + permissively-licensed real-world.

Information retrieval

Workflow automation

Code + analysis

Error recovery

Same four families as the RAG benchmark. Same rubric shape. Same publishing cadence.

Claude

OpenAI

Gemini

Llama 3.1 70B

See the parent benchmark + methodology. Same harness. Same rubric shape.

Agent reliability benchmark, 2026-Q3
100 tasks. Tool use, multi-step, error recovery.

100 tasks, four categories.
Synthetic + permissively-licensed real-world.

Same four families as the RAG benchmark.
Same rubric shape. Same publishing cadence.

See the parent benchmark + methodology.
Same harness. Same rubric shape.