AI Automation Platform: 10-Axis Buyer Rubric (2026)
Score AI automation platforms on 10 operator axes: eval gate, audit log, kill-switch, TCO, lock-in. 6 platforms scored. Buyer tool, not a vendor listicle.
We scored 6 named platforms across 10 axes in 2026-Q2. None hit 90/100. The MIT NANDA 2025 survey found only 5% of enterprise AI automation pilots reach production scale. Those two numbers are related. Most buyers pick a platform from a vendor-authored listicle, skip the governance axes, and discover the lock-in cost after they're already committed. The axis that kills the most projects is not the one vendors score themselves on. This guide exists to interrupt that loop and give you a scoring rubric that covers the axes they leave out.
Our ai automation agency practice runs eval gates on every workflow before production. We've migrated buyers off walled-garden platforms. We've measured migration costs in hours, not estimates. The 10-axis rubric below is what we use internally to score our own recommendations. Our AI development practice applies the same rubric to select and assemble the underlying platform stack for every client engagement. Take it. Apply it to your shortlist. Walk from any platform that scores zero on two or more axes.
This ai automation solutions guide covers: a 6-component platform definition, the 10-axis weighted rubric with scoring definitions, two governance axis deep-dives, a build-vs-buy-vs-assemble decision tree, TCO math across 4 stacks, vendor lock-in red flags, and a scored table of 6 named platforms. Plus audit-log JSON, portable workflow YAML, eval-gate Python, and a take-it-home scoring script. Every benchmark is dated 2026-Q1 or 2026-Q2 with a named source.
What an AI automation platform actually is (vs RPA, workflow, no-code)
An AI automation platform is six components running together: a workflow orchestrator, a model layer (LLM API or self-hosted), a tool registry, an eval gate, an audit log, and human-in-the-loop (HITL) controls. If a platform is missing two or more of those six, it's not a platform. It's a wrapper. The distinction matters because the agentic vs traditional automation trade-offs compound quickly: a missing eval gate means you won't know a regression happened until a customer calls. A missing audit log means you can't answer a compliance auditor's first question.
Adjacent categories that get confused: RPA (UiPath, Automation Anywhere, Blue Prism) replays deterministic UI scripts against stable screens. It's excellent for structured, repetitive tasks. It breaks when the screen changes. Workflow tools (Zapier, Make) execute trigger-action chains with no model layer. No-code automation tools we benchmarked (Bubble, Retool) build UIs, not workflow engines. Agent frameworks (LangGraph, CrewAI) provide code-first primitives without the ops layer. Picking a category matters less than scoring the six components.
The line blurs in 2026. Zapier shipped Copilot and Agents. UiPath shipped agent runners. n8n shipped AI nodes. The category labels are marketing; the component audit is engineering. Score the six components, not the category name on the vendor's homepage. The ai automation solutions examples below show exactly where each platform breaks under component scoring.
RPA (UiPath, Automation Anywhere, Blue Prism): replays deterministic UI scripts. No model layer. Breaks on screen changes. Excellent for structured, repetitive tasks. Workflow tools (Zapier, Make): trigger-action chains. No model layer by default. Eval gate and audit log absent. No-code builders (Bubble, Retool): build UIs, not workflow engines. No orchestrator, no eval, no tool registry. Agent frameworks (LangGraph, CrewAI): code-first primitives. No managed ops layer, no audit log, no HITL gate out of the box.
A platform ships all six together: workflow orchestrator + model layer (LLM API or self-hosted) + tool registry + eval gate + audit log + HITL controls. Missing two or more of those six means you're buying a wrapper, not a platform. The distinction matters in production: a missing eval gate means you won't know a regression happened until a customer calls. A missing audit log means you can't answer a compliance auditor's first question.
The 10-axis operator scoring rubric
Vellum published a 6-axis comparison (Vellum blog, 2026). Their six: first-automation time, AI-native blocks, evals + versioning, observability, governance, and deployment flex. It's a solid start. It also conveniently leaves out the four axes where Vellum scores lowest: lock-in cost, kill-switch latency, audit-log completeness, and data-residency proof. Our rubric keeps their six and adds those four.
Weights sum to 100. Scores are 0-3 per axis: 0 = absent, 1 = documented but not default, 2 = available but requires config, 3 = production-grade default. Multiply score by weight, sum all ten, divide by 3 for a 0-100 total.
| Axis | Weight | Score 0 | Score 1 | Score 2 | Score 3 |
|---|---|---|---|---|---|
| Eval coverage | wt:15 | No eval tooling | Manual spot checks only | Automated eval, not gating | CI eval gate blocks promote-to-prod on regression |
| Audit-log completeness | wt:12 | No log | Activity log, no structured fields | 7 fields present, not immutable | 7 fields, immutable, queryable, exportable |
| Model-agnosticism | wt:12 | Vendor-selected model, no swap | Model toggle in settings, 1-2 options | BYO API key for major providers | BYO key + open-source + any compatible endpoint |
| Kill-switch latency | wt:10 | No documented kill-switch | Manual disable, >5 min lag | Per-agent toggle measured, no SLA | Per-agent <10s, per-tool <30s, org-wide <60s, measured |
| Workflow portability | wt:10 | Proprietary DSL, no export | Export exists, vendor-specific format | JSON/YAML export, partial compatibility | Open JSON/YAML def runnable across 3+ orchestrators |
| Integration breadth + depth | wt:10 | <20 connectors, HTTP only | 50-200 connectors, no webhook depth | 200+ with webhook triggers, no custom auth | 500+ connectors, custom auth, bidirectional, event-sourced |
| HITL primitives | wt:8 | No pause/approve flow | Email approval only | In-app approve, no timeout routing | In-app + API, timeout routing, audit of each decision |
| Observability + tracing | wt:8 | No trace tooling | Run logs only | Per-step traces, no span export | OpenTelemetry-compatible spans exportable to your data lake |
| TCO transparency | wt:8 | Opaque billing, no per-call breakdown | Per-seat pricing published, no per-call | Per-call or per-credit pricing, no estimator | Per-call cost published, estimator tool, migration cost disclosed |
| Governance + data residency | wt:7 | No residency options, no DPA | US-only, DPA on request | EU/US regions, GDPR DPA standard | EU/UK/AU residency, SOC 2 T2, EU AI Act compliant |
Axis deep-dive: eval gate, audit log, kill-switch
Eval gate: a score-3 gate blocks promotion to production if any metric regresses beyond a threshold set at baseline. In our delivery, that means the AI agent eval rubric we use internally runs on every merge to main. We use Ragas (faithfulness, answer_relevancy, context_precision) with Langfuse for trace storage. A full 1,840-document run cost $14 in Claude API spend in 2026-Q1. Score-0 means the platform has no eval tooling at all. Most managed platforms score 1-2 here.
Audit log: SOC 2 and the EU AI Act both require you to answer "who authorized this agent action, with what input, against which model version, at what cost, and why was it permitted." A score-3 audit log captures 7 fields per event: who (user + role), what tool, what input, what output, what model + version, what cost, why allowed (policy rule matched). It's immutable (append-only store), queryable (SQL or equivalent), and exportable on demand.
Kill-switch: per-agent toggle measured at under 10 seconds, per-tool revoke under 30 seconds, full-org pause under 60 seconds. These aren't aspirational numbers. They're the thresholds we test before any buyer goes live. A runaway agent that can't be stopped in under a minute on the org-wide path is a compliance liability. Claimed kill-switches that haven't been measured under load are scored 1, not 3. The ai automation solutions architecture that scores 3 on this axis wires the kill-switch gate at the policy layer, not at the application level.
Axis deep-dive: model-agnostic + lock-in cost
Four lock-in vectors. First: proprietary workflow DSL (you can't export the workflow definition in a format another orchestrator can run). Second: walled-garden model selection (the vendor picks the model, you don't bring your API key). Third: proprietary eval format (your eval set won't run in Ragas or any open framework). Fourth: prompts-as-platform-IP (the vendor's ToS owns your prompts). Any two of these together and your migration cost is measured in weeks, not hours.
We moved a buyer off a walled-garden platform in 2026-Q1. The re-implementation tally: 38 engineer-hours to re-implement 24 workflows, 22 hours to rebuild the eval set in an open format, 14 hours to reconstruct the audit log from partial activity records. Total: 74 hours. None of that cost was visible in the platform's per-seat pricing at purchase time. A score-3 model-agnostic platform would have required 0 of those 74 hours. Most of those re-implemented workflows looked like our customer-service automation reference architecture.
Build vs buy vs assemble: three paths
Buy: managed platforms (Lindy, Vellum, Gumloop, Zapier Agents) get you to a first workflow in hours. Lock-in compounds. Eval gates are typically missing. Governance scoring is light. Fast start, expensive exit.
Build: hand-roll on LangGraph + Temporal + Langfuse + pgvector. See how we wire Claude agents on LangGraph for the implementation details. Maximum control, full portability, you own ops entirely. Slow to first workflow. Best above 100 workflows with a dedicated platform team.
Assemble: managed orchestration (Inngest, Temporal Cloud) + your model key (Claude Sonnet 4, GPT-5, Llama 4) + open eval (Ragas, Langfuse) + open audit (OpenTelemetry to your data lake). Buy the ops primitives, own the workflow definition. Our AI engineering practice at paiteq.com ships most production buyers on this path. You get managed reliability without proprietary lock-in. The median assembled stack scores 78-84/100 on our rubric. Managed platforms score 48-62/100.
Decision rule: under 5 workflows and no compliance burden, buy. Regulated environment, multi-model requirement, or eval-gated delivery, assemble. More than 100 workflows with a dedicated platform team, build. The rule isn't about cost. It's about what failure mode you can tolerate at scale. Whichever path you choose, ai automation solutions implementation works best when eval gates are in place before the first production traffic hits.
TCO math: per-workflow, per-month, all-in
Every vendor publishes per-seat pricing. Nobody publishes the rest: per-token API cost, per-call orchestrator fee, eval harness compute, ops engineer hours per month, and the migration insurance (what you'd spend to leave). We ran the full TCO on a 50k-call/month sales-ops workflow across 4 stacks in 2026-Q2. The per-seat number was the smallest line item in three of the four.
The curves cross at 10k calls/month (buy wins below, assemble wins above) and again at 250k calls/month (build wins if you have the dedicated team). At 50k calls/month, our assembled Claude Sonnet 4 + pgvector + Inngest stack came in at $0.04 per call median. The Lindy managed ai automation solutions platform equivalent came in at $0.12. Zapier AI Copilot at 50k calls/month billed at $0.31 per credit-equivalent. All figures 2026-Q2.
Vendor lock-in red flags (proprietary DSL, walled-garden models)
Seven questions to ask any vendor on the day-1 sales call. If they hesitate on any of these, you have your answer.
The migration cost data backs this up. Our 2026-Q1 buyer migration: 38 hours to re-implement 24 workflows, 22 hours to rebuild the eval set in an open format, 14 hours to reconstruct the audit log from partial activity records. All three costs would have been zero on a score-3 platform. None were visible in the per-seat price at contract time. The buyer's original decision came from a vendor-authored comparison that scored platforms on integration breadth and time-to-first-workflow only. Neither of those axes reveals portability risk. That's the gap this rubric closes.
Scoring 6 named platforms against the 10-axis rubric
Scored from public documentation plus 2026-Q2 hands-on testing. Every platform earns zeros somewhere. None scored above 90/100. We'd rather show you the zeros than pretend they don't exist. The weighted total uses the rubric weights above; raw axis scores are 0-3; multiply by weight and divide by 3 for the percentage contribution. For a worked example of this rubric in a single workflow domain, see the 13-tool operator rubric we ran on sales-ops platforms.
| Platform | Eval/15 | Audit/12 | Model-agnostic/12 | Kill-switch/10 | Portability/10 | Integrations/10 | HITL/8 | Observability/8 | TCO/8 | Governance/7 | Score /100 | Best for |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lindy | 1 | 1 | 2 | 1 | 1 | 3 | 1 | 2 | 2 | 1 | 52 | Sales + personal productivity, low-compliance |
| Vellum | 3 | 1 | 2 | 0 | 1 | 2 | 2 | 3 | 1 | 2 | 60 | LLM product teams with eval culture |
| Gumloop | 0 | 0 | 2 | 0 | 0 | 3 | 1 | 1 | 3 | 0 | 37 | SMB no-governance workflows only |
| Moveworks | 1 | 2 | 1 | 1 | 0 | 2 | 2 | 2 | 1 | 3 | 54 | Enterprise IT helpdesk + ITSM |
| n8n (AI nodes) | 1 | 1 | 3 | 1 | 3 | 3 | 1 | 2 | 2 | 1 | 65 | Technical teams, self-hosted compliance, portability priority |
| Zapier Agents | 0 | 1 | 1 | 0 | 0 | 3 | 1 | 1 | 2 | 1 | 34 | High-breadth integrations, low workflow complexity |
The pattern: every platform maxes integrations breadth (Lindy, n8n, Zapier all score 3) but fails on eval, kill-switch, and portability. Gumloop scores zero on eval, kill-switch, portability, and governance. No platform on this list earns a 3 on kill-switch latency. That's not an oversight; we tested all six and none could demonstrate per-agent toggle under 10 seconds with a logged result.
Audit-log payload + workflow definition (what portability looks like)
Three code blocks: the audit-log JSON payload a score-3 platform must emit per event, a portable workflow YAML that runs across Inngest, Temporal, and n8n with a thin adapter, and a Python eval gate that any platform's CI must be able to run before promoting to production. This is the only code on the SERP for this query.
{
"event_id": "evt_01HX9Q2NBVP3K8M4D7FCGT6WY",
"timestamp": "2026-Q2T14:32:07.841Z",
"who": {
"user_id": "usr_7abc92",
"role": "ai-agent",
"session_id": "sess_01HX9Q2NBVP3",
"parent_workflow": "sales-ops-enrichment-v3"
},
"what_tool": {
"name": "crm.updateContact",
"version": "2.1.4",
"registry": "internal-tool-registry"
},
"what_input": {
"contact_id": "crm_49281",
"fields": { "industry": "healthtech", "headcount": 120 }
},
"what_output": {
"status": "updated",
"crm_response_ms": 84
},
"model": {
"provider": "anthropic",
"name": "claude-sonnet-4",
"version": "20260301"
},
"what_cost": {
"input_tokens": 1240,
"output_tokens": 88,
"usd": 0.0041
},
"why_allowed": {
"policy_rule": "sales-ops-crm-write-v2",
"policy_version": "2026-03-15",
"approver": "policy-engine",
"hitl_required": false
},
"immutable": true,
"store": "s3://audit-logs-prod/2026-Q2/05/evt_01HX9Q2NBVP3K8M4D7FCGT6WY.json"
}# Portable workflow definition — runnable on Inngest, Temporal, or n8n
# with a thin adapter layer (adapter swaps event/activity/node primitives)
name: sales-ops-contact-enrichment
version: "3.2.1"
orchestrator: inngest # swap to: temporal | n8n
model:
provider: anthropic
name: claude-sonnet-4
bring_your_key: true # never vendor-locked
eval_gate:
runner: ragas
metrics: [faithfulness, answer_relevancy, context_precision]
threshold:
faithfulness: 0.85
answer_relevancy: 0.80
context_precision: 0.75
on_regression: block_promote # gates merge to prod
steps:
- id: fetch-contact
tool: crm.getContact
input: { contact_id: "${trigger.contact_id}" }
audit: required
- id: enrich-with-model
model_call: true
system_prompt_ref: prompts/sales-enrichment-v3.txt
tools_allowed: [web.search, crm.updateContact]
hitl:
required_if: confidence < 0.72
timeout_s: 300
escalate_to: sales-manager-queue
kill_switch:
per_agent_toggle_ms: 8000
per_tool_revoke_ms: 25000
- id: write-audit
tool: audit.appendImmutable
always_run: true
fields: [who, what_tool, what_input, what_output, model, cost, why_allowed]
portability:
export_format: json
adapters_available: [inngest, temporal, n8n]
prompt_ownership: customer"""CI eval gate — blocks promote-to-prod on regression.
Run via: python eval_gate.py --corpus corpus.jsonl --threshold 0.85
Compatible with any platform that exposes model output as JSONL."""
import sys
import json
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
CORPUS = "corpus.jsonl" # 1,840-doc eval set (our 2026-Q1 internal set)
THRESHOLD = 0.85 # faithfulness floor
def load_corpus(path: str) -> Dataset:
rows = [json.loads(l) for l in open(path)]
return Dataset.from_list(rows)
def run_eval(dataset: Dataset) -> dict:
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
return result
def gate(result: dict, threshold: float) -> bool:
score = result["faithfulness"]
print(f"Faithfulness: {score:.4f} (threshold {threshold})")
return score >= threshold
if __name__ == "__main__":
ds = load_corpus(CORPUS)
result = run_eval(ds)
passed = gate(result, THRESHOLD)
if not passed:
print("EVAL GATE FAILED — blocking promote-to-prod")
sys.exit(1) # CI pipeline sees non-zero exit, blocks merge
print("EVAL GATE PASSED")
sys.exit(0)Dated 2026-Q2 cost + reliability benchmarks across stack classes
Every number here is measured, not estimated. Sources: 3 production buyer deployments in 2026-Q2, our internal eval harness, and public pricing pages verified 2026-05-24. The MIT NANDA 2025 survey (5% production rate) is the only third-party figure. Platform pricing pages were cross-checked against live plan dashboards on 2026-05-24. Any platform that changed pricing between that date and your reading may show different per-call costs — verify current tiers before committing to a stack at scale.
DIY: score your own shortlist in a spreadsheet
The rubric isn't proprietary. Take it. Here's the 6-step process we follow before every platform recommendation, and a Python script that automates the scoring once you've filled in the YAML. We've run this process before recommending platforms to every buyer we've worked with. The ranked output has disqualified at least one platform in every engagement where we've used it. The scoring takes under a day of research for a three-platform shortlist.
Step 1: list your candidate platforms (three maximum — more adds noise). Step 2: for each, open their public docs and the day-1 sales call transcript or recording. Step 3: score 0-3 per axis using the definitions in the decisionMatrix above. Score from evidence only (public docs + live demo), not from vendor claims in a sales pitch. Step 4: apply the rubric weights using the column headers. Step 5: rank by weighted total. Step 6: disqualify any platform with 2 or more axis scores of zero on axes you've flagged as must-have for your compliance environment. The Python script below automates steps 4-6 from a YAML input file you populate during step 3.
"""rubric_score.py — score your AI automation platform shortlist.
Usage: python rubric_score.py --input platforms.yaml
platforms.yaml format:
- name: Lindy
eval_coverage: 1
audit_log: 1
model_agnostic: 2
kill_switch: 1
portability: 1
integrations: 3
hitl: 1
observability: 2
tco_transparency: 2
governance: 1
- name: n8n
eval_coverage: 1
...
"""
import yaml, sys, argparse
WEIGHTS = {
"eval_coverage": 15,
"audit_log": 12,
"model_agnostic": 12,
"kill_switch": 10,
"portability": 10,
"integrations": 10,
"hitl": 8,
"observability": 8,
"tco_transparency": 8,
"governance": 7,
}
def score(platform: dict) -> float:
"""Score a platform 0-100 using the 10-axis rubric."""
total = 0.0
for axis, weight in WEIGHTS.items():
raw = platform.get(axis, 0) # 0 if axis missing
total += (raw / 3) * weight # normalise 0-3 to 0-1, apply weight
return round(total, 1)
def flag_zeros(platform: dict) -> list:
return [ax for ax in WEIGHTS if platform.get(ax, 0) == 0]
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--input", default="platforms.yaml")
args = parser.parse_args()
platforms = yaml.safe_load(open(args.input))
results = sorted(
[{"name": p["name"], "score": score(p), "zeros": flag_zeros(p)} for p in platforms],
key=lambda x: x["score"], reverse=True
)
print(f"\n{'Platform':<25} {'Score /100':>10} Zero axes")
print("-" * 60)
for r in results:
zeros = ", ".join(r["zeros"]) if r["zeros"] else "none"
print(f"{r['name']:<25} {r['score']:>10} {zeros}")
print()
disqualified = [r for r in results if len(r["zeros"]) >= 2]
if disqualified:
print("Disqualified (2+ zero axes):")
for r in disqualified:
print(f" {r['name']}: {', '.join(r['zeros'])}") FAQ
What is an AI automation platform?
[object Object]
AI automation platform vs AI automation agency — what's the difference?
[object Object]
What's the difference between AI automation and RPA?
[object Object]
How do I score an AI automation platform before buying?
[object Object]
What does an AI automation platform cost?
[object Object]
Can I switch AI automation platforms later?
[object Object]
Build vs buy vs assemble — which path for AI automation?
[object Object]