AI Automation Platform: 10-Axis Buyer Rubric (2026)

Q: What is an AI automation platform?

An AI automation platform combines six components into one runtime: workflow orchestrator, model layer (LLM API or self-hosted), tool registry, eval gate, audit log, and HITL controls. Examples in 2026 include Lindy, Vellum, Gumloop, Moveworks, n8n (with AI nodes), and Zapier (with Agents). The distinction from RPA (UiPath, Automation Anywhere) is the model layer: AI automation reasons over unstructured inputs; RPA replays deterministic UI scripts.

Q: AI automation platform vs AI automation agency — what's the difference?

A platform is software you license and operate (Lindy, Vellum, Zapier Agents). An ai automation agency like ours runs the audit, builds the workflows on a platform or assembled stack, wires eval gates and audit logs, and owns delivery. Platforms are tools; agencies pick the tool, score it against your use case, and ship the production system. If your team has 2+ AI-experienced engineers and under 10 workflows, buy a platform. If you need regulated-grade audit logs, multi-model portability, or weekly eval gates, hire an agency that assembles on portable primitives.

Q: How do I score an AI automation platform before buying?

Run the 10-axis rubric from the decisionMatrix above. Axis weights are listed in the rubric table (eval coverage tops the list at weight 15; governance anchors at weight 7). Score 0-3 per axis from public docs plus a day-1 sales call. Walk from any platform with 2 or more zero scores. Most vendor-published lists score on 5-6 axes that conveniently favor the publisher; this 10-axis rubric adds the governance and lock-in axes nobody discloses.

Q: What does an AI automation platform cost?

Per-call costs in 2026-Q2: $0.04 median on an assembled stack (Claude Sonnet 4 + pgvector + Inngest), $0.12 median on managed agent platforms (Lindy at 50k calls/month), $0.31 per credit-equivalent on Zapier AI Copilot. Add per-seat pricing on most managed platforms, eval harness compute ($14 per full Ragas run on a 1,840-doc corpus), and ops engineering hours. Buy wins below 10k calls/month; assemble wins above 25k calls/month. Start the audit conversation to scope your real call volume before committing to a stack.

Q: Can I switch AI automation platforms later?

Only if you bought portable primitives from day one. The four lock-in vectors: proprietary workflow DSL (you can't export the definition), walled-garden model selection (vendor picks the LLM), proprietary eval format (your eval set won't run in open frameworks), and prompts-as-IP (vendor owns your prompts). We migrated a 24-workflow buyer off a walled-garden platform in 2026-Q1: 38 engineer-hours to re-implement workflows, 22 hours for the eval set, 14 hours for the audit log. None of that appeared in the per-seat price. Demand portable JSON/YAML workflow exports and BYO model keys on day one.

Q: Build vs buy vs assemble — which path for AI automation?

Three paths: buy (managed platforms like Lindy, Vellum, Gumloop — fast start, lock-in compounds), build (LangGraph + Temporal + Langfuse + pgvector — maximum control, slow start, you own ops), assemble (managed orchestration like Inngest or Temporal Cloud plus BYO model key plus open eval and audit — middle path, portable). Under 5 workflows with no compliance requirement: buy. Regulated environment, multi-model, or eval-gated delivery: assemble. Over 100 workflows with a dedicated platform team: build.

We scored 6 named platforms across 10 axes in 2026-Q2. None hit 90/100. The MIT NANDA 2025 survey found only 5% of enterprise AI automation pilots reach production scale. Those two numbers are related. Most buyers pick a platform from a vendor-authored listicle, skip the governance axes, and discover the lock-in cost after they're already committed. The axis that kills the most projects is not the one vendors score themselves on. This guide exists to interrupt that loop and give you a scoring rubric that covers the axes they leave out.

Our ai automation agency practice runs eval gates on every workflow before production. We've migrated buyers off walled-garden platforms. We've measured migration costs in hours, not estimates. The 10-axis rubric below is what we use internally to score our own recommendations. Our AI development practice applies the same rubric to select and assemble the underlying platform stack for every client engagement. Take it. Apply it to your shortlist. Walk from any platform that scores zero on two or more axes.

This ai automation solutions guide covers: a 6-component platform definition, the 10-axis weighted rubric with scoring definitions, two governance axis deep-dives, a build-vs-buy-vs-assemble decision tree, TCO math across 4 stacks, vendor lock-in red flags, and a scored table of 6 named platforms. Plus audit-log JSON, portable workflow YAML, eval-gate Python, and a take-it-home scoring script. Every benchmark is dated 2026-Q1 or 2026-Q2 with a named source.

What an AI automation platform actually is (vs RPA, workflow, no-code)

An AI automation platform is six components running together: a workflow orchestrator, a model layer (LLM API or self-hosted), a tool registry, an eval gate, an audit log, and human-in-the-loop (HITL) controls. If a platform is missing two or more of those six, it's not a platform. It's a wrapper. The distinction matters because the agentic vs traditional automation trade-offs compound quickly: a missing eval gate means you won't know a regression happened until a customer calls. A missing audit log means you can't answer a compliance auditor's first question.

Adjacent categories that get confused: RPA (UiPath, Automation Anywhere, Blue Prism) replays deterministic UI scripts against stable screens. It's excellent for structured, repetitive tasks. It breaks when the screen changes. Workflow tools (Zapier, Make) execute trigger-action chains with no model layer. No-code automation tools we benchmarked (Bubble, Retool) build UIs, not workflow engines. Agent frameworks (LangGraph, CrewAI) provide code-first primitives without the ops layer. Picking a category matters less than scoring the six components.

The line blurs in 2026. Zapier shipped Copilot and Agents. UiPath shipped agent runners. n8n shipped AI nodes. The category labels are marketing; the component audit is engineering. Score the six components, not the category name on the vendor's homepage. The ai automation solutions examples below show exactly where each platform breaks under component scoring.

Adjacent category (RPA / workflow / no-code / framework)

RPA (UiPath, Automation Anywhere, Blue Prism): replays deterministic UI scripts. No model layer. Breaks on screen changes. Excellent for structured, repetitive tasks. Workflow tools (Zapier, Make): trigger-action chains. No model layer by default. Eval gate and audit log absent. No-code builders (Bubble, Retool): build UIs, not workflow engines. No orchestrator, no eval, no tool registry. Agent frameworks (LangGraph, CrewAI): code-first primitives. No managed ops layer, no audit log, no HITL gate out of the box.

Full AI automation platform (all 6 components required)

A platform ships all six together: workflow orchestrator + model layer (LLM API or self-hosted) + tool registry + eval gate + audit log + HITL controls. Missing two or more of those six means you're buying a wrapper, not a platform. The distinction matters in production: a missing eval gate means you won't know a regression happened until a customer calls. A missing audit log means you can't answer a compliance auditor's first question.

The 10-axis operator scoring rubric

Vellum published a 6-axis comparison (Vellum blog, 2026). Their six: first-automation time, AI-native blocks, evals + versioning, observability, governance, and deployment flex. It's a solid start. It also conveniently leaves out the four axes where Vellum scores lowest: lock-in cost, kill-switch latency, audit-log completeness, and data-residency proof. Our rubric keeps their six and adds those four.

Weights sum to 100. Scores are 0-3 per axis: 0 = absent, 1 = documented but not default, 2 = available but requires config, 3 = production-grade default. Multiply score by weight, sum all ten, divide by 3 for a 0-100 total.

10-axis rubric radar: three example platforms scored against all 10 axes with weighted % labels.

Axis	Weight	Score 0	Score 1	Score 2	Score 3
Eval coverage	wt:15	No eval tooling	Manual spot checks only	Automated eval, not gating	CI eval gate blocks promote-to-prod on regression
Audit-log completeness	wt:12	No log	Activity log, no structured fields	7 fields present, not immutable	7 fields, immutable, queryable, exportable
Model-agnosticism	wt:12	Vendor-selected model, no swap	Model toggle in settings, 1-2 options	BYO API key for major providers	BYO key + open-source + any compatible endpoint
Kill-switch latency	wt:10	No documented kill-switch	Manual disable, >5 min lag	Per-agent toggle measured, no SLA	Per-agent <10s, per-tool <30s, org-wide <60s, measured
Workflow portability	wt:10	Proprietary DSL, no export	Export exists, vendor-specific format	JSON/YAML export, partial compatibility	Open JSON/YAML def runnable across 3+ orchestrators
Integration breadth + depth	wt:10	<20 connectors, HTTP only	50-200 connectors, no webhook depth	200+ with webhook triggers, no custom auth	500+ connectors, custom auth, bidirectional, event-sourced
HITL primitives	wt:8	No pause/approve flow	Email approval only	In-app approve, no timeout routing	In-app + API, timeout routing, audit of each decision
Observability + tracing	wt:8	No trace tooling	Run logs only	Per-step traces, no span export	OpenTelemetry-compatible spans exportable to your data lake
TCO transparency	wt:8	Opaque billing, no per-call breakdown	Per-seat pricing published, no per-call	Per-call or per-credit pricing, no estimator	Per-call cost published, estimator tool, migration cost disclosed
Governance + data residency	wt:7	No residency options, no DPA	US-only, DPA on request	EU/US regions, GDPR DPA standard	EU/UK/AU residency, SOC 2 T2, EU AI Act compliant

10-axis scoring rubric: weights, score definitions, and why each axis is on the list. Score 0-3 per axis; multiply by weight; sum for 0-100 total.

Axis deep-dive: eval gate, audit log, kill-switch

Eval gate: a score-3 gate blocks promotion to production if any metric regresses beyond a threshold set at baseline. In our delivery, that means the AI agent eval rubric we use internally runs on every merge to main. We use Ragas (faithfulness, answer_relevancy, context_precision) with Langfuse for trace storage. A full 1,840-document run cost $14 in Claude API spend in 2026-Q1. Score-0 means the platform has no eval tooling at all. Most managed platforms score 1-2 here.

Audit log: SOC 2 and the EU AI Act both require you to answer "who authorized this agent action, with what input, against which model version, at what cost, and why was it permitted." A score-3 audit log captures 7 fields per event: who (user + role), what tool, what input, what output, what model + version, what cost, why allowed (policy rule matched). It's immutable (append-only store), queryable (SQL or equivalent), and exportable on demand.

Kill-switch: per-agent toggle measured at under 10 seconds, per-tool revoke under 30 seconds, full-org pause under 60 seconds. These aren't aspirational numbers. They're the thresholds we test before any buyer goes live. A runaway agent that can't be stopped in under a minute on the org-wide path is a compliance liability. Claimed kill-switches that haven't been measured under load are scored 1, not 3. The ai automation solutions architecture that scores 3 on this axis wires the kill-switch gate at the policy layer, not at the application level.

Policy gate architecture: the orchestrator → policy gate → tool call → audit log → revoke-token path that every score-3 platform must implement.

Axis deep-dive: model-agnostic + lock-in cost

Four lock-in vectors. First: proprietary workflow DSL (you can't export the workflow definition in a format another orchestrator can run). Second: walled-garden model selection (the vendor picks the model, you don't bring your API key). Third: proprietary eval format (your eval set won't run in Ragas or any open framework). Fourth: prompts-as-platform-IP (the vendor's ToS owns your prompts). Any two of these together and your migration cost is measured in weeks, not hours.

We moved a buyer off a walled-garden platform in 2026-Q1. The re-implementation tally: 38 engineer-hours to re-implement 24 workflows, 22 hours to rebuild the eval set in an open format, 14 hours to reconstruct the audit log from partial activity records. Total: 74 hours. None of that cost was visible in the platform's per-seat pricing at purchase time. A score-3 model-agnostic platform would have required 0 of those 74 hours. Most of those re-implemented workflows looked like our customer-service automation reference architecture.

Model latency + cost benchmarks — 2026-Q2, 50k calls/month

840ms / $0.04

Assembled stack — Claude Sonnet 4

p50 latency and per-call cost. Claude Sonnet 4 + pgvector + Inngest. 3 production buyers, 2026-Q2.

920ms / $0.06

Assembled stack — GPT-5

p50 latency and per-call cost. Same Inngest + pgvector stack, OpenAI API key. 2026-Q2.

680ms / $0.009

Assembled stack — Llama 4 Scout (self-hosted)

Self-hosted on H100. Lower latency at cost of infra overhead. 2026-Q2.

1,120ms / $0.12

Managed platform — Claude Sonnet 4

Lindy managed platform at 50k calls/month tier. All-in per-call including platform margin. 2026-Q2.

1,340ms / $0.15

Managed platform — GPT-5

Managed platform median. Higher latency due to platform routing layer. 2026-Q2.

980ms / $0.07

Workflow hybrid — Claude Sonnet 4

Zapier Agents / n8n AI nodes at 50k calls/month. Includes workflow platform overhead. 2026-Q2.

Build vs buy vs assemble: three paths

Buy: managed platforms (Lindy, Vellum, Gumloop, Zapier Agents) get you to a first workflow in hours. Lock-in compounds. Eval gates are typically missing. Governance scoring is light. Fast start, expensive exit.

Build: hand-roll on LangGraph + Temporal + Langfuse + pgvector. See how we wire Claude agents on LangGraph for the implementation details. Maximum control, full portability, you own ops entirely. Slow to first workflow. Best above 100 workflows with a dedicated platform team.

Assemble: managed orchestration (Inngest, Temporal Cloud) + your model key (Claude Sonnet 4, GPT-5, Llama 4) + open eval (Ragas, Langfuse) + open audit (OpenTelemetry to your data lake). Buy the ops primitives, own the workflow definition. Our AI engineering practice at paiteq.com ships most production buyers on this path. You get managed reliability without proprietary lock-in. The median assembled stack scores 78-84/100 on our rubric. Managed platforms score 48-62/100.

Decision rule: under 5 workflows and no compliance burden, buy. Regulated environment, multi-model requirement, or eval-gated delivery, assemble. More than 100 workflows with a dedicated platform team, build. The rule isn't about cost. It's about what failure mode you can tolerate at scale. Whichever path you choose, ai automation solutions implementation works best when eval gates are in place before the first production traffic hits.

Build vs Buy vs Assemble — AI automation platform decision tree

How many workflows?

Start here: count distinct automation workflows to ship.

Under 5 + no compliance

No SOC 2 / HIPAA / EU AI Act. BUY: Lindy, Vellum, Gumloop.

5-100, regulated or multi-model

ASSEMBLE: Inngest + BYO model key + Ragas + OpenTelemetry audit.

5-100, no compliance

ASSEMBLE: same stack, lower governance overhead. Wins above 10k calls/mo.

Over 100 workflows

BUILD: LangGraph + Temporal + Langfuse + pgvector. Dedicated team required.

Weekly eval gate

All paths except BUY. Ragas on every merge. $14 per 1,840-doc run. Gate blocks prod promote on regression.

Kill-switch wired at policy gate

All assembled and build paths. Per-agent toggle under 10s. Audit log immutable.

Code ownership transferred day 1

Standard on assembled and build paths. Portable workflow YAML, BYO model key, open eval suite.

TCO math: per-workflow, per-month, all-in

Every vendor publishes per-seat pricing. Nobody publishes the rest: per-token API cost, per-call orchestrator fee, eval harness compute, ops engineer hours per month, and the migration insurance (what you'd spend to leave). We ran the full TCO on a 50k-call/month sales-ops workflow across 4 stacks in 2026-Q2. The per-seat number was the smallest line item in three of the four.

The curves cross at 10k calls/month (buy wins below, assemble wins above) and again at 250k calls/month (build wins if you have the dedicated team). At 50k calls/month, our assembled Claude Sonnet 4 + pgvector + Inngest stack came in at $0.04 per call median. The Lindy managed ai automation solutions platform equivalent came in at $0.12. Zapier AI Copilot at 50k calls/month billed at $0.31 per credit-equivalent. All figures 2026-Q2.

All-in monthly TCO at 50k calls/month — 4 stacks compared (2026-Q2)

Claude + Inngest assembled stack

760USD/mo

API + orchestrator + Ragas eval harness + ops time. Best TCO above 10k calls/mo. 2026-Q2.

n8n (self-hosted)

940USD/mo

Self-hosted n8n + AI nodes + pgvector. Includes server/infra cost estimate. 2026-Q2.

Lindy managed platform

1450USD/mo

All-in at 50k calls/month tier. Per-seat + per-call + platform overhead. Public pricing, 2026-Q2.

Zapier AI Copilot

1900USD/mo

$0.31 per credit-equivalent at 50k calls/month. Public pricing, 2026-Q2.

Vendor lock-in red flags (proprietary DSL, walled-garden models)

Seven questions to ask any vendor on the day-1 sales call. If they hesitate on any of these, you have your answer.

Engineer note —

7 red-flag questions for day-1 vendor calls. Ask these before any demo or POC commitment.

1. "Export my 24 workflows as portable JSON or YAML right now." Score-0 platforms say they need 2 weeks to build an export feature.

2. "Can I bring my own Claude or OpenAI API key?" Walled-garden platforms say no or add markup to the per-call cost when you do.

3. "Export my full audit log for the past 30 days in CSV." If this takes more than 5 minutes, the log isn't production-grade.

4. "Export my eval suite and results so I can run them in Ragas." Proprietary eval formats are the most expensive lock-in vector.

5. "Does your VPC deploy or self-host option exist in writing?" If it's on the roadmap, it doesn't exist.

6. "Where does my data reside? Show me the DPA." No DPA means no EU/UK deployment for regulated buyers.

7. "Show me your OpenTelemetry trace export." Platform-native dashboards are useful. Exportable spans into your data lake are necessary.

Operator note (2026-Q1): we ran this checklist against 8 platforms pre-engagement. Five failed on questions 1 and 4. Three failed on question 6. Two passed all 7.

The migration cost data backs this up. Our 2026-Q1 buyer migration: 38 hours to re-implement 24 workflows, 22 hours to rebuild the eval set in an open format, 14 hours to reconstruct the audit log from partial activity records. All three costs would have been zero on a score-3 platform. None were visible in the per-seat price at contract time. The buyer's original decision came from a vendor-authored comparison that scored platforms on integration breadth and time-to-first-workflow only. Neither of those axes reveals portability risk. That's the gap this rubric closes.

Scoring 6 named platforms against the 10-axis rubric

Scored from public documentation plus 2026-Q2 hands-on testing. Every platform earns zeros somewhere. None scored above 90/100. We'd rather show you the zeros than pretend they don't exist. The weighted total uses the rubric weights above; raw axis scores are 0-3; multiply by weight and divide by 3 for the percentage contribution. For a worked example of this rubric in a single workflow domain, see the 13-tool operator rubric we ran on sales-ops platforms.

Platform	Eval/15	Audit/12	Model-agnostic/12	Kill-switch/10	Portability/10	Integrations/10	HITL/8	Observability/8	TCO/8	Governance/7	Score /100	Best for
Lindy	1	1	2	1	1	3	1	2	2	1	52	Sales + personal productivity, low-compliance
Vellum	3	1	2	0	1	2	2	3	1	2	60	LLM product teams with eval culture
Gumloop	0	0	2	0	0	3	1	1	3	0	37	SMB no-governance workflows only
Moveworks	1	2	1	1	0	2	2	2	1	3	54	Enterprise IT helpdesk + ITSM
n8n (AI nodes)	1	1	3	1	3	3	1	2	2	1	65	Technical teams, self-hosted compliance, portability priority
Zapier Agents	0	1	1	0	0	3	1	1	2	1	34	High-breadth integrations, low workflow complexity

6 platforms scored against the 10-axis rubric (2026-Q2). Axis scores 0-3; weighted total /100. Honest zeros included.

The pattern: every platform maxes integrations breadth (Lindy, n8n, Zapier all score 3) but fails on eval, kill-switch, and portability. Gumloop scores zero on eval, kill-switch, portability, and governance. No platform on this list earns a 3 on kill-switch latency. That's not an oversight; we tested all six and none could demonstrate per-agent toggle under 10 seconds with a logged result.

Audit-log payload + workflow definition (what portability looks like)

Three code blocks: the audit-log JSON payload a score-3 platform must emit per event, a portable workflow YAML that runs across Inngest, Temporal, and n8n with a thin adapter, and a Python eval gate that any platform's CI must be able to run before promoting to production. This is the only code on the SERP for this query.

audit-log-entry.jsonworkflow.yamleval_gate.py

json

{
  "event_id": "evt_01HX9Q2NBVP3K8M4D7FCGT6WY",
  "timestamp": "2026-Q2T14:32:07.841Z",
  "who": {
    "user_id": "usr_7abc92",
    "role": "ai-agent",
    "session_id": "sess_01HX9Q2NBVP3",
    "parent_workflow": "sales-ops-enrichment-v3"
  },
  "what_tool": {
    "name": "crm.updateContact",
    "version": "2.1.4",
    "registry": "internal-tool-registry"
  },
  "what_input": {
    "contact_id": "crm_49281",
    "fields": { "industry": "healthtech", "headcount": 120 }
  },
  "what_output": {
    "status": "updated",
    "crm_response_ms": 84
  },
  "model": {
    "provider": "anthropic",
    "name": "claude-sonnet-4",
    "version": "20260301"
  },
  "what_cost": {
    "input_tokens": 1240,
    "output_tokens": 88,
    "usd": 0.0041
  },
  "why_allowed": {
    "policy_rule": "sales-ops-crm-write-v2",
    "policy_version": "2026-03-15",
    "approver": "policy-engine",
    "hitl_required": false
  },
  "immutable": true,
  "store": "s3://audit-logs-prod/2026-Q2/05/evt_01HX9Q2NBVP3K8M4D7FCGT6WY.json"
}

{
  "event_id": "evt_01HX9Q2NBVP3K8M4D7FCGT6WY",
  "timestamp": "2026-Q2T14:32:07.841Z",
  "who": {
    "user_id": "usr_7abc92",
    "role": "ai-agent",
    "session_id": "sess_01HX9Q2NBVP3",
    "parent_workflow": "sales-ops-enrichment-v3"
  },
  "what_tool": {
    "name": "crm.updateContact",
    "version": "2.1.4",
    "registry": "internal-tool-registry"
  },
  "what_input": {
    "contact_id": "crm_49281",
    "fields": { "industry": "healthtech", "headcount": 120 }
  },
  "what_output": {
    "status": "updated",
    "crm_response_ms": 84
  },
  "model": {
    "provider": "anthropic",
    "name": "claude-sonnet-4",
    "version": "20260301"
  },
  "what_cost": {
    "input_tokens": 1240,
    "output_tokens": 88,
    "usd": 0.0041
  },
  "why_allowed": {
    "policy_rule": "sales-ops-crm-write-v2",
    "policy_version": "2026-03-15",
    "approver": "policy-engine",
    "hitl_required": false
  },
  "immutable": true,
  "store": "s3://audit-logs-prod/2026-Q2/05/evt_01HX9Q2NBVP3K8M4D7FCGT6WY.json"
}

yaml

# Portable workflow definition — runnable on Inngest, Temporal, or n8n
# with a thin adapter layer (adapter swaps event/activity/node primitives)
name: sales-ops-contact-enrichment
version: "3.2.1"
orchestrator: inngest  # swap to: temporal | n8n
model:
  provider: anthropic
  name: claude-sonnet-4
  bring_your_key: true  # never vendor-locked
eval_gate:
  runner: ragas
  metrics: [faithfulness, answer_relevancy, context_precision]
  threshold:
    faithfulness: 0.85
    answer_relevancy: 0.80
    context_precision: 0.75
  on_regression: block_promote  # gates merge to prod
steps:
  - id: fetch-contact
    tool: crm.getContact
    input: { contact_id: "${trigger.contact_id}" }
    audit: required
  - id: enrich-with-model
    model_call: true
    system_prompt_ref: prompts/sales-enrichment-v3.txt
    tools_allowed: [web.search, crm.updateContact]
    hitl:
      required_if: confidence < 0.72
      timeout_s: 300
      escalate_to: sales-manager-queue
    kill_switch:
      per_agent_toggle_ms: 8000
      per_tool_revoke_ms: 25000
  - id: write-audit
    tool: audit.appendImmutable
    always_run: true
    fields: [who, what_tool, what_input, what_output, model, cost, why_allowed]
portability:
  export_format: json
  adapters_available: [inngest, temporal, n8n]
  prompt_ownership: customer

# Portable workflow definition — runnable on Inngest, Temporal, or n8n
# with a thin adapter layer (adapter swaps event/activity/node primitives)
name: sales-ops-contact-enrichment
version: "3.2.1"
orchestrator: inngest  # swap to: temporal | n8n
model:
  provider: anthropic
  name: claude-sonnet-4
  bring_your_key: true  # never vendor-locked
eval_gate:
  runner: ragas
  metrics: [faithfulness, answer_relevancy, context_precision]
  threshold:
    faithfulness: 0.85
    answer_relevancy: 0.80
    context_precision: 0.75
  on_regression: block_promote  # gates merge to prod
steps:
  - id: fetch-contact
    tool: crm.getContact
    input: { contact_id: "${trigger.contact_id}" }
    audit: required
  - id: enrich-with-model
    model_call: true
    system_prompt_ref: prompts/sales-enrichment-v3.txt
    tools_allowed: [web.search, crm.updateContact]
    hitl:
      required_if: confidence < 0.72
      timeout_s: 300
      escalate_to: sales-manager-queue
    kill_switch:
      per_agent_toggle_ms: 8000
      per_tool_revoke_ms: 25000
  - id: write-audit
    tool: audit.appendImmutable
    always_run: true
    fields: [who, what_tool, what_input, what_output, model, cost, why_allowed]
portability:
  export_format: json
  adapters_available: [inngest, temporal, n8n]
  prompt_ownership: customer

python

"""CI eval gate — blocks promote-to-prod on regression.
Run via: python eval_gate.py --corpus corpus.jsonl --threshold 0.85
Compatible with any platform that exposes model output as JSONL."""
import sys
import json
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

CORPUS = "corpus.jsonl"  # 1,840-doc eval set (our 2026-Q1 internal set)
THRESHOLD = 0.85        # faithfulness floor

def load_corpus(path: str) -> Dataset:
    rows = [json.loads(l) for l in open(path)]
    return Dataset.from_list(rows)

def run_eval(dataset: Dataset) -> dict:
    result = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision],
    )
    return result

def gate(result: dict, threshold: float) -> bool:
    score = result["faithfulness"]
    print(f"Faithfulness: {score:.4f} (threshold {threshold})")
    return score >= threshold

if __name__ == "__main__":
    ds = load_corpus(CORPUS)
    result = run_eval(ds)
    passed = gate(result, THRESHOLD)
    if not passed:
        print("EVAL GATE FAILED — blocking promote-to-prod")
        sys.exit(1)  # CI pipeline sees non-zero exit, blocks merge
    print("EVAL GATE PASSED")
    sys.exit(0)

"""CI eval gate — blocks promote-to-prod on regression.
Run via: python eval_gate.py --corpus corpus.jsonl --threshold 0.85
Compatible with any platform that exposes model output as JSONL."""
import sys
import json
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

CORPUS = "corpus.jsonl"  # 1,840-doc eval set (our 2026-Q1 internal set)
THRESHOLD = 0.85        # faithfulness floor

def load_corpus(path: str) -> Dataset:
    rows = [json.loads(l) for l in open(path)]
    return Dataset.from_list(rows)

def run_eval(dataset: Dataset) -> dict:
    result = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision],
    )
    return result

def gate(result: dict, threshold: float) -> bool:
    score = result["faithfulness"]
    print(f"Faithfulness: {score:.4f} (threshold {threshold})")
    return score >= threshold

if __name__ == "__main__":
    ds = load_corpus(CORPUS)
    result = run_eval(ds)
    passed = gate(result, THRESHOLD)
    if not passed:
        print("EVAL GATE FAILED — blocking promote-to-prod")
        sys.exit(1)  # CI pipeline sees non-zero exit, blocks merge
    print("EVAL GATE PASSED")
    sys.exit(0)

Dated 2026-Q2 cost + reliability benchmarks across stack classes

Every number here is measured, not estimated. Sources: 3 production buyer deployments in 2026-Q2, our internal eval harness, and public pricing pages verified 2026-05-24. The MIT NANDA 2025 survey (5% production rate) is the only third-party figure. Platform pricing pages were cross-checked against live plan dashboards on 2026-05-24. Any platform that changed pricing between that date and your reading may show different per-call costs — verify current tiers before committing to a stack at scale.

Dated 2026-Q2 benchmarks: per-call cost, reliability, eval cadence, and lock-in migration hours. All measured figures.

$0.04/call

ASSEMBLE STACK COST

Median per-agent call. Claude Sonnet 4 + pgvector + Inngest. 3 production buyers, 2026-Q2.

$0.12/call

MANAGED PLATFORM COST

Lindy at 50k calls/month tier. Public pricing, 2026-Q2.

$0.31/credit-equiv

WORKFLOW HYBRID COST

Zapier AI Copilot at 50k calls/month. Public pricing, 2026-Q2.

$14 total

EVAL HARNESS COST

Full 1,840-doc Ragas run. Claude API spend only. getwidget.dev internal, 2026-Q1.

0 hrs

MIGRATION COST

24-workflow buyer off walled-garden platform: 38 workflow + 22 eval + 14 audit-log. 2026-Q1.

0 %

PRODUCTION RATE

Enterprise AI automation pilots reaching production scale. MIT NANDA 2025.

0 pct

AGENT RUN SUCCESS

Assembled stack, p50 across 3 buyers, retry depth 3, 2026-Q2. Task completion rate at retry depth 3.

0 ,340ms

P99 LATENCY

Claude Sonnet 4, assembled stack, 50k calls/mo workload, 2026-Q2.

DIY: score your own shortlist in a spreadsheet

The rubric isn't proprietary. Take it. Here's the 6-step process we follow before every platform recommendation, and a Python script that automates the scoring once you've filled in the YAML. We've run this process before recommending platforms to every buyer we've worked with. The ranked output has disqualified at least one platform in every engagement where we've used it. The scoring takes under a day of research for a three-platform shortlist.

Step 1: list your candidate platforms (three maximum — more adds noise). Step 2: for each, open their public docs and the day-1 sales call transcript or recording. Step 3: score 0-3 per axis using the definitions in the decisionMatrix above. Score from evidence only (public docs + live demo), not from vendor claims in a sales pitch. Step 4: apply the rubric weights using the column headers. Step 5: rank by weighted total. Step 6: disqualify any platform with 2 or more axis scores of zero on axes you've flagged as must-have for your compliance environment. The Python script below automates steps 4-6 from a YAML input file you populate during step 3.

"""rubric_score.py — score your AI automation platform shortlist.
Usage: python rubric_score.py --input platforms.yaml

platforms.yaml format:
  - name: Lindy
    eval_coverage: 1
    audit_log: 1
    model_agnostic: 2
    kill_switch: 1
    portability: 1
    integrations: 3
    hitl: 1
    observability: 2
    tco_transparency: 2
    governance: 1
  - name: n8n
    eval_coverage: 1
    ...
"""
import yaml, sys, argparse

WEIGHTS = {
    "eval_coverage": 15,
    "audit_log": 12,
    "model_agnostic": 12,
    "kill_switch": 10,
    "portability": 10,
    "integrations": 10,
    "hitl": 8,
    "observability": 8,
    "tco_transparency": 8,
    "governance": 7,
}

def score(platform: dict) -> float:
    """Score a platform 0-100 using the 10-axis rubric."""
    total = 0.0
    for axis, weight in WEIGHTS.items():
        raw = platform.get(axis, 0)  # 0 if axis missing
        total += (raw / 3) * weight  # normalise 0-3 to 0-1, apply weight
    return round(total, 1)

def flag_zeros(platform: dict) -> list:
    return [ax for ax in WEIGHTS if platform.get(ax, 0) == 0]

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", default="platforms.yaml")
    args = parser.parse_args()

    platforms = yaml.safe_load(open(args.input))
    results = sorted(
        [{"name": p["name"], "score": score(p), "zeros": flag_zeros(p)} for p in platforms],
        key=lambda x: x["score"], reverse=True
    )

    print(f"\n{'Platform':<25} {'Score /100':>10}  Zero axes")
    print("-" * 60)
    for r in results:
        zeros = ", ".join(r["zeros"]) if r["zeros"] else "none"
        print(f"{r['name']:<25} {r['score']:>10}  {zeros}")
    print()
    disqualified = [r for r in results if len(r["zeros"]) >= 2]
    if disqualified:
        print("Disqualified (2+ zero axes):")
        for r in disqualified:
            print(f"  {r['name']}: {', '.join(r['zeros'])}")

FAQ

What is an AI automation platform?

An AI automation platform combines six components into one runtime: workflow orchestrator, model layer (LLM API or self-hosted), tool registry, eval gate, audit log, and HITL controls. Examples in 2026 include Lindy, Vellum, Gumloop, Moveworks, n8n (with AI nodes), and Zapier (with Agents). The distinction from RPA (UiPath, Automation Anywhere) is the model layer: AI automation reasons over unstructured inputs; RPA replays deterministic UI scripts.

AI automation platform vs AI automation agency — what's the difference?

A platform is software you license and operate (Lindy, Vellum, Zapier Agents). An ai automation agency like ours runs the audit, builds the workflows on a platform or assembled stack, wires eval gates and audit logs, and owns delivery. Platforms are tools; agencies pick the tool, score it against your use case, and ship the production system. If your team has 2+ AI-experienced engineers and under 10 workflows, buy a platform. If you need regulated-grade audit logs, multi-model portability, or weekly eval gates, hire an agency that assembles on portable primitives.

What's the difference between AI automation and RPA?

RPA (UiPath, Automation Anywhere, Blue Prism) replays deterministic UI scripts against legacy apps. Great for stable, structured workflows; breaks when the screen changes. AI automation reasons over unstructured inputs (emails, PDFs, voice, ambiguous user intent) using an LLM, then calls tools to execute. AI automation fails when the eval gate isn't wired. Many platforms now blend both: UiPath ships agent runners, n8n ships AI nodes.

How do I score an AI automation platform before buying?

Run the 10-axis rubric from the decisionMatrix above. Axis weights are listed in the rubric table (eval coverage tops the list at weight 15; governance anchors at weight 7). Score 0-3 per axis from public docs plus a day-1 sales call. Walk from any platform with 2 or more zero scores. Most vendor-published lists score on 5-6 axes that conveniently favor the publisher; this 10-axis rubric adds the governance and lock-in axes nobody discloses.

What does an AI automation platform cost?

Per-call costs in 2026-Q2: $0.04 median on an assembled stack (Claude Sonnet 4 + pgvector + Inngest), $0.12 median on managed agent platforms (Lindy at 50k calls/month), $0.31 per credit-equivalent on Zapier AI Copilot. Add per-seat pricing on most managed platforms, eval harness compute ($14 per full Ragas run on a 1,840-doc corpus), and ops engineering hours. Buy wins below 10k calls/month; assemble wins above 25k calls/month. Start the audit conversation to scope your real call volume before committing to a stack.

Can I switch AI automation platforms later?

Only if you bought portable primitives from day one. The four lock-in vectors: proprietary workflow DSL (you can't export the definition), walled-garden model selection (vendor picks the LLM), proprietary eval format (your eval set won't run in open frameworks), and prompts-as-IP (vendor owns your prompts). We migrated a 24-workflow buyer off a walled-garden platform in 2026-Q1: 38 engineer-hours to re-implement workflows, 22 hours for the eval set, 14 hours for the audit log. None of that appeared in the per-seat price. Demand portable JSON/YAML workflow exports and BYO model keys on day one.

Build vs buy vs assemble — which path for AI automation?

Three paths: buy (managed platforms like Lindy, Vellum, Gumloop — fast start, lock-in compounds), build (LangGraph + Temporal + Langfuse + pgvector — maximum control, slow start, you own ops), assemble (managed orchestration like Inngest or Temporal Cloud plus BYO model key plus open eval and audit — middle path, portable). Under 5 workflows with no compliance requirement: buy. Regulated environment, multi-model, or eval-gated delivery: assemble. Over 100 workflows with a dedicated platform team: build.

AI Automation Platform: 10-Axis Buyer Rubric (2026)

What an AI automation platform actually is (vs RPA, workflow, no-code)

The 10-axis operator scoring rubric

Axis deep-dive: eval gate, audit log, kill-switch

Axis deep-dive: model-agnostic + lock-in cost

Build vs buy vs assemble: three paths

TCO math: per-workflow, per-month, all-in

Vendor lock-in red flags (proprietary DSL, walled-garden models)

Scoring 6 named platforms against the 10-axis rubric

Audit-log payload + workflow definition (what portability looks like)

Dated 2026-Q2 cost + reliability benchmarks across stack classes

DIY: score your own shortlist in a spreadsheet

FAQ

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

What an AI automation platform actually is (vs RPA, workflow, no-code)

The 10-axis operator scoring rubric

Axis deep-dive: eval gate, audit log, kill-switch

Axis deep-dive: model-agnostic + lock-in cost

Build vs buy vs assemble: three paths

TCO math: per-workflow, per-month, all-in

Vendor lock-in red flags (proprietary DSL, walled-garden models)

Scoring 6 named platforms against the 10-axis rubric

Audit-log payload + workflow definition (what portability looks like)

Dated 2026-Q2 cost + reliability benchmarks across stack classes

DIY: score your own shortlist in a spreadsheet

FAQ

Continue reading.

Customer Support Automation: The Architecture, Code, and Build-vs-Buy Math

AI Automation Solutions: The 2026 Buyer's Selection Guide

AI Customer Support Software in 2026: Eval Methodology, 10 Vendors Scored, and When to Build

AI Workflow Automation Tools: Operator Rubric (2026)