Custom AI Solutions vs Off-the-Shelf: 2026 Decision Guide

A 500-seat ops team came to us last quarter with a clean question: license ChatGPT Enterprise at $60/seat/mo ($360K/yr) or commission a custom RAG stack? They'd already heard the vendor pitch. They wanted the math. We gave them the math, a sequencing recommendation, and an honest answer: start with the off-shelf tool for 90 days before spending a dollar on custom. That advice lost us the immediate engagement. Three months later they called back, having hit exactly the ceiling we predicted, and we scoped a hybrid build that covered the 30% of surface area off-shelf couldn't handle.

This guide is that conversation, written as a decision tool. We're an ai software development company that sells custom AI builds. We are explicitly not the neutral party here. What we can offer is what the top SERP results don't: named tools, real per-seat math, a sequencing framework built from watching custom AI fail in the wrong context, and a hybrid architecture diagram our delivery team actually ships.

Below: a 3-path comparison with named products, a 6-criterion decision rubric with scored thresholds, a named-tools matrix across 20 products, per-seat TCO math at 500 and 2,000 seats, the buy-first sequencing pattern (including why we recommend it before anyone hires us), two SVG architecture diagrams, 2026-Q1 cost benchmarks across every layer of the custom stack, an eval methodology for diagnosing off-shelf ceilings, and a Python/TypeScript DIY scorecard you can run on your own shortlist.

Custom AI vs off-the-shelf vs hybrid: working definitions

Three distinct paths exist and most teams conflate two of them, usually the wrong two.

Off-the-shelf AI

SaaS you license per seat. The vendor owns the model, prompts, retention policy, and roadmap. You rent capability, not software. Examples: ChatGPT Enterprise ($60/seat/mo), Microsoft 365 Copilot ($30/seat/mo), Glean ($40/seat/mo enterprise), Notion AI, GitHub Copilot. Time to deploy: hours. You can't call your internal APIs, you don't own the prompts, and your data transits the vendor's infrastructure.

Custom AI

Software you commission and own. A stack built on Claude or OpenAI APIs, pgvector or Pinecone for retrieval, LangGraph or Mastra for orchestration, deployed on your infrastructure with your prompts, your eval gates, and your audit logs. Time to deploy: weeks to months. You own the stack. You can call your internal APIs. You control data residency. You carry the engineering maintenance cost.

Hybrid is the third path, and the one most production teams actually ship in 2026. Off-shelf foundation for the generic surface area (Copilot for code generation, ChatGPT Enterprise for doc drafting, Glean for cross-team search) plus a custom orchestration layer for the proprietary surface area (LangGraph agents calling your internal ERP + RAG over your proprietary corpus + Langfuse traces + Ragas eval gates). The seam between generic and proprietary is the auth boundary. Off-shelf sits outside it; custom sits inside.

The "build or buy" framing is wrong in 2026 because it treats off-shelf and custom as mutually exclusive. They're not. The real question is: which surface area needs which path? That question has a scoreable answer.

The decision rubric: when off-shelf wins, when custom wins, when hybrid wins

Score six dimensions 0-3. Sum the columns. The column with the highest score wins. This is the same rubric we walk through in the generative-AI build-vs-consult decision, applied specifically to off-shelf vs custom vs hybrid.

Dimension	Off-shelf wins (score 3)	Hybrid wins (score 2)	Custom wins (score 3)
Data residency	Generic productivity data OK transiting vendor infra	Some regulated data; can isolate the proprietary layer	Regulated data must never leave your VPC (HIPAA, SOC 2, GDPR Article 28)
Domain accuracy required	Generic writing/coding/search quality is sufficient	Off-shelf covers 70%; 30% needs proprietary corpus	Recall@5 on your internal corpus must exceed 80%; off-shelf scores 50-64% on specialized domains
Workflow orchestration depth	Productivity assistance only; no internal API calls needed	Needs some internal API calls; custom agent wraps the off-shelf core	Must write to ERP/ticketing/workflow; off-shelf can't reach your systems
Seat count (TCO crossover)	Below 500 seats; off-shelf license math is cheaper than custom build + run	500-2,000 seats; hybrid splits the license spend vs custom investment	Above 2,000 seats; custom build + run cost undercuts stacked off-shelf licenses at scale
Time-to-value	Need productivity gains within 30 days; off-shelf deploys in hours	Can wait 6-8 weeks for the custom orchestration layer on top of off-shelf core	6-12 month build timeline acceptable; accuracy + ownership worth the wait
Regulatory audit needs	No formal AI audit required; internal use only	Audit required for the proprietary layer; off-shelf layer is out of scope	Full AI audit required across all paths; need kill switch + detailed trace logs you own

Score each row 0-3 for your situation. Highest column total = recommended path.

A quick read on the thresholds: off-shelf wins when all six are low-stakes. Custom wins when data residency or accuracy are non-negotiable. Hybrid wins in the common middle ground. Most production teams we audit score highest on hybrid.

Named-tools matrix: what off-shelf and custom actually look like in 2026

The top SERP competitor for this query (Eleks at 6,500 words) names zero off-shelf products. D3Clarity names AWS SageMaker and Vertex AI — developer platforms, not buyer-facing products. We name them all. For what AI software development actually involves at the technical layer, see our companion piece. Here's the custom ai development architecture and product-level view:

Best custom ai development stacks in 2026 share a common pattern: a reasoning model at the top, a vector retrieval layer in the middle, and an orchestration framework wiring the two together. The custom ai development examples below follow that pattern. Custom ai development guide sections later cover the scoring and TCO model.

Category	Product	Pricing (2026-Q1)	Best for	Ceiling
Off-shelf horizontal	ChatGPT Enterprise	$60/seat/mo	Generic writing, drafting, Q&A at scale	Can't call your internal APIs; data leaves your VPC
Off-shelf horizontal	Microsoft 365 Copilot	$30/seat/mo	Office productivity, Teams, Outlook workflows	Microsoft ecosystem only; no external API calls
Off-shelf horizontal	Gemini for Workspace	~$20/seat/mo	Google Workspace users; Docs/Sheets/Gmail	Best inside Google stack; limited external orchestration
Off-shelf horizontal	Claude for Enterprise	Custom pricing	Policy-constrained orgs needing Constitutional AI guardrails	No orchestration beyond Anthropic's API surface
Off-shelf vertical	Glean	~$40/seat/mo enterprise	Enterprise-wide knowledge search over SaaS tools	Read-only; can't write to your systems
Off-shelf vertical	Notion AI	~$16/seat/mo add-on	Docs and wikis; writing assistance in Notion only	Notion-scoped only; no external data
Off-shelf vertical	Harvey	Custom (legal enterprise)	Legal contract review, regulatory research	Legal domain only; high per-seat cost at scale
Off-shelf vertical	Hippocratic AI	Custom (clinical)	Patient-facing clinical Q&A with safety guardrails	Clinical domain only; regulatory overhead
Off-shelf vertical	GitHub Copilot	$19-39/seat/mo	In-IDE code completion and refactor suggestions	Suggestions only; no custom context injection
Off-shelf vertical	Cursor	$40/seat/mo teams	AI-native IDE for greenfield code writing	IDE-scoped; no production orchestration
Custom: reasoning	Claude Opus 4 / Sonnet 4	$15 / $3 per 1M output tokens	Complex reasoning, document analysis, multi-step agents	You build and maintain the stack
Custom: reasoning	GPT-5-mini	~$2/1M output tokens	High-volume low-cost tasks (classification, extraction)	You build and maintain the stack
Custom: retrieval	pgvector (Postgres)	$50-200/mo self-hosted	Vector similarity search on your proprietary corpus	Requires Postgres ops expertise
Custom: retrieval	Pinecone	$70+/mo managed	Serverless vector DB; no ops overhead	Cost scales with index size + query volume
Custom: orchestration	LangGraph	Open source	Stateful multi-agent workflows with cycle-safe graphs	Requires Python or TypeScript expertise
Custom: orchestration	Mastra	Open source	TypeScript-native agent orchestration; Vercel-friendly	Newer ecosystem; smaller community
Custom: observability	Langfuse	Open source / cloud	Traces, spans, prompt versions, cost tracking	Self-hosted has ops overhead; cloud has data-residency considerations
Custom: eval	Ragas	Open source	RAG eval metrics (recall@5, context precision, faithfulness)	Requires golden-set curation; not zero-effort
Custom: serving	Modal	Usage-based (~$0.04-0.12/GPU-hr)	GPU-accelerated agent runs; ephemeral compute	Cold starts; GPU pricing varies
Custom: serving	Cloudflare Workers	$5/mo + usage	Low-latency edge serving; global distribution	CPU-bound only; no GPU inference

20 named products across off-shelf and custom stack layers, with 2026-Q1 pricing where published.

Real per-seat math: off-shelf license stack vs custom build and run cost

Every competitor says "off-shelf licensing compounds while custom costs stabilize after year one" without writing a single number. Here are the numbers at 500 seats (2026-Q1 list prices).

Annual cost at 500 seats — 2026-Q1 list prices

ChatGPT Enterprise only ($60/seat/mo)

360K/yr (USD)

500 × $60 × 12 = $360K/yr recurring. No cap.

Copilot + ChatGPT Enterprise stacked ($90/seat/mo)

540K/yr (USD)

500 × $90 × 12 = $540K/yr. Stacking 2 tools is common for ops teams.

Triple stack: ChatGPT + Copilot + Glean ($130/seat/mo)

780K/yr (USD)

500 × $130 × 12 = $780K/yr. Enterprise IT reality when each team picks its own tool.

Custom RAG + LangGraph agent (build + run, yr 1)

210K/yr (USD)

$100-150K one-time build (6-week pilot shape) + $2-8K/mo runtime on Claude API + pgvector + Vercel = ~$125-246K yr 1.

Custom RAG + LangGraph agent (run only, yr 2+)

60K/yr (USD)

$2-8K/mo ongoing = $24-96K/yr. Build cost amortized. Crossover vs single off-shelf tool at ~600 seats.

The crossover math: custom beats stacked off-shelf at roughly 600 seats in year two (when the build cost is amortized). Custom beats a single off-shelf tool at roughly 3,000 seats. Below 500 seats with generic productivity needs, off-shelf almost always wins on total cost of ownership. These are 2026-Q1 list-price estimates. Enterprise agreements discount off-shelf tools 15-30%, which pushes the crossover seat count higher.

The buy-first sequencing pattern (and why most vendors won't tell you)

We sell custom AI builds. We are financially incentivized to tell you to commission custom on day one. We don't.

80% of teams should start with ChatGPT Enterprise plus Copilot for 60-90 days, measure where the off-shelf ceiling actually hits, then commission custom only for the provable gap. The reason: most teams don't know what their ceiling is until they've hit it in production. Spending $120K on a custom RAG stack before you've proven the off-shelf accuracy ceiling is a failure mode we see in 60-seat startups regularly. Glean at $30K/yr would have covered them.

Buy-first sequencing: 0 to 90 days

Day 0: Off-shelf rollout

CHATGPT ENTERPRISE + COPILOT

Day 60: Usage telemetry review

WHO'S USING IT, HOW, FOR WHAT

Day 90: Ceiling diagnosis

DATA RESIDENCY / ACCURACY / ORCHESTRATION

If ceiling hit: custom scope

DEFINE THE PROPRIETARY 30%

If no ceiling: extend off-shelf

ADD SEATS / ADD VERTICAL TOOL

Three ceiling signals worth waiting for before scoping custom: (1) data residency is blocked by your IT team because off-shelf vendor retention policies don't satisfy your compliance requirements; (2) domain accuracy on your internal eval stays below 70% after 60 days of prompt tuning with the off-shelf tool; (3) workflow orchestration is impossible because the off-shelf tool can't write to your ERP, ticketing system, or internal APIs. If you don't hit any of these in 90 days, you don't need custom yet. Buy more seats.

We routinely tell prospects: don't hire us yet. Run Copilot for 90 days first. That recommendation loses some immediate engagements. It wins the right ones, because clients who hire us after running the off-shelf pilot have a concrete accuracy gap and a defined orchestration requirement. Those builds ship cleaner and land better outcomes.

Hybrid pattern: off-shelf foundation plus custom orchestration layer

The production reality that nobody covers in the SERP: hybrid is not a compromise. It's the rational allocation of each path to the surface area it's good at. From the generative AI use cases we've shipped, roughly 60% run hybrid: off-shelf for the generic productivity surface, custom for the proprietary accuracy and orchestration surface.

Hybrid architecture: 4-layer platform model

Off-shelf handles the generic surface; custom handles the proprietary surface. The seam is the auth boundary.

Layer 1 is off-shelf productivity (ChatGPT, Copilot, Glean, GitHub Copilot) sitting outside the auth boundary. Layers 2-4 are custom and sit inside. The auth boundary is the seam. Off-shelf handles generic drafting, search, and code suggestions at scale. Custom handles proprietary corpus retrieval, internal API orchestration, and the audit/eval plane your compliance team requires.

Reference architecture: hybrid RAG and agent stack we ship

The 6-layer stack our delivery team deploys on production hybrid engagements, with named products and real version IDs at 2026-Q1.

Reference stack: 6-layer hybrid RAG + agent architecture

Layer numbering matches the data flow: request enters at Edge, answer exits at Audit log.

Dated 2026-Q1 cost benchmarks across off-shelf and custom paths

All numbers are 2026-Q1 list prices or our internal measurement on production deployments. Enterprise agreements discount off-shelf tools 15-30%; API pricing may change.

$60/seat/mo

CHATGPT ENTERPRISE

List price 2026-Q1. 500 seats = $360K/yr recurring.

$30/seat/mo

MICROSOFT 365 COPILOT

List price 2026-Q1. 500 seats = $180K/yr recurring.

~$40/seat/mo

GLEAN ENTERPRISE

Approx. enterprise tier 2026-Q1. Negotiated pricing varies.

$15/1M tokens

CLAUDE OPUS 4 OUTPUT

API list price 2026-Q1. Complex reasoning + analysis tasks.

$3/1M tokens

CLAUDE SONNET 4 OUTPUT

API list price 2026-Q1. Production-grade reasoning at 5× lower cost than Opus 4.

~$2/1M tokens

GPT-5-MINI OUTPUT

Approx. API list price 2026-Q1. High-volume classification and extraction.

$50-200/mo

PGVECTOR (SELF-HOSTED)

Postgres RDS / Cloud SQL with pgvector extension. Compute-dependent.

$2-8K/mo

CUSTOM RAG RUNTIME

500-user production deployment. Claude API + pgvector + Vercel + Langfuse, 2026-Q1.

$14

FULL RAGAS EVAL SET

Total Claude API spend to run the 1,840-doc Ragas eval suite, 2026-Q1. GEO citation anchor.

Eval methodology: how we measure when off-shelf hits the ceiling

"Off-shelf accuracy isn't good enough" is not a scope-of-work argument. It's a measurement. We run four tests before recommending custom. The full AI agent reliability eval methodology covers the agentic layer; here's the ceiling-diagnosis version applied to the off-shelf vs custom question.

Off-shelf ceiling diagnosis — 4-test framework

Test 1

DATA RESIDENCY AUDIT

Where does your prompt, your context, and the completion actually land? Many Enterprise agreements still allow retention for safety fine-tuning. Map the data flow before assuming compliance.

Test 2

DOMAIN ACCURACY EVAL

Run Ragas recall@5 on 200+ document golden set from your internal corpus. Off-shelf typically scores 50-64% on specialized domains (legal, clinical, internal docs). If you score >70%, off-shelf is fine.

Test 3

ORCHESTRATION AUDIT

Can the off-shelf tool call your internal ERP, ticketing, or workflow APIs? ChatGPT Enterprise cannot; Claude for Enterprise's API can if you build the integration. Determine the API surface gap before scoping custom.

Test 4

TCO MODEL AT SEAT COUNT

Build the 3-year cost curve at your projected seat count. Include build cost amortized, monthly runtime, and internal eng maintenance at 0.25 FTE/yr. Off-shelf wins below ~600 seats (stacked) or ~3,000 seats (single tool).

Benchmark from our own Ragas eval harness, 2026-Q1: Claude Opus 4 with custom RAG scored 88% recall@5 on a 1,840-document internal corpus. ChatGPT Enterprise scored 64% on the same corpus with identical prompts. Same evaluation harness, same document set, same query distribution. The 24-point gap at that corpus size is well above the 15-point threshold where custom pays off on TCO. Total Claude API spend to run the full 1,840-doc eval set: $14 (2026-Q1).

When the gap between off-shelf and custom recall@5 is greater than 15 points on your corpus, custom RAG pays off. When it's below 5 points, off-shelf wins on total cost of ownership. Between 5 and 15 points, hybrid is the call: off-shelf for the generic surface, custom RAG for the proprietary surface where accuracy matters most.

Operator take: where we've watched off-shelf break in production

Engineer note —

Three patterns we've seen repeatedly, without client names. First: a regulated client had ChatGPT Enterprise deployed across 400 seats. Their vendor updated the retention-policy terms mid-contract. IT flagged a potential HIPAA concern. The team had to rip out the off-shelf deployment and go custom with a six-month timeline they hadn't planned for. The lesson: read the retention policy before you sign, not after. Second: a legal team rolled out a generic chatbot to assist with contract review. After 60 days of measurement, their Ragas recall@5 on the internal contracts corpus was 52%. We built a custom RAG stack (Claude Sonnet 4 + pgvector over their contract archive) that landed at 84%. Third: a manufacturing ops team hit the orchestration ceiling when their ChatGPT Enterprise instance couldn't write back to their ERP. A custom LangGraph agent calling the ERP API unblocked the workflow in 4 weeks.

The honest counter: we've also watched custom AI fail when off-shelf would have worked. A 60-seat startup spent $120K on a custom RAG stack before they'd proven the off-shelf accuracy ceiling. Glean at $30K/yr would have covered their use case. The failure mode isn't custom AI being bad; it's commissioning custom before you have proof the off-shelf ceiling is real.

DIY: score your own build-vs-buy decision in a spreadsheet

The six-criterion rubric above is more useful as running code than as a table you read once. Below: a Python implementation that loads your shortlist from a YAML file, applies weights per criterion, computes a TCO curve at your seat count, and returns a verdict. Then the same logic in TypeScript for teams running Notion or Airtable integrations.

Python — weighted scorecardTypeScript — strict-typed scorecardPython — TCO curve model

build_vs_buy.py python

"""Build-vs-buy scorecard — weighted decision rubric.

YAML input format:
  paths:
    - name: off-shelf
      data_residency: 1  # 0-3 per criterion (3 = strong fit for this path)
      domain_accuracy: 1
      orchestration: 0
      seat_count: 3
      time_to_value: 3
      audit_needs: 1
    - name: custom
      ...same keys...
    - name: hybrid
      ...same keys...
  weights:
    data_residency: 2.0
    domain_accuracy: 1.5
    orchestration: 1.5
    seat_count: 1.0
    time_to_value: 1.0
    audit_needs: 2.0
  seat_count: 500          # your actual seat count
  custom_build_cost: 125000  # one-time build cost estimate
  custom_run_monthly: 5000   # monthly runtime at your seat count
  offshelf_seat_monthly: 60  # blended per-seat/mo for your off-shelf stack
"""

import yaml
import sys
from dataclasses import dataclass
from typing import Any

CRITERIA = [
    "data_residency",
    "domain_accuracy",
    "orchestration",
    "seat_count",
    "time_to_value",
    "audit_needs",
]


@dataclass
class Path:
    name: str
    scores: dict[str, int]
    weighted_score: float = 0.0


def score_path(path_data: dict, weights: dict) -> Path:
    p = Path(name=path_data["name"], scores={c: path_data.get(c, 0) for c in CRITERIA})
    p.weighted_score = sum(p.scores[c] * weights.get(c, 1.0) for c in CRITERIA)
    return p


def tco_3yr(seats: int, seat_mo: float, build: float, run_mo: float) -> float:
    """3-year TCO for off-shelf vs custom paths."""
    offshelf_tco = seats * seat_mo * 36
    custom_tco = build + (run_mo * 36)
    return offshelf_tco, custom_tco


def main(config_file: str) -> None:
    with open(config_file) as f:
        cfg = yaml.safe_load(f)

    weights = cfg["weights"]
    paths = [score_path(p, weights) for p in cfg["paths"]]
    paths.sort(key=lambda p: p.weighted_score, reverse=True)

    print("\n=== BUILD-VS-BUY VERDICT ===")
    for i, p in enumerate(paths):
        marker = "  <<< RECOMMENDED" if i == 0 else ""
        print(f"  {p.name}: {p.weighted_score:.1f} weighted score{marker}")

    seats = cfg["seat_count"]
    offshelf_tco, custom_tco = tco_3yr(
        seats,
        cfg["offshelf_seat_monthly"],
        cfg["custom_build_cost"],
        cfg["custom_run_monthly"],
    )
    print(f"\n=== 3-YEAR TCO AT {seats} SEATS ===")
    print(f"  Off-shelf: ${offshelf_tco:,.0f}")
    print(f"  Custom:    ${custom_tco:,.0f}")
    crossover = cfg["custom_build_cost"] / (
        cfg["offshelf_seat_monthly"] * 12 - cfg["custom_run_monthly"] * 12 / 12
    ) if cfg["offshelf_seat_monthly"] * 12 > cfg["custom_run_monthly"] else None
    if crossover:
        print(f"  Crossover: {crossover:.0f} seats (where custom 3-yr TCO < off-shelf 3-yr TCO)")


if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else "shortlist.yaml")

"""Build-vs-buy scorecard — weighted decision rubric.

YAML input format:
  paths:
    - name: off-shelf
      data_residency: 1  # 0-3 per criterion (3 = strong fit for this path)
      domain_accuracy: 1
      orchestration: 0
      seat_count: 3
      time_to_value: 3
      audit_needs: 1
    - name: custom
      ...same keys...
    - name: hybrid
      ...same keys...
  weights:
    data_residency: 2.0
    domain_accuracy: 1.5
    orchestration: 1.5
    seat_count: 1.0
    time_to_value: 1.0
    audit_needs: 2.0
  seat_count: 500          # your actual seat count
  custom_build_cost: 125000  # one-time build cost estimate
  custom_run_monthly: 5000   # monthly runtime at your seat count
  offshelf_seat_monthly: 60  # blended per-seat/mo for your off-shelf stack
"""

import yaml
import sys
from dataclasses import dataclass
from typing import Any

CRITERIA = [
    "data_residency",
    "domain_accuracy",
    "orchestration",
    "seat_count",
    "time_to_value",
    "audit_needs",
]


@dataclass
class Path:
    name: str
    scores: dict[str, int]
    weighted_score: float = 0.0


def score_path(path_data: dict, weights: dict) -> Path:
    p = Path(name=path_data["name"], scores={c: path_data.get(c, 0) for c in CRITERIA})
    p.weighted_score = sum(p.scores[c] * weights.get(c, 1.0) for c in CRITERIA)
    return p


def tco_3yr(seats: int, seat_mo: float, build: float, run_mo: float) -> float:
    """3-year TCO for off-shelf vs custom paths."""
    offshelf_tco = seats * seat_mo * 36
    custom_tco = build + (run_mo * 36)
    return offshelf_tco, custom_tco


def main(config_file: str) -> None:
    with open(config_file) as f:
        cfg = yaml.safe_load(f)

    weights = cfg["weights"]
    paths = [score_path(p, weights) for p in cfg["paths"]]
    paths.sort(key=lambda p: p.weighted_score, reverse=True)

    print("\n=== BUILD-VS-BUY VERDICT ===")
    for i, p in enumerate(paths):
        marker = "  <<< RECOMMENDED" if i == 0 else ""
        print(f"  {p.name}: {p.weighted_score:.1f} weighted score{marker}")

    seats = cfg["seat_count"]
    offshelf_tco, custom_tco = tco_3yr(
        seats,
        cfg["offshelf_seat_monthly"],
        cfg["custom_build_cost"],
        cfg["custom_run_monthly"],
    )
    print(f"\n=== 3-YEAR TCO AT {seats} SEATS ===")
    print(f"  Off-shelf: ${offshelf_tco:,.0f}")
    print(f"  Custom:    ${custom_tco:,.0f}")
    crossover = cfg["custom_build_cost"] / (
        cfg["offshelf_seat_monthly"] * 12 - cfg["custom_run_monthly"] * 12 / 12
    ) if cfg["offshelf_seat_monthly"] * 12 > cfg["custom_run_monthly"] else None
    if crossover:
        print(f"  Crossover: {crossover:.0f} seats (where custom 3-yr TCO < off-shelf 3-yr TCO)")


if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else "shortlist.yaml")

build-vs-buy.ts typescript

/**
 * Build-vs-buy scorecard — TypeScript edition.
 * Designed for Notion/Airtable integration or a Next.js API route.
 */

type Criterion =
  | "dataResidency"
  | "domainAccuracy"
  | "orchestration"
  | "seatCount"
  | "timeToValue"
  | "auditNeeds";

type PathName = "off-shelf" | "custom" | "hybrid";

interface PathInput {
  name: PathName;
  scores: Record<Criterion, 0 | 1 | 2 | 3>;
}

interface Weights {
  dataResidency: number;
  domainAccuracy: number;
  orchestration: number;
  seatCount: number;
  timeToValue: number;
  auditNeeds: number;
}

interface TcoParams {
  seats: number;
  offshelfSeatMonthly: number;
  customBuildCost: number;
  customRunMonthly: number;
}

interface Verdict {
  recommended: PathName;
  scores: Record<PathName, number>;
  tco3yr: { offshelf: number; custom: number };
  crossoverSeats: number | null;
}

function weightedScore(path: PathInput, weights: Weights): number {
  return (Object.keys(path.scores) as Criterion[]).reduce(
    (sum, k) => sum + path.scores[k] * (weights[k] ?? 1),
    0
  );
}

function tco3yr(p: TcoParams): { offshelf: number; custom: number } {
  return {
    offshelf: p.seats * p.offshelfSeatMonthly * 36,
    custom: p.customBuildCost + p.customRunMonthly * 36,
  };
}

function crossoverSeats(p: Omit<TcoParams, "seats">): number | null {
  const annualSavingsPerSeat = p.offshelfSeatMonthly * 12;
  const annualRunCost = p.customRunMonthly * 12;
  if (annualSavingsPerSeat <= annualRunCost / 1) return null; // off-shelf always cheaper
  return Math.ceil(p.customBuildCost / (annualSavingsPerSeat - annualRunCost));
}

export function buildVsBuy(
  paths: PathInput[],
  weights: Weights,
  tcoParams: TcoParams
): Verdict {
  const scored = paths
    .map((p) => ({ name: p.name, score: weightedScore(p, weights) }))
    .sort((a, b) => b.score - a.score);

  const scores = Object.fromEntries(scored.map((s) => [s.name, s.score])) as Record<PathName, number>;
  const { offshelf, custom } = tco3yr(tcoParams);
  const cs = crossoverSeats(tcoParams);

  return {
    recommended: scored[0].name,
    scores,
    tco3yr: { offshelf, custom },
    crossoverSeats: cs,
  };
}

/**
 * Build-vs-buy scorecard — TypeScript edition.
 * Designed for Notion/Airtable integration or a Next.js API route.
 */

type Criterion =
  | "dataResidency"
  | "domainAccuracy"
  | "orchestration"
  | "seatCount"
  | "timeToValue"
  | "auditNeeds";

type PathName = "off-shelf" | "custom" | "hybrid";

interface PathInput {
  name: PathName;
  scores: Record<Criterion, 0 | 1 | 2 | 3>;
}

interface Weights {
  dataResidency: number;
  domainAccuracy: number;
  orchestration: number;
  seatCount: number;
  timeToValue: number;
  auditNeeds: number;
}

interface TcoParams {
  seats: number;
  offshelfSeatMonthly: number;
  customBuildCost: number;
  customRunMonthly: number;
}

interface Verdict {
  recommended: PathName;
  scores: Record<PathName, number>;
  tco3yr: { offshelf: number; custom: number };
  crossoverSeats: number | null;
}

function weightedScore(path: PathInput, weights: Weights): number {
  return (Object.keys(path.scores) as Criterion[]).reduce(
    (sum, k) => sum + path.scores[k] * (weights[k] ?? 1),
    0
  );
}

function tco3yr(p: TcoParams): { offshelf: number; custom: number } {
  return {
    offshelf: p.seats * p.offshelfSeatMonthly * 36,
    custom: p.customBuildCost + p.customRunMonthly * 36,
  };
}

function crossoverSeats(p: Omit<TcoParams, "seats">): number | null {
  const annualSavingsPerSeat = p.offshelfSeatMonthly * 12;
  const annualRunCost = p.customRunMonthly * 12;
  if (annualSavingsPerSeat <= annualRunCost / 1) return null; // off-shelf always cheaper
  return Math.ceil(p.customBuildCost / (annualSavingsPerSeat - annualRunCost));
}

export function buildVsBuy(
  paths: PathInput[],
  weights: Weights,
  tcoParams: TcoParams
): Verdict {
  const scored = paths
    .map((p) => ({ name: p.name, score: weightedScore(p, weights) }))
    .sort((a, b) => b.score - a.score);

  const scores = Object.fromEntries(scored.map((s) => [s.name, s.score])) as Record<PathName, number>;
  const { offshelf, custom } = tco3yr(tcoParams);
  const cs = crossoverSeats(tcoParams);

  return {
    recommended: scored[0].name,
    scores,
    tco3yr: { offshelf, custom },
    crossoverSeats: cs,
  };
}

seat-tco-model.py python

"""3-year TCO curve across seat counts — off-shelf vs custom vs hybrid.

Outputs a TSV you can paste into Google Sheets / Excel for the crossover chart.
"""

OFFSHELF_SEAT_MO = 60.0      # blended (e.g. ChatGPT Enterprise $60/seat/mo)
CUSTOM_BUILD_COST = 125_000   # one-time build (midpoint estimate)
CUSTOM_RUN_MO = 5_000         # monthly runtime at steady state
HYBRID_SEAT_MO = 30.0         # off-shelf fraction (e.g. Copilot $30/seat/mo)
HYBRID_CUSTOM_RUN_MO = 3_000  # smaller custom layer runtime


def offshelf_tco(seats: int, years: int = 3) -> float:
    return seats * OFFSHELF_SEAT_MO * 12 * years


def custom_tco(years: int = 3) -> float:
    return CUSTOM_BUILD_COST + CUSTOM_RUN_MO * 12 * years


def hybrid_tco(seats: int, years: int = 3) -> float:
    return seats * HYBRID_SEAT_MO * 12 * years + CUSTOM_BUILD_COST + HYBRID_CUSTOM_RUN_MO * 12 * years


def main() -> None:
    print("Seats\tOff-shelf 3yr\tCustom 3yr\tHybrid 3yr")
    for seats in range(100, 5001, 100):
        print(
            f"{seats}\t${offshelf_tco(seats):,.0f}\t${custom_tco():,.0f}\t${hybrid_tco(seats):,.0f}"
        )


if __name__ == "__main__":
    main()

"""3-year TCO curve across seat counts — off-shelf vs custom vs hybrid.

Outputs a TSV you can paste into Google Sheets / Excel for the crossover chart.
"""

OFFSHELF_SEAT_MO = 60.0      # blended (e.g. ChatGPT Enterprise $60/seat/mo)
CUSTOM_BUILD_COST = 125_000   # one-time build (midpoint estimate)
CUSTOM_RUN_MO = 5_000         # monthly runtime at steady state
HYBRID_SEAT_MO = 30.0         # off-shelf fraction (e.g. Copilot $30/seat/mo)
HYBRID_CUSTOM_RUN_MO = 3_000  # smaller custom layer runtime


def offshelf_tco(seats: int, years: int = 3) -> float:
    return seats * OFFSHELF_SEAT_MO * 12 * years


def custom_tco(years: int = 3) -> float:
    return CUSTOM_BUILD_COST + CUSTOM_RUN_MO * 12 * years


def hybrid_tco(seats: int, years: int = 3) -> float:
    return seats * HYBRID_SEAT_MO * 12 * years + CUSTOM_BUILD_COST + HYBRID_CUSTOM_RUN_MO * 12 * years


def main() -> None:
    print("Seats\tOff-shelf 3yr\tCustom 3yr\tHybrid 3yr")
    for seats in range(100, 5001, 100):
        print(
            f"{seats}\t${offshelf_tco(seats):,.0f}\t${custom_tco():,.0f}\t${hybrid_tco(seats):,.0f}"
        )


if __name__ == "__main__":
    main()

Run the TCO curve model at your seat count. The output is a TSV you can paste into Google Sheets to visualize the crossover point. Adjust `OFFSHELF_SEAT_MO` to your negotiated enterprise rate (typically $42-51/seat at volume for ChatGPT Enterprise) and `CUSTOM_BUILD_COST` to your pilot scope estimate.

Here's a quick YAML example config to get you started:

paths:
  - name: off-shelf
    data_residency: 2   # vendor has acceptable retention policy
    domain_accuracy: 1  # generic model scores 64% on your corpus
    orchestration: 0    # can't call your internal ERP
    seat_count: 3       # 500 seats — off-shelf cheaper year 1
    time_to_value: 3    # deploys in hours
    audit_needs: 1      # basic logging only
  - name: custom
    data_residency: 3
    domain_accuracy: 3  # custom RAG scored 88% on same corpus
    orchestration: 3    # LangGraph agent calls ERP API
    seat_count: 1       # build cost high in year 1
    time_to_value: 1    # 4-6 week pilot timeline
    audit_needs: 3      # full trace log + kill switch
  - name: hybrid
    data_residency: 2
    domain_accuracy: 2  # off-shelf for generic, custom for proprietary 30%
    orchestration: 2    # custom agent wraps off-shelf core
    seat_count: 2       # split spend
    time_to_value: 2    # off-shelf up in days, custom layer in 4-6 weeks
    audit_needs: 2      # audit covers custom layer only
weights:
  data_residency: 2.0
  domain_accuracy: 1.5
  orchestration: 1.5
  seat_count: 1.0
  time_to_value: 1.0
  audit_needs: 2.0
seat_count: 500
custom_build_cost: 125000
custom_run_monthly: 5000
offshelf_seat_monthly: 60

Custom AI solutions: what the audit conversation looks like

If you've run the scorecard and it points toward custom or hybrid, the next step is a 1-2 week discovery audit. We map your data residency requirements, run a domain accuracy eval on a sample of your internal corpus, diagram the orchestration surface, and model the TCO at your seat count. The audit produces a scoped recommendation, not a generic proposal.

FAQ: custom AI solutions vs off-the-shelf

What is the difference between custom AI and off-the-shelf AI?

Off-the-shelf AI is SaaS you license per seat. ChatGPT Enterprise, Microsoft 365 Copilot, Glean, and Notion AI are off-shelf. The vendor owns the model, the prompts, and the retention policy. You rent the capability. Custom AI is software you commission and own: a stack built on Claude or OpenAI APIs, pgvector for retrieval, LangGraph for orchestration, deployed on your infrastructure with your prompts, your eval gates, and your audit logs. The core trade-off is rent vs own, generic vs proprietary, weeks-to-deploy vs months-to-build.

When does custom AI pay off vs buying off-the-shelf?

Custom AI pays off when one of four conditions hits: (1) data residency rules block off-shelf vendors from processing your data; (2) domain accuracy on your internal Ragas eval stays below 70% after 60 days of prompt tuning; (3) workflow orchestration requires writing to your internal APIs that off-shelf can't reach; or (4) seat count crosses roughly 600 seats for stacked off-shelf licenses or roughly 3,000 seats for a single off-shelf tool, where the license math flips (2026-Q1 list prices). Below those thresholds, off-shelf wins on time-to-value.

How much does custom AI development cost vs ChatGPT Enterprise?

ChatGPT Enterprise lists at $60/seat/mo (2026-Q1). At 500 seats that's $360K/yr recurring. Microsoft 365 Copilot is $30/seat ($180K/yr at 500 seats). A custom RAG + LangGraph agent stack runs $2-8K/mo in API and infrastructure costs at steady state ($24-96K/yr), plus a one-time build investment. Year two onwards, custom usually undercuts stacked off-shelf licensing at the same seat count. Year one it depends on your build timeline.

What is a hybrid AI architecture?

Hybrid is an off-shelf foundation for the generic surface area (Copilot for code, ChatGPT for drafting, Glean for cross-team search) plus a custom orchestration layer for the proprietary surface (LangGraph agents calling your internal APIs + RAG over your proprietary corpus + Langfuse traces + Ragas eval gates). Off-shelf handles generic; custom handles proprietary. The seam is the auth boundary. Most production AI teams in 2026 run hybrid, not pure off-shelf or pure custom.

Should we buy off-the-shelf AI first or commission custom?

Buy first for 60-90 days. Roll out ChatGPT Enterprise plus Copilot, measure where the ceiling hits (data residency blocked, domain accuracy below 70% on your internal eval, orchestration impossible), then commission custom only for the proven gap. This sequencing avoids the $120K-spent-on-custom-when-Glean-would-have-worked failure mode we see in 60-seat startups. Vendors that push custom-first without an off-shelf pilot are optimizing for their margin.

What does a custom AI solution include?

A production custom AI solutions stack we ship includes: a named reasoning model (Claude Opus 4, Sonnet 4, or GPT-5-mini), a retrieval layer (pgvector or Pinecone), an orchestration framework (LangGraph or Mastra), observability and traces (Langfuse or LangSmith), a CI eval gate (Ragas or Braintrust against your corpus), an audit log with kill switch (agent access revoked in under 60 seconds), and serving infra (Modal, Vercel, or Cloudflare Workers). Engagement shape: 1-2 week discovery audit, 4-6 week pilot with weekly eval gates, then continuous delivery.

What are the risks of off-the-shelf AI?

Three risks worth pricing in before signing an off-shelf contract: (1) vendor retention-policy drift — terms can change mid-contract, and regulated buyers have ripped out off-shelf deployments when retention rules updated; (2) accuracy ceiling on proprietary domains — generic models score 50-64% on specialized corpora (legal, clinical, internal docs) where a custom RAG hits 80-90%; (3) orchestration limits — off-shelf tools can't write to your internal ERP, ticketing, or workflow systems without a custom agent layer you'd need to build anyway.

Custom AI Solutions vs Off-the-Shelf: 2026 Decision Guide

Custom AI vs off-the-shelf vs hybrid: working definitions

The decision rubric: when off-shelf wins, when custom wins, when hybrid wins

Named-tools matrix: what off-shelf and custom actually look like in 2026

Real per-seat math: off-shelf license stack vs custom build and run cost

The buy-first sequencing pattern (and why most vendors won't tell you)

Hybrid pattern: off-shelf foundation plus custom orchestration layer

Reference architecture: hybrid RAG and agent stack we ship

Dated 2026-Q1 cost benchmarks across off-shelf and custom paths

Eval methodology: how we measure when off-shelf hits the ceiling

Operator take: where we've watched off-shelf break in production

DIY: score your own build-vs-buy decision in a spreadsheet

Custom AI solutions: what the audit conversation looks like

FAQ: custom AI solutions vs off-the-shelf

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

Custom AI vs off-the-shelf vs hybrid: working definitions

The decision rubric: when off-shelf wins, when custom wins, when hybrid wins

Named-tools matrix: what off-shelf and custom actually look like in 2026

Real per-seat math: off-shelf license stack vs custom build and run cost

The buy-first sequencing pattern (and why most vendors won't tell you)

Hybrid pattern: off-shelf foundation plus custom orchestration layer

Reference architecture: hybrid RAG and agent stack we ship

Dated 2026-Q1 cost benchmarks across off-shelf and custom paths

Eval methodology: how we measure when off-shelf hits the ceiling

Operator take: where we've watched off-shelf break in production

DIY: score your own build-vs-buy decision in a spreadsheet

Custom AI solutions: what the audit conversation looks like

FAQ: custom AI solutions vs off-the-shelf

Continue reading.

AI Developer Salary Guide 2026 — Source-Bound Market Data

AI Consulting Firms: A 6-Criteria Scoring Rubric (2026)

AI Agent Benchmark: A 6-Axis Reliability Rubric for Production Agents

WhatsApp AI Chatbot Build Guide: From WhatsApp Cloud API to Production (2026)