AI Strategy Consulting: What to Expect

What to expect from AI strategy consulting: the 4 engagement phases, deliverables, maturity models, opportunity scoring, and the red flags of bad strategy work.

AI Strategy Consulting: What to Expect — hero image

An AI strategy consulting engagement should end with a roadmap you can hand to an engineering team and a use-case backlog scored on a number, not a hunch. If it ends with a 60-slide deck and a recommendation to 'establish a center of excellence,' you bought theater. We've run discovery audits where the highest-scoring opportunity was not the one the executive team wanted to fund, and saying so out loud was the most valuable thing in the engagement. That is the bar. As an ai consulting company that ships production systems, we judge strategy work by whether it survives contact with a build team.

This guide walks through what a buyer should expect at each phase: the discovery audit, the AI maturity assessment, opportunity mapping with a real scoring formula, the roadmap deliverable, what 'good' looks like measured against dated numbers, and the red flags that tell you a firm sells strategy as a product instead of a tool. If you are still deciding whether to hire externally or build the strategy in-house, start with the generative AI consulting vs build decision first, then come back here once you've committed to a strategy phase.

We write this in the first person because we deliver these engagements. Founded in 2017, our team runs out of Dallas and Bengaluru, and our default register is eval-first and model-agnostic. The frameworks below are the public versions of what we use internally. Names you'll recognize appear throughout: the McKinsey AI maturity model, the Gartner AI Hype Cycle, Claude Opus 4, GPT-4o, pgvector, Ragas, Langfuse. None of them are magic. The discipline is in how you sequence them.

What AI strategy consulting actually is (and what it is not)

AI strategy consulting is the work of deciding what to build before you build it: which use cases clear an ROI threshold, what data and infrastructure you need, what the sequencing and dependencies are, and how you measure success. It is a different discipline from AI implementation, which is the build-and-ship work. The two are adjacent, and the best engagements connect them tightly, but they answer different questions. Strategy answers 'what should we do and in what order.' Implementation answers 'how do we make it run in production.'

The failure mode of pure strategy firms is well known: a deck, a maturity grade, three slides on 'organizational readiness,' and an invoice. No artifact an engineer can act on. The failure mode of pure implementation shops is the inverse: they'll happily build whatever you point at, including the use case that fails a basic value test. Good AI strategy consulting sits in the seam. It produces a scored backlog and a roadmap precise enough to start a pilot, and it is honest about which opportunities should never be funded.

AI Strategy Consulting

Answers what to build and in what order. Deliverables: scored use-case backlog, AI maturity assessment, data + infrastructure gap analysis, 12-month roadmap with dependencies, eval methodology design. Buyer-grade ROI test applied to each candidate. Should end with an artifact an engineering team can start a pilot from. Weakness when sold alone: no working software, and a roadmap rots fast if no one builds against it. Watch for firms that stop at the deck.

AI Implementation Consulting

Answers how to make it run in production. Deliverables: working pilot, RAG or agent pipeline, eval gates in CI, audit logging, weekly releases, code ownership transfer. Strength: shippable software, measured against real corpora. Weakness when sold alone: will build whatever you point it at, including the use case that fails a value test. Watch for shops that skip the strategy step entirely and start coding the first idea in the room.

The 4-phase engagement arc you should expect

A credible AI strategy engagement moves through four phases, each with a hard deliverable and a decision point where you can stop. We structure ours as a 1 to 2 week discovery audit, a 2 to 3 week assessment and opportunity-mapping phase, a roadmap synthesis week, and then a hand-off into a 4 to 6 week pilot if the data justifies it. The phases are gates, not a contract you sign once. You should be able to walk after the audit if the numbers don't support a build.

AI STRATEGY ENGAGEMENT ARC
PHASEHARD DELIVERABLE1. Discovery Audit1-2 weeks. Data inventory,stakeholder interviews, constraintsData + infra gap map. Candidateuse-case longlist. Constraint register(PII, regulation, latency, budget).GATE: fund an assessment?2. Assess + Map2-3 weeks. Maturity model scored,use cases scored on ROI formulaMaturity grade (1-5). Scored backlogranked by weighted value. Evalmethodology design per use case.3. Roadmap Synthesis1 week. Sequence, dependencies,staffing, build-vs-buy per item12-month roadmap with dependencygraph. Quarter-1 pilot spec. Cost +infra sizing. Risk + compliance plan.GATE: greenlight a pilot?4. Pilot Hand-off4-6 wks, weekly eval gatesWorking pilot, eval in CI, audit log,code ownership transferred day one.
Figure 1: The four phases of an AI strategy engagement. Each phase produces a hard artifact (right column) and ends at a decision gate. A buyer can stop after discovery if the data inventory shows no fundable opportunity.

Phase 1: the discovery audit and what it inventories

The discovery audit is a 1 to 2 week inventory of what you actually have, as opposed to what the org chart says you have. It is the cheapest phase to get right and the most expensive to skip. We've seen six-figure roadmaps built on the assumption of clean, accessible data, only for the pilot to discover the data lived in three systems with no shared key and a retention policy that deleted the labels. The audit catches that before anyone writes a roadmap.

A good audit inventories four things. Data: where it lives, who owns it, what shape it's in, what you can legally send to a third-party model. Infrastructure: cloud posture, identity, whether you can run pgvector on existing Postgres or need a managed vector store. People: who will own the system after the consultant leaves. Constraints: regulation, latency budgets, PII handling, the question of whether customer data can touch OpenAI or Claude at all. Each one becomes a node in the gap map.

Discovery Audit Inputs to Gap Map
Data inventory
Infrastructure scan
Stakeholder interviews
Constraint register
Gap map
Use-case longlist

AI maturity models: how the assessment grades you

Most firms anchor the assessment on a maturity model. The McKinsey AI maturity framework, the Gartner AI maturity model, and a dozen consultancy variants all describe roughly the same five-stage progression: from ad-hoc experimentation to AI embedded in core operations. The value of a maturity grade is not the grade itself. It is the gap it exposes between where you are and where the roadmap needs you to be. A grade-2 organization should not be funding an autonomous multi-agent system. The maturity model tells you that before the budget does.

Here is the version we use, collapsed to the signals that actually predict delivery success. Read down the levels and locate yourself honestly. The biggest mistake buyers make is rating themselves a 4 because a few teams use ChatGPT. Tool adoption is a level-1 signal. Eval discipline and production audit logging are what separate a 4 from a 2.

LevelData + infra signalEval + governance signalOrg signal
1. Ad-hoc Individuals use ChatGPT or Copilot. No shared data pipeline for AI. No eval. Quality judged by reading outputs and feeling good. No owner. Shadow AI projects scattered across teams.
2. Piloting One or two PoCs on real data. Data access still manual and ad-hoc. One-time PoC eval with a named tool (Ragas, LangSmith). No cadence. A champion exists but no budget line or mandate.
3. Operationalizing Reusable retrieval + serving infra. pgvector or managed store in production. CI eval gate on at least one system. Results logged with dates. Funded AI function. Clear ownership for one or two systems.
4. Scaling Shared platform: model gateway, vector infra, observability across teams. Weekly regression eval on multiple systems. Audit logging by default. AI embedded in multiple product lines with accountable owners.
5. Embedded AI infra is core platform. New use cases ship in days on shared rails. Eval, drift detection, and kill-switch revocation wired by default. AI is a P&L input, not a project. Governance board with teeth.
A condensed AI maturity model. Locate your organization by the highest level where you meet every signal in the row. Tool adoption alone is level 1, not level 4.

Phase 2: opportunity mapping and use-case scoring

This is the phase where strategy earns its fee. A longlist of 15 to 30 candidate use cases comes out of discovery. Opportunity mapping scores each one on a weighted formula so the ranking is defensible to a CFO, not a popularity contest in a workshop. The formula matters more than the workshop. We score on five factors: business value, data readiness, technical feasibility, time-to-value, and risk. The output is a single number per use case and a clear top quartile.

Below is a real scoring function, simplified to the public version. It takes a use case with five 1-to-5 inputs and returns a weighted score plus a flag for any hard blocker. A hard blocker (data you legally cannot use, a regulation you cannot meet) zeroes the score regardless of business value. That single rule kills more board-favorite projects than any other line in the engagement, and it is the most useful thing you'll get from an honest strategy firm.

opportunity_score.py
Python
"""opportunity_score.py

Score AI use cases on a weighted formula so the ranking is defensible.
A hard blocker (illegal data, unmeetable regulation) zeroes the score
regardless of business value. Run: python opportunity_score.py cases.yaml
"""
from dataclasses import dataclass
from typing import List

# Weights sum to 1.0. Tune per organization, but write them down
# BEFORE scoring so the ranking can't be reverse-engineered to a favorite.
WEIGHTS = {
    "business_value":    0.35,  # revenue, cost, or risk reduction
    "data_readiness":    0.25,  # is the data accessible, labeled, legal?
    "tech_feasibility":  0.15,  # can current models + infra ship it?
    "time_to_value":     0.15,  # weeks to first measurable result
    "risk":              0.10,  # inverse: lower risk scores higher
}

@dataclass
class UseCase:
    name: str
    business_value: int    # 1-5
    data_readiness: int    # 1-5
    tech_feasibility: int  # 1-5
    time_to_value: int     # 1-5 (5 = fastest)
    risk: int              # 1-5 (5 = highest risk)
    hard_blocker: bool = False  # illegal data / unmeetable regulation

    def score(self) -> float:
        if self.hard_blocker:
            return 0.0  # the rule that kills board favorites
        risk_inv = 6 - self.risk  # invert: low risk -> high contribution
        raw = (
            self.business_value   * WEIGHTS["business_value"]
            + self.data_readiness * WEIGHTS["data_readiness"]
            + self.tech_feasibility * WEIGHTS["tech_feasibility"]
            + self.time_to_value  * WEIGHTS["time_to_value"]
            + risk_inv            * WEIGHTS["risk"]
        )
        return round(raw / 5 * 100, 1)  # normalize to /100


def rank(cases: List[UseCase]) -> List[UseCase]:
    return sorted(cases, key=lambda c: c.score(), reverse=True)


if __name__ == "__main__":
    backlog = [
        UseCase("Support deflection RAG", 5, 4, 5, 5, 2),
        UseCase("Autonomous contract drafting", 5, 2, 2, 2, 5),
        UseCase("Sales-call summarization", 3, 5, 5, 5, 1),
        UseCase("Fraud co-pilot (PII offshore)", 5, 4, 4, 3, 4, hard_blocker=True),
    ]
    for c in rank(backlog):
        flag = "  [BLOCKED]" if c.hard_blocker else ""
        print(f"{c.score():>6}  {c.name}{flag}")

Phase 3: the roadmap deliverable, in machine-readable form

The roadmap is the artifact that justifies the whole engagement. A bad roadmap is a Gantt chart of buzzwords with quarters labeled 'AI maturity' and 'scale.' A good one is specific enough that an engineering lead can read it and start a pilot on Monday. It names the use cases in priority order, states the model and infra choice per item with reasoning, declares dependencies, and sets eval thresholds the pilot must clear. We deliver it as a document and as a structured config, because the config is what survives the consultant leaving.

Below are two views of the same roadmap deliverable. The YAML is the quarter-1 plan an engineering team executes against. The JSON is the maturity scorecard that travels to the board. If your strategy firm cannot hand you something this concrete, ask why the deliverable is a slide deck instead of a spec.

yaml
# roadmap-q1.yaml
# Quarter-1 plan from an AI strategy engagement. Engineering executes
# against this. Eval thresholds gate the pilot before any scale spend.

maturity:
  current_level: 2        # piloting
  target_level_12mo: 4    # scaling

strategy_principles:
  - model_agnostic: true        # choose model by eval, not partnership
  - eval_first: true            # no scale spend before eval gate clears
  - code_ownership: client      # transferred day one of pilot

q1_pilot:
  use_case: support-deflection-rag
  opportunity_score: 88         # from opportunity_score.py
  stack:
    reasoning_model: claude-sonnet-4
    intent_model: claude-haiku-4   # cheaper path for classification
    fallback_model: gpt-4o
    vector_store: pgvector-0.7     # already on Postgres 16
    eval_harness: ragas
    observability: langfuse
  eval_gates:
    recall_at_5: 0.80
    faithfulness: 0.82
    fail_on_regression: true
  dependencies:
    - data-access-api          # blocker surfaced in discovery audit
    - pii-redaction-layer
  engagement_shape:
    discovery_audit: 1-2 weeks
    pilot: 4-6 weeks
    cadence: weekly eval gate
json
{
  "maturity_scorecard": {
    "as_of": "2026-Q1",
    "current_level": 2,
    "target_level_12mo": 4,
    "dimensions": {
      "data_infra":   { "score": 2, "gap": "no shared retrieval layer" },
      "eval_gov":     { "score": 1, "gap": "no CI eval gate yet" },
      "org_owner":    { "score": 2, "gap": "champion, no funded function" }
    },
    "top_use_cases": [
      { "name": "support-deflection-rag",    "score": 88, "quarter": "Q1" },
      { "name": "sales-call-summarization",  "score": 81, "quarter": "Q2" },
      { "name": "autonomous-contract-draft", "score": 47, "quarter": "defer" }
    ],
    "blocked": [
      { "name": "fraud-copilot-pii-offshore", "reason": "PII cannot leave region" }
    ],
    "recommended_next_gate": "greenlight Q1 pilot"
  }
}

Every deliverable you should walk away with

Here is the full deliverable set, mapped to phase and to who consumes it. Use this as a checklist when you scope the engagement. If a firm's proposal is missing the scored backlog or the eval methodology, those are the two omissions that most often signal a deck-only engagement. The board scorecard and the executive summary are fine; they are just not sufficient on their own.

DeliverablePhasePrimary consumerFormat
Data + infra gap map1 DiscoveryEng lead, data ownerDocument + diagram
Constraint register (PII, regulation, latency)1 DiscoveryLegal, security, engStructured list
AI maturity grade (1-5) with gap analysis2 AssessC-suite, boardScorecard
Scored use-case backlog (ranked /100)2 AssessProduct, eng, CFOSpreadsheet + formula
Eval methodology per top use case2 AssessEng leadNamed metrics + thresholds
12-month roadmap + dependency graph3 SynthesisC-suite, engDocument + config
Quarter-1 pilot spec + stack choices3 SynthesisEng teamYAML / machine-readable
Cost + infra sizing (technical, not fees)3 SynthesisFinance, engToken + infra estimate
Risk + compliance plan (audit, kill switch)3 SynthesisSecurity, legalArchitecture doc

What 'good' looks like, measured against dated numbers

Good strategy work attaches numbers to its claims. When we recommend a model for a retrieval use case, we cite the eval. On a 1,840-document RAG corpus we ran internally in 2026-Q1 with the Ragas harness, Claude Opus 4 scored 88% recall@5 against GPT-4o at 71%, same prompt and same corpus. The full Ragas regression run cost $14 in Claude API spend. Those numbers belong in the roadmap because they let a buyer size their eval budget before they sign a build. A firm that recommends a model without showing you the eval is recommending a vibe. Governance belongs here too: responsible AI practices should be designed into the roadmap, not bolted on after the pilot.

Reference numbers a strategy roadmap should cite (2026-Q1)
88%
Claude Opus 4 recall@5
1,840-doc corpus, Ragas harness, 2026-Q1. Used to justify the model choice in the roadmap.
71%
GPT-4o recall@5 (same corpus)
Same prompt, same retrieval depth, 2026-Q1. The gap is why eval beats logo-wall stack claims.
$14
Full Ragas regression run cost
Claude API spend to run the complete 1,840-doc eval set, 2026-Q1. Lets you size eval infra before a SOW.
$0.04
Median per-agent-call cost
Claude Sonnet 4 + pgvector retrieval across 3 production agents, 2026-Q1. A real unit-economics anchor.

Red flags of bad AI strategy consulting

Six tells separate strategy theater from strategy that ships. No scored backlog, only a workshop ranking. No eval methodology in the deliverable set. A roadmap with no model or infra specificity, just 'deploy generative AI.' A maturity grade with no gap analysis attached. A pricing structure that locks you into the build before the audit data is in. And a single-vendor stack recommendation that conveniently matches the firm's partnership. Any two of these together, and you should walk.

RED-FLAG DECISION TREE: STRATEGY THEATER VS STRATEGY THAT SHIPS
Is there a SCORED backlog?Ranked /100, written formulayesnoRED: workshoppopularity ranking onlyEval METHODOLOGY included?Named harness, metrics, thresholdsyesnoRED: no evalquality = reading outputsModel + infra NAMED in roadmap?Versioned, with reasoningyesnoRED: 'deploy GenAI'no specificityMaturity grade has GAP analysis?Where you are vs roadmap needyesnoRED: grade onlyvanity number, no actionStack VENDOR-NEUTRAL?Chosen by eval, not partnershipyesnoRED: partner locksingle-vendor by marginPASS: strategy that shipsScored backlog + eval + named stack +gap analysis + vendor-neutral choice.
Figure 2: Walk the tree against a firm's proposal. Each red node is a documented failure mode. Two or more red nodes, and the engagement is likely a deck, not a roadmap.

How strategy connects to implementation without a handoff cliff

The most expensive failure in AI consulting is the handoff cliff: a strategy firm delivers a deck, walks away, and a separate build team starts from scratch, re-discovering the constraints the audit already found. The roadmap config exists to prevent exactly this. When the pilot team picks up the YAML, the model choice, the eval thresholds, and the dependency blockers are already specified. The first week of the pilot is wiring, not re-litigating the strategy. This is also where the custom build vs off-the-shelf decision gets settled per use case, not as a blanket policy.

Strategy to Implementation, No Cliff
Scored backlog
Roadmap config (YAML)
Pilot week 1
Weekly eval gate
Production + audit log
Roadmap item 2

Engagement timeline and where the weeks actually go

Buyers consistently underestimate discovery and overestimate the roadmap-writing. The roadmap is a week of synthesis on top of good assessment data. Discovery is where the real time goes, because data access requests, security reviews, and stakeholder scheduling are slow in any organization above 200 people. The chart below shows where the weeks land across the engagement classes we've run. Tier-1 strategy firms run longer not because the thinking is deeper but because procurement and multi-stakeholder sign-off add weeks before discovery even starts.

Median weeks to a usable roadmap, by engagement class (2026-Q1, 11 engagements observed)
Boutique operator (gated audit to roadmap)
4.5wks
Discovery + assessment + synthesis. Fastest because procurement is light and the build team is in the room.
Vertical specialist
6wks
Compliance-gated data access in regulated verticals adds 1-2 weeks to discovery.
Tier-1 strategy firm
11wks
Procurement, security review, and multi-stakeholder workshops add weeks before discovery starts.

Questions to ask before you sign a strategy engagement

Send these five questions to any firm on your shortlist before you sign. They take a firm 30 minutes to answer well and an hour to dodge. The dodge is the answer. A firm that cannot show you a redacted opportunity-scoring sheet or a roadmap config from a past engagement is asking you to trust that the deliverable exists. We hand prospective clients a redacted version of both before they sign anything.

strategy-firm-questions.md
MARKDOWN
# 5 questions to ask before signing an AI strategy engagement

## 1. Scoring transparency
Show us a redacted use-case scoring sheet from a past engagement.
Required: a written weighting formula, per-use-case scores, at least
one board-favorite that scored low and why.

## 2. The deliverable, concretely
What does the roadmap look like as a file? Show a redacted YAML or
spec, not a deck. If the only artifact is slides, ask what an
engineering team executes against on day one.

## 3. Eval methodology
Which eval harness do you design into the roadmap (Ragas, LangSmith,
Braintrust)? What thresholds gate the pilot? Show a dated run.

## 4. Model + stack neutrality
Do you recommend models by eval result or by partnership? Show a case
where the eval pushed you off your default vendor.

## 5. Handoff mechanics
How does the strategy transfer to a build team? Do you hand a
machine-readable config, and is code ownership with the client from
day one of the pilot?

# Scoring: a clean answer to 4 of 5 is a strong signal.
# A dodge on questions 1 or 3 is a red flag worth two of the others.

Frameworks a strategy firm should name (and how to read them)

Frameworks are scaffolding, not answers. A firm that quotes the Gartner AI Hype Cycle to justify a recommendation is using it as decoration. A firm that uses a maturity model to expose a specific gap, then closes it in the roadmap, is using it as a tool. The named frameworks below come up in most credible engagements. The test is whether the firm uses them to produce a number or a decision, or just to fill slides. Real eval work names tools too: Ragas for retrieval metrics, Langfuse or LangSmith for production traces, pgvector for retrieval, and a model gateway across Claude, GPT-4o, and open-source options like Llama 4 chosen by corpus result.

FAQ

What should I expect from an AI strategy consulting engagement?

Expect four phases with hard deliverables and decision gates: a 1 to 2 week discovery audit (data + infra gap map, constraint register), a 2 to 3 week assessment and opportunity-mapping phase (AI maturity grade plus a scored use-case backlog ranked /100), a roadmap synthesis week (12-month roadmap, quarter-1 pilot spec, cost and infra sizing), and a hand-off into a 4 to 6 week pilot with weekly eval gates. If the engagement ends with a slide deck and no scored backlog or eval methodology, you bought theater, not strategy.

What deliverables should AI strategy consulting produce?

A complete set includes: a data and infrastructure gap map, a constraint register (PII, regulation, latency), an AI maturity grade with gap analysis, a scored use-case backlog with a written weighting formula, an eval methodology per top use case (named harness and thresholds), a 12-month roadmap with a dependency graph, a machine-readable quarter-1 pilot spec, technical cost and infra sizing, and a risk and compliance plan covering audit logging and kill-switch revocation. Missing the scored backlog or the eval methodology is the most common tell of a deck-only engagement.

How long does an AI strategy engagement take?

A boutique operator with a gated audit-to-roadmap process reaches a usable roadmap in about 4.5 weeks median (2026-Q1, across 11 engagements observed). Vertical specialists run about 6 weeks because compliance-gated data access adds time to discovery. Tier-1 strategy firms run around 11 weeks, mostly because procurement, security review, and multi-stakeholder workshops add weeks before discovery even starts. Discovery, not roadmap-writing, is where the time actually goes.

What AI maturity model do consultants use?

Most firms use a five-level model, whether badged as the McKinsey AI maturity framework, the Gartner AI maturity model, or a house variant. The levels run from ad-hoc experimentation (level 1) through piloting, operationalizing, and scaling, to AI embedded in core operations (level 5). The value is the gap analysis, not the grade. Tool adoption (people using ChatGPT or Copilot) is a level-1 signal. Eval discipline and production audit logging are what separate a level 4 from a level 2.

What are the red flags of bad AI strategy consulting?

Six tells: no scored backlog (only a workshop popularity ranking), no eval methodology in the deliverable set, a roadmap with no model or infrastructure specificity (just 'deploy generative AI'), a maturity grade with no gap analysis, a contract that locks the full build before discovery-audit data is in, and a single-vendor stack recommendation that conveniently matches the firm's own partnership. Any two together, and you should walk.

How does AI strategy connect to implementation?

Through a machine-readable roadmap config, not a deck. The config carries the model choice (for example Claude Sonnet 4 for reasoning, Haiku 4 for the cheap intent path), the eval thresholds (recall@5, faithfulness), and the dependency blockers surfaced during discovery directly into the pilot. That lets pilot week one be wiring rather than re-discovery. The expensive failure is the handoff cliff, where a strategy firm delivers prose and a separate build team starts from scratch. Insist that code ownership sits with you from day one of the pilot.

Should AI strategy and implementation come from the same firm?

Not necessarily, but the handoff has to be clean either way. Many boutique operators (including us) bundle strategy and implementation so the team that scored the backlog also builds the pilot, which removes re-discovery cost. Tier-1 firms often split them across separate practice areas, which adds handoff risk. If you do split the firms, require the strategy deliverable to ship as a machine-readable config so the build team executes against a spec, not an interpretation of slides.

MORE IN AI CONSULTING

Continue reading.

How to Run an AI Readiness Assessment — hero image
#ai-consulting

How to Run an AI Readiness Assessment

Run an AI readiness assessment in 14 days: a six-dimension weighted scorecard, the rubric, a Python scoring formula, and what to do with a low score.

Navin Sharma Navin Sharma
12m
Generative AI consulting vs build: an isometric fork between an engineering workshop and a consulting meeting room
#generative-ai#ai-consulting

Generative AI Consulting vs Build: An Operator's Rubric for 2026

Should you hire a Gen AI consultant or build in-house? Operator decision rubric with eval methodology, named-model trade-offs, 6-week pilot blueprint, and a 7-question RFP.

Navin Sharma Navin Sharma
21m
Back to Blog