AI Strategy Consulting: What to Expect
What to expect from AI strategy consulting: the 4 engagement phases, deliverables, maturity models, opportunity scoring, and the red flags of bad strategy work.
An AI strategy consulting engagement should end with a roadmap you can hand to an engineering team and a use-case backlog scored on a number, not a hunch. If it ends with a 60-slide deck and a recommendation to 'establish a center of excellence,' you bought theater. We've run discovery audits where the highest-scoring opportunity was not the one the executive team wanted to fund, and saying so out loud was the most valuable thing in the engagement. That is the bar. As an ai consulting company that ships production systems, we judge strategy work by whether it survives contact with a build team.
This guide walks through what a buyer should expect at each phase: the discovery audit, the AI maturity assessment, opportunity mapping with a real scoring formula, the roadmap deliverable, what 'good' looks like measured against dated numbers, and the red flags that tell you a firm sells strategy as a product instead of a tool. If you are still deciding whether to hire externally or build the strategy in-house, start with the generative AI consulting vs build decision first, then come back here once you've committed to a strategy phase.
We write this in the first person because we deliver these engagements. Founded in 2017, our team runs out of Dallas and Bengaluru, and our default register is eval-first and model-agnostic. The frameworks below are the public versions of what we use internally. Names you'll recognize appear throughout: the McKinsey AI maturity model, the Gartner AI Hype Cycle, Claude Opus 4, GPT-4o, pgvector, Ragas, Langfuse. None of them are magic. The discipline is in how you sequence them.
What AI strategy consulting actually is (and what it is not)
AI strategy consulting is the work of deciding what to build before you build it: which use cases clear an ROI threshold, what data and infrastructure you need, what the sequencing and dependencies are, and how you measure success. It is a different discipline from AI implementation, which is the build-and-ship work. The two are adjacent, and the best engagements connect them tightly, but they answer different questions. Strategy answers 'what should we do and in what order.' Implementation answers 'how do we make it run in production.'
The failure mode of pure strategy firms is well known: a deck, a maturity grade, three slides on 'organizational readiness,' and an invoice. No artifact an engineer can act on. The failure mode of pure implementation shops is the inverse: they'll happily build whatever you point at, including the use case that fails a basic value test. Good AI strategy consulting sits in the seam. It produces a scored backlog and a roadmap precise enough to start a pilot, and it is honest about which opportunities should never be funded.
Answers what to build and in what order. Deliverables: scored use-case backlog, AI maturity assessment, data + infrastructure gap analysis, 12-month roadmap with dependencies, eval methodology design. Buyer-grade ROI test applied to each candidate. Should end with an artifact an engineering team can start a pilot from. Weakness when sold alone: no working software, and a roadmap rots fast if no one builds against it. Watch for firms that stop at the deck.
Answers how to make it run in production. Deliverables: working pilot, RAG or agent pipeline, eval gates in CI, audit logging, weekly releases, code ownership transfer. Strength: shippable software, measured against real corpora. Weakness when sold alone: will build whatever you point it at, including the use case that fails a value test. Watch for shops that skip the strategy step entirely and start coding the first idea in the room.
The 4-phase engagement arc you should expect
A credible AI strategy engagement moves through four phases, each with a hard deliverable and a decision point where you can stop. We structure ours as a 1 to 2 week discovery audit, a 2 to 3 week assessment and opportunity-mapping phase, a roadmap synthesis week, and then a hand-off into a 4 to 6 week pilot if the data justifies it. The phases are gates, not a contract you sign once. You should be able to walk after the audit if the numbers don't support a build.
Phase 1: the discovery audit and what it inventories
The discovery audit is a 1 to 2 week inventory of what you actually have, as opposed to what the org chart says you have. It is the cheapest phase to get right and the most expensive to skip. We've seen six-figure roadmaps built on the assumption of clean, accessible data, only for the pilot to discover the data lived in three systems with no shared key and a retention policy that deleted the labels. The audit catches that before anyone writes a roadmap.
A good audit inventories four things. Data: where it lives, who owns it, what shape it's in, what you can legally send to a third-party model. Infrastructure: cloud posture, identity, whether you can run pgvector on existing Postgres or need a managed vector store. People: who will own the system after the consultant leaves. Constraints: regulation, latency budgets, PII handling, the question of whether customer data can touch OpenAI or Claude at all. Each one becomes a node in the gap map.
AI maturity models: how the assessment grades you
Most firms anchor the assessment on a maturity model. The McKinsey AI maturity framework, the Gartner AI maturity model, and a dozen consultancy variants all describe roughly the same five-stage progression: from ad-hoc experimentation to AI embedded in core operations. The value of a maturity grade is not the grade itself. It is the gap it exposes between where you are and where the roadmap needs you to be. A grade-2 organization should not be funding an autonomous multi-agent system. The maturity model tells you that before the budget does.
Here is the version we use, collapsed to the signals that actually predict delivery success. Read down the levels and locate yourself honestly. The biggest mistake buyers make is rating themselves a 4 because a few teams use ChatGPT. Tool adoption is a level-1 signal. Eval discipline and production audit logging are what separate a 4 from a 2.
| Level | Data + infra signal | Eval + governance signal | Org signal | |
|---|---|---|---|---|
| 1. Ad-hoc | Individuals use ChatGPT or Copilot. No shared data pipeline for AI. | No eval. Quality judged by reading outputs and feeling good. | No owner. Shadow AI projects scattered across teams. | |
| 2. Piloting | One or two PoCs on real data. Data access still manual and ad-hoc. | One-time PoC eval with a named tool (Ragas, LangSmith). No cadence. | A champion exists but no budget line or mandate. | |
| 3. Operationalizing | Reusable retrieval + serving infra. pgvector or managed store in production. | CI eval gate on at least one system. Results logged with dates. | Funded AI function. Clear ownership for one or two systems. | |
| 4. Scaling | Shared platform: model gateway, vector infra, observability across teams. | Weekly regression eval on multiple systems. Audit logging by default. | AI embedded in multiple product lines with accountable owners. | |
| 5. Embedded | AI infra is core platform. New use cases ship in days on shared rails. | Eval, drift detection, and kill-switch revocation wired by default. | AI is a P&L input, not a project. Governance board with teeth. |
Phase 2: opportunity mapping and use-case scoring
This is the phase where strategy earns its fee. A longlist of 15 to 30 candidate use cases comes out of discovery. Opportunity mapping scores each one on a weighted formula so the ranking is defensible to a CFO, not a popularity contest in a workshop. The formula matters more than the workshop. We score on five factors: business value, data readiness, technical feasibility, time-to-value, and risk. The output is a single number per use case and a clear top quartile.
Below is a real scoring function, simplified to the public version. It takes a use case with five 1-to-5 inputs and returns a weighted score plus a flag for any hard blocker. A hard blocker (data you legally cannot use, a regulation you cannot meet) zeroes the score regardless of business value. That single rule kills more board-favorite projects than any other line in the engagement, and it is the most useful thing you'll get from an honest strategy firm.
"""opportunity_score.py
Score AI use cases on a weighted formula so the ranking is defensible.
A hard blocker (illegal data, unmeetable regulation) zeroes the score
regardless of business value. Run: python opportunity_score.py cases.yaml
"""
from dataclasses import dataclass
from typing import List
# Weights sum to 1.0. Tune per organization, but write them down
# BEFORE scoring so the ranking can't be reverse-engineered to a favorite.
WEIGHTS = {
"business_value": 0.35, # revenue, cost, or risk reduction
"data_readiness": 0.25, # is the data accessible, labeled, legal?
"tech_feasibility": 0.15, # can current models + infra ship it?
"time_to_value": 0.15, # weeks to first measurable result
"risk": 0.10, # inverse: lower risk scores higher
}
@dataclass
class UseCase:
name: str
business_value: int # 1-5
data_readiness: int # 1-5
tech_feasibility: int # 1-5
time_to_value: int # 1-5 (5 = fastest)
risk: int # 1-5 (5 = highest risk)
hard_blocker: bool = False # illegal data / unmeetable regulation
def score(self) -> float:
if self.hard_blocker:
return 0.0 # the rule that kills board favorites
risk_inv = 6 - self.risk # invert: low risk -> high contribution
raw = (
self.business_value * WEIGHTS["business_value"]
+ self.data_readiness * WEIGHTS["data_readiness"]
+ self.tech_feasibility * WEIGHTS["tech_feasibility"]
+ self.time_to_value * WEIGHTS["time_to_value"]
+ risk_inv * WEIGHTS["risk"]
)
return round(raw / 5 * 100, 1) # normalize to /100
def rank(cases: List[UseCase]) -> List[UseCase]:
return sorted(cases, key=lambda c: c.score(), reverse=True)
if __name__ == "__main__":
backlog = [
UseCase("Support deflection RAG", 5, 4, 5, 5, 2),
UseCase("Autonomous contract drafting", 5, 2, 2, 2, 5),
UseCase("Sales-call summarization", 3, 5, 5, 5, 1),
UseCase("Fraud co-pilot (PII offshore)", 5, 4, 4, 3, 4, hard_blocker=True),
]
for c in rank(backlog):
flag = " [BLOCKED]" if c.hard_blocker else ""
print(f"{c.score():>6} {c.name}{flag}")
Phase 3: the roadmap deliverable, in machine-readable form
The roadmap is the artifact that justifies the whole engagement. A bad roadmap is a Gantt chart of buzzwords with quarters labeled 'AI maturity' and 'scale.' A good one is specific enough that an engineering lead can read it and start a pilot on Monday. It names the use cases in priority order, states the model and infra choice per item with reasoning, declares dependencies, and sets eval thresholds the pilot must clear. We deliver it as a document and as a structured config, because the config is what survives the consultant leaving.
Below are two views of the same roadmap deliverable. The YAML is the quarter-1 plan an engineering team executes against. The JSON is the maturity scorecard that travels to the board. If your strategy firm cannot hand you something this concrete, ask why the deliverable is a slide deck instead of a spec.
# roadmap-q1.yaml
# Quarter-1 plan from an AI strategy engagement. Engineering executes
# against this. Eval thresholds gate the pilot before any scale spend.
maturity:
current_level: 2 # piloting
target_level_12mo: 4 # scaling
strategy_principles:
- model_agnostic: true # choose model by eval, not partnership
- eval_first: true # no scale spend before eval gate clears
- code_ownership: client # transferred day one of pilot
q1_pilot:
use_case: support-deflection-rag
opportunity_score: 88 # from opportunity_score.py
stack:
reasoning_model: claude-sonnet-4
intent_model: claude-haiku-4 # cheaper path for classification
fallback_model: gpt-4o
vector_store: pgvector-0.7 # already on Postgres 16
eval_harness: ragas
observability: langfuse
eval_gates:
recall_at_5: 0.80
faithfulness: 0.82
fail_on_regression: true
dependencies:
- data-access-api # blocker surfaced in discovery audit
- pii-redaction-layer
engagement_shape:
discovery_audit: 1-2 weeks
pilot: 4-6 weeks
cadence: weekly eval gate
{
"maturity_scorecard": {
"as_of": "2026-Q1",
"current_level": 2,
"target_level_12mo": 4,
"dimensions": {
"data_infra": { "score": 2, "gap": "no shared retrieval layer" },
"eval_gov": { "score": 1, "gap": "no CI eval gate yet" },
"org_owner": { "score": 2, "gap": "champion, no funded function" }
},
"top_use_cases": [
{ "name": "support-deflection-rag", "score": 88, "quarter": "Q1" },
{ "name": "sales-call-summarization", "score": 81, "quarter": "Q2" },
{ "name": "autonomous-contract-draft", "score": 47, "quarter": "defer" }
],
"blocked": [
{ "name": "fraud-copilot-pii-offshore", "reason": "PII cannot leave region" }
],
"recommended_next_gate": "greenlight Q1 pilot"
}
}
Every deliverable you should walk away with
Here is the full deliverable set, mapped to phase and to who consumes it. Use this as a checklist when you scope the engagement. If a firm's proposal is missing the scored backlog or the eval methodology, those are the two omissions that most often signal a deck-only engagement. The board scorecard and the executive summary are fine; they are just not sufficient on their own.
| Deliverable | Phase | Primary consumer | Format |
|---|---|---|---|
| Data + infra gap map | 1 Discovery | Eng lead, data owner | Document + diagram |
| Constraint register (PII, regulation, latency) | 1 Discovery | Legal, security, eng | Structured list |
| AI maturity grade (1-5) with gap analysis | 2 Assess | C-suite, board | Scorecard |
| Scored use-case backlog (ranked /100) | 2 Assess | Product, eng, CFO | Spreadsheet + formula |
| Eval methodology per top use case | 2 Assess | Eng lead | Named metrics + thresholds |
| 12-month roadmap + dependency graph | 3 Synthesis | C-suite, eng | Document + config |
| Quarter-1 pilot spec + stack choices | 3 Synthesis | Eng team | YAML / machine-readable |
| Cost + infra sizing (technical, not fees) | 3 Synthesis | Finance, eng | Token + infra estimate |
| Risk + compliance plan (audit, kill switch) | 3 Synthesis | Security, legal | Architecture doc |
What 'good' looks like, measured against dated numbers
Good strategy work attaches numbers to its claims. When we recommend a model for a retrieval use case, we cite the eval. On a 1,840-document RAG corpus we ran internally in 2026-Q1 with the Ragas harness, Claude Opus 4 scored 88% recall@5 against GPT-4o at 71%, same prompt and same corpus. The full Ragas regression run cost $14 in Claude API spend. Those numbers belong in the roadmap because they let a buyer size their eval budget before they sign a build. A firm that recommends a model without showing you the eval is recommending a vibe. Governance belongs here too: responsible AI practices should be designed into the roadmap, not bolted on after the pilot.
Red flags of bad AI strategy consulting
Six tells separate strategy theater from strategy that ships. No scored backlog, only a workshop ranking. No eval methodology in the deliverable set. A roadmap with no model or infra specificity, just 'deploy generative AI.' A maturity grade with no gap analysis attached. A pricing structure that locks you into the build before the audit data is in. And a single-vendor stack recommendation that conveniently matches the firm's partnership. Any two of these together, and you should walk.
How strategy connects to implementation without a handoff cliff
The most expensive failure in AI consulting is the handoff cliff: a strategy firm delivers a deck, walks away, and a separate build team starts from scratch, re-discovering the constraints the audit already found. The roadmap config exists to prevent exactly this. When the pilot team picks up the YAML, the model choice, the eval thresholds, and the dependency blockers are already specified. The first week of the pilot is wiring, not re-litigating the strategy. This is also where the custom build vs off-the-shelf decision gets settled per use case, not as a blanket policy.
Engagement timeline and where the weeks actually go
Buyers consistently underestimate discovery and overestimate the roadmap-writing. The roadmap is a week of synthesis on top of good assessment data. Discovery is where the real time goes, because data access requests, security reviews, and stakeholder scheduling are slow in any organization above 200 people. The chart below shows where the weeks land across the engagement classes we've run. Tier-1 strategy firms run longer not because the thinking is deeper but because procurement and multi-stakeholder sign-off add weeks before discovery even starts.
Questions to ask before you sign a strategy engagement
Send these five questions to any firm on your shortlist before you sign. They take a firm 30 minutes to answer well and an hour to dodge. The dodge is the answer. A firm that cannot show you a redacted opportunity-scoring sheet or a roadmap config from a past engagement is asking you to trust that the deliverable exists. We hand prospective clients a redacted version of both before they sign anything.
# 5 questions to ask before signing an AI strategy engagement
## 1. Scoring transparency
Show us a redacted use-case scoring sheet from a past engagement.
Required: a written weighting formula, per-use-case scores, at least
one board-favorite that scored low and why.
## 2. The deliverable, concretely
What does the roadmap look like as a file? Show a redacted YAML or
spec, not a deck. If the only artifact is slides, ask what an
engineering team executes against on day one.
## 3. Eval methodology
Which eval harness do you design into the roadmap (Ragas, LangSmith,
Braintrust)? What thresholds gate the pilot? Show a dated run.
## 4. Model + stack neutrality
Do you recommend models by eval result or by partnership? Show a case
where the eval pushed you off your default vendor.
## 5. Handoff mechanics
How does the strategy transfer to a build team? Do you hand a
machine-readable config, and is code ownership with the client from
day one of the pilot?
# Scoring: a clean answer to 4 of 5 is a strong signal.
# A dodge on questions 1 or 3 is a red flag worth two of the others.
Frameworks a strategy firm should name (and how to read them)
Frameworks are scaffolding, not answers. A firm that quotes the Gartner AI Hype Cycle to justify a recommendation is using it as decoration. A firm that uses a maturity model to expose a specific gap, then closes it in the roadmap, is using it as a tool. The named frameworks below come up in most credible engagements. The test is whether the firm uses them to produce a number or a decision, or just to fill slides. Real eval work names tools too: Ragas for retrieval metrics, Langfuse or LangSmith for production traces, pgvector for retrieval, and a model gateway across Claude, GPT-4o, and open-source options like Llama 4 chosen by corpus result.
FAQ
What should I expect from an AI strategy consulting engagement?
Expect four phases with hard deliverables and decision gates: a 1 to 2 week discovery audit (data + infra gap map, constraint register), a 2 to 3 week assessment and opportunity-mapping phase (AI maturity grade plus a scored use-case backlog ranked /100), a roadmap synthesis week (12-month roadmap, quarter-1 pilot spec, cost and infra sizing), and a hand-off into a 4 to 6 week pilot with weekly eval gates. If the engagement ends with a slide deck and no scored backlog or eval methodology, you bought theater, not strategy.
What deliverables should AI strategy consulting produce?
A complete set includes: a data and infrastructure gap map, a constraint register (PII, regulation, latency), an AI maturity grade with gap analysis, a scored use-case backlog with a written weighting formula, an eval methodology per top use case (named harness and thresholds), a 12-month roadmap with a dependency graph, a machine-readable quarter-1 pilot spec, technical cost and infra sizing, and a risk and compliance plan covering audit logging and kill-switch revocation. Missing the scored backlog or the eval methodology is the most common tell of a deck-only engagement.
How long does an AI strategy engagement take?
A boutique operator with a gated audit-to-roadmap process reaches a usable roadmap in about 4.5 weeks median (2026-Q1, across 11 engagements observed). Vertical specialists run about 6 weeks because compliance-gated data access adds time to discovery. Tier-1 strategy firms run around 11 weeks, mostly because procurement, security review, and multi-stakeholder workshops add weeks before discovery even starts. Discovery, not roadmap-writing, is where the time actually goes.
What AI maturity model do consultants use?
Most firms use a five-level model, whether badged as the McKinsey AI maturity framework, the Gartner AI maturity model, or a house variant. The levels run from ad-hoc experimentation (level 1) through piloting, operationalizing, and scaling, to AI embedded in core operations (level 5). The value is the gap analysis, not the grade. Tool adoption (people using ChatGPT or Copilot) is a level-1 signal. Eval discipline and production audit logging are what separate a level 4 from a level 2.
What are the red flags of bad AI strategy consulting?
Six tells: no scored backlog (only a workshop popularity ranking), no eval methodology in the deliverable set, a roadmap with no model or infrastructure specificity (just 'deploy generative AI'), a maturity grade with no gap analysis, a contract that locks the full build before discovery-audit data is in, and a single-vendor stack recommendation that conveniently matches the firm's own partnership. Any two together, and you should walk.
How does AI strategy connect to implementation?
Through a machine-readable roadmap config, not a deck. The config carries the model choice (for example Claude Sonnet 4 for reasoning, Haiku 4 for the cheap intent path), the eval thresholds (recall@5, faithfulness), and the dependency blockers surfaced during discovery directly into the pilot. That lets pilot week one be wiring rather than re-discovery. The expensive failure is the handoff cliff, where a strategy firm delivers prose and a separate build team starts from scratch. Insist that code ownership sits with you from day one of the pilot.
Should AI strategy and implementation come from the same firm?
Not necessarily, but the handoff has to be clean either way. Many boutique operators (including us) bundle strategy and implementation so the team that scored the backlog also builds the pilot, which removes re-discovery cost. Tier-1 firms often split them across separate practice areas, which adds handoff risk. If you do split the firms, require the strategy deliverable to ship as a machine-readable config so the build team executes against a spec, not an interpretation of slides.