How to Run an AI Readiness Assessment

We've watched AI pilots stall for the same reason a building inspection fails: the foundation was never checked. A team picks a flashy use case, wires up a model, demos something impressive, then hits production and finds the data is dirty, nobody owns the audit log, and the one engineer who understood the retrieval layer just left. An AI readiness assessment is the inspection you run before you pour the concrete. Done right, it tells you in two weeks what would otherwise take a six-month failed pilot to learn. Most readiness tools online are self-serve marketing funnels: answer 12 questions, get a vague maturity score, talk to sales. We're going to hand you the actual artifact instead. A six-dimension scorecard, a weighted scoring formula in Python, the rubric as a decision matrix, and a 14-day process you can run with your own team. If you're still deciding whether to build in-house or hire help at all, read the consulting-vs-build decision first. If you've decided AI is coming and you need to know how exposed you are, this is the inspection.

The output of a good assessment is not a number. It's a prioritized gap list with owners and a 90-day roadmap. The number is what executives ask for; the gap list is what changes what your team does on Monday. We'll show you both, and we'll be honest about the trap: a readiness score of 82 means nothing if the assessor scored every dimension generously to win the follow-on engagement. Score yourself before you trust anyone else's score.

What an AI readiness assessment actually is (and what it isn't)

An AI readiness assessment is a structured evaluation of whether your organization can build, deploy, and operate AI systems safely and profitably. It scores six dimensions: data readiness, infrastructure, talent and skills, governance and risk, use-case viability, and change management. Each dimension gets a 0 to 4 maturity score against documented evidence, the scores are weighted by delivery risk, and the result is a single readiness index plus a ranked list of the gaps that will hurt you first.

Here's what it isn't. It isn't an AI maturity model where you slot yourself into 'crawl, walk, run' and feel vaguely behind. It isn't a vendor questionnaire designed to qualify you for a sales call. And it isn't a one-time event you file away. The assessment is a baseline. You re-run it every two or three quarters because data drifts, the team turns over, and the EU AI Act compliance bar moves.

The six dimensions of AI readiness

We score six dimensions because each is a distinct way an AI initiative dies. Skip data readiness and your model hallucinates on dirty inputs. Skip infrastructure and you can't ship past a notebook demo. Skip talent and the system rots the moment the one person who understands it leaves. Skip governance and your first regulated deployment gets blocked by legal. Skip use-case viability and you build something impressive that nobody needed. Skip change management and the people meant to use it route around it.

AI READINESS SCORECARD — SIX DIMENSIONS

Figure 1: The six-dimension readiness radar. Each axis is scored 0 to 4. The shaded polygon shows an illustrative organization: strong on infrastructure and use-case viability, weak on governance and change management. The dashed ring at level 3 is the minimum-to-ship threshold for any single dimension.

Notice the weighting. Data readiness carries the most because in 2026 the bottleneck is almost never the model and almost always the data feeding it. Infrastructure is next because shipping past a demo requires real MLOps. The exact weights are yours to adjust, and we show you the code to do it, but start here and change them only when you can defend the change with evidence.

The weighted scoring rubric: how to score each dimension 0 to 4

A score is only as honest as its rubric. Below is the rubric we run. Each dimension has explicit evidence requirements at each level, so two assessors scoring the same organization land within a point of each other. The discipline is in level 3, the ship bar: it requires the practice to be tested and enforced, not just documented. Anyone can write a data-governance policy. Far fewer can show you the CI job that fails a deploy when PII leaks into the embedding pipeline.

	Dimension (weight)	Score 0-1	Score 2	Score 3 (ship bar)
Data readiness (25%)	No data catalog. Quality unknown. PII unmapped. Sources scattered across spreadsheets.	Catalog exists, ownership documented, but quality is unmeasured and access is broad.	Profiled quality metrics, lineage tracked, PII classified, access controls enforced in CI.	Continuous quality monitoring with drift alerts, automated lineage, golden eval sets versioned.
Infrastructure (20%)	Notebooks only. No deployment path. No model serving. Manual everything.	Some cloud capacity (AWS Bedrock, Azure OpenAI) but no CI/CD for models, no observability.	Model serving in place, CI/CD with eval gates, tracing via Langfuse or LangSmith, cost tracked.	Autoscaling inference, canary releases, full OpenTelemetry traces, per-token cost dashboards.
Talent + skills (15%)	No in-house ML or applied-AI skill. Single hero engineer or full reliance on a vendor.	Small skilled team but no redundancy. Knowledge in heads, not docs. No upskilling plan.	Cross-trained team, documented runbooks, named owners per system, active upskilling cadence.	Deep bench, internal enablement program, contributors to OSS frameworks (LangGraph, Ragas).
Governance + risk (15%)	No AI policy. No risk register. No audit log. No human-in-the-loop on consequential actions.	Policy drafted, risk register exists, but controls are manual and not enforced on every call.	Immutable audit log, kill switch under 60s, HITL gates, EU AI Act risk-tier mapping done.	Automated policy gates on every tool call, red-team cadence, third-party assurance in place.
Use-case viability (15%)	Use cases chosen by hype. No ROI test. No baseline. Success undefined.	Use cases listed with rough value estimates but no eval-able success metric or baseline.	Each use case has a measurable success metric, a baseline, and a buyer-grade ROI test.	Portfolio scored and ranked, value tracked post-launch against the original ROI thesis.
Change management (10%)	AI built in a corner. Affected teams not consulted. No adoption or training plan.	Stakeholders informed but not involved. Training planned but not resourced.	Affected teams co-design the workflow, training is funded, adoption metrics are tracked.	Champions network in place, feedback loop to product, adoption measured against targets.

Score each dimension against documented evidence, not intentions. Level 3 (enforced, tested, gated) is the minimum-to-ship bar. Spend roughly half a day per dimension gathering evidence.

Dimension 1: scoring data readiness (the 25% that breaks most pilots)

Data readiness is the highest-weighted dimension because it's the one that quietly fails every pilot. A model is a commodity in 2026; you can swap Claude Sonnet 4 for GPT-4o in a config line. Your data is not a commodity. To score data readiness, gather evidence on four things: do you have a catalog that knows where your data lives, do you measure quality, have you classified PII, and are access controls enforced rather than aspirational. The choice between custom AI versus an off-the-shelf product often comes down to exactly this: off-the-shelf tools assume your data is clean and accessible, and most organizations discover during the pilot that it isn't. Across enterprise AI surveys published through 2025, roughly 70% to 80% of AI projects stall before production, with data quality and integration cited as the top recurring blocker.

Dated industry signals on why data readiness dominates the score. Figures reflect widely reported 2025-2026 enterprise AI surveys, used here as directional benchmarks for sizing your own gap.

~ 0 -80%

Enterprise AI projects that stall before production

Reported across multiple 2025 enterprise surveys; data quality and integration cited as the top recurring blocker.

0 -80%

Data-prep share of a typical AI project timeline

Long-standing data-science benchmark, restated in 2025 MLOps reports. The model is the cheap part.

Score 3

Our minimum data-readiness ship bar

Profiled quality metrics, tracked lineage, PII classified, access controls enforced in CI. Below this, do not pilot in production.

0 %

PII classification coverage we require before any RAG index build

Every field that enters an embedding index is classified and access-gated. Unclassified fields block the build, 2026 standard.

Dimension 2: scoring infrastructure and MLOps readiness

Infrastructure readiness is the gap between a demo and a system. A demo runs in a notebook. A system runs behind a serving layer, ships through CI/CD with an eval gate that fails the build when recall drops, emits traces to Langfuse or LangSmith, and tracks per-token cost. To score this dimension, ask your team how a model change reaches production today. If the answer involves a person copying a file, you're at a 1. If there's an eval gate that can block a deploy, you're at a 3.

# Infrastructure dimension — score it by checking each item.
# Each TRUE moves you up. Score 3 (ship bar) requires every item in the
# ship_bar block. Run this as a literal checklist during the assessment.

infrastructure_dimension:
  weight: 0.20

  ship_bar:           # all must be true to score 3+
    model_serving:      true   # vLLM / AWS Bedrock / Azure OpenAI behind an API
    ci_cd_for_models:   true   # model + prompt changes ship through a pipeline
    eval_gate_in_ci:    true   # build fails if recall@5 or faithfulness drops
    tracing:            true   # Langfuse or LangSmith on every production call
    cost_tracking:      true   # per-token + per-request spend on a dashboard
    rollback_path:      true   # one command reverts a bad model/prompt deploy

  optimizing:         # level 4 — measured + improving each quarter
    autoscaling_inference: false
    canary_releases:       false
    otel_full_traces:      false   # OpenTelemetry spans across the agent graph

  scoring:
    0_1: "notebooks only, manual deploy, no observability"
    2:   "cloud capacity exists, no CI/CD for models, no eval gate"
    3:   "all ship_bar items true"
    4:   "all ship_bar + all optimizing items true"

Dimension 3: scoring talent, skills, and the bus-factor problem

The talent dimension is where readiness scores most often get inflated. Leadership sees one brilliant engineer shipping impressive demos and scores talent a 4. The honest score is a 1, because the bus factor is one. The right question isn't 'do we have skilled people?' It's 'if our best AI engineer left next week, would the system keep running and could someone else change it safely?' Score on redundancy, documentation, named ownership, and a real upskilling cadence. A team that contributes to LangGraph or maintains a Ragas eval config in public is a 4. A team where the knowledge lives in one person's head is a 1, no matter how good that person is.

Talent readiness: bus-factor distribution across teams we've assessed (illustrative pattern, 2025-2026)

Bus factor = 1 (single hero engineer)

41% of teams

Most common pattern. Scores a 1 on talent regardless of demo quality. Highest hidden risk.

Bus factor = 2-3 (small cross-trained team)

38% of teams

Scores 2-3. Redundancy exists but documentation is usually thin. The realistic target for a first pilot.

Bench + enablement program (score 4)

21% of teams

Documented runbooks, internal training, OSS contribution. Rare, and usually the orgs that least need an assessment.

Dimension 4: scoring governance, risk, and compliance readiness

Governance readiness is the dimension legal and risk teams care about and engineering teams forget until it blocks a launch. In 2026 the EU AI Act risk tiering is in force for many enterprise deployments, SOC 2 auditors ask for AI-specific event trails, and any agent touching customer data needs an audit log and a revocation path wired by default. Score this dimension on whether the controls are enforced on every call, not whether a policy document exists. Our guide to what responsible AI means in practice covers the full control set; for the assessment, the two non-negotiables are an immutable audit log and a kill switch that revokes an agent in under 60 seconds.

GOVERNANCE READINESS GATE — WHAT 'ENFORCED' LOOKS LIKE

Figure 2: The governance evidence flow. A score of 3 requires every consequential agent action to pass through a policy gate, write to an immutable audit log, and be revocable in under 60 seconds. A policy document with none of this wired scores a 1.

Dimensions 5 and 6: use-case viability vs change management

The last two dimensions are the ones executives most often conflate, so we score them separately on purpose. Use-case viability asks whether the thing you want to build is worth building and measurable. Change management asks whether the people it touches will actually use it. A use case can be perfectly viable on paper and die because the support team it was built for was never consulted and quietly kept the old process. Score viability on whether each candidate has a baseline and a buyer-grade ROI test. Score change management on whether affected teams co-designed the workflow.

Use-case viability (15%)

Asks: is this worth building, and can we measure success? Score 3 requires each use case to have a measurable success metric, a documented baseline, and a buyer-grade ROI test run before any code ships. Failure mode at score 0: use cases chosen because a competitor announced one, with no baseline and no definition of done. The fix is a prioritization pass that ranks candidates by value over effort and kills the ones with no measurable metric. This is engineering and product economics. It does not require any organizational change to assess.

Change management (10%)

Asks: will the people this touches actually adopt it? Score 3 requires affected teams to co-design the workflow, funded training, and tracked adoption metrics. Failure mode at score 0: AI built in a corner by a central team, shipped to a department that was never consulted and routes around it within a month. The fix is involving the end users as co-designers from week one. This is organizational and political, not technical, which is exactly why technical teams under-score it and why we weight it as a distinct dimension.

The 14-day process: how to run the assessment end to end

Two weeks is enough. Longer than that and the assessment becomes a project that competes with the work it's supposed to enable. Here is the sequence we run as a discovery audit: scope and stakeholder kickoff, then evidence gathering across the six dimensions, then scoring against the rubric, then a gap workshop, then the readiness report and roadmap. Each phase is a decision point. If data readiness scores a zero on day five, you can stop and fix data before spending another nine days assessing infrastructure you won't be allowed to use yet.

14-Day AI Readiness Assessment Process

Days 1-2: Scope + kickoff

Days 3-7: Evidence gathering

Days 8-9: Score against rubric

Days 10-11: Gap workshop

Days 12-13: Readiness report

Day 14: Roadmap readout

The scorecard as code: weighted scoring formula in Python

Here is the scorecard as something you can actually run. The Python computes the weighted readiness index and enforces the hard rule that any single dimension below 2 caps the verdict at Not Ready, no matter how high the weighted total. The YAML is the scorecard your assessors fill in. The JSON is the report shape your roadmap tooling consumes. Copy these, adjust the weights to your context, and you have a defensible scoring artifact instead of a vibe.

Python (scoring engine)YAML (the scorecard)JSON (report output)

python

"""readiness_score.py

Compute a weighted AI readiness index from a filled scorecard.
Enforces the hard floor: any dimension below 2 caps the verdict
at NOT_READY regardless of the weighted total.
    python readiness_score.py --input scorecard.yaml
"""
import argparse, json, yaml
from dataclasses import dataclass
from typing import Dict

WEIGHTS = {
    "data_readiness":    0.25,
    "infrastructure":    0.20,
    "talent_skills":     0.15,
    "governance_risk":   0.15,
    "use_case_viability":0.15,
    "change_management": 0.10,
}
MAX_SCORE = 4          # each dimension scored 0-4
HARD_FLOOR = 2         # any dimension below this caps verdict

@dataclass
class Scorecard:
    org: str
    scores: Dict[str, int]

    @property
    def index(self) -> float:
        total = sum((self.scores.get(d, 0) / MAX_SCORE) * w * 100
                    for d, w in WEIGHTS.items())
        return round(total, 1)

    @property
    def verdict(self) -> str:
        if any(self.scores.get(d, 0) < HARD_FLOOR for d in WEIGHTS):
            return "NOT_READY"   # hard floor breached
        if self.index >= 75:
            return "READY"
        if self.index >= 50:
            return "PARTIALLY_READY"
        return "NOT_READY"

    @property
    def gaps(self):
        # rank dimensions by weighted points lost, biggest first
        ranked = sorted(
            WEIGHTS,
            key=lambda d: (MAX_SCORE - self.scores.get(d, 0)) * WEIGHTS[d],
            reverse=True,
        )
        return [d for d in ranked if self.scores.get(d, 0) < 3]

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--input", default="scorecard.yaml")
    args = ap.parse_args()
    raw = yaml.safe_load(open(args.input))
    card = Scorecard(org=raw["org"], scores=raw["scores"])
    report = {
        "org": card.org,
        "readiness_index": card.index,
        "verdict": card.verdict,
        "prioritized_gaps": card.gaps,
    }
    print(json.dumps(report, indent=2))

if __name__ == "__main__":
    main()

"""readiness_score.py

Compute a weighted AI readiness index from a filled scorecard.
Enforces the hard floor: any dimension below 2 caps the verdict
at NOT_READY regardless of the weighted total.
    python readiness_score.py --input scorecard.yaml
"""
import argparse, json, yaml
from dataclasses import dataclass
from typing import Dict

WEIGHTS = {
    "data_readiness":    0.25,
    "infrastructure":    0.20,
    "talent_skills":     0.15,
    "governance_risk":   0.15,
    "use_case_viability":0.15,
    "change_management": 0.10,
}
MAX_SCORE = 4          # each dimension scored 0-4
HARD_FLOOR = 2         # any dimension below this caps verdict

@dataclass
class Scorecard:
    org: str
    scores: Dict[str, int]

    @property
    def index(self) -> float:
        total = sum((self.scores.get(d, 0) / MAX_SCORE) * w * 100
                    for d, w in WEIGHTS.items())
        return round(total, 1)

    @property
    def verdict(self) -> str:
        if any(self.scores.get(d, 0) < HARD_FLOOR for d in WEIGHTS):
            return "NOT_READY"   # hard floor breached
        if self.index >= 75:
            return "READY"
        if self.index >= 50:
            return "PARTIALLY_READY"
        return "NOT_READY"

    @property
    def gaps(self):
        # rank dimensions by weighted points lost, biggest first
        ranked = sorted(
            WEIGHTS,
            key=lambda d: (MAX_SCORE - self.scores.get(d, 0)) * WEIGHTS[d],
            reverse=True,
        )
        return [d for d in ranked if self.scores.get(d, 0) < 3]

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--input", default="scorecard.yaml")
    args = ap.parse_args()
    raw = yaml.safe_load(open(args.input))
    card = Scorecard(org=raw["org"], scores=raw["scores"])
    report = {
        "org": card.org,
        "readiness_index": card.index,
        "verdict": card.verdict,
        "prioritized_gaps": card.gaps,
    }
    print(json.dumps(report, indent=2))

if __name__ == "__main__":
    main()

yaml

# scorecard.yaml
# Filled during the assessment. Each dimension scored 0-4 against
# documented evidence. Do not score from opinion — attach the artifact.

org: "Example Corp — Q2 2026 baseline"

scores:
  data_readiness:     2   # catalog exists, quality unmeasured, access broad
  infrastructure:     3   # serving + CI/CD + eval gate + tracing wired
  talent_skills:      1   # bus factor of 1, no runbooks
  governance_risk:    1   # policy doc only, no audit log, no kill switch
  use_case_viability: 3   # each use case has baseline + ROI test
  change_management:  2   # stakeholders informed, training unfunded

evidence:
  data_readiness:     "link: data-catalog; gap: no quality metrics"
  infrastructure:     "link: ci-pipeline + langfuse dashboard"
  talent_skills:      "gap: single owner, no runbook repo"
  governance_risk:    "link: policy.md; gap: nothing wired in code"
  use_case_viability: "link: usecase-roi-sheet"
  change_management:  "gap: support team not yet co-designing"

# scorecard.yaml
# Filled during the assessment. Each dimension scored 0-4 against
# documented evidence. Do not score from opinion — attach the artifact.

org: "Example Corp — Q2 2026 baseline"

scores:
  data_readiness:     2   # catalog exists, quality unmeasured, access broad
  infrastructure:     3   # serving + CI/CD + eval gate + tracing wired
  talent_skills:      1   # bus factor of 1, no runbooks
  governance_risk:    1   # policy doc only, no audit log, no kill switch
  use_case_viability: 3   # each use case has baseline + ROI test
  change_management:  2   # stakeholders informed, training unfunded

evidence:
  data_readiness:     "link: data-catalog; gap: no quality metrics"
  infrastructure:     "link: ci-pipeline + langfuse dashboard"
  talent_skills:      "gap: single owner, no runbook repo"
  governance_risk:    "link: policy.md; gap: nothing wired in code"
  use_case_viability: "link: usecase-roi-sheet"
  change_management:  "gap: support team not yet co-designing"

json

{
  "org": "Example Corp — Q2 2026 baseline",
  "readiness_index": 49.6,
  "verdict": "NOT_READY",
  "_why": "talent_skills and governance_risk both below the hard floor of 2",
  "prioritized_gaps": [
    "governance_risk",
    "talent_skills",
    "data_readiness",
    "change_management"
  ]
}

{
  "org": "Example Corp — Q2 2026 baseline",
  "readiness_index": 49.6,
  "verdict": "NOT_READY",
  "_why": "talent_skills and governance_risk both below the hard floor of 2",
  "prioritized_gaps": [
    "governance_risk",
    "talent_skills",
    "data_readiness",
    "change_management"
  ]
}

What to do with your score: ready, partially ready, or not ready

A score with no decision attached is a slide, not an assessment. Each verdict maps to a concrete next action. The hard floor stops organizations from averaging away a fatal gap: a 90 on data and infrastructure with a 0 on governance is not 'mostly ready,' it's a launch legal will block. Below is what each band means.

	Verdict	What the score means	What to do next
READY (index 75+, no dimension below 2)	Foundations are enforced, not aspirational. The gaps that remain are optimization, not existential.	Run a scoped 4-6 week production pilot with weekly eval gates. Track the remaining gaps as roadmap items, not blockers.	Pilot now; production in one quarter if the pilot clears eval gates.
PARTIALLY READY (index 50-74)	Some dimensions are solid, others are documented but not enforced. Real but containable risk.	Pilot in a sandbox or low-stakes internal use case. Close the top two weighted gaps in parallel before any customer-facing launch.	Sandbox pilot now; production in two quarters once top gaps clear.
NOT READY (index below 50, or any dimension below 2)	A foundational dimension is missing. Building AI on top of it now multiplies the risk, it does not reduce it.	Stop new use-case work. Run a focused gap-closure sprint on the lowest-scoring weighted dimension first. Re-assess that dimension before resuming.	No production work until the floor is cleared. Re-score the failed dimension in 4-8 weeks.

Each readiness band maps to a specific action. The hard floor (any dimension below 2) overrides the index every time.

From gaps to a 90-day roadmap: the mapping that makes the score useful

The deliverable that changes behavior is the gap-to-roadmap mapping. Take the prioritized gap list from the scorecard, attach an owner and a sized action to each gap, and sequence them so the highest-weighted gaps close first. The Python ranked the gaps by weighted points lost; the roadmap turns that ranking into a 90-day plan where each gap becomes a concrete first action, not a theme.

# gap-to-roadmap.yaml
# Generated from prioritized_gaps in readiness_report.json.
# Highest weighted-points-lost first. Each gap gets ONE owner and a
# sized, dated first action. No theme without an action.

roadmap:
  org: "Example Corp"
  horizon_days: 90

  gaps:
    - dimension: governance_risk        # score 1, weight 15% — hard-floor breach
      action: "Wire immutable audit log + policy gate + <60s kill switch"
      owner: "Platform lead"
      first_milestone: "Audit log append-only in staging"
      sized: "2-3 weeks"
      unblocks: "any customer-facing pilot"

    - dimension: talent_skills          # score 1, weight 15% — hard-floor breach
      action: "Document runbooks, cross-train a second owner per system"
      owner: "Eng manager"
      first_milestone: "Runbook repo + named backup owner"
      sized: "3-4 weeks"
      unblocks: "removes bus-factor-of-1 risk"

    - dimension: data_readiness         # score 2, weight 25%
      action: "Add data-quality metrics + PII classification gate in CI"
      owner: "Data lead"
      first_milestone: "Quality dashboard + PII gate blocking bad builds"
      sized: "4-6 weeks"
      unblocks: "moves data_readiness 2 -> 3 (ship bar)"

    - dimension: change_management      # score 2, weight 10%
      action: "Bring support team in as co-designers, fund training"
      owner: "Product lead"
      first_milestone: "Co-design workshop + funded training plan"
      sized: "2 weeks"
      unblocks: "adoption risk on the target use case"

  rescore_trigger: "re-run readiness_score.py after governance + talent clear"

Frameworks and tools to run the assessment with

You don't need to invent the vocabulary from scratch. Borrow framing from the McKinsey AI maturity model and the Gartner AI maturity framework for the board narrative, and use the NIST AI Risk Management Framework and the EU AI Act risk tiers for the governance dimension. For the technical evidence, the tools you score against are the same ones a competent team already runs: Ragas and LangSmith for eval gates, Langfuse for tracing, OpenTelemetry for spans, pgvector or Pinecone for retrieval, LangGraph or CrewAI for the agent layer, and AWS Bedrock or Azure OpenAI for serving. If you'd rather have a neutral party run the assessment and reconcile the scores, that's what our AI consulting company does in the discovery-audit phase.

What you score each dimension against — the evidence tools (2026 stack)

NIST AI RMF + EU AI Act

Governance framing

Map each use case to a risk tier. High-risk tiers raise the HITL and logging bar for the governance dimension.

Ragas / LangSmith

Eval-gate evidence

A named harness with a dated run is the difference between infrastructure score 2 and 3 (2026 ship bar).

Langfuse / OpenTelemetry

Tracing evidence

Production traces on every call. No traces means no observability, which caps infrastructure at 2.

McKinsey / Gartner models

Maturity narrative

Useful for the board readout. Not a substitute for evidence-scored readiness. Pair them, don't confuse them.

FAQ

What is an AI readiness assessment?

An AI readiness assessment is a structured evaluation of whether an organization can build, deploy, and operate AI systems safely and profitably. It scores six dimensions, data readiness, infrastructure, talent and skills, governance and risk, use-case viability, and change management, each from 0 to 4 against documented evidence. The scores are weighted by delivery risk into a single readiness index out of 100, and the output is a prioritized gap list plus a 90-day roadmap. Unlike an AI maturity model, which asks how advanced your practice is, a readiness assessment is a go or no-go decision tied to evidence.

How do you run an AI readiness assessment?

Run it in 14 days. Days 1-2: scope, stakeholders, agree the dimension weights. Days 3-7: gather evidence across the six dimensions, roughly half a day each, evidence not opinions. Days 8-9: two assessors score independently against the rubric and reconcile differences. Days 10-11: a gap workshop converts scores into owned, sized gaps. Days 12-13: write the readiness report with the scorecard, index, and prioritized gap list. Day 14: present the 90-day roadmap and the go or no-go verdict. Each phase is a decision point; you can stop early if a foundational dimension scores zero.

What dimensions should an AI readiness assessment cover?

Six dimensions, because each is a distinct way an AI initiative fails. Data readiness (weight 25%): catalog, quality, lineage, PII classification, enforced access. Infrastructure (20%): model serving, CI/CD with eval gates, tracing, cost tracking. Talent and skills (15%): redundancy and the bus-factor problem, documentation, named ownership. Governance and risk (15%): immutable audit log, kill switch, EU AI Act risk tiering, HITL. Use-case viability (15%): measurable success metric, baseline, ROI test. Change management (10%): whether affected teams co-design the workflow and adopt it.

What is the difference between an AI readiness assessment and an AI maturity model?

A maturity model (McKinsey, Gartner) asks how advanced your AI practice is and slots you into a stage like crawl, walk, or run. It's useful for board narrative but it's a vanity metric. A readiness assessment asks whether you can ship a specific class of use case without it failing in production, and ties every score to evidence. Maturity tells you where you sit relative to peers; readiness tells you whether to greenlight the pilot. Use the maturity model for the board story and the readiness scorecard for the decision.

How long does an AI readiness assessment take?

Two weeks is the right length when scoped to a target use case or two. Longer and the assessment becomes a project that competes with the work it's meant to enable. We run it as a 1-2 week discovery audit: a few days of scoping and stakeholder alignment, a week of evidence gathering and scoring across the six dimensions, then a gap workshop and roadmap readout. Re-run the full assessment every two to three quarters, since data drifts, teams turn over, and the compliance bar moves.

What do you do if your AI readiness score is low?

A low score (index below 50, or any single dimension below 2) means a foundational dimension is missing, and building AI on top of it multiplies risk rather than reducing it. Stop new use-case work and run a focused gap-closure sprint on the lowest-scoring weighted dimension first. Use the gap-to-roadmap mapping: attach one owner and a sized first action to each gap, sequence the highest-weighted gaps first, and re-score the failed dimension in four to eight weeks before resuming. The hard rule is that any dimension below 2 caps the verdict at not ready regardless of how high the weighted index looks.

Can we run an AI readiness assessment ourselves or do we need a consultant?

You can run it yourself, and you should at least once. The scorecard, the rubric, and the scoring code in this post are everything you need. The catch is honesty: teams routinely over-score their own readiness, especially on talent and governance, because they score the policy rather than the enforcement. We've seen a self-score of 79 reconcile to 51 against the same rubric two days later. If you run it internally, use two independent assessors and require an artifact for every score. A neutral party (this is what our discovery audit does) mainly buys you that independence and a reconciled score you can defend to a board.

How to Run an AI Readiness Assessment

What an AI readiness assessment actually is (and what it isn't)

The six dimensions of AI readiness

The weighted scoring rubric: how to score each dimension 0 to 4

Dimension 1: scoring data readiness (the 25% that breaks most pilots)

Dimension 2: scoring infrastructure and MLOps readiness

Dimension 3: scoring talent, skills, and the bus-factor problem

Dimension 4: scoring governance, risk, and compliance readiness

Dimensions 5 and 6: use-case viability vs change management

The 14-day process: how to run the assessment end to end

The scorecard as code: weighted scoring formula in Python

What to do with your score: ready, partially ready, or not ready

From gaps to a 90-day roadmap: the mapping that makes the score useful

Frameworks and tools to run the assessment with

FAQ

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

What an AI readiness assessment actually is (and what it isn't)

The six dimensions of AI readiness

The weighted scoring rubric: how to score each dimension 0 to 4

Dimension 1: scoring data readiness (the 25% that breaks most pilots)

Dimension 2: scoring infrastructure and MLOps readiness

Dimension 3: scoring talent, skills, and the bus-factor problem

Dimension 4: scoring governance, risk, and compliance readiness

Dimensions 5 and 6: use-case viability vs change management

The 14-day process: how to run the assessment end to end

The scorecard as code: weighted scoring formula in Python

What to do with your score: ready, partially ready, or not ready

From gaps to a 90-day roadmap: the mapping that makes the score useful

Frameworks and tools to run the assessment with

FAQ

Continue reading.

AI Use-Case Prioritization Framework

The ROI of AI Business Consulting: How Value Is Measured

AI Strategy Consulting: What to Expect

Generative AI Consulting vs Build: An Operator's Rubric for 2026