AI Use-Case Prioritization Framework

An AI implementation consulting framework that scores AI use cases on value, feasibility, data readiness, and risk into a sequenced delivery roadmap.

AI Use-Case Prioritization Framework — hero image

Most AI programs stall in the same place: a list of 30 ideas and no defensible reason to start with any of them. The CFO wants the cost win, the head of support wants the deflection bot, the data team wants the forecasting model, and the board wants "something with agents." Everyone is right, and the program goes nowhere because there is no scoring model to break the tie. The deliverable that breaks it is a use-case prioritization framework. That framework, not a slide deck of trends, is what AI implementation consulting actually produces in the first two weeks of work.

We run this scoring exercise on every discovery audit before a single line of code ships. It scores each candidate use case on four axes: business value, technical feasibility, data readiness, and risk. The axes get weighted, the weighted scores get plotted on a value-by-feasibility matrix, and the matrix produces a sequenced roadmap: quick wins first, foundational bets staged behind them, and the long-tail "someday" items parked with a clear reason. This post hands you the full method. The 0-to-3 scoring rubric per axis, the weighted-scoring function in Python, a worked example scoring six candidate use cases, the sequencing logic, and how the output feeds the implementation plan.

Frameworks like RICE, ICE, and WSJF already exist for product prioritization, and we borrow their math. What they miss for AI specifically is data readiness and reversibility. An AI use case can score high on value and feasibility and still be a trap if the training data is a swamp or the failure mode is a regulatory incident. We add those axes. If you are still deciding whether to hire externally for this work, the rubric for scoring AI consulting firms is a useful companion. This post is the deliverable those firms should be handing you.

What AI implementation consulting actually produces

Buyers hear "AI implementation consulting" and picture a strategy deck. The useful version produces an artifact your engineers can act on the next morning: a scored, sequenced list of use cases with the reasoning attached. Each entry carries its four axis scores, its weighted total, its quadrant on the prioritization matrix, and a one-line disposition: build now, build after the data work, or park with a reason. That artifact is the spine of the implementation plan.

The cost of skipping this step is documented. The widely cited industry figure across 2024 and 2025 holds that the majority of enterprise AI pilots never reach production, and the most common autopsy is the same every time: the team picked a hard, low-value, data-starved use case because it was the loudest in the room, not because it scored well. A 2025 MIT Media Lab analysis of enterprise generative-AI initiatives found that roughly 95% delivered no measurable return, with use-case selection and integration depth cited as the dominant failure factors rather than model quality. Prioritization is the cheapest intervention with the highest leverage in the whole program.

The four axes of a use-case prioritization framework

Product prioritization frameworks score on two or three axes. RICE uses reach, impact, confidence, and effort. ICE uses impact, confidence, and ease. WSJF divides cost of delay by job size. These are fine for shipping features into an existing product. AI use cases carry two extra risks that those models do not capture: the data may not exist in usable form, and the failure mode may be unrecoverable. So we score on four axes.

Business value (weight 35%) is the size of the prize, scored in the buyer's own units: hours saved per week, deflected tickets per month, basis points of margin, revenue at risk. Technical feasibility (weight 25%) is whether current models and your stack can actually do the task at acceptable quality and cost. Data readiness (weight 25%) is whether the data exists, is labeled or labelable, is accessible without a six-month legal review, and is clean enough to train or ground a model. Risk and reversibility (weight 15%) is what happens when the system is wrong, and how fast you can undo it.

THE FOUR-AXIS SCORING MODEL
INPUT: 4 AXES (score 0-3 each)WEIGHTOUTPUTBusiness ValueHours saved, tickets deflected, margin bps, revenue at risk35%Technical FeasibilityCan current models + your stack hit quality at acceptable cost25%Data ReadinessExists, accessible, labelable, clean enough to ground or train25%Risk + ReversibilityBlast radius when wrong, and how fast you can undo it15%Weighted sum(score/3) x weight x 100Score /100+ matrix quadrant+ dispositionWHY FOUR AXES, NOT TWORICE / ICE / WSJF score value and effort. They were built for shipping features into an existing product.AI use cases add two failure modes those frameworks miss: the data may not exist in usable form (data readiness),and the failure may be unrecoverable (risk + reversibility). A high value-and-feasibility score on a data-starved,irreversible use case is the single most common reason a pilot dies after the demo. Score all four or skip the gate.
Figure 1: Each candidate use case is scored 0-3 on four axes. Axes are weighted, summed to a 0-100 total, and the total feeds the value-by-feasibility matrix. Data readiness and risk are the two axes generic product frameworks (RICE, ICE, WSJF) omit.

How to score business value (the 35% axis)

Score business value in the buyer's vocabulary, never in "improves efficiency." If you cannot state the value as a number with a unit, the use case is not ready to score. The rule we apply: convert the claimed benefit into hours per week, dollars per month, percentage points of a tracked metric, or revenue at risk, then size it against a real baseline. A support-ticket deflection bot that saves 400 agent-hours a month is a 3. A "smarter internal search" with no measured baseline and no owner who can name the saving is a 0, regardless of how exciting the demo looks.

Two adjustments keep this axis honest. First, discount for adoption: a use case worth 1,000 hours a month that only 20% of the team will actually use is worth 200. Second, count only the value that survives the eval gate. A model that hits the target accuracy 70% of the time captures roughly 70% of the theoretical value, not 100%. Score the realistic capture, not the slide-deck headline.

ScoreWhat it meansIllustrative example
0No measurable value stated, no baseline, no owner who can name the saving."Smarter internal search" with no metric attached.
1Value plausible but unquantified, or quantified but small relative to program cost.Auto-tagging support tickets, saves ~10 agent-hours/month.
2Quantified against a baseline, adoption-discounted, meaningful but not transformational.FAQ deflection bot, ~150 tickets/month deflected at 60% adoption.
3Large quantified value against a hard baseline, surviving adoption + eval-capture discount.Claims-triage assistant, ~400 adjuster-hours/month saved, owner signs off on baseline.

How to score technical feasibility (the 25% axis)

Technical feasibility asks one question: can a current model plus your existing stack do this task at acceptable quality, latency, and cost, with a known integration path? The honest way to score it is a thin spike, not an opinion. Take 50 real examples, run them through Claude Sonnet 4 or GPT-4o with a realistic prompt, and measure. A retrieval-grounded Q&A task that hits 85% faithfulness on the spike is a 3. A task that needs a fine-tuned multimodal model your team has never trained, on a deadline, is a 1. The custom build vs off-the-shelf decision sits inside this axis: an off-the-shelf API path scores higher on feasibility than a from-scratch model for the same task.

High feasibility (score 2-3)

Off-the-shelf model via API (Claude, GPT-4o, Gemini) hits target quality on a 50-example spike. Retrieval grounding with pgvector or Pinecone covers the knowledge gap, no fine-tuning needed. Latency and per-call cost are inside budget at production volume. Clear integration point into an existing system. Failure modes are bounded and observable with Langfuse or LangSmith traces. The team has shipped a similar pattern before.

Low feasibility (score 0-1)

Requires a fine-tuned or custom-trained model the team has not built. Task needs reliable multi-step agentic reasoning that current models still fail intermittently. Latency budget is sub-second but the workload needs a large reasoning model. Per-call cost at production volume blows the unit economics. No clean integration point; needs a new data pipeline first. Failure modes are silent and hard to detect without heavy eval investment.

How to score data readiness (the 25% axis)

Data readiness is where most AI roadmaps quietly die, and it is the axis generic frameworks skip entirely. A use case can score a 3 on value and a 3 on feasibility and still be unbuildable this quarter because the data lives in a vendor system nobody can export from, or it exists but is unlabeled, or it is labeled but the labels are wrong. Score data readiness on four sub-questions: does it exist, is it accessible without a multi-month legal or integration project, is it labeled or cheaply labelable, and is it clean enough to ground or train against.

We encode the rubric as a YAML scorecard so the discovery-audit team scores it the same way every time and the result is auditable. The config below is the actual shape we hand to a client team to self-assess between sessions. Anything that scores 0 on accessibility is parked until the data work lands, no matter how high it scores elsewhere.

data-readiness-rubric.yaml
YAML
# data-readiness-rubric.yaml
# Score each use case 0-3 on data readiness using these four sub-checks.
# The axis score is the FLOOR of the four (a single 0 caps the axis at 0).
# Hand this to the client team to self-assess between discovery sessions.

data_readiness:
  exists:
    0: "Data does not exist; would need to be generated or collected from scratch."
    1: "Partial data exists; gaps require new capture or third-party purchase."
    2: "Most data exists across known systems; some consolidation needed."
    3: "Complete data exists in a known, queryable location."
  accessible:
    0: "Locked in a vendor system with no export; or blocked on a multi-month legal review."
    1: "Accessible but requires a new integration project (>4 weeks)."
    2: "Accessible via existing connectors with light engineering."
    3: "Already in the warehouse / object store; query today."
  labeled:
    0: "Unlabeled and no cheap path to labels."
    1: "Labelable but needs significant human annotation budget."
    2: "Partially labeled, or weak-labels available via heuristics / LLM-assisted labeling."
    3: "Fully labeled, or task is retrieval-grounded so no labels needed."
  clean:
    0: "Known-bad: duplicates, drift, wrong values, no schema."
    1: "Messy; needs a real cleaning pipeline before use."
    2: "Mostly clean; spot-fixes and validation rules suffice."
    3: "Clean, schema-validated, monitored."

scoring_rule:
  method: floor   # axis score = min(exists, accessible, labeled, clean)
  rationale: "A single unfixable sub-check sinks the use case this quarter."

How to score risk and reversibility (the 15% axis)

Risk is scored inverted: a high score means low risk and high reversibility. The two questions are blast radius and recovery time. What is the worst plausible outcome when the system is wrong, and how fast can a human catch and undo it? A draft-email assistant where every output passes a human before sending is high-reversibility, low-risk, score 3. An autonomous pricing agent that pushes changes to a live storefront with no review gate is low-reversibility, score 0, until a human-in-the-loop gate and a kill switch are wired in. The responsible-AI controls that govern high-risk deployments belong on this axis, not as an afterthought.

The weighted scoring function

With four axis scores in hand, the math is deliberately simple so it survives a skeptical CFO. Each axis is scored 0-3, normalized to a fraction of its max, multiplied by its weight, and summed to a 0-100 total. The same function applies the risk veto: any use case with a raw risk score of 0 is flagged regardless of its weighted total. Here it is in Python you can run on your own shortlist, and as the JSON scorecard shape we store per use case.

python
"""prioritize.py

Score AI use cases on the 4-axis weighted rubric and print a ranked roadmap.
    python prioritize.py --input usecases.json
"""
import argparse
import json
from dataclasses import dataclass
from typing import Dict, List

WEIGHTS = {
    "business_value":      0.35,
    "technical_feasibility": 0.25,
    "data_readiness":      0.25,
    "risk_reversibility":  0.15,
}
MAX_PER_AXIS = 3

@dataclass
class UseCase:
    name: str
    scores: Dict[str, int]   # each axis 0-3

    @property
    def weighted_total(self) -> float:
        total = 0.0
        for axis, weight in WEIGHTS.items():
            raw = self.scores.get(axis, 0)
            total += (raw / MAX_PER_AXIS) * weight * 100
        return round(total, 1)

    @property
    def risk_veto(self) -> bool:
        # A raw risk score of 0 = irreversible failure mode, no review gate.
        # Park it for remediation regardless of weighted total.
        return self.scores.get("risk_reversibility", 0) == 0

    def quadrant(self) -> str:
        v = self.scores.get("business_value", 0)
        f = self.scores.get("technical_feasibility", 0)
        if v >= 2 and f >= 2:
            return "QUICK WIN"
        if v >= 2 and f < 2:
            return "FOUNDATIONAL BET"
        if v < 2 and f >= 2:
            return "FILL-IN"
        return "PARK"

def load(path: str) -> List[UseCase]:
    with open(path) as fh:
        raw = json.load(fh)
    return [UseCase(name=u["name"], scores=u["scores"]) for u in raw["use_cases"]]

def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--input", default="usecases.json")
    args = ap.parse_args()

    cases = load(args.input)
    ranked = sorted(cases, key=lambda c: c.weighted_total, reverse=True)

    print(f"{'Use case':<34}{'Total':<8}{'Quadrant':<18}{'Flag'}")
    print("-" * 70)
    for c in ranked:
        flag = "RISK-VETO: remediate first" if c.risk_veto else ""
        print(f"{c.name:<34}{c.weighted_total:<8}{c.quadrant():<18}{flag}")

if __name__ == "__main__":
    main()
json
{
  "use_cases": [
    {
      "name": "Support FAQ deflection bot",
      "scores": {
        "business_value": 2,
        "technical_feasibility": 3,
        "data_readiness": 3,
        "risk_reversibility": 3
      },
      "notes": "Retrieval-grounded, human-reviewable, ~150 tickets/mo deflected."
    },
    {
      "name": "Claims-triage assistant",
      "scores": {
        "business_value": 3,
        "technical_feasibility": 2,
        "data_readiness": 2,
        "risk_reversibility": 2
      },
      "notes": "~400 adjuster-hours/mo; HITL gate on every decision."
    },
    {
      "name": "Autonomous dynamic pricing",
      "scores": {
        "business_value": 3,
        "technical_feasibility": 2,
        "data_readiness": 2,
        "risk_reversibility": 0
      },
      "notes": "Pushes live store prices, no review gate yet -> RISK VETO."
    }
  ]
}

Worked example: scoring six candidate use cases

Here is the framework applied to six candidate use cases for an illustrative mid-market insurer. These are composite examples for teaching, not a real engagement; the scores show how the math separates a loud favorite from the actual right first move. Each axis is scored 0-3, weighted, and totaled. The risk veto flags the one that cannot ship as scoped.

Candidate use case (illustrative) Use caseValue (35%)Feasibility (25%)Data (25%)Risk (15%)Total /100
Support FAQ deflection bot 2: ~150 tickets/mo deflected, adoption-discounted 3: retrieval-grounded, off-shelf API, 50-ex spike at 86% faithfulness 3: KB already in warehouse, no labels needed 3: human-reviewable, fully reversible 75.0: QUICK WIN
Claims-triage assistant (HITL) 3: ~400 adjuster-hours/mo, baseline signed off 2: needs custom extraction + grounding, spike viable 2: exists across two systems, light consolidation 2: HITL gate on every decision, recoverable 76.7: QUICK WIN
Underwriting risk-model copilot 3: margin impact, high prize 1: needs fine-tuned model team has not built 1: historic data exists but labels are inconsistent 1: regulated decision, heavy oversight needed 55.0: FOUNDATIONAL BET
Autonomous dynamic pricing agent 3: large revenue prize 2: agentic loop feasible with current models 2: pricing + demand data accessible 0: pushes live prices, no review gate (VETO) 76.7 on paper: RISK VETO
Agent-call quality scorer 1: modest QA-time saving 3: straightforward classification 2: call transcripts available 3: internal-only, low blast radius 58.3: FILL-IN
Fraud-pattern discovery model 2: meaningful but uncertain saving 1: needs anomaly model + tuning 0: labeled fraud examples scarce, capped at 0 2: flags for human review 41.7: PARK (data work first)
Worked scorecard for an illustrative insurer. Weights: value 35%, feasibility 25%, data 25%, risk 15%. Totals are out of 100. The autonomous pricing agent scores high on paper but takes a risk veto.

Read the result. The autonomous pricing agent and the claims-triage assistant both score around 77 on the raw weighted math, but the pricing agent takes a risk veto and drops out of the immediate queue until a review gate and kill switch are designed. The fraud model, despite real value, is capped by a data-readiness floor of 0 (no labeled fraud examples), so it goes to a data-work track instead of the build queue. The two quick wins that actually start are the FAQ bot and the claims-triage assistant. That is the entire value of scoring before building: the loudest idea in the room, dynamic pricing, is not the first thing you ship. Notice also how close the top scores cluster. When several candidates land within a few points of each other, the weighted total stops being the decider and the matrix quadrant plus the dependency graph take over. A 75 quick win with no blockers beats a 77 foundational bet that needs a quarter of data work, because shipped value compounds and parked value does not. The ranking opens the conversation; it does not close it.

The prioritization matrix: plotting value against feasibility

The weighted total ranks the list, but a 2x2 matrix of business value against technical feasibility makes the sequencing visible to a room. Plot each use case by its value score on the vertical axis and its feasibility score on the horizontal. Four quadrants fall out. High value and high feasibility are quick wins, the things you ship in the first pilot. High value and low feasibility are foundational bets, worth doing but staged behind a data or capability investment. Low value and high feasibility are fill-ins for slack capacity. Low value and low feasibility get parked.

VALUE-BY-FEASIBILITY PRIORITIZATION MATRIX
FOUNDATIONAL BEThigh value, low feasibilityQUICK WINhigh value, high feasibilityPARKlow value, low feasibilityFILL-INlow value, high feasibilityBUSINESS VALUE →TECHNICAL FEASIBILITY →lowhighFAQ deflectionClaims triage (HITL)Underwriting copilotPricing agent (risk veto)Call-quality scorerFraud discovery (data 0)Solid dot: enters sequencing on weighted score.Hollow dot: held by risk veto until review gate + kill switch are scoped.
Figure 2: The six illustrative use cases plotted by business value (vertical) against technical feasibility (horizontal). Quick wins ship first. Foundational bets stage behind capability or data work. The dynamic-pricing dot is hollow: it plots in the quick-win quadrant on raw score but is held by the risk veto.

Sequencing logic: quick wins versus foundational bets

Ranking is not sequencing. The matrix tells you which quadrant each use case sits in, but the order you actually ship in follows three rules. Ship a quick win first to bank a credible result and earn the political capital for the harder work. Stage foundational bets behind the specific capability or data investment they depend on, so the dependency work has a paying customer. Run the data-readiness track in parallel from day one, because data work is the long pole and waiting until you need it guarantees a stall.

For the illustrative insurer, the sequence reads: the FAQ deflection bot ships in pilot one because it is the cleanest quick win and proves the eval-gate discipline. The claims-triage assistant follows as pilot two, slightly higher value but needing a little more integration. The underwriting copilot and the fraud model both feed a shared data-and-capability track that runs alongside the pilots, so by the time the quick wins are in production the foundational bets are unblocked. The pricing agent sits in a governance track until its review gate and kill switch are designed and its risk score clears the veto.

Illustrative sequencing: weeks to first production value per track
Quick win (FAQ deflection bot)
4wks
Pilot one. Retrieval-grounded, off-shelf API, weekly eval gate from sprint 1.
Quick win (claims-triage HITL)
7wks
Pilot two. Slightly more integration; HITL gate adds a sprint.
Foundational bet (underwriting copilot)
14wks
Staged behind the shared data + capability track.
Parked (fraud model)
20wks
Data-work track first; re-scored once labeled fraud examples exist.

From prioritization to the implementation plan

The scored, sequenced roadmap is the input to the implementation plan, not a replacement for it. Each quick win becomes a pilot with a named stack, a weekly eval gate, and a clear definition of done. The feasibility score sets the technical approach: a 3 means an off-the-shelf API path with retrieval grounding, a 1 means a capability-building spike comes first. The data-readiness score sets the data-track scope. The risk score sets the governance work: anything that scored low on reversibility gets its human-in-the-loop gate, audit log, and kill switch designed before it ships.

We tie each pilot to a named-tool stack at this point. A retrieval-grounded quick win like the FAQ bot runs Claude Sonnet 4 for generation, Claude Haiku 4 for intent classification on the cheaper path, pgvector for retrieval, Ragas for the eval gate, and Langfuse for production traces. A foundational bet that needs multi-step orchestration adds LangGraph for the agent control flow, and a heavier-retrieval use case might swap pgvector for Weaviate or Pinecone if the corpus outgrows Postgres. The eval gate is wired before the first sprint ends, so the weekly cadence reports the use case's value capture against the baseline you scored. That closes the loop: the value score you assigned in prioritization becomes a tracked metric in delivery, and a use case that fails to capture its scored value gets re-prioritized rather than quietly continued.

Scoring to roadmap to implementation
Inventory use cases
Score 4 axes
Weighted total + matrix
Sequence
Pilot plan per use case
Re-score on eval data

Five ways the framework gets gamed (and how to stop it)

A scoring framework is only as honest as the people filling it in, and an executive who wants their pet project to win will find a way to inflate its score. We watch for five specific failure modes in every workshop.

Failure modeThe tellThe fix
Inflated value Value claimed with no baseline, no owner, no unit. "Huge productivity gains." Cap value at 1 with no measurable baseline. Require an owner to sign off on the number.
Optimistic feasibility "The model can do this" with no spike run on real examples. Require a 50-example spike before any feasibility score above 1. Measure, don't assert.
Hand-waved data readiness "We have tons of data" without checking access, labels, or quality. Score the four sub-checks. Take the floor. A single 0 caps the axis.
Ignored reversibility Autonomous action scored as if a human were in the loop when none is. Risk score of 0 is a hard veto, not a weighted input. No exceptions for high totals.
Weight tuning to win Someone proposes re-weighting the axes after seeing the scores, to lift their project. Lock weights before scoring any use case. Change weights only with a documented rationale, then re-score every candidate.
The five most common ways a prioritization score gets gamed, the tell, and the fix we apply in the room.

Tooling: run the framework on your own backlog

You do not need a consultant to run this once. The framework is a spreadsheet and a 90-minute workshop. Put every candidate use case in a row, score the four axes 0-3 in a cross-functional room (a product owner, an engineer who will run the spikes, a data lead, and someone who owns the risk question), and let the weighted formula rank them. The CSV below drops straight into Google Sheets or Excel with the weighted formula pre-built; the Python scorer above runs the same math on a JSON export if you keep your backlog in a tool.

usecase-scorecard.csv
Bash
# Paste into Google Sheets / Excel. One row per use case.
# Axes scored 0-3. Weights: value=0.35, feasibility=0.25, data=0.25, risk=0.15.
# The Total column normalizes each axis (/3), applies its weight, sums to /100.
# RiskVeto flags any row where Risk = 0 (irreversible, no review gate).

UseCase,Value,Feasibility,Data,Risk,Total/100,RiskVeto
FAQ deflection bot,2,3,3,3,=((B2/3)*35)+((C2/3)*25)+((D2/3)*25)+((E2/3)*15),=IF(E2=0,"VETO","")
Claims-triage (HITL),3,2,2,2,=((B3/3)*35)+((C3/3)*25)+((D3/3)*25)+((E3/3)*15),=IF(E3=0,"VETO","")
Underwriting copilot,3,1,1,1,=((B4/3)*35)+((C4/3)*25)+((D4/3)*25)+((E4/3)*15),=IF(E4=0,"VETO","")
Autonomous pricing agent,3,2,2,0,=((B5/3)*35)+((C5/3)*25)+((D5/3)*25)+((E5/3)*15),=IF(E5=0,"VETO","")
Call-quality scorer,1,3,2,3,=((B6/3)*35)+((C6/3)*25)+((D6/3)*25)+((E6/3)*15),=IF(E6=0,"VETO","")
Fraud discovery model,2,1,0,2,=((B7/3)*35)+((C7/3)*25)+((D7/3)*25)+((E7/3)*15),=IF(E7=0,"VETO","")

FAQ

What is an AI use-case prioritization framework?

It is a scoring model that ranks candidate AI use cases on four weighted axes: business value (35%), technical feasibility (25%), data readiness (25%), and risk and reversibility (15%). Each axis is scored 0-3, normalized, weighted, and summed to a 0-100 total. The totals are plotted on a value-by-feasibility matrix that sorts use cases into quick wins, foundational bets, fill-ins, and parked items. The output is a sequenced roadmap that feeds the implementation plan. It is the core deliverable of AI implementation consulting.

How is this different from RICE or ICE prioritization?

RICE (reach, impact, confidence, effort), ICE (impact, confidence, ease), and WSJF (cost of delay over job size) were built for shipping features into an existing product. They score value and effort. AI use cases add two failure modes those frameworks miss: the data may not exist in usable form, and the failure mode may be unrecoverable. This framework keeps the weighted-scoring math from RICE and ICE but adds a data-readiness axis and a risk-and-reversibility axis with a hard veto, because those two are where most AI pilots die.

What weights should each axis get?

Our default is value 35%, feasibility 25%, data readiness 25%, risk 15%. The weights are a starting point, not gospel; a heavily regulated buyer might push risk to 25% and trim feasibility. The one rule that matters: lock the weights before you score any use case, and if you change them, document why and re-score every candidate. Re-weighting after seeing the scores to lift a favorite project is the most common way the framework gets gamed.

Why does data readiness get its own axis?

Because a use case can score high on value and feasibility and still be unbuildable this quarter if the data is locked in a vendor system, unlabeled, or known-bad. We score data readiness on four sub-checks (exists, accessible, labeled, clean) and take the floor, so a single unfixable sub-check caps the axis at 0. In the worked example, a fraud-discovery model with real value scores 0 on data because labeled fraud examples are scarce, so it goes to a data-work track rather than the build queue.

What does the risk veto do?

A raw risk score of 0 means an irreversible failure mode with no human review gate. We treat that as a hard veto, not just a low weighted input. A use case can score 77/100 on the weighted math and still be pulled from the immediate queue until its human-in-the-loop gate, audit log, and kill switch are designed. The EU AI Act's high-risk obligations phased in through 2025 and 2026, so a use case that lands in a high-risk category needs its governance scoped before it is sequenced.

How long does a prioritization exercise take?

The scoring workshop is about 90 minutes with the right people in the room: a product owner, an engineer who will run the feasibility spikes, a data lead, and someone who owns the risk question. The feasibility spikes (50 real examples through Claude Sonnet 4 or GPT-4o) add a few days. In a consulting engagement this sits inside a 1-2 week discovery audit alongside the data inventory. The output is the scored, sequenced roadmap, produced before any production code ships.

Can we run this ourselves without a consultant?

Yes. The framework is a spreadsheet and a workshop. The CSV in this post drops into Google Sheets with the weighted formula pre-built, and the Python scorer runs the same math on a JSON export. The value a consultant adds is neutrality (no internal politics inflating a pet project's value), the discipline of running real feasibility spikes instead of asserting, and an honest read on the risk axis. If you run it yourself, the biggest risk is grading your own homework: appoint someone whose job is to challenge every score.

MORE IN AI CONSULTING

Continue reading.

The ROI of AI Business Consulting: How Value Is Measured — hero image
#ai-consulting

The ROI of AI Business Consulting: How Value Is Measured

Where AI business consulting ROI comes from, the payback and NPV math with worked examples, leading vs lagging indicators, and when it does not pay.

Navin Sharma Navin Sharma
14m
How to Run an AI Readiness Assessment — hero image
#ai-consulting

How to Run an AI Readiness Assessment

Run an AI readiness assessment in 14 days: a six-dimension weighted scorecard, the rubric, a Python scoring formula, and what to do with a low score.

Navin Sharma Navin Sharma
12m
AI Strategy Consulting: What to Expect — hero image
#ai-consulting

AI Strategy Consulting: What to Expect

What to expect from AI strategy consulting: the 4 engagement phases, deliverables, maturity models, opportunity scoring, and the red flags of bad strategy work.

Navin Sharma Navin Sharma
12m
Generative AI consulting vs build: an isometric fork between an engineering workshop and a consulting meeting room
#generative-ai#ai-consulting

Generative AI Consulting vs Build: An Operator's Rubric for 2026

Should you hire a Gen AI consultant or build in-house? Operator decision rubric with eval methodology, named-model trade-offs, 6-week pilot blueprint, and a 7-question RFP.

Navin Sharma Navin Sharma
21m
Back to Blog