decision tool · 8 questions

RAG or fine-tune?
Answer 8 questions. Get a defended recommendation.

The honest answer is usually 'it depends'. This tool replaces that with a procedure. The logic is wired to published guidance from Lewis 2020, LoRA, DSPy, OpenAI, and Anthropic. Each recommendation cites which of your answers drove it.

interactive · client-side only

Score your use case. The procedure picks the path.

Nothing leaves your browser. The form runs in JavaScript on this page. We keep no answers, no analytics on the inputs, no server roundtrip.

head-to-head

RAG vs Fine-tuning vs Hybrid.
Same dimensions we score in our audit phase.

No 'best overall' row. Best is a function of your freshness, latency, corpus, and citation requirements. This table lays out the trade we make on each axis.

DimensionRAGFine-tuningHybrid
Knowledge freshnessReindex on demand. Same-day updates trivial.Stale at training time. New facts require a new run.Fresh facts via retrieval, learned style baked in.
Cost per queryHigher. Retrieval + longer prompts.Lower. Smaller prompts after training.Between the two; depends on retrieval depth.
Cost per training runZero. No training step.Real. From a few dollars (LoRA on 7B) to four figures on large models.Real. Same as fine-tuning plus retrieval setup.
Citation supportNative. Each chunk has a source.Not honest. Cannot trace weights to a doc.Citations come from the retrieval half.
Latency+80-300ms for retrieval depending on index.Lower. No retrieval step. Smaller models possible.Same as RAG, plus the inference path.
Best-fit corpus sizeTens to millions of docs.Knowledge-agnostic. Tunes behaviour, not facts.Any corpus size.
Domain adaptationLimited. Model still reasons in its base style.Strong. Adapts tone, schema, persona, refusal policy.Best of both. Most-used pattern in production.
Time to shipDays for a baseline; weeks to harden eval.Weeks. Includes data collection + label QA.Weeks. Both pipelines must land.
When NOT to use itFixed style requirements, sub-200ms budgets.Daily-changing facts, citation requirements.Tiny corpora or single-purpose prompts.

Dated 2026-05-22. Costs move quickly; re-check the numbers on a fresh eval each quarter.

anti-patterns we see weekly
design decision · 01

Don't fine-tune for facts

we rejected
Use RAG for factual recall
because
Fine-tuning shifts output distribution; it doesn't reliably encode new facts. Lewis et al. 2020 framed RAG specifically because parametric memory was the wrong tool for knowledge.
design decision · 02

Don't RAG for style

we rejected
Fine-tune or use a LoRA adapter for tone, format, and persona
because
Retrieval gives the model context, not personality. A LoRA adapter (Hu et al. 2021) is the cheap, modular path to consistent voice and schema.
design decision · 03

Don't skip prompt engineering

we rejected
Drive a real eval against the base model first
because
Anthropic and OpenAI both publish the same advice: a tight system prompt with examples often closes 60-80% of the gap. Fine-tune or RAG only after that ceiling is measured.
design decision · 04

Don't pick before you have evals

we rejected
Lock the rubric, then run the bake-off
because
Without a held-out test set, every approach looks good on cherry-picked demos. Ragas (arXiv:2309.15217) and DSPy (arXiv:2310.03714) give you the wiring; use them.
sources behind the logic

The decision tree is wired to published guidance.
Every branch maps to one of these.

We don't invent the rules. The tool encodes what Lewis, Hu, Khattab, OpenAI, and Anthropic have already published. Our job is the wiring and the eval that confirms it on your corpus.

RAG paper — Lewis et al. 2020

arXiv:2005.11401. The original retrieval-augmented generation paper. Frames RAG against parametric-memory baselines and shows the freshness + citation argument.

LoRA — Hu et al. 2021

arXiv:2106.09685. Low-rank adapters. The reason fine-tuning is now cheap enough to run alongside RAG instead of as a competing path.

DSPy — Khattab et al. 2023

arXiv:2310.03714. Programming, not prompting. Treats prompt engineering as compilation against a metric, which is what our decision tool encodes.

OpenAI fine-tuning guide

platform.openai.com/docs/guides/fine-tuning. Source for our labeled-data thresholds (50-100 minimum, hundreds to low-thousands for production).

Anthropic prompt engineering guide

anthropic.com/engineering. Source for the 'prompt engineering first' rule. Their published advice is to exhaust prompting before reaching for fine-tuning.

Ragas — Es et al. 2023

arXiv:2309.15217. Automated RAG evaluation. The metric scaffolding our recommendation cites when it suggests groundedness scoring.

Services this decision tool feeds: AI knowledge base development (RAG-over-private-corpora is the canonical answer for most knowledge-base builds), AI chatbot development (RAG vs domain-tuned distinction matters when chatbot accuracy is gated on retrieval recall), and Claude development (Claude's long-context window + prompt caching change the RAG-vs-fine-tune economics — we score both on your eval).

next step

The recommendation is the hypothesis. The eval is the test.
Two weeks, fixed scope, written deliverable.

If the tool pointed you at RAG, fine-tuning, or a hybrid, the audit is how we confirm it on your data. Engagements move audit → 4-6 week pilot with weekly eval gates → continuous delivery against the same rubric in production. We lock the rubric, run the bake-off, and ship a written prioritisation. If we recommend not building, we say so. Recent pilot RAG run logged 88% faithfulness on a 1,840-document corpus (2026-Q1); a sibling fine-tune lift logged +6.4 points on a 12K-example task (2026-Q1). <a href='/services/ai-consulting/'>Architecture decision consulting</a> is the audit door where this tool's hypothesis becomes a real eval on your corpus.