Should I always use RAG over fine-tuning?

No. RAG wins when knowledge changes often, when citations are required, or when the corpus is large. Fine-tuning wins when the model must reliably adopt a style, format, or persona, or when you need to compress a working prompt into a smaller, cheaper model. The two solve different problems.

Can I do RAG and fine-tuning together?

Yes. A common pattern is RAG for fresh knowledge plus a LoRA adapter (Hu et al. 2021) tuned on your tone, schema, or refusal policy. Retrieval handles facts; the adapter handles voice. Score both contributions on the same eval before shipping.

How much labeled data do I need to fine-tune?

OpenAI's fine-tuning guide recommends a minimum of 50-100 high-quality examples to see a clear lift, and a few hundred to low-thousands for production-grade style adoption. Quality matters more than volume; bad labels cost more than missing labels.

What if my knowledge changes daily?

Use RAG. Fine-tuning bakes facts into weights at training time. By the time the run finishes, the weights are already stale. Reindex retrieval on a schedule that matches how often the source-of-truth updates.

Is RAG cheaper than fine-tuning?

It depends on volume. RAG has higher per-query cost (retrieval plus longer prompts) and lower one-time cost. Fine-tuning inverts that: a single training run, then cheaper inference per call. At high QPS with stable knowledge, fine-tuning or distillation often wins on total cost; at low QPS with churning knowledge, RAG wins.

Does fine-tuning let the model 'know' new facts?

Not reliably. Fine-tuning shifts the distribution of outputs; it teaches style, format, and behaviour far better than it teaches facts. For factual recall use RAG, where the retrieved passage is part of the prompt and can be cited.

decision tool · 8 questions

RAG or fine-tune?
Answer 8 questions. Get a defended recommendation.

The honest answer is usually 'it depends'. This tool replaces that with a procedure. The logic is wired to published guidance from Lewis 2020, LoRA, DSPy, OpenAI, and Anthropic. Each recommendation cites which of your answers drove it.

Jump to the tool See the comparison table

interactive · client-side only

Score your use case. The procedure picks the path.

Nothing leaves your browser. The form runs in JavaScript on this page. We keep no answers, no analytics on the inputs, no server roundtrip.

01 · knowledge freshness How often does the underlying knowledge change? 02 · corpus size How big is the knowledge corpus? 03 · latency budget What's the p95 latency budget? 04 · data sensitivity PII or regulated content? 05 · style / format / persona Does the model need a specific style, JSON shape, or persona? 06 · labeled data Do you have labeled training examples? 07 · citation need Do users need source attribution? 08 · cost sensitivity How tight is the per-query budget?

head-to-head

RAG vs Fine-tuning vs Hybrid.
Same dimensions we score in our audit phase.

No 'best overall' row. Best is a function of your freshness, latency, corpus, and citation requirements. This table lays out the trade we make on each axis.

Dimension	RAG	Fine-tuning	Hybrid
Knowledge freshness	Reindex on demand. Same-day updates trivial.	Stale at training time. New facts require a new run.	Fresh facts via retrieval, learned style baked in.
Cost per query	Higher. Retrieval + longer prompts.	Lower. Smaller prompts after training.	Between the two; depends on retrieval depth.
Cost per training run	Zero. No training step.	Real. From a few dollars (LoRA on 7B) to four figures on large models.	Real. Same as fine-tuning plus retrieval setup.
Citation support	Native. Each chunk has a source.	Not honest. Cannot trace weights to a doc.	Citations come from the retrieval half.
Latency	+80-300ms for retrieval depending on index.	Lower. No retrieval step. Smaller models possible.	Same as RAG, plus the inference path.
Best-fit corpus size	Tens to millions of docs.	Knowledge-agnostic. Tunes behaviour, not facts.	Any corpus size.
Domain adaptation	Limited. Model still reasons in its base style.	Strong. Adapts tone, schema, persona, refusal policy.	Best of both. Most-used pattern in production.
Time to ship	Days for a baseline; weeks to harden eval.	Weeks. Includes data collection + label QA.	Weeks. Both pipelines must land.
When NOT to use it	Fixed style requirements, sub-200ms budgets.	Daily-changing facts, citation requirements.	Tiny corpora or single-purpose prompts.

Dated 2026-05-22. Costs move quickly; re-check the numbers on a fresh eval each quarter.

anti-patterns we see weekly

design decision · 01

Don't fine-tune for facts

we rejected: Use RAG for factual recall
because: Fine-tuning shifts output distribution; it doesn't reliably encode new facts. Lewis et al. 2020 framed RAG specifically because parametric memory was the wrong tool for knowledge.

design decision · 02

Don't RAG for style

we rejected: Fine-tune or use a LoRA adapter for tone, format, and persona
because: Retrieval gives the model context, not personality. A LoRA adapter (Hu et al. 2021) is the cheap, modular path to consistent voice and schema.

design decision · 03

Don't skip prompt engineering

we rejected: Drive a real eval against the base model first
because: Anthropic and OpenAI both publish the same advice: a tight system prompt with examples often closes 60-80% of the gap. Fine-tune or RAG only after that ceiling is measured.

design decision · 04

Don't pick before you have evals

we rejected: Lock the rubric, then run the bake-off
because: Without a held-out test set, every approach looks good on cherry-picked demos. Ragas (arXiv:2309.15217) and DSPy (arXiv:2310.03714) give you the wiring; use them.

sources behind the logic

The decision tree is wired to published guidance.
Every branch maps to one of these.

We don't invent the rules. The tool encodes what Lewis, Hu, Khattab, OpenAI, and Anthropic have already published. Our job is the wiring and the eval that confirms it on your corpus.

RAG paper — Lewis et al. 2020

arXiv:2005.11401. The original retrieval-augmented generation paper. Frames RAG against parametric-memory baselines and shows the freshness + citation argument.

LoRA — Hu et al. 2021

arXiv:2106.09685. Low-rank adapters. The reason fine-tuning is now cheap enough to run alongside RAG instead of as a competing path.

DSPy — Khattab et al. 2023

arXiv:2310.03714. Programming, not prompting. Treats prompt engineering as compilation against a metric, which is what our decision tool encodes.

OpenAI fine-tuning guide

platform.openai.com/docs/guides/fine-tuning. Source for our labeled-data thresholds (50-100 minimum, hundreds to low-thousands for production).

Anthropic prompt engineering guide

anthropic.com/engineering. Source for the 'prompt engineering first' rule. Their published advice is to exhaust prompting before reaching for fine-tuning.

Ragas — Es et al. 2023

arXiv:2309.15217. Automated RAG evaluation. The metric scaffolding our recommendation cites when it suggests groundedness scoring.

Services this decision tool feeds: AI knowledge base development (RAG-over-private-corpora is the canonical answer for most knowledge-base builds), AI chatbot development (RAG vs domain-tuned distinction matters when chatbot accuracy is gated on retrieval recall), and Claude development (Claude's long-context window + prompt caching change the RAG-vs-fine-tune economics — we score both on your eval).

next step

The recommendation is the hypothesis. The eval is the test.
Two weeks, fixed scope, written deliverable.

If the tool pointed you at RAG, fine-tuning, or a hybrid, the audit is how we confirm it on your data. Engagements move audit → 4-6 week pilot with weekly eval gates → continuous delivery against the same rubric in production. We lock the rubric, run the bake-off, and ship a written prioritisation. If we recommend not building, we say so. Recent pilot RAG run logged 88% faithfulness on a 1,840-document corpus (2026-Q1); a sibling fine-tune lift logged +6.4 points on a 12K-example task (2026-Q1). <a href='/services/ai-consulting/'>Architecture decision consulting</a> is the audit door where this tool's hypothesis becomes a real eval on your corpus.

Start the audit conversation Read our eval methodology

RAG or fine-tune?
Answer 8 questions. Get a defended recommendation.

Score your use case. The procedure picks the path.

RAG vs Fine-tuning vs Hybrid.
Same dimensions we score in our audit phase.

Don't fine-tune for facts

Don't RAG for style

Don't skip prompt engineering

Don't pick before you have evals

The decision tree is wired to published guidance.
Every branch maps to one of these.

RAG paper — Lewis et al. 2020

LoRA — Hu et al. 2021

DSPy — Khattab et al. 2023

OpenAI fine-tuning guide

Anthropic prompt engineering guide

Ragas — Es et al. 2023

The recommendation is the hypothesis. The eval is the test.
Two weeks, fixed scope, written deliverable.

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

RAG or fine-tune? Answer 8 questions. Get a defended recommendation.

Score your use case. The procedure picks the path.

RAG vs Fine-tuning vs Hybrid. Same dimensions we score in our audit phase.

Don't fine-tune for facts

Don't RAG for style

Don't skip prompt engineering

Don't pick before you have evals

The decision tree is wired to published guidance. Every branch maps to one of these.

RAG paper — Lewis et al. 2020

LoRA — Hu et al. 2021

DSPy — Khattab et al. 2023

OpenAI fine-tuning guide

Anthropic prompt engineering guide

Ragas — Es et al. 2023

The recommendation is the hypothesis. The eval is the test. Two weeks, fixed scope, written deliverable.

RAG or fine-tune?
Answer 8 questions. Get a defended recommendation.

RAG vs Fine-tuning vs Hybrid.
Same dimensions we score in our audit phase.

The decision tree is wired to published guidance.
Every branch maps to one of these.

The recommendation is the hypothesis. The eval is the test.
Two weeks, fixed scope, written deliverable.