mlllmengineeringresearch

Fine-tuning vs RAG vs Prompting

Apr 18, 20236 min readUtso Sarkar

Every team building on LLMs hits the same question: how do we make the model good at our specific task? The three options on the table are prompting, retrieval-augmented generation (RAG), and fine-tuning. Conference talks treat them as interchangeable. They are not. Each has different cost profiles, latency characteristics, maintenance burdens, and failure modes.

I have seen teams fine-tune when prompting would suffice, wasting months and tens of thousands of dollars. I have seen teams prompt-engineer endlessly when a small fine-tune would have solved the problem in a week. The choice is not about which approach is “best.” It is about which approach fits your constraints.

The Decision Framework

flowchart TD
    Start([Adapt LLM to your task]) --> Q1{Knowledge changes frequently?}
    Q1 -->|Yes, daily/weekly| RAG[RAG + prompting]
    Q1 -->|No, stable domain| Q2{Need specific output format/ style?}

    Q2 -->|Yes, rigid schema| Q3{Have 500+ quality examples?}
    Q2 -->|No, flexible outputs| Q4{Context fits in window?}

    Q4 -->|Yes| Prompt[Prompt engineering + few-shot]
    Q4 -->|No| RAG

    Q3 -->|Yes| FT[Fine-tuning]
    Q3 -->|No| Q5{Can you generate synthetic examples?}

    Q5 -->|Yes| Synth[Synthetic data + fine-tune]
    Q5 -->|No| Prompt

    RAG --> Q6{Retrieval quality sufficient?}
    Q6 -->|No| FixRAG[Fix chunking / embeddings first]
    Q6 -->|Yes| Ship[Ship and iterate]

    FT --> Q7{Base model updates break you?}
    Q7 -->|Yes| RAG
    Q7 -->|No| Ship

    Prompt --> Ship
    Synth --> Ship
    FixRAG --> RAG

This is a starting point, not gospel. Your specific task may violate every assumption here. But it beats the default approach of “let’s fine-tune because it sounds serious.”

Prompting: The Underrated Baseline

Prompt engineering gets mocked as “not real engineering.” This is stupid. Prompting is the fastest iteration loop available. Change a system prompt, run your eval set, see results in minutes. No training pipeline. No GPU cluster. No data labeling budget.

Prompting works well when:

Your task fits in the context window with room for examples
Output format is flexible or enforceable via structured prompting
You need to ship this week, not next quarter
Your domain knowledge can be expressed as instructions, not implicit in thousands of examples

Prompting fails when:

The model consistently ignores instructions despite careful prompt design
You need behavior that requires internalizing patterns too complex for in-context demonstration
Latency and cost from long prompts with many few-shot examples exceed fine-tuning inference costs
You are stuffing 50 examples into every request and calling it a “prompt strategy”

Cost profile: API inference only. Scales linearly with prompt length. Maintenance: Low. Update prompts in code, redeploy. Latency: Depends on prompt size. Long few-shot prompts hurt.

RAG: When Your Knowledge Is External and Dynamic

RAG separates the model’s reasoning capability from your domain knowledge. You retrieve relevant documents at query time and inject them into the prompt. The model answers using provided context.

RAG works well when:

Your knowledge base changes frequently (product docs, support articles, legal regulations)
The total knowledge exceeds any context window
You need citations and traceability for answers
You want to update knowledge without retraining

RAG fails when:

Retrieval returns wrong or incomplete context (most common failure)
The task requires synthesizing information across many documents in non-obvious ways
Your documents are poorly structured for chunking
You use RAG as a substitute for fixing a model that lacks basic reasoning capability

Cost profile: Embedding API + vector storage + inference. Re-indexing costs when documents change. Maintenance: Medium-high. Pipeline for ingestion, chunking, embedding, index updates. Latency: Retrieval step adds 50-200ms depending on infrastructure.

The dirty secret of RAG in 2023: most teams should spend 80% of their effort on document quality and chunking strategy and 20% on retrieval infrastructure. Most teams do the reverse.

Fine-tuning: When You Need Behavior, Not Knowledge

Fine-tuning adapts model weights to your task. The model internalizes patterns from training examples rather than reading them at inference time.

Fine-tuning works well when:

You have hundreds to thousands of high-quality input-output pairs
You need consistent output format, tone, or style the base model resists via prompting
Inference cost matters and a shorter fine-tuned model replaces long few-shot prompts
The task is stable and will not change with every product release

Fine-tuning fails when:

You have fewer than 200 quality examples (results will be unreliable)
Your knowledge changes frequently (model goes stale)
You fine-tune for knowledge injection instead of behavior shaping (use RAG)
You do not have an eval set to detect regression when the base model updates

Cost profile: Training compute (one-time per version) + inference. OpenAI fine-tuning API charges for training tokens and inference at a premium. Maintenance: Medium. Retrain when base model updates or task distribution shifts. Latency: Lower than long-prompt RAG for equivalent quality on behavior tasks.

The Combinations That Actually Work

Real production systems combine approaches:

Prompt + RAG: The default stack for most B2B AI products in 2023. RAG for knowledge, prompting for behavior and format control.

Prompt + fine-tuning: Fine-tune for style and format, prompt for task-specific instructions that change frequently.

RAG + fine-tuning: Fine-tune a model to better use retrieved context (train on query-document-answer triples). Expensive to get right but powerful when retrieval alone is insufficient.

All three: Justified only at scale with dedicated ML infrastructure. Most startups should not start here.

Common Mistakes I See Repeatedly

Fine-tuning for facts. If the answer is in your docs, RAG is cheaper and more maintainable. Fine-tuning factual knowledge into weights is how you get a model that confidently states outdated information.

RAG without evaluation. Teams ship retrieval pipelines without measuring whether the right chunks are retrieved. Answer quality is a downstream symptom of retrieval quality.

Prompt engineering without a eval set. You cannot improve what you do not measure. Ten examples in a spreadsheet is not an eval set. Build 100+ labeled cases before declaring prompting insufficient.

Chasing base model updates. GPT-4 improves, your fine-tuned GPT-3.5 gets relatively worse. Have a migration plan or accept that fine-tuning ties you to a model version.

My Default Recommendation

Start with prompting. Add RAG if knowledge is external or dynamic. Fine-tune only when you have evidence that prompting and RAG cannot achieve required quality, and you have the data and eval infrastructure to do it properly.

This sequence minimizes wasted effort. Each step teaches you something about your task that informs the next step. Skipping straight to fine-tuning because it sounds more “serious” is how startups burn three months and learn nothing.

Closing

There is no universally correct choice. There is a correct choice for your task, your data, your timeline, and your team’s capabilities. The framework above is how I think about it when advising founders.

The LLM infrastructure vendors want you to believe fine-tuning is always the answer because fine-tuning generates training revenue. Vector DB vendors want you to believe RAG is always the answer because RAG generates storage revenue. Prompting generates nothing for vendors, which is exactly why you should start there.

Build the eval set. Run the experiments. Let metrics decide.

--claps