RAG vs Fine-Tuning: When to Use Which

Introduction#

When a base LLM does not know enough about your domain to be useful, you have two primary options: Retrieval-Augmented Generation (RAG) and fine-tuning. Both improve model output quality, but they solve fundamentally different problems. Choosing the wrong one wastes months of effort.

This post explains what each approach actually does, where each fails, and how to decide which to use — or whether you need both.

What RAG Does#

RAG adds knowledge retrieval to inference. Before the model generates a response, relevant documents are fetched from an external store and injected into the context window.

User query
    → Embed query
    → Search vector database
    → Retrieve top-k relevant chunks
    → Inject into prompt: "Given the following context: [...] answer: [query]"
    → LLM generates response grounded in retrieved context

The model’s weights are unchanged. The model learns nothing. You are extending what it can see, not what it knows.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

class SimpleRAG:
    def __init__(self):
        self.chunks: list[dict] = []  # {"text": ..., "embedding": ...}

    def index(self, documents: list[str]) -> None:
        for doc in documents:
            self.chunks.append({
                "text": doc,
                "embedding": embed(doc)
            })

    def retrieve(self, query: str, top_k: int = 3) -> list[str]:
        query_embedding = embed(query)
        scored = [
            (cosine_similarity(query_embedding, chunk["embedding"]), chunk["text"])
            for chunk in self.chunks
        ]
        scored.sort(reverse=True)
        return [text for _, text in scored[:top_k]]

    def answer(self, query: str) -> str:
        context_chunks = self.retrieve(query)
        context = "\n\n".join(context_chunks)

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Answer questions using only the provided context. If the answer is not in the context, say so."
                },
                {
                    "role": "user",
                    "content": f"Context:\n{context}\n\nQuestion: {query}"
                }
            ]
        )
        return response.choices[0].message.content

What Fine-Tuning Does#

Fine-tuning continues training a pre-trained model on your own dataset, adjusting the model’s weights to internalize patterns from your data.

The model learns:

Domain-specific vocabulary and facts
A particular tone, format, or style
Task-specific reasoning patterns
How to structure output in your preferred schema

Unlike RAG, there is no retrieval step at inference. The knowledge is baked into the weights.

# Preparing a fine-tuning dataset in OpenAI format
import json

training_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a support agent for Acme Corp."},
            {"role": "user", "content": "How do I reset my password?"},
            {"role": "assistant", "content": "Go to Settings > Security > Reset Password. You'll receive an email within 2 minutes."}
        ]
    },
    # ... hundreds or thousands more examples
]

# Write JSONL file for upload
with open("training_data.jsonl", "w") as f:
    for example in training_examples:
        f.write(json.dumps(example) + "\n")

# Upload and start fine-tuning job (simplified)
# response = client.fine_tuning.jobs.create(
#     training_file="file-abc123",
#     model="gpt-4o-mini-2024-07-18"
# )

Direct Comparison#

Dimension	RAG	Fine-Tuning
Knowledge freshness	Real-time updates	Static until retrained
Training data required	None	Hundreds–thousands of examples
Compute cost	Low (inference only)	High (training) + inference
Latency	Adds retrieval step (~50–200ms)	No retrieval overhead
Transparency	Retrieved docs are inspectable	Model internals are opaque
Knowledge scope	Bounded by indexed documents	Bounded by training distribution
Format/style control	Weak	Strong
Factual grounding	High (explicit source)	Lower (hallucination risk)
Setup complexity	Medium	High

When to Use RAG#

RAG is the right default for most production use cases. Use it when:

1. Your knowledge base changes frequently

A customer support bot that answers questions about your product must reflect today’s documentation, not last month’s. Fine-tuning a model weekly is not feasible. Update your vector index instead.

2. You need source attribution

Regulated industries, legal, and finance use cases require citing sources. RAG retrieves and surfaces the exact document the answer is grounded in. Fine-tuning cannot do this.

3. You have large or specialized knowledge

Fine-tuning cannot reliably memorize thousands of pages of documentation. Long-context retrieval handles arbitrary knowledge corpus size.

4. You want to reduce hallucination

Grounding answers in retrieved documents significantly reduces confabulation compared to relying on model memory alone.

5. You are prototyping or iterating quickly

RAG can be operational in days. Fine-tuning requires curating training data, running jobs, evaluating results — typically weeks.

When to Use Fine-Tuning#

Fine-tuning solves problems that retrieval cannot. Use it when:

1. You need consistent output format or style

If every response must follow a precise JSON schema, use a specific voice, or adopt a domain-specific communication style, fine-tuning is more reliable than prompt engineering.

# A fine-tuned model can consistently output structured data
# without brittle "respond only in JSON" instructions
expected_output = {
    "intent": "order_status",
    "confidence": 0.97,
    "entities": {"order_id": "ORD-12345"},
    "response": "Your order is out for delivery."
}

2. You want to reduce prompt length

If you are paying for tokens and your system prompt is 2000 tokens of instructions that repeat on every request, fine-tuning those instructions into the model reduces cost.

3. You are building a specialized reasoning task

Tasks like code review in a specific style, medical entity extraction, or classifying support tickets into your custom taxonomy benefit from fine-tuning when you have labelled examples.

4. Latency is critical and retrieval overhead is unacceptable

Eliminating the retrieval step saves 50–200ms per request. At scale, this matters.

5. The knowledge is stable and bounded

Legal clause analysis, medical coding, internal process workflows — domains where the knowledge set is relatively fixed and well-documented are good fine-tuning candidates.

Common Mistakes#

Mistake 1: Fine-tuning for knowledge injection

Fine-tuning does not reliably memorize facts. If you fine-tune on “our product costs $49/month” and that changes to $59, the model will confidently say the wrong price. Use RAG for facts.

Mistake 2: Using RAG for style/format consistency

No matter how detailed your retrieval prompt, a base model will occasionally ignore formatting instructions. Fine-tuning enforces output structure far more reliably.

Mistake 3: Skipping evaluation

Neither RAG nor fine-tuning is “set and forget.” Build an eval suite before you start. Measure quality on a held-out test set and track it over time.

Mistake 4: Under-investing in chunking strategy for RAG

The quality of RAG is largely determined by retrieval quality, which depends on how documents are chunked. Fixed-size chunking with no semantic awareness is a common bottleneck.

# Naive chunking (often insufficient)
def naive_chunk(text: str, chunk_size: int = 500) -> list[str]:
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# Better: sentence-aware chunking with overlap
import re

def smart_chunk(text: str, max_tokens: int = 400, overlap: int = 50) -> list[str]:
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    current = []
    current_len = 0

    for sentence in sentences:
        sentence_len = len(sentence.split())
        if current_len + sentence_len > max_tokens and current:
            chunks.append(" ".join(current))
            # Keep overlap sentences
            overlap_sentences = current[-overlap:] if len(current) > overlap else current
            current = overlap_sentences[:]
            current_len = sum(len(s.split()) for s in current)
        current.append(sentence)
        current_len += sentence_len

    if current:
        chunks.append(" ".join(current))
    return chunks

Using Both Together#

RAG and fine-tuning are not mutually exclusive. Many production systems combine them:

Fine-tune for format and style — the model learns your output schema, tone, and domain vocabulary
RAG for facts and freshness — retrieved context grounds the model in current, accurate information

The fine-tuned model handles structure; the retrieval pipeline handles knowledge. This combination reduces both hallucination and prompt engineering overhead.

Decision Checklist#

Before choosing an approach, answer these:

Does the knowledge change frequently? → RAG
Do you need source attribution? → RAG
Is the knowledge set too large to memorize? → RAG
Do you need rigid output format? → Fine-tuning
Is latency critical? → Fine-tuning
Is the domain vocabulary highly specialized? → Fine-tuning (or both)
Do you have labelled examples? If not → RAG first

Conclusion#

Default to RAG. It is faster to build, cheaper to iterate, and safer for factual accuracy. Reach for fine-tuning when you have a well-defined format requirement, a stable knowledge domain, or measurable evidence that RAG alone is insufficient.

The worst outcome is spending weeks fine-tuning when a well-tuned retrieval pipeline would have solved the problem in days.