Introduction#
When a base LLM does not know enough about your domain to be useful, you have two primary options: Retrieval-Augmented Generation (RAG) and fine-tuning. Both improve model output quality, but they solve fundamentally different problems. Choosing the wrong one wastes months of effort.
This post explains what each approach actually does, where each fails, and how to decide which to use — or whether you need both.
What RAG Does#
RAG adds knowledge retrieval to inference. Before the model generates a response, relevant documents are fetched from an external store and injected into the context window.
1
2
3
4
5
6
User query
→ Embed query
→ Search vector database
→ Retrieve top-k relevant chunks
→ Inject into prompt: "Given the following context: [...] answer: [query]"
→ LLM generates response grounded in retrieved context
The model’s weights are unchanged. The model learns nothing. You are extending what it can see, not what it knows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str) -> list[float]:
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
class SimpleRAG:
def __init__(self):
self.chunks: list[dict] = [] # {"text": ..., "embedding": ...}
def index(self, documents: list[str]) -> None:
for doc in documents:
self.chunks.append({
"text": doc,
"embedding": embed(doc)
})
def retrieve(self, query: str, top_k: int = 3) -> list[str]:
query_embedding = embed(query)
scored = [
(cosine_similarity(query_embedding, chunk["embedding"]), chunk["text"])
for chunk in self.chunks
]
scored.sort(reverse=True)
return [text for _, text in scored[:top_k]]
def answer(self, query: str) -> str:
context_chunks = self.retrieve(query)
context = "\n\n".join(context_chunks)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer questions using only the provided context. If the answer is not in the context, say so."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}
]
)
return response.choices[0].message.content
What Fine-Tuning Does#
Fine-tuning continues training a pre-trained model on your own dataset, adjusting the model’s weights to internalize patterns from your data.
The model learns:
- Domain-specific vocabulary and facts
- A particular tone, format, or style
- Task-specific reasoning patterns
- How to structure output in your preferred schema
Unlike RAG, there is no retrieval step at inference. The knowledge is baked into the weights.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Preparing a fine-tuning dataset in OpenAI format
import json
training_examples = [
{
"messages": [
{"role": "system", "content": "You are a support agent for Acme Corp."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "Go to Settings > Security > Reset Password. You'll receive an email within 2 minutes."}
]
},
# ... hundreds or thousands more examples
]
# Write JSONL file for upload
with open("training_data.jsonl", "w") as f:
for example in training_examples:
f.write(json.dumps(example) + "\n")
# Upload and start fine-tuning job (simplified)
# response = client.fine_tuning.jobs.create(
# training_file="file-abc123",
# model="gpt-4o-mini-2024-07-18"
# )
Direct Comparison#
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge freshness | Real-time updates | Static until retrained |
| Training data required | None | Hundreds–thousands of examples |
| Compute cost | Low (inference only) | High (training) + inference |
| Latency | Adds retrieval step (~50–200ms) | No retrieval overhead |
| Transparency | Retrieved docs are inspectable | Model internals are opaque |
| Knowledge scope | Bounded by indexed documents | Bounded by training distribution |
| Format/style control | Weak | Strong |
| Factual grounding | High (explicit source) | Lower (hallucination risk) |
| Setup complexity | Medium | High |
When to Use RAG#
RAG is the right default for most production use cases. Use it when:
1. Your knowledge base changes frequently
A customer support bot that answers questions about your product must reflect today’s documentation, not last month’s. Fine-tuning a model weekly is not feasible. Update your vector index instead.
2. You need source attribution
Regulated industries, legal, and finance use cases require citing sources. RAG retrieves and surfaces the exact document the answer is grounded in. Fine-tuning cannot do this.
3. You have large or specialized knowledge
Fine-tuning cannot reliably memorize thousands of pages of documentation. Long-context retrieval handles arbitrary knowledge corpus size.
4. You want to reduce hallucination
Grounding answers in retrieved documents significantly reduces confabulation compared to relying on model memory alone.
5. You are prototyping or iterating quickly
RAG can be operational in days. Fine-tuning requires curating training data, running jobs, evaluating results — typically weeks.
When to Use Fine-Tuning#
Fine-tuning solves problems that retrieval cannot. Use it when:
1. You need consistent output format or style
If every response must follow a precise JSON schema, use a specific voice, or adopt a domain-specific communication style, fine-tuning is more reliable than prompt engineering.
1
2
3
4
5
6
7
8
# A fine-tuned model can consistently output structured data
# without brittle "respond only in JSON" instructions
expected_output = {
"intent": "order_status",
"confidence": 0.97,
"entities": {"order_id": "ORD-12345"},
"response": "Your order is out for delivery."
}
2. You want to reduce prompt length
If you are paying for tokens and your system prompt is 2000 tokens of instructions that repeat on every request, fine-tuning those instructions into the model reduces cost.
3. You are building a specialized reasoning task
Tasks like code review in a specific style, medical entity extraction, or classifying support tickets into your custom taxonomy benefit from fine-tuning when you have labelled examples.
4. Latency is critical and retrieval overhead is unacceptable
Eliminating the retrieval step saves 50–200ms per request. At scale, this matters.
5. The knowledge is stable and bounded
Legal clause analysis, medical coding, internal process workflows — domains where the knowledge set is relatively fixed and well-documented are good fine-tuning candidates.
Common Mistakes#
Mistake 1: Fine-tuning for knowledge injection
Fine-tuning does not reliably memorize facts. If you fine-tune on “our product costs $49/month” and that changes to $59, the model will confidently say the wrong price. Use RAG for facts.
Mistake 2: Using RAG for style/format consistency
No matter how detailed your retrieval prompt, a base model will occasionally ignore formatting instructions. Fine-tuning enforces output structure far more reliably.
Mistake 3: Skipping evaluation
Neither RAG nor fine-tuning is “set and forget.” Build an eval suite before you start. Measure quality on a held-out test set and track it over time.
Mistake 4: Under-investing in chunking strategy for RAG
The quality of RAG is largely determined by retrieval quality, which depends on how documents are chunked. Fixed-size chunking with no semantic awareness is a common bottleneck.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Naive chunking (often insufficient)
def naive_chunk(text: str, chunk_size: int = 500) -> list[str]:
words = text.split()
return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
# Better: sentence-aware chunking with overlap
import re
def smart_chunk(text: str, max_tokens: int = 400, overlap: int = 50) -> list[str]:
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current = []
current_len = 0
for sentence in sentences:
sentence_len = len(sentence.split())
if current_len + sentence_len > max_tokens and current:
chunks.append(" ".join(current))
# Keep overlap sentences
overlap_sentences = current[-overlap:] if len(current) > overlap else current
current = overlap_sentences[:]
current_len = sum(len(s.split()) for s in current)
current.append(sentence)
current_len += sentence_len
if current:
chunks.append(" ".join(current))
return chunks
Using Both Together#
RAG and fine-tuning are not mutually exclusive. Many production systems combine them:
- Fine-tune for format and style — the model learns your output schema, tone, and domain vocabulary
- RAG for facts and freshness — retrieved context grounds the model in current, accurate information
The fine-tuned model handles structure; the retrieval pipeline handles knowledge. This combination reduces both hallucination and prompt engineering overhead.
Decision Checklist#
Before choosing an approach, answer these:
- Does the knowledge change frequently? → RAG
- Do you need source attribution? → RAG
- Is the knowledge set too large to memorize? → RAG
- Do you need rigid output format? → Fine-tuning
- Is latency critical? → Fine-tuning
- Is the domain vocabulary highly specialized? → Fine-tuning (or both)
- Do you have labelled examples? If not → RAG first
Conclusion#
Default to RAG. It is faster to build, cheaper to iterate, and safer for factual accuracy. Reach for fine-tuning when you have a well-defined format requirement, a stable knowledge domain, or measurable evidence that RAG alone is insufficient.
The worst outcome is spending weeks fine-tuning when a well-tuned retrieval pipeline would have solved the problem in days.