Introduction#
If you have been operating traditional ML models in production, you might assume LLMs are just bigger models that fit into the same operational framework. They do not. Large language models introduce a different set of failure modes, deployment patterns, cost structures, and evaluation challenges that require a rethought operational approach.
This post maps the differences between MLOps and LLMOps across the key operational dimensions: deployment, versioning, testing, monitoring, and rollback.
The Core Difference in Mindset#
Traditional MLOps operates on a narrow-output paradigm: a model takes structured input and produces a prediction, score, or classification. The output space is bounded and measurable. You define accuracy, precision, recall — metrics with clear ground truth.
LLMs operate on an open-output paradigm: the output is natural language, potentially thousands of tokens long, with correctness that is often subjective, context-dependent, and requires human judgment to evaluate at scale. There is no single “accuracy” metric.
This changes almost everything downstream.
Deployment#
Traditional MLOps#
Models are serialized artifacts (ONNX, pickle, TensorFlow SavedModel, scikit-learn joblib) and served via a model server (TorchServe, TensorFlow Serving, BentoML, Seldon). Deployments are typically small — a few hundred MB to a few GB. GPU is optional for many models.
1
2
3
4
5
6
7
8
9
10
11
12
# Traditional model deployment (BentoML example)
apiVersion: serving.bentoml.org/v1alpha2
kind: BentoDeployment
metadata:
name: fraud-detector
spec:
bentoTag: fraud-detector:v1.2.0
replicas: 3
resources:
requests:
cpu: "2"
memory: "4Gi"
LLMOps#
LLMs are either:
- API-hosted (OpenAI, Anthropic, Azure OpenAI) — you deploy a client, not a model
- Self-hosted open models (Llama, Mistral, Phi) — require significant GPU infrastructure
For self-hosted LLMs, the infrastructure requirements are fundamentally different:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Self-hosted LLM deployment with vLLM on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "2"
- "--max-model-len"
- "8192"
resources:
requests:
nvidia.com/gpu: "2"
limits:
nvidia.com/gpu: "2"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
nodeSelector:
accelerator: nvidia-a100
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-weights-pvc
GPU scarcity, cost, and VRAM constraints dominate self-hosted LLM operations in a way that has no equivalent in traditional ML.
For API-hosted LLMs, “deployment” means deploying your integration layer (prompts, RAG pipeline, business logic) rather than the model itself.
Versioning#
Traditional MLOps#
Model versioning is well-understood. You version the model artifact in a registry (MLflow, W&B, DVC). The model itself is the primary artifact.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# MLflow model registry (traditional)
import mlflow
with mlflow.start_run():
mlflow.sklearn.log_model(model, "fraud_detector")
mlflow.log_metrics({"accuracy": 0.94, "f1": 0.91})
mlflow.log_params({"n_estimators": 100, "max_depth": 5})
# Register and promote
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="fraud-detector",
version=3,
stage="Production"
)
LLMOps#
For API-hosted LLMs, the model is managed by the provider. Your versioned artifacts are:
- Prompts: System prompts, few-shot examples, prompt templates
- RAG configuration: Chunk size, embedding model, retrieval strategy, top-k
- Chains/agents: The orchestration logic connecting LLM calls
- Evaluation datasets: Your test cases and expected outputs
These must be version-controlled with the same rigor as code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Prompt versioning with structured metadata
from dataclasses import dataclass
from datetime import datetime
@dataclass
class PromptVersion:
version: str # semver
template: str # the actual prompt template
model: str # which model this was designed for
author: str
created_at: datetime
eval_score: float | None # score on your eval suite
# Store prompts in version control alongside code
# prompts/support-bot/v2.1.0.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
# prompts/support-bot/v2.1.0.yaml
version: "2.1.0"
model: "gpt-4o"
system_prompt: |
You are a support agent for Acme Corp.
Always respond in a professional tone.
If you cannot answer a question, escalate to a human agent.
Never share pricing information — redirect to the sales team.
few_shot_examples:
- user: "How do I reset my password?"
assistant: "Navigate to Settings > Security > Reset Password..."
eval_score: 0.87
created_at: "2026-03-28"
The key insight: when you upgrade from gpt-4o-2024-08-06 to gpt-4o-2025-01-xx, your prompts may need adjustment. A model version bump is a deployment event that requires evaluation.
Testing and Evaluation#
Traditional MLOps#
Evaluation is quantitative. Hold a test set, compute metrics, compare to baseline. If accuracy >= threshold, promote. Automated pass/fail.
1
2
3
4
5
6
7
8
9
10
11
# Traditional: deterministic evaluation
from sklearn.metrics import accuracy_score, f1_score
def evaluate_model(model, X_test, y_test) -> dict:
predictions = model.predict(X_test)
return {
"accuracy": accuracy_score(y_test, predictions),
"f1": f1_score(y_test, predictions, average="weighted"),
}
# Gate promotion: if f1 < 0.90, reject
LLMOps#
LLM evaluation is fundamentally harder because output quality is subjective. You need multiple evaluation strategies:
1. Automated rule-based checks (fast, cheap, limited)
1
2
3
4
5
6
def basic_eval(output: str, expected_keywords: list[str]) -> dict:
return {
"contains_keywords": all(k.lower() in output.lower() for k in expected_keywords),
"not_too_short": len(output.split()) > 20,
"no_refusal": not any(p in output for p in ["I cannot", "I'm unable"]),
}
2. LLM-as-judge (scalable, reasonable quality)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI
client = OpenAI()
def llm_judge(question: str, answer: str, reference: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an evaluator. Score the answer on correctness and completeness (1-5). Respond in JSON: {\"score\": N, \"reason\": \"...\"}."
},
{
"role": "user",
"content": f"Question: {question}\n\nReference answer: {reference}\n\nAnswer to evaluate: {answer}"
}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
3. Human evaluation (ground truth, expensive)
Maintain a golden dataset — a set of questions with human-validated correct answers. Run every prompt version against it before promoting.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Eval pipeline structure
@dataclass
class EvalCase:
id: str
input: str
expected_output: str # human-validated
tags: list[str] # "factual", "reasoning", "format", etc.
def run_eval_suite(
prompt_version: PromptVersion,
cases: list[EvalCase],
judge_model: str = "gpt-4o"
) -> dict:
results = []
for case in cases:
actual = run_prompt(prompt_version, case.input)
score = llm_judge(case.input, actual, case.expected_output)
results.append({"case_id": case.id, **score})
avg_score = sum(r["score"] for r in results) / len(results)
return {
"prompt_version": prompt_version.version,
"avg_score": avg_score,
"pass": avg_score >= 3.5, # your threshold
"results": results,
}
Monitoring#
Traditional MLOps#
Monitor prediction distribution, feature drift, and business outcome labels. Ground truth often arrives within hours or days (e.g., fraud confirmed/rejected).
LLMOps#
Ground truth rarely arrives automatically. You must monitor proxies:
| Signal | Traditional ML | LLM |
|---|---|---|
| Output quality | Accuracy, F1 against labels | Refusal rate, LLM-judge scoring, user feedback |
| Input distribution | Feature drift (KS test) | Embedding drift, topic distribution shift |
| Ground truth | Typically available | Rarely available in real-time |
| Failure mode | Wrong prediction | Hallucination, harmful output, policy violation |
| Latency | Milliseconds | Seconds to tens of seconds |
| Cost | Compute (fixed) | Per-token billing (variable with input/output length) |
The key operational difference: in LLMOps, cost is a primary monitoring concern. Token usage varies with input length and user behavior. A single prompt change can double or halve your monthly bill.
1
2
3
4
5
6
7
8
9
10
11
# Cost anomaly detection
class CostAnomalyDetector:
def __init__(self, baseline_daily_cost: float, threshold_multiplier: float = 2.0):
self.baseline = baseline_daily_cost
self.threshold = baseline_daily_cost * threshold_multiplier
def check(self, today_cost: float) -> bool:
if today_cost > self.threshold:
# Alert: cost 2x above baseline
return True
return False
Rollback#
Traditional MLOps#
Rollback means swapping the model artifact in the registry and redeploying the previous version. Relatively clean because inputs and outputs are structured.
LLMOps#
Rollback is more nuanced because there are multiple versioned components:
- Prompt rollback: Revert
prompts/support-bot/v2.1.0.yamltov2.0.0 - RAG config rollback: Revert chunk strategy or embedding model
- Model version pin: If the provider upgraded
gpt-4oand behavior changed, pin to an older snapshot (most providers support dated model versions) - Application code rollback: Revert orchestration changes
1
2
3
4
5
# Pin to a specific model snapshot to prevent surprise behavior changes
response = client.chat.completions.create(
model="gpt-4o-2024-08-06", # Pinned snapshot, not "gpt-4o" alias
messages=[...]
)
Always use dated model versions in production. The gpt-4o alias resolves to the latest snapshot and can change behavior without notice.
CI/CD Pipeline Comparison#
Traditional MLOps Pipeline#
1
2
3
4
5
6
7
8
9
Code commit
→ Unit tests
→ Train model
→ Evaluate on test set (accuracy >= threshold?)
→ Register model artifact
→ Deploy to staging
→ Integration tests
→ Deploy to production
→ Monitor predictions
LLMOps Pipeline#
1
2
3
4
5
6
7
8
9
Code/prompt commit
→ Unit tests (application code)
→ Run eval suite on golden dataset (avg score >= threshold?)
→ Version prompt/config artifacts
→ Deploy to staging
→ Integration tests + canary user traffic
→ Deploy to production
→ Monitor: cost, latency, quality proxies, user feedback
→ Schedule regular eval re-runs against new traffic samples
The eval suite replaces model training and the test set replaces accuracy metrics. The pipeline structure is similar; what runs inside each stage is different.
Key Operational Principles for LLMOps#
1. Treat prompts as code Version them in git, review them in PRs, test them before promoting. A prompt is a first-class artifact.
2. Build eval before you ship Define your golden dataset before writing the first prompt. Eval-driven development prevents regressions going unnoticed.
3. Pin model versions in production
Never use a floating alias like gpt-4o in production. Pin to dated snapshots.
4. Cost is a metric Set cost budgets per service and alert on anomalies. Token costs are unpredictable without active monitoring.
5. Expect non-determinism
LLM outputs are not deterministic even at temperature=0 with the same input. Build tests that tolerate semantic equivalence rather than exact string matching.
6. Human feedback is a first-class signal Instrument your application to collect thumbs-up/down or ratings. Route a sample to human review. This is your ground truth loop.
Conclusion#
LLMOps shares the operational goals of MLOps — reliable deployment, quality monitoring, safe rollback — but the tools and techniques differ significantly. The biggest shifts are in evaluation (from quantitative metrics to LLM-as-judge and human eval), versioning (prompts and configs become the primary artifact), and monitoring (cost and quality proxies replace prediction accuracy).
If you are bringing MLOps experience to LLM systems, the mindset transfers well. The specific practices need rethinking.