MLOps vs LLMOps: What Changes When You Operate LLMs

If you have been operating traditional ML models in production, you might assume LLMs are just bigger models that fit into the same operational framework. They do not. Large language models introduce

Introduction#

If you have been operating traditional ML models in production, you might assume LLMs are just bigger models that fit into the same operational framework. They do not. Large language models introduce a different set of failure modes, deployment patterns, cost structures, and evaluation challenges that require a rethought operational approach.

This post maps the differences between MLOps and LLMOps across the key operational dimensions: deployment, versioning, testing, monitoring, and rollback.

The Core Difference in Mindset#

Traditional MLOps operates on a narrow-output paradigm: a model takes structured input and produces a prediction, score, or classification. The output space is bounded and measurable. You define accuracy, precision, recall — metrics with clear ground truth.

LLMs operate on an open-output paradigm: the output is natural language, potentially thousands of tokens long, with correctness that is often subjective, context-dependent, and requires human judgment to evaluate at scale. There is no single “accuracy” metric.

This changes almost everything downstream.

Deployment#

Traditional MLOps#

Models are serialized artifacts (ONNX, pickle, TensorFlow SavedModel, scikit-learn joblib) and served via a model server (TorchServe, TensorFlow Serving, BentoML, Seldon). Deployments are typically small — a few hundred MB to a few GB. GPU is optional for many models.

1
2
3
4
5
6
7
8
9
10
11
12
# Traditional model deployment (BentoML example)
apiVersion: serving.bentoml.org/v1alpha2
kind: BentoDeployment
metadata:
  name: fraud-detector
spec:
  bentoTag: fraud-detector:v1.2.0
  replicas: 3
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"

LLMOps#

LLMs are either:

  1. API-hosted (OpenAI, Anthropic, Azure OpenAI) — you deploy a client, not a model
  2. Self-hosted open models (Llama, Mistral, Phi) — require significant GPU infrastructure

For self-hosted LLMs, the infrastructure requirements are fundamentally different:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Self-hosted LLM deployment with vLLM on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--tensor-parallel-size"
            - "2"
            - "--max-model-len"
            - "8192"
          resources:
            requests:
              nvidia.com/gpu: "2"
            limits:
              nvidia.com/gpu: "2"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      nodeSelector:
        accelerator: nvidia-a100
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-weights-pvc

GPU scarcity, cost, and VRAM constraints dominate self-hosted LLM operations in a way that has no equivalent in traditional ML.

For API-hosted LLMs, “deployment” means deploying your integration layer (prompts, RAG pipeline, business logic) rather than the model itself.

Versioning#

Traditional MLOps#

Model versioning is well-understood. You version the model artifact in a registry (MLflow, W&B, DVC). The model itself is the primary artifact.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# MLflow model registry (traditional)
import mlflow

with mlflow.start_run():
    mlflow.sklearn.log_model(model, "fraud_detector")
    mlflow.log_metrics({"accuracy": 0.94, "f1": 0.91})
    mlflow.log_params({"n_estimators": 100, "max_depth": 5})

# Register and promote
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detector",
    version=3,
    stage="Production"
)

LLMOps#

For API-hosted LLMs, the model is managed by the provider. Your versioned artifacts are:

  • Prompts: System prompts, few-shot examples, prompt templates
  • RAG configuration: Chunk size, embedding model, retrieval strategy, top-k
  • Chains/agents: The orchestration logic connecting LLM calls
  • Evaluation datasets: Your test cases and expected outputs

These must be version-controlled with the same rigor as code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Prompt versioning with structured metadata
from dataclasses import dataclass
from datetime import datetime

@dataclass
class PromptVersion:
    version: str           # semver
    template: str          # the actual prompt template
    model: str             # which model this was designed for
    author: str
    created_at: datetime
    eval_score: float | None  # score on your eval suite

# Store prompts in version control alongside code
# prompts/support-bot/v2.1.0.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
# prompts/support-bot/v2.1.0.yaml
version: "2.1.0"
model: "gpt-4o"
system_prompt: |
  You are a support agent for Acme Corp.
  Always respond in a professional tone.
  If you cannot answer a question, escalate to a human agent.
  Never share pricing information — redirect to the sales team.
few_shot_examples:
  - user: "How do I reset my password?"
    assistant: "Navigate to Settings > Security > Reset Password..."
eval_score: 0.87
created_at: "2026-03-28"

The key insight: when you upgrade from gpt-4o-2024-08-06 to gpt-4o-2025-01-xx, your prompts may need adjustment. A model version bump is a deployment event that requires evaluation.

Testing and Evaluation#

Traditional MLOps#

Evaluation is quantitative. Hold a test set, compute metrics, compare to baseline. If accuracy >= threshold, promote. Automated pass/fail.

1
2
3
4
5
6
7
8
9
10
11
# Traditional: deterministic evaluation
from sklearn.metrics import accuracy_score, f1_score

def evaluate_model(model, X_test, y_test) -> dict:
    predictions = model.predict(X_test)
    return {
        "accuracy": accuracy_score(y_test, predictions),
        "f1": f1_score(y_test, predictions, average="weighted"),
    }

# Gate promotion: if f1 < 0.90, reject

LLMOps#

LLM evaluation is fundamentally harder because output quality is subjective. You need multiple evaluation strategies:

1. Automated rule-based checks (fast, cheap, limited)

1
2
3
4
5
6
def basic_eval(output: str, expected_keywords: list[str]) -> dict:
    return {
        "contains_keywords": all(k.lower() in output.lower() for k in expected_keywords),
        "not_too_short": len(output.split()) > 20,
        "no_refusal": not any(p in output for p in ["I cannot", "I'm unable"]),
    }

2. LLM-as-judge (scalable, reasonable quality)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI

client = OpenAI()

def llm_judge(question: str, answer: str, reference: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are an evaluator. Score the answer on correctness and completeness (1-5). Respond in JSON: {\"score\": N, \"reason\": \"...\"}."
            },
            {
                "role": "user",
                "content": f"Question: {question}\n\nReference answer: {reference}\n\nAnswer to evaluate: {answer}"
            }
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

3. Human evaluation (ground truth, expensive)

Maintain a golden dataset — a set of questions with human-validated correct answers. Run every prompt version against it before promoting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Eval pipeline structure
@dataclass
class EvalCase:
    id: str
    input: str
    expected_output: str       # human-validated
    tags: list[str]            # "factual", "reasoning", "format", etc.

def run_eval_suite(
    prompt_version: PromptVersion,
    cases: list[EvalCase],
    judge_model: str = "gpt-4o"
) -> dict:
    results = []
    for case in cases:
        actual = run_prompt(prompt_version, case.input)
        score = llm_judge(case.input, actual, case.expected_output)
        results.append({"case_id": case.id, **score})

    avg_score = sum(r["score"] for r in results) / len(results)
    return {
        "prompt_version": prompt_version.version,
        "avg_score": avg_score,
        "pass": avg_score >= 3.5,  # your threshold
        "results": results,
    }

Monitoring#

Traditional MLOps#

Monitor prediction distribution, feature drift, and business outcome labels. Ground truth often arrives within hours or days (e.g., fraud confirmed/rejected).

LLMOps#

Ground truth rarely arrives automatically. You must monitor proxies:

Signal Traditional ML LLM
Output quality Accuracy, F1 against labels Refusal rate, LLM-judge scoring, user feedback
Input distribution Feature drift (KS test) Embedding drift, topic distribution shift
Ground truth Typically available Rarely available in real-time
Failure mode Wrong prediction Hallucination, harmful output, policy violation
Latency Milliseconds Seconds to tens of seconds
Cost Compute (fixed) Per-token billing (variable with input/output length)

The key operational difference: in LLMOps, cost is a primary monitoring concern. Token usage varies with input length and user behavior. A single prompt change can double or halve your monthly bill.

1
2
3
4
5
6
7
8
9
10
11
# Cost anomaly detection
class CostAnomalyDetector:
    def __init__(self, baseline_daily_cost: float, threshold_multiplier: float = 2.0):
        self.baseline = baseline_daily_cost
        self.threshold = baseline_daily_cost * threshold_multiplier

    def check(self, today_cost: float) -> bool:
        if today_cost > self.threshold:
            # Alert: cost 2x above baseline
            return True
        return False

Rollback#

Traditional MLOps#

Rollback means swapping the model artifact in the registry and redeploying the previous version. Relatively clean because inputs and outputs are structured.

LLMOps#

Rollback is more nuanced because there are multiple versioned components:

  1. Prompt rollback: Revert prompts/support-bot/v2.1.0.yaml to v2.0.0
  2. RAG config rollback: Revert chunk strategy or embedding model
  3. Model version pin: If the provider upgraded gpt-4o and behavior changed, pin to an older snapshot (most providers support dated model versions)
  4. Application code rollback: Revert orchestration changes
1
2
3
4
5
# Pin to a specific model snapshot to prevent surprise behavior changes
response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",  # Pinned snapshot, not "gpt-4o" alias
    messages=[...]
)

Always use dated model versions in production. The gpt-4o alias resolves to the latest snapshot and can change behavior without notice.

CI/CD Pipeline Comparison#

Traditional MLOps Pipeline#

1
2
3
4
5
6
7
8
9
Code commit
→ Unit tests
→ Train model
→ Evaluate on test set (accuracy >= threshold?)
→ Register model artifact
→ Deploy to staging
→ Integration tests
→ Deploy to production
→ Monitor predictions

LLMOps Pipeline#

1
2
3
4
5
6
7
8
9
Code/prompt commit
→ Unit tests (application code)
→ Run eval suite on golden dataset (avg score >= threshold?)
→ Version prompt/config artifacts
→ Deploy to staging
→ Integration tests + canary user traffic
→ Deploy to production
→ Monitor: cost, latency, quality proxies, user feedback
→ Schedule regular eval re-runs against new traffic samples

The eval suite replaces model training and the test set replaces accuracy metrics. The pipeline structure is similar; what runs inside each stage is different.

Key Operational Principles for LLMOps#

1. Treat prompts as code Version them in git, review them in PRs, test them before promoting. A prompt is a first-class artifact.

2. Build eval before you ship Define your golden dataset before writing the first prompt. Eval-driven development prevents regressions going unnoticed.

3. Pin model versions in production Never use a floating alias like gpt-4o in production. Pin to dated snapshots.

4. Cost is a metric Set cost budgets per service and alert on anomalies. Token costs are unpredictable without active monitoring.

5. Expect non-determinism LLM outputs are not deterministic even at temperature=0 with the same input. Build tests that tolerate semantic equivalence rather than exact string matching.

6. Human feedback is a first-class signal Instrument your application to collect thumbs-up/down or ratings. Route a sample to human review. This is your ground truth loop.

Conclusion#

LLMOps shares the operational goals of MLOps — reliable deployment, quality monitoring, safe rollback — but the tools and techniques differ significantly. The biggest shifts are in evaluation (from quantitative metrics to LLM-as-judge and human eval), versioning (prompts and configs become the primary artifact), and monitoring (cost and quality proxies replace prediction accuracy).

If you are bringing MLOps experience to LLM systems, the mindset transfers well. The specific practices need rethinking.

Contents